A course in Time Series Analysis. Suhasini Subba Rao

Transcription

1 A course in Time Series Analysis Suhasini Subba Rao August 5, 205

2 Contents Introduction 8. Time Series data R code Detrending a time series Estimation of parametric trend Estimation using nonparametric methods Estimation of the period Some formalism Estimating the mean Stationary processes Types of stationarity (with Ergodicity thrown in) Towards statistical inference for time series What makes a covariance a covariance? Linear time series Motivation Linear time series and moving average models Infinite sums of random variables The autoregressive model and the solution Difference equations and back-shift operators Solution of two particular AR() models The unique solution of a general AR() The solution of a general AR(p)

3 2.3.5 Explicit solution of an AR(2) model Features of a realisation from an AR(2) Solution of the general AR( ) model An explanation as to why the backshift operator method works Representing the AR(p) as a vector AR() The ARMA model Simulating from an Autoregressive process The autocovariance function of a linear time series The autocovariance function The rate of decay of the autocovariance of an ARMA process The autocovariance of an autoregressive process The autocovariance of a moving average process The autocovariance of an autoregressive moving average process The partial covariance and correlation of a time series A review of partial correlation in multivariate analysis Partial correlation in time series The variance/covariance matrix and precision matrix of an autoregressive and moving average process Correlation and non-causal time series The Yule-Walker equations of a non-causal process Filtering non-causal AR models Nonlinear Time Series Models Data Motivation Yahoo data from FTSE 00 from January - August The ARCH model Features of an ARCH Existence of a strictly stationary solution and second order stationarity of the ARCH

4 4.3 The GARCH model Existence of a stationary solution of a GARCH(, ) Extensions of the GARCH model R code Bilinear models Features of the Bilinear model Solution of the Bilinear model R code Nonparametric time series models Prediction 7 5. Forecasting given the present and infinite past Review of vector spaces Spaces spanned by infinite number of elements Levinson-Durbin algorithm A proof based on projections A proof based on symmetric Toeplitz matrices Using the Durbin-Levinson to obtain the Cholesky decomposition of the precision matrix Forecasting for ARMA processes Forecasting for nonlinear models Forecasting volatility using an ARCH(p) model Forecasting volatility using a GARCH(, ) model Forecasting using a BL(, 0,, ) model Nonparametric prediction The Wold Decomposition Estimation of the mean and covariance 5 6. An estimator of the mean The sampling properties of the sample mean An estimator of the covariance

5 6.2. Asymptotic properties of the covariance estimator Proof of Bartlett s formula Using Bartlett s formula for checking for correlation Long range dependence versus changes in the mean Parameter estimation 7 7. Estimation for Autoregressive models The Yule-Walker estimator The Gaussian maximum likelihood Estimation for ARMA models The Gaussian maximum likelihood estimator The Hannan-Rissanen AR( ) expansion method The quasi-maximum likelihood for ARCH processes Spectral Representations How we have used Fourier transforms so far The near uncorrelatedness of the Discrete Fourier Transform Seeing the decorrelation in practice Proof of Lemma 8.2.: By approximating Toeplitz with Circulant matrices Proof 2 of Lemma 8.2.: Using brute force Heuristics The spectral density and spectral distribution The spectral density and some of its properties The spectral distribution and Bochner s theorem The spectral representation theorem The spectral density functions of MA, AR and ARMA models The spectral representation of linear processes The spectral density of a linear process Approximations of the spectral density to AR and MA spectral densities Higher order spectrums

6 8.7 Extensions The spectral density of a time series with randomly missing observations225 9 Spectral Analysis The DFT and the periodogram Distribution of the DFT and Periodogram under linearity Estimating the spectral density function The Whittle Likelihood Ratio statistics in Time Series Goodness of fit tests for linear time series models Consistency and and asymptotic normality of estimators Modes of convergence Sampling properties Showing almost sure convergence of an estimator Proof of Theorem (The stochastic Ascoli theorem) Toy Example: Almost sure convergence of the least squares estimator for an AR(p) process Convergence in probability of an estimator Asymptotic normality of an estimator Martingale central limit theorem Example: Asymptotic normality of the least squares estimator Example: Asymptotic normality of the weighted periodogram Asymptotic properties of the Hannan and Rissanen estimation method Proof of Theorem 0.7. (A rate for ˆb T b T 2 ) Asymptotic properties of the GMLE A Background 300 A. Some definitions and inequalities A.2 Martingales A.3 The Fourier series

7 A.4 Application of Burkholder s inequality A.5 The Fast Fourier Transform (FFT) B Mixingales 36 B. Obtaining almost sure rates of convergence for some sums B.2 Proof of Theorem

8 Preface The material for these notes come from several different places, in particular: Brockwell and Davis (998) Shumway and Stoffer (2006) (a shortened version is Shumway and Stoffer EZ). Fuller (995) Pourahmadi (200) Priestley (983) Box and Jenkins (970) A whole bunch of articles. Tata Subba Rao and Piotr Fryzlewicz were very generous in giving advice and sharing homework problems. When doing the homework, you are encouraged to use all materials available, including Wikipedia, Mathematica/Maple (software which allows you to easily derive analytic expressions, a web-based version which is not sensitive to syntax is Wolfram-alpha). You are encouraged to use R (see David Stoffer s tutorial). I have tried to include Rcode in the notes so that you can replicate some of the results. Exercise questions will be in the notes and will be set at regular intervals. You will be given some projects are the start of semester which you should select and then present in November. 7

9 Chapter Introduction A time series is a series of observations x t, observed over a period of time. Typically the observations can be over an entire interval, randomly sampled on an interval or at fixed time points. Different types of time sampling require different approaches to the data analysis. In this course we will focus on the case that observations are observed at fixed equidistant time points, hence we will suppose we observe {x t : t Z} (Z = {..., 0,, 2...}). Let us start with a simple example, independent, uncorrelated random variables (the simplest example of a time series). A plot is given in Figure.. We observe that there aren t any clear patterns in the data. Our best forecast (predictor) of the next observation is zero (which appears to be the mean). The feature that distinguishes a time series from classical statistics is that there is dependence in the observations. This allows us to obtain better forecasts of future observations. Keep Figure. in mind, and compare this to the following real examples of time series (observe in all these examples you see patterns).. Time Series data Below we discuss four different data sets. The Southern Oscillation Index from 876-present The Southern Oscillation Index (SOI) is an indicator of intensity of the El Nino effect (see wiki). The SOI measures the fluctuations in air surface pressures between Tahiti and Darwin. 8

10 whitenoise Time Figure.: Plot of independent uncorrelated random variables In Figure.2 we give a plot of monthly SOI from January July 204 (note that there is some doubt on the reliability of the data before 930). The data was obtained from Using this data set one major goal is to look for patterns, in particular periodicities in the data. soi Time Figure.2: Plot of monthly Southern Oscillation Index,

11 Nasdaq Data from 985-present The daily closing Nasdaq price from st October, 985-8th August, 204 is given in Figure.3. The (historical) data was obtained from See also Of course with this type of data the goal is to make money! Therefore the main object is to forecast (predict future volatility). nasdaq Time Figure.3: Plot of daily closing price of Nasdaq Yearly sunspot data from Sunspot activity is measured by the number of sunspots seen on the sun. In recent years it has had renewed interest because times in which there are high activity causes huge disruptions to communication networks (see wiki and NASA). In Figure.4 we give a plot of yearly sunspot numbers from The data was obtained from For this type of data the main aim is to both look for patterns in the data and also to forecast (predict future sunspot activity). Yearly and monthly temperature data Given that climate change is a very topical subject we consider global temperature data. Figure.5 gives the yearly temperature anomalies from and in Figure.6 we plot 0

12 sunspot Time Figure.4: Plot of Sunspot numbers the monthly temperatures from January July 204. The data was obtained from and nasa.gov/gistemp/graphs_v3/fig.c.txt respectively. For this type of data one may be trying to detect for global warming (a long term change/increase in the average temperatures). This would be done by fitting trend functions through the data. However, sophisticated time series analysis is required to determine whether these estimators are statistically significant... R code A large number of the methods and concepts will be illustrated in R. If you are not familar with this language please learn the very basics. Here we give the R code for making the plots above. # assuming the data is stored in your main directory we scan the data into R soi <- scan("~/soi.txt") soi <- ts(monthlytemp,start=c(876,),frequency=2) # the function ts creates a timeseries object, start = starting year, # where denotes January. Frequency = number of observations in a

13 temp Time Figure.5: Plot of global, yearly average, temperature anomalies, monthlytemp Time Figure.6: Plot of global, monthly average, temperatures January, July, 204. # unit of time (year). As the data is monthly it is 2. plot.ts(soi) 2

14 .2 Detrending a time series In time series, the main focus is on modelling the relationship between observations. Time series analysis is usually performed after the data has been detrended. In other words, if Y t = µ t + ε t, where {ε t } is zero mean time series, we first estimate µ t and then conduct the time series analysis on the residuals. Once the analysis has been performed, we return to the trend estimators and use the results from the time series analysis to construct confidence intervals etc. In this course the main focus will be on the data after detrending. However, we start by reviewing some well known detrending methods. A very good primer is given in Shumway and Stoffer, Chapter 2, and you are strongly encouraged to read it..2. Estimation of parametric trend Often a parametric trend is assumed. Common examples include a linear trend Y t = β 0 + β t + ε t (.) and the quadratic trend Y t = β 0 + β t + β 2 t 2 + ε t. (.2) For example we may fit such models to the yearly average temperature data. Alternatively we may want to include seasonal terms Y t = β 0 + β sin ( ) ( ) 2πt 2πt + β 3 cos + ε t. 2 2 For example, we may believe that the Southern Oscillation Index has a period 2 (since the observations are taken monthly) and we use sine and cosine functions to model the seasonality. For these type of models, least squares can be used to estimate the parameters. Remark.2. (Taking differences to avoid fitting linear and higher order trends) A commonly used method to avoid fitting linear trend to a model is to take first differences. 3

15 For example if Y t = β 0 + β t + ε t, then Z t = Y t+ Y t = β + ε t+ ε t. Taking higher order differences (ie. taking first differences of {Z t } removes quadratic terms) removes higher order polynomials. Exercise. (i) Import the yearly temperature data (file global mean temp.txt) into R and fit the linear model in (.) to the data (use the R command lsfit). (ii) Suppose the errors in (.) are correlated. Under the correlated assumption, explain why the standard errors reported in the R output are unreliable. (iii) Make a plot of the residuals after fitting the linear model in (i). Make a plot of the first differences. What do you notice about the two plots, similar? (What I found was quite strange) The AIC (Akaike Information Criterion) is usually used to select the parameters in the model (see wiki). You should have studied the AIC/AICc/BIC in several of the prerequists you have taken. In this course it will be assumed that you are familiar with it..2.2 Estimation using nonparametric methods In Section.2. we assumed that the mean had a certain known parametric form. This may not always be the case. If we have no apriori idea of what features may be in the mean, we can estimate the mean trend using a nonparametric approach. If we do not have any apriori knowledge of the mean function we cannot estimate it without placing some assumptions on it s structure. The most common is to assume that the mean µ t is a sample from a smooth function, ie. µ t = µ( t ). Under this assumption the following approaches are valid. n Possibly one of the most simplest methods is to use a rolling window. There are several windows that one can use. We describe, below, the exponential window, since it can be 4

16 evaluated in an online way. For t = let ˆµ = Y, then for t > define ˆµ t = ( λ)ˆµ t + λy t, where 0 < λ <. The choice of λ depends on how much weight one wants to give the present observation. It is straightforward to show that ˆµ t = t ( λ) t j λy j = t [ exp( γ)] exp [ γ(t j)] Y j where γ = log( λ). Let b = /γ and K(u) = exp( u)i(u 0), then ˆµ t can be written as ˆµ t = ( e /b ) } {{ } b ( ) t j K Y j, b This we observe that the exponential rolling window estimator is very close to a nonparametric kernel estimator of the mean, which has the form µ t = b K ( t j b ) Y j. it is likely you came across such estimators in your nonparametric classes. The main difference between the rolling window estimator and the nonparametric kernel estimator is that the kernel/window for the rolling window is not symmetric. This is because we are trying to estimate the mean at time t, given only the observations up to time t. Whereas for nonparametric kernel estimators we can be observations on both sides of the neighbourhood of t. Other type of estimators include sieve-estimators. This is where we expand µ(u) in terms of an orthogonal basis {φ k (u); k Z} µ(u) = a k φ k (u). k= 5

17 Examples of basis functions are the Fourier φ k (u) = exp(iku), Haar/other wavelet functions etc. We observe that the unknown coefficients a k are a linear in the regressors φ k. Thus we can use least squares to estimate the coefficients, {a k }. To estimate these coefficients, we truncate the above expansion to order M, and use least squares to estimate the coefficients [ ] 2 M Y t a k φ k ( t n ). (.3) t= k= The orthogonality of the basis means that the least squares estimator â k is â k n ( ) t Y t φ k. n t= It is worth pointing out that regardless of the method used, correlations in the errors {ε t } will play an role in quality of the estimator and even on the choice of bandwidth, b, or equivalently the number of basis functions, M (see Hart (99)). To understand why, suppose the mean function is µ t = µ( t 200 ) (the sample size n = 200), where µ(u) = 5 (2u 2.5u2 )+20. We corrupt this quadratic function with both iid and dependent noise (the dependent noise is the AR(2) process defined in equation (.6)). The plots are given in Figure.7. We observe that the dependent noise looks smooth (dependent can induce smoothness in a realisation). This means that in the case that the mean has been corrupted by dependent noise it difficult to see that the underlying trend is a simple quadratic function..2.3 Estimation of the period Suppose that the observations {Y t ; t =,..., n} satisfy the following regression model Y t = A cos(ωt) + B sin(ωt) + ε t where {ε t } are iid standard normal random variables and 0 < ω < π. The parameters A, B, and ω are real and unknown. Unlike the regression models given in (.2.) the model here is nonlinear, since the unknown parameter, ω, is inside a trignometric function. Standard least squares methods cannot be used to estimate the parameters. Assuming Gaussianity of 6

18 iid ar temp temp quadraticiid quadraticar temp temp Figure.7: Top: realisations from iid random noise and dependent noise (left = iid and right = dependent). Bottom: Quadratic trend plus corresponding noise. {ε t }, the maximum likelihood corresponding to the model is L n (A, B, ω) = 2 (Y t A cos(ωt) B sin(ωt)) 2. t= Nonlinear least squares method (which would require the use of a numerical maximisation scheme) can be employed to estimate the parameters. However, using some algebraic manipulations, explicit expressions for the estimators can be obtained (see Walker (97) and Exercise.3). These are ˆω n = arg max I n(ω) ω where I n (ω) = Y t exp(itω) n t= 2 (.4) 7

19 (we look for the maximum over the fundamental frequencies ω k = 2πk n for k n), Â n = 2 n Y t cos(ˆω n t) and ˆB n = 2 n t= Y t sin(ˆω n t). t= The rather remarkable aspect of this result is that the rate of convergence of ˆω n ω = O(n 3/2 ), which is faster than the standard O(n /2 ) that we usually encounter (we will see this in Example.2.). I n (ω) is usually called the periodogram. Searching for peaks in the periodogram is a long established method for detecting periodicities. If we believe that there were two or more periods in the time series, we can generalize the method to searching for the largest and second largest peak etc. We consider an example below. Example.2. Consider the following model ( ) 2πt Y t = 2 sin + ε t t =,..., n. (.5) 8 where ε t are iid standard normal random variables. It is clear that {Y t } is made up of a periodic signal with period eight. We make a plot of one realisation (using sample size n = 28) together with the periodogram I(ω) (defined in (.4)). In Figure.8 we give a plot of one realisation together with a plot of the periodogram. We observe that there is a symmetry, this is because of the e iω in the definition of I(ω) we can show that I(ω) = I(2π ω). Notice there is a clear peak at frequency 2π/ (where we recall that 8 is the period). This method works extremely well if the error process {ε t } is uncorrelated. However, problems arise when the errors are correlated. To illustrate this issue, consider again model (.5) but this time let us suppose the errors are correlated. More precisely, they satisfy the AR(2) model, ε t =.5ε t 0.75ε t 2 + ɛ t, (.6) where {ɛ t } are iid random variables (do not worry if this does not make sense to you we define this class of models precisely in Chapter 2). As in the iid case we use a sample size 8

20 signal Time P frequency Figure.8: Left: Realisation of (.5) with iid noise, Right: Periodogram n = 28. In Figure.9 we give a plot of one realisation and the corresponding periodogram. We observe that the peak at 2π/8 is not the highest. The correlated errors (often called coloured noise) is masking the peak by introducing new peaks. To see what happens for signal Time P frequency Figure.9: Left: Realisation of (.5) with correlated noise and n = 28, Right: Periodogram larger sample sizes, we consider exactly the same model (.5) with the noise generated as 9

21 in (.6). But this time we use n = 024 (8 time the previous sample size). A plot of one realisation, together with the periodogram is given in Figure.0. In contrast to the smaller sample size, a large peak is seen at 2π/8. These examples illustrates two important points: signal Time P frequency Figure.0: Left: Realisation of (.5) with correlated noise and n = 024, Right: Periodogram (i) When the noise is correlated and the sample size is relatively small it is difficult to disentangle the deterministic period from the noise. Indeed we will show in Chapters 2 and 3 that linear time series can exhibit similar types of behaviour to a periodic deterministic signal. This is a subject of on going research that dates back at least 60 years (see Quinn and Hannan (200)). However, the similarity is only to a point. Given a large enough sample size (which may in practice not be realistic), the deterministic frequency dominates again. (ii) The periodogram holds important properties about the correlations in the noise (observe the periodogram in both Figures.9 and.0), there is some interesting activity in the lower frequencies, that appear to be due to noise. This is called spectral analysis and is explored in Chapters 8 and 9. Indeed a lot of time series analysis can be done within the so called frequency or time domain. 20

22 Exercise.2 (Understanding Fourier transforms) (i) Let Y t =. Plot the Periodogram of {Y t ; t =,..., 28}. (ii) Let Y t = + ε t, where {ε t } are iid standard normal random variables. Plot the Periodogram of {Y t ; t =,..., 28}. (iii) Let Y t = µ( t 28 ) where µ(u) = 5 (2u 2.5u2 ) Plot the Periodogram of {Y t ; t =,..., 28}. (iv) Let Y t = 2 sin( 2πt 8 ). Plot the Periodogram of {Y t; t =,..., 28}. (v) Let Y t = 2 sin( 2πt 2πt ) + 4 cos( ). Plot the Periodogram of {Y 8 2 t; t =,..., 28}. Exercise.3 (i) Let ( S n (A, B, ω) = Yt 2 2 t= t= ) ( ) Y t A cos(ωt) + B sin(ωt) + 2 n(a2 + B 2 ). Show that 2L n (A, B, ω) + S n (A, B, ω) = (A2 B 2 ) 2 cos(2tω) + AB t= sin(2tω). t= and thus L n (A, B, ω) + 2 S n(a, B, ω) = O() (ie. the difference does not grow with n). Since L n (A, B, ω) and 2 S n(a, B, ω) are asymptotically equivalent (i) shows that we can maximise 2 S n(a, B, ω) instead of the likelihood L n (A, B, ω). (ii) By profiling out the parameters A and B, use the the profile likelihood to show that ˆω n = arg max ω n t= Y t exp(itω) 2. (iii) By using the identity (which is the one-sided Dirichlet kernel) exp(iωt) = t= exp( 2 i(n+)ω) sin( 2 nω) sin( 2 Ω) 0 < Ω < 2π n Ω = 0 or 2π. (.7) 2

23 we can show that for 0 < Ω < 2π we have t cos(ωt) = O(n) t= t 2 cos(ωt) = O(n 2 ) t= t sin(ωt) = O(n) t= t 2 sin(ωt) = O(n 2 ). t= Using the above identities, show that the Fisher Information of L n (A, B, ω) (denoted as I(A, B, ω)) is asymptotically equivalent to 2I(A, B, ω) = E ( 2 S n ω 2 ) = n n 0 2 B + O(n) n 2 n2 2 A + O(n) n 2 n2 B + O(n) A + O(n) n (A2 + B 2 ) + O(n 2 ). (iv) Use the Fisher information to show that ˆω n ω = O(n 3/2 ). Exercise (i) Simulate three hundred times from model (.5) using n = 28. Estimate ω, A and B for each simulation and obtain the empirical mean squared error 300 i= (ˆθ i θ) 2 (where θ denotes the parameter and ˆθ i the estimate). In your simulations, is the estimate of the period, ω superior to the estimator of coefficients, A and B? (ii) Do the same as above but now use coloured noise given in (.6) as the errors. How do your estimates compare to (i)? R Code Simulation and periodogram for model (.5) with iid errors: temp <- rnorm(28) signal <-.5*sin(2*pi*c(:28)/8) + temp # this simulates the series # Use the command fft to make the periodogram P <- abs(fft(signal)/28)**2 frequency <- 2*pi*c(0:27)/28 # To plot the series and periodogram 22

24 par(mfrow=c(2,)) plot.ts(signal) plot(frequency, P,type="o") Simulation and periodogram for model (.5) with correlated errors: set.seed(0) ar2 <- arima.sim(list(order=c(2,0,0), ar = c(.5, -0.75)), n=28) signal2 <-.5*sin(2*pi*c(:28)/8) + ar2 P2 <- abs(fft(signal2)/28)**2 frequency <- 2*pi*c(0:27)/28 par(mfrow=c(2,)) plot.ts(signal2) plot(frequency, P2,type="o").3 Some formalism When we observe the time series {x t }, usually we assume that {x t } is a realisation from a random process {X t }. We formalise this notion below. The random process {X t ; t Z} (where Z denotes the integers) is defined on the probability space {Ω, F, P }. We explain what these mean below: (i) Ω is the set of all possible outcomes. Suppose that ω Ω, then {X t (ω)} is one realisation from the random process. For any given ω, {X t (ω)} is not random. In time series we will usually assume that what we observe x t = X t (ω) (for some ω) is a typical realisation. That is, for any other ω Ω, X t (ω ) will be different, but its general or overall characteristics will be similar. (ii) F is known as a sigma algebra. It is a set of subsets of Ω (though not necessarily the set of all subsets, as this can be too large). But it consists of all sets for which a probability can be assigned. That is if A F, then a probability is assigned to the set A. (iii) P is the probability. Different types of convergence we will be using in class: 23

25 (i) Almost sure convergence: X n a.s. a as n (in this course a will always be a constant). This means for every ω Ω X n a, where P (Ω) = (this is classical limit of a sequence, see Wiki for a definition). (ii) Convergence in probability: X n P a. This means that for every ε > 0, P ( Xn a > ε) 0 as n (see Wiki) (iii) Convergence in mean square X n 2 a. This means E Xn a 2 0 as n (see Wiki). (iv) Convergence in distribution. This means the distribution of X n converges to the distribution of X, ie. for all x where F X is continuous, we have F n (x) F X (x) as n (where F n and F X are the distribution functions of X n and X respectively). This is the simplest definition (see Wiki). Which implies which? (i), (ii) and (iii) imply (iv). (i) implies (ii). (iii) implies (ii). Central limit theorems require (iv). It is often easiest to show (iii) (since this only requires mean and variance calculations)..4 Estimating the mean Based on one realisation of a time series we want to make inference about parameters associated with the process {X t }, such as the mean etc. Let us consider the simplest case, estimating the mean. We recall that in classical statistics we usually assume we observe several independent realisations, {X t } from a random variable X, and use the multiple realisations to make inference about the mean: X = n n k= X k. Roughly speaking, by using several independent realisations we are sampling over the entire probability space and obtaining a good estimate of the mean. On the other hand if the samples were highly dependent, then it is likely that {X t } would be concentrated over small parts of the probability space. In this case, the variance of the sample mean would not converge to zero as the sample size grows. 24

26 A typical time series is a half way house between totally dependent data and independent data. Unlike, classical statistics, in time series, parameter estimation is based on only one realisation x t = X t (ω) (not multiple, independent, replications). Therefore, it would appear impossible to obtain a good estimator of the mean. However good estimates, of the mean, can be made, based on just one realisation so long as certain assumptions are satisfied (i) the process has a constant mean (a type of stationarity) and (ii) despite the fact that each time series is generated from one realisation there is short memory in the observations. That is, what is observed today, x t has little influence on observations in the future, x t+k (when k is relatively large). Hence, even though we observe one tragectory, that trajectory traverses much of the probability space. The amount of dependency in the time series determines the quality of the estimator. There are several ways to measure the dependency. We know that the most common measure of linear dependency is the covariance. The covariance in the stochastic process {X t } is defined as cov(x t, X t+k ) = E(X t X t+k ) E(X t )E(X t+k ). Noting that if {X t } has zero mean, then the above reduces to cov(x t, X t+k ) = E(X t X t+k ). Remark.4. It is worth bearing in mind that the covariance only measures linear dependence. For some statistical analysis, such as deriving an expression for the variance of an estimator, the covariance is often sufficient as a measure. However, given cov(x t, X t+k ) we cannot say anything about cov(g(x t ), g(x t+k )), where g is a nonlinear function. There are occassions where we require a more general measure of dependence (for example, to show asymptotic normality). Examples of more general measures include mixing (and other related notions, such as Mixingales, Near-Epoch dependence, approximate m-dependence, physical dependence, weak dependence), first introduced by Rosenblatt in the 50s (M. and Grenander (997)). In this course we will not cover mixing. Returning to the sample mean example suppose that {X t } is a time series. In order to estimate the mean we need to be sure that the mean is constant over time (else the estimator will be meaningless). Therefore we will assume that {X t } is a time series with constant mean µ. We observe {X t } n t= and estimate the mean µ with the sample mean X = n n t= X t. It is clear that this is an unbiased estimator of µ, since E( X) = µ (it is unbiased). Thus to see whether it converges in mean square 25

27 to µ we consider its variance var( X) = n 2 var(x t ) + 2 n 2 t= cov(x t, X τ ). (.8) t= τ=t+ If the covariance structure decays at such a rate that the sum of all lags is finite (sup t τ= cov(x t, X τ ) <, often called short memory), then the variance is O( n ), just as in the iid case. However, even with this assumption we need to be able to estimate var( X) in order to test/construct CI for µ. Usually this requires the stronger assumption of stationarity, which we define in Section.5 Example.4. (The variance of a regression model with correlated errors) Let us return to the parametric models discussed in Section.2.. The general model is Y t = β 0 + p β j u t,j + ε t = β u t + ε t, where E[ε t ] = 0 and we will assume that {u t,j } are nonrandom regressors. Note this includes the parametric trend models discussed in Section.2.. We use least squares to estimate β with Thus Ln(ˆβ n ) β L n (β) = (Y t β u t ) 2, t= ˆβ n = arg min L n (β) = ( u t u t) t= Y t u t. = 0. To evaluate the variance of ˆβ n we will derive an expression for ˆβ n β (this expression also applies to many nonlinear estimators too). We note that by using straightforward algebra we can show that t= L n (ˆβ n ) β L n(β) β = [ˆβn β] u t u t. (.9) t= 26

28 Moreoover because Ln(ˆβ n ) β = 0 we have L n (ˆβ n ) β L n(β) β = L n(β) β = [Y t β u t ] u } {{ } t = u t ε t. (.0) t= ε t= t Altogether (.9) and (.0) give [ˆβn β] u t u t = u tε t. t= t= and Using this expression we can see that ( ] var [ˆβn β = n ( ) [ˆβn β] = u t u t u t ε t. t= t= ( u t u t) var n t= Finally we need only evaluate var ( n n t= u tε t ) which is var ( n ) u t ε t t= = n 2 cov[ε t, ε τ ]u t u τ = t,τ= n 2 t= var[ε t ]u t u t } {{ } expression if independent + t= u t ε t ) ( n 2 u t u t). t= n 2 t= τ=t+ cov[ε t, ε τ ]u t u τ } {{ } additional term due to correlation in the errors Under the assumption that ( n n t= u tu t) is non-singular, supt u t < and sup t τ= cov(ε t, ε τ ) < ], we can see that var [ˆβn β = O(n ), but just as in the case of the sample mean we need to impose some additional conditions on {ε t } we want to construct confidence intervals/test for β...5 Stationary processes We have established that one of the main features that distinguish time series analysis from classical methods is that observations taken over time (a time series) can be dependent and this dependency 27

29 tends to decline the further apart in time these two observations. However, to do any sort of analysis of this time series we have to assume some sort of invariance in the time series, for example the mean or variance of the time series does not change over time. If the marginal distributions of the time series were totally different no sort of inference would be possible (suppose in classical statistics you were given independent random variables all with different distributions, what parameter would you be estimating, it is not possible to estimate anything!). The typical assumption that is made is that a time series is stationary. Stationarity is a rather intuitive concept, it is an invariant property which means that statistical characteristics of the time series do not change over time. For example, the yearly rainfall may vary year by year, but the average rainfall in two equal length time intervals will be roughly the same as would the number of times the rainfall exceeds a certain threshold. Of course, over long periods of time this assumption may not be so plausible. For example, the climate change that we are currently experiencing is causing changes in the overall weather patterns (we will consider nonstationary time series towards the end of this course). However in many situations, including short time intervals, the assumption of stationarity is quite a plausible. Indeed often the statistical analysis of a time series is done under the assumption that a time series is stationary..5. Types of stationarity (with Ergodicity thrown in) There are two definitions of stationarity, weak stationarity which only concerns the covariance of a process and strict stationarity which is a much stronger condition and supposes the distributions are invariant over time. Definition.5. (Strict stationarity) The time series {X t } is said to be strictly stationary if for any finite sequence of integers t,..., t k and shift h the distribution of (X t,..., X tk ) and (X t +h,..., X tk +h) are the same. The above assumption is often considered to be rather strong (and given a data it is very hard to check). Often it is possible to work under a weaker assumption called weak/second order stationarity. Definition.5.2 (Second order stationarity/weak stationarity) The time series {X t } is said to be second order stationary if the mean is constant for all t and if for any t and k the covariance between X t and X t+k only depends on the lag difference k. In other words there exists a function 28

30 c : Z R such that for all t and k we have c(k) = cov(x t, X t+k ). Remark.5. (Strict and second order stationarity) (i) If a process is strictly stationarity and E Xt 2 <, then it is also second order stationary. But the converse is not necessarily true. To show that strict stationarity (with E Xt 2 < ) implies second order stationarity, suppose that {X t } is a strictly stationary process, then cov(x t, X t+k ) = E(X t X t+k ) E(X t )E(X t+k ) = xy [ P Xt,Xt+k (dx, dy) P Xt (dx)p Xt+k (dy) ] = xy [P X0,X k (dx, dy) P X0 (dx)p Xk (dy)] = cov(x 0, X k ), where P Xt,Xt+k and P Xt is the joint distribution and marginal distribution of X t, X t+k respectively. The above shows that cov(x t, X t+k ) does not depend on t and {X t } is second order stationary. (ii) If a process is strictly stationary but the second moment is not finite, then it is not second order stationary. (iii) It should be noted that a weakly stationary Gaussian time series is also strictly stationary too (this is the only case where weakly stationary implies strictly stationary). Example.5. (The sample mean and it s variance under stationarity) Returning the variance of the sample mean discussed (.8), if a time series is second order stationary, then the sample mean X is estimating the mean µ and the variance of X is var( X) = n var(x 0) + 2 T 2 = n var(x } {{ 0) + 2 } n c(0) t= τ=t+ r= cov(x t, X τ ) (n r ) cov(x0, X r ). n } {{ } c(r) We approximate the above, by using that the covariances r c(r) <. Therefore for all r, ( r/n)c(r) c(r) and n r= ( r /n)c(r) r c(r), thus by dominated convergence (see 29

31 Chapter A) n r= ( r/n)c(r) r= c(r). This implies that var( X) n c(0) + 2 n c(r) = O( n ). The above is often called the long term variance. The above implies that r= E( X µ) 2 = var( X) 0, which we recall is convergence in mean square. Thus we have convergence in probability X P µ. The example above illustrates how second order stationarity gave a rather elegant expression for the variance. We now motivate the concept of ergodicity. Sometimes, it is difficult to evaluate the mean and variance of an estimator, but often we may only require almost sure or convergence in probability. Therefore, we may want to find an alternative method to evaluating the mean squared error. To see whether this is possible we recall that for iid random variables we have the very useful law of large numbers n t= X t a.s. µ and in general n n t= g(x t) a.s. E[g(X 0 )] (if E[g(X 0 )] < ). Does such a result exists in time series? It does, but we require the slightly stronger condition that a time series is ergodic (which is a slightly stronger condition than the strictly stationary). Definition.5.3 (Ergodicity - very rough) Let (Ω, F, P ) be a probability space. A transformation T : Ω Ω is said to be measure preserving if for every set A F, P (T A) = P (A). Moreover, it is said to be an ergodic transformation if T A = A implies that P (A) = 0 or. It is not obvious what this has to do with stochastic processes, but we attempt to make a link. Let us suppose that X = {X t } is a strictly stationary process defined on the probability space (Ω, F, P ). By strict stationarity the transformation (shifting a sequence by one) T (x, x 2,...) = (x 2, x 3,...), is a measure preserving transformation. Thus a process which is stationarity is measure preserving. 30

32 To understand ergodicity we define the set A, where A = {ω : (X (ω), X 0 (ω),...) H}. = {ω : X (ω),..., X 2 (ω),...) H}. The stochastic process is said to be ergodic, if the only sets which satisfies the above are such that P (A) = 0 or. Roughly, this means there cannot be too many outcomes ω which generate sequences which repeat itself (are periodic in some sense). See Billingsley (994), page 32-34, for examples and a better explanation. The definition of ergodicity, given above, is quite complex and is rarely used in time series analysis. However, one consequence of ergodicity is the ergodic theorem, which is extremely useful in time series. It states that if {X t } is an ergodic stochastic process then n t= g(x t ) a.s. E[g(X 0 )] for any function g( ). And in general for any shift τ,..., τ k and function g : R k+ R we have n t= g(x t, X t+τ,..., X t+τk ) a.s. E[g(X 0,..., X t+τk )] (.) (often (.) is used as the definition of ergodicity, as it is an iff with the ergodic definition). Later you will see how useful this. (.) gives us an idea of what constitutes an ergodic process. Suppose that {ε t } is an ergodic process (a classical example are iid random variables) then any reasonable (meaning measurable) function of X t is also ergodic. More precisely, if X t is defined as X t = h(..., ε t, ε t,...), (.2) where {ε t } are iid random variables and h( ) is a measureable function, then {X t } is an Ergodic process. For full details see Stout (974), Theorem Remark.5.2 As mentioned above all Ergodic processes are stationary, but a stationary process is not necessarily ergodic. Here is one simple example. Suppose that {ε t } are iid random variables and Z is a Bernoulli random variable with outcomes {, 2} (where the chance of either outcome is 3

33 half). Suppose that Z stays the same for all t. Define µ + ε t Z = X t = µ 2 + ε t Z = 2. It is clear that E(X t Z = i) = µ i and E(X t ) = 2 (µ + µ 2 ). This sequence is stationary. However, we observe that T T t= X t will only converge to one of the means, hence we do not have almost sure convergence (or convergence in probability) to 2 (µ + µ 2 ). Exercise.5 State, with explanation, which of the following time series is second order stationary, which are strictly stationary and which are both. (i) {ε t } are iid random variables with mean zero and variance one. (ii) {ε t } are iid random variables from a Cauchy distributon. (iii) X t+ = X t + ε t, where {ε t } are iid random variables with mean zero and variance one. (iv) X t = Y where Y is a random variable with mean zero and variance one. (iv) X t = U t +U t +V t, where {(U t, V t )} is a strictly stationary vector time series with E[U 2 t ] < and E[V 2 t ] <. Example.5.2 In Chapter 6 we consider estimation of the autocovariance function. However for now use R command acf. In Figure. we give the sample acf plots of the Southern Oscillation Index and the Sunspot data. We observe that are very different. The acf of the SOI decays rapidly, but there does appear to be some sort of pattern in the correlations. On the other hand, there is more persistence in the acf of the Sunspot data. The correlations of the acf data appear to decay but over a longer period of time and there is a clear periodicity in the correlation. Exercise.6 (i) Make an ACF plot of the monthly temperature data from (ii) Make and ACF plot of the yearly temperature data from (iii) Make and ACF plot of the residuals (after fitting a line through the data (using the command lsfit(..)$res)) of the yearly temperature data from Briefly describe what you see. 32

34 Series soi ACF Lag Series sunspot ACF Lag Figure.: Top: ACF of Southern Oscillation data. Bottom ACF plot of Sunspot data. R code To make the above plots we use the commands par(mfrow=c(2,)) acf(soi,lag.max=300) acf(sunspot,lag.max=60).5.2 Towards statistical inference for time series Returning to the sample mean Example.5.. Suppose we want to construct CIs or apply statistical tests on the mean. This requires us to estimate the long run variance (assuming stationarity) var( X) n c(0) + 2 n c(r). There are several ways this can be done, either by fitting a model to the data and from the model estimate the covariance or doing it nonparametrically. This example motivates the contents of the course: r= (i) Modelling, finding suitable time series models to fit to the data. (ii) Forecasting, this is essentially predicting the future given current and past observations. 33

35 (iii) Estimation of the parameters in the time series model. (iv) The spectral density function and frequency domain approaches, sometimes within the frequency domain time series methods become extremely elegant. (v) Analysis of nonstationary time series. (vi) Analysis of nonlinear time series. (vii) How to derive sampling properties..6 What makes a covariance a covariance? The covariance of a stationary process has several very interesting properties. The most important is that it is positive semi-definite, which we define below. Definition.6. (Positive semi-definite sequence) (i) A sequence {c(k); k Z} (Z is the set of all integers) is said to be positive semi-definite if for any n Z and sequence x = (x,..., x n ) R n the following is satisfied c(i j)x i x j 0. i, (ii) A function is said to be an even positive semi-definite sequence if (i) is satisfied and c(k) = c( k) for all k Z. An extension of this notion is the positive semi-definite function. Definition.6.2 (Positive semi-definite function) (i) A function {c(u); u R} is said to be positive semi-definite if for any n Z and sequence x = (x,..., x n ) R n the following is satisfied c(u i u j )x i x j 0. i, (ii) A function is said to be an even positive semi-definite function if (i) is satisfied and c(u) = c( u) for all u R. 34

36 Remark.6. You have probably encountered this positive definite notion before, when dealing with positive definite matrices. Recall the n n matrix Σ n is positive semi-definite if for all x R n x Σ n x 0. To see how this is related to positive semi-definite matrices, suppose that the matrix Σ n has a special form, that is the elements of Σ n are (Σ n ) i,j = c(i j). Then x Σ n x = n i,j c(i j)x ix j. We observe that in the case that {X t } is a stationary process with covariance c(k), the variance covariance matrix of X n = (X,..., X n ) is Σ n, where (Σ n ) i,j = c(i j). We now take the above remark further and show that the covariance of a stationary process is positive semi-definite. Theorem.6. Suppose that {X t } is a discrete time/continuous stationary time series with covariance function {c(k)}, then {c(k)} is a positive semi-definite sequence/function. Conversely for any even positive semi-definite sequence/function there exists a stationary time series with this positive semi-definite sequence/function as its covariance function. PROOF. We prove the result in the case that {X t } is a discrete time time series, ie. {X t ; t Z}. We first show that {c(k)} is a positive semi-definite sequence. Consider any sequence x = (x,..., x n ) R n, and the double sum n i,j x ic(i j)x j. Define the random variable Y = n i= x ix i. It is straightforward to see that var(y ) = x var(x n )x = n i, c(i j)x ix j where X n = (X,..., X n ). Since for any random variable Y, var(y ) 0, this means that n i, x ic(i j)x j 0, hence {c(k)} is a positive definite sequence. To show the converse, that is for any positive semi-definite sequence {c(k)} we can find a corresponding stationary time series with the covariance {c(k)} is relatively straightfoward, but depends on defining the characteristic function of a process and using Komologorov s extension theorem. We omit the details but refer an interested reader to Brockwell and Davis (998), Section.5. In time series analysis usually the data is analysed by fitting a model to the data. The model (so long as it is correctly specified, we will see what this means in later chapters) guarantees the covariance function corresponding to the model (again we cover this in later chapters) is positive definite. This means, in general we do not have to worry about positive definiteness of the covariance function, as it is implicitly implied. On the other hand, in spatial statistics, often the object of interest is the covariance function and specific classes of covariance functions are fitted to the data. In which case it is necessary to 35

37 ensure that the covariance function is semi-positive definite (noting that once a covariance function has been found by Theorem.6. there must exist a spatial process which has this covariance function). It is impossible to check for positive definiteness using Definitions.6. or.6.. Instead an alternative but equivalent criterion is used. The general result, which does not impose any conditions on {c(k)} is stated in terms of positive measures (this result is often called Bochner s theorem). Instead, we place some conditions on {c(k)}, and state a simpler version of the theorem. Theorem.6.2 Suppose the coefficients {c(k); k Z} are absolutely summable (that is k c(k) < ). Then the sequence {c(k)} is positive semi-definite if an only if the function f(ω), where is nonnegative for all ω [0, 2π]. f(ω) = 2π k= c(k) exp(ikω), We also state a variant of this result for positive semi-definite functions. Suppose the function {c(u); k R} is absolutely summable (that is R c(u) du < ). Then the function {c(u)} is positive semi-definite if and only if the function f(ω), where is nonnegative for all ω R. PROOF. See Section f(ω) = c(u) exp(iuω)du, 2π Example.6. We will show that sequence c(0) =, c() = 0.5, c( ) = 0.5 and c(k) = 0 for k > a positive definite sequence. From the definition of spectral density given above we see that the spectral density corresponding to the above sequence is f(ω) = cos(ω). Since cos(ω), f(ω) 0, thus the sequence is positive definite. An alternative method is to find a model which has this as the covariance structure. Let X t = ε t +ε t, where ε t are iid random variables with E[ε t ] = 0 and var(ε t ) = 0.5. This model has this covariance structure. 36

38 We note that Theorem.6.2 can easily be generalized to higher dimensions, d, by taking Fourier transforms over Z d or R d. Exercise.7 Which of these sequences can used as the autocovariance function of a second order stationary time series? (i) c( ) = /2, c(0) =, c() = /2 and for all k >, c(k) = 0. (ii) c( ) = /2, c(0) =, c() = /2 and for all k >, c(k) = 0. (iii) c( 2) = 0.8, c( ) = 0.5, c(0) =, c() = 0.5 and c(2) = 0.8 and for all k > 2, c(k) = 0. Exercise.8 (i) Show that the function c(u) = exp( a u ) where a > 0 is a positive semidefinite function. (ii) Show that the commonly used exponential spatial covariance defined on R 2, c(u, u 2 ) = exp( a u 2 + u2 2 ), where a > 0, is a positive semi-definite function. 37

39 Chapter 2 Linear time series Prerequisites Familarity with linear models. Solve polynomial equations. Be familiar with complex numbers. Understand under what conditions the partial sum S n = n a j has a well defined limits (ie. if a j <, then S n S, where S = a j. Objectives Understand what causal and invertible is. Know what an AR, MA and ARMA time series model is. Know how to find a solution of an ARMA time series, and understand why this is important (how the roots determine causality and why this is important to know - in terms of characteristics in the process and also simulations). Understand how the roots of the AR can determine features in the time series and covariance structure (such as pseudo periodicities). 38

40 2. Motivation The objective of this chapter is to introduce the linear time series model. Linear time series models are designed to model the covariance structure in the time series. There are two popular subgroups of linear time models (a) the autoregressive and (a) the moving average models, which can be combined to make the autoregressive moving average models. We motivate the autoregressive from the perspective of classical linear regression. We recall one objective in linear regression is to predict the response variable given variables that are observed. To do this, typically linear dependence between response and variable is assumed and we model Y i as Y i = p a j X ij + ε i, where ε i is such that E[ε i X ij ] = 0 and more commonly ε i and X ij are independent. In linear regression once the model has been defined, we can immediately find estimators of the parameters, do model selection etc. Returning to time series, one major objective is to predict/forecast the future given current and past observations (just as in linear regression our aim is to predict the response given the observed variables). At least formally, it seems reasonable to represent this as X t = p a j X t j + ε t, (2.) where we assume that {ε t } are independent, identically distributed, zero mean random variables. Model (2.) is called an autoregressive model of order p (AR(p) for short). It is easy to see that E(X t X t,..., X t p ) = p a j X t j (this is the exected value of X t given that X t,..., X t p have already been observed), thus the past values of X t have a linear influence on the conditional mean of X t (compare with the linear model Y t = p a jx t,j + ε t, then E(Y t X t,j ) = p a jx t,j ). Conceptionally, the autoregressive model appears to be a straightforward extension of the linear regression model. Don t be fooled by this, it is a more complex object. Unlike the linear regression model, (2.) is an infinite set of linear difference equations. This means, for this systems of equations to be well defined, it needs 39

41 to have a solution which is meaningful. To understand why, recall that equation is defined for all t Z, so let us start the equation at the beginning of time (t = ) and run it on. Without any constraint on the parameters {a j }, there is no reason to believe the solution is finite (contrast this with linear regression where these issues are not relevant). Therefore, the first thing to understand is under what conditions will the AR model (2.) have a well defined stationary solution and what features in a time series is the solution able to capture. Of course, one could ask why go through to the effort. One could simply use least squares to estimate the parameters. This is possible, but without a proper analysis it is not clear whether model has a meaningful solution (for example in Section 3.3 we show that the least squares estimator can lead to misspecified models), it s not even possible to make simulations of the process. Therefore, there is a practical motivation behind our theoretical treatment. In this chapter we will be deriving conditions for a strictly stationary solution of (2.). We will place moment conditions on the innovations {ε t }, these conditions will be sufficient but not necessary conditions Under these conditions we obtain a strictly stationary solution but not a second order stationary process. In Chapter 3 we obtain conditions for (2.) to have be both a strictly and second order stationary solution. It is possible to obtain strictly stationary solution under far weaker conditions (see Theorem 4.0.), but these won t be considered here. Example 2.. How would you simulate from the model X t = φ X t + φ 2 X t + ε t. 2.2 Linear time series and moving average models 2.2. Infinite sums of random variables Before defining a linear time series, we define the MA(q) model which is a subclass of linear time series. Let us supppose that {ε t } are iid random variables with mean zero and finite variance. The time series {X t } is said to have a MA(q) representation if it satisfies q X t = ψ j ε t j, j=0 40

42 where E(ε t ) = 0 and var(ε t ) =. It is clear that X t is a rolling finite weighted sum of {ε t }, therefore {X t } must be well defined (which for finite sums means it is almost surely finite, this you can see because it has a finite variance). We extend this notion and consider infinite sums of random variables. Now, things become more complicated, since care must be always be taken with anything involving infinite sums. More precisely, for the sum j= ψ j ε t j, to be well defined (has a finite limit), the partial sums S n = n j= n ψ jε t j should be (almost surely) finite and the sequence S n should converge (ie. S n S n2 0 as n, n 2 ). Below, we give conditions under which this is true. Lemma 2.2. Suppose {X t } is a strictly stationary time series with E X t <, then {Y t } defined by Y t = j= ψ j X t j, where j=0 ψ j <, is a strictly stationary time series. Furthermore, the partial sum converges almost surely, Y n,t = n j=0 ψ jx t j Y t. If var(x t ) <, then {Y t } is second order stationary and converges in mean square (that is E(Y n,t Y t ) 2 0). PROOF. See Brockwell and Davis (998), Proposition 3.. or Fuller (995), Theorem 2.. (page 3) (also Shumway and Stoffer (2006), page 86). Example 2.2. Suppose {X t } is a strictly stationary time series with var(x t ) <. Define {Y t } as the following infinite sum Y t = j k ρ j X t j j=0 where ρ <. Then {Y t } is also a strictly stationary time series with a finite variance. We will use this example later in the course. Having derived conditions under which infinite sums are well defined (good), we can now define the general class of linear and MA( ) processes. 4

43 Definition 2.2. (The linear process and moving average (MA)( )) Suppose that {ε t } are iid random variables, j=0 ψ j < and E( ε t ) <. A time series is said to be a linear time series if it can be represented as X t = j= ψ j ε t j, where {ε t } are iid random variables with finite variance. (ii) (i) The time series {X t } has a MA( ) representation if it satisfies X t = ψ j ε t j. j=0 Note that since that as these sums are well defined by equation (.2) {X t } is a strictly stationary (ergodic) time series. The difference between an MA( ) process and a linear process is quite subtle. A linear process involves both past, present and future innovations {ε t }, whereas the MA( ) uses only past and present innovations. Definition (Causal and invertible) (i) A process is said to be causal if it has the representation X t = a j ε t j, (ii) A process is said to be invertible if it has the representation X t = b j X t j + ε t, j=0 (so far we have yet to give conditions under which the above has a well defined solution). Causal and invertible solutions are useful in both estimation and forecasting (predicting the future based on the current and past). A very interesting class of models which have MA( ) representations are autoregressive and autoregressive moving average models. In the following sections we prove this. 42

44 2.3 The autoregressive model and the solution In this section we will examine under what conditions the AR process has a stationary solution Difference equations and back-shift operators The autoregressive model is defined in terms of inhomogenuous difference equations. Difference equations can often be represented in terms of backshift operators, so we start by defining them and see why this representation may be useful (and why it should work). The time series {X t } is said to be an autoregressive (AR(p)) if it satisfies the equation X t φ X t... φ p X t p = ε t, t Z, where {ε t } are zero mean, finite variance random variables. As we mentioned previously, the autoregressive model is a difference equation (which can be treated as a infinite number of simultaneous equations). Therefore for it to make any sense it must have a solution. To obtain a general solution we write the autoregressive model in terms of backshift operators: X t φ BX t... φ p B p X t = ε t, φ(b)x t = ε t where φ(b) = p φ jb j, B is the backshift operator and is defined such that B k X t = X t k. Simply rearranging φ(b)x t = ε t, gives the solution of the autoregressive difference equation to be X t = φ(b) ε t, however this is just an algebraic manipulation, below we investigate whether it really has any meaning. To do this, we start with an example Solution of two particular AR() models Below we consider two different AR() models and obtain their solutions. (i) Consider the AR() process X t = 0.5X t + ε t, t Z. (2.2) Notice this is an equation (rather like 3x 2 + 2x + = 0, or an infinite number of simultaneous equations), which may or may not have a solution. To obtain the solution we note that 43

45 X t = 0.5X t + ε t and X t = 0.5X t 2 + ε t. Using this we get X t = ε t + 0.5(0.5X t 2 + ε t ) = ε t + 0.5ε t X t 2. Continuing this backward iteration we obtain at the kth iteration, X t = k j=0 (0.5)j ε t j + (0.5) k+ X t k. Because (0.5) k+ 0 as k by taking the limit we can show that X t = j=0 (0.5)j ε t j is almost surely finite and a solution of (2.2). Of course like any other equation one may wonder whether it is the unique solution (recalling that 3x 2 + 2x + = 0 has two solutions). We show in Section that this is the unique stationary solution of (2.2). Let us see whether we can obtain a solution using the difference equation representation. We recall, that by crudely taking inverses, the solution is X t = ( 0.5B) ε t. The obvious question is whether this has any meaning. Note that ( 0.5B) = j=0 (0.5B)j, for B 2, hence substituting this power series expansion into X t we have X t = ( 0.5B) ε t = ( j=0 (0.5B) j )ε t = ( j=0(0.5 j B j ))ε t = (0.5) j ε t j, which corresponds to the solution above. Hence the backshift operator in this example helps us to obtain a solution. Moreover, because the solution can be written in terms of past values of ε t, it is causal. j=0 (ii) Let us consider the AR model, which we will see has a very different solution: X t = 2X t + ε t. (2.3) Doing what we did in (i) we find that after the kth back iteration we have X t = k j=0 2j ε t j + 2 k+ X t k. However, unlike example (i) 2 k does not converge as k. This suggest that if we continue the iteration X t = j=0 2j ε t j is not a quantity that is finite (when ε t are iid). Therefore X t = j=0 2j ε t j cannot be considered as a solution of (2.3). We need to write (2.3) in a slightly different way in order to obtain a meaningful solution. Rewriting (2.3) we have X t = 0.5X t + 0.5ε t. Forward iterating this we get X t = (0.5) k j=0 (0.5)j ε t+j (0.5) t+k+ X t+k. Since (0.5) t+k+ 0 we have as a solution of (2.3). X t = (0.5) (0.5) j ε t+j j=0 44

46 Let us see whether the difference equation can also offer a solution. Since ( 2B)X t = ε t, using the crude manipulation we have X t = ( 2B) ε t. Now we see that ( 2B) = j=0 (2B)j for B < /2. Using this expansion gives X t = j=0 2j B j X t, but as pointed out above this sum is not well defined. What we find is that φ(b) ε t only makes sense (is well defined) if the series expansion of φ(b) converges in a region that includes the unit circle B =. What we need is another series expansion of ( 2B) which converges in a region which includes the unit circle B = (as an aside, we note that a function does not necessarily have a unique series expansion, it can have difference series expansions which may converge in different regions). We now show that a convergent series expansion needs to be defined in terms of negative powers of B not positive powers. Writing ( 2B) = (2B)( (2B) ), therefore ( 2B) = (2B) (2B) j, j=0 which converges for B > /2. Using this expansion we have X t = (0.5) j+ B j ε t = (0.5) j+ ε t+j+, j=0 which we have shown above is a well defined solution of (2.3). In summary ( 2B) has two series expansions which converges for B < /2 and j=0 ( 2B) = (2B) j j=0 ( 2B) = (2B) (2B) j, which converges for B > /2. The one that is useful for us is the series which converges when B =. It is clear from the above examples how to obtain the solution of a general AR(). We now show that this solution is the unique stationary solution. 45 j=0

47 2.3.3 The unique solution of a general AR() Consider the AR() process X t = φx t + ε t, where φ <. Using the method outlined in (i), it is straightforward to show that X t = j=0 φj ε t j is it s stationary solution, we now show that this solution is unique. We first show that X t = j=0 φj ε t j is well defined (that it is almost surely finite). We note that X t j=0 φj ε t j. Thus we will show that j=0 φj ε t j is almost surely finite, which will imply that X t is almost surely finite. By montone convergence we can exchange sum and expectation and we have E( X t ) E(lim n n j=0 φj ε t j ) = lim n n j=0 φj E ε t j ) = E( ε 0 ) j=0 φj <. Therefore since E X t <, j=0 φj ε t j is a well defined solution of X t = φx t + ε t. To show that it is the unique stationary causal solution, let us suppose there is another (causal) solution, call it Y t (note that this part of the proof is useful to know as such methods are often used when obtaining solutions of time series models). Clearly, by recursively applying the difference equation to Y t, for every s we have Y t = s φ j ε t j + φ s Y t s. j=0 Evaluating the difference between the two solutions gives Y t X t = A s B s where A s = φ s Y t s and B s = j=s+ φj ε t j for all s. To show that Y t and X t coincide almost surely we will show that for every ɛ > 0, s= P ( A s B s > ε) < (and then apply the Borel-Cantelli lemma). We note if A s B s > ε), then either A s > ε/2 or B s > ε/2. Therefore P ( A s B s > ε) P ( A s > ε/2)+p ( B s > ε/2). To bound these two terms we use Markov s inequality. It is straightforward to show that P ( B s > ε/2) Cφ s /ε. To bound E A s, we note that Y s φ Y s + ε s, since {Y t } is a stationary solution then E Y s ( φ ) E ε s, thus E Y t E ε t /( φ ) <. Altogether this gives P ( A s B s > ε) Cφ s /ε (for some finite constant C). Hence s= P ( A s B s > ε) < s= Cφs /ε <. Thus by the Borel-Cantelli lemma, this implies that the event { A s B s > ε} happens only finitely often (almost surely). Since for every ε, { A s B s > ε} occurs (almost surely) only finite often for all ε, then Y t = X t almost surely. Hence X t = j=0 φj ε t j is (almost surely) the unique causal solution. 46

48 2.3.4 The solution of a general AR(p) Let us now summarise our observation for the general AR() process X t = φx t + ε t. If φ <, then the solution is in terms of past values of {ε t }, if on the other hand φ > the solution is in terms of future values of {ε t }. Now we try to understand this in terms of the expansions of the characteristic polynomial φ(b) = φb (using the AR() as a starting point). From what we learnt in the previous section, we require the characteristic polynomial of the AR process to have a convergent power series expansion in the region including the ring B =. In terms of the AR() process, if the root of φ(b) is greater than one, then the power series of φ(b) is in terms of positive powers, if it is less than one, then φ(b) is in terms of negative powers. Generalising this argument to a general polynomial, if the roots of φ(b) are greater than one, then the power series of φ(b) (which converges for B = ) is in terms of positive powers (hence the solution φ(b) ε t will be in past terms of {ε t }). On the other hand, if the roots are both less than and greater than one (but do not lie on the unit circle), then the power series of φ(b) will be in both negative and positive powers. Thus the solution X t = φ(b) ε t will be in terms of both past and future values of {ε t }. We summarize this result in a lemma below. Lemma 2.3. Suppose that the AR(p) process satisfies the representation φ(b)x t = ε t, where none of the roots of the characteristic polynomial lie on the unit circle and E ε t <. Then {X t } has a stationary, almost surely unique, solution. We see that where the roots of the characteristic polynomial φ(b) lie defines the solution of the AR process. We will show in Sections and 3..2 that it not only defines the solution but determines some of the characteristics of the time series. Exercise 2. Suppose {X t } satisfies the AR(p) representation X t = p φ j X t j + ε t, where p φ j < and E ε t <. Show that {X t } will always have a causal stationary solution. 47

49 2.3.5 Explicit solution of an AR(2) model Specific example Suppose {X t } satisfies X t = 0.75X t 0.25X t 2 + ε t, where {ε t } are iid random variables. We want to obtain a solution for the above equations. It is not easy to use the backward (or forward) iterating techique for AR processes beyond order one. This is where using the backshift operator becomes useful. We start by writing X t = 0.75X t 0.25X t 2 + ε t as φ(b)x t = ε, where φ(b) = 0.75B B 2, which leads to what is commonly known as the characteristic polynomial φ(z) = 0.75z z 2. If we can find a power series expansion of φ(b), which is valid for B =, then the solution is X t = φ(b) ε t. We first observe that φ(z) = 0.75z z 2 = ( 0.5z)( 0.25z). Therefore by using partial fractions we have φ(z) = ( 0.5z)( 0.25z) = ( 0.5z) + 2 ( 0.25z). We recall from geometric expansions that ( 0.5z) = (0.5) j z j z 2, j=0 2 ( 0.25z) = 2 (0.25) j z j z 4. j=0 Putting the above together gives ( 0.5z)( 0.25z) = { (0.5) j + 2(0.25) j }z j z < 2. j=0 The above expansion is valid for z =, because j=0 (0.5)j + 2(0.25) j < (see Lemma 2.3.2). Hence X t = {( 0.5B)( 0.25B)} ε t = ( { (0.5) j + 2(0.25) j }B j) ε t = which gives a stationary solution to the AR(2) process (see Lemma 2.2.). j=0 { (0.5) j + 2(0.25) j }ε t j, The discussion above shows how the backshift operator can be applied and how it can be used j=0 48

50 to obtain solutions to AR(p) processes. The solution of a general AR(2) model We now generalise the above to general AR(2) models X t = (a + b)x t abx t 2 + ε t, the characteristic polynomial of the above is (a + b)z + abz 2 = ( az)( bz). This means the solution of X t is X t = ( Ba) ( Bb) ε t, thus we need an expansion of ( Ba) ( Bb). Assuming that a b, as above we have Cases: ( za)( zb) = ( b b a bz a ) az (i) a < and b <, this means the roots lie outside the unit circle. Thus the expansion is which leads to the causal solution ( za)( zb) = ( b b j z j a a j z j), (b a) X t = b a ( j=0 j=0 j=0 ( b j+ a j+ )ε t j ). (2.4) (ii) Case that a > and b <, this means the roots lie inside and outside the unit circle and we have the expansion ( za)( zb) = = ( b b a (b a) ) bz a (az)((az) ) ( b b j z j + z a j z ), j (2.5) j=0 j=0 49

51 which leads to the non-causal solution X t = ( b j+ ε t j + a j ) ε t++j. (2.6) b a j=0 Later we show that the non-causal solution has the same correlation structure as the causal solution when a = a. This solution throws up additional interesting results. Let us return to the expansion in (2.5) and apply it to X t j=0 X t = = ( Ba)( Bb) ε t = b a b a (Y t + Z t+ ) b bb ε t } {{ } causal AR() + B( a B ) ε t } {{ } noncausal AR() where Y t = by t + ε t and Z t+ = a Z t+2 + ε t+. In other words, the noncausal AR(2) process is the sum of a causal and a future AR() process. This is true for all noncausal time series (except when there is multiplicity in the roots) and is discussed further in Section 2.6. Several authors including Richard Davis, Jay Breidt and Beth Andrews argue that noncausal time series can model features in data which causal time series cannot. (iii) a = b >. The characteristic polynomial is ( az) 2. To obtain the convergent expansion when z = we note that ( az) 2 = ( ) d( az) d(az). Thus This leads to the causal solution ( ) ( az) 2 = ( ) j(az) j. X t = ( ) j=0 ja j ε t j. Exercise 2.2 Show for the AR(2) model X t = φ X t + φ 2 X t 2 + ε t to have a causal stationary 50

52 solution the parameters φ, φ 2 must lie in the region φ 2 + φ <, φ 2 φ < φ 2 <. Exercise 2.3 (a) Consider the AR(2) process X t = φ X t + φ 2 X t 2 + ε t, where {ε t } are iid random variables with mean zero and variance one. Suppose the roots of the characteristic polynomial φ z φ 2 z 2 are greater than one. Show that φ + φ 2 < 4. (b) Now consider a generalisation of this result. Consider the AR(p) process X t = φ X t + φ 2 X t φ p X t p + ε t. Suppose the roots of the characteristic polynomial φ z... φ p z p are greater than one. Show that φ φ p 2 p Features of a realisation from an AR(2) We now explain why the AR(2) (and higher orders) can characterise some very interesting behaviour (over the rather dull AR()). For now we assume that X t is a causal time series which satisfies the AR(2) representation X t = φ X t + φ 2 X t 2 + ε t where {ε t } are iid with mean zero and finite variance. The characteristic polynomial is φ(b) = φ B φ 2 B 2. Let us assume the roots of φ(b) are complex, since φ and φ 2 are real, the roots are complex conjugates. Thus by using case (i) above we have φ B φ 2 B 2 = ( ) λ λ λ λb λ λb, where λ and λ are the roots of the characteristic. Thus X t = C λ j ε t j C λ j ε t j, (2.7) j=0 j=0 5

53 where C = [λ λ]. Since λ and C are complex we use the representation λ = r exp(iθ) and C = α exp(iβ) (noting that r < ), and substitute these expressions for λ and C into (2.7) to give X t = α r j cos(θj + β)ε t j. j=0 We can see that X t is effectively the sum of cosines with frequency θ that have been modulated by the iid errors and exponentially damped. This is why for realisations of autoregressive processes you will often see periodicities (depending on the roots of the characteristic). These arguments can be generalised to higher orders p. Exercise 2.4 (a) Obtain the stationary solution of the AR(2) process X t = 7 3 X t 2 3 X t 2 + ε t, where {ε t } are iid random variables with mean zero and variance σ 2. Does the solution have an MA( ) representation? (b) Obtain the stationary solution of the AR(2) process X t = 4 3 X t X t 2 + ε t, where {ε t } are iid random variables with mean zero and variance σ 2. Does the solution have an MA( ) representation? (c) Obtain the stationary solution of the AR(2) process X t = X t 4X t 2 + ε t, where {ε t } are iid random variables with mean zero and variance σ 2. Does the solution have an MA( ) representation? Exercise 2.5 Construct a causal stationary AR(2) process with pseudo-period 7. Using the R function arima.sim simulate a realisation from this process (of length 200) and make a plot of the periodogram. What do you observe about the peak in this plot? Below we now consider solutions to general AR( ) processes. 52

54 2.3.7 Solution of the general AR( ) model AR( ) models are more general than the AR(p) model and are able to model more complex behaviour, such as slower decay of the covariance structure. It is arguable how useful these models are in modelling data, however recently it has become quite popular in time series bootstrap methods. In order to obtain the stationary solution of an AR( ), we need to define an analytic function and its inverse. Definition 2.3. (Analytic functions in the region Ω) Suppose that z C. φ(z) is an analytic complex function in the region Ω, if it has a power series expansion which converges in Ω, that is φ(z) = j= φ jz j. If there exists a function φ(z) = j= φ j z j such that φ(z)φ(z) = for all z Ω, then φ(z) is the inverse of φ(z) in the region Ω. Well known examples of analytic functions include (i) Finite order polynomials such as φ(z) = p j=0 φ jz j for Ω = C. (ii) The expansion ( 0.5z) = j=0 (0.5z)j for Ω = {z; z 2}. We observe that for AR processes we can represent the equation as φ(b)x t = ε t, which formally gives the solution X t = φ(b) ε t. This raises the question, under what conditions on φ(b) is φ(b) ε t a valid solution. For φ(b) ε t to make sense φ(b) should be represented as a power series expansion. Below, we give conditions on the power series expansion which give a stationary solution. It is worth noting this is closely related to Lemma Lemma Suppose that φ(z) = j= φ jz j is finite on a region that includes z = (hence it is analytic) and {X t } is a strictly stationary process with E X t <. Then j= φ j < and Y t = φ(b)x t j = j= φ jx t j is almost surely finite and strictly stationary time series. PROOF. It can be shown that if sup z = φ(z) <, in other words on the unit circle j= φ jz j <, then j= φ j <. Since the coefficients are absolutely summable, then by Lemma 2.2. we have that Y t = φ(b)x t j = j= φ jx t j is almost surely finite and strictly stationary. case). Using the above we can obtain the solution of an AR( ) (which includes an AR(p) as a special 53

55 Corollary 2.3. Suppose that X t = φ j X t j + ε t and φ(z) has an inverse ψ(z) = j= ψ jz j which is analytic in a region including z =, then the solution of X t is X t = j= ψ j ε t j. Corollary Let X t be an AR(p) time series, where X t = p φ j X t j + ε t. Suppose the roots of the characteristic polynomial φ(b) = p φ jb j do not lie on the unit circle B =, then X t admits a strictly stationary solution. In addition suppose the roots of φ(b) all lie outside the unit circle, then X t admits a strictly stationary, causal solution. This summarises what we observed in Section Rules of the back shift operator: (i) If a(z) is analytic in a region Ω which includes the unit circle z = in it s interior, then a(b)x t is a well defined random variable. (ii) The operator is commutative and associative, that is [a(b)b(b)]x t = a(b)[b(b)x t ] = [b(b)a(b)]x t (the square brackets are used to indicate which parts to multiply first). This may seems obvious, but remember matrices are not commutative! (iii) Suppose that a(z) and its inverse a(z) are both have solutions in the region Ω which includes the unit circle z = in it s interior. If a(b)x t = Z t, then X t = a(b) Z t. Example 2.3. (Analytic functions) (i) Clearly a(z) = 0.5z is analytic for all z C, and has no zeros for z < 2. The inverse is a(z) = j=0 (0.5z)j is well defined in the region z < 2. 54

56 (ii) Clearly a(z) = 2z is analytic for all z C, and has no zeros for z > /2. The inverse is a(z) = ( 2z) ( (/2z)) = ( 2z) ( j=0 (/(2z))j ) well defined in the region z > /2. (iii) The function a(z) = ( 0.5z)( 2z) is analytic in the region 0.5 < z < 2. (iv) a(z) = z, is analytic for all z C, but is zero for z =. Hence its inverse is not well defined for regions which involve z = (see Example 2.3.2). Example (Unit root/integrated processes and non-invertible processes) (i) If the difference equation has root one, then an (almost sure) stationary solution of the AR model do not exist. The simplest example is the random walk X t = X t + ε t (φ(z) = ( z)). This is an example of an Autoregressive Integrated Moving Average ARIMA(0,, 0) model ( B)X t = ε t. To see that it does not have a stationary solution, we iterate the equation n steps backwards and we see that X t = n j=0 ε t j + X t n. S t,n = n j=0 ε t j is the partial sum, but it is clear that the partial sum S t,n does not have a limit, since it is not a Cauchy sequence, ie. S t,n S t,m does not have a limit. However, given some initial value X 0, for t > 0 we can define the unit process X t = X t +ε. Notice that the nonstationary solution of this sequence is X t = X 0 + t ε t j which has variance var(x t ) = var(x 0 ) + t (assuming that {ε t } are iid random variables with variance one and independent of X 0 ). We observe that we can stationarize the process by taking first differences, ie. defining Y t = X t X t = ε t. (ii) The unit process described above can be generalised ARIMA(0, d, 0), where ( B) d X t = ε t. In this case to stationarize the sequence we take d differences, ie. let Y t,0 = X t and for i d define the iteration Y t,i = Y t,i Y t,i and Y t = Y t,d will be a stationary sequence. Or define Y t = d j=0 d! j!(d j)! ( )j X t j, in which case Y t as defined above will be a stationary sequence. 55

57 (iii) The general ARIMA(p, d, q) is defined as ( B) d φ(b)x t = θ(b)ε t, where φ(b) and θ(b) are p and q order polynomials respectively and the roots of φ(b) lie outside the unit circle. Another way of describing the above model is that after taking d differences (as detailed in (ii)) the resulting process is an ARMA(p, q) process (see Section 2.5 for the definition of an ARMA model). To illustrate the difference between stationary ARMA and ARIMA processes, in Figure 2. (iv) In examples (i) and (ii) a stationary solution does not exist. We now consider an example where the process is stationary but an autoregressive representation does not exist. Consider the MA() model X t = ε t ε t. We recall that this can be written as X t = φ(b)ε t where φ(b) = B. From Example 2.3.(iv) we know that φ(z) does not exist, therefore it does not have an AR( ) representation since ( B) X t = ε t is not well defined. ar ar2i Time Time (a) X t =.5X t 0.75X t 2 + ε t (b) ( B)Y t = X t, where is defined in (a) Figure 2.: Realisations from an AR process and it s corresponding integrated process, using N(0, ) innovations (generated using the same seed). 56

58 2.4 An explanation as to why the backshift operator method works To understand why the magic backshift operator works, we use matrix notation to rewrite the AR(p) model as an infinite set of difference equations φ... φ p φ... φ p X t X t X t 2. =. ε t ε t ε t 2.. The above is an infinite dimensional equation (and the matrix is an infinite upper triangular matrix). Formally to obtain a simulation we invert the matrix to get a solution of X t in terms of ε t. Of course in reality it is not straightfoward to define this inverse. Instead let us consider a finite (truncated) version of the above matrix equation. Except for the edge effects this is a circulant matrix (where the rows are repeated, but each time shifted by one, see wiki for a description). Truncating the matrix to have dimension n, we approximate the above by the finite set of n-equations φ... φ p φ... φ p φ φ X n X n. X 0 = ε n ε n. ε 0 C n X n ε n. The approximation of the AR(p) equation only arises in the first p-equations, where p X 0 φ j X n j = ε 0 X φ X 0 p φ j X n+ j = ε j=2.. p X p φ j X p j φ p X n = ε p. 57

59 We now define the n n matrix U n, where U n = We observe that U n is a deformed diagonal matrix where all the ones along the diagonal have been shifted once to the right, and the left over one is placed in the bottom left hand corner. U n is another example of a circulant matrix, moreover U 2 n shifts once again all the ones to the right Un 2 = U 3 n shifts the ones to the third off-diagonal and so forth until U n n = I. Thus all circulant matrices can be written in terms of powers of U n (the matrix U n can be considered as the building blocks of circulant matrices). In particular C n = I n p φ j Un, j [I n p φ ju j n]x n = ε n and the solution to the equation is X n = (I n p φ j Un) j ε n. Our aim is to write (I n p φ ju j n) as a power series in U n, with U n playing the role of the backshift operator. To do this we recall the similarity between the matrix I n p φ ju j n and the characteristic equation φ(b) = p φ jz j. In particular since we can factorize the characteristic equation as φ(b) = p [ λ jb], we can factorize the matrix I n p φ ju j n = p [I n λ j U n ]. To obtain the inverse, for simplicity, we assume that the roots of the characteristic function are greater than 58

60 one (ie. λ j <, which we recall corresponds to a causal solution) and are all different. Then there exists constants c j where [I n p φ j Un] j = p c j (I n λ j U n ) (just as in partial fractions) - to see why multiply the above by [I n p φ ju j n]. Finally, we recall that if the eigenvalues of A are less than one, then ( A) = j=0 Aj. The eigenvalues of U n are {exp( 2πij n ); j =,..., n}, thus the eigenvalues of λ ju n are less than one. This gives (I n λ j U n ) = k=0 λk j U k n and [I n p φ j Un] j = Therefore, the solution of C n X n = ε n is X n = C n ε n = p p c j c j k=0 k=0 λ k j U k n. (2.8) λ k j Un k ε n. Let us focus on the first element of the vector X n, which is X n. Since U k nε n shifts the elements of ε n up by k (note that this shift is with wrapping of the vector) we have X n = p c j λ k j ε n k + k=0 p c j k=n+ λ k j ε n k mod (n) } {{ } 0. (2.9) Note that the second term decays geometrically fast to zero. Thus giving the stationary solution X n = p c j k=0 λk j ε n k. To recollect, we have shown that [I n p φ ju j n] admits the solution in (2.8) (which is the same as the solution of the inverse of φ(b) ) and that U j nε n plays the role of the backshift operator. Therefore, we can use the backshift operator in obtaining a solution of an AR process because it plays the role of the matrix U n. 59

61 Example 2.4. The AR() model, X t φ X t = ε t is written as φ φ φ X n X n. X 0 = ε n ε n. ε 0 C n X n = ε n. The approximation of the AR() is only for the first equation, where X 0 φ X n = ε 0. Using the matrix U n, the above equation can be written as (I n φ U n )X n = ε n, which gives the solution X n = (I n φ U n ) ε n. Let us suppose that φ > (ie, the root lies inside the unit circle and the solution is noncausal), then to get a convergent expansion of ( n φ U n ) we rewrite (I n φ U n ) = φ U n (I n φ U n ). Thus we have Therefore the solution is (I n φ U n ) = X n = ( k=0 [ k=0 φ k U n k φ k+ Un k+ which in it s limit gives the same solution as Section 2.3.2(ii). Notice that U j n and B j are playing the same role. ] ) (φ U n ). ε n, 2.4. Representing the AR(p) as a vector AR() Let us suppose X t is an AR(p) process, with the representation p X t = φ j X t j + ε t. For the rest of this section we will assume that the roots of the characteristic function, φ(z), lie outside the unit circle, thus the solution causal. We can rewrite the above as a Vector Autoregressive 60

62 (VAR()) process X t = AX t + ε t (2.0) where φ φ 2... φ p φ p , (2.) X t = (X t,..., X t p+ ) and ε t = (ε t, 0,..., 0). It is straightforward to show that the eigenvalues of A are the inverse of the roots of φ(z) (since det(a zi) = z p p φ i z p i = z p ( i= p φ i z i )), i= } {{ } =z p φ(z ) thus the eigenvalues of A lie inside the unit circle. It can be shown that for any λ max (A) < δ <, there exists a constant C δ such that A j spec C δ δ j (see Appendix A). Note that result is extremely obvious if the eigenvalues are distinct (in which case the spectral decomposition can be used), in which case A j spec C δ λ max (A) j (note that A spec is the spectral norm of A, which is the largest eigenvalue of the symmetric matrix AA ). We can apply the same back iterating that we did for the AR() to the vector AR(). Iterating (2.0) backwards k times gives k X t = A j ε t j + A k X t k. j=0 Since A k X t k 2 A k spec X t k P 0 we have X t = A j ε t j. j=0 6

63 2.5 The ARMA model Up to now, we have defined the moving average and the autoregressive model. The MA(q) average has the feature that after q lags there isn t any correlation between two random variables. On the other hand, there are correlations at all lags for an AR(p) model. In addition as we shall see later on, it is much easier to estimate the parameters of an AR model than an MA. Therefore, there are several advantages in fitting an AR model to the data (note that when the roots are of the characteristic polynomial lie inside the unit circle, then the AR can also be written as an MA( ), since it is causal). However, if we do fit an AR model to the data, what order of model should we use? Usually one uses the AIC (BIC or similar criterion) to determine the order. But for many data sets, the selected order tends to be relative large, for example order 4. The large order is usually chosen when correlations tend to decay slowly and/or the autcorrelations structure is quite complex (not just monotonically decaying). However, a model involving 0-5 unknown parameters is not particularly parsimonious and more parsimonious models which can model the same behaviour would be useful. A very useful generalisation which can be more flexible (and parsimonious) is the ARMA(p, q) model, in this case X t satisfies p q X t φ i X t i = ε t + θ j ε t j. i= Definition 2.5. (Summary of AR, ARMA and MA models) model: {X t } satisfies (i) The autoregressive AR(p) X t = p φ i X t i + ε t. (2.2) i= Observe we can write it as φ(b)x t = ε t (ii) The moving average MA(q) model: {X t } satisfies q X t = ε t + θ j ε t j. (2.3) Observe we can write X t = θ(b)ε t 62

64 (iii) The autoregressive moving average ARMA(p, q) model: {X t } satisfies p q X t φ i X t i = ε t + θ j ε t j. (2.4) i= We observe that we can write X t as φ(b)x t = θ(b)ε t. Below we give conditions for the ARMA to have a causal solution and also be invertible. We also show that the coefficients of the MA( ) representation of X t will decay exponentially. Lemma 2.5. Let us suppose X t is an ARMA(p, q) process with representation given in Definition (i) If the roots of the polynomial φ(z) lie outside the unit circle, and are greater than ( + δ) (for some δ > 0), then X t almost surely has the solution X t = a j ε t j, (2.5) j=0 where for j > q, a j = [A j ], + q i= θ i[a j i ],, with A = φ φ 2... φ p φ p where j a j < (we note that really a j = a j (φ, θ) since its a function of {φ i } and {θ i }). Moreover for all j, a j Kρ j (2.6) for some finite constant K and /( + δ) < ρ <. (ii) If the roots of φ(z) lie both inside or outside the unit circle and are larger than ( + δ) or less than ( + δ) for some δ > 0, then we have X t = a j ε t j, (2.7) j= 63

65 (a vector AR() is not possible), where a j Kρ j (2.8) for some finite constant K and /( + δ) < ρ <. (iii) If the absolute value of the roots of θ(z) = + q θ jz j are greater than ( + δ), then (2.4) can be written as where X t = b j X t j + ε t. (2.9) b j Kρ j (2.20) for some finite constant K and /( + δ) < ρ <. PROOF. We first prove (i) There are several way to prove the result. The proof we consider here, uses the VAR expansion given in Section We write the ARMA process as a vector difference equation X t = AX t + ε t (2.2) where X t = (X t,..., X t p+ ), ε t = (ε t + q θ jε t j, 0,..., 0). Now iterating (2.2), we have X t = concentrating on the first element of the vector X t we see that A j ε t j, (2.22) j=0 X t = q [A i ], (ε t i + θ j ε t i j ). i=0 Comparing (2.5) with the above it is clear that for j > q, a j = [A j ], + q i= θ i[a j i ],. Observe that the above representation is very similar to the AR(). Indeed as we will show below the A j behaves in much the same way as the φ j in AR() example. As with φ j, we will show that A j converges to zero as j (because the eigenvalues of A are less than one). We now show that 64

66 X t K ρj ε t j for some 0 < ρ <, this will mean that a j Kρ j. To bound X t we use (2.22) X t X t 2 A j spec ε t j 2. j=0 Hence, by using Section 2.4. we have A j spec C ρ ρ j (for any λ max (A) < ρ < ), which gives the corresponding bound for a j. To prove (ii) we consider a power series expansion of θ(z) φ(z). If the roots of φ(z) are distinct, then it is straightforward to write φ(z) in terms of partial fractions and a convergent power series for z =. This expansion immediately gives the the linear coefficients a j and show that a j C( + δ) j for some finite constant C. On the other hand, if there are multiple roots, say the roots of φ(z) are λ,..., λ s with multiplicity m,..., m s (where s m s = p) then we need to adjust the partial fraction expansion. It can be shown that a j C j maxs ms ( + δ) j. We note that for every ( + δ) < ρ <, there exists a constant such that j maxs ms ( + δ) j Cρ j, thus we obtain the desired result. To show (iii) we use a similar proof to (i), and omit the details. Corollary 2.5. An ARMA process is invertible if the roots of θ(b) (the MA coefficients) lie outside the unit circle and causal if the roots of φ(b) (the AR coefficients) lie outside the unit circle. The representation of an ARMA process is unique upto AR and MA polynomials θ(b) and φ(b) having common roots. A simplest example is X t = ε t, this also satisfies the representation X t φx t = ε t φε t etc. Therefore it is not possible to identify common factors in the polynomials. One of the main advantages of the invertibility property is in prediction and estimation. We will consider this in detail below. It is worth noting that even if an ARMA process is not invertible, one can generate a time series which has identical correlation structure but is invertible (see Section 3.3). 65

67 2.6 Simulating from an Autoregressive process Simulating from a Gaussian AR process It is straightforward to simulate from an AR process with Gaussian innovations, {ε t }. Given the autoregressive structure we can deduce the correlation structure (see Chapter 3) (regardless of the distribution of the innovations). Furthermore, from Lemma 2.5.(ii) we observe that all AR processes can be written as the infinite sum of the innovations. Thus if the innovations are Gaussian, so is the AR process. This allows us to deduce the joint distribution of X,..., X p, which in turn allows us generate the AR(p) process. We illustrate the details with with an AR() process. Suppose X t = φ X t + ε t where {ε t } are iid standard normal random variables (note that for Gaussian processes it is impossible to discriminate between causal and non-causal processes - see Section 3.3, therefore we will assume φ < ). We will show in Section 3., equation (3.) that the autocovariance of an AR() is c(r) = φ r j=0 φ 2j = φr φ 2. Therefore, the marginal distribution of X t is Gaussian with variance ( φ 2 ). Therefore, to simulate an AR() Gaussian time series, we draw from a Gaussian time series with mean zero and variance ( φ 2 ), calling this X. We then iterate for 2 t, X t = φ X t + ε t. This will give us a stationary realization from an AR() Gaussian time series. Note the function arima.sim is a routine in R which does the above. See below for details. Simulating from a non-gaussian AR model Unlike the Gaussian AR process it is difficult to simulate a non-gaussian model, but we can obtain a very close approximation. This is because if the innovations are non-gaussian but known it is not clear what the distribution of X t will be. Here we describe how to obtain a close approximation in the case that the AR process is causal. Again we describe the method for the AR(). Let {X t } be an AR() process, X t = φ X t +ε t, which has stationary, causal solution X t = φ j ε t j. j=0 66

68 To simulate from the above model, we set X = 0. Then obtain the iteration X t = φ Xt + ε t for t 2. We note that the solution of this equation is X t = t φ j ε t j. j=0 We recall from Lemma 2.5. that X t X t φ t j=0 φj ε j, which converges geometrically fast to zero. Thus if we choose a large n to allow burn in and use { X t ; t n} in the simulations we have a simulation which is close to a stationary solution from an AR() process. Simulating from an Integrated process To simulate from an integrated process ARIMA(p,, q) ( B)Y t = X t, where X t is a causal ARMA(p, q) process. We first simulate {X t } using the method above. Then we define the recursion Y = X and for t > Y t = Y t + X t. Thus giving a realisation from an ARIMA(p,, q). Simulating from a non-gaussian non-causal model Suppose that X t satisfies the representation p X t = φ j X t j + ε t, whose characteristic function have roots both inside and outside the unit circle. Thus, the stationary solution of this equation is not causal. It is not possible to simulate from this equation. To see why, consider directly simulating from X t = 2X t + ε t without rearranging it as X t = 2 X t 2 ε t, the solution would explode. Now if the roots are both inside and outside the unit circle, there would not be a way to rearrange the equation to iterate a stationary solution. There are two methods to remedy this problem: 67

69 (i) From Lemma 2.5.(ii) we recall that X t has the solution X t = a j ε t j, (2.23) j= where the coefficients a j are determined from the characteristic equation. Thus to simulate the process we use the above representation, though we do need to truncate the number of terms in (2.23) and use M X t = a j ε t j. j= M (ii) The above is a brute force method is an approximation which is also difficult to evaluate. There is a simpler method, if one studies the roots of the characteristic equation. Let us suppose that {λ j ; j =,..., p } are the roots of φ(z) which lie outside the unit circle and {µ j2 ; j 2 =,..., p 2 } are the roots which lie inside the unit circle. For ease of calculation we will assume the roots are distinct. We can rewrite φ(z) as φ(z) = = = [ p j = ( λ j z) p j = p j = ] p2 c j ( λ j z) + [ p2 j 2 = ( µ j 2 z) j 2 = p2 c j ( λ j z) j 2 = d jd ( µ jd z) ] d jd µ jd z( µ j d z ) Thus the solution of X t is X t = φ(b) ε t = p j = c j ( λ j B) ε t p 2 j 2 = d jd µ jd B( µ j d B ) ε t Let Y j,t = λ j Y j,t + ε t and Z j2,t = µ j2 Z j2,t + ε t (thus the stationary solution is generated with Z j2,t = µ j 2 Z j2,t µ j 2 ε t ). Generate the time series {Y j,t; j =,..., p } and {Y j,t; j =,..., p } using the method described above. Then the non-causal time series can be generated by using X t = p j = c j Y j,t 68 p 2 j 2 = d j2 Z j2,t.

70 Comments: Remember Y j,t is generated using the past ε t and Z j,t is generated using future innovations. Therefore to ensure that the generated {Y j,t } and {Z j,t } are close to the stationary we need to ensure that the initial value of Y j,t is far in the past and the initial value for Z j,t is far in the future. If the roots are complex conjugates, then the corresponding {Y j,t } or {Z j,t } should be written as AR(2) models (to avoid complex processes). R functions Shumway and Stoffer (2006) and David Stoffer s website gives a comprehensive introduction to time series R-functions. The function arima.sim simulates from a Gaussian ARIMA process. For example, arima.sim(list(order=c(2,0,0), ar = c(.5, -0.75)), n=50) simulates from the AR(2) model X t =.5X t 0.75X t 2 + ε t, where the innovations are Gaussian. Exercise 2.6 In the following simulations, use non-gaussian innovations. (i) Simulate an AR(4) process with characteristic function φ(z) = [ 0.8 exp(i 2π ] [ 3 )z 0.8 exp( i 2π ] [ 3 )z.5 exp(i 2π5 ] [ )z.5 exp( i 2π5 ] )z. (ii) Simulate an AR(4) process with characteristic function φ(z) = [ 0.8 exp(i 2π ] [ 3 )z 0.8 exp( i 2π ] [ 3 )z 23 ] [ exp(i2π5 )z 23 ] exp( i2π5 )z. Do you observe any differences between these realisations? 69

71 Chapter 3 The autocovariance function of a linear time series Objectives Be able to determine the rate of decay of an ARMA time series. Be able solve the autocovariance structure of an AR process. Understand what partial correlation is and how this may be useful in determining the order of an AR model. Understand why autocovariance is blind to processes which are non-causal. But the higher order cumulants are not blind to causality. 3. The autocovariance function The autocovariance function (ACF) is defined as the sequence of covariances of a stationary process. That is suppose that {X t } is a stationary process with mean zero, then {c(k) : k Z} is the ACF of {X t } where c(k) = E(X 0 X k ). Clearly different time series give rise to different features in the ACF. We will explore some of these features below. Before investigating the structure of ARMA processes we state a general result connecting linear time series and the summability of the autocovariance function. 70

72 Lemma 3.. Suppose the stationary time series X t satisfies the linear representation j= ψ jε t j. The covariance is c(r) = j= ψ jψ j+r. (i) If j= ψ j <, then k c(k) <. (ii) If j= jψ j <, then k k c(k) <. (iii) If j= ψ j 2 <, then we cannot say anything about summability of the covariance. PROOF. It is straightforward to show that c(k) = var[ε t ] j ψ j ψ j k. Using this result, it is easy to see that k c(k) k proves (i). j ψ j ψ j k, thus k c(k) <, which The proof of (ii) is similar. To prove (iii), we observe that j ψ j 2 < is a weaker condition then j ψ j < (for example the sequence ψ j = j satisfies the former condition but not the latter). Thus based on the condition we cannot say anything about summability of the covariances. First we consider a general result on the covariance of a causal ARMA process (always to obtain the covariance we use the MA( ) expansion - you will see why below). 3.. The rate of decay of the autocovariance of an ARMA process We evaluate the covariance of an ARMA process using its MA( ) representation. Let us suppose that {X t } is a causal ARMA process, then it has the representation in (2.7) (where the roots of φ(z) have absolute value greater than + δ). Using (2.7) and the independence of {ε t } we have cov(x t, X τ ) = cov( = a j ε t j, j =0 a j2 ε τ j2 ) j 2 =0 a j a j2 cov(ε t j, ε τ j ) = j=0 (here we see the beauty of the MA( ) expansion). Using (2.8) we have cov(x t, X τ ) var(ε t )C 2 ρ j=0 a j a j+ t τ var(ε t ) (3.) j=0 ρ j ρ j+ t τ Cρρ 2 t τ ρ 2j = ρ t τ ρ 2, (3.2) 7 j=0

73 for any /( + δ) < ρ <. The above bound is useful, it tells us that the ACF of an ARMA process decays exponentially fast. In other words, there is very little memory in an ARMA process. However, it is not very enlightening about features within the process. In the following we obtain an explicit expression for the ACF of an autoregressive process. So far we have used the characteristic polynomial associated with an AR process to determine whether it was causal. Now we show that the roots of the characteristic polynomial also give information about the ACF and what a typical realisation of a autoregressive process could look like The autocovariance of an autoregressive process Let us consider the zero mean AR(p) process {X t } where p X t = φ j X t j + ε t. (3.3) From now onwards we will assume that {X t } is causal (the roots of φ(z) lie outside the unit circle). Given that {X t } is causal we can derive a recursion for the covariances. It can be shown that multipying both sides of the above equation by X t k (k 0) and taking expectations, gives the equation E(X t X t k ) = p φ j E(X t j X t k ) + E(ε t X t k ) = } {{ } =0 p φ j E(X t j X t k ). (3.4) It is worth mentioning that if the process were not causal this equation would not hold, since ε t and X t k are not necessarily independent. These are the Yule-Walker equations, we will discuss them in detail when we consider estimation. For now letting c(k) = E(X 0 X k ) and using the above we see that the autocovariance satisfies the homogenuous difference equation p c(k) φ j c(k j) = 0, (3.5) for k 0. In other words, the autocovariance function of {X t } is the solution of this difference equation. The study of difference equations is a entire field of research, however we will now scratch the surface to obtain a solution for (3.5). Solving (3.5) is very similar to solving homogenuous differential equations, which some of you may be familar with (do not worry if you are not). 72

74 Recall the characteristic polynomial of the AR process φ(z) = p φ jz j = 0, which has the roots λ,..., λ p. In Section we used the roots of the characteristic equation to find the stationary solution of the AR process. In this section we use the roots characteristic to obtain the solution (3.5). It can be shown if the roots are distinct (the roots are all different) the solution of (3.5) is c(k) = p C j λ k j, (3.6) where the constants {C j } are chosen depending on the initial values {c(k) : k p} and are such that they ensure that c(k) is real (recalling that λ j ) can be complex. The simplest way to prove (3.6) is to use a plugin method. Plugging c(k) = p C jλj k (3.5) gives into p c(k) φ j c(k j) = = p p ( C j λ k j C j λ k j ( p i= ) φ i λ (k i) j p ) φ i λ i j = 0. i= } {{ } φ(λ i ) In the case that the roots of φ(z) are not distinct, let the roots be λ,..., λ s with multiplicity m,..., m s ( s k= m k = p). In this case the solution is c(k) = s λ k j P mj (k), (3.7) where P mj (k) is m j th order polynomial and the coefficients {C j } are now hidden in P mj (k). We now study the covariance in greater details and see what it tells us about a realisation. motivation consider the following example. As a Example 3.. Consider the AR(2) process X t =.5X t 0.75X t 2 + ε t, (3.8) where {ε t } are iid random variables with mean zero and variance one. The corresponding characteristic polynomial is.5z + 0, 75z 2, which has roots ± i3 /2 = 4/3 exp(iπ/6). Using the 73

75 discussion above we see that the autocovariance function of {X t } is c(k) = ( 4/3) k (C exp( ikπ/6) + C exp(ikπ/6)), for a particular value of C. Now write C = a exp(ib), then the above can be written as c(k) = a( 4/3) k cos (k π ) 6 + b. We see that the covariance decays at an exponential rate, but there is a periodicity within the decay. This means that observations separated by a lag k = 2 are more closely correlated than other lags, this suggests a quasi-periodicity in the time series. The ACF of the process is given in Figure 3., notice that it has decays to zero but also observe that it undulates. A plot of a realisation of the time series is given in Figure 3.2, notice the quasi-periodicity of about 2π/2. Let is now briefly return to the definition of the periodogram given in Section.2.3 (I n (ω) = n n t= X t exp(itω) 2 ). We used the periodogram to identify the periodogram of a deterministic signal, but showed that when dependent, correlated noise was added to the dignal that the periodogram exhibited more complex behaviour. In Figure 6. we give a plot of the periodogram corresponding to Figure 3.2. Recall that this AR(2) gives a quasi-periodicity of 2, which corresponds to the frequency 2π/2 0.52, which matches the main peaks in periodogram. We will learn later that the periodogram is a crude (meaning inconsistent) estimator of the spectral density function. The spectral density if given in the lower plot of Figure 6.. We now generalise the above example. Let us consider the general AR(p) process defined in (3.3). Suppose the roots of the corresponding characteristic polynomial are distinct and let us split them into real and complex roots. Because the characteristic polynomial is comprised of real coefficients, the complex roots come in complex conjugate pairs. Hence let us suppose the real roots are {λ j } r and the complex roots are {λ j, λ j } (p r)/2 j=r+. The covariance in (3.6) can be written as c(k) = r (p 2)/2 C j λ k j + a j λ j k cos(kθ j + b j ) (3.9) j=r+ where for j > r we write λ j = λ j exp(iθ j ) and a j and b j are real constants. Notice that as the example above the covariance decays exponentially with lag, but there is undulation. A typical realisation from such a process will be quasi-periodic with periods at θ r+,..., θ (p r)/2, though the 74

76 acf lag Figure 3.: The ACF of the time series X t =.5X t 0.75X t 2 + ε t ar Time Figure 3.2: The a simulation of the time series X t =.5X t 0.75X t 2 + ε t 75

77 Periodogram frequency Autoregressive spectrum frequency Figure 3.3: Top: Periodogram of X t =.5X t 0.75X t 2 + ε t for sample size n = 44. Lower: The corresponding spectral density function (note that 0.5 of the x-axis on spectral density corresponds to π on the x-axis of the periodogram). 76

78 magnitude of each period will vary. An interesting discussion on covariances of an AR process and realisation of an AR process is given in Shumway and Stoffer (2006), Chapter 3.3 (it uses the example above). A discussion of difference equations is also given in Brockwell and Davis (998), Sections 3.3 and 3.6 and Fuller (995), Section 2.4. Example 3..2 (Autocovariance of an AR(2)) Let us suppose that X t satisfies the model X t = (a + b)x t abx t 2 + ε t. We have shown that if a < and b <, then it has the solution X t = ( ( b j+ a j+ ) )ε t j. b a By writing a timeline it is straightfoward to show that for r > cov(x t, X t r ) = j=0 (b j+ a j+ )(b j++r a j++r ). j=0 Example 3..3 The autocorrelation of a causal and noncausal time series Let us consider the two AR() processes considered in Section We recall that the model X t = 0.5X t + ε t has the stationary causal solution X t = 0.5 j ε t j. Assuming the innovations has variance one, the ACF of X t is On the other hand the model c X (0) = j= c X (k) = 0.5 k Y t = 2Y t + ε t 77

79 has the noncausal stationary solution Y t = (0.5) j+ ε t+j+. Thus process has the ACF j=0 c Y (0) = c X (k) = k Thus we observe that except for a factor (0.5) 2 both models has an identical autocovariance function. Indeed their autocorrelation function would be same. Furthermore, by letting the innovation of X t have standard deviation 0.5, both time series would have the same autocovariance function. Therefore, we observe an interesting feature, that the non-causal time series has the same correlation structure of a causal time series. In Section 3.3 that for every non-causal time series there exists a causal time series with the same autocovariance function. Therefore autocorrelation is blind to non-causality. Exercise 3. Recall the AR(2) models considered in Exercise 2.4. Now we want to derive their ACF functions. (i) (a) Obtain the ACF corresponding to X t = 7 3 X t 2 3 X t 2 + ε t, where {ε t } are iid random variables with mean zero and variance σ 2. (b) Obtain the ACF corresponding to X t = 4 3 X t X t 2 + ε t, where {ε t } are iid random variables with mean zero and variance σ 2. (c) Obtain the ACF corresponding to X t = X t 4X t 2 + ε t, where {ε t } are iid random variables with mean zero and variance σ 2. 78

80 (ii) For all these models plot the true ACF in R. You will need to use the function ARMAacf. BEWARE of the ACF it gives for non-causal solutions. Find a method of plotting a causal solution in the non-causal case. Exercise 3.2 In Exercise 2.5 you constructed a causal AR(2) process with period 7. Load Shumway and Stoffer s package asta into R (use the command install.packages("astsa") and then library("astsa"). Use the command arma.spec to make a plot of the corresponding spectral density function. How does your periodogram compare with the true spectral density function? R code We use the code given in Shumway and Stoffer (2006), page 0 to make Figures 3. and 3.2. To make Figure 3.: acf = ARMAacf(ar=c(.5,-0.75),ma=0,50) plot(acf,type="h",xlab="lag") abline(h=0) To make Figures 3.2 and 6.: set.seed(5) ar2 <- arima.sim(list(order=c(2,0,0), ar = c(.5, -0.75)), n=44) plot.ts(ar2, axes=f); box(); axis(2) axis(,seq(0,44,24)) abline(v=seq(0,44,2),lty="dotted") plot(frequency, Periodogram,type="o") library("astsa") arma.spec( ar = c(.5, -0.75), log = "no", main = "Autoregressive") 3..3 The autocovariance of a moving average process Suppose that {X t } satisfies q X t = ε t + θ j ε t j. 79

81 The covariance is p i=0 cov(x t, X t k ) = θ iθ i k k = q,..., q 0 otherwise where θ 0 = and θ i = 0 for i < 0 and i q. Therefore we see that there is no correlation when the lag between X t and X t k is greater than q The autocovariance of an autoregressive moving average process We see from the above that an MA(q) model is only really suitable when we believe that there is no correlaton between two random variables separated by more than a certain distance. Often autoregressive models are fitted. However in several applications we find that autoregressive models of a very high order are needed to fit the data. If a very long autoregressive model is required a more suitable model may be the autoregressive moving average process. It has several of the properties of an autoregressive process, but can be more parsimonuous than a long autoregressive process. In this section we consider the ACF of an ARMA process. Let us suppose that the causal time series {X t } satisfies the equations p q X t φ i X t i = ε t + θ j ε t j. i= We now define a recursion for ACF, which is similar to the ACF recursion for AR processes. Let us suppose that the lag k is such that k > q, then it can be shown that the autocovariance function of the ARMA process satisfies E(X t X t k ) On the other hand, if k q, then we have p φ i E(X t i X t k ) = 0 i= p E(X t X t k ) φ i E(X t i X t k ) = i= q q θ j E(ε t j X t k ) = θ j E(ε t j X t k ). j=k We recall that X t has the MA( ) representation X t = j=0 a jε t j (see (2.7)), therefore for 80

82 k j q we have E(ε t j X t k ) = a j k var(ε t ) (where a(z) = θ(z)φ(z) ). Altogether the above gives the difference equations c(k) c(k) p φ i c(k i) = var(ε t ) i= q θ j a j k for k q (3.0) j=k p φ i c(k i) = 0, for k > q, i= where c(k) = E(X 0 X k ). (3.0) is homogenuous difference equation, then it can be shown that the solution is c(k) = s λ k j P mj (k), where λ,..., λ s with multiplicity m,..., m s ( k m s = p) are the roots of the characteristic polynomial p φ jz j. Observe the similarity to the autocovariance function of the AR process (see (3.7)). The coefficients in the polynomials P mj in (3.0). are determined by the initial condition given You can also look at Brockwell and Davis (998), Chapter 3.3 and Shumway and Stoffer (2006), Chapter The partial covariance and correlation of a time series We see that by using the autocovariance function we are able to identify the order of an MA(q) process: when the covariance lag is greater than q the covariance is zero. However the same is not true for AR(p) processes. The autocovariances do not enlighten us on the order p. However a variant of the autocovariance, called the partial autocovariance is quite informative about order of AR(p). We start by reviewing the partial autocovariance, and it s relationship to the inverse variance/covariance matrix (often called the precision matrix). 8

83 3.2. A review of partial correlation in multivariate analysis Partial correlation Suppose X = (X,..., X d ) is a zero mean random vector (we impose the zero mean condition to simplify notation and it s not necessary). The partial correlation is the covariance between X i and X j, conditioned on the other elements in the vector. In other words, the covariance between the residuals of X i conditioned on X (ij) (the vector not containing X i and X j ) and the residual of X j conditioned on X (ij). That is the partial covariance between X i and X j given X (ij) is defined as cov ( X i var[x (ij) ] E[X (ij) X i ]X (ij), X j var[x (ij) ] ) E[X (ij) X j ]X (ij) = cov[x i X j ] E[X (ij) X i ] var[x (ij) ] E[X (ij) X j ]. Taking the above argument further, the variance/covariance matrix of the residual of X ij = (X i, X j ) given X (ij) is defined as var ( X ij E[X ij X (ij) ] var[x (ij) ] X (ij) ) = Σij c ijσ (ij) c ij (3.) where Σ ij = var(x ij ), c ij = E(X ij X (ij) ) (=cov(x ij, X (ij) )) and Σ (ij) = var(x (ij) ) ( denotes the tensor product). Let s ij denote the (i, j)th element of the (2 2) matrix Σ ij c ij Σ (ij) c ij. The partial correlation between X i and X j given X (ij) is ρ ij = s 2 s s 22, observing that (i) s 2 is the partial covariance between X i and X j. (ii) s = E(X i k i,j β i,kx k ) 2 (where β i,k are the coefficients of the best linear predictor of X i given {X k ; k i, j}). (ii) s 22 = E(X j k i,j β j,kx k ) 2 (where β j,k are the coefficients of the best linear predictor of X j given {X k ; k i, j}). In the following section we relate partial correlation to the inverse of the variance/covariance matrix (often called the precision matrix). 82

84 The precision matrix and it s properties Let us suppose that X = (X,..., X d ) is a zero mean random vector with variance Σ. The (i, j)th element of Σ the covariance cov(x i, X j ) = Σ ij. Here we consider the inverse of Σ, and what information the (i, j)th of the inverse tells us about the correlation between X i and X j. Let Σ ij denote the (i, j)th element of Σ. We will show that with appropriate standardisation, Σ ij is the negative partial correlation between X i and X j. More precisely, Σ ij Σ ii Σ jj = ρ ij. (3.2) The proof uses the inverse of block matrices. To simplify the notation, we will focus on the (, 2)th element of Σ and Σ (which concerns the correlation between X and X 2 ). Let X,2 = (X, X 2 ), X (,2) = (X 3,..., X d ), Σ (,2) = var(x (,2) ), c,2 = cov(x (,2), X (,2) ) and Σ,2 = var(x,2 ). Using this notation it is clear that var(x) = Σ = Σ,2 c,2 c,2 Σ (,2) (3.3) It is well know that the inverse of the above block matrix is Σ = P Σ (,2) c,2p P c,2 Σ (,2) P + Σ (,2) c,2p c,2 Σ (,2), (3.4) where P = (Σ,2 c,2 Σ (,2) c,2). Comparing P with (3.), we see that P is the 2 2 variance/- covariance matrix of the residuals of X (,2) conditioned on X (,2). Thus the partial correlation between X and X 2 is ρ,2 = P,2 P, P 2,2 (3.5) where P ij denotes the elements of the matrix P. Inverting P (since it is a two by two matrix), we see that P = P, P 2,2 P 2,2 P 2,2 P,2 P,2 P. (3.6) 83

85 Thus, by comparing (3.4) and (3.6) and by the definition of partial correlation given in (3.5) we have P,2 = ρ,2. Let Σ ij denote the (i, j)th element of Σ. Thus we have shown (3.2): Σij ρ ij = Σ ii Σ. jj In other words, the (i, j)th element of Σ divided by the square root of it s diagonal gives negative partial correlation. Therefore, if the partial correlation between X i and X j given X ij is zero, then Σ i,j = 0. The precision matrix, Σ, contains many other hidden treasures. For example, the coefficients of Σ convey information about the best linear predictor X i given X i = (X,..., X i, X i+,..., X d ) (all elements of X except X i ). Let X i = j i β i,j X j + ε i, where {β i,j } are the coefficients of the best linear predictor. Then it can be shown that β i,j = Σij Σ ii and Σ ii = The proof uses the same arguments as those in (3.3). E[X i j i β i,jx j ] 2. (3.7) Exercise 3.3 By using the decomposition var(x) = Σ = Σ c c Σ () (3.8) where Σ = var(x ), c = E[X X ] and Σ () = var[x ] prove (3.7). The Cholesky decomposition and the precision matrix We now represent the precision matrix through it s Cholesky decomposition. It should be mentioned that Mohsen Pourahmadi has done a lot of interesting research in this area and he recently wrote 84

86 a review paper, which can be found here. We define the sequence of linear equations t X t = β t,j X j + ε t, t = 2,..., k, (3.9) where {β t,j ; j t } are the coefficeints of the best linear predictor of X t given X,..., X t. Let σ 2 t = var[ε t ] = E[X t t β t,jx j ] 2 and σ 2 = var[x ]. We standardize (3.9) and define t γ t,j X j = t X t β t,j X j, (3.20) σ t where we note that γ t,t = σ t that var(lx) = I k, where and for j < t, γ t,j = β t,j /σ i. By construction it is clear L = γ, γ 2, γ 2, γ 3, γ 3,2 γ 3, (3.2) γ k, γ k,2 γ k,3... γ k,k γ k,k and LL = Σ (see?, equation (8)), where Σ = var(x k ). Let Σ t = var[x t ], then Σ ij t = t γ ik γ jk. k= We use apply these results to the analysis of the partial correlations of autoregressive processes and the inverse of it s variance/covariance matrix Partial correlation in time series The partial covariance/correlation of a time series is defined in a similar way. Definition 3.2. The partial covariance/correlation between X t and X t+k+ is defined as the partial covariance/correlation between X t and X t+k+ after conditioning out the inbetween time series X t+,..., X t+k. 85

87 We now obtain an expression for the partial correlation between X t and X t+k+ in terms of their autocovariance function (for the final result see equation (3.22)). As the underlying assumption is that the time series is stationary it is the same as the partial covariance/correlation X k+ and X 0. In Chapter 5 we will introduce the idea of linear predictor of a future time point given the present and the past (usually called forecasting) this can be neatly described using the idea of projections onto subspaces. This notation is quite succinct, therefore we derive an expression for the partial correlation using projection notation. The projection of X k+ onto the space spanned by X k = (X, X 2,..., X k ), is the best linear predictor of X k+ given X k. We will denote the projection of X k onto the space spanned by X, X 2,..., X k as P Xk (X k+ ) (note that this is the same as the best linear predictor). Thus P Xk (X k+ ) = X k (var[x k] E[X k+ X k ]) = X k Σ k c k := k φ k,j X j, where Σ k = var(x k ) and c k = E(X k+ X k ). To derive a similar expression for P Xk (X 0 ) we use the stationarity property P Xk (X k+ ) = X k (var[x k] E[X 0 X k ]) = X k (var[x k] E k E[X k+ X k ]) = X k Σ k E kc k = X k E kσ k c k := where E k is a matrix which swops round all the elements in a vector k φ k,k+ j X j, E k = Thus the partial correlation between X t and X t+k (where k > 0) is the correlation X 0 P Xk (X 0 ) and X k+ P Xk (X k+ ), which is cov(x k+ P Xk (X k+ ), X 0 P Xk (X 0 )) = cov(x k+ X 0 ) c k Σ k E kc k. (3.22) 86

88 We consider an example. Example 3.2. (The PACF of an AR() process) Consider the causal AR() process X t = 0.5X t + ε t where E(ε t ) = 0 and var(ε t ) =. Using (3.) it can be shown that cov(x t, X t 2 ) = (compare with the MA() process X t = ε t +0.5ε t, where the covariance cov(x t, X t 2 ) = 0). We evaluate the partial covariance between X t and X t 2. Remember we have to condition out the random variables inbetween, which in this case is X t. It is clear that the projection of X t onto X t is 0.5X t (since X t = 0.5X t + ε t ). Therefore X t P sp(xt )X t = X t 0.5X t = ε t. The projection of X t 2 onto X t is a little more complicated, it is P sp(xt )X t 2 = E(X t X t 2 ) E(Xt 2 ) X t. Therefore the partial correlation between X t and X t 2 cov ( ( ) X t P Xt X t, X t 2 P Xt )X t 2 = cov ε t, X t 2 E(X t X t 2 ) E(Xt 2 ) X t ) = 0. In fact the above is true for the partial covariance between X t and X t k, for all k 2. Hence we see that despite the covariance not being zero for the autocovariance of an AR process greater than order two, the partial covariance is zero for all lags greater than or equal to two. Using the same argument as above, it is easy to show that partial covariance of an AR(p) for lags greater than p is zero. Hence in may respects the partial covariance can be considered as an analogue of the autocovariance. It should be noted that though the covariance of MA(q) is zero for lag greater than q, the same is not true for the parial covariance. Whereas partial covariances removes correlation for autoregressive processes it seems to add correlation for moving average processes! Model identification: If the autocovariances after a certain lag are zero q, it may be appropriate to fit an MA(q) model to the time series. On the other hand, the autocovariances of any AR(p) process will only decay to zero as the lag increases. If the partial autocovariances after a certain lag are zero p, it may be appropriate to fit an AR(p) model to the time series. On the other hand, the partial covariances of any MA(p) process will only decay to zero as the lag increases. 87

89 Exercise 3.4 (The partial correlation of an invertible MA()) Let φ t,t denote the partial correlation between X t+ and X. It is well known (this is the Levinson-Durbin algorithm, which we cover in Chapter 5) that φ t,t can be deduced recursively from the autocovariance funciton using the algorithm: Step φ, = c()/c(0) and r(2) = E[X 2 X 2 ] 2 = E[X 2 φ, X ] 2 = c(0) φ, c(). Step 2 For j = t φ t,t = c(t) t φ t,jc(t j) r(t) φ t,j = φ t,j φ t,t φ t,t j j t, and r(t + ) = r(t)( φ 2 t,t). (i) Using this algorithm show that the PACF of the MA() process X t = ε t +θε t, where θ < (so it is invertible) is φ t,t = ( )t+ (θ) t ( θ 2 ) θ 2(t+). (ii) Explain how this partial correlation is similar to the ACF of the AR() model X t = θx t + ε t. Exercise 3.5 (Comparing the ACF and PACF of an AR process) Compare the below plots: (i) Compare the ACF and PACF of the AR(2) model X t =.5X t 0.75X t 2 + ε t using ARIMAacf(ar=c(.5,-0.75),ma=0,30) and ARIMAacf(ar=c(.5,-0.75),ma=0,pacf=T,30). (ii) Compare the ACF and PACF of the MA() model X t = ε t 0.5ε t using ARIMAacf(ar=0,ma=c(-.5),30) and ARIMAacf(ar=0,ma=c(-.5),pacf=T,30). (ii) Compare the ACF and PACF of the ARMA(2, ) model X t.5x t X t 2 = ε t 0.5ε t using ARIMAacf(ar=c(.5,-0.75),ma=c(-.5),30) and ARIMAacf(ar=c(.5,0.75),ma=c(-.5),pacf=T,30). Exercise 3.6 Compare the ACF and PACF plots of the monthly temperature data from Would you fit an AR, MA or ARMA model to this data? 88

90 Rcode The sample partial autocorrelation of a time series can be obtained using the command pacf. However, remember just because the sample PACF is not zero, does not mean the true PACF is non-zero. This is why we require error bars! The variance/covariance matrix and precision matrix of an autoregressive and moving average process Let us suppose that {X t } is a stationary time series. In this section we consider the variance/covariance matrix var(x k ) = Σ k, where X k = (X,..., X k ). We will consider two cases (i) when X t follows an MA(p) models and (ii) when X t follows an AR(p) model. The variance and inverse of the variance matrices for both cases yield quite interesting results. We will use classical results from multivariate analysis, stated in Section We recall that the variance/covariance matrix of a stationary time series has a (symmetric) Toeplitz structure (see wiki for a definition). Let X k = (X,..., X k ), then Σ k = var(x k ) = c(0) c() 0... c(k 2) c(k ) c() c(0) c()... c(k 3) c(k 2)..... c(k ) c(k 2).... c() c(0). Σ k for AR(p) and MA(p) models (i) If {X t } satisfies an MA(p) model and k > p, then Σ k will be bandlimited, where p offdiagonals above and below the diagonal will be non-zero and the rest of the off-diagonal will be zero. (ii) If {X t } satisfies an AR(p) model, then Σ k will not be bandlimited. Σ k for an AR(p) model We now consider the inverse of Σ k. Warning: note that the inverse of a Toeplitz is not necessarily Toeplitz (unlike the circulant which is). We use the results in Section Suppose that we have an AR(p) process and we consider the precision matrix of X k = (X,..., X k ), where k > p. 89

91 Recall the (i, j)th element of Σ k divided by the square roots of the corresponding diagonals is the negative partial correlation of between X i and X j conditioned on all the elements in X k. In Section we showed that if i j > p, then the partial correlation between X i and X j given X i+,..., X j (assuming without loss of generality that i < j) is zero. We now show that the precision matrix of Σ k will be bandlimited (note that it is not immediate obvious since Σ ij k is the negative partial correlation between X i and X j given X (ij) not just the elements between X i and X j ). To show this we use the Cholesky decomposition given in (3.9). Since X t is an autoregressive process of order p and plugging this information into (3.9), for t > p we have t X t = β t,j X j + ε t = p φ j X t j + ε t, thus β t,t j = φ j for j p otherwise β t,t j = 0. Moreover, for t > p we have σ 2 t = var(ε t ) =. For t p we use the same notation as that used in (3.9). This gives the lower triangular p- bandlimited matrix γ, γ 2, γ 2, φ p φ p... φ L k = φ p φ p... φ φ p... φ 2 φ (3.23) (the above matrix has not been formated well, but after the first p rows, there are ones along the diagonal and the p lower off-diagonals are non-zero). We recall that Σ k matrix, Σ k = L k L k = L k L k, thus we observe that since L k is a lower triangular bandlimited is a bandlimited matrix with the p off-diagonals either side of the diagonal non-zero. Let Σ ij denote the (i, j)th element of Σ k. Then we observe that Σ(i,j) = 0 if i j > p. Moreover, if 0 < i j p and either i or j is greater than p, then Σ ij = 2 p k= i j φ kφ k i j + φ i j. 90

92 The coefficients Σ (i,j) gives us a fascinating insight into the prediction of X t given the past and future observations. We recall from equation (3.7) that Σ ij /Σ ii are the coffficients of the best linear predictor of X i given X i. This result tells if the observations came from a stationary AR(p) process, then the best linear predictor of X i given X i,..., X i a and X i+,..., X i+b (where a and b > p) is the same as the best linear predictor of X i given X i,..., X i p and X i+,..., X i+p (knowledge of other values will not improve the prediction). There is an interesting duality between the AR and MA model which we will explore further in the course. 3.3 Correlation and non-causal time series Here we demonstrate that it is not possible to identify whether a process is noninvertible/noncausal from its covariance structure. The simplest way to show result this uses the spectral density function, which will now define and then return to and study in depth in Chapter 8. Definition 3.3. (The spectral density) Given the covariances c(k) (with k c(k) 2 < ) the spectral density function is defined as f(ω) = k c(k) exp(ikω). The covariances can be obtained from the spectral density by using the inverse fourier transform c(k) = 2π f(ω) exp( ikω). 2π 0 Hence the covariance yields the spectral density and visa-versa. For reference below, we point out that the spectral density function uniquely identifies the autocovariance function. Let us suppose that {X t } satisfies the AR(p) representation X t = p φ i X t i + ε t i= where var(ε t ) = and the roots of φ(z) = p φ jz j can lie inside and outside the unit circle, but not on the unit circle (thus it has a stationary solution). We will show in Chapter 8 that the 9

93 spectral density of this AR process is f(ω) = p φ j exp(ijω) 2. (3.24) Factorizing f(ω). Let us supose the roots of the characteristic polynomial φ(z) = + q φ jz j are {λ j } p, thus we can factorize φ(x) + p φ jz j = p ( λ jz). Using this factorization we have (3.24) can be written as f(ω) = p λ. (3.25) j exp(iω) 2 As we have not assumed {X t } is causal, the roots of φ(z) can lie both inside and outside the unit circle. We separate the roots, into those outside the unit circle {λ O,j ; j =,..., p } and inside the unit circle {λ I,j2 ; j 2 =,..., p 2 } (p + p 2 = p). Thus p φ(z) = [ ( λ O,j z)][ ( λ I,j2 z)] p 2 j = j 2 = p = ( ) p 2 λ I,j2 z p 2 [ ( λ O,j z)][ ( λ I,j 2 z)]. (3.26) Thus we can rewrite the spectral density in (3.27) p 2 j = j 2 = f(ω) = j 2 = λ I,j 2 2 p j = λ O,j exp(iω) 2 p 2 p2 j 2 = λ I,j 2 exp(iω) 2. (3.27) Let f O (ω) = p j = λ O,j exp(iω) 2 p 2 j 2 = λ I,j 2 exp(iω). 2 Then f(ω) = p 2 j 2 = λ I,j 2 2 f O (ω). A parallel causal AR(p) process with the same covariance structure always exists. We now define a process which has the same autocovariance function as {X t } but is causal. 92

94 Using (3.26) we define the polynomial p p 2 φ(z) = [ ( λ O,j z)][ ( λ I,j 2 z)]. (3.28) j = j 2 = By construction, the roots of this polynomial lie outside the unit circle. We then define the AR(p) process φ(b) X t = ε t, (3.29) from Lemma 2.3. we know that { X t } has a stationary, almost sure unique solution. Moreover, because the roots lie outside the unit circle the solution is causal. By using (3.24) the spectral density of { X t } is f(ω). We know that the spectral density function uniquely gives the autocovariance function. Comparing the spectral density of { X t } with the spectral density of {X t } we see that they both are the same up to a multiplicative constant. Thus they both have the same autocovariance structure up to a multiplicative constant (which can be made the same, if in the definition (3.29) the innovation process has variance p 2 j 2 = λ I,j 2 2 ). Therefore, for every non-causal process, there exists a causal process with the same autocovariance function. By using the same arguments above, we can generalize to result to ARMA processes. Definition An ARMA process is said to have minimum phase when the roots of φ(z) and θ(z) both lie outside of the unit circle. Remark 3.3. For Gaussian random processes it is impossible to discriminate between a causal and non-causal time series, this is because the mean and autocovariance function uniquely identify the process. However, if the innovations are non-gaussian, even though the autocovariance function is blind to non-causal processes, by looking for other features in the time series we are able to discriminate between a causal and non-causal process. 93

95 3.3. The Yule-Walker equations of a non-causal process Once again let us consider the zero mean AR(p) model p X t = φ j X t j + ε t, and var(ε t ) <. Suppose the roots of the corresponding characteristic polynomial lie outside the unit circle, then {X t } is strictly stationary where the solution of X t is only in terms of past and present values of {ε t }. Moreover, it is second order stationary with covariance {c(k)}. We recall from Section 3..2, equation (3.4) that we derived the Yule-Walker equations for causal AR(p) processes, where E(X t X t k ) = p p φ j E(X t j X t k ) c(k) φ j c(k j) = 0. (3.30) Let us now consider the case that the roots of the characteristic polynomial lie both outside and inside the unit circle, thus X t does not have a causal solution but it is still strictly and second order stationary (with autocovariance, say {c(k)}). In the previous section we showed that there exists a causal AR(p) φ(b) X t = ε t (where φ(b) and φ(b) = p φ j z j are the characteristic polynomials defined in (3.26) and (3.28)). We showed that both have the same autocovariance structure. Therefore, c(k) p φ j c(k j) = 0 This means the Yule-Walker equations for {X t } would actually give the AR(p) coefficients of { X t }. Thus if the Yule-Walker equations were used to estimate the AR coefficients of {X t }, in reality we would be estimating the AR coefficients of the corresponding causal { X t } Filtering non-causal AR models Here we discuss the surprising result that filtering a non-causal time series with the corresponding causal AR parameters leaves a sequence which is uncorrelated but not independent. Let us suppose 94

96 that X t = p φ j X t j + ε t, where ε t are iid, E(ε t ) = 0 and var(ε t ) <. It is clear that given the input X t, if we apply the filter X t p φ jx t j we obtain an iid sequence (which is {ε t }). Suppose that we filter {X t } with the causal coefficients { φ j }, the output ε t = X t p φ j X t j is not an independent sequence. However, it is an uncorrelated sequence. We illustrate this with an example. Example 3.3. Let us return to the AR() example, where X t = φx t + ε t. Let us suppose that φ >, which corresponds to a non-causal time series, then X t has the solution X t = φ j ε t+j+. The causal time series with the same covariance structure as X t is X t = φ X t + ε (which has backshift representation ( /(φb))x t = ε t ). Suppose we pass X t through the causal filter ε t = ( φ B)X t = X t φ X t = ( φ B) B( φb )ε t = φ ε t + ( φ 2 ) φ j ε t+j. Evaluating the covariance of the above (assuming wlog that var(ε) = ) is cov( ε t, ε t+r ) = φ ( φ 2 ) φ r + ( φ 2 )2 j=0 = 0. φ2j Thus we see that { ε t } is an uncorrelated sequence, but unless it is Gaussian it is clearly not independent. One method to study the higher order dependence of { ε t }, by considering it s higher order cumulant structure etc. The above above result can be generalised to general AR models, and it is relatively straightforward to prove using the Crámer representation of a stationary process (see Section 8.4, Theorem??). 95

97 Exercise 3.7 (i) Consider the causal AR(p) process X t =.5X t 0.75X t 2 + ε t. Derive a parallel process with the same autocovariance structure but that is non-causal (it should be real). (ii) Simulate both from the causal process above and the corresponding non-causal process with non-gaussian innovations (see Section 2.6). Show that they have the same ACF function. (iii) Find features which allow you to discriminate between the causal and non-causal process. 96

98 Chapter 4 Nonlinear Time Series Models Prerequisites A basic understanding of expectations, conditional expectations and how one can use conditioning to obtain an expectation. Objectives: Use relevant results to show that a model has a stationary, solution. Derive moments of these processes. Understand the differences between linear and nonlinear time series. So far we have focused on linear time series, that is time series which have the representation X t = ψ j ε t j, j= where {ε t } are iid random variables. Such models are extremely useful, because they are designed to model the autocovariance structure and are straightforward to use for forecasting. These are some of the reasons that they are used widely in several applications. A typical realisation from a linear time series, will be quite regular with no suddent bursts or jumps. This is due to the linearity of the system. However, if one looks at financial data, for example, there are sudden bursts in volatility (variation) and extreme values, which calm down after a while. It is not possible to model such behaviour well with a linear time series. In order to capture nonlinear behaviour several nonlinear models have been proposed. The models typically 97

99 consists of products of random variables which make possible the sudden irratic bursts seen in the data. Over the past 30 years there has been a lot research into nonlinear time series models. Probably one of the first nonlinear models proposed for time series analysis is the bilinear model, this model is used extensively in signal processing and engineering. A popular model for modelling financial data are (G)ARCH-family of models. Other popular models are random autoregressive coefficient models and threshold models, to name but a few (see, for example, Subba Rao (977), Granger and Andersen (978), Nicholls and Quinn (982), Engle (982), Subba Rao and Gabr (984), Bollerslev (986), Terdik (999), Fan and Yao (2003), Straumann (2005) and Douc et al. (204)). Once a model has been defined, the first difficult task is to show that it actually has a solution which is almost surely finite (recall these models have dynamics which start at the, so if they are not well defined they could be infinite ), with a stationary solution. Typically, in the nonlinear world, we look for causal solutions. I suspect this is because the mathematics behind existence of non-causal solution makes the problem even more complex. We state a result that gives sufficient conditions for a stationary, causal solution of a certain class of models. These models include ARCH/GARCH and Bilinear models. We note that the theorem guarantees a solution, but does not give conditions for it s moments. The result is based on Brandt (986), but under stronger conditions. Theorem 4.0. (Brandt (986)) Let us suppose that {X t } is a d-dimensional time series defined by the stochastic recurrence relation X t = A t X t + B t, (4.) where {A t } and {B t } are iid random matrices and vectors respectively. E log B t < (where denotes the spectral norm of a matrix), then If E log A t < 0 and X t = B t + k= ( k i=0 A t i ) B t k (4.2) converges almost surely and is the unique strictly stationary causal solution. Note: The conditions given above are very strong and Brandt (986) states the result under which weaker conditions, we outline the differences here. Firstly, the assumption {A t, B t } are iid can be relaxed to their being Ergodic sequences. Secondly, the assumption E log A t < 0 can be 98

100 relaxed to E log A t < and that {A t } has a negative Lyapunov exponent, where the Lyapunov exponent is defined as lim n n n A j = γ, with γ < 0 (see Brandt (986)). The conditions given in the above theorem may appear a little cryptic. However, the condition E log A t < 0 (in the unvariate case) becomes quite clear if you compare the SRE model with the AR() model X t = ρx t + ε t, where ρ < (which is the special case of the SRE, where the coefficients is deterministic). We recall that the solution of the AR() is X t = k= ρj ε t j. The important part in this decomposition is that ρ j decays geometrically fast to zero. Now let us compare this to (4.2), we see that ρ j plays a similar role to k i=0 A t i. Given that there are similarities between the AR() and SRE, it seems reasonable that for (4.2) to converge, k i=0 A t i should converge geometrically too (at least almost surely). However analysis of a product is not straight forward, therefore we take logarithms to turn it into a sum k k log A t i = k a.s. log A t i E[log A t ] := γ, k i=0 since it is the sum of iid random variables. Thus taking anti-logs i=0 k i=0 A t i exp[kγ], which only converges to zero if γ < 0, in other words E[log A t ] < 0. Thus we see that the condition E log A t < 0 is quite a logical conditional afterall. 4. Data Motivation 4.. Yahoo data from We consider here the closing share price of the Yahoo daily data downloaded from finance.yahoo.com/q/hp?s=yhoo. The data starts from from 0th April 996 to 8th August 204 (over 4000 observations). A plot is given in Figure 4.. Typically the logarithm of such data taken, and in order to remove linear and/or stochastic trend the first difference of the logarithm is taken (ie. X t = log S t log S t ). The hope is that after taking differences the data has been stationarized (see Example 2.3.2). However, the data set spans almost 20 years and this assumption is rather precarious and will be investigated later. A plot of the data after taking first differences together 99

101 with the QQplot is given in Figure 4.2. From the QQplot in Figure 4.2, we observe that log yahoo Time Figure 4.: Plot of daily closing Yahoo share price Normal Q Q Plot yahoo.log.diff Sample Quantiles Time Theoretical Quantiles Figure 4.2: Plot of log differences of daily Yahoo share price and the corresponding QQplot differences {X t } appears to have very thick tails, which may mean that higher order moments of the log returns do not exist (not finite). In Figure 4.3 we give the autocorrelation (ACF) plots of the log differences, absolute log differences and squares of the log differences. Note that the sample autocorrelation is defined as ρ(k) = ĉ(k) ĉ(0), where ĉ(k) = T T k (X t X)(X t+k X). (4.3) The dotted lines are the errors bars (the 95% confidence of the sample correlations constructed under the assumption the observations are independent, see Section 6.2.). From Figure 4.3a we see that there appears to be no correlation in the data. More precisely, most of the sample t= 00

102 correlations are within the errors bars, the few that are outside it could be by chance, as the error bars are constructed pointwise. However, Figure 4.3b the ACF plot of the absolutes gives significant large correlations. In contrast, in Figure 4.3c we give the ACF plot of the squares, where there does not appear to be any significant correlations. Series yahoo.log.diff Series abs(yahoo.log.diff) Series (yahoo.log.diff)^2 ACF ACF ACF Lag Lag Lag (a) ACF plot of the log differences (b) ACF plot of the absolute of the log differences (c) ACF plot of the square of the log differences Figure 4.3: ACF plots of the transformed Yahoo data To summarise, {X t } appears to be uncorrelated (white noise). However, once absolutes have been taken there does appear to be dependence. This type of behaviour cannot be modelled with a linear model. What is quite interesting is that there does not appear to be any significant correlation in the squares. However, on explanation for this can be found in the QQplot. The data has extremely thick tails which suggest that the forth moment of the process may not exist (the empirical variance of X t will be extremely large). Since correlation is defined as (4.3) involves division by ĉ(0), which could be extremely large, this would hide the sample covariance. R code for Yahoo data Here we give the R code for making the plots above. yahoo <- scan("~/yahoo txt") yahoo <- yahoo[c(length(yahoo):)] # switches the entries to ascending order yahoo.log.diff <- log(yahoo[-]) - log(yahoo[-length(yahoo)]) # Takelog differences par(mfrow=c(,)) plot.ts(yahoo) par(mfrow=c(,2)) 0

103 plot.ts(yahoo.log.diff) qqnorm(yahoo.log.diff) qqline(yahoo.log.diff) par(mfrow=c(,3)) acf(yahoo.log.diff) # ACF plot of log differences acf(abs(yahoo.log.diff)) # ACF plot of absolute log differences acf((yahoo.log.diff)**2) # ACF plot of square of log differences 4..2 FTSE 00 from January - August 204 For completeness we discuss a much shorter data set, the daily closing price of the FTSE 00 from 20th January - 8th August, 204 (4 observations). This data was downloaded from http: //markets.ft.com/research//tearsheets/pricehistorypopup?symbol=ftse:fsi. Exactly the same analysis that was applied to the Yahoo data is applied to the FTSE data and the plots are given in Figure ftse Time Figure 4.4: Plot of daily closing FTSE price Jan-August, 204 We observe that for this (much shorter) data set, the observations do not appear to deviate much from normality. Furthermore, the ACF plot of the log differences, absolutes and squares do not suggest any evidence of correlation. Could it be, that after taking log differences, there is no dependence in the data (the data is a realisation from iid random variables). Or that there is dependence but it lies in a higher order structure or over more sophisticated transformations. Comparing this to the Yahoo data, may be we see dependence in the Yahoo data because it is actually nonstationary. The mystery continues. It would be worth while conducting a similar analysis on a similar portion of the Yahoo data. 02

104 Normal Q Q Plot ftse.log.diff Sample Quantiles Time Theoretical Quantiles Figure 4.5: Plot of log differences of daily FTSE price Jan-August, 204 and the corresponding QQplot Series ftse.log.diff Series abs(ftse.log.diff) Series (ftse.log.diff)^2 ACF ACF ACF Lag Lag Lag (a) ACF plot of the log differences (b) ACF plot of the absolute of the log differences (c) ACF plot of the square of the log differences Figure 4.6: ACF plots of the transformed FTSE data 4.2 The ARCH model During the early 80s Econometricians were trying to find a suitable model for forecasting stock prices. They were faced with data similar to the log differences of the Yahoo data in Figure 4.2. As Figure 4.3a demonstrates, there does not appear to be any linear dependence in the data, which makes the best linear predictor quite useless for forecasting. Instead, they tried to predict the variance of future prices given the past, var[x t+ X t, X t,...]. This called for a model that has a zero autocorrelation function, but models the conditional variance. To address this need, Engle (982) proposed the autoregressive conditionally heteroskadastic (ARCH) model (note that Rob Engle, together with Clive Granger, in 2004, received the Noble prize 03

105 for Economics for Cointegration). He proposed the ARCH(p) which satisfies the representation p X t = σ t Z t σt 2 = a 0 + a j Xt j, 2 where Z t are iid random variables where E(Z t ) = 0 and var(z t ) =, a 0 > 0 and for j p a j 0. Before, worrying about whether a solution of such a model exists, let us consider the reasons behind why this model was first proposed Features of an ARCH Let us suppose that a causal, stationary solution of the ARCH model exists (X t is a function of Z t, Z t, Z t,...) and all the necessary moments exist. Then we obtain the following. (i) The first moment: E[X t ] = E[Z t σ t ] = E[E(Z t σ t X t, X t 2,...)] = E[σ t E(Z t X t, X t 2,...)] } {{ } σ t function of X t,...,x t p = E[σ t E(Z t ) } {{ } by causality ] = E[0 σ t ] = 0. Thus the ARCH process has a zero mean. (ii) The conditional variance: var(x t X t, X t 2,..., X t p ) = E(X 2 t X t, X t 2,..., X t p ) = E(Z 2 t σ 2 t X t, X t 2,..., X t p ) = σ 2 t E[Z 2 t ] = σ 2 t. Thus the conditional variance is σ 2 t = a 0 + p a jx 2 t j (a weighted sum of the squared past). (iii) The autocovariance function: Without loss of generality assume k > 0 cov[x t, X t+k ] = E[X t X t+k ] = E[X t E(X t+k X t+k,..., X t )] = E[X t σ t+k E(Z t+k X t+k,..., X t )] = E[X t σ t+k E(Z t+k )] = E[X t σ t+k 0] = 0. 04

106 The autocorrelation function is zero (it is a white noise process). (iv) We will show in Section that E[X 2d ] < iff [ p a j]e[zt 2d ] /d <. It is well known that even for Gaussian innovations E[Zt 2d ] /d grows with d, therefore if any of the a j are non-zero (recall all need to be positive), there will exist a d 0 such that for all d d 0 E[X d t ] will not be finite. Thus the we see that the ARCH process has thick tails. Usually we measure the thickness of tails in data using the Kurtosis measure (see wiki). Points (i-iv) demonstrate that the ARCH model is able to model many of the features seen in the stock price data. In some sense the ARCH model can be considered as a generalisation of the AR model. That is the squares of ARCH model satisfy X 2 t = σ 2 Z 2 t = a 0 + p a j Xt j 2 + (Zt 2 )σt 2, (4.4) with characteristic polynomial φ(z) = p a jz j. It can be shown that if p a j <, then the roots of the characteristic polynomial φ(z) lie outside the unit circle (see Exercise 2.). Moreover, the innovations ɛ t = (Z 2 t )σ 2 t are martingale differences (see wiki). This can be shown by noting that E[(Zt 2 )σt 2 X t, X t 2,...] = σt 2 E(Zt 2 X t, X t 2,...) = σt 2 E(Zt 2 ) = 0. } {{ } =0 Thus cov(ɛ t, ɛ s ) = 0 for s t. Martingales are a useful asymptotic tool in time series, we demonstrate how they can be used in Chapter 0. To summarise, in many respects the ARCH(p) model resembles the AR(p) except that the innovations {ɛ t } are martingale differences and not iid random variables. This means that despite the resemblence, it is not a linear time series. We show that a unique, stationary causal solution of the ARCH model exists and derive conditions under which the moments exist. 05

107 4.2.2 Existence of a strictly stationary solution and second order stationarity of the ARCH To simplify notation we will consider the ARCH() model X t = σ t Z t σ 2 t = a 0 + a X 2 t. (4.5) It is difficult to directly obtain a solution of X t, instead we obtain a solution for σ 2 t (since X t can immediately be obtained from this). Using that X 2 t = σ2 t Z2 t we obtain and substituting this into (4.5) σ 2 t = a 0 + a X 2 t = (a Z 2 t )σ 2 t + a 0. (4.6) We observe that (4.6) can be written in the stochastic recurrence relation form given in (4.) with A t = a Z 2 t and B t = a 0. Therefore, by using Theorem 4.0., if E[log a Z 2 t ] = log a + E[log Z 2 t ] < 0, then σ2 t has the strictly stationary causal solution σ 2 t = a 0 + a 0 k a k Zt j. 2 k= The condition for existence using Theorem 4.0. and (4.6) is E[log(a Z 2 t )] = log a + E[log Z 2 t ] < 0, (4.7) which is immediately implied if a < (since E[log Zt 2 ] log E[Zt 2 ] = 0), but it is also satisfied under weaker conditions on a. To obtain the moments of Xt 2 we use that it has the solution is Xt 2 = Zt 2 a 0 + a 0 therefore taking expectations we have E[Xt 2 ] = E[Zt 2 ]E a 0 + a 0 k a k Zt j 2 k= k a k Zt j 2 k=, = a 0 + a 0 a k. k= 06

108 Thus E[X 2 t ] < if and only if a < (heuristically we can see this from E[X 2 t ] = E[Z 2 2 ](a 0 + a E[X 2 t ])). By placing stricter conditions on a, namely a E(Zt 2d ) /d <, we can show that E[Xt 2d ] <. The ARCH(p) model We can generalize the above results to ARCH(p) processes (but to show existence of a solution we need to write the ARCH(p) process as a vector process similar to the Vector AR() representation of an AR(p) given in Section 2.4.). It can be shown that under sufficient conditions on the coefficients {a j } that the stationary, causal solution of the ARCH(p) model is X 2 t = a 0 Z 2 t + k m t (k) (4.8) where m t (k) = j,...,j k ( k ) k a 0 a jr r= r=0 Z 2 t r s=0 js (j 0 = 0). The above solution belongs to a general class of functions called a Volterra expansion. We note that E[X 2 t ] < iff p a j <. 4.3 The GARCH model A possible drawback of the ARCH(p) model is that the conditional variance only depends on finite number of the past squared observations/log returns (in finance, the share price is often called the return). However, when fitting the model to the data, analogous to order selection of an autoregressive model (using, say, the AIC), often a large order p is selected. This suggests that the conditional variance should involve a large (infinite?) number of past terms. This observation motivated the GARCH model (first proposed in Bollerslev (986) and Taylor (986)), which in many respects is analogous to the ARMA. The conditional variance of the GARCH model is a weighted average of the squared returns, the weights decline with the lag, but never go completely to zero. The GARCH class of models is a rather parsimonous class of models and is extremely popular in finance. The GARCH(p, q) model is defined as p q X t = σ t Z t σt 2 = a 0 + a j Xt j 2 + b i σt i 2 (4.9) i= 07

109 where Z t are iid random variables where E(Z t ) = 0 and var(z t ) =, a 0 > 0 and for j p a j 0 and i q b i 0. Under the assumption that a causal solution with sufficient moments exist, the same properties defined for the ARCH(p) in Section 4.2. also apply to the GARCH(p, q) model. It can be shown that under suitable conditions on {b j } that X t satisfies an ARCH( ) representation. Formally, we can write the conditional variance σ 2 t exists) as (assuming that a stationarity solution ( q b i B i )σt 2 = (a 0 + i= p a j Xt j), 2 where B denotes the backshift notation defined in Chapter 2. Therefore if the roots of b(z) = ( q i= b iz i ) lie outside the unit circle (which is satisfied if i b i < ) then σ 2 t = ( q b jb j ) (a 0 + p a j Xt j) 2 = α 0 + α j Xt j, 2 (4.0) where a recursive equation for the derivation of α j can be found in Berkes et al. (2003). In other words the GARCH(p, q) process can be written as a ARCH( ) process. This is analogous to the invertibility representation given in Definition This representation is useful when estimating the parameters of a GARCH process (see Berkes et al. (2003)) and also prediction. The expansion in (4.0) helps explain why the GARCH(p, q) process is so popular. As we stated at the start of this section, the conditional variance of the GARCH is a weighted average of the squared returns, the weights decline with the lag, but never go completely to zero, a property that is highly desirable. Example 4.3. (Inverting the GARCH(, )) If b <, then we can write σ 2 t as σ 2 t = b j B j [a 0 + a Xt ] 2 a 0 = b + a b j Xt j. 2 j=0 j=0 This expression is useful in both prediction and estimation. In the following section we derive conditions for existence of the GARCH model and also it s moments. 08

110 4.3. Existence of a stationary solution of a GARCH(, ) We will focus on the GARCH(, ) model as this substantially simplifies the conditions. We recall the conditional variance of the GARCH(, ) can be written as σ 2 t = a 0 + a X 2 t + b σ 2 t = ( a Z 2 t + b ) σ 2 t + a 0. (4.) We observe that (4.) can be written in the stochastic recurrence relation form given in (4.) with A t = (a Zt 2 + b ) and B t = a 0. Therefore, by using Theorem 4.0., if E[log(a Zt 2 + b )] < 0, then σt 2 has the strictly stationary causal solution k σt 2 = a 0 + a 0 (a Zt j 2 + b ). (4.2) k= These conditions are relatively weak and depend on the distribution of Z t. They are definitely satisfied if a + b <, since E[log(a Zt 2 + b )] log E[a Zt 2 + b ] = log(a + b ). However existence of a stationary solution does not require such a strong condition on the coefficients (and there can still exist a stationary solution if a + b >, so long as the distribution of Z 2 t that E[log(a Z 2 t + b )] < 0). By taking expectations of (4.2) we can see that is such E[X 2 t ] = E[σ 2 t ] = a 0 + a 0 k= k (a + b ) = a 0 + a 0 (a + b ) k. Thus E[X 2 t ] < iff a + b < (noting that a and b are both positive). Expanding on this argument, if d > we can use Minkowski inequality to show k k (E[σt 2d ]) /d a 0 + a 0 (E[ (a Zt j 2 + b )] d ) /d a 0 + a 0 ( E[(a Zt j 2 + b ) d ]) /d. k= k= k= Therefore, if E[(a Zt j 2 + b ) d ] <, then E[Xt 2d ] <. This is an iff condition, since from the definition in (4.) we have E[σt 2d ] = E[a 0 + (a Zt 2 + b )σt 2 ] d E[(a Zt 2 + b )σ } {{ } t ] 2 d = E[(a Zt 2 + b ) d ]E[σ 2d every term is positive since σt 2 has a causal solution, it is independent of Z2 t. We observe that by stationarity and if t ], 09

111 E[σ 2d t ] <, then E[σt 2d ] = E[σt 2d ]. Thus the above inequality only holds if E[(a Zt 2 + b ) d ] <. Therefore, E[X 2d t ] < iff E[(a Z 2 t + b ) d ] <. Indeed in order for E[Xt 2d ] < a huge constraint needs to be placed on the parameter space of a and b. Exercise 4. Suppose {Z t } are standard normal random variables. Find conditions on a and b such that E[X 4 t ] <. The above results can be generalised to GARCH(p, q) model. Conditions for existence of a stationary solution hinge on the random matrix corresponding to the SRE representation of the GARCH model (see Bougerol and Picard (992a) and Bougerol and Picard (992b)), which are nearly impossible to verify. Sufficient and necessary conditions for both a stationary (causal) solution and second order stationarity (E[X 2 t ] < ) is p a j + q i= b i <. However, many econometricians believe this condition places an unreasonable constraint on the parameter space of {a j } and {b j }. A large amount of research has been done on finding consistent parameter estimators under weaker conditions. Indeed, in the very interesting paper by Berkes et al. (2003) (see also Straumann (2005)) they derive consistent estimates of GARCH parameters on far milder set of conditions on {a j } and {b i } (which don t require E(X 2 t ) < ). Definition 4.3. The IGARCH model is a GARCH model where X t = σ t Z t σ 2 t = a 0 + p a j Xt j 2 + q b i σt i 2 (4.3) where the coefficients are such that p a j + q i= b i =. This is an example of a time series model which has a strictly stationary solution but it is not second order stationary. i= Exercise 4.2 Simulate realisations of ARCH() and GARCH(, ) models. Simulate with both iid Gaussian and t-distribution errors ({Z t } where E[Zt 2 ] = ). Remember to burn-in each realisation. In all cases fix a 0 > 0. Then (i) Simulate an ARCH() with a = 0.3 and a = 0.9. (ii) Simulate a GARCH(, ) with a = 0. and b = 0.85, and a GARCH(, ) with a = 0.85 and b = 0.. Compare the two behaviours. 0

112 4.3.2 Extensions of the GARCH model One criticism of the GARCH model is that it is blind to negative the sign of the return X t. In other words, the conditional variance of X t only takes into account the magnitude of X t and does not depend on increases or a decreases in S t (which corresponds to X t being positive or negative). In contrast it is largely believed that the financial markets react differently to negative or positive X t. The general view is that there is greater volatility/uncertainity/variation in the market when the price goes down. This observation has motivated extensions to the GARCH, such as the EGARCH which take into account the sign of X t. Deriving conditions for such a stationary solution to exist can be difficult task, and the reader is refered to Straumann (2005) and more the details. Other extensions to the GARCH include an Autoregressive type model with GARCH innovations R code install.packages("tseries"), library("tseries") recently there have been a new package developed library("fgarch"). 4.4 Bilinear models The Bilinear model was first proposed in Subba Rao (977) and Granger and Andersen (978) (see also Subba Rao (98)). The general Bilinear (BL(p, q, r, s)) model is defined as p q r s X t φ j X t j = ε t + θ i ε t i + b k,k X t k ε t k, i= k= k = where {ε t } are iid random variables with mean zero and variance σ 2. To motivate the Bilinear model let us consider the simplest version of the model BL(, 0,, ) X t = φ X t + b, X t ε t + ε t = [φ + b, ε t ]X t + ε t. (4.4) Comparing (4.6) with the conditional variane of the GARCH(, ) in (4.) we see that they are very similar, the main differences are that (a) the bilinear model does not constrain the coefficients to be positive (whereas the conditional variance requires the coefficients to be positive) (b) the

113 ε t depends on X t, whereas in the GARCH(, ) Zt 2 and σ2 t are independent coefficients and (c) the innovation in the GARCH(, ) model is deterministic, whereas in the innovation in the Bilinear model is random. (b) and (c) makes the analysis of the Bilinear more complicated than the GARCH model Features of the Bilinear model In this section we assume a causal, stationary solution of the bilinear model exists, the appropriate number of moments and that it is invertible in the sense that there exists a function g such that ε t = g(x t, X t 2,...). Under the assumption that the Bilinear process is invertible we can show that E[X t X t, X t 2,...] = E[(φ + b, ε t )X t X t, X t 2,...] + E[ε t X t, X t 2,...] = (φ + b, ε t )X t, (4.5) thus unlike the autoregressive model the conditional expectation of the X t given the past is a nonlinear function of the past. It is this nonlinearity that gives rise to the spontaneous peaks that we see a typical realisation. To see how the bilinear model was motivated in Figure 4.7 we give a plot of X t = φ X t + b, X t ε t + ε t, (4.6) where φ = 0.5 and b, = 0, 0.35, 0.65 and and {ε t } are iid standard normal random variables. We observe that Figure 4.7a is a realisation from an AR() process and the subsequent plots are for different values of b,. Figure 4.7a is quite regular, whereas the sudden bursts in activity become more pronounced as b, grows (see Figures 4.7b and 4.7c). In Figure 4.7d we give a plot realisation from a model where b, is negative and we see that the fluctation has changed direction. Remark 4.4. (Markov Bilinear model) Some authors define the BL(, 0,, ) as Y t = φ Y t + b, Y t ε t + ε t = [φ + b ε t ]Y t + ε t. The fundamental difference between this model and (4.6) is that the multiplicative innovation 2

114 bilinear(400, 0.5, 0) bilinear(400, 0.5, 0.3) Time (a) φ = 0.5 and b, = Time (b) φ = 0.5 and b, = 0.35 bilinear(400, 0.5, 0.6) bilinear(400, 0.5, 0.6) Time (c) φ = 0.5 and b, = Time (d) φ = 0.5 and b, = 0.65 Figure 4.7: Realisations from different BL(, 0,, ) models (using ε t rather than ε t ) does not depend on Y t. This means that E[Y t Y t, Y t 2,...] = φ Y t and the autocovariance function is the same as the autocovariance function of an AR() model with the same AR parameter. Therefore, it is unclear the advantage of using this version of the model if the aim is to forecast, since the forecast of this model is the same as a forecast using the corresponding AR() process X t = φ X t + ε t. Forecasting with this model does not take into account its nonlinear behaviour Solution of the Bilinear model We observe that (4.6) can be written in the stochastic recurrence relation form given in (4.) with A t = (φ + b ε t ) and B t = a 0. Therefore, by using Theorem 4.0., if E[log(φ + b ε t )] < 0 3

115 and E[ε t ] <, then X t has the strictly stationary, causal solution k X t = (φ + b, ε t j ) [(φ + b, ε t k )ε t k ] + ε t. (4.7) k= To show that it is second order stationary we require that E[X 2 t ] <, which imposes additional conditions on the parameters. To derive conditions for E[X 2 t ] we use (4.8) and the Minkowski inequality to give (E[X 2 t ]) /2 = 2 k E (φ + b, ε t j ) k= k k= /2 (E [(φ + b ε t k )ε t k ] 2) /2 ( E [(φ + b, ε t j )] 2) /2 ( E [(φ + b, ε t k )ε t k ] 2) /2. (4.8) Therefore if E[ε 4 t ] < and E [(φ + b, ε t )] 2 = φ 2 + b 2 var(ε t ) <, then E[X 2 t ] < (note that the above equality is due to E[ε t ] = 0). Remark (Inverting the Bilinear model) We note that ε t = (bx t )ε t + [X t φx t ], thus by iterating backwards with respect to ε t j we have ( ) j ε t = ( b) j X t j [X t j φx t j ]. j=0 i=0 This invertible representation is useful both in forecasting and estimation (see Section 5.5.3). Exercise 4.3 Simulate the BL(2, 0,, ) model (using the AR(2) parameters φ =.5 and φ 2 = 0.75). Experiment with different parameters to give different types of behaviour. Exercise 4.4 The random coefficient AR model is a nonlinear time series proposed by Barry Quinn (see Nicholls and Quinn (982) and Aue et al. (2006)). The random coefficient AR() model is 4

116 defined as X t = (φ + η t )X t + ε t where {ε t } and {η t } are iid random variables which are independent of each other. (i) State sufficient conditions which ensure that {X t } has a strictly stationary solution. (ii) State conditions which ensure that {X t } is second order stationary. (iii) Simulate from this model for different φ and var[η t ] R code Code to simulate a BL(, 0,, ) model: # Bilinear Simulation # Bilinear(,0,,) model, we use the first n0 observations are burn-in # in order to get close to the stationary solution. bilinear <- function(n,phi,b,n0=400) { y <- rnorm(n+n0) w <- rnorm(n + n0) for (t in 2:(n+n0)) { y[t] <- phi * y[t-] + b * w[t-] * y[t-] + w[t] } return(y[(n0+):(n0+n)]) } 4.5 Nonparametric time series models Many researchers argue that fitting parametric models can lead to misspecification and argue that it may be more realistic to fit nonparametric or semi-parametric time series models instead. There exists several nonstationary and semi-parametric time series (see Fan and Yao (2003) and Douc et al. (204) for a comprehensive summary), we give a few examples below. The most general 5

117 nonparametric model is X t = m(x t,..., X t p, ε t ), but this is so general it looses all meaning, especially if the need is to predict. A slight restriction is make the innovation term additive (see Jones (978)) X t = m(x t,..., X t p ) + ε t, it is clear that for this model E[X t X t,..., X t p ] = m(x t,..., X t p ). However this model has the distinct disadvantage that without placing any structure on m( ), for p > 2 nonparametric estimators of m( ) are lousy (as the suffer from the curse of dimensionality). Thus such a generalisation renders the model useless. Instead semi-parametric approaches have been developed. Examples include the functional AR(p) model defined as X t = the semi-parametric AR() model p φ j (X t p )X t j + ε t X t = φx t + γ(x t ) + ε t, the nonparametric ARCH(p) p X t = σ t Z t σt 2 = a 0 + a j (Xt j). 2 In the case of all these models it is not easy to establish conditions in which a stationary solution exists. More often then not, if conditions are established they are similar in spirit to those that are used in the parametric setting. For some details on the proof see Vogt (203) (also here), who considers nonparametric and nonstationary models (note the nonstationarity he considers is when the covariance structure changes over time, not the unit root type). For example in the case of the the semi-parametric AR() model, a stationary causal solution exists if φ + γ (0) <. 6

118 Chapter 5 Prediction Prerequisites The best linear predictor. Some idea of what a basis of a vector space is. Objectives Understand that prediction using a long past can be difficult because a large matrix has to be inverted, thus alternative, recursive method are often used to avoid direct inversion. Understand the derivation of the Levinson-Durbin algorithm, and why the coefficient, φ t,t, corresponds to the partial correlation between X and X t+. Understand how these predictive schemes can be used write space of sp(x t, X t,..., X ) in terms of an orthogonal basis sp(x t P Xt,X t 2,...,X (X t ),..., X ). Understand how the above leads to the Wold decomposition of a second order stationary time series. To understand how to approximate the prediction for an ARMA time series into a scheme which explicitly uses the ARMA structure. And this approximation improves geometrically, when the past is large. One motivation behind fitting models to a time series is to forecast future unobserved observations - which would not be possible without a model. In this chapter we consider forecasting, based on the assumption that the model and/or autocovariance structure is known. 7

119 5. Forecasting given the present and infinite past In this section we will assume that the linear time series {X t } is both causal and invertible, that is X t = a j ε t j = j=0 b i X t i + ε t, (5.) i= where {ε t } are iid random variables (recall Definition 2.2.2). Both these representations play an important role in prediction. Furthermore, in order to predict X t+k given X t, X t,... we will assume that the infinite past is observed. In later sections we consider the more realistic situation that only the finite past is observed. We note that since X t, X t, X t 2,... is observed that we can obtain ε τ (for τ t) by using the invertibility condition ε τ = X τ b i X τ i. i= Now we consider the prediction of X t+k given {X τ ; τ t}. Using the MA( ) presentation (since the time series is causal) of X t+k we have X t+k = a j+k ε t j j=0 } {{ } innovations are observed + k a j ε t+k j j=0 } {{ } future innovations impossible to predict since E[ k j=0 a jε t+k j X t, X t,...] = E[ k j=0 a jε t+k j ] = 0. Therefore, the best linear predictor of X t+k given X t, X t,..., which we denote as X t (k) is, X t (k) = a j+k ε t j = a j+k (X t j b i X t i j ). (5.2) j=0 j=0 i= X t (k) is called the k-step ahead predictor and it is straightforward to see that it s mean squared error is k E [X t+k X t (k)] 2 = E a j ε t+k j 2 j=0 j=0 k = var[ε t ] a 2 j, (5.3) where the last line is due to the uncorrelatedness and zero mean of the innovations. Often we would like to obtain the k-step ahead predictor for k =,..., n where n is some 8

120 time in the future. We now explain how X t (k) can be evaluated recursively using the invertibility assumption. Step Use invertibility in (5.) to give X t () = b i X t+ i, i= and E [X t+ X t ()] 2 = var[ε t ] Step 2 To obtain the 2-step ahead predictor we note that X t+2 = = b i X t+2 i + b X t+ + ε t+2 i=2 b i X t+2 i + b [X t () + ε t+ ] + ε t+2, i=2 thus it is clear that X t (2) = b i X t+2 i + b X t () i=2 and E [X t+2 X t (2)] 2 = var[ε t ] ( b 2 + ) = var[ε t ] ( a a2 ). Step 3 To obtain the 3-step ahead predictor we note that X t+3 = = b i X t+2 i + b 2 X t+ + b X t+2 + ε t+3 i=3 b i X t+2 i + b 2 (X t () + ε t+ ) + b (X t (2) + b ε t+ + ε t+2 ) + ε t+3. i=3 Thus X t (3) = b i X t+2 i + b 2 X t () + b X t (2) i=3 and E [X t+3 X t (3)] 2 = var[ε t ] [ (b 2 + b 2 )2 + b 2 + ] = var[ε t ] ( a a2 2 + a2 ). 9

121 Step k Using the arguments above it is easily seen that k X t (k) = b i X t+k i + b i X t (k i). i=k i= Thus the k-step ahead predictor can be recursively estimated. We note that the predictor given above it based on the assumption that the infinite past is observed. In practice this is not a realistic assumption. However, in the special case that time series is an autoregressive process of order p (with AR parameters {φ j } p ) and X t,..., X t m is observed where m p, then the above scheme can be used for forecasting. More precisely, X t () = X t (k) = X t (k) = p φ j X t+ j p k φ j X t+k j + φ j X t (k j) for 2 k p j=k p φ j X t (k j) for k > p. (5.4) However, in the general case more sophisticated algorithms are required when only the finite past is known. Example: Forecasting yearly temperatures We now fit an autoregressive model to the yearly temperatures from and use this model to forecast the temperatures from In Figure 5. we give a plot of the temperature time series together with it s ACF. It is clear there is some trend in the temperature data, therefore we have taken second differences, a plot of the second difference and it s ACF is given in Figure 5.2. We now use the command ar.yule(res,order.max=0) (we will discuss in Chapter 7 how this function estimates the AR parameters) to estimate the the AR parameters. The function ar.yule uses the AIC to select the order of the AR model. When fitting the second differences from (from a data set of length of 27) the AIC chooses the AR(7) model X t =.472X t.565x t X t X t X t X t X t 7 + ε t, 20

122 Series global.mean temp ACF Time Lag Figure 5.: Yearly temperature from and the ACF. Series diff2 second.differences ACF Time Lag Figure 5.2: Second differences of yearly temperature from and its ACF. with var[ε t ] = σ 2 = An ACF plot after fitting this model and then estimating the residuals {ε t } is given in Figure 5.3. We observe that the ACF of the residuals appears to be uncorrelated, which suggests that the AR(7) model fitted the data well (there is a formal test for this called the Ljung-Box test which we cover later). By using the sequence of equations 2

123 Series residuals ACF Lag Figure 5.3: An ACF plot of the estimated residuals { ε t }. ˆX 27 () =.472X X X X X X X 2 ˆX 27 (2) =.472 ˆX 27 ().565X X X X X X 22 ˆX 27 (3) =.472 ˆX 27 (2).565 ˆX 27 ().0784X X X X X 23 ˆX 27 (4) =.472 ˆX 27 (3).565 ˆX 27 (2).0784 ˆX 27 () X X X X 24 ˆX 27 (5) =.472 ˆX 27 (4).565 ˆX 27 (3).0784 ˆX 27 (2) ˆX 27 () 0.632X X X 25. We can use ˆX 27 (),..., ˆX 27 (5) as forecasts of X 28,..., X 32 (we recall are the second differences), which we then use to construct forecasts of the temperatures. A plot of the second difference forecasts together with the true values are given in Figure 5.4. We note that (5.3) can be used to 22

124 give the mse error. For example E[X 28 ˆX 27 ()] 2 = σ 2 t E[X 28 ˆX 27 ()] 2 = ( + φ 2 )σ 2 t If we believe the residuals are Gaussian we can use the mean squared error to construct confidence intervals for the predictions. On the other hand, if the residuals are not Gaussian, we can construct 95% confidence intervals for the forecast using bootstrap. Specifically, we rewrite the AR(7) process as an MA( ) process X t = ψ j ( ˆφ)ε t j. j=0 Hence the best linear predictor can be rewritten as thus giving the prediction error X t (k) = ψ j ( ˆφ)ε t+k j j=k k X t+k X t (k) = ψ j ( ˆφ)ε t+k j. j=0 We have the prediction estimates, therefore all we need is to obtain the distribution of k j=0 ψ j( ˆφ)ε t+k j. This can be done by estimating the residuals and then using bootstrap to estimate the distribution of k j=0 ψ j( ˆφ)ε t+k j, using the empirical distribution of k j=0 ψ j( ˆφ)ε t+k j. From this we can construct the 95% CI for the forecasts. A small criticism of our approach is that we have fitted a rather large AR(7) model to time series of length of 27. It may be more appropriate to fit an ARMA model to this time series. Exercise 5. In this exercise we analyze the Sunspot data found on the course website. In the data analysis below only use the data from (the remaining data we will use for prediction). In this section you will need to use the function ar.yw in R. (i) Fit the following models to the data and study the residuals (using the ACF). Using this 23

125 second difference = forecast = true value year Figure 5.4: Forecasts of second differences. decide which model X t = µ + A cos(ωt) + B sin(ωt) + ε }{{} t AR X t = µ + ε }{{} t AR or is more appropriate (take into account the number of parameters estimated overall). (ii) Use these models to forecast the sunspot numbers from diff = global.mean[c(2:34)] - global.mean[c(:33)] diff2 = diff[c(2:33)] - diff[c(:32)] res = diff2[c(:27)] residualsar7 <- ar.yw(res,order.max=0)$resid residuals <- residualsar7[-c(:7)] Forecast using the above model res = c(res,rep(0,5)) res[28] = -.472*res[27] -.565*res[26] *res[25] *res[24] *res[ 24

126 res[29] = -.472*res[28] -.565*res[27] *res[26] *res[25] *res[ res[30] = -.472*res[29] -.565*res[28] *res[27] *res[26] *res[ res[3] = -.472*res[30] -.565*res[29] *res[28] *res[27] *res[ res[32] = -.472*res[3] -.565*res[30] *res[29] *res[28] *res[ 5.2 Review of vector spaces In next few sections we will consider prediction/forecasting for stationary time series. In particular to find the best linear predictor of X t+ given the finite past X t,..., X. Setting up notation our aim is to find X t+ t = P X,...,X t (X t+ ) = X t+ t,..., = t φ t,j X t+ j, where {φ t,j } are chosen to minimise the mean squared error min φt E(X t+ t φ t,jx t+ j ) 2. Basic results from multiple regression show that φ t,. φ t,t = Σ t r t, where (Σ t ) i,j = E(X i X j ) and (r t ) i = E(X t i X t+ ). Given the covariances this can easily be done. However, if t is large a brute force method would require O(t 3 ) computing operations to calculate (5.7). Our aim is to exploit stationarity to reduce the number of operations. To do this, we will briefly discuss the notion of projections on a space, which help in our derivation of computationally more efficient methods. Before we continue we first discuss briefly the idea of a a vector space, inner product spaces, Hilbert spaces, spans and basis. A more complete review is given in Brockwell and Davis (998), Chapter 2. First a brief definition of a vector space. X is called an vector space if for every x, y X and a, b R (this can be generalised to C), then ax + by X. An inner product space is a vector space which comes with an inner product, in other words for every element x, y X we can defined an innerproduct x, y, where, satisfies all the conditions of an inner product. Thus for every element x X we can define its norm as x = x, x. If the inner product space is complete 25

127 (meaning the limit of every sequence in the space is also in the space) then the innerproduct space is a Hilbert space (see wiki). Example 5.2. (i) The classical example of a Hilbert space is the Euclidean space R n where the innerproduct between two elements is simply the scalar product, x, y = n i= x iy i. (ii) The subset of the probability space (Ω, F, P ), where all the random variables defined on Ω have a finite second moment, ie. E(X 2 ) = Ω X(ω)2 dp (ω) <. This space is denoted as L 2 (Ω, F, P ). In this case, the inner product is X, Y = E(XY ). (iii) The function space L 2 [R, µ], where f L 2 [R, µ] if f is mu-measureable and R f(x) 2 dµ(x) <, is a Hilbert space. For this space, the inner product is defined as f, g = R f(x)g(x)dµ(x). In this chapter we will not use this function space, but it will be used in Chapter?? (when we prove the Spectral representation theorem). It is straightforward to generalize the above to complex random variables and functions defined on C. We simply need to remember to take conjugates when defining the innerproduct, ie. X, Y = E(XY ) and f, g = C f(z)g(z)dµ(z). In this chapter our focus will be on certain spaces of random variables which have a finite variance. Basis The random variables {X t, X t,..., X } span the space X t for every Y X t, there exists coefficients {a j R} such that Y = (denoted as sp(x t, X t,..., X )), if t a j X t+ j. (5.5) Moreover, sp(x t, X t,..., X ) = X t if for every {a j R}, t a jx t+ j X t. define the basis of a vector space, which is closely related to the span. We now The random variables {X t,..., X } form a basis of the space Xt, if for every Y Xt we have a representation (5.5) and 26

128 this representation is unique. More precisely, there does not exist another set of coefficients {b j } such that Y = t b jx t+ j. For this reason, one can consider a basis as the minimal span, that is the smallest set of elements which can span a space. Definition 5.2. (Projections) The projection of the random variable Y onto the space spanned by sp(x t, X t,..., X ) (often denoted as P Xt,X t,...,x (Y)) is defined as P Xt,X t,...,x (Y) = t c jx t+ j, where {c j } is chosen such that the difference Y P( X t,x t,...,x ) (Y t) is uncorrelated (orthogonal/perpendicular) to any element in sp(x t, X t,..., X ). In other words, P Xt,X t,...,x (Y t ) is the best linear predictor of Y given X t,..., X. Orthogonal basis An orthogonal basis is a basis, where every element in the basis is orthogonal to every other element in the basis. It is straightforward to orthogonalize any given basis using the method of projections. To simplify notation let X t t = P Xt,...,X (X t ). By definition, X t X t t is orthogonal to the space sp(x t, X t,..., X ). In other words X t X t t and X s ( s t) are orthogonal (cov(x s, (X t X t t )), and by a similar argument X t X t t and X s X s s are orthogonal. Thus by using projections we have created an orthogonal basis X, (X 2 X 2 ),..., (X t X t t ) of the space sp(x, (X 2 X 2 ),..., (X t X t t )). By construction it clear that sp(x, (X 2 X 2 ),..., (X t X t t )) is a subspace of sp(x t,..., X ). We now show that sp(x, (X 2 X 2 ),..., (X t X t t )) = sp(x t,..., X ). To do this we define the sum of spaces. If U and V are two orthogonal vector spaces (which share the same innerproduct), then y U V, if there exists a u U and v V such that y = u + v. By the definition of X t, it is clear that (X t X t t ) X t, but (X t X t t ) / X t. Hence X t = sp(x t X t t ) Xt. Continuing this argument we see that X t = sp(x t X t t ) sp(x t X t t 2 ),..., sp(x ). Hence sp(x t,..., X ) = sp(x t X t t,..., X 2 X 2, X ). Therefore for every P Xt,...,X (Y ) = t a jx t+ j, there exists coefficients {b j } such that P Xt,...,X (Y ) = P Xt X t t,...,x 2 X 2,X (Y ) = t t P Xt+ j X t+ j t j (Y ) = b j (X t+ j X t+ j t j ) + b t X, where b j = E(Y (X j X j j ))/E(X j X j j )) 2. A useful application of orthogonal basis is the ease of obtaining the coefficients b j, which avoids the inversion of a matrix. This is the underlying idea behind the innovations algorithm proposed in Brockwell and Davis (998), Chapter 5. 27

129 5.2. Spaces spanned by infinite number of elements The notions above can be generalised to spaces which have an infinite number of elements in their basis (and are useful to prove Wold s decomposition theorem). Let now construct the space spanned by infinite number random variables {X t, X t,...}. As with anything that involves we need to define precisely what we mean by an infinite basis. To do this we construct a sequence of subspaces, each defined with a finite number of elements in the basis. We increase the number of elements in the subspace and consider the limit of this space. Let X n t = sp(x t,..., X n ), clearly if m > n, then X n t X m t. We define X, as X t t = n= X n t, in other words if Y X, then there exists an n such that Y Xt n. However, we also need to ensure that the limits of all the sequences lie in this infinite dimensional space, therefore we close the space by defining defining a new space which includes the old space and also includes all the limits. To make this precise suppose the sequence of random variables is such that Y s Xt s, and E(Y s Y s2 ) 2 0 as s, s 2. Since the sequence {Y s } is a Cauchy sequence there exists a limit. More precisely, there exists a random variable Y, such that E(Y s Y ) 2 0 as s. Since the closure of the space, X n t, contains the set X n t and all the limits of the Cauchy sequences in this set, then Y Xt. We let t X t = sp(x t, X t,...), (5.6) The orthogonal basis of sp(x t, X t,...) An orthogonal basis of sp(x t, X t,...) can be constructed using the same method used to orthogonalize sp(x t, X t,..., X ). The main difference is how to deal with the initial value, which in the case of sp(x t, X t,..., X ) is X. The analogous version of the initial value in infinite dimension space sp(x t, X t,...) is X, but this it not a well defined quantity (again we have to be careful with these pesky infinities). Let X t () denote the best linear predictor of X t given X t, X t 2,.... As in Section 5.2 it is clear that (X t X t ()) and X s for s t are uncorrelated and X t where X t = sp(x t X t ()) X t, = sp(x t, X t,...). Thus we can construct the orthogonal basis (X t X t ()), (X t X t 2 ()),... and the corresponding space sp((x t X t ()), (X t X t 2 ()),...). It is clear that sp((x t X t ()), (X t X t 2 ()),...) sp(x t, X t,...). However, unlike the finite dimensional case it is not clear that they are equal, roughly speaking this is because sp((x t X t ()), (X t X t 2 ()),...) lacks the inital value X. Of course the time in the past is not really a well 28

130 defined quantity. Instead, the way we overcome this issue is that we define the initial starting random variable as the intersection of the subspaces, more precisely let X = n= X t. Furthermore, we note that since X n X n () and X s (for any s n) are orthogonal, then sp((x t X t ()), (X t X t 2 ()),...) and X are orthogonal spaces. Using X, we have t j=0 sp((x t j X t j ()) X = sp(x t, X t,...). We will use this result when we prove the Wold decomposition theorem (in Section 5.7). 5.3 Levinson-Durbin algorithm We recall that in prediction the aim is to predict X t+ given X t, X t,..., X. The best linear predictor is t X t+ t = P X,...,X t (X t+ ) = X t+ t,..., = φ t,j X t+ j, (5.7) where {φ t,j } are chosen to minimise the mean squared error, and are the solution of the equation φ t,. φ t,t = Σ t r t, (5.8) where (Σ t ) i,j = E(X i X j ) and (r t ) i = E(X t i X t+ ). Using standard methods, such as Gauss-Jordan elimination, to solve this system of equations requires O(t 3 ) operations. However, we recall that {X t } is a stationary time series, thus Σ t is a Toeplitz matrix, by using this information in the 940s Norman Levinson proposed an algorithm which reduced the number of operations to O(t 2 ). In the 960s, Jim Durbin adapted the algorithm to time series and improved it. We first outline the algorithm. We recall that the best linear predictor of X t+ given X t,..., X is t X t+ t = φ t,j X t+ j. (5.9) The mean squared error is r(t + ) = E[X t+ X t+ t ] 2. Given that the second order stationary covariance structure, the idea of the Levinson-Durbin algorithm is to recursively estimate {φ t,j ; j =,..., t} given {φ t,j ; j =,..., t } (which are the coefficients of the best linear predictor of X t 29

131 given X t,..., X ). Let us suppose that the autocovariance function c(k) = cov[x 0, X k ] is known. The Levinson-Durbin algorithm is calculated using the following recursion. Step φ, = c()/c(0) and r(2) = E[X 2 X 2 ] 2 = E[X 2 φ, X ] 2 = 2c(0) 2φ, c(). Step 2 For j = t φ t,t = c(t) t φ t,jc(t j) r(t) φ t,j = φ t,j φ t,t φ t,t j j t, and r(t + ) = r(t)( φ 2 t,t). We give two proofs of the above recursion. Exercise 5.2 (i) Suppose X t = φx t +ε t (where φ < ). Use the Levinson-Durbin algorithm, to deduce an expression for φ t,j for ( j t). (ii) Suppose X t = φε t + ε t (where φ < ). Use the Levinson-Durbin algorithm (and possibly Maple/Matlab), deduce an expression for φ t,j for ( j t). (recall from Exercise 3.4 that you already have an analytic expression for φ t,t ) A proof based on projections Let us suppose {X t } is a zero mean stationary time series and c(k) = E(X k X 0 ). Let P Xt,...,X2 (X ) denote the best linear predictor of X given X t,..., X 2 and P Xt,...,X2 (X t+ ) denote the best linear predictor of X t+ given X t,..., X 2. Stationarity means that the following predictors share the same coefficients X t t = P Xt,...,X 2 (X ) = t t φ t,j X t j P Xt,...,X2 (X t+ ) = φ t,j X t+ j (5.0) t φ t,j X j+. The last line is because stationarity means that flipping a time series round has the same correlation structure. These three relations are an important component of the proof. Recall our objective is to derive the coefficients of the best linear predictor of P Xt,...,X (X t+ ) based on the coefficients of the best linear predictor P Xt,...,X (X t ). To do this we partition the 30

132 space sp(x t,..., X 2, X ) into two orthogonal spaces sp(x t,..., X 2, X ) = sp(x t,..., X 2, X ) sp(x P Xt,...,X 2 (X )). Therefore by uncorrelatedness we have the partition X t+ t = P Xt,...,X 2 (X t+ ) + P X P Xt,...,X 2 (X )(X t+ ) = = t φ t,j X t+ j } {{ } by (5.0) + φ tt (X P Xt,...,X 2 (X )) } {{ } by projection onto one variable t t φ t,j X t+ j + φ t,t X φ t,j X j+. (5.) } {{ } by (5.0) We start by evaluating an expression for φ t,t (which in turn will give the expression for the other coefficients). It is straightforward to see that φ t,t = E(X t+(x P Xt,...,X 2 (X ))) E(X P Xt,...,X 2 (X )) 2 (5.2) = E[(X t+ P Xt,...,X 2 (X t+ ) + P Xt,...,X 2 (X t+ ))(X P Xt,...,X 2 (X ))] E(X P Xt,...,X 2 (X )) 2 = E[(X t+ P Xt,...,X 2 (X t+ ))(X P Xt,...,X 2 (X ))] E(X P Xt,...,X 2 (X )) 2 Therefore we see that the numerator of φ t,t is the partial covariance between X t+ and X (see Section 3.2.2), furthermore the denominator of φ t,t is the mean squared prediction error, since by stationarity E(X P Xt,...,X 2 (X )) 2 = E(X t P Xt,...,X (X t )) 2 = r(t) (5.3) Returning to (5.2), expanding out the expectation in the numerator and using (5.3) we have φ t,t = E(X t+(x P Xt,...,X 2 (X ))) r(t) = c(0) E[X t+p Xt,...,X 2 (X ))] r(t) = c(0) t φ t,jc(t j), r(t) (5.4) which immediately gives us the first equation in Step 2 of the Levinson-Durbin algorithm. To 3

133 obtain the recursion for φ t,j we use (5.) to give X t+ t = = t φ t,j X t+ j t t φ t,j X t+ j + φ t,t X φ t,j X j+. To obtain the recursion we simply compare coefficients to give φ t,j = φ t,j φ t,t φ t,t j j t. This gives the middle equation in Step 2. To obtain the recursion for the mean squared prediction error we note that by orthogonality of {X t,..., X 2 } and X P Xt,...,X 2 (X ) we use (5.) to give r(t + ) = E(X t+ X t+ t ) 2 = E[X t+ P Xt,...,X 2 (X t+ ) φ t,t (X P Xt,...,X 2 (X )] 2 = E[X t+ P X2,...,X t (X t+ )] 2 + φ 2 t,te[x P Xt,...,X 2 (X )] 2 2φ t,t E[(X t+ P Xt,...,X 2 (X t+ ))(X P Xt,...,X 2 (X ))] = r(t) + φ 2 t,tr(t) 2φ t,t E[X t+ (X P Xt,...,X 2 (X ))] } {{ } =r(t)φ t,t by (5.4) = r(t)[ φ 2 tt]. This gives the final part of the equation in Step 2 of the Levinson-Durbin algorithm. Further references: Brockwell and Davis (998), Chapter 5 and Fuller (995), pages A proof based on symmetric Toeplitz matrices We now give an alternative proof which is based on properties of the (symmetric) Toeplitz matrix. We use (5.8), which is a matrix equation where Σ t φ t,. φ t,t = r t, (5.5) 32

134 with Σ t = c(0) c() c(2)... c(t ) c() c(0) c()... c(t 2) c(t ) c(t 2).. c(0) and r t = c() c(2). c(t). The proof is based on embedding r t and Σ t into Σ t and using that Σ t φ t = r t. To do this, we define the (t ) (t ) matrix E t which basically swops round all the elements in a vector E t = , (recall we came across this swopping matrix in Section 3.2.2). Using the above notation, we have the interesting block matrix structure Σ t = Σ t E t r t r t E t c(0) and r t = (r t, c(t)). Returning to the matrix equations in (5.5) and substituting the above into (5.5) we have Σ t φ t = r t, Σ t E t r t r t E t c(0) φ t,t φ t,t = r t c(t), where φ t,t = (φ,t,..., φ t,t ). This leads to the two equations Σ t φ t,t + E t r t φ t,t = r t (5.6) r t E t φ t,t + c(0)φ t,t = c(t). (5.7) We first show that equation (5.6) corresponds to the second equation in the Levinson-Durbin 33

135 algorithm. Multiplying (5.6) by Σ t, and rearranging the equation we have φ t,t = Σ t r t } {{ } t E t r t φ t,t. =E t φ t Σ } {{ } =φ t Thus we have φ t,t = φ t φ t,t E t φ t. (5.8) This proves the second equation in Step 2 of the Levinson-Durbin algorithm. We now use (5.7) to obtain an expression for φ t,t, which is the first equation in Step. Substituting (5.8) into φ t,t of (5.7) gives r t E t ( φ t φ t,t E t φ t ) + c(0)φ t,t = c(t). (5.9) Thus solving for φ t,t we have φ t,t = c(t) c t E t φ t c(0) c. (5.20) t φ t Noting that r(t) = c(0) c t φ. (5.20) is the first equation of Step 2 in the Levinson-Durbin t equation. Note from this proof it does not appear that we need that the (symmetric) Toeplitz matrix is positive semi-definite Using the Durbin-Levinson to obtain the Cholesky decomposition of the precision matrix We recall from Section 3.2. that by sequentially projecting the elements of random vector on the past elements in the vector gives rise to Cholesky decomposition of the inverse of the variance/covariance (precision) matrix. This is exactly what was done in when we make the Durbin-Levinson 34

136 algorithm. In other words, var X r() X φ, X 2 r(2). X n n φ n,jx n j r(n) = I n Therefore, if Σ n = var[x n ], where X n = (X,..., X n ), then Σ n = L n D n L n, where L n = φ, φ 2, φ 2, φ n,n φ n,n 2 φ n,n (5.2) and D n = diag(r, r 2,..., r n ). 5.4 Forecasting for ARMA processes Given the autocovariance of any stationary process the Levinson-Durbin algorithm allows us to systematically obtain one-step predictors of second order stationary time series without directly inverting a matrix. In this section we consider forecasting for a special case of stationary processes, the ARMA process. We will assume throughout this section that the parameters of the model are known. We showed in Section 5. that if {X t } has an AR(p) representation and t > p, then the best linear predictor can easily be obtained using (5.4). Therefore, when t > p, there is no real gain in using the Levinson-Durbin for prediction of AR(p) processes. However, we do show in Chapter?? we can apply the Levinson-Durbin algorithm for obtaining estimators of the autoregressive parameters. Similarly if {X t } satisfies an ARMA(p, q) representation, then the prediction scheme can be greatly simplified. Unlike the AR(p) process, which is p-markovian, P Xt,Xt,...,X (X t+ ) does involve all regressors X t,..., X. However, some simplifications can be made in the scheme. To 35

137 explain how, let us suppose that X t satisfies the representation X t p φ i X t j = ε t + q θ i ε t i, where {ε t } are iid zero mean random variables and the roots of φ(z) and θ(z) lie outside the unit circle. For the analysis below, let W t = X t for t p and for t > max(p, q) let W t = ε t + q i= θ iε t i (which is the MA(q) part of the process). Since X p+ = p φ jx t+ j + W p+ and so forth it is clear that sp(x,..., X t ) = sp(w,..., W t ) (ie, they are linear combinations of each other). We will show for t > max(p, q) that X t+ t = P Xt,...,X (X t+ ) = i= p φ j X t+ j + q θ t,i (X t+ i X t+ i t i ). (5.22) for some θ t,i which can be evaluated from the autocovariance structure. To prove the result we use the following steps: i= P Xt,...,X (X t+ ) = = = = = = p q φ j P Xt,...,X (X t+ j ) + θ i P Xt,...,X } {{ } (ε t+ i ) i= X t+ j p q φ j X t+ j + θ i P Xt Xt t,...,x 2 X 2,X (ε t+ i ) i= } {{ } =P Wt W t t,...,w 2 W 2,W (ε t+ i ) p q φ j X t+ j + θ i P Wt+ i W t+ i t i,...,w t W t t (ε t+ i ) p φ j X t+ j + p φ j X t+ j + p φ j X t+ j + i= q i θ i P Wt+ i+s W t+ i+s t i+s (ε t+ i ) s=0 } {{ } since ε t+ is independent of the past q θ t,i (W t+ i W t+ i t i ) } {{ } =X t+ i X t+ i t i q θ t,i (X t+ i X t+ i t i ), (5.23) i= i= i= this gives the desired result. Thus given the parameters {θ t,i } is straightforward to construct the predictor X t+ t. It can be shown that θ t,i θ i as t (see Brockwell and Davis (998)), Chapter 5. Remark 5.4. In terms of notation we can understand the above result for the MA(q) case. In 36

138 this case,the above result reduces to X t+ t = q i= θ t,i ( X t+ i X t+ i t i ). We now state a few results which will be useful later. Lemma 5.4. Suppose {X t } is a stationary time series with spectral density f(ω). Let X t = (X,..., X t ) and Σ t = var(x t ). (i) If the spectral density function is bounded away from zero (there is some γ > 0 such that inf ω f(ω) > 0), then for all t, λ min (Σ t ) γ (where λ min and λ max denote the smallest and largest absolute eigenvalues of the matrix). (ii) Further, λ max (Σ t ) γ. (Since for symmetric matrices the spectral norm and the largest eigenvalue are the same, then Σ t spec γ ). (iii) Analogously, sup ω f(ω) M <, then λ max (Σ t ) M (hence Σ t spec M). PROOF. See Chapter??. Remark Suppose {X t } is an ARMA process, where the roots φ(z) and and θ(z) have absolute value greater than + δ and less than δ 2, then the spectral density f(ω) is bounded by var(ε t ) ( ) δ 2p 2 f(ω) var(ε ( ( ) t) ( ( ) +δ 2p. Therefore, from Lemma 5.4. we have that λ +δ 2p ( ) δ 2p max (Σ t ) 2 and λ max (Σ t ) is bounded uniformly over t. The prediction can be simplified if we make a simple approximation (which works well if t is relatively large). recursion For t max(p, q), set X t+ t = X t and for t > max(p, q) we define the p q X t+ t = φ j X t+ j + θ i (X t+ i X t+ i t i ). (5.24) i= This approximation seems reasonable, since in the exact predictor (5.23), θ t,i θ i. In the following proposition we show that the best linear predictor of X t+ given X,..., X t, X t+ t, the approximating predictor X t+ t and the best linear predictor given the infinite past, 37

139 X t () are asymptotically equivalent. To do this we obtain expressions for X t () and X t+ t X t () = b j X t+ j ( since X t+ = b j X t+ j + ε t+ ). Furthermore, by iterating (5.24) backwards we can show that X t+ t = t max(p,q) max(p,q) b j X t+ j + gamma j X j (5.25) where γ j Cρ t, with /( + δ) < ρ < and the roots of θ(z) are outside ( + δ). We give a proof in the remark below. Remark We prove (5.25) for the ARMA(, 2). We first recall that the AR() part in the ARMA(, ) model does not play any role since sp(x, X t,..., X t ) = sp(w, W 2,..., W t ), where W = X and for t 2 we define the corresponding MA(2) process W t = θ ε t + θ 2 ε t 2 + ε t. The corresponding approximating predictor is defined as Ŵ2 = W, Ŵ3 2 = W 2 and for t > 3 Ŵ t t = θ [W t Ŵt t 2] + θ 2 [W t 2 Ŵt 2 t 3]. Using this and rearranging (5.24) gives X t+ t φ X t } {{ } Ŵ t+ t By subtracting the above from W t+ we have = θ [X t Xt t ] +θ 2 [X t X t t 2 ], } {{ } } {{ } =(W t Ŵt t ) =(W t Ŵt t 2) W t+ Ŵt+ t = θ (W t Ŵt t ) θ 2 (W t Ŵt t 2) + W t+. (5.26) It is straightforward to rewrite W t+ Ŵt+ t as the matrix difference equation W t+ Ŵt+ t = θ θ 2 W t Ŵt t + W t+ W t } {{ Ŵt t 0 W t } } {{ } } {{ Ŵt t 2 0 } } {{ } = ε t+ =Q = ε t W t+ We now show that ε t+ and W t+ Ŵt+ t lead to the same difference equation except for some 38

140 initial conditions, it is this that will give us the result. To do this we write ε t as function of {W t } (the irreducible condition). We first note that ε t can be written as the matrix difference equation } {{ } =ε t+ ε t+ = θ θ 2 ε t } 0 {{ } Q ε t ε t } {{ } ε t + W t+ 0 } {{ } W t+ (5.27) Thus iterating backwards we can write ε t+ = ( ) j [Q j ] (,) W t+ j = bj W t+ j, j=0 j=0 where b j = ( ) j [Q j ] (,) (noting that b 0 = ) denotes the (, )th element of the matrix Q j (note we did something similar in Section 2.4.). Furthermore the same iteration shows that ε t+ = = t 3 ( ) j [Q j ] (,) W t+ j + ( ) t 2 [Q t 2 ] (,) ε 3 j=0 t 3 bj W t+ j + ( ) t 2 [Q t 2 ] (,) ε 3. (5.28) j=0 Therefore, by comparison we see that t 3 ε t+ bj W t+ j = ( ) t 2 [Q t 2 ε 3 ] = bj W t+ j. j=0 j=t 2 We now return to the approximation prediction in (5.26). Comparing (5.27) and (5.27) we see that they are almost the same difference equations. The only difference is the point at which the algorithm starts. ε t goes all the way back to the start of time. Whereas we have set initial values for Ŵ2 = W, Ŵ3 2 = W 2, thus ε 3 = (W 3 W 2, W 2 W ).Therefore, by iterating both (5.27) and (5.27) backwards, focusing on the first element of the vector and using (5.28) we have ε t+ ε t+ = ( ) t 2 [Q t 2 ε 3 ] } {{ } +( ) t 2 [Q t 2 ε 3 ] = b j=t 2 j W t+ j We recall that ε t+ = W t+ + b j W t+ j and that ε t+ = W t+ Ŵt+ t. Substituting this into 39

141 the above gives Ŵ t+ t bj W t+ j = bj W t+ j + ( ) t 2 [Q t 2 ε 3 ]. j=t 2 Replacing W t with X t φ X t gives (5.25), where the b j can be easily deduced from b j and φ. Proposition 5.4. Suppose {X t } is an ARMA process where the roots of φ(z) and θ(z) have roots which are greater in absolute value than + δ. Let X t+ t, X t+ t and X t () be defined as in (5.23), (5.24) and (5.2) respectively. Then E[X t+ t X t+ t ] 2 Kρ t, (5.29) E[ X t+ t X t ()] 2 Kρ t (5.30) E[Xt+ X t+ t ] 2 σ 2 Kρ t (5.3) for any +δ < ρ < and var(ε t) = σ 2. PROOF. The proof of (5.29) becomes clear when we use the expansion X t+ = b jx t+ j + ε t+, noting that by Lemma 2.5.(iii), b j Cρ j. gives Evaluating the best linear predictor of X t+ given X t,..., X, using the autoregressive expansion X t+ t = = b j P Xt,...,X (X t+ j ) + P Xt,...,X (ε t+ ) } {{ } =0 t max(p,q) b j X t+ j + } {{ } X t+ t max(p,q) γ j X j j=t max(p,q) b j P Xt,...,X (X t j+ ). Therefore by using (5.25) we see that the difference between the best linear predictor and X t+ t is X t+ t X t+ t = j= max(p,q) max(p,q) b t+j P Xt,...,X (X j+ ) + γ j X j = I + II. 40

142 By using (5.25), it is straightforward to show that the second term E[II 2 ] = E[ max(p,q) γ j X t j ] 2 Cρ t, therefore what remains is to show that E[II 2 ] attains a similar bound. Heuristically, this seems reasonable, since b t+j Kρ t+j, the main obstacle is to show that E[P Xt,...,X (X j+ ) 2 ] and does not grow with t. To obtain a bound, we first obtain a bound for E[P Xt,...,X (X j+ )] 2. Basic results in linear regression shows that P Xt,...,X (X j+ ) = β j,tx t, (5.32) where β j,t = Σ t r t,j, with β j,t = (β,j,t,..., β t,j,t ), X t = (X,..., X t ), Σ t = E(X t X t) and r t,j = E(X t X j ). Substituting (5.32) into I gives j= max(p,q) b t+j P Xt,...,X (X j+ ) = Therefore the mean squared error of I is E[I 2 ] = j= max(p,q) j= max(p,q) b t+j r t,j Σ t b t+j β j,tx t = ( j= max(p,q) j=t max(p,q) To bound the above we use the Cauchy schwarz inequality ( abb b t+j r t,j. b j r ) t,j Σ t X t. (5.33) a 2 Bb 2 ), the spectral norm inequality ( a 2 Bb 2 a 2 B spec b 2 ) and Minkowiski s inequality ( n a j 2 n a j 2 ) we have E [ I 2] b t+j r t,j 2 2 Σ t 2 spec ( b t+j r t,j 2 ) 2 Σ t 2 spec. (5.34) We now bound each of the terms above. We note that for all t, using Remark that Σ t spec K (for some constant K). We now consider r t,j = (E(X X j ),..., E(X t X j )) = (c( j),..., c(t j)). By using (3.2) we have c(k) Cρ k, therefore r t,j 2 K( t ρ 2(j+r) ) /2 K ρj ( ρ 2 ) 2. r= Substituting these bounds into (5.34) gives E [ I 2] Kρ t. Altogether the bounds for I and II give ρj E(X t+ t X t+ t ) 2 K ( ρ 2 ) 2. 4

143 Thus proving (5.29). To prove (5.30) we note that E[X t () X t t+ t ] 2 = E b t+j X j + j=0 j=t max(p,q) 2 b j Y t j. Using the above and that b t+j Kρ t+j, it is straightforward to prove the result. Finally to prove (5.3), we note that by Minkowski s inequality we have [ E ( ) ] 2 /2 X t+ X t+ t [E (X t X t ()) 2] [ /2 + E (X t () X ) ] 2 /2 t+ t } {{ } =σ } {{ } Kρ t/2 by (5.30) [ + ( ) ] 2 /2 E Xt+ t X t+ t. } {{ } Kρ t/2 by (5.29) Thus giving the desired result. 5.5 Forecasting for nonlinear models In this section we consider forecasting for nonlinear models. The forecasts we construct, may not necessarily/formally be the best linear predictor, because the best linear predictor is based on minimising the mean squared error, which we recall from Chapter 4 requires the existence of the higher order moments. Instead our forecast will be the conditional expection of X t+ given the past (note that we can think of it as the best linear predictor). Furthermore, with the exception of the ARCH model we will derive approximation of the conditional expectation/best linear predictor, analogous to the forecasting approximation for the ARMA model, Xt+ t (given in (5.24)) Forecasting volatility using an ARCH(p) model We recall the ARCH(p) model defined in Section 4.2 p X t = σ t Z t σt 2 = a 0 + a j Xt j. 2 42

144 Using a similar calculation to those given in Section 4.2., we see that E[X t+ X t, X t,..., X t p+ ] = E(Z t+ σ t+ X t, X t,..., X t p+ ) = σ t+ E(Z t+ X t, X t,..., X t p+ ) } {{ } σ t+ function of X t,...,x t p+ = σ t+ E(Z t+ ) } {{ } by causality = 0 σ t+ = 0. In other words, past values of X t have no influence on the expected value of X t+. On the other hand, in Section 4.2. we showed that E(X 2 t+ X t, X t,..., X t p+ ) = E(Z 2 t+σ 2 t+ X t, X t 2,..., X t p+ ) = σ 2 t+e[z 2 t+] = σ 2 t+ = p a j Xt+ j, 2 thus X t has an influence on the conditional mean squared/variance. Therefore, if we let X t+k t denote the conditional variance of X t+k given X t,..., X t p+, it can be derived using the following recursion X 2 t+ t = X 2 t+k t = X 2 t+k t = p a j Xt+ j 2 p k a j Xt+k j 2 + a j Xt+k j k 2 j=k p a j Xt+k j t 2 k > p. 2 k p Forecasting volatility using a GARCH(, ) model We recall the GARCH(, ) model defined in Section 4.3 σ 2 t = a 0 + a X 2 t + b σ 2 t = ( a Z 2 t + b ) σ 2 t + a 0. Similar to the ARCH model it is straightforward to show that E[X t+ X t, X t,...] = 0 (where we use the notation X t, X t,... to denote the infinite past or more precisely conditioned on the sigma algebra F t = σ(x t, X t,...)). Therefore, like the ARCH process, our aim is to predict Xt 2. We recall from Example 4.3. that if the GARCH the process is invertible (satisfied if b < ), 43

145 then E[X 2 t+ X t, X t,...] = σ 2 t+ = a 0 + a X 2 t + b σ 2 t = a 0 b + a b j Xt j. 2 (5.35) Of course, in reality we only observe the finite past X t, X t,..., X. We can approximate E[Xt+ 2 X t, X t,..., X ] using the following recursion, set σ 0 2 = 0, then for t let j=0 σ 2 t+ t = a 0 + a X 2 t + b σ 2 t t (noting that this is similar in spirit to the recursive approximate one-step ahead predictor defined in (5.25)). It is straightforward to show that σ 2 t+ t = a 0( b t+ ) b t + a b j Xt j, 2 taking note that this is not the same as E[X 2 t+ X t,..., X ] (if the mean square error existed E[X 2 t+ X t,..., X ] would give a smaller mean square error), but just like the ARMA process it will j=0 closely approximate it. Furthermore, from (5.35) it can be seen that σ 2 t+ t closely approximates σ 2 t+ Exercise 5.3 To answer this question you need R install.package("tseries") then remember library("garch"). (i) You will find the Nasdaq data from 4th January 200-5th October 204 on my website. (ii) By taking log differences fit a GARCH(,) model to the daily closing data (ignore the adjusted closing value) from 4th January th September 204 (use the function garch(x, order = c(, )) fit the GARCH(, ) model). (iii) Using the fitted GARCH(, ) model, forecast the volatility σ 2 t from October st-5th (noting that no trading is done during the weekends). Denote these forecasts as σ 2 t 0. Evaluate t= σ2 t 0 (iv) Compare this to the actual volatility t= X2 t (where X t are the log differences). 44

146 5.5.3 Forecasting using a BL(, 0,, ) model We recall the Bilinear(, 0,, ) model defined in Section 4.4 X t = φ X t + b, X t ε t + ε t. Assuming invertibility, so that ε t can be written in terms of X t (see Remark 4.4.2): ε t = it can be shown that ( j=0 j ( b) j i=0 X t j ) [X t j φx t j ], X t () = E[X t+ X t, X t,...] = φ X t + b, X t ε t. However, just as in the ARMA and GARCH case we can obtain an approximation, by setting X 0 = 0 and for t defining the recursion X t+ t = φ X t + b, X t ( X t X t t ). See? and? for further details. Remark 5.5. (How well does X t+ t approximate X t ()?) We now derive conditions for X t+ t to be a close approximation of X t () when t is large. We use a similar technique to that used in Remark We note that X t+ X t () = ε t+ (since a future innovation, ε t+, cannot be predicted). We will show that X t+ X t+ t is close to ε t+. Subtracting X t+ t from X t+ gives the recursion X t+ X t+ t = b, (X t X t t )X t + (bε t X t + ε t+ ). (5.36) We will compare the above recursion to the recursion based on ε t+. equation gives Rearranging the bilinear ε t+ = bε t X t + (X t+ φ X t ) } {{ }. (5.37) =bε tx t+ε t+ 45

147 We observe that (5.36) and (5.37) are almost the same difference equation, the only difference is that an initial value is set for X 0. This gives the difference between the two equations as Thus if b t t ε j ε t+ [X t+ X t+ t ] = ( ) t b t X t ε j + ( ) t b t [X X 0 ] t ε j. a.s. 0 as t, then X t+ t P Xt () as t. We now show that if E[log ε t < log b, then b t t ε j a.s. 0. Since b t t ε j is a product, it seems appropriate to take logarithms to transform it into a sum. To ensure that it is positive, we take absolutes and t-roots log b t t ε j /t t = log b + log ε j. t } {{ } average of iid random variables Therefore by using the law of large numbers we have t log b t ε j /t = log b + t t log ε j P log b + E log ε 0 = γ. Thus we see that b t t ε /t a.s. j exp(γ). In other words, b t t ε j exp(tγ), which will only converge to zero if E[log ε t < log b. 5.6 Nonparametric prediction In this section we briefly consider how prediction can be achieved in the nonparametric world. Let us assume that {X t } is a stationary time series. Our objective is to predict X t+ given the past. However, we don t want to make any assumptions about the nature of {X t }. Instead we want to obtain a predictor of X t+ given X t which minimises the means squared error, E[X t+ g(x t )] 2. It is well known that this is conditional expectation E[X t+ X t ]. (since E[X t+ g(x t )] 2 = E[X t+ E(X t+ X t )] 2 + E[g(X t ) E(X t+ X t )] 2 ). Therefore, one can estimate E[X t+ X t = x] = m(x) 46

148 nonparametrically. A classical estimator of m(x) is the Nadaraya-Watson estimator m n (x) = n t= X t+k( x Xt b ) n t= K( x Xt b ), where K : R R is a kernel function (see Fan and Yao (2003), Chapter 5 and 6). Under some regularity conditions it can be shown that m n (x) is a consistent estimator of m(x) and converges to m(x) in mean square (with the typical mean squared rate O(b 4 + (bn) )). The advantage of going the non-parametric route is that we have not imposed any form of structure on the process (such as linear/(g)arch/bilinear). Therefore, we do not run the risk of misspecifying the model A disadvantage is that nonparametric estimators tend to be a lot worse than parametric estimators (in Chapter?? we show that parametric estimators have O(n /2 ) convergence which is faster than the nonparametric rate O(b 2 + (bn) /2 )). Another possible disavantage is that if we wanted to include more past values in the predictor, ie. m(x,..., x d ) = E[X t+ X t = x,..., X t p = x d ] then the estimator will have an extremely poor rate of convergence (due to the curse of dimensionality). A possible solution to the problem is to assume some structure on the nonparametric model, and define a semi-parametric time series model. We state some examples below: (i) An additive structure of the type X t = where {ε t } are iid random variables. p g j (X t j ) + ε t (ii) A functional autoregressive type structure p X t = g j (X t d )X t j + ε t. (iii) The semi-parametric GARCH(, ) X t = σ t Z t, σ 2 t = bσ 2 t + m(x t ). However, once a structure has been imposed, conditions need to be derived in order that the model has a stationary solution (just as we did with the fully-parametric models). See?,?,?,?,? etc. 47

149 5.7 The Wold Decomposition Section 5.2. nicely leads to the Wold decomposition, which we now state and prove. The Wold decomposition theorem, states that any stationary process, has something that appears close to an MA( ) representation (though it is not). We state the theorem below and use some of the notation introduced in Section Theorem 5.7. Suppose that {X t } is a second order stationary time series with a finite variance (we shall assume that it has mean zero, though this is not necessary). Then X t can be uniquely expressed as X t = ψ j Z t j + V t, (5.38) j=0 where {Z t } are uncorrelated random variables, with var(z t ) = E(X t X t ()) 2 (noting that X t () is the best linear predictor of X t given X t, X t 2,...) and V t X = is defined in (5.6). n= X n, where X PROOF. First let is consider the one-step ahead prediction of X t given the infinite past, denoted X t (). Since {X t } is a second order stationary process it is clear that X t () = b jx t j, where the coefficients {b j } do not vary with t. For this reason {X t ()} and {X t X t ()} are second order stationary random variables. Furthermore, since {X t X t ()} is uncorrelated with X s for any s t, then {X s X s (); s R} are uncorrelated random variables. Define Z s = X s X s (), and observe that Z s is the one-step ahead prediction error. We recall from Section 5.2. that X t sp((x t X t ()), (X t X t 2 ()),...) sp(x ) = j=0 sp(z t j) sp(x ). Since the spaces j=0 sp(z t j) and sp(x ) are orthogonal, we shall first project X t onto j=0 sp(z t j), due to orthogonality the difference between X t and its projection will be in sp(x ). This will lead to the Wold decomposition. First we consider the projection of X t onto the space j=0 sp(z t j), which is n P Zt,Z t,...(x t ) = ψ j Z t j, where due to orthogonality ψ j = cov(x t, (X t j X t j ()))/var(x t j X t j ()). Since X t j=0 sp(z t j) sp(x ), the difference X t P Zt,Zt,...X t is orthogonal to {Z t } and belongs in j=0 48

150 sp(x ). Hence we have X t = ψ j Z t j + V t, j=0 where V t = X t j=0 ψ jz t j and is uncorrelated to {Z t }. Hence we have shown (5.38). To show that the representation is unique we note that Z t, Z t,... are an orthogonal basis of sp(z t, Z t,...), which pretty much leads to uniqueness. Exercise 5.4 Consider the process X t = A cos(bt + U) where A, B and U are random variables such that A, B and U are independent and U is uniformly distributed on (0, 2π). (i) Show that X t is second order stationary (actually it s stationary) and obtain it s means and covariance function. (ii) Show that the distribution of A and B can be chosen is such a way that {X t } has the same covariance function as the MA() process Y t = ε t + φε t (where φ < ) (quite amazing). (iii) Suppose A and B have the same distribution found in (ii). (a) What is the best predictor of X t+ given X t, X t,...? (b) What is the best linear predictor of X t+ given X t, X t,...? It is worth noting that variants on the proof can be found in Brockwell and Davis (998), Section 5.7 and Fuller (995), page 94. Remark 5.7. Notice that the representation in (5.38) looks like an MA( ) process. There is, however, a significant difference. The random variables {Z t } of an MA( ) process are iid random variables and not just uncorrelated. We recall that we have already come across the Wold decomposition of some time series. In Section 3.3 we showed that a non-causal linear time series could be represented as a causal linear time series with uncorrelated but dependent innovations. Another example is in Chapter 4, where we explored ARCH/GARCH process which have an AR and ARMA type representation. Using this representation we can represent ARCH and GARCH processes as the weighted sum of {(Zt 2 )σt 2 } which are uncorrelated random variables. 49

151 Remark (Variation on the Wold decomposition) In many technical proofs involving time series, we often use a results related to the Wold decomposition. More precisely, we often decompose the time series in terms of an infinite sum of martingale differences. In particular, we define the sigma-algebra F t = σ(x t, X t,...), and suppose that E(X t F ) = µ. Then by telescoping we can formally write X t as X t µ = j=0 where Z t,j = E(X t F t j ) E(X t F t j ). It is straightforward to see that Z t,j are martingale differences, and under certain conditions (mixing, physical dependence, your favourite dependence flavour etc) it can be shown that j=0 Z t,j p < (where p is the pth moment). This means the above representation holds almost surely. Thus in several proofs we can replace X t µ by j=0 Z t,j. This decomposition allows us to use martingale theorems to prove results. Z t,j 50

152 Chapter 6 Estimation of the mean and covariance Prerequisites Some idea of what a cumulant is. Objectives To derive the sample autocovariance of a time series, and show that this is a positive definite sequence. To show that the variance of the sample covariance involves fourth order cumulants, which can be unwielding to estimate in practice. But under linearity the expression for the variance greatly simplifies. To show that under linearity the correlation does not involve the fourth order cumulant. This is the Bartlett formula. To use the above results to construct a test for uncorrelatedness of a time series (the Portmanteau test). And understand how this test may be useful for testing for independence in various different setting. Also understand situations where the test may fail. 5

153 6. An estimator of the mean Suppose we observe {Y t } n t=, where Y t = µ + X t, where µ is the finite mean, {X t } is a zero mean stationary time series with absolutely summable covariances ( k cov(x 0, X k ) < ). Our aim is to estimate the mean µ. The most obvious estimator is the sample mean, that is Ȳn = n n t= Y t as an estimator of µ. 6.. The sampling properties of the sample mean We recall from Example.5. that we obtained an expression for the sample mean. We showed that var(ȳn) = n var(x 0) + 2 n k= (n k ) c(k). n Furthermore, if k c(k) <, then in Example.5. we showed that var(ȳn) = n var(x 0) + 2 n c(k) + o( n ). Thus if the time series has sufficient decay in it s correlation structure a mean squared consistent estimator of the sample mean can be achieved. However, one drawback is that the dependency means that one observation will influence the next, and if the influence is positive (seen by a positive covariance), the resulting estimator may have a (much) larger variance than the iid case. The above result does not require any more conditions on the process, besides second order stationarity and summability of its covariance. However, to obtain confidence intervals we require a stronger result, namely a central limit theorem for the sample mean. The above conditions are not enough to give a central limit theorem. To obtain a CLT for sums of the form n t= X t we need the following main ingredients: (i) The variance needs to be finite. (ii) The dependence between X t decreases the further apart in time the observations. However, this is more than just the correlation, it really means the dependence. k= 52

154 The above conditions are satisfied by linear time series, if the cofficients φ j decay sufficient fast. However, these conditions can also be verified for nonlinear time series (for example the (G)ARCH and Bilinear model described in Chapter 4). We now state the asymptotic normality result for linear models. Theorem 6.. Suppose that X t is a linear time series, of the form X t = j= ψ jε t j, where ε t are iid random variables with mean zero and variance one, j= ψ j < and j= ψ j 0. Let Y t = µ + X t, then we have n (Ȳn µ ) = N (0, σ 2 ) where σ 2 = var(x 0 ) + 2 k= c(k). PROOF. Later in this course we will give precise details on how to prove asymptotic normality of several different type of estimators in time series. However, we give a small flavour here by showing asymptotic normality of Ȳn in the special case that {X t } n t= satisfy an MA(q) model, then explain how it can be extended to MA( ) processes. The main idea of the proof is to transform/approximate the average into a quantity that we know is asymptotic normal. We know if {ɛ t } n t= are iid random variables with mean µ and variance one then n( ɛn µ) D N (0, ). (6.) We aim to use this result to prove the theorem. Returning to Ȳn by a change of variables (s = t j) we can show that n Y t = µ + X t = µ + n n t= t= t= = µ + n q q ε s ψ j + n s= j=0 = µ + n q q ψ j n := µ + j=0 n q q ψ j ε t j j=0 0 s= q+ n q ε s ε s + n s= q j=q s 0 ψ j s= q+ + ε s s=n q+ q j=q+s ψ j ε s n s + n j=0 ψ j s=n q+ ε s n s Ψ(n q) ε n q + E + E 2, (6.2) n j=0 ψ j 53

155 where Ψ = q j=0 ψ j. It is straightforward to show that E E Cn and E E 2 Cn. Finally we examine Ψ(n q) n ε n q. We note that if the assumptions are not satisfied and q j=0 ψ j = 0 (for example the process X t = ε t ε t ), then n Y t = µ + n t= 0 s= q+ ε s q j=q s ψ j + n s=n q+ ε s n s This is a degenerate case, since E and E 2 only consist of a finite number of terms and thus if ε t are non-gaussian these terms will never be asymptotically normal. Therefore, in this case we simply have that n n t= Y t = µ + O( n ) (this is why in the assumptions it was stated that Ψ 0). On the other hand, if Ψ 0, then the dominating term in Ȳn is ε n q. clear that n q ε n q n εn q j=0 ψ j. From (6.) it is P N (0, ) as n. However, for finite q, (n q)/n P, therefore P N (0, ). Altogether, substituting E E Cn and E E 2 Cn into (6.2) gives n (Ȳn µ ) = Ψ n ε n q + O p ( n ) P N ( 0, Ψ 2). With a little work, it can be shown that Ψ 2 = σ 2. Observe that the proof simply approximated the sum by a sum of iid random variables. In the case that the process is a MA( ) or linear time series, a similar method is used. More precisely, we have n (Ȳn µ ) = n t= j=0 = n ψ j ε t j ψ j j=0 t= ε t + R n where R n = = n n j=0 j=0 ψ j ψ j n j s= j 0 s= j ε s ε s := R n + R n2 + R n3 + R n4. s= ε s s=n j ε s + n j=n+ ψ j n j s= j ε s s= ε s 54

156 We will show that E[R 2 n,j ] = o() for j 4. We start with R n, E[R 2 n,] = n = n = n n j,j 2 =0 j,j 2 =0 ψ j ψ j2 cov 0 0 ε s ε s s= j s= j 2 ψ j ψ j2 min[j, j 2 ] ψj 2 (j ) + 2 n j=0 ψj 2 (j ) + 2Ψ n j=0 j =0 j ψ j, ψ j2 min[j 2 ] j 2 =0 j ψ j. Since j=0 ψ j < and, thus, j=0 ψ j 2 <, then by dominated convegence n j=0 [ j/n]ψ j j=0 ψ j and n j=0 [ j/n]ψ2 j j=0 ψ2 j as n. This implies that n j=0 (j/n)ψ j 0 and n j=0 (j/n)ψ2 j 0. Substituting this into the above bounds for E[R2 n, ] we immediately obtain E[R 2 n, ] = o(). Using the same argument we obtain the same bound for R n,2, R n,3 and R n,4. Thus n (Ȳn µ ) = Ψ n j =0 ε t + o p () and the result then immediately follows. Estimation of the so called long run variance (given in Theorem 6..) can be difficult. There are various methods that can be used, such as estimating the spectral density function (which we define in Chapter 8) at zero. An interesting approach advocated by Xiaofeng Shao is to use the method of so called self-normalization which circumvents the need to estimate the long run mean, see Shao (200). 6.2 An estimator of the covariance Suppose we observe {Y t } n t=, to estimate the covariance we can estimate the covariance c(k) = cov(y 0, Y k ) from the the observations. A plausible estimator is ĉ n (k) = n n k (Y t Ȳn)(Y t+ k Ȳn), (6.3) t= 55

157 since E[(Y t Ȳn)(Y t+ k Ȳn)] c(k). Of course if the mean of Y t is known to be zero (Y t = X t ), then the covariance estimator is ĉ n (k) = n n k The eagle-eyed amongst you may wonder why we don t use biased estimator, whereas n k which we discuss in the lemma below. t= X t X t+ k. (6.4) n k n k t= X t X t+ k, when ĉ n (k) is a n k t= X t X t+ k is not. However ĉ n (k) has some very nice properties Lemma 6.2. Suppose we define the empirical covariances ĉ n (k) = n k n t= X t X t+ k k n 0 otherwise then {ĉ n (k)} is a positive definite sequence. Therefore, using Lemma.6. there exists a stationary time series {Z t } which has the covariance ĉ n (k). PROOF. There are various ways to show that {ĉ n (k)} is a positive definite sequence. One method uses that the spectral density corresponding to this sequence is non-negative, we give this proof in Section Here we give an alternative proof. We recall a sequence is positive definite if for any vector a = (a,..., a r ) we have r a k a k2 ĉ n (k k 2 ) = a k a k2 ĉ n (k k 2 ) = a Σn a 0 where k,k 2 = k,k 2 = Σ n = ĉ n (0) ĉ n () ĉ n (2)... ĉ n (n ) ĉ n () ĉ n (0) ĉ n ()... ĉ n (n 2) ĉ n (n ) ĉ n (n 2).. ĉ n (0), noting that ĉ n (k) = n n k t= X t X t+ k. However, ĉ n (k) = n k n t= X t X t+ k has a very interesting construction, it can be shown that the above convariance matrix is Σ n = X n X n, where X n is a 56

158 n 2n matrix with X n = X X 2... X n X n X X 2... X n X n X X 2... X n X n Using the above we have a Σn a = a X n X na = X a This this proves that {ĉ n (k)} is a positive definite sequence. Finally, by using Theorem.6., there exists a stochastic process with {ĉ n (k)} as it s autocovariance function Asymptotic properties of the covariance estimator The main reason we construct an estimator is either for testing or constructing a confidence interval for the parameter of interest. To do this we need the variance and distribution of the estimator. It is impossible to derive the finite sample distribution, thus we look at their asymptotic distribution. Besides showing asymptotic normality, it is important to derive an expression for the variance. In an ideal world the variance will be simple and will not involve unknown parameters. Usually in time series this will not be the case, and the variance will involve several (often an infinite) number of parameters which are not straightforward to estimate. Later in this section we show that the variance of the sample covariance can be extremely complicated. However, a substantial simplification can arise if we consider only the sample correlation (not variance) and assume linearity of the time series. This result is known as Bartlett s formula (you may have come across Maurice Bartlett before, besides his fundamental contributions in time series he is well known for proposing the famous Bartlett correction). This example demonstrates, how the assumption of linearity can really simplify problems in time series analysis and also how we can circumvent certain problems in which arise by making slight modifications of the estimator (such as going from covariance to correlation). The following theorem gives the asymptotic sampling properties of the covariance estimator (6.3). One proof of the result can be found in Brockwell and Davis (998), Chapter 8, Fuller 57

159 (995), but it goes back to Bartlett (indeed its called Bartlett s formula). We prove the result in Section Theorem 6.2. Suppose {X t } is a linear stationary time series where X t = µ + ψ j ε t j, j= where j ψ j <, {ε t } are iid random variables with E(ε 4 t ) <. Suppose we observe {X t : t =,..., n} and use (6.3) as an estimator of the covariance c(k) = cov(x 0, X k ). Define ˆρ n (r) = ĉ n (r)/ĉ n (0) as the sample correlation. Then for each h {,..., n} n(ˆρn (h) ρ(h)) D N (0, W h ) (6.5) where ˆρ n (h) = (ˆρ n (),..., ˆρ n (h)), ρ(h) = (ρ(),..., ρ(h)) and (W h ) ij = {ρ(k + i) + ρ(k i) 2ρ(i)ρ(k)}{ρ(k + j) + ρ(k j) 2ρ(j)ρ(k)}. (6.6) k= Equation (6.6) is known as Bartlett s formula. In Section 6.3 we apply the method for checking for correlation in a time series. We first show how the expression for the asymptotic variance is obtained Proof of Bartlett s formula The variance of the sample covariance in the case of strict stationarity We first derive an expression for ĉ n (r) under the assumption that {X t } is a strictly stationary time series with finite fourth order moment, k c(k) < and for all r, r 2 Z, k κ 4(r, k, k +r 2 ) < where κ 4 (k, k 2, k 3 ) = cum(x 0, X k, X k2, X k3 ). Remark 6.2. (Strict Stationarity and cumulants) We note if the time series is strictly stationary then the cumulants are invariant of shift (just as the covariance is): cum(x t, X t+k, X t+k2, X t+k3 ) = cum(x 0, X k, X k2, X k3 ) = κ 4 (k, k 2, k 3 ). 58

160 A simply expansion shows that var[ĉ n (r)] = n r n 2 t,τ= cov(x t X t+r, X τ X τ+r ). One approach for the analysis of cov(x t X t+r, X τ X τ+r ) is to expand it in terms of expectations cov(x t X t+r, X τ X τ+r ) = E(X t X t+r, X τ X τ+r ) E(X t X t+r )E(X τ X τ+r ), however it not clear how this will give var[x t X t+r ] = O(n ). Instead we observe that cov(x t X t+r, X τ X τ+r ) is the covariance of the product of random variables. This belong to the general class of cumulants of products of random variables. We now use standard results on cumulants, which show that cov[xy, UV ] = cov[x, U]cov[Y, V ] + cov[x, V ]cov[y, U] + cum(x, Y, U, V ) (note this result can be generalized to higher order cumulants, see?). Using this result we have var[ĉ n (r)] = n r n 2 t,τ= = n r n 2 t,τ= ( := I + II + III, cov(x t, X τ ) } {{ } =c(t τ) by stationarity cov(x t+r, X τ+r ) + cov(x t, X τ+r )cov(x t+r, X τ ) + cum(x t, X t+r, X τ, X τ+r ) ) [ c(t τ) 2 + c(t τ r)c(t + r τ) + k 4 (r, τ t, τ + r t) ] where the above is due to strict stationarity of the time series. We analyse the above term by term. Either (i) by changing variables and letting k = t τ and thus changing the limits of the summand in an appropriate way or (ii) observing that n r t,τ= c(t τ)2 is the sum of the elements in the Toeplitz matrix c(0) 2 c() 2... c(n r ) 2 c( ) 2 c(0) 2... c(n r 2) c( (n r )) 2 c( (n r 2)) 2... c(0) 2, (noting that c( k) = c(k)) the sum I can be written as I = n r n 2 t,τ= c(t τ) 2 = n 2 (n r ) k= (n r ) n r k c(k) 2 t= = n n r k= (n r ) ( n r k n ) c(k) 2. 59

161 For all k, ( k /n)c(k) 2 c(k) 2 and n r k= (n r ) ( k /n)c(k)2 k c(k)2, thus by dominated convergence (see Chapter A) n k= (n r ) ( k /n)c(k)2 k= c(k)2. This gives I = n k= Using a similar argument we can show that II = n k= c(k) 2 + o( n ). c(k + r)c(k r) + o( n ). To derive the limit of III, again we use a change of variables to give III = n n r k= (n r ) ( n r k n ) k 4 (r, k, k + r). To bound we note that for all k, ( k /n)k 4 (r, k, k + r) k 4 (r, k, k + r) and n r k= (n r ) ( k /n)k 4 (r, k, k + r) k k 4(r, k, k + r), thus by dominated convergence we have n k= (n r ) ( k /n)k 4 (r, k, k + r) k= k 4(r, k, k + r). This gives Therefore altogether we have nvar[ĉ n (r)] = k= III = κ 4 (r, k, k + r) + o( n n ). c(k) 2 + Using similar arguments we obtain ncov[ĉ n (r ), ĉ n (r 2 )] = k= k k= c(k)c(k + r r 2 ) + c(k + r)c(k r) + k= k= c(k r )c(k + r 2 ) + κ 4 (r, k, k + r) + o(). k= κ 4 (r, k, k + r 2 ) + o(). We observe that the covariance of the covariance estimator contains both covariance and cumulants terms. Thus if we need to estimate them, for example to construct confidence intervals, this can be extremely difficult. However, we show below that under linearity the above fourth order cumulant term has a simpler form. 60

162 The covariance of the sample covariance under linearity We recall that c(k + r r 2 )c(k) + c(k r )c(k + r 2 ) + κ 4 (r, k, k + r 2 ) + o() = T + T 2 + T 3 + o(). k= k= k= We now show that under linearity, T 3 (the fourth order cumulant) has a much simpler form. Let us suppose that the time series is linear X t = ψ j ε t j j= where j ψ j <, {ε t } are iid, E(ε t ) = 0, var(ε t ) = and κ 4 = cum 4 (ε t ). Then T 3 is T 3 = = cum ψ j ε j, ψ j2 ε r j 2, ψ j3 ε k j3, ψ j4 ε k+r2 j k= k= j,...,j 4 = j = j 2 = j 3 = j 4 = ψ j ψ j2 ψ j3 ψ j4 cum (ε j, ε r j 2, ε k j3, ε k+r2 j ). Standard results in cumulants (which can be proved using the characteristic function), show that cum[y, Y 2,..., Y n ] = 0, if any of these variables is independent of all the others. Applying this result to cum (ε j, ε r j 2, ε k j3, ε k+r2 j ) reduces T 3 to T 3 = κ 4 ψ j ψ j r ψ j k ψ j r2 k. k= j= Using a change of variables j = j and j 2 = j k we have ( )( ) κ 4 ψ j ψ j r ψ j2 ψ j2 r 2 = κ4 c(r )c(r 2 ), j = j 2 = recalling the covariance of a linear process in Lemma 3... Altogether this gives ncov[ĉ n (r ), ĉ n (r 2 )] = c(k)c(k + r r 2 ) + c(k r )c(k + r 2 ) + κ 4 c(r )c(r 2 ) + o(). (6.7) k= k= Thus in the case of linearity our expression for the variance is simpler, and the only difficult 6

163 parameter to estimate of κ 4, which can be done using various methods. The variance of the sample correlation under linearity A suprisingly twist in the story is that (6.7) can be reduced further, if we are interested in estimating the correlation rather than the covariance. We recall the sample correlation is which is an estimator of ρ(r) = c(r)/c(0). ˆρ n (r) = ĉn(r) ĉ n (0), Lemma (Bartlett s formula) Suppose {X t } is a linear time series, where j ψ(j) <. Then the variance of the distribution of ˆρ n (r) is k= {ρ(k + r) + ρ(k r) 2ρ(r)ρ(k)}{ρ(k + r) + ρ(k r) 2ρ(r)ρ(k)} PROOF. By making a Taylor expansion of ĉ n (0) about c(0) we have ˆρ n (r) ρ(r) = ĉn(r) ĉ n (0) c(r) c(0) = [ĉ n(r) c(r)] [ĉ n (0) c(0)] c(0) = [ĉ n(r) c(r)] c(0) := A n + O p ( n ), ĉ n (r) c(0) 2 } {{ } replace with c(r) + [ĉ n (0) c(0)] 2 ĉ n (r) ĉ n (0) 3 } {{ } =O(n ) [ĉ n (0) c(0)] c(r) c(0) [ĉ n(0) c(0)] 2 ĉ n (r) c n (0) 3 [ĉ n(0) c(0)] [ĉ n(r) c(r)] c(0) } {{ 2 } O(n ) where c n (0) lies between ĉ n (0) and c(0). We observe that the last two terms of the above are of order O(n ) (by (6.7) and that c(0) is bounded away from zero) and the dominating term is A n which is of order O(n /2 ) (again by (6.7)). Thus the limiting distribution of ˆρ n (r) ρ(r) is determined by A n and the variance of the limiting distribution is also determined by A n. It is straightforward to show that nvar[a n ] = n var[ĉ n(r)] c(0) 2 2ncov[ĉ n (r), ĉ n (0)] c(r)2 c(0) 3 + nvar[ĉ n(0)] c(r)2 c(0) 4. (6.8) 62

164 By using (6.7) we have = nvar ĉn(r) ĉ n (0) k= c(k)2 + k= c(k)c(k r) + κ 4c(r) 2 2 k= c(k)c(k r) + κ 4c(r)c(0) +o(). 2 k= c(k)c(k r) + κ 4c(r)c(0) k= c(k)2 + k= c(k)c(k r) + κ 4c(0) 2 Substituting the above into (6.8) gives us nvar[a n ] = ( 2 k= ( 2 ( k= c(k) 2 + k= k= c(k)c(k r) + κ 4 c(r) 2 ) c(k)c(k r) + κ 4 c(r)c(0) c(k) 2 + k= ) c(r) 2 c(0) 3 + c(k)c(k r) + κ 4 c(0) 2 ) c(0) 2 c(r) 2 c(0) 4 + o(). Focusing on the fourth order cumulant terms, we see that these cancel, which gives the result. To prove Theorem 6.2., we simply use the Lemma to obtain an asymptotic expression for the variance, then we use A n to show asymptotic normality of ĉ n (r) (under linearity). Exercise 6. Under the assumption that {X t } are iid random variables show that ĉ n () is asymptotically normal. Hint: Let m = n/(b + ) and partition the sum n k= X tx t+ as follows n X t X t+ = t= = b X t X t+ + X b+ X b+2 + t= (b+)+b t=(b+)+ (m )(b+)+b+ +X (m )(b+) X (m )(b+)+ + X t X t+ m j=0 m U b,j + j=0 X j(b+)+b X (j(b+)+ X t X t+ + X (b+)+b+ X (b+)+b t=(m )(b+)+ where U b,j = j(b+)+b t=j(b+) X tx t+. Show that the second term in the above summand is asymptotically negligible and show that the classical CLT for iid random variables can be applied to the first term. 63

165 Exercise 6.2 Under the assumption that {X t } is a MA() process, show that ĉ n () is asymptotically normal. Exercise 6.3 The block bootstrap scheme is a commonly used method for estimating the finite sample distribution of a statistic (which includes its variance). The aim in this exercise is to see how well the bootstrap variance approximates the finite sample variance of a statistic. (i) In R write a function to calculate the autocovariance ĉ n () = n Remember the function is defined as cov = function(x){...} n t= X tx t+. (ii) Load the library boot library("boot") into R. We will use the block bootstrap, which partitions the data into blocks of lengths l and then samples from the blocks n/l times to construct a new bootstrap time series of length n. For each bootstrap time series the covariance is evaluated and this is done R times. The variance is calculated based on these R bootstrap estimates. You will need to use the function tsboot(tseries,statistic,r=00,l=20,sim="fixed"). tseries refers to the original data, statistic to the function you wrote in part (i) (which should only be a function of the data), R=is the number of bootstrap replications and l is the length of the block. Note that tsboot(tseries,statistic,r=00,l=20,sim="fixed")$t will be vector of length R = 00 which will contain the bootstrap statistics, you can calculate the variance of this vector. (iii) Simulate the AR(2) time series arima.sim(list(order = c(2, 0, 0), ar = c(.5, 0.75)), n = 28) 500 times. For each realisation calculate the sample autocovariance at lag one and also the bootstrap variance. (iv) Calculate the mean of the bootstrap variances and also the mean squared error (compared with the empirical variance), how does the bootstrap perform? (iv) Play around with the bootstrap block length l. Observe how the block length can influence the result. Remark The above would appear to be a nice trick, but there are two major factors that lead to the cancellation of the fourth order cumulant term 64

166 Linearity of the time series Ratio between ĉ n (r) and ĉ n (0). Indeed this is not a chance result, in fact there is a logical reason why this result is true (and is true for many statistics, which have a similar form - commonly called ratio statistics). It is easiest explained in the Fourier domain. If the estimator can be written as n k= φ(ω k)i n (ω k ) n n n k= I, n(ω k ) where I n (ω) is the periodogram, and {X t } is a linear time series, then we will show later that the asymptotic distribution of the above has a variance which is only in terms of the covariances not higher order cumulants. We prove this result in Section Using Bartlett s formula for checking for correlation Bartlett s formula if commonly used to check by eye; whether a time series is uncorrelated (there are more sensitive tests, but this one is often used to construct CI in for the sample autocovariances in several statistical packages). This is an important problem, for many reasons: Given a data set, we need to check whether there is dependence, if there is we need to analyse it in a different way. Suppose we fit a linear regression to time series data. We may to check whether the residuals are actually uncorrelated, else the standard errors based on the assumption of uncorrelatedness would be unreliable. We need to check whether a time series model is the appropriate model. To do this we fit the model and estimate the residuals. If the residuals appear to be uncorrelated it would seem likely that the model is correct. If they are correlated, then the model is inappropriate. For example, we may fit an AR() to the data, estimate the residuals ε t, if there is still correlation in the residuals, then the AR() was not the correct model, since X t ˆφX t is still correlated (which it would not be, if it were the correct model). 65

167 Series iid ACF Lag Figure 6.: The sample ACF of an iid sample with error bars (sample size n = 200). We now apply Theorem 6.2. to the case that the time series are iid random variables. Suppose {X t } are iid random variables, then it is clear that it is trivial example of a (not necessarily Gaussian) linear process. We use (6.3) as an estimator of the autocovariances. To derive the asymptotic variance of {ĉ n (r)}, we recall that if {X t } are iid then ρ(k) = 0 for k 0. Substituting this into (6.6) we see that n(ˆρn (h) ρ(h)) D N (0, W h ), where i = j (W h ) ij = 0 i j In other words, n(ˆρ n (h) ρ(h)) D N (0, I h ). Hence the sample autocovariances at different lags are asymptotically uncorrelated and have variance one. This allows us to easily construct error bars for the sample autocovariances under the assumption of independence. If the vast majority of the sample autocovariance lie inside the error bars there is not enough evidence to suggest that the data is a realisation of a iid random variables (often called a white noise process). An example of the empirical ACF and error bars is given in Figure 6.. We see that the empirical autocorrelations of the realisation from iid random variables all lie within the error bars. In contrast in Figure 6.2 we give a plot of the sample ACF of an AR(2). We observe that a large number of the sample autocorrelations lie outside the error bars. 66

168 Series ar2 ACF Lag acf lag Figure 6.2: Top: The sample ACF of the AR(2) process X t =.5X t X t 2 + ε t with error bars n = 200. Bottom: The true ACF. Of course, simply checking by eye means that we risk misconstruing a sample coefficient that lies outside the error bars as meaning that the time series is correlated, whereas this could simply be a false positive (due to multiple testing). To counter this problem, we construct a test statistic for testing uncorrelatedness. Since under the null n(ˆρ n (h) ρ(h)) D N (0, I), one method of testing is to use the square correlations h S h = n ˆρ n (r) 2, (6.9) r= under the null it will asymptotically have a χ 2 -distribution with h degrees of freedom, under the alternative it will be a non-central (generalised) chi-squared. The non-centrality is what makes us reject the null if the alternative of correlatedness is true. This is known as the Box-Pierce test (a test which gives better finite sample results is the Ljung-Box test). Of course, a big question is how to select h. In general, we do not have to use large h since most correlations will arise when r is small, However the choice of h will have an influence on power. If h is too large the test will loose power (since the mean of the chi-squared grows as h ), on the other hand choosing h too small may mean that certain correlations at higher lags are missed. How to selection h is discussed in several papers, see for example Escanciano and Lobato (2009). 67

169 6.4 Long range dependence versus changes in the mean A process is said to have long range dependence if the autocovariances are not absolutely summable, ie. k c(k) =. From a practical point of view data is said to exhibit long range dependence if the autocovariances do not decay very fast to zero as the lag increases. Returning to the Yahoo data considered in Section 4.. we recall that the ACF plot of the absolute log differences, given again in Figure 6.3 appears to exhibit this type of behaviour. However, it has been argued by several authors that Series abs(yahoo.log.diff) ACF Lag Figure 6.3: ACF plot of the absolute of the log differences. the appearance of long memory is really because of a time-dependent mean has not been corrected for. Could this be the reason we see the memory in the log differences? We now demonstrate that one must be careful when diagnosing long range dependence, because a slow/none decay of the autocovariance could also imply a time-dependent mean that has not been corrected for. This was shown in Bhattacharya et al. (983), and applied to econometric data in Mikosch and Stărică (2000) and Mikosch and Stărică (2003). A test for distinguishing between long range dependence and change points is proposed in Berkes et al. (2006). Suppose that Y t satisfies Y t = µ t + ε t, where {ε t } are iid random variables and the mean µ t depends on t. We observe {Y t } but do not know the mean is changing. We want to evaluate the autocovariance function, hence estimate the 68

170 autocovariance at lag k using ĉ n (k) = n n k (Y t Ȳn)(Y t+ k Ȳn). t= Observe that Ȳn is not really estimating the mean but the average mean! If we plotted the empirical ACF {ĉ n (k)} we would see that the covariances do not decay with time. However the true ACF would be zero and at all lags but zero. The reason the empirical ACF does not decay to zero is because we have not corrected for the time dependent mean. Indeed it can be shown that ĉ n (k) = n n k (Y t µ t + µ t Ȳn)(Y t+ k µ t+k + µ t+k Ȳn) t= n k t= n k (Y t µ t )(Y n t+ k µ t+k ) + (µ t n Ȳn)(µ t+k Ȳn) n k t= c(k) + (µ }{{} t n Ȳn)(µ t+k Ȳn) t= true autocovariance=0 } {{ } additional term due to time-dependent mean Expanding the second term and assuming that k << n and µ t µ(t/n) (and is thus smooth) we have Therefore n k n n t= (µ t Ȳn)(µ t+k Ȳn) µ 2 t t= = n 2 = n 2 s= t= s= t= ( n µ 2 t ) 2 µ t + o p () t= ( n ) 2 µ t + o p () t= µ t (µ t µ s ) = n 2 s= t= n k (µ t n Ȳn)(µ t+k Ȳn) 2n 2 t= (µ t µ s ) 2 + n 2 s= t= µ s (µ t µ s ) s= t= } {{ } = n n n 2 s= t= µt(µt µs) (µ t µ s ) 2. Thus we observe that the sample covariances are positive and don t tend to zero for large lags. 69

171 This gives the false impression of long memory. It should be noted if you study a realisation of a time series with a large amount of dependence, it is unclear whether what you see is actually a stochastic time series or an underlying trend. This makes disentangling a trend from data with a large amount of correlation extremely difficult. 70

172 Chapter 7 Parameter estimation Prerequisites The Gaussian likelihood. Objectives To be able to derive the Yule-Walker and least squares estimator of the AR parameters. To understand what the quasi-gaussian likelihood for the estimation of ARMA models is, and how the Durbin-Levinson algorithm is useful in obtaining this likelihood (in practice). Also how we can approximate it by using approximations of the predictions. Understand that there exists alternative methods for estimating the ARMA parameters, which exploit the fact that the ARMA can be written as an AR( ). We will consider various methods for estimating the parameters in a stationary time series. We first consider estimation parameters of an AR and ARMA process. It is worth noting that we will look at maximum likelihood estimators for the AR and ARMA parameters. The maximum likelihood will be constructed as if the observations were Gaussian. However, these estimators work both when the process is Gaussian is also non-gaussian. In the non-gaussian case, the likelihood simply acts as a contrast function (and is commonly called the quasi-likelihood). In time series, often the distribution of the random variables is unknown and the notion of likelihood has little meaning. Instead we seek methods that give good estimators of the parameters, meaning that they are consistent and as close to efficiency as possible without placing too many assumption on 7

173 the distribution. We need to free ourselves from the notion of likelihood acting as a likelihood (and attaining the Crámer-Rao lower bound). 7. Estimation for Autoregressive models Let us suppose that {X t } is a zero mean stationary time series which satisfies the AR(p) representation X t = p φ j X t j + ε t, where E(ε t ) = 0 and var(ε t ) = σ 2 and the roots of the characteristic polynomial p φ jz j lie outside the unit circle. We will assume that the AR(p) is causal (the techniques discussed here will not consistently estimate the parameters in the case that the process is non-causal, they will only consistently estimate the corresponding causal model). Our aim in this section is to construct estimator of the AR parameters {φ j }. We will show that in the case that {X t } has an AR(p) representation the estimation is relatively straightforward, and the estimation methods all have properties which are asymptotically equivalent to the Gaussian maximum estimator. 3..4). The Yule-Walker estimator is based on the Yule-Walker equations derived in (3.4) (Section 7.. The Yule-Walker estimator We recall that the Yule-Walker equation state that if an AR process is causal, then for i > 0 we have E(X t X t i ) = p p φ j E(X t j X t i ), c(i) = φ j c(i j). (7.) Putting the cases i p together we can write the above as r p = Σ p φ p, (7.2) where (Σ p ) i,j = c(i j), (r p ) i = c(i) and φ p = (φ,..., φ p ). Thus the autoregressive parameters solve these equations. 72

174 The Yule-Walker equations inspire the method of moments estimator called the Yule-Walker estimator. We use (7.2) as the basis of the estimator. It is clear that ˆr p and ˆΣ p are estimators of r p and Σ p where (ˆΣ p ) i,j = ĉ n (i j) and (ˆr p ) i = ĉ n (i). Therefore we can use ˆφ p = ˆΣ p ˆr p, (7.3) as an estimator of the AR parameters φ p = (φ,..., φ p ). We observe that if p is large this involves inverting a large matrix. However, we can use the Durbin-Levinson algorithm to calculate ˆφ p by recursively fitting lower order AR processes to the observations and increasing the order. This way an explicit inversion can be avoided. We detail how the Durbin-Levinson algorithm can be used to estimate the AR parameters below. Step Set ˆφ, = ĉ n ()/ĉ n (0) and ˆr n (2) = 2ĉ n (0) 2 ˆφ, ĉ n (). Step 2 For 2 t p, we define the recursion ˆφ t,t = ĉn(t) t ˆφ t,j ĉ n (t j) ˆr n (t) ˆφ t,j = ˆφ t,j ˆφ t,t ˆφt,t j j t, and ˆr n (t + ) = ˆr n (t)( ˆφ 2 t,t). Step 3 We recall from (5.2) that φ t,t is the partial correlation between X t+ and X, therefore ˆφ tt are estimators of the partial correlation between X t+ and X. As mentioned in Step 3, the Yule-Walker estimators have the useful property that the partial correlations can easily be evaluated within the procedure. This is useful when trying to determine the order of the model to fit to the data. In Figure 7. we give the partial correlation plot corresponding to Figure 6.. Notice that only the first two terms are outside the error bars. This rightly suggests the time series comes from an autoregressive process of order two. Assuming that ˆφ t,t is asymptotically normal the error bars (confidence interval) is determined by the variance of ˆφ t,t. This can be determined by deriving the variance of φ t. The Yule-Walker estimator has the useful property that the parameter estimates { ˆφ j ; j =,..., p} correspond to a causal AR(p), in other words, the roots corresponding to ˆφ(z) = p ˆφ j z j lie outside the unit circle. This is because the covariances {ĉ n (r)} form a positive definite sequence, thus there exists a random vector Z p+ = (Z,..., Z p+ ) where var[z] p+ = (ˆΣ p+ ) i,j = 73

175 Series ar2 Partial ACF Lag Figure 7.: Top: The sample partial autocorrelation plot of the AR(2) process X t =.5X t X t 2 + ε t with error bars n = 200. ĉ n (i j), using this and the following result it follows that { ˆφ j ; j =,..., p} corresponds to a causal AR process. Lemma 7.. Let us suppose Z p+ = (Z,..., Z p+ ) is a random vector, where var[z] p+ = (Σ p+ ) i,j = c n (i j) (which is Toeplitz). Let Z p+ p be the best linear predictor of Z p+ given Z p,..., Z, where φ p = (φ,..., φ p ) = Σ p r p are the coefficients corresponding to the best linear predictor. Then the roots of the corresponding characteristic polynomial φ(z) = p φ jz j lie outside the unit circle. PROOF. We first note that by definition of the best linear predictor, for any coefficients {a j } we have the inequality E Z p+ p φ j Z p+ j 2 = E (φ(b)z p+ ) 2 E Y p+ p a j Z p+ j 2 = E (a(b)z p+ ) 2.(7.4) We use the above inequality to prove the result by contradiction. Let us suppose that there exists at least one root of φ(z), which lies inside the unit circle. We denote this root as λ ( λ > ) and factorize φ(z) as φ(z) = ( λz)r(z), where R(z) contains the other remaining roots and can be either inside or outside the unit circle. Define the two new random variables, Y p+ = R(B)Z p+ and Y p = R(B)Z p (where B acts as the backshift operator), which we note is a linear combination of Z p+,..., Z 2 and Z p,..., Z respectively ie. Y p+ = p 0= R iz p+ i and Y p = p i=0 R iz p i. The most important observation 74

176 in this construction is that the matrix Σ p+ is Toeplitz (ie. {Z t } is a stationary vector), therefore Y p+ and Y p have the same covariance structure, in particular they have the same variance. Let ρ = cov[y p+, Y p ] var[y p ] = cov[y p+, Y p ] = cor(y p+, Y p ). var[y p ]var[y p+ ] } {{ } by stationarity We recall that ρy p is the best linear predictor of Y p+ given Y p and ρ. We now return to the proof. We start by defining a new polynomial φ (z) = ( ρz)r(z). We will show that the mean squared error of E[φ (B)Z p+ ] 2 E[φ(B)Z p+ ] 2, which by (7.4) leads to a contradiction. Evaluating E[φ (B)Z p+ ] 2 we have E[φ (B)Z p+ ] 2 = E[( ρb)r(b)z p+ ] 2 = E[R(B)Z p+ ρbr(b)z p+ ] 2 = E[Y p+ ρy p ] 2 E[Y p+ λy p ] 2 = E[φ(B)Z p+ ] 2. From (7.4), the above can only be true if λ = ρ, and thus the root lies outside the unit circle. Further, we see that all the roots of φ(z) can only lie outside the unit circle. required result. Thus giving the The above result can immediately be used to show that the Yule-Walker estimators of the AR(p) coefficients yield a causal solution. Since the autocovariance estimators {ĉ n (r)} form a positive semi-definite sequence, there exists a vector Y p where var[y p+ ] = ˆΣ p+ with (ˆΣ p+ ) = ĉ n (i j), thus by the above lemma we have that ˆΣ p r p are the coefficients of a Causal AR process. We note that below we define the intuitively obvious least squares estimator, which does not necessarily have this property. The least squares estimator is based can either be defined in it s own right or be considered as the conditional Gaussian likelihood. We start by defining the Gaussian likelihood The Gaussian maximum likelihood Our object here is to obtain the maximum likelihood estimator of the AR(p) parameters. We recall that the maximum likelihood estimator is the parameter which maximises the joint density of the observations. Since the log-likelihood often has a simpler form, we will focus on the log-likelihood. We note that the Gaussian MLE is constructed as if the observations {X t } were Gaussian, though it 75

177 is not necessary that {X t } is Gaussian when doing the estimation. In the case that the innovations are not Gaussian, estimator will be less efficient (will not obtain the Cramer-Rao lower bound) then the likelihood constructed as if the distribution were known. Suppose we observe {X t ; t =,..., n} where X t are observations from an AR(p) process. Let us suppose for the moment that the innovations of the AR process are Gaussian, this implies that X n = (X,..., X n ) is a n-dimension Gaussian random vector, with the corresponding log-likelihood L n (a) = log Σ n (a) X nσ n (a) X n, (7.5) where Σ t (a) the variance covariance matrix of X n constructed as if X n came from an AR process with parameters a. Of course, in practice in the likelihood in the form given above is impossible to maximise. Therefore we need to rewrite the likelihood in a more tractable form. We now derive a tractable form of the likelihood under the assumption that the innovations come from an arbitrary distribution. To construct the likelihood, we use the method of conditioning, to write the likelihood as the product of conditional likelihoods. In order to do this, we derive the conditional distribution of X t+ given X t,..., X. We first note that the AR(p) process is p- Markovian, therefore if t p all the information about X t+ is contained in the past p observations, therefore P(X t+ x X t, X t,..., X ) = P(X t+ x X t, X t,..., X t p+ ). (7.6) Since the Markov property applies to the distribution function it also applied to the density f(x t+ X t,..., X ) = f(x t+ X t,..., X t p+ ). By using the (7.6) we have P(X t+ x X t,..., X ) = P(X t+ x X t,..., X ) = P ɛ (ɛ x p a j X t+ j ), (7.7) where P ɛ denotes the distribution of the innovation. Differentiating P ɛ with respect to X t+ gives f(x t+ X t,..., X t p+ ) = P ɛ(ɛ X t+ p a jx t+ j ) X t+ = f ɛ X t+ p a j X t+ j. (7.8) 76

178 Example 7.. (AR()) To understand why (7.6) is true consider the simple case that p = (AR()). Studying the conditional probability gives P(X t+ x t+ X t = x t,..., X = x ) = P( ax t + ɛ t x t+ } {{ } all information contained in X t X t = x t,..., X = x ) = P ɛ (ɛ t x t+ ax t ) = P(X t+ x t+ X t = x t ), where P ε denotes the distribution function of the innovation ε. Using (7.8) we can derive the joint density of {X t } n t=. By using conditioning we obtain n f(x, X 2,..., X n ) = f(x,..., X p ) f(x t+ X t,..., X ) Therefore the log likelihood is log f(x, X 2,..., X n ) } {{ } Full log-likelihood L n(a;x n ) t=p n = f(x,..., X p ) f(x t+ X t,..., X t p+ ) t=p n = f(x,..., X p ) f ɛ (X t+ t=p = log f(x,..., X p ) } {{ } initial observations (by repeated conditioning) p a j X t+ j ) (by (7.8)). log f ɛ (X t+ n + t=p (by the Markov property) p a j X t+ j ). } {{ } conditional log-likelihood=l n(a;x n ) In the case that the sample sizes are large n >> p, the contribution of initial observations log f(x,..., X p ) is minimal and the conditional log-likelihood and full log-likelihood are asymptotically equivalent. So far we have not specified the distribution of ε. From now on we shall assume that it is Gaussian. In the case that ε is Gaussian, log f(x,..., X p ) is multivariate normal with mean zero (since we are assuming, for convenience, that the time series has zero mean) and variance Σ p. We recall that Σ p (a) is a Toeplitz matrix whose covariance is determined by the AR parameters a, see (3.7). As can be seen from (3.7), the coefficients are buried within the covariance (which is in terms of the roots of the characteristic), this makes it quite an unpleasant part of the likelihood to 77

179 maximise. On the other hand the conditional log-likelihood has a far simpler form L n (a; X) = (n p) log σ 2 n X σ 2 t+ t=p p a j X t+ j 2. The maximum likelihood estimator is [ φ n = arg max log Σp (a) X pσ p (a) X p + L n (a; X) ]. (7.9) a Θ By constraining the parameter space, we an ensure the estimator correspond to a causal AR process. However, it is clear that despite having the advantage that it attains the Crámer-Rao lower bound in the case that the innovations are Gaussian, it not simple to evaluate. A far simpler estimator can be obtained, by simply focusing on the conditiona log likelihood L n (a; X). An explicit expression for it s maximum can easily be obtained (as long as we do not constrain the parameter space). It is simply the least squares estimator, in other words, φ p = arg max L n (a; X) and that φ p = Σ p r p, where ( Σ p ) i,j = n n p t=p+ X t ix t j and ( r n ) i = n n p t=p+ X tx t i. Remark 7.. (A comparison of the Yule-Walker and least squares estimators) Comparing the least squares estimator φ p = Σ p r p with the Yule-Walker estimator ˆφ p = ˆΣ p ˆr p we see that they are very similar. The difference lies in Σ p and ˆΣ p (and the corresponding r p and ˆr p ). We see that ˆΣ p is a Toeplitz matrix, defined entirely by the positive definite sequence ĉ n (r). On the other hand, Σ p is not a Toeplitz matrix, the estimator of c(r) changes subtly at each row. This means that the proof given in Lemma 7.. cannot be applied to the least squares estimator as it relies on the matrix Σ p+ (which is a combination of Σ p and r p ) being Toeplitz (thus stationary). Thus the characteristic polynomial corresponding to the least squares estimator will not necessarily have roots which lie outside the unit circle. Example 7..2 (Toy Example) To illustrate the difference between the Yule-Walker and least squares estimator (at least for example samples) consider the rather artifical example that the time series consists of two observations X and X 2 (we will assume the mean is zero). We fit an AR() 78

180 model to the data, the least squares estimator of the AR() parameter is φ LS = X X 2 X 2 2 whereas the Yule-Walker estimator of the AR() parameter is φ Y W = X X 2 X 2 +. X2 2 It is clear that φ LS < only if X 2 < X. On the other hand φ Y W <. Indeed since (X X 2 ) 2 > 0, we see that φ Y W /2. Exercise 7. In R you can estimate the AR parameters using ordinary least squares (ar.ols), yule-walker (ar.yw) and (Gaussian) maximum likelihood (ar.mle). Simulate the causal AR(2) model X t =.5X t 0.75X t 2 + ε t using the routine arima.sim (which gives Gaussian realizations) and also innovations which from a t-distribution with 4df. Use the sample sizes n = 00 and n = 500 and compare the three methods through a simulation study. Exercise 7.2 None of these methods are able to consistently estimator the parameters of a noncausal AR(p) time series. This is because all these methods are estimating the autocovariance function (regardless of whether the Yule-Walker of least squares method is used). It is possible that other criterions may give a consistent estimator. For example the l -norm defined as t L n (φ) = X t t=p+ p φ j X t j, with ˆφ n = arg min L n (φ). (i) Simulate a stationary solution of the non-causal AR() process X t = 2X t + ε t, where the innovations come from a double exponential and estimate φ using L n (φ). Do this 00 times, does this estimator appear to consistently estimate 2? (ii) Simulate a stationary solution of the non-causal AR() process X t = 2X t + ε t, where the innovations come from a t-distribution with 4 df and estimate φ using L n (φ). Do this 00 times, does this estimator appear to consistently estimate 2? You will need to use a Quantile Regression package to minimise the l norm. I suggest using the package quantreg and the function rq where we set τ = 0.5 (the median). 79

181 7.2 Estimation for ARMA models Let us suppose that {X t } satisfies the ARMA representation p q X t φ i X t i = ε t + θ j ε t j, i= and θ = (θ,..., θ q ), φ = (φ,..., φ p ) and σ 2 = var(ε t ). We will suppose for now that p and q are known. The objective in this section is to consider various methods for estimating these parameters The Gaussian maximum likelihood estimator We now derive the Gaussian maximum likelihood estimator (GMLE) to estimate the parameters θ and φ. Let X n = (X,..., X n ). The criterion (the GMLE) is constructed as if {X t } were Gaussian, but this need not be the case. The likelihood is similar to the likelihood given in (7.5), but just as in the autoregressive case it can be not directly maximised, ie. L n (φ, θ, σ) = log Σ n (φ, θ, σ) X nσ n (φ, θ, σ) X n, (7.0) where Σ n (φ, θ, σ) the variance covariance matrix of X n. However, in Section 5.3.3, equation (5.2) the Cholesky decomposition of Σ n is given and using this we can show that X nσ n (φ, θ, σ) X n = X2 n r(; θ) + t= (X t+ t φ t,j(θ)x t+ j ) 2 r(t + ; θ) where θ = (φ, θ, σ 2 ). Furthermore, since Σ n = L n D n L n, then det(σ n ) = det(l n ) 2 det(d n ) = n t= r(t), this implies log Σ n (φ, θ, σ) = n t= log r(t; θ). Thus the log-likelihood is L n (φ, θ, σ) = t= log r(t; θ) X2 n r(; θ) t= (X t+ t φ t,j(θ)x t+ j ) 2 r(t + ; θ),. 80

182 We recall from (5.24) that best linear predictor, t φ t,j(θ)x t+ j, can be simplified by taking into account the ARMA structure X nσ n (φ, θ, σ) X n = X 2 max(p,q) r(; θ) + (X t+ t φ t,j(θ)x t+ j ) 2 + r(t + ; θ) t= n max(p,q) (X t+ p φ jx t+ j q i= θ t,i(x t+ i X t+ i t i (θ)) 2. r(t + ; θ) Substituting this into L n (θ) gives L n (φ, θ, σ) = t= n max(p,q) log r(t; θ) X2 max(p,q) r(; θ) (X t+ t φ t,j(θ)x t+ j ) 2 r(t + ; θ) t= (X t+ p φ jx t+ j q i= θ t,i(x t+ i X t+ i t i (θ)) 2. r(t + ; θ) The maximum likelihood estimator are the parameters ˆθ n, ˆφ n, σ 2 which maximises L n (θ). We can also use an approximation to the log-likelihood which can simplify the estimation scheme. We recall in Section 5.4 we approximated X t+ t with ˆX t+ t. This motivates the approximation where we replace in L n (θ) X t+ t with ˆX t+ t, where ˆX t+ t is defined in (5.25) and r(t, θ) with σ 2 to give the approximate Gaussian log-likelihood L n (θ) = = t= t= n log σ 2 t=2 t=2 [X t+ X t+ t (θ)] 2 σ 2 n log σ 2 [(θ(b) φ(b)) [t] X t+ ] 2 where (θ(b) φ(b)) [t] denotes the approximation of the polynomial in B to the tth order. This approximate likelihood greatly simplifies the estimation scheme because the derivatives (which is the main tool used in the maximising it) can be easily obtained. To do this we note that d φ(b) dθ i θ(b) X t = Bi φ(b) θ(b) 2 X t = φ(b) θ(b) 2 X t i (7.) d φ(b) dφ j θ(b) X t = Bj θ(b) 2 X t = σ 2 θ(b) 2 X t j 8

183 therefore d dθ i ( ) φ(b) 2 ( ) ( ) φ(b) φ(b) θ(b) X t = 2 θ(b) X t θ(b) 2 X t i and d dφ j ( ) φ(b) 2 ( ) ( ) φ(b) θ(b) X t = 2 θ(b) X t θ(b) 2 X t j. (7.2) Substituting this into the approximate likelihood gives the derivatives L θ i = 2 σ 2 L φ j = 2 σ 2 [ (θ(b) φ(b) ) ] [ ( φ(b) [t] X t θ(b) 2 t= [ (θ(b) φ(b) ) ] [ ( [t] X t θ(b) t= L σ 2 = σ 2 nσ 4 ) ) ] X t i [t i] ] X t j [t j] [ (θ(b) φ(b) ) ] 2 [t] X t. (7.3) t= We then use the Newton-Raphson scheme to solve maximise the approximate likelihood. It can be shown that the approximate likelihood is close the actual true likelihood and asymptotically both methods are equivalent. Theorem 7.2. Let us suppose that X t has a causal and invertible ARMA representation X t p φ j X t j = ε t + q θ i ε t i where {ε t } are iid random variables with mean zero and var[ε t ] = σ 2. Then the the (quasi)-gaussian n ˆφ n φ ˆθ n θ i= D N (0, Λ ), with Λ = E(U tu t) E(V t U t) E(U t V t) E(V t V t) and U t = (U t,..., U t p+ ) and V t = (V t,..., V t q+ ), where {U t } and {V t } are autoregressive processes which satisfy φ(b)u t = ε t and θ(b)v t = ε t. We do not give the proof in this section, however it is possible to understand where this result comes from. We recall that that the maximum likelihood and the approximate likelihood are 82

184 asymptotically equivalent. They are both approximations of the unobserved likelihood L n (θ) = t= n log σ 2 t=2 [X t+ X t (; θ)] 2 σ 2 = t= n log σ 2 t=2 [θ(b) φ(b)x t+ ] 2 σ 2. This likelihood is infeasible in the sense that it cannot be maximised since the finite past X 0, X,... is unobserved, however is a very convenient tool for doing the asymptotic analysis. Using Lemma 5.4. we can show that all three likelihoods L n, Ln and L n are all asymptotically equivalent. Therefore, to obtain the asymptotic sampling properties of L n or L n we can simply consider the unobserved likelihood L n. To show asymptotic normality (we assume here that the estimators are consistent) we need to consider the first and second derivative of L n (since the asymptotic properties are determined by Taylor expansions). In particular we need to consider the distribution of L n θ and the expectation of 2 L n θ 2 L θ i = 2 σ 2 L φ j = 2 σ 2 at its true parameters at it s true parameters. We note that by using (7.2) we have t= t= [( θ(b) φ(b) ) ] [( ) ] φ(b) X t θ(b) 2 X t i [( θ(b) φ(b) ) X t ] [( θ(b) ) X t j ] (7.4) Since we are considering the derivatives at the true parameters we observe that ( θ(b) φ(b) ) X t = ε t, and φ(b) θ(b) 2 X t i = φ(b) θ(b) 2 θ(b) φ(b) ε t i = θ(b) ε t i = V t i θ(b) X t j = θ(b) θ(b) φ(b) ε t j = φ(b) ε t j = U t j. Thus φ(b)u t = ε t and θ(b)v t = ε t are autoregressive processes (compare with theorem). This means that the derivative of the unobserved likelihood can be written as L θ i = 2 σ 2 t= ε t U t i and L φ j = 2 σ 2 ε t V t j (7.5) Note that by causality ε t, U t i and V t j are independent. Again like many of the other estimators we t= 83

185 have encountered this sum is mean-like so can show normality of it by using a central limit theorem designed for dependent data. Indeed we can show asymptotically normality of { L θ i ; i =,..., q}, { L φ j ; j =,..., p} and their linear combinations using the Martingale central limit theorem, see Theorem 3.2 (and Corollary 3.), Hall and Heyde (980) - note that one can also use m-dependence. Moreover, it is relatively straightforward to show that n /2 ( L θ i, L φ j ) has the limit variance matrix. Finally, by taking second derivative of the likelihood we can show that E[n 2 L ] =. Thus θ 2 giving us the desired result The Hannan-Rissanen AR( ) expansion method The methods detailed above require good initial values in order to begin the maximisation (in order to prevent convergence to a local maximum). We now describe a simple method first propose in Hannan and Rissanen (982) and An et al. (982). It is worth bearing in mind that currently the large p small n problem is a hot topic. These are generally regression problems where the sample size n is quite small but the number of regressors p is quite large (usually model selection is of importance in this context). The methods proposed by Hannan involves expanding the ARMA process (assuming invertibility) as an AR( ) process and estimating the parameters of the AR( ) process. In some sense this can be considered as a regression problem with an infinite number of regressors. Hence there are some parallels between the estimation described below and the large p, small n problem. As we mentioned in Lemma 2.5., if an ARMA process is invertible it is can be represented as X t = b j X t j + ε t. (7.6) The idea behind Hannan s method is to estimate the parameters {b j }, then estimate the innovations ε t, and use the estimated innovations to construct a multiple linear regression estimator of the ARMA paramters {θ i } and {φ j }. Of course in practice we cannot estimate all parameters {b j } as there are an infinite number of them. So instead we do a type of sieve estimation where we only estimate a finite number and let the number of parameters to be estimated grow as the sample size increases. We describe the estimation steps below: (i) Suppose we observe {X t } n t=. Recalling (7.6), will estimate {b j} pn suppose that p n as n and p n << n (we will state the rate below). parameters. We will 84

186 We use Yule-Walker to estimate {b j } pn, where ˆbpn = ˆΣ p n ˆr pn, where (ˆΣ pn ) i,j = n n i j t= (X t X)(X t+ i j X) and (ˆr pn ) j = n n j (X t X)(X t+ j X). t= (ii) Having estimated the first {b j } pn coefficients we estimate the residuals with ε t = X t p n ˆbj,n X t j. (iii) Now use as estimates of φ 0 and θ 0 φn, θ n where φ n, θ p q n = arg min (X t φ j X t j θ i ε t i ) 2. t=p n+ i= We note that the above can easily be minimised. In fact ( φ n, θ n ) = R n s n where R n = Ỹ n t Ỹ t and s n = n t=max(p,q) Ỹ t X t, t=max(p,q) Ỹ t = (X t,..., X t p, ε t,..., ε t q ). 7.3 The quasi-maximum likelihood for ARCH processes In this section we consider an estimator of the parameters a 0 = {a j : j = 0,..., p} given the observations {X t : t =,..., N}, where {X t } is a ARCH(p) process. We use the conditional loglikelihood to construct the estimator. We will assume throughout that E(Z 2 t ) = and p α j = ρ <. 85

187 We now construct an estimator of the ARCH parameters based on Z t N (0, ). It is worth mentioning that despite the criterion being constructed under this condition it is not necessary that the innovations Z t are normally distributed. In fact in the case that the innovations are not normally distributed but have a finite fourth moment the estimator is still good. This is why it is called the quasi-maximum likelihood, rather than the maximum likelihood (similar to the how the GMLE estimates the parameters of an ARMA model regardless of whether the innovations are Gaussian or not). Let us suppose that Z t is Gaussian. Since Z t = X t / a 0 + p a jxt j 2, E(X t X t,..., X t p ) = 0 and var(x t X t,..., X t p ) = a 0 + p a jx 2 t j, then the log density of X t given X t,..., X t p is log(a 0 + p a j Xt j) 2 Xt 2 + a 0 + p a jxt j 2. Therefore the conditional log density of X p+, X p+2,..., X n given X,..., X p is t=p+ ( log(a 0 + p a j Xt j) 2 X 2 ) t + a 0 + p a jxt j 2. This inspires the the conditional log-likelihood L n (α) = n p t=p+ ( log(α 0 + p α j Xt j) 2 X 2 ) t + α 0 + p α jxt j 2. To obtain the estimator we define the parameter space Θ = {α = (α 0,..., α p ) : p α j, 0 < c α 0 c 2 <, c α j } and assume the true parameters lie in its interior a = (a 0,..., a p ) Int(Θ). We let â n = arg min α Θ L n(α). (7.7) The method for estimation of GARCH parameters parallels the approximate likelihood ARMA estimator given in Section

188 Chapter 8 Spectral Representations Prerequisites Knowledge of complex numbers. Have some idea of what the covariance of a complex random variable (we do define it below). Some idea of a Fourier transform (a review is given in Section A.3). Objectives Know the definition of the spectral density. The spectral density is always non-negative and this is a way of checking that a sequence is actually non-negative definite (is a autocovariance). The DFT of a second order stationary time series is almost uncorrelated. The spectral density of an ARMA time series, and how the roots of the characteristic polynomial of an AR may influence the spectral density function. There is no need to understand the proofs of either Bochner s (generalised) theorem or the spectral representation theorem, just know what these theorems are. However, you should know the proof of Bochner s theorem in the simple case that r rc(r) <. 87

189 8. How we have used Fourier transforms so far We recall in Section.2.3 that we considered models of the form X t = A cos (ωt) + B sin (ωt) + ε t t =,..., n. (8.) where ε t are iid random variables with mean zero and variance σ 2 and ω is unknown. We estimated the frequency ω by taking the Fourier transform J n (ω) = n n t= X te itω and using as an estimator of ω, the value which maximised J n (ω) 2. As the sample size grows the peak (which corresponds the frequency estimator) grows in size. Besides the fact that this corresponds to the least squares estimator of ω, we note that n J n (ω k ) = = 2πn X t exp(itω k ) t= µ( t 2πn n ) exp(itω k) + t= } {{ } =O() ε t exp(itω k ) 2πn t= } {{ } =O p(n /2 ) compare with n n t= εt (8.2) where ω k = 2πk n, is an estimator the the Fourier transform of the deterministic mean at frequency k. In the case that the mean is simply the sin function, there is only one frequency which is nonzero. A plot of one realization (n = 28), periodogram of the realization, periodogram of the iid noise and periodogram of the sin function is given in Figure 8.. Take careful note of the scale (yaxis), observe that the periodogram of the sin function dominates the the periodogram of the noise (magnitudes larger). We can understand why from (8.2), where the asymptotic rates are given and we see that the periodogram of the deterministic signal is estimating n Fourier coefficient, whereas the periodgram of the noise is O p (). However, this is an asymptotic result, for small samples sizes you may not see such a big difference between deterministic mean and the noise. Next look at the periodogram of the noise we see that it is very erratic (we will show later that this is because it is an inconsistent estimator of the spectral density function), however, despite the erraticness, the amount of variation overall frequencies seems to be same (there is just one large peak - which could be explained by the randomness of the periodogram). Returning again to Section.2.3, we now consider the case that the sin function has been 88

190 signal PS Time frequency P P frequency frequency Figure 8.: Top Left: Realisation of (.5) (2 sin( 2πt )) with iid noise, Top Right: Periodogram. 8 Bottom Left: Periodogram of just the noise. Bottom Right: Periodogram of the sin function. corrupted by colored noise, which follows an AR(2) model ε t =.5ε t 0.75ε t 2 + ɛ t. (8.3) A realisation and the corresponding periodograms are given in Figure 8.2. The results are different to the iid case. The peak in the periodogram no longer corresponds to the period of the sin function. From the periodogram of the just the AR(2) process we observe that it erratic, just as in the iid case, however, there appears to be varying degrees of variation over the frequencies (though this is not so obvious in this plot). We recall from Chapters 2 and 3, that the AR(2) process has a pseudo-period, which means the periodogram of the colored noise will have pronounced peaks which correspond to the frequencies around the pseudo-period. It is these pseudo-periods which are dominating the periodogram, which is giving a peak at frequency that does not correspond to the sin function. However, asymptotically the rates given in (8.2) still hold in this case too. In other words, for large enough sample sizes the DFT of the signal should dominate the noise. To see that this is the case, we increase the sample size to n = 024, a realisation is given in Figure 8.3. We see that the period corresponding the sin function dominates the periodogram. Studying the periodogram of just the AR(2) noise we see that it is still erratic (despite the large sample size), 89

191 but we also observe that the variability clearly changes over frequency. signal PS Time frequency P P frequency frequency Figure 8.2: Top Left: Realisation of (.5) (2 sin( 2πt )) with AR(2) noise (n = 28), Top 8 Right: Periodogram. Bottom Left: Periodogram of just the AR(2) noise. Bottom Right: Periodogram of the sin function. signal PS Time frequency P P frequency frequency Figure 8.3: Top Left: Realisation of (.5) (2 sin( 2πt )) with AR(2) noise (n = 024), Top 8 Right: Periodogram. Bottom Left: Periodogram of just the AR(2) noise. Bottom Right: Periodogram of the sin function. From now on we focus on the constant mean stationary time series (eg. iid noise and the AR(2)) 90

192 (where the mean is either constant or zero). As we have observed above, the periodogram is the absolute square of the discrete Fourier Transform (DFT), where J n (ω k ) = 2πn t= X t exp(itω k ). (8.4) This is simply a (linear) transformation of the data, thus it easily reversible by taking the inverse DFT X t = 2π n t= J n (ω k ) exp( itω k ). (8.5) Therefore, just as one often analyzes the log transform of data (which is also an invertible transform), one can analyze a time series through its DFT. In Figure 8.4 we give plots of the periodogram of an iid sequence and AR(2) process defined in equation (8.3). We recall from Chapter 3, that the periodogram is an inconsistent estimator of the spectral density function f(ω) = (2π) r= c(r) exp(irω) and a plot of the spectral density function corresponding to the iid and AR(2) process defined in (??). We will show later that by inconsistent estimator we mean that E[ J n (ω k ) 2 ] = f(ω k ) + O(n ) but var[ J n (ω k ) 2 ] 0 as n. this explains why the general shape of J n (ω k ) 2 looks like f(ω k ) but J n (ω k ) 2 is extremely erratic and variable frequency P frequency P2 Figure 8.4: Left: Periodogram of iid noise. Right: Periodogram of AR(2) process. Remark 8.. (Properties of the spectral density function) The spectral density function was first introduced in in Section.6. We recall that given an autoregressive process {c(k)}, the 9

193 IID Autoregressive (2) spectrum spectrum frequency frequency Figure 8.5: Left: Spectral density of iid noise. Right: Spectral density of AR(2), note that the interval [0, ] corresponds to [0, 2π] in Figure 8.5 spectral density is defined as f(ω) = 2π r= c(r) exp(2πir). And visa versa, given the spectral density we can recover the autocovariance via the inverse transform c(r) = 2π 0 f(ω) exp( 2πirω)dω. We recall from Section.6 that the spectral density function can be used to construct a valid autocovariance function since only a sequence whose Fourier transform is real and positive can be positive definite. In Section 5.4 we used the spectral density function to define conditions under which the variance covariance matrix of a stationary time series had minimum and maximim eigenvalues. Now from the discussion above we observe that the variance of the DFT is approximately the spectral density function (note that for this reason the spectral density is sometimes called the power spectrum). We now collect some of the above observations, to summarize some of the basic properties of the DFT: (i) We note that J n (ω k ) = J n (ω n k ), therefore, all the information on the time series is contain in the first n/2 frequencies {J n (ω k ); k =,..., n/2}. (ii) If the time series E[X t ] = µ and k 0 then E[J n (ω k )] = n µ exp(itω k ) = 0. t= 92

194 In other words, the mean of the DFT is zero regardless of whether the time series has a zero mean (it just needs to have a constant mean). (iii) However, unlike the original stationary time series, we observe that the variance of the DFT depends on frequency (unless it is a white noise process) and that for k 0, var[j n (ω k )] = E[ J n (ω k ) 2 ] = f(ω k ) + O(n ). The focus of this chapter will be on properties of the spectral density function (proving some of the results we stated previously) and on the so called Cramer representation (or spectral representation) of a second order stationary time series. However, before we go into these results (and proofs) we give one final reason why the analysis of a time series is frequently done by transforming to the frequency domain via the DFT. Above we showed that there is a one-to-one correspondence between the DFT and the original time series, below we show that the DFT almost decorrelates the stationary time series. In other words, one of the main advantages of working within the frequency domain is that we have transformed a correlated time series into something that it almost uncorrelated (this also happens to be a heuristic reason behind the spectral representation theorem). 8.2 The near uncorrelatedness of the Discrete Fourier Transform Let X n = {X t ; t =,..., n} and Σ n = var[x n ]. It is clear that Σ /2 n X n is an uncorrelated sequence. This means to formally decorrelate X n we need to know Σ /2 n. However, if X t is a second order stationary time series, something curiously, remarkable happens. The DFT, almost uncorrelates the X n. The implication of this is extremely useful in time series, and we shall be using this transform in estimation in Chapter 9. We start by defining the Fourier transform of {X t } n t= as J n (ω k ) = 2πn t= X t exp(ik 2πt n ) where the frequences ω k = 2πk/n are often called the fundamental, Fourier frequencies. 93

195 Lemma 8.2. Suppose {X t } is a second order stationary time series, where r rc(r) <. Then we have cov(j n ( 2πk n ), J n( 2πk 2 n )) = f( 2πk n ) + O( n ) k = k 2 O( n ) k k 2 where f(ω) = 2π r= c(r) exp(irω). In the sections below we give two proofs for the same result. We note that the principle reason behind both proofs is that t= ( exp it 2πj ) = n 0 j nz n j Z. (8.6) 8.2. Seeing the decorrelation in practice We evaluate the DFT using the following piece of code (note that we do not standardize by 2π) dft <- function(x){ n=length(x) dft <- fft(x)/sqrt(n) return(dft) } We have shown above that {J n (ω k )} k are close to uncorrelated and have variance close to f(ω k ). This means that the ratio J n (ω k )/f(ω k ) /2 are close to uncorrelated with variance close to one. Let us treat Z k = J n(ω k ) f(ω k ) /2, as the transformed random variables, noting that {Z k } is complex, our aim is to show that the acf corresponding to {Z k } is close to zero. Of course, in practice we do not know the spectral density function f, therefore we estimate it using the piece of code (where test is the time series) k<-kernel("daniell",6) temp2 <-spec.pgram(test,k, taper=0, log = "no")$spec n <- length(temp2) 94

196 temp3 <- c(temp2[c(:n)],temp2[c(n:)]) temp3 simply takes a local average of the periodogram about the frequency of interest (however it is worth noting that spec.pgram does not do precisely this, which can be a bit annoying). In Section 9.3 we explain why this is a consistent estimator of the spectral density function. Notice that we also double the length, because the estimator temp2 only gives estimates in the interval [0, π]. Thus our estimate of {Z k }, which we denote as Ẑk = J n (ω k )/ f n (ω k ) /2 is temp <- dft(test); temp4 <- temp/sqrt(temp3) We want to evaluate the covariance of {Ẑk} over various lags Ĉ n (r) = n Ẑ k Ẑ k = n k= k= J n (ω k )J n (ω k+r ) f n (ω k ) f n (ω k+r ) to do this we use we exploit the speed of the FFT (Fast Fourier Transform) temp5 <- Mod(dft(temp4))**2 dftcov <- fft(temp5, inverse = TRUE)/(length(temp5)) dftcov = dftcov[-] and make an ACF plot of the real and imaginary parts of the Fourier ACF: n = length(temp5) par(mfrow=c(2,)) plot(sqrt(n)*re(dftcov[:30])) lines(c(,30),c(.96,.96)) lines(c(,30),c(-.96,-.96)) plot(sqrt(n)*im(dftcov[:30])) lines(c(,30),c(.96,.96)) lines(c(,30),c(-.96,-.96)) Note that the.96 corresponds to the 2.5% limits, however this bound only holds if the time series is Gaussian. If it non-gaussian some corrections have to be made (see Dwivedi and Subba Rao (20) and Jentsch and Subba Rao (204)). A plot of the AR(2) model ε t =.5ε t 0.75ε t 2 + ɛ t. 95

197 together with the real and imaginary parts of its DFT autocovariance is given in Figure 8.6. We observe that most of the correlations lie between [.96,.96] test Time Real Imaginary sqrt(n) * Re(dftcov[:30]) 2 0 sqrt(n) * Im(dftcov[:30]) Index Index Figure 8.6: Top: Realization. Bottom: Real and Imaginary of Ĉn(r) plotted against the lag r. Exercise 8. (a) Simulate an AR(2) process and run the above code using the sample size (i) n = 64 (however use k<-kernel("daniell",3)) (ii) n = 28 (however use k<-kernel("daniell",4)) Does the near decorrelation property hold when the sample size is very small. Explain your answer by looking at the proof of the lemma. (b) Simulate a piecewise stationary time series (this is a simple example of a nonstationary time series) by stringing two stationary time series together. One example is 96

198 ts = arima.sim(list(order=c(2,0,0), ar = c(.5, -0.75)), n=28); ts2 = arima.sim(list(order=c(,0,0), ar = c(0.7)), n=28) test = c(ts/sd(ts),ts2/sd(ts2)) Make a plot of this time series. Calculate the DFT covariance of this time series, what do you observe in comparison to the stationary case? Proof of Lemma 8.2.: By approximating Toeplitz with Circulant matrices Let X n = (X n,..., X ) and F n be the Fourier transformation matrix (F n ) s,t = n /2 Ω (s )(t ) n = n /2 exp( 2iπ(s )(t ) n ) (note that Ω n = exp( 2π n )). It is clear that F nx n = (J n (ω 0 ),..., J n (ω n )). We now prove that F n X n is almost an uncorrelated sequence. The first proof will be based on approximating the symmetric Toeplitz variance matrix of X n with a circulant matrix, which has well know eigen values and functions. We start by considering the variance of F n X n, var(f n X n ) = F n Σ n F n, and our aim is to show that it is almost a diagonal. We first recall that if Σ n were a circulant matrix, then F n X n would be uncorrelated since F n is the eigenmatrix of any circulant matrix. This is not the case. However, the upper right hand side and the lower left hand side of Σ n can approximated by circulant matrices - this is the trick in showing the near uncorrelatedness. Studying Σ n Σ n = c(0) c() c(2)... c(n ) c() c(0) c()... c(n 2) c(n ) c(n 2). c() c(0) we observe that it can be written as the sum of two circulant matrices, plus some error, that we will bound. That is, we define the two circulant matrices C n = c(0) c() c(2)... c(n ) c(n ) c(0) c()... c(n 2) c() c(2). c(n ) c(0) 97

199 and C 2n = 0 c(n ) c(n 2)... c() c() 0 c(n )... c(2) c(2) c() 0... c(3) c(n ) c(n 2). c() 0 We observe that the upper right hand sides of C n and Σ n match and the lower left and sides of C 2n and Σ n match. As the above are circulant their eigenvector matrix is F n (note that F n Furthermore, the eigenvalues matrix of C n is n diag j=0 whereas the eigenvalue matrix of C n2 is n diag n c(j), n c(j), n n = diag c(j), j=0 n c(j)ω j n,..., j=0 n c(n j)ω j n,..., c(j)ω j n n,..., c(j)ω (t )j n, c(n j)ω (t )j n c(j)ω (t )j n, = F n). More succinctly, the kth eigenvalues of C n and C n2 are λ k = n n c(j)ω j(k ) n. Observe that λ k + λ k2 = 2πj j (n ) c(j)e n eigenvalues approximate the spectral density function. We now show that under the condition r rc(r) < we have F n Σ n F n F n ( Cn + C n2 ) F n = O j=0 c(j)ωj(k ) n and λ k2 = f(ω j ), thus the sum of these ( ) I, (8.7) n where I is a n n matrix of ones. To show the above we consider the differences element by element. Since the upper right hand sides of C n and Σ n match and the lower left and sides of C n2 98

200 and Σ n match, the above difference is (F n Σ n F ( ) ) n F n Cn + C n2 F n (s,t) = e s Σ n e t e s C n e t e s C n2 e t 2 n rc(r) = O( n n ). Thus we have shown (8.7). Therefore, since F n is the eigenvector matrix of C n and C n2, altogether we have ( ( ) F n Cn + C n2 F n = diag f n (0), f n ( 2π ) n ),..., f 2π(n ) n( ), n where f n (ω) = n r= (n ) c(r) exp(ijω). Altogether this gives r= var(f n X n ) = F n Σ n F n = f n (0) f n ( 2π n ) f n ( 2π(n ) n )) + O( n ) Finally, we note that since r rc(r) < which gives the required result. f n (ω) f(ω) c(r) rc(r) = O(n ), (8.8) n r >n r >n Remark 8.2. Note the eigenvalues of a matrix are often called the spectrum and that above calculation shows that spectrum of var[x n ] is close to f(ω n ), which may be a one reason why f(ω) is called the spectral density (the reason for density probably comes from the fact that f is positive). These ideas can also be used for inverting Toeplitz matrices (see Chen et al. (2006)) Proof 2 of Lemma 8.2.: Using brute force A more hands on proof is to just calculate cov(j n ( 2πk n ), J n( 2πk 2 n )). The important aspect of this proof is that if we can isolate the exponentials than we can use (8.6). It is this that gives rise to the near uncorrelatedness property. Remember also that exp(i 2π n jk) = exp(ijω k) = exp(ikω j ), hence 99

201 we can interchange between the two notations. We note that cov(a, B) = E(AB) E(A)E(B), thus we have ( cov J n ( 2πk n ), J n( 2πk ) 2 n ) = n t,τ= Now change variables with r = t τ, this gives (for 0 k, k 2 < n) where = n = R n = n ( cov J n ( 2πk n ), J n( 2πk ) 2 n ) n r= (n ) n r= (n ) n r= (n ) c(r) exp ( ir 2πk 2 n ( cov(x t, X τ ) exp i(tk τk 2 ) 2π ) n ) n r t= ( ) 2πit(k k 2 ) exp ) n ( c(r) exp ir 2πk ) 2 ( ) 2πit(k k 2 ) exp +R n, n n n t= } {{ } δ k (k 2 ) ( c(r) exp ir 2πk ) ( ) 2 2πit(k k 2 ) exp ) n n t=n r + Thus R n n r n rc(r) = O(n ) Finally by using (8.8) we obtain the result. Exercise 8.2 The the above proof (in Section 8.2.3) uses that r rc(r) <. What bounds do we obtain if we relax this assumption to r c(r) <? Heuristics In this section we summarize some spectral properties. We do this by considering the DFT of the data {J n (ω k )} n k=. It is worth noting that to calculate {J n(ω k )} n k= is computationally very fast and requires only O(n log n) computing operations (see Section A.5, where the Fast Fourier Transform is described). 200

202 The spectral (Cramer s) representation theorem We observe that for any sequence {X t } n t= t n that it can be written as the inverse transform for which can be written as an integral X t = where Z n (ω) = X t = n J n (ω k ) exp( itω k ), (8.9) k= exp( itω k ) [Z n (ω k ) Z n (ω k )] = k=2 ω 2π n n k= J n (ω k ). 2π 0 exp( itω)dz n (ω), (8.0) The second order stationary property of X t means that the DFT J n (ω k ) is close to an uncorrelated sequence or equivalently the process Z n (ω) has near orthogonal increments, meaning that for any two non-intersecting intervals [ω, ω 2 ] and [ω 3, ω 4 ] that Z n (ω 2 ) Z n (ω ) and Z n (ω 4 ) Z n (ω 3 ). The spectral representation theorem generalizes this result, it states that for any second order stationary time series {X t } there exists an a process {Z(ω); ω [0, 2π]} where for all t Z X t = 2π 0 exp( itω)dz(ω) (8.) and Z(ω) has orthogonal increments, meaning that for any two non-intersecting intervals [ω, ω 2 ] and [ω 3, ω 4 ] E[Z(ω 2 ) Z(ω )][Z(ω 2 ) Z(ω )] = 0. We now explore the relationship between the DFT with the orthogonal increment process. Using (8.) we see that J n (ω k ) = = 2πn t= 2π 2πn 0 X t exp(itω k ) = 2πn 2π 0 ( ) exp(it[ω k ω]) dz(ω) t= ( e i(n+)(ω k ω 0 )/2 D n/2 (ω k ω) ) dz(ω), where D n/2 (x) = sin[((n + )/2)x]/ sin(x/2) is the Dirichlet kernel (see Priestley (983), page 49). We recall that the Dirichlet kernel limits to the Dirac-delta function, therefore very crudely speaking we observe that the DFT is an approximation of the orthogonal increment localized about ω k (though mathematically this is not strictly correct). 20

203 Bochner s theorem This is a closely related result that is stated in terms of the so called spectral distribution. First the heuristics. We see that from Lemma 8.2. that the DFT J n (ω k ), is close to uncorrelated. Using this and inverse Fourier transforms we see that for t, τ n we have Let F n (ω) = n c(t τ) = cov(x t, X τ ) = n n k = k 2 = cov (J n (ω k ), J n (ω k2 )) exp( itω k + iτω k2 ) var(j n (ω k )) exp( i(t τ)ω k ). (8.2) k= ω 2π n k= var[j n (ω k )], then the above can be written as c(t τ) 2π 0 exp( i(t τ)ω)df n (ω), where we observe that F n (ω) is a positive function which in non-decreasing over ω. Bochner s theorem is an extension of this is states that for any autocovariance function {c(k)} we have the representation c(t τ) = 2π 0 exp( i(t τ)ω)f(ω)dω = 2π 0 exp( i(t τ)ω)df (ω). where F (ω) is a positive non-decreasing bounded function. Moreover, F (ω) = E( Z(ω) 2 ). We note that if the spectral density function exists (which is only true if r c(r) 2 < ) then F (ω) = ω 0 f(λ)dλ. Remark The above results hold for both linear and nonlinear time series, however, in the case that X t has a linear representation then X t has the particular form X t = j= ψ j ε t j, X t = A(ω) exp( ikω)dz(ω), (8.3) where A(ω) = j= ψ j exp(ijω) and Z(ω) is an orthogonal increment process, but in addition 202

204 E( dz(ω) 2 ) = dω ie. the variance of increments do not vary over frequency (as this varying has been absorbed by A(ω), since F (ω) = A(ω) 2 ). We mention that a more detailed discussion on spectral analysis in time series is give in Priestley (983), Chapters 4 and 6, Brockwell and Davis (998), Chapters 4 and 0, Fuller (995), Chapter 3, Shumway and Stoffer (2006), Chapter 4. In many of these references they also discuss tests for periodicity etc (see also Quinn and Hannan (200) for estimation of frequencies etc.). 8.3 The spectral density and spectral distribution 8.3. The spectral density and some of its properties We start by showing that under certain strong conditions the spectral density function is nonnegative. We later weaken these conditions (and this is often called Bochner s theorem). Theorem 8.3. (Positiveness of the spectral density) Suppose the coefficients {c(k)} are absolutely summable (that is k c(k) < ). Then the sequence {c(k)} is positive semi-definite if an only if the function f(ω), where is nonnegative. Moreover f(ω) = 2π c(k) = 2π 0 k= c(k) exp(ikω) exp( ikω)f(ω)dω. (8.4) It is worth noting that f is called the spectral density corresponding to the covariances {c(k)}. PROOF. We first show that if {c(k)} is a non-negative definite sequence, then f(ω) is a nonnegative function. We recall that since {c(k)} is non-negative then for any sequence x = (x,..., x N ) (real or complex) we have n s,t= x sc(s t) x s 0 (where x s is the complex conjugate of x s ). Now we consider the above for the particular case x = (exp(iω),..., exp(inω)). Define the function f n (ω) = 2πn exp(isω)c(s t) exp( itω). s,t= 203

205 Thus by definition f n (ω) 0. We note that f n (ω) can be rewritten as f n (ω) = 2π (n ) k= (n ) ( n k n ) c(k) exp(ikω). Comparing f(ω) = 2π k= c(k) exp(ikω) with f n(ω) we see that f(ω) f n (ω) c(k) exp(ikω) + 2π 2π k n := I n + II n. (n ) k= (n ) k n c(k) exp(ikω) Since k= c(k) < it is clear that I n 0 as n. Using Lemma A.. we have II n 0 as n. Altogether the above implies f(ω) f n (ω) 0 as n. (8.5) Now it is clear that since for all n, f n (ω) are nonnegative functions, the limit f must be nonnegative (if we suppose the contrary, then there must exist a sequence of functions {f nk (ω)} which are not necessarily nonnegative, which is not true). Therefore we have shown that if {c(k)} is a nonnegative definite sequence, then f(ω) is a nonnegative function. 2π We now show the converse, that is the Fourier coefficients of any non-negative l 2 function f(ω) = k= c(k) exp(ikω), is a positive semi-definite sequence. Writing c(k) = 2π 0 f(ω) exp(ikω)dω we substitute this into Definition.6. to give x s c(s t) x s = s,t= 2π 0 f(ω) { n s,t= } 2π x s exp(i(s t)ω) x s dω = f(ω) x s exp(isω) 0 s= 2 dω 0. Hence we obtain the desired result. The above theorem is very useful. It basically gives a simple way to check whether a sequence {c(k)} is non-negative definite or not (hence whether it is a covariance function - recall Theorem.6.). See Brockwell and Davis (998), Corollary or Fuller (995), Theorem 3..9, for alternative explanations. Example 8.3. Consider the empirical covariances (here we gives an alternative proof to Remark 204

206 6.2.) defined in Chapter 6 ĉ n (k) = n k n t= X t X t+ k k n 0 otherwise, we give an alternative proof to Lemma 6.2. to show that {ĉ n (k)} is non-negative definite sequence. To show that the sequence we take the Fourier transform of ĉ n (k) and use Theorem The Fourier transform of {ĉ n (k)} is (n ) k= (n ) exp(ikω)ĉ n (k) = (n ) k= (n ) exp(ikω) n n k t= X t X t+ k = n X t exp(itω) 0. t= Since the above is non-negative, this means that {ĉ n (k)} is a non-negative definite sequence. We now state a useful result which relates the largest and smallest eigenvalue of the variance of a stationary process to the smallest and largest values of the spectral density (we recall we used this in Lemma 5.4.). Lemma 8.3. Suppose that {X k } is a stationary process with covariance function {c(k)} and spectral density f(ω). Let Σ n = var(x n ), where X n = (X,..., X n ). Suppose inf ω f(ω) m > 0 and sup ω f(ω) M < Then for all n we have λ min (Σ n ) inf f(ω) and λ max(σ n ) sup f(ω). ω PROOF. Let e be the eigenvector with smallest eigenvalue λ corresponding to Σ n. Then using c(s t) = f(ω) exp(i(s t)ω)dω we have λ min (Σ n ) = e Σ n e = = 2π 0 ē s, c(s t)e t, = s,t= f(ω) e s, exp(isω) s= 2 f(ω) ω ē s, exp(i(s t)ω)e t, dω = s,t= 2π dω inf f(ω) e s, exp(isω) ω 0 s= 2 dω = inf ω f(ω), since by definition n s= e s, exp(isω) 2 dω = n s= e s, 2 = (using Parseval s identity). Using a similar method we can show that λ max (Σ n ) sup f(ω). We now state a version of the above result which requires weaker conditions on the autocovariance function (only that they decay to zero). 205

207 Lemma Suppose the covariance {c(k)} decays to zero as k, then for all n, Σ n = var(x n ) is a non-singular matrix (Note we do not require the stronger condition the covariances are absolutely summable). PROOF. See Brockwell and Davis (998), Proposition The spectral distribution and Bochner s theorem Theorem 8.3. only holds when the sequence {c(k)} is absolutely summable. Of course this may not always be the case. An extreme example is the time series X t = Z. Clearly this is a stationary time series and its covariance is c(k) = var(z) = for all k. In this case the autocovariances {c(k) = }, is not absolutely summable, hence the representation of the covariance in Theorem 8.3. does not apply in this case. The reason is because the Fourier transform of the infinite sequence {c(k) = } k is not well defined (clearly {c(k) = } k does not belong to l 2 ). However, we now show that Theorem 8.3. can be generalised to include all non-negative definite sequences and stationary processes, by considering the spectral distribution rather than the spectral density. Theorem A function {c(k)} is non-negative definite sequence if and only if c(k) = 2π 0 exp( ikω)df (ω), (8.6) where F (ω) is a right-continuous (this means that F (x+h) F (x) as 0 < h 0), non-decreasing, non-negative bounded function on [ π, π] (hence it has all the properties of a distribution and it can be consider as a distribution - it is usually called the spectral distribution). This representation is unique. This is a very constructive result. It shows that the Fourier coefficients of any distribution function form a non-negative definite sequence, and thus, if c(k) = c( k) (hence is symmetric) correspond to the covarance function of a random process. In Figure 8.7 we give two distribution functions. the top plot is continuous and smooth, therefore it s derivative will exist, be positive and belong to l 2. So it is clear that its Fourier coefficients form a non-negative definite sequence. The interesting aspect of Thereom is that the Fourier coefficients corresponding to the distribution function in the second plot also forms a non-negative definite sequence even though the derivative 206

208 of this distribution function does not exist. However, this sequence will not belong to l 2 (ie. the correlations function will not decay to zero as the lag grows). Figure 8.7: Both plots are of non-decreasing functions, hence are valid distribution functions. The top plot is continuous and smooth, thus its derivative (the spectral density function) exists. Whereas the bottom plot is not (spectral density does not exist). PROOF of Theorem We first show that if {c(k)} is non-negative definite sequence, then we can write c(k) = 2π 0 exp(ikω)df (ω), where F (ω) is a distribution function. To prove the result we adapt some of the ideas used to prove Theorem As in the proof of Theorem 8.3. define the (nonnegative) function f n (ω) = var[j n (ω)] = 2πn s,t= exp(isω)c(s t) exp( itω) = 2π (n ) k= (n ) ( n k n ) c(k) exp(ikω). If {c(k)} is not absolutely summable, the limit of f n (ω) is no longer be well defined. Instead we consider its integral, which will always be a distribution function (in the sense that it is nondecreasing and bounded). Let us define the function F n (ω) whose derivative is f n (ω), that is F n (ω) = ω 0 f n (λ)dλ = ω 2π c(0) + 2 2π n r= ( r ) c(r) sin(ωr) n r 0 λ 2π. 207

209 Since f n (λ) is nonnegative, F n (ω) is a nondecreasing function. Furthermore it is bounded since F n (2π) = 2π 0 f n (λ)dλ = c(0). Hence F n satisfies all properties of a distribution and can be treated as a distribution function. This means that we can use Helly s theorem which states that for any sequence of distributions {G n } defined on [0, 2π], were G n (0) = 0 and sup n G n (2π) < M <, there exist a subsequence {n m } m where G nm (x) G(x) as m for each x [0, 2π] at which G is continuous. Furthermore, since G nm (x) G(x) (pointwise as m ), this implies that for any bounded sequence h we have that h(x)dg nm (x) h(x)dg(x) as m (a very nice proof is given in Varadhan, Theorem 4.). We now apply this result to F n. Using Helly s theorem there exists a subsequence of distributions {F nm } m which has a limit F, thus h(x)df nm (x) h(x)df (x) as m. We focus on the function h(x) = exp( ikω). It is clear that for every k and n we have 2π 0 exp( ikω)df n (ω) = 2π 0 exp(ikω)f n (ω)dω = ( k n )c(k) k n 0 k n (8.7) Fixing k and letting n we see that d n,k = is a Cauchy sequence, where 2π 0 exp(ikω)df n (ω) = ( k ) c(k) n d n,k d k = c(k) (8.8) as n. Thus d nm,k = exp( ikx)df nm (x) exp( ikx)df (x) as m 208

210 and by (8.8) we have c(k) = exp( ikx)df (x), where F (x) is a well defined distribution. This gives the first part of the assertion. To show the converse, that is {c(k)} is a non-negative definite sequence when c(k) is defined as c(k) = exp(ikω)df (ω), we use the same method given in the proof of Theorem 8.3., that is x s c(s t) x s = s,t= = 2π 0 2π 0 { n } x s exp( i(s t)ω) x s df (ω) s,t= x s exp( isω) s= 2 df (ω) 0, since F (ω) is a distribution. Finally, if {c(k)} were absolutely summable, then we can use Theorem 8.3. to write c(k) = 2π 0 exp( ikω)df (ω), where F (ω) = ω 0 f(λ)dλ and f(λ) = 2π k= c(k) exp(ikω). By using Theorem 8.3. we know that f(λ) is nonnegative, hence F (ω) is a distribution, and we have the result. Example Using the above we can construct the spectral distribution for the (rather silly) time series X t = Z. Let F (ω) = 0 for ω < 0 and F (ω) = var(z) for ω 0 (hence F is the step function). Then we have cov(x t, X t+k ) = var(z) = exp( ikω)df (ω). Example Consider the second order stationary time series X t = U cos(λt) + U 2 sin(λt), where U and U 2 are iid random variables with mean zero and variance σ 2 and λ the frequency. It can be shown that cov(x t, X t+k ) = σ2 2 [exp(iλk) + exp( iλk)]. 209

211 Observe that this covariance does not decay with the lag k. Then cov(x t, X t+k ) = var(z) = where 2π 0 exp( ikω)df (ω). F (ω) = 0 ω < λ σ 2 /2 λ ω < λ σ 2 λ ω. 8.4 The spectral representation theorem We now state the spectral representation theorem and give a rough outline of the proof. Theorem 8.4. If {X t } is a second order stationary time series with mean zero, and spectral distribution F (ω), and the spectral distribution function is F (ω), then there exists a right continuous, orthogonal increment process {Z(ω)} (that is E[(Z(ω ) Z(ω 2 )(Z(ω 3 ) Z(ω 4 ))] = 0, when the intervals [ω, ω 2 ] and [ω 3, ω 4 ] do not overlap) such that X t = 2π 0 exp( itω)dz(ω), (8.9) where for ω ω 2, E Z(ω ) Z(ω 2 ) 2 = F (ω ) F (ω 2 ) (noting that F (0) = 0). (One example of a right continuous, orthogonal increment process is Brownian motion, though this is just one example, and usually Z(ω) will be far more general than Brownian motion). Heuristically we see that (8.9) is the decomposition of X t in terms of frequencies, whose amplitudes are orthogonal. In other words X t is decomposed in terms of frequencies exp(itω) which have the orthogonal amplitudes dz(ω) (Z(ω + δ) Z(ω)). Remark 8.4. Note that so far we have not defined the integral on the right hand side of (8.9). It is known as a stochastic integral. Unlike many deterministic functions (functions whose derivative exists), one cannot really suppose dz(ω) Z (ω)dω, because usually a typical realisation of Z(ω) will not be smooth enough to differentiate. For example, it is well known that Brownian is quite rough, that is a typical realisation of Brownian motion satisfies B(t, ω) B(t 2, ω) K( ω) t t t γ, where ω is a realisation and γ /2, but in general γ will not be larger. The integral 20

212 g(ω)dz(ω) is well defined if it is defined as the limit (in the mean squared sense) of discrete sums. More precisely, let Z n (ω) = n k= Z(ω k)i ωnk,ω nk (ω) = nω/2π k= [Z(ω k ) Z(ω k )], then g(ω)dz n (ω) = g(ω k ){Z(ω k ) Z(ω k )}. k= The limit of g(ω)dz n (ω) as n is g(ω)dz(ω) (in the mean squared sense, that is E[ g(ω)dz(ω) g(ω)dzn (ω)] 2 ). Compare this with our heuristics in equation (8.0). For a more precise explanation, see Parzen (959), Priestley (983), Sections and Section 4., page 254, and Brockwell and Davis (998), Section 4.7. For a very good review of elementary stochastic calculus see Mikosch (999). A very elegant explanation on the different proofs of the spectral representation theorem is given in Priestley (983), Section 4.. We now give a rough outline of the proof using the functional theory approach. Rough PROOF of the Spectral Representation Theorem To prove the result we first define two Hilbert spaces H and H 2, where H one contains deterministic functions and H 2 contains random variables. First we define the space H = sp{e itω ; t Z} with inner-product f, g = 2π 0 f(x)g(x)df (x) (8.20) (and of course distance f g, f g = 2π 0 f(x) g(x) 2 df (x)) it is clear that this inner product is well defined because f, f 0 (since F is a measure). It can be shown (see Brockwell and Davis { (998), page 44) that H = g; } 2π 0 g(ω) 2 df (ω) <. We also define the space H 2 = sp{x t ; t Z} Roughly speaking it is because all continuous functions on [0, 2π] are dense in L 2 ([0, 2π], B, F ) (using the metric f g = f g, f g and the limit of Cauchy sequences). Since all continuous function can be written as linear combinations of the Fourier basis, this gives the result. 2

213 with inner-product cov(x, Y ) = E[XY ] E[X]E[Y ]. Now let us define the linear mapping T : H H 2 T ( a j exp(ikω)) = a j X k, (8.2) for any n (it is necessary to show that this can be extended to infinite n, but we won t do so here). We will shown that T defines an isomorphism (ie. it is a one-to-one linear mapping that preserves norm). To show that it is a one-to-one mapping see Brockwell and Davis (998), Section 4.7. It is clear that it is linear, there all that remains is to show that the mapping preserves inner-product. Suppose f, g H, then there exists coefficients {f j } and {g j } such that f(x) = j f j exp(ijω) and g(x) = j g j exp(ijω). Hence by definition of T in (8.2) we have T f, T g = cov( j f j X j, j g j X j ) = j,j 2 f j g j2 cov(x j, X j2 ) (8.22) Now by using Bochner s theorem (see Theorem 8.3.2) we have T f, T g = 2π 0 ( ) 2π f j g j2 exp(i(j j 2 )ω) df (ω) = f(x)g(x)df (x) = f, g. j,j 0 2 (8.23) Hence < T f, T g >=< f, g >, so the inner product is preserved (hence T is an isometry). Altogether this means that T defines an isomorphism betwen H and H 2. Therefore all functions which are in H have a corresponding random variable in H 2 which has similar properties. For all ω [0, 2π], it is clear that the identity functions I [0,ω] (x) H. Thus we define the random function {Z(ω); 0 ω 2π}, where T (I [0,ω] ( )) = Z(ω) H 2 (since T is an isomorphism). Since that mapping T is linear we observe that T (I [ω,ω 2 ]) = T (I [0,ω ] I [0,ω2 ]) = T (I [0,ω ]) T (I [0,ω2 ]) = Z(ω ) Z(ω 2 ). Moreover, since T preserves the norm for any non-intersecting intervals [ω, ω 2 ] and [ω 3, ω 4 ] we have cov ((Z(ω ) Z(ω 2 ), (Z(ω 3 ) Z(ω 4 )) = T (I [ω,ω 2 ]), T (I [ω3,ω 4 ]) = I [ω,ω 2 ], I [ω3,ω 4 ] = I [ω,ω 2 ](ω)i [ω3,ω 4 ](ω)df (ω) = 0. 22

214 Therefore by construction {Z(ω); 0 ω 2π} is an orthogonal increment process, where E Z(ω 2 ) Z(ω ) 2 = < T (I [ω,ω 2 ]), T (I [ω,ω 2 ]) >=< I [ω,ω 2 ], I [ω,ω 2 ] > = 2π 0 I [ω,ω 2 ]df (ω) = ω2 ω df (ω) = F (ω 2 ) F (ω ). Having defined the two spaces which are isomorphic and the random function {Z(ω); 0 ω 2π} and function I [0,ω] (x) which have orthogonal increments, we can now prove the result. Since di [0,ω] (s) = δ ω (s)ds, where δ ω (s) is the dirac delta function, any function g L 2 [0, 2π] can be represented as Thus for g(ω) = exp( itω) we have Therefore T (exp( itω)) = T = g(ω) = exp( itω) = ( 2π 0 2π 0 2π 0 2π 0 g(s)di [ω,2π] (s). exp( its)di [ω,2π] (s). ) 2π exp( its)di [ω,2π] (s) = exp( its)t [di [ω,2π] (s)] 0 exp( its)dt [I [ω,2π] (s)], where the mapping goes inside the integral due to the linearity of the isomorphism. Using that I [ω,2π] (s) = I [0,s] (ω) we have T (exp( itω)) = 2π 0 exp( its)dt [I [0,s] (ω)]. By definition we have T (I [0,s] (ω)) = Z(s) which we substitute into the above to give X t = 2π 0 exp( its)dz(s), which gives the required result. Note that there are several different ways to prove this result. 23

215 It is worth taking a step back from the proof and see where the assumption of stationarity crept in. By Bochner s theorem we have that c(t τ) = exp( i(t τ)ω)df (ω), where F is a distribution. We use F to define the space H, the mapping T (through {exp(ikω)} k ), the inner-product and thus the isomorphism. However, it was the construction of the orthogonal random functions {Z(ω)} that was instrumental. The main idea of the proof was that there are functions {φ k (ω)} and a distribution H such that all the covariances of the stochastic process {X t } can be written as E(X t X τ ) = c(t, τ) = 2π 0 φ t (ω)φ τ (ω)dh(ω), where H is a measure. As long as the above representation exists, then we can define two spaces H and H 2 where {φ k } is the basis of the functional space H and it contains all functions f such that f(ω) 2 dh(ω) < and H 2 is the random space defined by sp(x t ; t Z). From here we can define an isomorphism T : H H 2, where for all functions f(ω) = k f kφ k (ω) H T (f) = k f k X k H 2. An important example is T (φ k ) = X k. Now by using the same arguments as those in the proof above we have X t = φ t (ω)dz(ω) where {Z(ω)} are orthogonal random functions and E Z(ω) 2 = H(ω). We state this result in the theorem below (see Priestley (983), Section 4.). Theorem (General orthogonal expansions) Let {X t } be a time series (not necessarily second order stationary) with covariance {E(X t X τ ) = c(t, s)}. If there exists a sequence of functions {φ k ( )} which satisfy for all k 2π 0 φ k (ω) 2 dh(ω) < 24

216 and the covariance admits the representation c(t, s) = 2π 0 where H is a distribution then for all t we have the representation φ t (ω)φ s (ω)dh(ω), (8.24) X t = φ t (ω)dz(ω) (8.25) where {Z(ω)} are orthogonal random functions and E Z(ω) 2 = H(ω). On the other hand if X t has the representation (8.25), then c(s, t) admits the representation (8.24). Remark We mention that the above representation applies to both stationary and nonstationary time series. What makes the exponential functions {exp(ikω)} special is if a process is stationary then the representation of c(k) := cov(x t, X t+k ) in terms of exponentials is guaranteed: c(k) = 2π 0 exp( ikω)df (ω). (8.26) Therefore there always exists an orthogonal random function {Z(ω)} such that X t = exp( itω)dz(ω). Indeed, whenever the exponential basis is used in the definition of either the covariance or the process {X t }, the resulting process will always be second order stationary. We mention that it is not always guaranteed that for any basis {φ t } we can represent the covariance {c(k)} as (8.24). nonstationary processes. However (8.25) is a very useful starting point for characterising 25

217 8.5 The spectral density functions of MA, AR and ARMA models We obtain the spectral density function for MA( ) processes. Using this we can easily obtain the spectral density for ARMA processes. Let us suppose that {X t } satisfies the representation X t = j= ψ j ε t j (8.27) where {ε t } are iid random variables with mean zero and variance σ 2 and j= ψ j <. We recall that the covariance of above is c(k) = E(X t X t+k ) = Since j= ψ j <, it can be seen that j= ψ j ψ j+k. (8.28) c(k) k k j= ψ j ψ j+k <. Hence by using Theorem 8.3., the spectral density function of {X t } is well defined. 2π There are several ways to derive the spectral density of {X t }, we can either use (8.28) and f(ω) = k c(k) exp(ikω) or obtain the spectral representation of {X t} and derive f(ω) from the spectral representation. We prove the results using the latter method The spectral representation of linear processes Since {ε t } are iid random variables, using Theorem 8.4. there exists an orthogonal random function {Z(ω)} such that ε t = 2π 0 exp( itω)dz(ω). Since E(ε t ) = 0 and E(ε 2 t ) = σ 2 multiplying the above by ε t, taking expectations and noting that due to the orthogonality of {Z(ω)} we have E(dZ(ω )dz(ω 2 )) = 0 unless ω = ω 2 we have that E( dz(ω) 2 ) = σ 2 dω, hence f ε (ω) = (2π) σ 2. 26

218 Hence Using the above we obtain the following spectral representation for {X t } X t = 2π 0 X t = j= 2π 0 ψ j exp(ijω) exp( itω)dz(ω). A(ω) exp( itω)dz(ω), (8.29) where A(ω) = j= ψ j exp(ijω), noting that this is the unique spectral representation of X t. Definition 8.5. (The Cramer Representation) We mention that the representation in (8.29) of a stationary process is usually called the Cramer representation of a stationary process, where X t = 2π 0 A(ω) exp( itω)dz(ω), where {Z(ω) : 0 ω 2π} are orthogonal functions. Exercise 8.3 (i) Suppose that {X t } has an MA() representation X t = θε t + ε t. What is its Cramer s representation? (ii) Suppose that {X t } has a causal AR() representation X t = φx t + ε t. What is its Cramer s representation? The spectral density of a linear process Multiplying (8.29) by X t+k and taking expectations gives E(X t X t+k ) = c(k) = 2π 0 A(ω )A( ω 2 ) exp( i(t + k)ω + itω 2 )E(dZ(ω )dz(ω 2 )). Due to the orthogonality of {Z(ω)} we have E(dZ(ω )dz(ω 2 )) = 0 unless ω = ω 2, altogether this gives E(X t X t+k ) = c(k) = 2π 0 A(ω) 2 exp( ikω)e( dz(ω) 2 ) = 2π 0 f(ω) exp( ikω)dω, where f(ω) = σ2 2π A(ω) 2. Comparing the above with (8.4) we see that f( ) is the spectral density function. 27

219 The spectral density function corresponding to the linear process defined in (8.27) is f(ω) = σ2 2π j= ψ j exp( ijω) 2. Remark 8.5. (An alternative, more hands on proof) An alternative proof which avoids the Cramer representation is to use that the acf of a linear time series is c(r) = σ 2 k ψ jψ j+r (see Lemma 3..). Thus by definition the spectral density function is f(ω) = 2π = σ2 2π r= r= j= Now make a change of variables s = j + r this gives f(ω) = σ2 2π r= s= c(r) exp(irω) ψ j ψ j+r exp(irω). ψ j ψ s exp(i(s j)ω) = σ2 2π A(ω) 2. Example 8.5. Let us suppose that {X t } is a stationary ARMA(p, q) time series (not necessarily invertible or causal), where X t p ψ j X t j = q θ j ε t j, {ε t } are iid random variables with E(ε t ) = 0 and E(ε 2 t ) = σ 2. Then the spectral density of {X t } is f(ω) = σ2 + q θ j exp(ijω) 2 2π q φ j exp(ijω) 2 We note that because the ARMA is the ratio of trignometric polynomials, this is known as a rational spectral density. Remark The roots of the characteristic function of an AR process will have an influence on the location of peaks in its corresponding spectral density function. To see why consider the AR(2) model X t = φ X t + φ 2 X t 2 + ε t, 28

220 where {ε t } are iid random variables with zero mean and E(ε 2 ) = σ 2. Suppose the roots of the characteristic polynomial φ(b) = φ B φ 2 B 2 lie outside the unit circle and are complex conjugates where λ = r exp(iθ) and λ 2 = r exp( iθ). Then the spectral density function is f(ω) = = σ 2 r exp(i(θ ω)) 2 r exp(i( θ ω) 2 σ 2 [ + r 2 2r cos(θ ω)][ + r 2 2r cos(θ + ω)]. If r > 0, the f(ω) is maximum when ω = θ, on the other hand if, r < 0 then the above is maximum when ω = θ π. Thus the peaks in f(ω) correspond to peaks in the pseudo periodicities of the time series and covariance structure (which one would expect), see Section How pronounced these peaks are depend on how close r is to one. The close r is to one the larger the peak. We can generalise the above argument to higher order Autoregressive models, in this case there may be multiple peaks. In fact, this suggests that the larger the number of peaks, the higher the order of the AR model that should be fitted Approximations of the spectral density to AR and MA spectral densities In this section we show that the spectral density f(ω) = 2π r= c(r) exp(irω) can be approximated to any order by the spectral density of an AR(p) or MA(q) process. We do this by truncating the infinite number of covariances by a finite number, however, this does not necessarily lead to a positive definite spectral density. This can easily be proven by noting that m f m (ω) = c(r) exp(irω) = r= m 2π 0 f(λ)d m (ω λ)dλ, where D m (λ) = sin[(n + /2)λ]/ sin(λ/2). Observe that D m ( ) can be negative, which means that f m (ω) can be negative despite f being positive. Example Consider the AR() process X t = 0.75X t + ε t where var[ε t ] =. In Lemma 29

221 3.. we showed that the autcovariance corresponding to this model is c(r) = [ ] 0.75 r. Let us define a process whose autocorrelation is c(0) = [ ], c() = c( ) = [ ] 0.75 and c(r) = 0 for r >. The spectral density of this process is f m (ω) = ( ) cos[ω]. It is clear that this function can be zero for some values of ω. This means that { c(r)} is not a well defined covariance function, hence there does not exist a time series with this covariance structure. In other words, simply truncating an autocovariance is not enough to guarantee that it positive definite sequence. Instead we consider a slight variant on this and define which is positive. 2π m r= m ( r ) c(r) exp(irω) m Remark We note that f m is known as a Cesáro sum because it can be written as f m (ω) = 2π m r= m ( r ) c(r) exp(irω) = m m m f n (ω), (8.30) where f n ( ) = n 2π r= n c(r) exp(irω). Strangely, there is no guarantee that the truncated Fourier transform f n is not negative, however f n ( ) is definitely positive. There are are a few ways to prove this: (i) The first method we came across previously, var[j n (ω)] = f n (ω), it is clear that using this construction inf ω f n (ω) 0. (ii) By using (8.30) we can write f m ( ) as f m (ω) = 2π 0 f(λ)f m (ω λ)dλ, n=0 where F m (λ) = m m r= m D r(λ) = m ( ) 2 sin(nλ/2) sin(λ/2) and Dr (λ) = r j= r exp(ijω) (these are the Fejer and Dirichlet kernels respectively). Since both f and F m are positive, then f m has to be positive. 220

222 The Cesaro sum is special in the sense that sup f m (ω) f(ω) 0, as m. (8.3) ω Thus for a large enough m, f m (ω) will be within δ of the spectral density f. Using this we can prove the results below. Lemma 8.5. Suppose that r c(r) < and f is the spectral density of the covariances. Then for every δ > 0, there exists a m such that f(ω) f m (ω) < δ and f m (ω) = σ 2 ψ(ω) 2, where ψ(ω) = m j=0 ψ j exp(ijω). Thus we can approximate the spectral density of f with the spectral density of a MA. PROOF. We show that there exists an MA(m) which has the spectral density f m (ω), where f m is defined in (8.30). Thus by (8.3) we have the result. Before proving the result we note that if a polynomial is of the form m p(z) = a 0 + a r z r + a m z m + r= 2m r=m+ a r m z r, then it has the factorization p(z) = C m [ λ jz][ λ z], where λ j is such that λ j <. The proof is clear by factorising the polynomial and using contradiction. Therefore, if a(r) = a( r), then we have m r= m a(r) exp(irω) = exp( imω) 2m r=0 = C exp( imω) = C( ) m m a(r m) exp(irω) m m λ j j [ [ λ j exp(iω)] [ λ j λ j ] [ exp( iω) ] exp(iω) λ j ] exp(iω), for some finite constant C. Using the above, with a(r) = [ r n ]c(r), we can write f m as m m f m (ω) = K ( λ exp(iω)) ( λ exp( iω)) = A(ω)A( ω) = A(ω) 2, j j 22

223 where A(z) = m ( λ z). Since A(z) is an mth order polynomial where all the roots are greater than, we can always construct an MA(m) process which has A(z) as its transfer function. Thus there exists an MA(m) j process which has f m (ω) as its spectral density function. Lemma Suppose that r c(r) < and f is corresponding the spectral density function where inf ω f(ω) > 0. Then for every δ > 0, there exists a m such that f(ω) g m (ω) < δ and g m (ω) = σ 2 φ(ω) 2, where φ(ω) = m j=0 φ j exp(ijω) and the roots of φ(z) lie outside the unit circle. Thus we can approximate the spectral density of f with the spectral density of a causal autoregressive process. PROOF. We first note that we can write f(ω) gm (ω) = f(ω) g m (ω) f(ω) g m (ω). Since f( ) L 2 and is bounded away from zero, then f L 2 and we can write f as f (ω) = d r exp(irω), r= where d r are the Fourier coefficients of f. Since f is positive and symmetric, then f is positive and symmetric such that f (ω) = r= d re irω and {d r } is a positive definite symmetric sequence. Thus we can define the positive function g m where g m (ω) = r m and is such that g m (ω) f (ω) < δ, which implies ( r ) d r exp(irω) m f(ω) gm (ω) [ c(r) ] 2 δ. Now we can apply the same arguments to prove to Lemma 8.5. we can show that g m r can be factorised as g m (ω) = C φ m (ω) 2 (where φ m is an mth order polynomial whose roots lie outside 222

224 the unit circle). Thus g m (ω) = C φ m (ω) 2 and we obtain the desired result. 8.6 Higher order spectrums We recall that the covariance is a measure of linear dependence between two random variables. Higher order cumulants are a measure of higher order dependence. For example, the third order cumulant for the zero mean random variables X, X 2, X 3 is cum(x, X 2, X 3 ) = E(X X 2 X 3 ) and the fourth order cumulant for the zero mean random variables X, X 2, X 3, X 4 is cum(x, X 2, X 3, X 4 ) = E(X X 2 X 3 X 4 ) E(X X 2 )E(X 3 X 4 ) E(X X 3 )E(X 2 X 4 ) E(X X 4 )E(X 2 X 3 ). From the definition we see that if X, X 2, X 3, X 4 are independent then cum(x, X 2, X 3 ) = 0 and cum(x, X 2, X 3, X 4 ) = 0. Moreover, if X, X 2, X 3, X 4 are Gaussian random variables then cum(x, X 2, X 3 ) = 0 and cum(x, X 2, X 3, X 4 ) = 0. Indeed all cumulants higher than order two is zero. This comes from the fact that cumulants are the coefficients of the power series expansion of the logarithm of the characteristic function of {X t }, which is g X (t) = i µ }{{} mean t 2 t }{{} Σ t. cumulant Since the spectral density is the Fourier transform of the covariance it is natural to ask whether one can define the higher order spectral density as the fourier transform of the higher order cumulants. This turns out to be the case, and the higher order spectra have several interesting properties. Let us suppose that {X t } is a stationary time series (notice that we are assuming it is strictly stationary and not second order). Let κ 3 (t, s) = cum(x 0, X t, X s ), κ 3 (t, s, r) = cum(x 0, X t, X s, X r ) and κ q (t,..., t q ) = cum(x 0, X t,..., X tq ) (noting that like the covariance the higher order cumulants are invariant to shift). The third, fourth and the general qth order spectras is defined 223

225 as f 3 (ω, ω 2 ) = f 4 (ω, ω 2, ω 3 ) = f q (ω, ω 2,..., ω q ) = κ 3 (s, t) exp(isω + itω 2 ) s= t= s= t= r= t,...,t q = κ 4 (s, t, r) exp(isω + itω 2 + irω 3 ) κ q (t, t 2,..., t q ) exp(it ω + it 2 ω it q ω q ). Example 8.6. (Third and Fourth order spectral density of a linear process) Let us suppose that {X t } satisfies X t = j= ψ j ε t j where j= ψ j <, E(ε t ) = 0 and E(ε 4 t ) <. Let A(ω) = j= ψ j exp(ijω). Then it is straightforward to show that f(ω) = σ 2 A(ω) 2 f 3 (ω, ω 2 ) = κ 3 A(ω )A(ω 2 )A( ω ω 2 ) f 4 (ω, ω 2, ω 3 ) = κ 4 A(ω )A(ω 2 )A(ω 3 )A( ω ω 2 ω 3 ), where κ 3 = cum(ε t, ε t, ε t ) and κ 4 = cum(ε t, ε t, ε t, ε t ). We see from the example, that unlike the spectral density, the higher order spectras are not necessarily positive or even real. A review of higher order spectra can be found in Brillinger (200). Higher order spectras have several applications especially in nonlinear processes, see Subba Rao and Gabr (984). We will consider one such application in a later chapter. Using the definition of the higher order spectrum we can now generalise Lemma 8.2. to higher order cumulants (see Brillinger (200), Theorem 4.3.4). Proposition 8.6. {X t } is a strictly stationary time series, where for all i q we have t,...,t q = ( + t i)κ q (t,..., t q ) < (note that this is simply a generalization of the 224

226 covariance assumption r rc(r) < ). Then we have cum(j n (ω k ),..., J n (ω kq )) = = n q/2 f q(ω k2,..., ω kq ) n (q )/2 f q (ω k2,..., ω kq ) + O( n q/2 ) O( n q/2 ) exp(ij(ω k... ω kq )) + O( n q/2 ) q i= k i = nz otherwise where ω ki = 2πk i n. 8.7 Extensions 8.7. The spectral density of a time series with randomly missing observations Let us suppose that {X t } is a second order stationary time series. However {X t } is not observed at everytime point and there are observations missing, thus we only observe X t at {τ k } k. Thus what is observed is {X τk }. The question is how to deal with this type of data. One method was suggested in?. He suggested that the missingness mechanism {τ k } be modelled stochastically. That is define the random process {Y t } which only takes the values {0, }, where Y t = if X t is observed, but Y t = 0 if X t is not observed. Thus we observe {X t Y t } t = {X tk } and also {Y t } (which is the time points the process is observed). He also suggests modelling {Y t } as a stationary process, which is independent of {X t } (thus the missingness mechanism and the time series are independent). The spectral densities of {X t Y t }, {X t } and {Y t } have an interest relationship, which can be exploited to estimate the spectral density of {X t } given estimators of the spectral densities of {X t Y t } and {X t } (which we recall are observed). We first note that since {X t } and {Y t } are stationary, then {X t Y t } is stationary, furthermore cov(x t Y t, X τ Y τ ) = cov(x t, X τ )cov(y t, Y τ ) + cov(x t, Y τ )cov(y t, X τ ) + cum(x t, Y t, X τ, Y τ ) = cov(x t, X τ )cov(y t, Y τ ) = c X (t τ)c Y (t τ) 225

227 where the above is due to independence of {X t } and {Y t }. Thus the spectral density of {X t Y t } is f XY (ω) = cov(x 0 Y 0, X r Y r ) exp(irω) 2π r= = c X (r)c Y (r) exp(irω) 2π r= = f X (λ)f Y (ω λ)dω, where f X (λ) = 2π r= c X(r) exp(irω) and f Y (λ) = 2π r= c Y (r) exp(irω) are the spectral densities of the observations and the missing process. 226

228 Chapter 9 Spectral Analysis Prerequisites The Gaussian likelihood. The approximation of a Toeplitz by a Circulant (covered in previous chapters). Objectives The DFTs are close to uncorrelated but have a frequency dependent variance (under stationarity). The DFTs are asymptotically Gaussian. For a linear time series the DFT is almost equal to the transfer function times the DFT of the innovations. The periodograms is the square of the DFT, whose expectation is approximately equal to the spectral density. Smoothing the periodogram leads to an estimator of the spectral density as does truncating the covariances. The Whittle likelihood and how it is related to the Gaussian likelihood. Understand that many estimator can be written in the frequency domain. Calculating the variance of an estimator. 227

229 9. The DFT and the periodogram In the previous section we motivated transforming the stationary time series {X t } into it s discrete Fourier transform J n (ω k ) = = 2πn ( 2πn t= X t exp(ik 2πt n ) t= X t cos(k 2πt n ) + i 2πn t= X t sin(k 2πt ) n ) k = 0,..., n/2 (frequency series) as an alternative way of analysing the time series. Since there is a one-to-one mapping between the two, nothing is lost by making this transformation. Our principle reason for using this transformation is given in Lemma 8.2., where we showed that {J n (ω k )} n/2 n= is an almost uncorrelated series. However, there is a cost to the uncorrelatedness property, that is unlike the original stationary time series {X t }, the variance of the DFT varies over the frequencies, and the variance is the spectral density at that frequency. We summarise this result below, but first we recall the definition of the spectral density function f(ω) = 2π r= We summarize some of the results derived in Chapter 8 here. c(r) exp(irω) ω [0, 2π]. (9.) Lemma 9.. Suppose that {X t } is a zero second order stationary time series, where cov(x 0, X r ) = c(r) and r c(r) <. Define ω k = 2πk n. Then (i) J n (ω) 2 = 2π where ĉ n (r) is the sample autocovariance. (ii) for k 0 we have E[J n (ω k )] = 0, as n, n r= (n ) E( J n (ω) 2 ) f(ω) ( c(r) + 2π n r n ĉ n (r) exp(irω), (9.2) rc(r) ) 0 (9.3) r n 228

230 (iii) [ cov J n ( 2πk n ), J n( 2πk ] 2 n ) = f( 2πk n ) + o() k = k 2 o() k k 2 where f(ω) is the spectral density function defined in (9.). Under the stronger condition r rc(r) < the o() above is replaced with O(n ). In addition if we have higher order stationarity (or strict stationarity), then we also can find expressions for the higher order cumulants of the DFT (see Proposition 8.6.). It should be noted that even if the mean of the stationary time series {X t } is not zero (ie. E(X t ) = µ 0), so long as ω k 0 E(J n (ω k )) = 0 (even without centering X t, with X t X). Since there is a one-to-one mapping between the observations and the DFT, it is not surprising that classical estimators can be written in terms of the DFT. For example, the sample covariance can be rewritten in terms of the DFT ĉ n (r) + ĉ n (n r) = n J n (ω k ) 2 exp( irω k ). (9.4) k= (see Appendix A.3(iv)). Since ĉ n (n r) = n n t= n r X tx t+ n r, for small r (relative to T ) this term is negligible, and gives ĉ n (r) n J n (ω k ) 2 exp( irω k ). (9.5) k= The modulo square of the DFT plays such an important role in time series analysis that it has it s own name, the periodogram, which is defined as I n (ω) = J n (ω) 2 = 2π n r= (n ) ĉ n (r) exp(irω). (9.6) By using Lemma 9.. or Theorem 8.6. we have E(I n (ω)) = f(ω) + O( n ). Moreover, (9.4) belongs to a general class of integrated mean periodogram estimators which have the form A(φ, I n ) = n I n (ω k )φ(ω k ). (9.7) k= 229

231 Replacing the sum by an integral and the periodogram by its limit, it is clear that these are estimators of the integrated spectral density A(f, φ) = 2π 0 f(ω)φ(ω)dω. Before we consider these estimators (in Section 9.5). We analyse some of the properties of the DFT. 9.2 Distribution of the DFT and Periodogram under linearity An interesting aspect of the DFT, is that under certain conditions the DFT is asymptotically normal. We can heuristically justify this by noting that the DFT is a (weighted) sample mean. In fact at frequency zero, it is the sample mean (J n (0) = n 2π X). In this section we prove this result, and a similar result for the periodogram. We do the proof under linearity of the time series, that is X t = j= ψ j ε t j, however the result also holds for nonlinear time series (but is beyond this course). The DFT of the innovations J ε (ω k ) = 2πn n t= ε te itω k is a very simple object to deal with it. First the DFT is an orthogonal transformation and the orthogonal transformation of iid random variables leads to uncorrelated random variables. In other words, {J ε (ω k )} is completely uncorrelated as are its real and imaginary parts. Secondly, if {ε t } are Gaussian, then {J ε (ω k )} are independent and Gaussian. Thus we start by showing the DFT of a linear time series is approximately equal to the DFT of the innovations multiplied by the transfer function. This allows us to transfer results regarding J ε (ω k ) to J n (ω k ). We will use the assumption that j j/2 ψ j <, this is a slightly stronger assumption than j ψ j < (which we worked under in Chapter 2). Lemma 9.2. Let us suppose that {X t } satisfy X t = j= ψ jε t, where j= j/2 ψ j <, 230

232 and {ε t } are iid random variables with mean zero and variance σ 2. Let Then we have J ε (ω) = 2πn ε t exp(itω). t= J n (ω) = { ψ j exp(ijω) } J ε (ω) + Y n (ω), (9.8) j where Y n (ω) = 2πn j ψ j exp(ijω)u n,j, with U n,j = n j t= j exp(itω)ε t n t= exp(itω)ε t and E(Y n (ω)) 2 ( n /2 j= ψ j min( j, n) /2 ) 2 = O( n ). PROOF. We note that J n (ω) = = = j= j= ψ j exp(ijω) 2πn ε t j exp(itω) t= n j ψ j exp(ijω) 2πn s= j ε s exp(isω) ψ j exp(ijω) J ε (ω) + 2πn j j ψ j exp(ijω) n j t= j exp(itω)ε t exp(itω)ε t. t= } {{ } =Y n(ω) We will show that Y n (ω) is negligible with respect to the first term. We decompose Y n (ω) into three terms Y n (ω) = 2πn 2πn 2πn n j= j= n j=n+ = I + II + II. ψ j e ijω n j t= j exp(itω)ε t exp(itω)ε t + t= } {{ } no terms in common n j ψ j e ijω exp(itω)ε t exp(itω)ε t ψ j e ijω t= j t= } {{ } (n j) terms in common, 2j terms not in common n j t= j exp(itω)ε t exp(itω)ε t t= } {{ } no terms in common + 23

233 If we took the expectation of the absolute of Y n (ω) we find that we require the condition j jψ j < (and we don t exploit independence of the innovations). However, by evaluating E Y n (ω) 2 we exploit to independence of {ε t }, ie. [E(I 2 )] /2 2πn 2πn n j= n j= ψ j E n j t= j exp(itω)ε t ψ j [ 2nσ 2] /2 2πn n j= 2 exp(itω)ε t t= j /2 ψ j 2πn /2 j= j /2 ψ j similarly, III = O(n /2 ) and [E(I 2 )] /2 2πn 2πn j= n n j= n ψ j E n j t= j exp(itω)ε t ψ j [ 2jσ 2] /2 2πn n j= 2 exp(itω)ε t t= j /2 ψ j 2πn /2 j= j /2 ψ j. Thus we obtain the desired result. The above shows that under linearity and the condition j j/2 ψ j < we have J n (ω) = { j ψ j exp(ijω) } J ε (ω) + O p ( n ). (9.9) This implies that the distribution of J n (ω) is determined by the DFT of the innovations J ε (ω). We generalise the above result to the periodogram. Lemma Let us suppose that {X t } is a linear time series X t = j= ψ jε t j, where j= j/2 ψ j <, and {ε t } are iid random variables with mean zero, variance σ 2 E(ε 4 t ) <. Then we have I n (ω) = ψ j exp(ijω) j 2 J ε (ω) 2 + R n (ω), (9.0) where E(sup ω R n (ω) ) = O( n ). PROOF. See Priestley (983), Theorem 6.2. or Brockwell and Davis (998), Theorem

234 To summarise the above result, for a general linear process X t = j= ψ jε t j we have I n (ω) = j ψ j exp(ijω) 2 J ε (ω) 2 + O p ( n ) = 2πf(ω)I ε(ω) + O p ( ), (9.) n where we assume w.l.o.g. that var(ε t ) = and f(ω) = 2π j ψ j exp(ijω) 2 is the spectral density of {X t }. The asymptotic normality of J n (ω) follows from asymptotic normality of J ε (ω), which we prove in the following proposition. Proposition 9.2. Suppse {ε t } are iid random variables with mean zero and variance σ 2. We define J ε (ω) = n 2πn t= ε t exp(itω) and I ε (ω) = n t= ε t exp(itω) 2. Then we have J ε (ω) = RJ ε(ω) IJ ε (ω) 2πn D N ) (0, σ2 2π I 2, (9.2) where I 2 is the identity matrix. Furthermore, for any finite m (J ε (ω k ),..., J ε (ω km ) ) D N ) (0, σ2 2π I 2m, (9.3) I ε (ω)/σ 2 D χ 2 (2) (which is equivalent to the exponential distribution with mean one) and cov( J ε (ω j ) 2, J ε (ω k ) 2 ) = κ 4 (2π) 2 n κ 4 (2π) 2 n + 2σ4 (2π) 2 j k j = k (9.4) where ω j = 2πj/n and ω k = 2πk/n (and j, k 0 or n). PROOF. We first show (9.5). We note that R(J ε (ω k )) = 2πn n t= α t,n and I(J ε (ω k )) = 2πn n t= β t,n where α t,n = ε t cos(2kπt/n) and β t,n = ε t sin(2kπt/n). We note that R(J ε (ω k )) = n 2πn t= α t,n and I(J ε (ω k )) = n 2πn t= β t,n are the weighted sum of iid random variables, hence {α t,n } and {β t,n } are martingale differences. Therefore, to show asymptotic normality, we will use the martingale central limit theorem with the Cramer-Wold device to show that (9.5). We note that since {α t,n } and {β t,n } are independent random variables we an prove the same result using a CLT for independent, non-identically distributed variables. However, for practice we will use a martingale CLT. To prove the result we need to verify the three conditions of the martingale 233

235 CLT. First we consider the conditional variances 2πn t= 2πn t= 2πn t= E ( α t,n 2 ε t, ε t 2,..., ε ) = 2πn E ( β t,n 2 ε t, ε t 2,..., ε ) = 2πn E ( α t,n β t,n εt, ε t 2,..., ε ) = 2πn cos(2kπt/n) 2 ε 2 t t= sin(2kπt/n) 2 ε 2 t t= P σ2 2π P σ2 2π cos(2kπt/n) sin(2kπt/n)ε 2 t t= P 0, where the above follows from basic calculations using the mean and variance of the above. Finally we need to verify the Lindeberg condition, we only verify it for 2πn n t= α t,n, the same argument holds true for 2πn n t= β t,n. We note that for every ɛ > 0 we have 2πn t= E ( α t,n 2 I( α t,n 2π nɛ) εt, ε t 2,... ) = 2πn By using α t,n = cos(2πt/n)ε t ε t the above can be bounded by E [ α t,n 2 I( α t,n 2π nɛ) ]. t= 2πn 2πn E [ α t,n 2 I( α t,n 2π nɛ)] t= E [ ε t 2 I( ε t 2π nɛ) ] = E [ ε t 2 I( ε t 2π nɛ) ] P 0 as n, t= the above is true because E(ε 2 t ) <. Hence we have verified Lindeberg condition and we obtain (9.5). The proof of (9.3) is similar, hence we omit the details. Because I ε (ω) = R(J ε (ω)) 2 + I(J ε (ω)) 2, from (9.5) we have I ε (ω)/σ 2 χ 2 (2). To prove (9.4) we can either derive it from first principles or by using Proposition Here we do it from first principles. We observe cov(i ε (ω j ), I ε (ω k )) = (2π) 2 n 2 k k 2 t t 2 cov(ε t ε t +k, ε t2 ε t2 +k 2 ). Expanding the covariance gives cov(ε t ε t +k, ε t2 ε t2 +k 2 ) = cov(ε t, ε t2 +k 2 )cov(ε t2, ε t +k ) + cov(ε t, ε t2 )cov(ε t +k, ε t2 +k 2 ) + cum(ε t, ε t +k, ε t2, ε t2 +k 2 ). 234

236 Since {ε t } are iid random variables, for most t, t 2, k and k 2 the above covariance is zero. The exceptions are when t = t 2 and k = k 2 or t = t 2 and k = k 2 = 0 or t t 2 = k = k 2. Counting all these combinations we have cov( J ε (ω j ) 2, J ε (ω k ) 2 ) = 2σ 4 (2π) 2 n 2 exp(ik(ω j ω k )) + (2π) 2 n 2 k t t t κ 4 where σ 2 = var(ε t ) and κ 4 = cum 4 (ε) = cum(ε t, ε t, ε t, ε t ). We note that for j k, t exp(ik(ω j ω k )) = 0 and for j = k, t exp(ik(ω j ω k )) = n, substutiting this into cov( J ε (ω j ) 2, J ε (ω k ) 2 ) gives us the desired result. By using (9.9) the following result follows immediately from Lemma 9.2., equation (9.5). Corollary 9.2. Let us suppose that {X t } is a linear time series X t = j= ψ jε t j, where j= j/2 ψ j <, and {ε t } are iid random variables with mean zero, variance σ 2 E(ε 4 t ) <. Then we have RJ n(ω) IJ n (ω) D N (0, f(ω)i 2 ), (9.5) Using (9.) we see that I n (ω) f(ω) J ε (ω) 2. This suggest that most of the properties which apply to J ε (ω) 2 also apply to I n (ω). Indeed in the following theorem we show that the asympototic distribution of I n (ω) is exponential with asymptotic mean f(ω) and variance f(ω) 2 (unless ω = 0 in which case it is 2f(ω) 2 ). By using Lemma 9.2. we now generalise Proposition 9.2. to linear processes. We show that just like the DFT the Periodogram is also near uncorrelated at different frequencies. This result will be useful when motivating and deriving the sampling of the spectral density estimator in Section 9.3. Theorem 9.2. Suppose {X t } is a linear time series X t = j= ψ jε t j, where j= j/2 ψ j < with E[ε t ] = 0, var[ε t ] = σ 2 and E[ε 4 t ] <. Let I n (ω) denote the periodogram associated with {X,..., X n } and f( ) be the spectral density. Then (i) If f(ω) > 0 for all ω [0, 2π] and 0 < ω,..., ω m < π, then ( In (ω )/f(ω ),..., I n (ω m )/f(ω m ) ) 235

237 converges in distribution (as n ) to a vector of independent exponential distributions with mean one. (ii) Furthermore, for ω j = 2πj n and ω k = 2πk n we have 2f(ω k ) 2 + O(n /2 ) cov(i n (ω k ), I n (ω j )) = f(ω k ) 2 + O(n /2 ) O(n ) ω j = ω k = 0 or π 0 < ω j = ω k < π ω j ω k where the bound is uniform in ω j and ω k. Remark 9.2. (Summary of properties of the periodogram) (i) The periodogram is nonnegative and is an asymptotically an unbiased estimator of the spectral density (when j ψ j < ). (ii) It symmetric about zero, I n (ω) = I n (ω + π), like the spectral density function. (iii) At the fundemental frequencies {I n (ω j )} are asymptotically uncorrelated. (iv) If 0 < ω < π, I n (ω) is asymptotically exponentially distributed with mean f(ω). It should be mentioned that Theorem 9.2. also holds for several nonlinear time series too. 9.3 Estimating the spectral density function There are several explanations as to why the raw periodogram can not be used as an estimator of the spectral density function, despite its mean being approximately equal to the spectral density. One explanation is a direct consequence of Theorem 9.2., where we showed that the distribution of the periodogram standardized with the spectral density function is a chi-squared with two degrees of freedom, from here it is clear it will not converge to the mean, however large the sample size. An alternative, explanation is that the periodogram is the Fourier transform of the autocovariances estimators at n different lags. Typically the variance for each covariance ĉ n (k) will be about O(n ), thus, roughly speaking, the variance of I n (ω) will be the sum of these n O(n ) variances which leads to a variance of O(), this clearly does not converge to zero. Both these explanation motivate estimators of the spectral density function, which turn out to be the same. It is worth noting that Parzen (957) first proposed a consistent estimator of the 236

238 spectral density. These results not only lead to a revolution in spectral density estimation but also the usual density estimation that you may have encountered in nonparametric statistics (one of the first papers on density estimation is Parzen (962)). We recall that J n (ω k ) are zero mean uncorrelated random variables whose variance is almost equal to f(ω k ). This means that E J n (ω k ) 2 = E[I n (ω n )] f(ω k ). Remark 9.3. (Smoothness of the spectral density) We observe that f (s) (ω) = (2π) (ir) s c(r) exp(irω). r Z Therefore, the smoothness of the spectral density function is determined by finiteness of r rs c(r), in other words how fast the autocovariance function converges to zero. We recall that the acf of ARMA processes decay exponential fast to zero, thus f is extremely smooth (all derivatives exist). Assuming that the autocovariance function converges to zero sufficiently fast f will slowly vary over frequency. Furthermore, using Theorem 9.2., we know that {I n (ω k )} are close to uncorrelated and I n (ω k )/f(ω k ) is exponentially distributed. Therefore we can write I n (ω k ) as I n (ω k ) = E(I n (ω k )) + [I n (ω k ) E(I n (ω k ))] f(ω k ) + f(ω k )U k, k =,..., n, (9.6) where {U k } is sequence of mean zero and constant variance almost uncorrelated random variables. We recall (9.6) resembles the usual nonparametric equation (function plus noise) often considered in nonparametric statistics. Remark (Nonparametric Kernel estimation) Let us suppose that we observe Y i where ( ) i Y i = g + ε i i n, n and {ε i } are iid random variables and g( ) is a smooth function. The kernel density estimator of ĝ n ( i n ) ( ) j ĝ n = n i bn W ( ) j i Y i, bn where W ( ) is a smooth kernel function of your choosing, such as the Gaussian kernel, etc. 237

239 This suggest that to estimate the spectral density we could use a local weighted average of {I n (ω k )}. Equation (9.6) motivates the following nonparametric estimator of f(ω) f n (ω j ) = k bn W ( j k bn ) I n (ω k ), (9.7) where W ( ) is a spectral window which satisfies W (x)dx = and W (x) 2 dx <. Example 9.3. (Spectral windows) Here we give examples of spectral windows (see Section 6.2.3, page 437 in Priestley (983)). (i) The Daniell spectral Window is the local average /2 x W (x) = 0 x > This window leads to the estimator f n (ω j ) = bn j+bn/2 k=j bn/2 I n (ω k ). A plot of the periodgram, spectral density and different estimators (using Daniell kernel with bn = 2 and bn = 0) of the AR(2) process X t =.5X t 0.75X t 2 +ε t is given in Figure 9.. We observe that too small b leads to undersmoothing but too large b leads to over smoothing of features. There are various methods for selecting the bandwidth, one commonly method based on the Kullbach-Leibler criterion is proposed in Beltrao and Bloomfield (987). (ii) The Bartlett-Priestley spectral Window W (x) = ( 3 4 x 2 ) x 0 x > This spectral window was designed to reduce the mean squared error of the spectral density estimator (under certain smoothness conditions). The above estimator was constructed within the frequency domain. We now consider a spectral density estimator constructed within the time domain. We do this by considering the periodogram 238

240 Autoregressive P2[c(:28)] spectrum frequency[c(:28)] frequency Series: ar2 Smoothed Periodogram Series: ar2 Smoothed Periodogram spectrum spectrum frequency bandwidth = frequency bandwidth = Figure 9.: Using a realisation of the AR(2): X t =.5X t 0.75X t 2 + ε t where n = 256. Top left: Periodogram, Top Right: True spectral density function. Bottom left: Spectral density estimator with bn = 2 and Bottom right: Spectral density estimator with bn = 0. from an alternative angle. We recall that I n (ω) = 2π n k= (n ) ĉ n (k) exp(ikω), thus it is the sum of n autocovariance estimators. This is a type of sieve estimator (a nonparametric function estimator which estimates the coefficients/covariances in a series expansion). But as we explained above, this estimator is not viable because it uses too many coefficient estimators. Since the true coefficients/covariances decay to zero for large lags, this suggests that we do not use all the sample covariances in the estimator, just some of them. Hence a viable estimator of the spectral density is the truncated autocovariance estimator f n (ω) = 2π m k= m ĉ n (k) exp(ikω), (9.8) or a generalised version of this which down weights the sample autocovariances at larger lags f n (ω) = 2π n k= (n ) ( ) k λ ĉ n (k) exp(ikω), (9.9) m 239

241 where λ( ) is the so called lag window. The estimators (9.7) and (9.9) are very conceptionally similar, this can be understood if we rewrite ĉ n (r) in terms the periodogram ĉ n (r) = 2π 0 I n (ω) exp( irω)dω, and transforming (9.9) back into the frequency domain f n (ω) = 2π I n (λ) n k= (n ) λ( k ) exp(ik(ω λ))dλ = m 2π I n (λ)w m (ω λ)dω, (9.20) where W m (ω) = n 2π k= (n ) λ( k m ) exp(ikω). Example (Examples of Lag windows) Here we detail examples of lag windows. (i) Truncated Periodogram lag Window λ(u) = I [,] (u), where {λ(k/m)} corresponds to W m (x) = 2π which is the Dirichlet kernel. m k= m e ikω = 2π sin[(m + /2)x], sin(x/2) Note that the Dirchlet kernel can be negative, thus we can see from (9.20) that f n can be negative. Which is one potential drawback of this estimator (see Example 8.5.2). (ii) The Bartlett lag Window λ(x) = ( x )I [,] (x), where {λ(k/m)} corresponds to W m (x) = 2π m k= m ( k ) e ikω = m 2πn ( ) sin(nx/2) 2 sin(x/2) which is the Fejer kernel. We can immediately see that one advantage of the Bartlett window is that it corresponds to a spectral density estimator which is positive. Note that in the case that m = n (the sample size), the truncated periodogram window estimator corresponds to r n c(r)eirω and the Bartlett window estimator corresponds to r n [ r /n]c(r)e irω. W m ( ) and b W ( b ) (defined in (9.7)) cannot not be the same function, but they share many of the same characteristics. In particular, W m (ω) = n k= (n ) = m m ( ) k λ exp(ikω) = m m m k= (m ) m k= (m ) λ (ω k ) exp (iω k (mω)), 240 ( ) k m λ exp (i km ) m mω

242 where ω k = k/n. By using (A.2) and (A.3) (in the appendix), we can approximate the sum by the integral and obtain W m (ω) = mw (mω) + O(), where W (ω) = λ(x) exp(iω)dx. Therefore f n (ω) m I n (λ)k(m(ω λ))dω. Comparing with f n and f n (ω) we see that m plays the same role as b. Furthermore, we observe k bn W ( j k bn )I(ω k) is the sum of about nb I(ω k ) terms. The equivalent for W m ( ), is that it has the spectral width n/m. In other words since f n (ω) = n 2π k= (n ) λ( k M )ĉ n(k) exp(ikω) = min (λ)w (M(ω λ))dω, it is the sum of about n/m terms. 2π We now analyze the sampling properties of the spectral density estimator. It is worth noting that the analysis is very similar to the analysis of nonparametric kernel regression estimator ĝ n ( j n ) = bn i W ( j i bn )Y i, where Y i = g( i n ) + g( i n )ε i and {ε i } are iid random variables. This is because the periodogram {I n (ω)} k is near uncorrelated. However, still some care needs to be taken in the proof to ensure that the errors in the near uncorrelated term does not build up. Theorem 9.3. Suppose {X t } satisfy X t = j= ψ jε t j, where j= jψ j < and E(ε 4 t ) <. Let ˆf n (ω) be the spectral estimator defined in (9.7). Then E( ˆf n (ω j )) f(ω j ) ( ) C n + b (9.2) and var[ ˆf n (ω j )] bn f(ω j) 2 2 bn f(ω j) 2 0 < ω j < π ω j = 0 or π, (9.22) bn, b 0 as n. PROOF. The proof of both (9.2) and (9.22) are based on the spectral window W (x/b) becoming narrower as b 0, hence there is increasing localisation as the sample size grows (just like nonparametric regression). We first note that by using Lemma 3..(ii) we have r rc(r) <, thus f (ω) r rc(r) < 24

243 . Hence f is continuous with a bounded first derivative. To prove (9.2) we take expections E( ˆf ( ) k {E n (ω j )) f(ω j ) = bn W [ ] I(ωj k) f(ω j ) } bn k = ( ) k bn W [ ] E I(ωj k) f(ω j k) + ( ) k bn bn W f(ω j ) f(ω j k) bn k k := I + II. Using Lemma 9.. we have I = ( ) k bn W ( bn E I(ωj k ) ) f(ω j k ) k ( ) C W ( k bn bn ) c(k) + kc(k) = O( n n ). k k n k n To bound II we use that f(ω ) f(ω 2 ) sup f (ω) ω ω 2, this gives II = k bn K( k bn ){ f(ω j ) f(ω j k ) } = O(b). Altogether this gives I = O(n ) and II = O(b) as bn, b 0 and n. The above two bounds mean give (9.2). We will use Theorem 9.2. to prove (9.22). We first assume that j 0 or n. To prove the result we use that cov( J n (ω k ) 2, J n (ω k2 ) 2 ) = [f(ω k )I(k = k 2 ) + O( n )]2 + [f(ω k )I(k = n k 2 ) + O( n )][f(ω k )I(n k = k 2 ) + O( n )] +[ n f 4(ω, ω, ω 2 ) + O( n 2 )]. 242

244 where the above follows from Proposition This gives var( f n (ω j )) = k,k 2 (bn) 2 W ( j k bn = ( j (bn) 2 W k bn k,k 2 ) W ) W ( j k2 bn ( ) j k2 bn ) cov(i(ω k ), I(ω k )) = = = ( [f(ωk )I(k = k 2 ) + O( n )] 2 [ + f(ωk )I(k = n k 2 ) + O( n )][f(ω k )I(n k = k 2 ) + O( n )] + [ n f 4(ω, ω, ω 2 ) + O( ) n 2 )] ( ) j 2 (bn) 2 W k f(ωk 2 bn ) k= ( j k + k= 2πnb (bn) 2 W b W bn ( ωj ω b ) W ( j (n k ) bn ) 2 f(ω) 2 dω + 2πnb ( ω ) 2 2πnb f(ω j) 2 b W dω + O( b n ) ) f(ω 2 k ) + O( n ) ( ) ( ) b W ωj 2π + ω ωj ω W f(ω)dω +O( b b n ) } {{ } 0 where the above is using the Riemann integral. A similar proof can be used to prove the case j = 0 or n. The above result means that the mean squared error of the estimator E [ ˆfn (ω j ) f(ω j ) ] 2 0, where bn and b 0 as n. Moreover E [ ˆfn (ω j ) f(ω j ) ] 2 ( ) = O bn + b. Remark (The distribution of the spectral density estimator) Using that the periodogram I n (ω)/f(ω) is asymptotically χ 2 (2) distributed and uncorrelated at the fundemental frequencies, we can heuristically deduce the limiting distribution of ˆf n (ω). Here we consider the distribution with 243

245 the rectangular spectral window ˆf n (ω j ) = bn j+bn/2 k=j bn/2 I(ω k ). Since I(ω k )/f(ω k ) are approximately χ 2 (2), then since the sum j+bn/2 k=j bn/2 I(ω k) is taken over a local neighbourhood of ω j, we have that f(ω j ) j+bn/2 k=j bn/2 I(ω k) is approximately χ 2 (2bn). We note that when bn is large, then χ 2 (2bn) is close to normal. Hence bn ˆfn (ω j ) N(f(ω j ), f(ω j ) 2 ). Using this these asymptotic results, we can construct confidence intervals for f(ω j ). In general, to prove normality of ˆfn we rewrite it as a quadratic form, from this asymptotic normality can be derived, where bn ˆfn (ω j ) N (f(ω j ), f(ω j ) 2 ) W (u) 2 du. The variance of the spectral density estimator is simple to derive by using Proposition The remarkable aspect is that the variance of the spectral density does not involve (asymptotically) the fourth order cumulant (as it is off lower order). 9.4 The Whittle Likelihood In Chapter 6 we considered various methods for estimating the parameters of an ARMA process. The most efficient method (in terms of Fisher efficiency), when the errors are Gaussian is the Gaussian maximum likelihood estimator. This estimator was defined in the time domain, but it is interesting to note that a very similar estimator which is asymptotically equivalent to the GMLE estimator can be defined within the frequency domain. We start by using heuristics to define the Whittle likelihood. We then show how it is related to the Gaussian maximum likelihood. To motivate the method let us return to the Sunspot data considered in Exercise 5.. The 244

246 Periodogram and the spectral density corresponding to the best fitting autoregressive model, f(ω) = (2π).584e iω e i2ω 0.674e i3ω 0.385e i4ω 0.054e i5ω e i6ω e i7ω e i8ω e ω 2, is given in Figure 9.2. We see that the spectral density of the best fitting AR process closely follows the shape of the periodogram (the DFT modulo square). This means that indirectly the autoregressive estimator (Yule-Walker) chose the AR parameters which best fitted the shape of the periodogram. The Whittle likelihood estimator, that we describe below, does this directly. By selecting the parametric spectral density function which best fits the periodogram. The Whittle P2[c(2:n3)] frequency2[c(2:n3)] Autoregressive spectrum frequency Figure 9.2: The periodogram of sunspot data (with the mean removed, which is necessary to prevent a huge peak at zero) and the spectral density of the best fitting AR model. likelihood measures the distance between I n (ω) and the parametric spectral density function using the Kullbach-Leibler criterion L w n (θ) = k= ( log f θ (ω k ) + I ) n(ω k ), ω k = 2πk f θ (ω k ) n, and the parametric model which minimises this distance is used as the estimated model. The choice of this criterion over the other distance criterions may appear to be a little arbitrary, however there are several reasons why this is considered a good choice. Below we give some justifications as to 245

247 why this criterion is the prefered one. First let us suppose that we observe {X t } n t=, where X t satisfies the ARMA representation X t = p φ j X t j + q ψ j ε t j + ε t, and {ε t } are iid random variables. We will assume that {φ j } and {ψ j } are such that the roots of their corresponding characteristic polynomial are greater than + δ. Let θ = (φ, θ). As we mentioned in Section 8.2 if r rc(r) <, then cov(j n (ω k ), J n (ω k2 )) = f(ω k ) + O( n ) k = k 2 O( n ) k k 2,. where f(ω) = σ2 + q θ j exp(ijω) 2 2π + p φ j exp(ijω) 2. In other words, if the time series satisfies an ARMA presentation the DFT is near uncorrelated, its mean is zero and its variance has a well specified parametric form. Using this information we can define a criterion for estimating the parameters. We motivate this criterion through the likelihood, however there are various other methods for motivating the criterion for example the Kullbach-Leibler criterion is an alternative motivation, we comment on this later on. If the innovations are Gaussian then RJ n (ω) and IJ n (ω) are also Gaussian, thus by using above we approximately have J n = RJ n (ω ) IJ n (ω ). RJ n (ω n/2 ) IJ n (ω n/2 ) N (0, diag(f(ω ), f(ω ),..., f(ω n/2 ), f(ω n/2 ))). In the case that the innovations are not normal then, by Corollary 9.2., the above holds asymptotically for a finite number of frequencies. Here we construct the likelihood under normality of the innovations, however, this assumption is not required and is only used to motivate the construction. Since J n is normally distributed random vector with mean zero and approximate diagonal 246

248 matrix variance matrix diag(f(ω ),..., f(ω n )), the negative log-likelihood of J n is approximately L w n (θ) = k= ( log f θ (ω k ) + J X(ω k ) 2 ). f θ (ω k ) To estimate the parameter we would choose the θ which minimises the above criterion, that is θ w n = arg min θ Θ Lw n (θ), (9.23) where Θ consists of all parameters where the roots of the corresponding characteristic polynomial have absolute value greater than ( + δ) (note that under this assumption all spectral densities corresponding to these parameters will be bounded away from zero). Example 9.4. Fitting an ARMA(, ) model to the data To fit an ARMA model to the data using the Whittle likelihood we use the criterion n/2 L w n (θ) = (log σ2 + θe iω k 2 ) 2π φeiωk 2 2π φe iω + I n (ω k k ) σ 2 + θe iω. k 2 k= By differentiating L ω n with respect to φ, σ 2 and θ we solve these three equations (usually numerically), this gives us the Whittle likelihood estimators. Whittle (962) showed that the above criterion is an approximation of the GMLE. The correct proof is quite complicated and uses several matrix approximations due to Grenander and Szegö (958). Instead we give a heuristic proof which is quite enlightening. as Returning the the Gaussian likelihood for the ARMA process, defined in (7.0), we rewrite it L n (θ) = ( det R n (θ) + X nr n (θ) X n ) = ( det Rn (f θ ) + X nr n (f θ ) X n ), (9.24) where R n (f θ ) s,t = f θ (ω) exp(i(s t)ω)dω and X n = (X,..., X n ). We now show that L n (θ) L w n (θ). Lemma 9.4. Suppose that {X t } is a stationary ARMA time series with absolutely summable 247

249 covariances and f θ (ω) is the corresponding spectral density function. Then for large n. det R n (f θ ) + X nr n (f θ ) X n = k= ( log f θ (ω k ) + J n(ω k ) 2 ) + O(), f θ (ω k ) PROOF. There are various ways to precisely prove this result. All of them show that the Toeplitz matrix can in some sense be approximated by a circulant matrix. This result uses Szegö s identity (Grenander and Szegö (958)). The main difficulty in the proof is showing that R n (f θ ) U n (f θ ), where U n(f θ ) s,t = f θ (ω) exp(i(s t)ω)dω. An interesting derivation is given in Brockwell and Davis (998), Section 0.8. The main ingredients in the proof are:. For a sufficiently large m, R n (f θ ) can be approximated by R n (g m ), where g m is the spectral density of an mth order autoregressive process (this follows from Lemma 8.5.2), and showing that X nr n (f θ ) X n X nr n (g m ) X n = X [ n Rn (f θ ) R n (g m ) ] X n = X nr n (g m ) [R n (g m ) R n (f θ )] R n (f θ )X n From Section 3.2.3, we recall if g m is the spectral density of an AR(m) process, then for n >> m, R n (g m ) will be bandlimited with most of its rows a shift of the other (thus with the exception of the first m and last m rows it is close to circulant). 3. We approximate R n (g m ) with a circulant matrix, showing that X [ n Rn (g m ) C n (gm ) ] X n 0, where C n (g 2 m ) is the corresponding circulant matrix (where for 0 < i j m and either i or j is greater than m, (C n (g )) ij = 2 m k= i j φ m,kφ m,k i j + φ m, i j ) with the eigenvalues {g m (ω k ) } n k=. 4. These steps show that X [ n Rn (f θ ) U n (gm ) ] X n 0 as m as n, which gives the result. 248

250 Remark 9.4. (A heuristic derivation) We give a heuristic proof. Using the results in Section 8.2 we have see that R n (f θ ) can be approximately written in terms of the eigenvalue and eigenvectors of the circulant matrix associated with R n (f θ ), that is R n (f θ ) F n (f θ ) F n thus R n (f θ ) F n (f θ ) F n, (9.25) where (f θ ) = diag(f (n) θ (ω ),..., f (n) θ ω k = 2πk/n. Basic calculations give (ω n)), f (n) θ (ω) = (n ) j= (n ) c θ(k) exp(ikω) f θ (ω) and X n Fn = (J n (ω ),..., J n (ω n )). (9.26) Substituting (9.26) and (9.25) into (9.27) yields n L n(θ) n k= ( log f θ (ω k ) + J n(ω k ) 2 ) = f θ (ω k ) n Lw (θ). (9.27) Hence using the approximation in (9.25) leads to a heuristic equivalence between the Whittle and Gaussian likelihood. Lemma (Consistency) Suppose that {X t } is a causal ARMA process with parameters θ whose roots lie outside the ( + δ)-circle (where δ > 0 is arbitrary). Let θ w be defined as in (9.23) and suppose that E(ε 4 t ) <. Then we have θ w P θ. PROOF. To show consistency we need to show pointwise convergence and equicontinuity of n L n. Let L w (θ) = 2π ( log f θ (ω) + f ) θ 0 (ω) dω. 2π 0 f θ (ω) It is straightforward to show that E( n Lw n (θ)) L n (θ). Next we evaluate the variance, to do this 249

251 we use Proposition 8.6. and obtain [ ] var n Lw n (θ) = n 2 k,k 2 = f θ (ω k )f θ (ω k2 ) cov( J n(ω k ) 2, J n (ω k2 ) 2 ) = O( n ). Thus we have n Lw n (θ) P L w (θ). To show equicontinuity we apply the mean value theorem to n Lw n. We note that because the parameters (φ, θ) Θ, have characteristic polynomial whose roots are greater than ( + δ) then f θ (ω) is bounded away from zero (there exists a δ > 0 where inf ω,θ f θ (ω) δ ). Hence it can be shown that there exists a random sequence {K n } such that n Lw n (θ ) n Lw n (θ 2 )) K n ( θ θ 2 ) and K n converges almost surely to a finite constant as n. Therefore n L n is stochastically equicontinuous. Since the parameter space Θ is compact, the three standard conditions are satisfied and we have consistency of the Whittle estimator. To show asymptotic normality we note that n Lw n (θ) can be written as a quadratic form where 2π n Lw n (θ) = log f θ (ω k ) + 0 n d n (r; θ) = n n r= (n ) f θ (ω k ) exp(irω k ). k= n r d n (r; θ) X k X k+r Using the above quadratic form and it s derivatives wrt θ one can show normality of the Whittle likelihood under various dependence conditions on the time series. Using this result, in the following theorem we show asymptotic normality of the Whittle estimator. Note, this result not only applies to linear time series, but several types of nonlinear time series too. k= Theorem 9.4. Let us suppose that {X t } is a strictly stationary time series with a sufficient dependence structure (such as linearity, mixing at a certain rate, etc.) with spectral density function 250

252 f θ (ω) and E X 4 t <. Let L w n (θ) = k= ( log f θ (ω k ) + J n(ω k ) 2 ), f θ (ω k ) θ n = arg min θ Θ Lw n (θ) θ = arg min θ Θ Lw (θ) Then we have n ( θn θ ) D N (0, 2V + V W V ) where V = 2π 2π W = 2 (2π) 2 ( ) ( ) θ f θ (ω) θ f θ (ω) dω f θ (ω) f θ (ω) ( θ f θ (ω ) ) ( θ f θ (ω 2 ) ) f4,θ0 (ω, ω, ω 2 )dω dω 2, 0 2π 2π 0 and f 4,θ0 (ω, ω 2, ω 3 ) is the fourth order spectrum of {X t }. 0 We now apply the above result to the case of linear time series. We now show that in this case, in the fourth order cumulant term, W, falls out. This is due to the following lemma. Lemma Suppose that the spectral density has the form f(ω) = σ 2 + ψ j exp(ijω) 2 and inf f(ω) > 0. Then we have 2π log f(ω)dω = log σ 2 2π 0 PROOF. Since f(z) is non-zero for z, then log f(z) has no poles in {z; z }. Thus we have 2π log f(ω)dω = 2π log σ 2 dω + 2π log + 2π 0 2π 0 2π 0 = 2π log σ 2 dω + 2π 0 2π = 2π log σ 2 dω. 2π 0 z = log + ψ j z 2 dz ψ j exp(ijω) 2 dω An alternative proof is that since f(z) is analytic and does not have any poles for z, then 25

253 log f(z) is also analytic in the region z, thus for z we have the power series expansion log + ψ j exp(ijω) 2 = b jz j (a Taylor expansion about log ). Using this we have 2π log + 2π 0 = 2π b j 2π 0 ψ j exp(ijω) 2 dω = 2π 2π 0 exp(ijω)dω = 0, b j exp(ijω)dω and we obtain the desired result. Lemma Suppose that {X t } is a linear ARMA time series X t p φ jx t j = q i= θ iε t i + ε t, where E[ε t ] = 0, var[ε t ] = σ 2 and E[ε 4 t ] <. Let θ = ({φ j, θ j }), then we have W = 0 and n ( θw n θ) D N (0, 2V ). PROOF. The result follows from Theorem 9.4., however we need to show that in the case of linearity that W = 0. We use Example 8.6. for linear processes to give f 4,θ (ω, ω, ω 2 ) = κ 4 A(ω ) 2 A(ω 2 ) 2 = κ 4 σ 4 f(ω )f(ω 2 ). Substituting this into W gives W = 2π 0 = κ ( 4 σ 4 2π = κ ( 4 σ 4 2π = κ 4 σ 4 2π 2π 0 2π 0 2π 0 ( 2π θ ( θ f θ (ω ) )( θ f θ (ω 2 ) ) f 4,θ0 (ω, ω, ω 2 )dω dω 2 θ f θ (ω) f θ (ω) 2 f θ(ω)dω θ log f θ (ω)dω 2π 0 ) 2 ) 2 = κ 4 ) 2 log f θ (ω)dω = κ 4 σ 4 σ 4 ( 2π 2π 0 ) θ f θ (ω) 2 f θ (ω) dω ) 2 ( θ log σ2 = 0, 2π where by using Lemma we have 2π 0 log f θ (ω)dω = 2π log σ2 2π and since θ does not include σ 2 we obtain the above. Hence for linear processes the higher order cumulant does not play an asymptotic role in the variance thus giving the result. On first appearances there does not seem to be a connection between the Whittle likelihood and the sample autocorrelation estimator defined in Section However, we observe that the variance of both estimators, under linearity, do not contain the fourth order cumulant (even for non-gaussian linear time series). In Section 9.5 we explain there is a connection between the two, 252

254 and it is this connection that explains away this fourth order cumulant term. Remark Under linearity, the GMLE and the Whittle likelihood are asymptotically equivalent, therefore they have the same asymptotic distributions. The GMLE has the asymptotic distribution n(ˆφ n φ, ˆθ n θ) D N (0, Λ ), where Λ = E(U tu t) E(V t U t) E(U t V t ) E(V t V t ) and {U t } and {V t } are autoregressive processes which satisfy φ(b)u t = ε t and θ(b)v t = ε t. By using the similar derivatives to those given in (7.) we can show that E(U tu t) E(V t U t) = E(U t V t ) E(V t V t ) 2π 2π 0 ( θ f θ (ω) f θ (ω) 9.5 Ratio statistics in Time Series ) ( ) θ f θ (ω) dω. f θ (ω) We recall from (9.4) that the covariance can be written as a general periodogram mean which has the form The variance of this statistic is var(a(φ, I n )) = n 2 k,k 2 = = n 2 k,k 2 = A(φ, I n ) = n I n (ω k )φ(ω k ). (9.28) k= φ(ω k )φ(ω k )cov( J n (ω k ) 2, J n (ω k2 ) 2 ) [ φ(ω k )φ(ω k ) cov(j n (ω k ), J n (ω k2 ))cov(j n (ω k ), J n (ω k2 )) +cov(j n (ω k ), J n (ω k2 ))cov(j n (ω k ), J n (ω k2 )) ] +cum(j n (ω k ), J n (ω k2 ), J n (ω k2 ), J n (ω k2 ). (9.29) 253

255 By using Proposition 8.6. we have cov( J n (ω k ) 2, J n (ω k2 ) 2 ) = [ ( )] 2 [ f(ω k )I(k = k 2 ) + O + f(ω k )I(k = n k 2 ) + O n + ( ) n f 4(ω, ω, ω 2 ) + O n 2. Substituting (9.30) into (9.29) the above gives ( )] [ f(ω k )I(n k = k 2 ) + O n ( )] n (9.30) var(a(φ, I n )) = n 2 φ(ω k ) 2 f(ω k ) 2 + n 2 = n k= + n 3 + n k,k 2 = 2π 0 2π 2π 0 φ(ω k )φ(ω n k )f(ω k ) 2 k= φ(ω k )φ(ω k2 )f 4 (ω k, ω k, ω k2 ) + O( n 2 ) φ(ω) 2 f(ω) 2 dω + n 0 2π 0 φ(ω)φ(2π ω)f(ω) 2 dω φ(ω )φ(ω 2 )f 4 (ω, ω, ω 2 )dω dω 2 + O( ), (9.3) n2 where f 4 is the fourth order cumulant of {X t }. From above we see that unless φ satisfies some special conditions, var(a(φ, I n )) contains the fourth order spectrum, which can be difficult to estimate. There are bootstrap methods which can be used to estimate the variance or finite sample distribution, but simple bootstrap methods, such as the frequency domain bootstrap, cannot be applied to A(φ, I n ), since it is unable to capture the fourth order cumulant structure. However, in special cases the fourth order structure is disappears, we consider this case below and then discuss how this case can be generalised. Lemma 9.5. Suppose {X t } is a linear time series, with spectral density f(ω). Let A(φ, I n ) be defined as in (9.28) and suppose the condition A(φ, f) = φ(ω)f(ω)dω = 0 (9.32) holds, then var(a(φ, I n )) = n 2π 0 φ(ω) 2 f(ω) 2 dω + n 2π 0 φ(ω)φ(2π ω)f(ω) 2 dω. 254

256 PROOF. By using (9.3) we have = n var(a(φ, I n )) n 2π 0 2π 2π 0 φ(ω) 2 f(ω) 2 dω + n 0 2π 0 φ(ω)φ(2π ω)f(ω) 2 dω φ(ω )φ(ω 2 )f 4 (ω, ω, ω 2 ) + O( n 2 ). But under linearity f 4 (ω, ω, ω 2 ) = κ 4 σ 4 f(ω )f(ω 2 ), substituting this into the above gives = n = n var(a(φ, I n )) 2π 2π φ(ω) 2 f(ω) 2 dω + φ(ω)φ(2π ω)f(ω) 2 dω n 0 ( ) φ(ω )φ(ω 2 )f(ω )f(ω 2 )dω dω 2 + O n 2 0 κ 2π 2π 4 σ 4 n 0 2π 0 + κ 4 σ 4 n 0 2π φ(ω) 2 f(ω) 2 dω + φ(ω)φ(2π ω)f(ω) 2 dω n 0 2π 2 ( ) φ(ω)f(ω)dω + O n 2. 0 Since φ(ω)f(ω)dω = 0 we have the desired result. Example 9.5. (The Whittle likelihood) Let us return to the Whittle likelihood in the case of linearity. In Lemma we showed that the fourth order cumulant term does not play a role in the variance of the ARMA estimator. We now show that condition (9.32) holds. Consider the partial derivative of the Whittle likelihood θ L w n (θ) = k= ( θ f θ (ω k ) f θ (ω k ) I ) n(ω k ) f θ (ω k ) 2 θf θ (ω k ). To show normality we consider the above at the true parameter θ, this gives θ L w n (θ) = k= ( θ f θ (ω k ) f θ (ω k ) I ) n(ω k ) f θ (ω k ) 2 θf θ (ω k ). Only the second term of the above is random, therefore it is only this term that yields the variance. Let A(f 2 θ θf θ, I n ) = n k= 255 I n (ω k ) f θ (ω k ) 2 θf θ (ω k ).

257 To see whether this term satisfies the conditions of Lemma 9.5. we evaluate A(f 2 θ θf θ, f θ ) = = 2π 0 2π 0 = θ 2π f θ (ω) f θ (ω) 2 θf θ (ω) θ log f θ (ω) 0 2π log f θ (ω) = θ log f θ (ω)dω = 0, 2π 0 by using Lemma Thus we see that the derivative of the Whittle likelhood satisfies the condition (9.32). Therefore the zero cumulant term is really due to this property. The Whittle likelihood is a rather special example. However we now show that any statistic of the form A(φ, I n ) can be transformed such that the resulting transformed statistic satisfies condition (9.32). To find the suitable transformation we recall from Section 6.2. that the variance of ĉ n (r) involves the fourth order cumulant, but under linearity the sample correlation ρ n (r) = ĉ n (r)/ĉ n (0) does given not. Returning to the frequency representation of the autocovariance given in (9.5) we observe that ρ n (r) = n/2 I n (ω k ) exp(irω k ) ĉ n (0) n ĉ n (0) n k= I n (ω k ) exp(irω k ), k= (it does not matter whether we sum over n or n/2 for the remainder of this section we choose the case of summing over n). Motivated by this example we define the so called ratio statistic Ã(φ, I n ) = n k= I n (ω k )φ(ω k ) ĉ n (0) = n k= I n (ω k )φ(ω k ), (9.33) ˆF n (2π) where ˆF n (2π) = n n k= I n(ω k ) = n n t= X2 t = ĉ n (0). We show in the following lemma that Ã(φ, I n ) can be written in a form that almost satisfies condition (9.32). Lemma Let us suppose that Ã(φ, I n) satisfies (9.33) and Ã(φ, f) = n k= f(ω k )φ(ω k ), F n (2π) 256

258 where F n (2π) = n n f(ω k). Then we can represent Ã(φ, I n) as Ã(φ, I n ) Ã(φ, f) = F (2π) ˆF n (2π) n where ψ n (ω k )I n (ω k ), k= ψ n (ω k ) = φ(ω k )F n (2π) n PROOF. Basic algebra gives φ(ω j )f(ω j ) and n ψ(ω k )f(ω k ) = 0. (9.34) k= Ã(φ, I n ) Ã(φ, f) = n = n = n = n ( φ(ωk )I n (ω k ) φ(ω ) k)f(ω k ) ˆF n (2π) F n (2π) ( φ(ωk )F n (2π)I n (ω k ) φ(ω k ) ˆF ) n (2π)f(ω k ) k= F n (2π) ˆF n (2π) ( φ(ω k )F n (2π) ) I n (ω k ) φ(ω k )f(ω k ) n F k= n (2π) ˆF n (2π) ψ(ω k )I n (ω k ) F n (2π) ˆF n (2π), k= k= k= where F n (2π) and ψ are defined as above. To show (9.34), again we use basic algebra to give n ψ(ω k )f(ω k ) = n k= = n k= ( φ(ω)f n (2π) n φ(ω k )f(ω k )F n (2π) n k= ) φ(ω j )f(ω j ) f(ω k ) φ(ω k )f(ω k ) n k= f(ω j ) = 0. From the lemma above we see that Ã(φ, I n) Ã(φ, f) almost seems to satisfy the conditions in Lemma 9.5., the only difference is the random term ĉ n (0) = F n (2π) in the denominator. We now show that that we can replace F n (2π) with it s limit and that error is asymptotically negligible. Let Ã(φ, I n ) Ã(φ, f) = F n (2π) ˆF n (2π) n ψ n (ω k )I n (ω k ) := B(ψ, I n ) k= 257

259 and B(ψ n, I n ) = F n (2π) 2 n ψ(ω k )I n (ω k ). By using the mean value theorem (basically the Delta method) and expanding B(ψ n, I n ) about B(ψ n, I n ) (noting that B(φ n, f) = 0) gives k= B(ψ, I n ) B(ψ, I n ) = ( ˆFn (2π) F n (2π) ) } {{ } O p(n /2 ) F n (2π) 3 n ψ n (ω k )I n (ω k ) = O p ( n ), k= } {{ } O p(n /2 ) where F n (2π) lies between F n (2π) and F n (2π). Ã(φ, I n ) Ã(φ, f) is determined by Therefore the limiting distribution variance of Ã(φ, I n ) Ã(φ, f) = B(ψ n, I n ) + O p (n /2 ). B(ψ n, I n ) does satisfy the conditions in (9.32) and the lemma below immediately follows. Lemma Suppose that {X t } is a linear time series, then where var(b(ψ n, I n )) = n 2π 0 ψ(ω) 2 f(ω) 2 dω + n 2π ψ(ω) = φ(ω)f (2π) 2π φ(ω)f(ω)dω. 2π 0 0 ψ(ω)ψ(2π ω)f(ω) 2 dω + O( n 2 ), Therefore, the limiting variance of Ã(φ, I n) is 2π ψ(ω) 2 f(ω) 2 dω + 2π ψ(ω)ψ(2π ω)f(ω) 2 dω + O( n 0 n 0 n 2 ). This is a more elegant explanation as to why under linearity the limiting variance of the correlation estimator does not contain the fourth order cumulant term. It also allows for a general class of statistics. 258

260 Remark 9.5. (Applications) As we remarked above, many statistics can be written as a ratio statistic. The advantage of this is that the variance of the limiting distribution is only in terms of the spectral densities, and not any other higher order terms (which are difficult to estimate). Another perk is that simple schemes such as the frequency domain bootstrap can be used to estimate the finite sample distributions of statistics which satisfy the assumptions in Lemma 9.5. or is a ratio statistic (so long as the underlying process is linear), see Dahlhaus and Janas (996) for the details. The frequency domain bootstrap works by constructing the DFT from the data {J n (ω)} and dividing by the square root of either the nonparametric estimator of f or a parametric estimator, ie. {J n (ω)/ ˆfn (ω)}, these are close to constant variance random variables. {Ĵε(ω k ) = J n (ω k )/ ˆfn (ω k )} is bootstrapped, thus J n(ω k ) = Ĵ ε (ω k ) ˆfn (ω k ) is used as the bootstrap DFT. This is used to construct the bootstrap estimator, for example The Whittle likelihood estimator. The sample correlation. With these bootstrap estimators we can construct an estimator of the finite sample distribution. The nature of frequency domain bootstrap means that the higher order dependence structure is destroyed, eg. cum (J n(ω k ), J n(ω k2 ),..., J n(ω kr )) = 0 (where cum is the cumulant with respect to the bootstrap measure) if all the k i s that are not the same. However, we know from Proposition 8.6. that for the actual DFT this is not the case, there is still some small dependence, which can add up. Therefore, the frequency domain bootstrap is unable to capture any structure beyond the second order. This means for a linear time series which is not Gaussian the frequency domain bootstrap cannot approximate the distribution of the sample covariance (since it is asymptotically with normal with a variance which contains the forth order cumulant), but it can approximate the finite sample distribution of the correlation. Remark (Estimating κ 4 in the case of linearity) Suppose that {X t } is a linear time series X t = ψ j ε t j, j= with E(ε t ) = 0, var(ε t ) = σ 2 and cum 4 (ε t ) = κ 4. Then we can use the spectral density estimator to estimate κ 4 without any additional assumptions on {X t } (besides linearity). Let f(ω) denote the 259

261 spectral density of {X t } and g 2 (ω) the spectral density of {X 2 t }, then it can be shown that κ 4 = 2πg 2(0) 4π 2π 0 f(ω) 2 dω ( 2π 0 f(ω)dω ) 2. Thus by estimating f and g 2 we can estimate κ 4. Alternatively, we can use the fact that for linear time series, the fourth order spectral density f 4 (ω, ω 2, ω 3 ) = κ 4 A(ω )A(ω 2 )A(ω 3 )A( ω ω 2 ω 3 ). Thus we have κ 4 = σ4 f 4 (ω, ω, ω 2 ). f(ω )f(ω 2 ) This just demonstrates, there is no unique way to solve a statistical problem! 9.6 Goodness of fit tests for linear time series models As with many other areas in statistics, we often want to test the appropriateness of a model. In this section we briefly consider methods for validating whether, say an ARMA(p, q), is the appropriate model to fit to a time series. One method is to fit the model to the data and the estimate the residuals and conduct a Portmanteau test (see Section 3, equation (6.9)) on the estimated residuals. It can be shown that if model fitted to the data is the correct one, the estimated residuals behave almost like the true residuals in the model and the Portmanteau test statistic h S h = n ˆρ n (r) 2, where ˆρ n (r) = ĉ n (r)/ĉ n (0) r= ĉ n (r) = n n r t= ˆε tˆε t+r should be asymptotically a chi-squared. An alternative (but somehow equivalent) way to do the test, is through the DFTs. We recall if the time series is linear then (9.) is true, thus I X (ω) f θ (ω) = J ε(ω) 2 + o p (). 260

262 Therefore, if we fit the correct model to the data we would expect that I X (ω) fˆθ(ω) = J ε(ω) 2 + o p (). where ˆθ are the model parameter estimators. Now J ε (ω) 2 has the special property that not only is it almost uncorrelated at various frequencies, but it is constant over all the frequencies. Therefore, we would expect that n/2 2π (I X (ω) n fˆθ(ω) 2) D N(0, ) k= Thus, as an alternative to the goodness fit test based on the portmanteau test statistic we can use the above as a test statistic, noting that under the alternative the mean would be different. 26

263 Chapter 0 Consistency and and asymptotic normality of estimators In the previous chapter we considered estimators of several different parameters. The hope is that as the sample size increases the estimator should get closer to the parameter of interest. When we say closer we mean to converge. In the classical sense the sequence {x k } converges to x (x k x), if x k x 0 as k (or for every ε > 0, there exists an n where for all k > n, x k x < ε). Of course the estimators we have considered are random, that is for every ω Ω (set of all out comes) we have an different estimate. The natural question to ask is what does convergence mean for random sequences. 0. Modes of convergence We start by defining different modes of convergence. Definition 0.. (Convergence) Almost sure convergence We say that the sequence {X t } converges almost sure to µ, if there exists a set M Ω, such that P(M) = and for every ω N we have X t (ω) µ. 262

264 In other words for every ε > 0, there exists an N(ω) such that X t (ω) µ < ε, (0.) for all t > N(ω). Note that the above definition is very close to classical convergence. We denote X t µ almost surely, as X t a.s. µ. An equivalent definition, in terms of probabilities, is for every ε > 0 X t a.s. µ if P (ω; m= t=m { X t (ω) µ > ε}) = 0. It is worth considering briefly what m= t=m { X t (ω) µ > ε} means. If m= t=m { X t (ω) µ > ε}, then there exists an ω m= t=m { X t (ω) µ > ε} such that for some infinite sequence {k j }, we have X kj (ω ) µ > ε, this means X t (ω ) does not converge to µ. Now let m= t=m { X t (ω) µ > ε} = A, if P (A) = 0, then for most ω the sequence {X t (ω)} converges. Convergence in mean square We say X t µ in mean square (or L 2 convergence), if E(X t µ) 2 0 as t. Convergence in probability Convergence in probability cannot be stated in terms of realisations X t (ω) but only in terms P of probabilities. X t is said to converge to µ in probability (written X t µ) if P ( X t µ > ε) 0, t. Often we write this as X t µ = o p (). If for any γ we have E(X t µ) γ 0 t, then it implies convergence in probability (to see this, use Markov s inequality). Rates of convergence: (i) Suppose a t 0 as t. We say the stochastic process {X t } is X t µ = O p (a t ), 263

265 if the sequence {a t X t µ } is bounded in probability (this is defined below). We see from the definition of boundedness, that for all t, the distribution of a t X t µ should mainly lie within a certain interval. (ii) We say the stochastic process {X t } is X t µ = o p (a t ), if the sequence {a t X t µ } converges in probability to zero. Definition 0..2 (Boundedness) (i) Almost surely bounded If the random variable X is almost surely bounded, then for a positive sequence {e k }, such that e k as k (typically e k = 2 k is used), we have P (ω; { k= { X(ω) e k}}) =. Usually to prove the above we consider the complement P ((ω; { k= { X e k}}) c ) = 0. Since ( k= { X e k}) c = k= { X > e k} k= m=k { X > e k}, to show the above we show P (ω : { k= m=k { X(ω) > e k}}) = 0. (0.2) We note that if (ω : { k= m=k { X(ω) > e k}}), then there exists a ω Ω and an infinite subsequence k j, where X(ω ) > e kj, hence X(ω ) is not bounded (since e k ). To prove (0.2) we usually use the Borel Cantelli Lemma. This states that if k= P (A k) <, the events {A k } occur only finitely often with probability one. Applying this to our case, if we can show that m= P (ω : { X(ω) > e m }) <, then { X(ω) > e m } happens only finitely often with probability one. Hence if m= P (ω : { X(ω) > e m }) <, then P (ω : { k= m=k { X(ω) > e k}}) = 0 and X is a bounded random variable. It is worth noting that often we choose the sequence e k = 2 k, in this case m= P (ω : { X(ω) > e m }) = m= P (ω : {log X(ω) > log 2k }) CE(log X ). Hence if we can show that E(log X ) <, then X is bounded almost surely. b (ii) Sequences which are bounded in probability A sequence is bounded in probability, 264

266 written X t = O p (), if for every ε > 0, there exists a δ(ε) < such that P ( X t δ(ε)) < ε. Roughly speaking this means that the sequence is only extremely large with a very small probability. And as the largeness grows the probability declines. 0.2 Sampling properties Often we will estimate the parameters by maximising (or minimising) a criterion. Suppose we have the criterion L n (a) (eg. likelihood, quasi-likelihood, Kullback-Leibler etc) we use as an estimator of a 0, â n where â n = arg max a Θ L n(a) and Θ is the parameter space we do the maximisation (minimisation) over. parameter a should maximise (minimise) the limiting criterion L. Typically the true If this is to be a good estimator, as the sample size grows the estimator should converge (in some sense) to the parameter we are interesting in estimating. As we discussed above, there are various modes in which we can measure this convergence (i) almost surely (ii) in probability and (iii) in mean squared error. Usually we show either (i) or (ii) (noting that (i) implies (ii)), in time series its usually quite difficult to show (iii). Definition 0.2. (i) An estimator â n is said to be almost surely consistent estimator of a 0, if there exists a set M Ω, where P(M) = and for all ω M we have â n (ω) a. (ii) An estimator â n is said to converge in probability to a 0, if for every δ > 0 P ( â n a > δ) 0 T. To prove either (i) or (ii) usually involves verifying two main things, pointwise convergence and equicontinuity. 265

267 0.3 Showing almost sure convergence of an estimator We now consider the general case where L n (a) is a criterion which we maximise. Let us suppose we can write L n as L n (a) = l t (a), (0.3) n t= where for each a Θ, {l t (a)} t is a ergodic sequence. Let L(a) = E(l t (a)), (0.4) we assume that L(a) is continuous and has a unique maximum in Θ. We define the estimator ˆα n where ˆα n = arg min a Θ L n (a). Definition 0.3. (Uniform convergence) L n (a) is said to almost surely converge uniformly to L(a), if sup L n (a) L(a) a.s. 0. a Θ In other words there exists a set M Ω where P (M) = and for every ω M, sup L n (ω, a) L(a) 0. a Θ Theorem 0.3. (Consistency) Suppose that â n = arg max a Θ L n (a) and a 0 = arg max a Θ L(a) is the unique maximum. If sup a Θ L n (a) L(a) a.s. 0 as n and L(a) has a unique maximum. Then Then â n a.s. a 0 as n. PROOF. We note that by definition we have L n (a 0 ) L n (â n ) and L(â n ) L(a 0 ). Using this inequality we have L n (a 0 ) L(a 0 ) L n (â n ) L(a 0 ) L n (â n ) L(â n ). Therefore from the above we have L n (â T ) L(a 0 ) max { L n (a 0 ) L(a 0 ), L n (â T ) L(â n ) } sup L n (a) L(a). a Θ 266

268 Hence since we have uniform converge we have L n (â n ) L(a 0 ) a.s. 0 as n. Now since L(a) has a unique maximum, we see that L n (â n ) L(a 0 ) a.s. 0 implies â n a.s. a 0. We note that directly establishing uniform convergence is not easy. Usually it is done by assuming the parameter space is compact and showing point wise convergence and stochastic equicontinuity, these three facts imply uniform convergence. Below we define stochastic equicontinuity and show consistency under these conditions. Definition The sequence of stochastic functions {f n (a)} n is said to be stochastically equicontinuous if there exists a set M Ω where P (M) = and for every ω M and and ε > 0, there exists a δ and such that for every ω M for all n > N(ω). sup f n (ω, a ) f n (ω, a 2 ) ε, a a 2 δ A sufficient condition for stochastic equicontinuity of f n (a) (which is usually used to prove equicontinuity), is that f n (a) is in some sense Lipschitz continuous. In other words, sup f n (a ) f n (a 2 ) < K n a a 2, a,a 2 Θ where k n is a random variable which converges to a finite constant as n (K n a.s. K 0 as n ). To show that this implies equicontinuity we note that K n a.s. K 0 means that for every ω M (P (M) = ) and γ > 0, we have K n (ω) K 0 < γ for all n > N(ω). Therefore if we choose δ = ε/(k 0 + γ) we have for all n > N(ω). sup f n (ω, a ) f n (ω, a 2 ) < ε, a a 2 ε/(k 0 +γ) In the following theorem we state sufficient conditions for almost sure uniform convergence. It is worth noting this is the Arzela-Ascoli theorem for random variables. Theorem (The stochastic Ascoli Lemma) Suppose the parameter space Θ is compact, for every a Θ we have L n (a) a.s. L(a) and L n (a) is stochastic equicontinuous. Then sup a Θ L n (a) L(a) a.s. 0 as n. 267

269 We use the theorem below. Corollary 0.3. Suppose that â n = arg max a Θ L n (a) and a 0 = arg max a Θ L(a), moreover L(a) has a unique maximum. If (i) We have point wise convergence, that is for every a Θ we have L n (a) a.s. L(a). (ii) The parameter space Θ is compact. (iii) L n (a) is stochastic equicontinuous. then â n a.s. a 0 as n. PROOF. By using Theorem three assumptions imply that sup θ Θ L n (θ) L(θ) 0, thus by using Theorem 0.3. we obtain the result. We prove Theorem in the section below, but it can be omitted on first reading Proof of Theorem (The stochastic Ascoli theorem) We now show that stochastic equicontinuity and almost pointwise convergence imply uniform convergence. We note that on its own, pointwise convergence is a much weaker condition than uniform convergence, since for pointwise convergence the rate of convergence can be different for each parameter. Before we continue a few technical points. We recall that we are assuming almost pointwise convergence. This means for each parameter a Θ there exists a set N a Ω (with P (N a ) = ) such that for all ω N a L n (ω, a) L(a). In the following lemma we unify this set. That is show (using stochastic equicontinuity) that there exists a set N Ω (with P (N) = ) such that for all ω N L n (ω, a) L(a). Lemma 0.3. Suppose the sequence {L n (a)} n is stochastically equicontinuous and also pointwise convergent (that is L n (a) converges almost surely to L(a)), then there exists a set M Ω where P ( M) = and for every ω M and a Θ we have L n (ω, a) L(a) 0. PROOF. Enumerate all the rationals in the set Θ and call this sequence {a i } i. Since we have almost sure convergence, this implies for every a i there exists a set M ai where P (M ai ) = and for every 268

270 ω M ai we have L T (ω, a i ) L(a i ) 0. Define M = M ai, since the number of sets is countable P (M) = and for every ω M and a i we have L n (ω, a i ) L(a i ). Since we have stochastic equicontinuity, there exists a set M where P ( M) = and for every ω M, {L n (ω, )} is equicontinuous. Let M = M { M ai }, we will show that for all a Θ and ω M we have L n (ω, a) L(a). By stochastic equicontinuity for every ω M and ε/3 > 0, there exists a δ > 0 such that sup L n (ω, b ) L n (ω, b 2 ) ε/3, (0.5) b b 2 δ for all n > N(ω). Furthermore by definition of M for every rational aj Θ and ω N we have L n (ω, a i ) L(a i ) ε/3, (0.6) where n > N (ω). Now for any given a Θ, there exists a rational a i such that a a j δ. Using this, (0.5) and (0.6) we have L n (ω, a) L(a) L n (ω, a) L n (ω, a i ) + L n (ω, a i ) L(a i ) + L(a) L(a i ) ε, for n > max(n(ω), N (ω)). To summarise for every ω M and a Θ, we have L n (ω, a) L(a) 0. Hence we have pointwise covergence for every realisation in M. We now show that equicontinuity implies uniform convergence. Proof of Theorem Using Lemma 0.3. we see that there exists a set M Ω with P ( M) =, where L n is equicontinuous and also pointwise convergent. We now show uniform convergence on this set. Choose ε/3 > 0 and let δ be such that for every ω M we have sup L T (ω, a ) L T (ω, a 2 ) ε/3, (0.7) a a 2 δ for all n > n(ω). Since Θ is compact it can be divided into a finite number of open sets. Construct the sets {O i } p i=, such that Θ p i= O i and sup x,y,i x y δ. Let {a i } p i= be such that a i O i. We note that for every ω M we have L n (ω, a i ) L(a i ), hence for every ε/3, there exists an n i (ω) such that for all n > n i (ω) we have L T (ω, a i ) L(a i ) ε/3. Therefore, since p is finite (due 269

271 to compactness), there exists a ñ(ω) such that max L n(ω, a i ) L(a i ) ε/3, i p for all n > ñ(ω) = max i p (n i (ω)). For any a Θ, choose the i, such that open set O i such that a O i. Using (0.7) we have L T (ω, a) L T (ω, a i ) ε/3, for all n > n(ω). Altogether this gives L T (ω, a) L(a) L T (ω, a) L T (ω, a i ) + L T (ω, a i ) L(a i ) + L(a) L(a i ) ε, for all n max(n(ω), ñ(ω)). We observe that max(n(ω), ñ(ω)) and ε/3 does not depend on a, therefore for all n max(n(ω), ñ(ω)) and we have sup a L n (ω, a) L(a) < ε. This gives for every ω M (P( M) = ), sup a L n (ω, a) L(a) 0, thus we have almost sure uniform convergence. 0.4 Toy Example: Almost sure convergence of the least squares estimator for an AR(p) process In Chapter?? we will consider the sampling properties of many of the estimators defined in Chapter 6. However to illustrate the consistency result above we apply it to the least squares estimator of the autoregressive parameters. To simply notation we only consider estimator for AR() models. Suppose that X t satisfies X t = φx t + ε t (where φ < ). To estimate φ we use the least squares estimator defined below. Let L n (a) = n we use ˆφ n as an estimator of φ, where (X t ax t ) 2, (0.8) t=2 ˆφ n = arg min a Θ L T (a), (0.9) 270

272 where Θ = [, ]. How can we show that this is consistent? In the case of least squares for AR processes, â T has the explicit form ˆφ n = n n t=2 X tx t. n T t= X2 t By just applying the ergodic theorem to the numerator and denominator we get ˆφ n a.s. φ. It is worth noting, that unlike the Yule-Walker estimator < is not necessarily true. n n t=2 XtX t n n t= X2 t Here we will tackle the problem in a rather artifical way and assume that it does not have an explicit form and instead assume that ˆφ n is obtained by minimising L n (a) using a numerical routine. In order to derive the sampling properties of ˆφ n we need to directly study the least squares criterion L n (a). We will do this now in the least squares case. We will first show almost sure convergence, which will involve repeated use of the ergodic theorem. We will then demonstrate how to show convergence in probability. We look at almost sure convergence as its easier to follow. Note that almost sure convergence implies convergence in probability (but the converse is not necessarily true). The first thing to do it let l t (a) = (X t ax t ) 2. Since {X t } is an ergodic process (recall Example??(ii)) by using Theorem?? we have for a, that {l t (a)} t is an ergodic process. Therefore by using the ergodic theorem we have L n (a) = n t=2 l t (a) a.s. E(l 0 (a)). In other words for every a [, ] we have that L n (a) a.s. E(l 0 (a)) (almost sure pointwise convergence). Since the parameter space Θ = [, ] is compact and a is the unique minimum of l( ) in the 27

273 parameter space, all that remains is to show show stochastic equicontinuity. From this we deduce almost sure uniform convergence. To show stochastic equicontinuity we expand L T (a) and use the mean value theorem to obtain L n (a ) L n (a 2 ) = L T (ā)(a a 2 ), (0.0) where ā [min[a, a 2 ], max[a, a 2 ]] and Because ā [, ] we have L n (ā) = 2 n X t (X t āx t ). t=2 L n (ā) D n, where D n = 2 n ( X t X t + Xt ). 2 t=2 Since {X t } t is an ergodic process, then { X t X t + Xt 2 } is an ergodic process. var(ε t ) <, by using the ergodic theorem we have Therefore, if D n a.s. 2E( X t X t + X 2 t ). Let D := 2E( X t X t + Xt 2 ). Therefore there exists a set M Ω, where P(M) = and for every ω M and ε > 0 we have D T (ω) D δ, for all n > N(ω). Substituting the above into (0.0) we have L n (ω, a ) L n (ω, a 2 ) D n (ω) a a 2 (D + δ ) a a 2, for all n N(ω). Therefore for every ε > 0, there exists a δ := ε/(d + δ ) such that sup L n (ω, a ) L n (ω, a 2 ) ε, a a 2 ε/(d+δ ) for all n N(ω). Since this is true for all ω M we see that {L n (a)} is stochastically equicontinuous. 272

274 Theorem 0.4. Let ˆφ n be defined as in (0.9). Then we have ˆφ n a.s. φ. PROOF. Since {L n (a)} is almost sure equicontinuous, the parameter space [, ] is compact and we have pointwise convergence of L n (a) a.s. L(a), by using Theorem 0.3. we have that ˆφ n a.s. a, where a = min a Θ L(a). Finally we need to show that a = φ. Since L(a) = E(l 0 (a)) = E(X ax 0 ) 2, we see by differentiating L(a) with respect to a, that it is minimised at a = E(X 0 X )/E(X0 2 ), hence a = E(X 0 X )/E(X0 2 ). To show that this is φ, we note that by the Yule-Walker equations X t = φx t + ɛ t E(X t X t ) = φe(xt ) 2 + E(ɛ t X t ). } {{ } =0 Therefore φ = E(X 0 X )/E(X 2 0 ), hence ˆφ n a.s. φ. We note that by using a very similar methods we can show strong consistency of the least squares estimator of the parameters in an AR(p) model. 0.5 Convergence in probability of an estimator We described above almost sure (strong) consistency (â T a.s. a 0 ). Sometimes its not possible to show strong consistency (eg. when ergodicity cannot be verified). As an alternative, weak consistency where â T P a0 (convergence in probability), is shown. This requires a weaker set of conditions, which we now describe: (i) The parameter space Θ should be compact. (ii) Probability pointwise convergence: for every a Θ L n (a) P L(a). (iii) The sequence {L n (a)} is equicontinuous in probability. That is for every ɛ > 0 and η > 0 there exists a δ such that lim P n ( sup L n (a ) L n (a 2 ) > ɛ a a 2 δ ) < η. (0.) If the above conditions are satisified we have â T P a0. 273

275 Verifying conditions (ii) and (iii) may look a little daunting but by using Chebyshev s (or Markov s) inequality it can be quite straightforward. For example if we can show that for every a Θ E(L n (a) L(a)) 2 0 T. Therefore by applying Chebyshev s inequality we have for every ε > 0 that P ( L n (a) L(a) > ε) E(L n(a) L(a)) 2 ε 2 0 T. Thus for every a Θ we have L n (a) P L(a). have To show (iii) we often use the mean value theorem L n (a). Using the mean value theorem we L n (a ) L n (a 2 ) sup a L n (a) 2 a a 2. a Now if we can show that sup n E sup a a L n (a) 2 < (in other words it is uniformly bounded in probability over n) then we have the result. To see this observe that P ( sup L n (a ) L n (a 2 ) > ɛ a a 2 δ ) ( ) P sup a L n (a) 2 a a 2 > ɛ a Ω sup n E( a a 2 sup a Ω a L n (a) 2 ). ɛ Therefore by a careful choice of δ > 0 we see that (0.) is satisfied (and we have equicontinuity in probability). 0.6 Asymptotic normality of an estimator Once consistency of an estimator has been shown this paves the way to showing normality. To make the derivations simple we will assume that θ is univariate (this allows to easily use Taylor expansion). We will assume that that the third derivative of the contrast function, L n (θ), exists, its expectation is bounded and it s variance converges to zero as n. If this is the case we have have the following result 274

276 Lemma 0.6. Suppose that the third derivative of the contrast function L n (θ) exists, for k = 0,, 2 E( k L n(θ) ) = k L theta k θ k and var( k L n(θ) ) 0 as n and 3 L n(θ) theta k theta 3 is bounded by a random variable Z n which is independent of n where E(Z n ) < and var(z n ) 0. Then we have (ˆθ n θ 0 ) = V (θ) L n(θ) + o p () L n(θ), θ θ=θ 0 θ θ=θ 0 where V (θ 0 ) = 2 L(θ) θ 2 θ 0. PROOF. By the mean value theorem we have L n (θ) = L n(θ) (ˆθ n θ 0 ) 2 L n (θ) θ θ=θ 0 θ θ=ˆθ n θ 2 = (ˆθ n θ 0 ) 2 L n (θ) θ= θ n θ 2 (0.2) θ= θ n where θ n lies between θ 0 and ˆθ n. We first study 2 L n(θ) θ 2 θ= θ n. By using the man value theorem we have 2 L n (θ) θ 2 θ= θ n = 2 L n (θ) θ 2 θ 0 + ( θ n θ 0 ) 2 L n (θ) θ 2 θ= θ n where θ n lies between θ 0 and θ n. Since 2 L n(θ) θ 2 we have 2 L(θ) = V (θ θ 0 θ 2 θ 0 ), under the stated assumptions 0 2 L θ 2 θ= θ n V (θ 0 ) θn θ 0 2 L n (θ) θ 2 θ= θ n θn θ 0 W n. Therefore, by consistency of the estimator it is clear that 2 L θ 2 θ= θ n (0.2) we have P V (θ0 ). Substituting this into L θ θ=θ 0 = (ˆθ n θ 0 )(V (θ 0 ) + o()), since V (θ 0 ) is bounded away from zero we have [ 2 L θ 2 θ= θ n ] = V (θ 0 ) + o p () and we obtain the desired result. The above result means that the distribution of (ˆθ n θ 0 ) is determined by L θ θ=θ 0. following section we show to show asymptotic normality of L θ θ=θ 0. In the 275

277 0.6. Martingale central limit theorem The first central limit theorm goes back to the asymptotic distribution of sums of binary random variables (these have a binomial distribution and Bernoulli showed that they could be approximated to a normal distribution). This result was later generalised to sums of iid random variables. However from mid 20th century to late 20th century several advances have been made for generalisating the results to dependent random variables. These include generalisations to random variables which have n-dependence, mixing properties, cumulant properties, near-epoch dependence etc (see, for example, Billingsley (995) and Davidson (994)). In this section we will concentrate on a central limit theore for martingales. Our reason for choosing this flavour of CLT is that it can be applied in various estimation settings - as it can often be shown that the derivative of a criterion at the true parameter is a martingale. Let us suppose that S n = n Z t, t= we shall show asymptotic normality of n(s n E(S n )). The reason for normalising by n, is that (Ŝn E(S n )) a.s. 0 as n, hence in terms of distributions it converges towards the point mass at zero. Therefore we need to increase the magnitude of the difference. It it can show that var(s n ) = O(n ), then n(s n E(S 0 ) = O(). Definition 0.6. The random variables {Z t } are called martingale differences if E(Z t Z t, Z t 2,...) = 0. The sequence {S T } T, where S T = T k= are called martingales if {Z t } are martingale differences. Z t Remark 0.6. (Martingales and covariances) We observe that if {Z t } are martingale dif- 276

278 ferences then if t > s and F s = σ(z s, Z s,...) cov(z s, Z t ) = E(Z s Z t ) = E ( E(Z s Z t F s ) ) = E ( Z s E(Z t F s ) ) = E(Z s 0) = 0. Hence martingale differences are uncorrelated. Example 0.6. Suppose that X t = φx t + ε t, where {ε t } are idd rv with E(ε t ) = 0 and φ <. Then {ε t X t } t are martingale differences. Let us define S T as T S T = Z t, (0.3) t= where F t = σ(z t, Z t,...), E(Z t F t ) = 0 and E(Zt 2 ) <. In the following theorem adapted from Hall and Heyde (980), Theorem 3.2 and Corollary 3., we show that S T is asymptotically normal. Theorem 0.6. Let {S T } T be defined as in (0.62). Further suppose T where σ 2 is a finite constant, for all ε > 0, T (this is known as the conditional Lindeberg condition) and Then we have T t= Z 2 t P σ 2, (0.4) T E(Zt 2 I( Z t > ε T ) F t ) P 0, (0.5) t= T T E(Zt 2 F t ) P σ 2. (0.6) t= T /2 S T D N (0, σ 2 ). (0.7) 277

279 0.6.2 Example: Asymptotic normality of the least squares estimator In this section we show asymptotic normality of the least squares estimator of the AR() (X t = φx t + ε t, with var(ε t ) = σ 2 ) defined in (0.8). We call that the least squares estimator is ˆφ n = arg max a [,] L n (a). Recalling the criterion the first and the second derivative is L n (a) = n L n (a) = 2 n and 2 L n (a) = Therefore by using (??) we have 2 n (X t ax t ) 2, t=2 X t (X t ax t ) = t=2 t=2 X 2 t. 2 n X t ɛ t t=2 ( ˆφ n φ) = ( 2 L n ) Ln (φ). (0.8) Since {X 2 t } are ergodic random variables, by using the ergodic theorem we have 2 L n a.s. 2E(X 2 0 ). This with (0.8) implies n( ˆφn φ) = ( 2 ) L n } {{ } n Ln (φ). (0.9) a.s. (2E(X0 2)) To show asymptotic normality of n( ˆφ n φ), will show asymptotic normality of n L n (φ). We observe that L n (φ) = 2 n X t ɛ t, is the sum of martingale differences, since E(X t ɛ t X t ) = X t E(ɛ t X t ) = X t E(ɛ t ) = 0 (here we used Definition 0.6.). In order to show asymptotic of L n (φ) we will use the martingale central limit theorem. t=2 We now use Theorem 0.6. to show that n L n (φ) is asymptotically normal, which means we 278

280 have to verify conditions (0.4)-(0.6). We note in our example that Z t := X t ɛ t, and that the series {X t ɛ t } t is an ergodic process. Furthermore, since for any function g, E(g(X t ɛ t ) F t ) = E(g(X t ɛ t ) X t ), where F t = σ(x t, X t,...) we need only to condition on X t rather than the entire sigma-algebra F t. C : By using the ergodicity of {X t ɛ t } t we have n Zt 2 = n t= t= X 2 t ɛ 2 t P E(Xt ) 2 E(ɛ 2 t ) = σ 2 c(0). } {{ } = C2 : We now verify the conditional Lindeberg condition. n E(Zt 2 I( Z t > ε n) F t ) = n t= E(Xt ɛ 2 2 t I( X t ɛ t > ε n) X t ) t= We now use the Cauchy-Schwartz inequality for conditional expectations to split X 2 t ɛ2 t and I( X t ɛ t > ε). We recall that the Cauchy-Schwartz inequality for conditional expectations is E(X t Z t G) [E(X 2 t G)E(Z 2 t G)] /2 almost surely. Therefore n n n E(Zt 2 I( Z t > ε n) F t ) t= { E(X 4 t ɛ 4 t X t )E(I( X t ɛ t > ε n) 2 X t ) } /2 t= Xt E(ɛ 2 4 t ) /2 { E(I( X t ɛ t > ε n) 2 X t ) } /2. (0.20) t= We note that rather than use the Cauchy-Schwartz inequality we can use a generalisation of it called the Hölder inequality. The Hölder inequality states that if p + q =, then E(XY ) {E(X p )} /p {E(Y q )} /q (the conditional version also exists). The advantage of using this inequality is that one can reduce the moment assumptions on X t. Returning to (0.20), and studying E(I( X t ɛ t > ε) 2 X t ) we use that E(I(A)) = P(A) and the Chebyshev inequality to show E(I( X t ɛ t > ε n) 2 X t ) = E(I( X t ɛ t > ε n) X t ) = E(I( ɛ t > ε n/x t ) X t ) = P ε ( ɛ t > ε n )) X2 t var(ɛ t) X t ε 2. (0.2) n 279

281 Substituting (0.2) into (0.20) we have n E(Zt 2 I( Z t > ε n) F t ) t= { X X 2 n t E(ɛ 4 t ) /2 2 t var(ɛ t ) ε 2 n t= E(ɛ4 t ) /2 εn 3/2 X t 3 E(ɛ 2 t ) /2 t= E(ɛ4 t ) /2 E(ɛ 2 t ) /2 εn /2 n X t 3. t= } /2 If E(ɛ 4 t ) <, then E(X 4 t ) <, therefore by using the ergodic theorem we have n E( X 0 3 ). Since almost sure convergence implies convergence in probability we have n P 0. E(Zt 2 I( Z t > ε n) F t ) E(ɛ4 t ) /2 E(ɛ 2 t ) /2 } εn{{ /2 X t 3 } n t= 0 } {{ } t= P E( X 0 3 ) n t= X t 3 a.s. Hence condition (0.5) is satisfied. C3 : We need to verify that n E(Zt 2 F t ) P σ 2. t= Since {X t } t is an ergodic sequence we have n = n P E(Zt 2 F t ) = n t= E(Xt ε 2 2 X t ) t= Xt E(ε 2 2 X t ) = E(ε 2 ) Xt 2 n t= } {{ } a.s. E(X0 2) t= E(ε 2 )E(X 2 0) = σ 2 c(0), hence we have verified condition (0.6). 280

282 Altogether conditions C-C3 imply that n Ln (φ) = n D X t ɛ t N (0, σ 2 c(0)). (0.22) t= Recalling (0.9) and that n L n (φ) D N (0, σ 2 ) we have n( ˆφn φ) = ( 2 L n ) } {{ } a.s. n Ln (φ). (0.23) } {{ } (2E(X0 2)) D N (0,σ 2 c(0)) Using that E(X0 2 ) = c(0), this implies that n( ˆφn φ) D N (0, 4 σ2 c(0) ). (0.24) Thus we have derived the limiting distribution of ˆφ n. Remark We recall that ( ˆφ n φ) = ( 2 L n ) Ln (φ) = 2 n n t=2 ε tx t 2 n n t=2 X2 t, (0.25) and that var( 2 n n t=2 ε tx t ) = 2 n n t=2 var(ε tx t ) = O( n ). This implies ( ˆφ n φ) = O p (n /2 ). Indeed the results also holds almost surely ( ˆφ n φ) = O(n /2 ). (0.26) The same result is true for autoregressive processes of arbitrary finite order. That is n(ˆφn φ) D N (0, E(Γ p ) σ 2 ). (0.27) Example: Asymptotic normality of the weighted periodogram Previously we have discussed the weight peiodogram, here we show normality of it, in the case that the time series X t is zero mean linear time series (has the representation X t = j ψ jε t j ). 28

283 Recalling Lemma we have A(φ, I n ) = n = n φ(ω k )I n (ω k ) k= φ(ω k ) A(ω k ) 2 I ε (ω k ) + o( n ). k= Therefore we will show asymptotic normality of n n k= φ(ω k) A(ω k ) 2 I ε (ω k ), which will give asymptotic normality of A(φ, I n ). Expanding I ε (ω k ) and substituting this into n n k= φ(ω k) A(ω k ) 2 I ε (ω k ) gives n φ(ω k ) A(ω k ) 2 I ε (ω k ) = n k= where g n (t τ) = n k= t,τ= ε t ε τ n φ(ω k ) A(ω k ) 2 exp(iω k (t τ)) = n k= ε t ε τ g n (t τ) t,τ= φ(ω k ) A(ω k ) 2 exp(iω k (t τ)) = 2π φ(ω) A(ω) 2 exp(iω(t τ))dω + O( 2π 0 n 2 ), (the rate for the derivative exchange is based on assuming that the second derivatives of A(ω) and φ exist and φ(0) = φ(2π)). We can rewrite n n t,τ= ε tε τ g n (t τ) as n = n := n [ε t ε τ E(ε t ε τ )]g n (t τ) t,τ= ( [(ε 2 t E(ε 2 ( t )]g n (0) + ε t ε τ [g n (t τ) g n (τ t)] )) t= t= Z t,n where it is straightforward to show that {Z t,n } are the sum of martingale differences. Thus we can show that n t,τ= ε t ε τ g n (t τ) E ( n t,τ= τ<t ε t ε τ g n (t τ) ) = n satisfies the conditions of the martingale central limit theorem, which gives asymptotic normality of n n t,τ= ε tε τ g n (t τ) and thus A(φ, I n ). In the remainder of this chapter we obtain the sampling properties of the ARMA estimators t= Z t,n 282

284 defined in Sections 7.2. and Asymptotic properties of the Hannan and Rissanen estimation method In this section we will derive the sampling properties of the Hannan-Rissanen estimator. We will obtain an almost sure rate of convergence (this will be the only estimator where we obtain an almost sure rate). Typically obtaining only sure rates can be more difficult than obtaining probabilistic rates, moreover the rates can be different (worse in the almost sure case). We now illustrate why that is with a small example. Suppose {X t } are iid random variables with mean zero and variance one. Let S n = n t= X t. It can easily be shown that var(s n ) = n therefore S n = O p ( n ). (0.28) However, from the law of iterated logarithm we have for any ε > 0 P (S n ( + ε) 2n log log n infinitely often) = 0P (S n ( ε) 2n log log n infinitely often) = (0.29). Comparing (0.28) and (0.29) we see that for any given trajectory (realisation) most of the time n S n will be within the O( log log n n ) bound but there will be excursions above when it to the O( n bound. In other words we cannot say that n S n = ( n ) almost surely, but we can say that This basically means that 2 log log n n S n = O( ) almost surely. n Hence the probabilistic and the almost sure rates are (slightly) different. Given this result is true for the average of iid random variables, it is likely that similar results will hold true for various estimators. In this section we derive an almost sure rate for Hannan-Rissanen estimator, this rate will be determined by a few factors (a) an almost sure bound similar to the one derived above (b) the increasing number of parameters p n (c) the bias due to estimating only a finite number of parameters when there are an infinite number in the model. 283

285 We first recall the algorithm: (i) Use least squares to estimate {b j } pn and define ˆb n = ˆR n ˆr n, (0.30) where ˆb n = (ˆb,n,..., ˆb pn,n), ˆR n = X t X t ˆr n = T X t X t and X t = (X t,..., X t pn ). (ii) Estimate the residuals with ε t = X t t=p n+ t=p n+ p n ˆbj,n X t j. (iii) Now use as estimates of φ 0 and θ 0 φn, θ n where φ n, θ p q n = arg min (X t φ j X t j θ i ε t i ) 2. (0.3) t=p n+ i= We note that the above can easily be minimised. In fact ( φ n, θ n ) = R n s n where R n = n Ỹ t Ỹ t t=p n+ s n = T t=p n+ Ỹ t X t, Ỹ t = (X t,..., X t p, ε t,..., ε t q ). Let ˆϕ n = ( φ n, θ n ). We observe that in the second stage of the scheme where the estimation of the ARMA parameters are done, it is important to show that the empirical residuals are close to the true residuals. That is ε t = ε t + o(). We observe that from the definition of ε t, this depends on the rate of convergence 284

286 of the AR estimators ˆb j,n ε t = X t = ε t + p n ˆbj,n X t j p n (ˆb j,n b j )X t j j=p n+ b j X t j. (0.32) Hence ˆε t ε t p n (ˆb j,n b j )X t j + j=p n+ b j X t j. (0.33) Therefore to study the asymptotic properties of ϕ = ˆφ n, ˆθ n we need to Obtain a rate of convergence for sup j ˆb j,n b j. Obtain a rate for ˆε t ε t. Use the above to obtain a rate for ϕ n = (ˆφ n, ˆθ n ). We first want to obtain the uniform rate of convergence for sup j ˆb j,n b j. Deriving this is technically quite challanging. We state the rate in the following theorem, an outline of the proof can be found in Section The proofs uses results from mixingale theory which can be found in Chapter B. Theorem 0.7. Suppose that {X t } is from an ARMA process where the roots of the true characteristic polynomials φ(z) and θ(z) both have absolute value greater than + δ. Let ˆb n be defined as in (0.30), then we have almost surely ˆb n b n 2 = O ( p 2 n (log log n) +γ log n n + p3 n n + p nρ pn ) for any γ > 0. PROOF. See Section Corollary 0.7. Suppose the conditions in Theorem 0.7. are satisfied. Then we have ε t ε t p n max j p n ˆb j,n b j Z t,pn + Kρ pn Y t pn, (0.34) 285

287 where Z t,pn = p n pn t= X t j and Y t = p n t= ρ j X t, n t=p n+ ε t i X t j ε t i X t j = O(p n Q(n) + ρ pn ) (0.35) where Q(n) = p 2 n n t=p n+ (log log n) +γ log n n + p3 n n + p n ρ pn. PROOF. Using (0.33) we immediately obtain (0.34). To obtain (0.35) we use (0.33) to obtain n t=p n+ O(p n Q(n)) n ε t i ε t j ε t i ε t j = O(p n Q(n) + ρ pn ) (0.36) εt i X t j ε t i X t j n t=p n+ = O(p n Q(n) + ρ pn ). t=p n+ X t Z t,pn + O(ρ pn ) n X t j εt i ε t i t=p n+ X t Y t pn To prove (0.36) we use a similar method, hence we omit the details. We apply the above result in the theorem below. Theorem Suppose the assumptions in Theorem 0.7. are satisfied. Then ( ) ϕ n ϕ 2 (log log n) 0 = O p 3 +γ log n n + p4 n n n + p2 nρ pn. for any γ > 0, where ϕ n = ( φ n, θ n ) and ϕ 0 = (φ 0, θ 0 ). PROOF. We note from the definition of ϕ n that ) ( ϕn ϕ 0 = R n ( sn R ) n ϕ 0. Now in the R n and s n we replace the estimated residuals ε n with the true unobserved residuals. 286

288 This gives us ( ϕn ϕ 0 ) = R n ( ) sn R n ϕ 0 + (R s n n R n s n) (0.37) R n = n Y t Y t t=max(p,q) s n = n Y t X t, t=max(p,q) Y t = (X t,..., X t p, ε t,..., ε t q ) (recalling that Y t = (X t,..., X t p, ε t,..., ε t q ). error term is The (R n s n R n s n) = R ( R n R n ) n R n s n + R n (s n s n ). Now, almost surely R n, R n bound for R n R n and s n s n. We recall that = O() (if E(R n ) is non-singular). Hence we only need to obtain a R n R n = n t=p n+ (ỸtỸ t Y t Y t), hence the terms differ where we replace the estimated ε t with the true ε t, hence by using (0.35) and (0.36) we have almost surely R n R n = O(p n Q(n) + ρ pn ) and s n s n = O(p n Q(n) + ρ pn ). Therefore by substituting the above into (0.38) we obtain ) ( ) ( ϕn ϕ 0 = R n sn R n ϕ 0 + O(pn Q(n) + ρ pn ). (0.38) Finally using straightforward algebra it can be shown that s n R n ϕ n = n t=max(p,q) ε t Y t. (log log n) By using Theorem 0.7.3, below, we have s n R n ϕ n = O((p + q) +γ log n n ). Substituting 287

289 (log log n) the above bound into (??), and noting that O(Q(n)) dominates O( +γ log n n ) gives ( ) ϕ n ϕ 2 (log log n) n = O p 3 +γ log n n + p4 n n n + p2 nρ pn and the required result Proof of Theorem 0.7. (A rate for ˆb T b T 2 ) We observe that ˆb n b n = Rn (ˆrn ˆR ) ( n b n + ˆR n Rn )(ˆrn ˆR ) n b n where b, R n and r n are deterministic, with b n = (b..., b pn ), (R n ) i,j = E(X i X j ) and (r n ) i = E(X 0 X i ). Evaluating the Euclidean distance we have ˆb n b n 2 R n spec ˆr n ˆR n b 2 n + Rn spec ˆR n spec ˆRn R 2 n ˆr n ˆR n b 2 n,(0.39) where we used that ˆR n ˆR n = ˆR n (R n ˆR n )Rn and the norm inequalities. Now by using Lemma 5.4. we have λ min (Rn ) > δ/2 for all T. Thus our aim is to obtain almost sure bounds for ˆr n ˆR n b n 2 and ˆR n R n 2, which requires the lemma below. Theorem Let us suppose that {X t } has an ARMA representation where the roots of the characteristic polynomials φ(z) and θ(z) lie are greater than + δ. Then (i) n t=r+ (log log n) ε t X t r = O( +γ log n ) (0.40) n (ii) n t=max(i,j) (log log n) X t i X t j = O( +γ log n ). (0.4) n for any γ >

290 PROOF. The result is proved in Chapter B.2. To obtain the bounds we first note that if the there wasn t an MA component in the ARMA process, in other words {X t } was an AR(p) process with p n p, then ˆr n ˆR n b n = n n t=p n+ ε tx t r, which has a mean zero. However because an ARMA process has an AR( ) representation and we are only estimating the first p n parameters, there exists a bias in ˆr n ˆR n b n. Therefore we obtain the decomposition (ˆr n ˆR n b n ) r = n t=p n+ Therefore we can bound the bias with (ˆr n ˆR n b n ) r n ( Xt ) b j X t j Xt r + n t=p n+ j=p n+ = ε t X t r + b j X t j X t r n n t=p n+ t=p n+ j=p n+ } {{ } } {{ } stochastic term bias t=p n+ ε t X t r Kρ pn n t= b j X t j X t r (0.42) (0.43) X t r ρ j X t pn j. (0.44) Let Y t = ρj X t j and S n,k,r = n n t= X t r ρj X t k j. We note that {Y t } and {X t } are ergodic sequences. By applying the ergodic theorm we can show that for a fixed k and r, S n,k,r a.s. E(X t r Y t k ). Hence S n,k,r are almost surely bounded sequences and ρ pn n Therefore almost surely we have Now by using (0.40) we have t= X t r ρ j X t pn j = O(ρ pn ). ˆr n ˆR n b n 2 = n ˆr n ˆR n b n 2 = O ( t=p n+ ε t X t 2 + O(p n ρ pn ). p n { (log log n) +γ log n n + ρ pn }). (0.45) 289

291 This gives us a rate for ˆr n ˆR n b n. Next we consider ˆR n. It is clear from the definition of ˆR n that almost surely we have ( ˆR n ) i,j E(X i X j ) = n = n = n t=p n+ t=min(i,j) T t=min(i,j) X t i X t j E(X i X j ) Now by using (0.4) we have almost surely [X t i X t j E(X i X j )] n p n t=min(i,j) [X t i X t j E(X i X j )] + O( p n n ). X t i X t j + min(i, j) E(X i X j ) n Therefore we have almost surely ( ˆR n ) i,j E(X i X j ) = O( p n (log log n) n + +γ log n ). n ˆR n R n 2 = O ( p 2 n { }) p n (log log n) n + +γ log n. (0.46) n We note that by using (0.39), (0.45) and (0.46) we have ˆb n b n 2 Rn spec ˆR n spec O ( p 2 n (log log n) +γ log n n + p2 n n + p nρ pn ). As we mentioned previously, because the spectrum of X t is bounded away from zero, λ min (R n ) is bounded away from zero for all T. Moreover, since λ min ( ˆR n ) λ min (R n ) λ max ( ˆR n R n ) λ min (R n ) tr(( ˆR n R n ) 2 ), which for a large enough n is bounded away from zero. Hence we obtain almost surely ˆb n b n 2 = O ( p 2 n (log log n) +γ log n n + p3 n n + p nρ pn ), (0.47) thus proving Theorem 0.7. for any γ >

292 0.8 Asymptotic properties of the GMLE Let us suppose that {X t } satisfies the ARMA representation X t p i= φ (0) i X t i = ε t + q θ (0) j ε t j, (0.48) and θ 0 = (θ (0),..., θ(0) q ), φ 0 = (φ (0),..., φ(0) p ) and σ0 2 = var(ε t). In this section we consider the sampling properties of the GML estimator, defined in Section We first recall the estimator. We use as an estimator of (θ 0, φ 0 ), ˆφ n = (ˆθ n, ˆφ n, ˆσ n ) = arg min (θ,φ) Θ L n (φ, θ, σ), where n L n(φ, θ, σ) = n n t= log r t+ (σ, φ, θ) + n n t= (X t+ X (φ,θ) t+ t )2 r t+ (σ, φ, θ) To show consistency and asymptotic normality we will use the following assumptions.. (0.49) Assumption 0.8. (i) X t is both invertible and causal. (ii) The parameter space should be such that all φ(z) and θ(z) in the parameter space have roots whose absolute value is greater than + δ. φ 0 (z) and θ 0 (z) belong to this space. Assumption 0.8. means for for some finite constant K and +δ ρ <, we have φ(z) K j=0 ρj z j and φ(z) K j=0 ρj Z j. To prove the result, we require the following approximations of the GML. Let X (φ,θ) t+ t,... = t b j (φ, θ)x t+ j. (0.50) This is an approximation of the one-step ahead predictor. Since the likelihood is constructed from the one-step ahead predictors, we can approximated the likelihood n L n(φ, θ, σ) with the above and define n L n (φ, θ, σ) = log σ 2 + T nσ 2 (X t+ t= X (φ,θ) t+ t,... )2. (0.5) 29

293 We recall that given X t, X t,..., this is X (φ,θ) t+ t,... was derived from X (φ,θ) t+ t,... which is the one-step ahead predictor of X t+ X (φ,θ) t+ t,... = b j (φ, θ)x t+ j. (0.52) Using the above we define a approximation of n L n(φ, θ, σ) which in practice cannot be obtained (since the infinite past of {X t } is not observed). Let us define the criterion n L n(φ, θ, σ) = log σ 2 + T nσ 2 (X t+ X (φ,θ) t+ t,... )2. (0.53) In practice n L n(φ, θ, σ) can not be evaluated, but it proves to be a convenient tool in obtaining the sampling properties of ˆφ n. The main reason is because n L n(φ, θ, σ) is a function of {X t } and {X (φ,θ) t+ t,... = b j(φ, θ)x t+ j } both of these are ergodic (since the ARMA process is ergodic when its roots lie outside the unit circle and the roots of φ, θ Θ are such that they lie outside the unit circle). In contrast looking at L n (φ, θ, σ), which is comprised of {X t+ t }, which not an ergodic random variable because X t+ is the best linear predictor of X t+ given X t,..., X (see the number of elements in the prediction changes with t). Using this approximation really simplifies the proof, though it is possible to prove the result without using these approximations. First we obtain the result for the estimators ˆϕ n = (θ n, φ n, ˆσ n) = arg min (θ,φ) Θ L n (φ, θ, σ) and then show the same result can be applied to ˆϕ n. t= Proposition 0.8. Suppose {X t } is an ARMA process which satisfies (0.48), and Assumption 0.8. is satisfied. Let X (φ,θ) t+ t, X(φ,θ) t+ t,... and X(φ,θ) t+ t,... be the predictors defined in (??), (0.50) and (0.52), obtained using the parameters φ = {φ j } and θ = {θ i }, where the roots the corresponding characteristic polynomial φ(z) and θ(z) have absolute value greater than + δ. Then X (φ,θ) (φ,θ) t+ t X t+ t,... ρt ρ t ρ i X i, (0.54) i= E(X (φ,θ) (φ,θ) t+ t X t+ t,... )2 Kρ t, (0.55) 292

294 Xt+ t,... () X t+ t,... = j=t+ b j (φ, θ)x t+ j Kρ t ρ j X j, (0.56) j=0 E(X (φ,θ) (φ,θ) t+ t,... X t+ t,... )2 Kρ t (0.57) and r t (σ, φ, θ) σ 2 Kρ t (0.58) for any /( + δ) < ρ < and K is some finite constant. PROOF. The proof follows closely the proof of Proposition First we define a separate ARMA process {Y t }, which is driven by the parameters θ and φ (recall that {X t } is drive by the parameters θ 0 and φ 0 ). That is Y t satisfies Y t p φ jy t j = ε t + q θ jε t j. Recalling that X φ,θ t+ t is the best linear predictor of X t+ given X t,..., X and the variances of {Y t } (noting that it is the process driven by θ and φ), we have X φ,θ t+ t = t b j (φ, θ)x t+ j + ( j=t+ b j (φ, θ)r t,j(φ, θ)σ t (φ, θ) ) X t, (0.59) where Σ t (φ, θ) s,t = E(Y s Y t ), (r t,j ) i = E(Y t i Y j ) and X t = (X t,..., X ). Therefore X φ,θ t+ t X t+ t,... = ( j=t+ b j r t,jσ t (φ, θ) ) X t. Since the largest eigenvalue of Σ t (φ, θ) is bounded (see Lemma 5.4.) and (r t,j ) i = E(Y t i Y j ) Kρ t i+j we obtain the bound in (0.54). Taking expectations, we have E(X φ,θ φ,θ t+ t X t+ t,... )2 = ( j=t+ b j r ) t,j Σt (φ, θ) Σ t (φ 0, θ 0 )Σ t (φ, θ) ( ) b t+j r t,j. j=t+ Now by using the same arguments given in the proof of (5.29) we obtain (0.55). To prove (0.57) we note that E( X t+ t,... () X t+ t,... ) 2 = E( j=t+ b j (φ, θ)x t+ j ) 2 = E( b t+j (φ, θ)x j ) 2, 293

295 now by using (2.20), we have b t+j (φ, θ) Kρ t+j, for +δ Using this we have E( X t+ t,... () X t+ t,... ) 2 Kρ t, which proves the result. < ρ <, and the bound in (0.56). Using ε t = X t b j(φ 0, θ 0 )X t j and substituting this into L n (φ, θ, σ) gives n L n(φ, θ, σ) = log σ 2 + nσ 2 ( Xt ) 2 b j (φ, θ)x t+ j = n L n(φ, θ, σ) log σ 2 + nσ 2 = log σ 2 + nσ 2 + n ε 2 t + 2 n t= T { θ(b) }{ φ(b)x t θ(b) } φ(b)x t t= ( ) ε t b j (φ, θ)x t j t= ( ) 2. (b j (φ, θ) b j (φ 0, θ 0 ))X t j t= Remark 0.8. (Derivatives involving the Backshift operator) Consider the transformation θb X t = θ j B j X t = j=0 θ j X t j. Suppose we want to differentiate the above with respect to θ, there are two ways this can be done. Either differentiate j=0 θj X t j with respect to θ or differentiate θb words d dθ θb X t = B ( θb) 2 X t = j=0 jθ j X t j. j=0 with respect to θ. In other Often it is easier to differentiate the operator. Suppose that θ(b) = + p θ jb j and φ(b) = q φ jb j, then we have d φ(b) dθ j θ(b) X t = Bj φ(b) θ(b) 2 d φ(b) dφ j θ(b) X t = Moreover in the case of squares we have X t = φ(b) θ(b) 2 X t j Bj θ(b) 2 X t = θ(b) 2 X t j. d ( φ(b) dθ j θ(b) X t) 2 = 2( φ(b) θ(b) X t)( φ(b) θ(b) 2 X t j), d ( φ(b) dφ j θ(b) X t) 2 = 2( φ(b) θ(b) X t)( θ(b) 2 X t j). 294

296 Using the above we can easily evaluate the gradient of n L n n θ i L n (φ, θ, σ) = 2 σ 2 n φ j L n (φ, θ, σ) = 2 nσ 2 (θ(b) φ(b)x t ) φ(b) θ(b) 2 X t i (θ(b) φ(b)x t ) θ(b) X t j t= t= n σ 2L n(φ, θ, σ) = σ 2 nσ 4 ( ) 2. Xt b j (φ, θ)x t j (0.60) Let = ( φi, θj, σ 2). We note that the second derivative 2 L n can be defined similarly. t= Lemma 0.8. Suppose Assumption 0.8. holds. Then sup φ,θ Θ n L n 2 KS n sup φ,θ Θ n 3 L n 2 KS n (0.6) for some constant K, S n = n max(p,q) r,r 2 =0 Y t r Y t r2 (0.62) t= where Y t = K ρ j X t j. j=0 for any (+δ) < ρ <. PROOF. The proof follows from the the roots of φ(z) and θ(z) having absolute value greater than + δ. Define the expectation of the likelihood L(φ, θ, σ)) = E( n L n(φ, θ, σ)). We observe where L(φ, θ, σ)) = log σ 2 + σ2 0 σ 2 + σ 2 E(Z t(φ, θ) 2 ) Z t (φ, θ) = (b j (φ, θ) b j (φ 0, θ 0 ))X t j 295

297 Lemma Suppose that Assumption 0.8. are satisfied. Then for all θ, φ, θ Θ we have (i) n i L n (φ, θ, σ)) a.s. i L(φ, θ, σ)) for i = 0,, 2, 3. a.s. (ii) Let S n defined in (0.62), then S n E( max(p,q) n t= Y t r Y t r2 ). r,r 2 =0 PROOF. Noting that the ARMA process {X t } are ergodic random variables, then {Z t (φ, θ)} and {Y t } are ergodic random variables, the result follows immediately from the Ergodic theorem. We use these results in the proofs below. Theorem 0.8. Suppose that Assumption 0.8. is satisfied. Let (ˆθ n, ˆφ n, ˆσ n) = arg min L n (θ, φ, σ) (noting the practice that this cannot be evaluated). Then we have (i) (ˆθ n, ˆφ n, ˆσ n) a.s. (θ 0, φ 0, σ 0 ). (ii) n(ˆθ n θ 0, ˆφ n θ 0) D N (0, σ 2 0 Λ ), where Λ = E(U tu t) E(V t U t) E(U t V t ) E(V t V t ) and {U t } and {V t } are autoregressive processes which satisfy φ 0 (B)U t = ε t and θ 0 (B)V t = ε t. PROOF. We prove the result in two stages below. PROOF of Theorem 0.8.(i) We will first prove Theorem 0.8.(i). Noting the results in Section 0.3, to prove consistency we recall that we must show (a) the (φ 0, θ 0, σ 0 ) is the unique minimum of L( ) (b) pointwise convergence a.s. T L(φ, θ, σ)) L(φ, θ, σ)) and (b) stochastic equicontinuity (as defined in Definition 0.3.2). To show that (φ 0, θ 0, σ 0 ) is the minimum we note that L(φ, θ, σ)) L(φ 0, θ 0, σ 0 )) = log( σ2 σ0 2 ) + σ2 σ0 2 + E(Z t (φ, θ) 2 ). Since for all positive x, log x + x is a positive function and E(Z t (φ, θ) 2 ) = E( (b j(φ, θ) b j (φ 0, θ 0 ))X t j ) 2 is positive and zero at (φ 0, θ 0, σ 0 ) it is clear that φ 0, θ 0, σ 0 is the minimum of L. We will assume for now it is the unique minimum. Pointwise convergence is an immediate consequence of Lemma 0.8.2(i). To show stochastic equicontinuity we note that for any ϕ = (φ, θ, σ ) and ϕ 2 = (φ 2, θ 2, σ 2 ) we have by the mean value theorem L n (φ, θ, σ ) L n (φ 2, θ 2, σ 2 )) = (ϕ ϕ 2 ) L n ( φ, θ, σ). 296

298 Now by using (0.6) we have L n (φ, θ, σ ) L n (φ 2, θ 2, σ 2 )) S T (φ φ 2 ), (θ θ 2 ), (σ σ 2 ) 2. a.s. By using Lemma 0.8.2(ii) we have S n E( max(p,q) n t= Y t r Y t r2 ), hence {S n } is almost surely r,r 2 =0 bounded. This implies that L n is equicontinuous. Since we have shown pointwise convergence and equicontinuity of L n, by using Corollary 0.3., we almost sure convergence of the estimator. Thu proving (i). PROOF of Theorem 0.8.(ii) We now prove Theorem 0.8.(i) using the Martingale central limit theorem (see Billingsley (995) and Hall and Heyde (980)) in conjunction with the Cramer- Wold device (see Theorem 0.6.). Using the mean value theorem we have (ˆϕ n ϕ 0 ) = 2 L n( ϕ n ) L n(φ 0, θ 0, σ 0 ) where ˆϕ n = (ˆφ n, ˆθ n, ˆσ n), ϕ 0 = (φ 0, θ 0, σ 0 ) and ϕ n = φ, θ, σ lies between ˆϕ n and ϕ 0. Using the same techniques given in Theorem 0.8.(i) and Lemma we have pointwise convergence and equicontinuity of 2 L n. This means that 2 L n ( ϕ n ) a.s. E( 2 L n (φ 0, θ 0, σ 0 )) = σ 2 Λ (since by definition of ϕ n nonsingular) we have ϕ n a.s. ϕ 0 ). Therefore by applying Slutsky s theorem (since Λ is 2 L n ( ϕ n ) a.s. σ 2 Λ. (0.63) Now we show that L n (ϕ 0 ) is asymptotically normal. By using (0.60) and replacing X t i = φ 0 (B) θ 0 (B)ε t i we have n θ i L n (φ 0, θ 0, σ 0 ) = n φ j L n (φ 0, θ 0, σ 0 ) = 2 σ 2 n 2 σ 2 n t= t= n σ 2L n(φ 0, θ 0, σ 0 ) = σ 2 σ 4 n ε t ( ) θ 0 (B) ε t i = 2 σ 2 n ε t φ 0 (B) ε t j = 2 σ 2 n T t= ε 2 = σ 4 n ε t V t i t= T ε t U t j t= T (σ 2 ε 2 ), t= i =,..., q j =,..., p 297

299 where U t = φ 0 (B) ε t and V t = θ 0 (B) ε t. We observe that n L n is the sum of vector martingale differences. If E(ε 4 t ) <, it is clear that E((ε t U t j ) 4 ) = E((ε 4 t )E(U t j ) 4 ) <, E((ε t V t i ) 4 ) = E((ε 4 t )E(V t i ) 4 ) < and E((σ 2 ε 2 t ) 2 ) <. Hence Lindeberg s condition is satisfied (see the proof given in Section 0.6.2, for why this is true). Hence we have n Ln (φ 0, θ 0, σ 0 ) D N (0, Λ). Now by using the above and (0.63) we have n (ˆϕ n ϕ 0 ) = n 2 L n ( ϕ n ) L n (ϕ 0 ) n (ˆϕ n ϕ 0 ) D N (0, σ 4 Λ ). Thus we obtain the required result. The above result proves consistency and asymptotically normality of (ˆθ n, ˆφ n, ˆσ n), which is based on L n (θ, φ, σ), which in practice is impossible to evaluate. However we will show below that the gaussian likelihood, L n (θ, φ, σ) and is derivatives are sufficiently close to L n (θ, φ, σ) such that the estimators (ˆθ n, ˆφ n, ˆσ n) and the GMLE, (ˆθ n, ˆφ n, ˆσ n ) = arg min L n (θ, φ, σ) are asymptotically equivalent. We use Lemma 0.8. to prove the below result. Proposition Suppose that Assumption 0.8. hold and L n (θ, φ, σ), L n (θ, φ, σ) and L n (θ, φ, σ) are defined as in (0.49), (0.5) and (0.53) respectively. Then we have for all (θ, φ) T heta we have almost surely sup (φ,θ,σ) n (k) L(φ, θ, σ) k L n (φ, θ, σ) = O( n ) sup (φ,θ,σ) n L n (φ, θ, σ) L(φ, θ, σ) = O( n ), for k = 0,, 2, 3. PROOF. The proof of the result follows from (0.54) and (0.56). We show that result for sup (φ,θ,σ) n L(φ, θ, σ) L n (φ, θ, σ), a similar proof can be used for the rest of the result. Let us consider the difference L n (φ, θ) L n (φ, θ) = n (I n + II n + III n ), 298

300 where I n = III n = n { rt (φ, θ, σ) σ 2}, II n = t= n t= n t= r t (φ, θ, σ) (X(φ,θ) t+ X(φ,θ) t+ t )2 { 2Xt+ σ 2 (X (φ,θ) (φ,θ) t+ t X t+ t,... ) + ((X(φ,θ) t+ t )2 (φ,θ) ( X t+ t,... )2 ) }. Now we recall from Proposition 0.8. that X (φ,θ) (φ,θ) t+ t X t+ t,... ρ t K V t ( ρ) where V t = t i= ρi X i. Hence since E(X 2 t ) < and E(V 2 t ) < we have that sup n E I n <, sup n E II n < and sup n E III n <. Hence the sequence { I n + II n + III n } n is almost surely bounded. This means that almost surely sup L n (φ, θ) L n (φ, θ) = O( φ,θ,σ n ). Thus giving the required result. Now by using the above proposition the result below immediately follows. Theorem Let (ˆθ, ˆφ) = arg min L T (θ, φ, σ) and ( θ, ˆφ) = arg min L T (θ, φ, σ) a.s. a.s. (i) (ˆθ, ˆφ) (θ 0, φ 0 ) and ( θ, φ) (θ 0, φ 0 ). (ii) T (ˆθ T θ 0, ˆφ T θ 0 ) D N (0, σ 4 0 Λ ) and T ( θ T θ 0, φ T θ 0 ) D N (0, σ 4 0 Λ ). PROOF. The proof follows immediately from Proposition

301 Appendix A Background A. Some definitions and inequalities Some norm definitions. The norm of an object, is a postive numbers which measure the magnitude of that object. Suppose x = (x,..., x n ) R n, then we define x = n x j and x 2 = ( n x2 j )/2 (this is known as the Euclidean norm). There are various norms for matrices, the most popular is the spectral norm spec : let A be a matrix, then A spec = λ max (AA ), where λ max denotes the largest eigenvalue. Z denotes the set of a integers {...,, 0,, 2,...}. R denotes the real line (, ). Complex variables. i = and the complex variable z = x + iy, where x and y are real. Often the radians representation of a complex variable is useful. If z = x + iy, then it can also be written as r exp(iθ), where r = x 2 + y 2 and θ = tan (y/x). If z = x + iy, its complex conjugate is z = x iy. The roots of a rth order polynomial a(z), are those values λ,..., λ r where a(λ i ) = 0 for i =,..., r. Let λ(a) denote the spectral radius of the the matrix A (the largest eigenvalue in absolute terms). Then for any matrix norm A we have lim j A j /j = λ(a) (see Gelfand s 300

302 formula). Suppose λ(a) <, then Gelfand s formula implies that for any λ(a) < ρ <, there exists a constant, C, (which only depends A and ρ), such that A j C A,ρ ρ j. The mean value theorem. This basically states that if the partial derivative of the function f(x, x 2,..., x n ) has a bounded in the domiain Ω, then for x = (x,..., x n ) and y = (y,..., y n ) f(x, x 2,..., x n ) f(y, y 2,..., y n ) = i= (x i y i ) f x i x=x where x lies somewhere between x and y. The Taylor series expansion. This is closely related to the mean value theorem and a second order expansion is f(x, x 2,..., x n ) f(y, y 2,..., y n ) = i= (x i y i ) f x i + f 2 (x i y i )(x j y j ) x=x x i x j i, Partial Fractions. We use the following result mainly for obtaining the MA( ) expansion of an AR process. Suppose that g i > for i n. Then if g(z) = n i= ( z/g i) r i, the inverse of g(z) satisfies g(z) = i= { r i g i,j }, ( z g i ) j where g i,j =... Now we can make a polynomial series expansion of ( z g i ) j which is valid for all z. Dominated convergence. Suppose a sequence of functions f n (x) is such that pointwise f n (x) f(x) and for all n and x, f n (x) g(x), then f n (x)dx f(x)dx as n. We use this result all over the place to exchange infinite sums and expectations. For example, 30

303 if a j E( Z j ) <, then by using dominated convergence we have E( a j Z j ) = a j E(Z j ). Dominated convergence can be used to prove the following lemma. A more hands on proof is given below the lemma. Lemma A.. Suppose k= c(k) <, then we have n (n ) k= (n ) kc(k) 0 as n. Moreover, if k= kc(k) <, then (n ) n k= (n ) kc(k) = O( n ). PROOF. The proof is straightforward in the case that k= kc(k) < (the second assertion), in this case (n ) k k= (n ) n c(k) = O( n ). The proof is slightly more tricky in the case that k= c(k) <. First we note that since k= c(k) < for every ε > 0 there exists a N ε such that for all n N ε, k n c(k) < ε. Let us suppose that n > N ε, then we have the bound n (n ) k= (n ) kc(k) n (N ε ) k= (N ε ) kc(k) + n N ε k n kc(k) 2πn (N ε ) k= (N ε ) kc(k) + ε. Hence if we keep N ε fixed we see that n (Nε ) k= (N ε ) kc(k) 0 as n. Since this is true for all ε (for different thresholds N ε ) we obtain the required result. Cauchy Schwarz inequality. In terms of sequences it is a j b j ( a 2 j) /2 ( b 2 j) /2 302

304 . For integrals and expectations it is E XY E(X 2 ) /2 E(Y 2 ) /2 Holder s inequality. This is a generalisation of the Cauchy Schwarz inequality. It states that if p, q and p + q =, then E XY E( X p ) /p E( Y q ) /q. A similar results is true for sequences too. Martingale differences. Let F t be a sigma-algebra, where X t, X t,... F t. Then {X t } is a sequence of martingale differences if E(X t F t ) = 0. Minkowski s inequality. If < p <, then (E( X i ) p ) /p i= (E( X i p )) /p. i= Doob s inequality. This inequality concerns martingale differences. Let S n = n t= X t, then E( sup S n 2 ) E(SN). 2 n N Burkhölder s inequality. Suppose that {X t } are martingale differences and define S n = n k= X t. For any p 2 we have {E(S p n)} /p ( 2p E(X p k )2/p) /2. k= An application, is to the case that {X t } are identically distributed random variables, then we have the bound E(S p n) E(X p 0 )2 (2p) p/2 n p/2. It is worthing noting that the Burkhölder inequality can also be defined for p < 2 (see 303

305 Davidson (994), pages 242). It can also be generalised to random variables {X t } which are not necessarily martingale differences (see Dedecker and Doukhan (2003)). Riemann-Stieltjes Integrals. In basic calculus we often use the basic definition of the Riemann integral, g(x)f(x)dx, and if the function F (x) is continuous and F (x) = f(x), we can write g(x)f(x)dx = g(x)df (x). There are several instances where we need to broaden this definition to include functions F which are not continuous everywhere. To do this we define the Riemann-Stieltjes integral, which coincides with the Riemann integral in the case that F (x) is continuous. g(x)df (x) is defined in a slightly different way to the Riemann integral g(x)f(x)dx. Let us first consider the case that F (x) is the step function F (x) = n i= a ii [xi,x i ], then g(x)df (x) is defined as g(x)df (x) = n i= (a i a i )g(x i ) (with a = 0). Already we see the advantage of this definition, since the derivative of the step function is not well defined at the jumps. As most functions can be written as the limit of step functions (F (x) = lim k F k (x), where F k (x) = n k i= a i,n k I [xik,x ik ] ), we define g(x)df (x) = lim k nk i= (a i,n k a i,nk )g(x ik ). In statistics, the function F will usually be non-decreasing and bounded. We call such functions distributions. Theorem A.. (Helly s Theorem) Suppose that {F n } are a sequence of distributions with F n ( ) = 0 and sup n F n ( ) M <. There exists a distribution F, and a subsequence F nk such that for each x R F nk F and F is right continuous. A.2 Martingales Definition A.2. A sequence {X t } is said to be a martingale difference if E[X t F t ], where F t= = σ(x t, X t 2,...). In other words, the best predictor of X t given the past is simply zero. Martingales are very useful when proving several results, including central limit theorems. Martingales arise naturally in several situations. We now show that if correct likelihood is used (not the quasi-case), then the gradient of the conditional log likelihood evaluated at the true parameter is the sum of martingale differences. To see why, let B T = T t=2 log f θ(x t X t,..., X ) 304

306 be the conditonal log likelihood and C T (θ) its derivative, where C T (θ) = T t=2 log f θ (X t X t,..., X ). θ We want to show that C T (θ 0 ) is the sum of martingale differences. By definition if C T (θ 0 ) is the sum of martingale differences then ( ) log fθ (X t X t,..., X ) E θ=θ0 X t, X t 2,..., X = 0, θ we will show this. Rewriting the above in terms of integrals and exchanging derivative with integral we have ( ) log fθ (X t X t,..., X ) E θ=θ0 X t, X t 2,..., X θ log fθ (x t X t,..., X ) = θ=θ0 f θ0 (x t X t,..., X )dx t θ f θ (x t X t,..., X ) = θ=θ0 f θ0 (x t X t,..., X )dx t f θ0 (x t X t,..., X ) θ = ( f θ (x t X t,..., X )dx t ) θ=θ0 = 0. θ Therefore { log f θ(x t X t,...,x ) θ θ=θ0 } t are a sequence of martingale differences and C t (θ 0 ) is the sum of martingale differences (hence it is a martingale). A.3 The Fourier series The Fourier transform is a commonly used tool. We recall that {exp(2πijω); j Z} is an orthogonal basis of the space L 2 [0, ]. In other words, if f L 2 [0, ] (ie, 2 0 f(ω)2 dω < ) then f n (u) = j= n c j e iju2π c j = 0 f(u) exp(i2πju)du, where f(u) f n (u) 2 du 0 as n. Roughly speaking, if the function is continuous then we can say that f(u) = j Z c j e iju. 305

307 An important property is that f(u) constant iff c j = 0 for all j 0. Moreover, for all n Z f(u + n) = f(u) (hence f is periodic). Some relations: (i) Discrete Fourier transforms of finite sequences It is straightforward to show (by using the property n exp(i2πk/n) = 0 for k 0) that if d k = n x j exp(i2πjk/n), then {x r } can be recovered by inverting this transformation x r = n d k exp( i2πrk/n), k= (ii) Fourier sums and integrals Of course the above only has meaning when {x k } is a finite sequence. However suppose that {x k } is a sequence which belongs to l 2 (that is k x2 k < ), then we can define the function f(ω) = 2π k= x k exp(ikω), where 2π 0 f(ω) 2 dω = k x2 k, and we we can recover {x k} from f(ω), through x k = 2π 2π 0 f(ω) exp( ikω). (iii) Convolutions. Let us suppose that k a k 2 < and k b k 2 < and we define the Fourier transform of the sequences {a k } and {b k } as A(ω) = 2π a k exp(ikω) and B(ω) = 2π k b k exp(ikω) respectively. Then j= a j b k j = j= a j b j exp(ijω) = 2π 0 2π 0 A(ω)B( ω) exp( ikω)dω A(λ)B(ω λ)dλ. (A.) 306

308 The proof of the above follows from a j b j exp(ijω) = j= = = r= 2π 0 2π 2π 0 0 A(λ )B(λ 2 ) A(λ )B(λ 2 ) exp( ir(λ + λ 2 )) exp(ijω) r= A(λ)B(ω λ)dλ. exp(ir(ω λ λ 2 )) dλ dλ 2 } {{ } =δ ω(λ +λ 2 ) (iv) Using the DFT to calculate convolutions. Our objective is calculate n j=k a jb j s for all s = 0,..., n in as few computing computing operations. This is typically done via the DFT. Examples in time series where this is useful is in calculating the sample autocovariance function. Suppose we have two sequences a = (a,..., a n ) and b = (b,..., b n ). Let A n (ω k,n ) = n a j exp(ijω k,n ) and B n (ω k,n ) = n b j exp(ijω k,n ) where ω k,n = 2πk/n. It is straightforward to show that n A n (ω k,n )B n (ω k,n ) exp( isω k,n ) = k= s a j b j s + a j b j s+n, this is very fast to compute (requiring only O(n log n) operations using first the FFT and then inverse FFT). The only problem is that we don t want the second term. By padding the sequences and defining A n (ω k,2n ) = n a j exp(ijω k,2n ) = 2n a j exp(ijω k,2n ), with ω k,2n = 2πk/2n (where we set a j = 0 for j > 0) and analogously B n (ω k,2n ) = n b j exp(ijω k,2n ), we are able to remove the second term. Using the same calculations we have n 2 k= A n (ω k,2n )B n (ω k,2n ) exp( isω k,2n ) = j=s s a j b j s + a j b j s+2n. j=s } {{ } =0 This only requires O(2n log(2n)) operations to compute the convolution for all 0 k n. (v) The Poisson Summation Formula Suppose we do not observe the entire function and observe a sample from it, say f t,n = f( t n ) we can use this to estimate the Fourier coefficient 307

309 c j via the Discrete Fourier Transform: c j,n = n The Poisson Summation formula is t= k= f( t 2πt ) exp(ij n n ). c j,n = c j + c j+kn + c j kn, k= which we can prove by replacing f( t n ) with j Z c je ij2πt/n. disentangle frequency e ijω from it s harmonics e i(j+n)ω (this is aliasing). In other words, c j,n cannot (vi) Error in the DFT By using the Poisson summation formula we can see that c j,n c j c j+kn + c j kn k= It can be shown that if a function f( ) is (p+) times differentiable with bounded derivatives or that f p ( ) is bounded and piecewise montonic then the corresponding Fourier coefficients satisfy k= c j C j (p+). Using this result and the Poisson summation formula we can show that for j n/2 that if if a function f( ) is (p + ) times differentiable with bounded derivatives or that f p ( ) is piecewise montonic and p then c j,n c j Cn (p+), (A.2) where C is some finite constant. However, we cannot use this result in the case that f is bounded and piecewise monotone, however it can still be shown that c j,n c j Cn, (A.3) see Section 6.3, page 89, Briggs and Henson (997). 308

310 A.4 Application of Burkholder s inequality There are two inequalities (one for < p 2). Which is the following: Theorem A.4. Suppose that Y k are martingale differences and that S n = n Y k, then for 0 < q 2 E S n q 2 E(X q k ), (A.4) See for example Davidson (p. 242, Theorem 5.7). And one for (p 2), this is the statement for the Burkölder inequality: Theorem A.4.2 Suppose {S i : F i } is a martingale and < p <. Then there exists constants C, C 2 depending only on p such that ( m ) p/2 ( m ) p/2 C E Xi 2 E S n p C 2 E Xi 2. (A.5) i= i= An immediately consequence of the above for p 2 is the following corollary (by using Hölder s inequality): Corollary A.4. Suppose {S i : F i } is a martingale and 2 p <. Then there exists constants C, C 2 depending only on p such that S n E p ( C 2/p 2 /2 m Xi 2 p/2) E. (A.6) i= PROOF. By using the right hand side of (A.5) we have { E Sn p} /p = ( m C 2 E C 2/p 2 m i= i= X 2 i X 2 i ) p/2 2/p /2 E p/2 /2. (A.7) 309

311 By using Hölder inequality we have Thus we have the desired result. { E Sn p} /p [ C 2/p 2 /2 m Xi 2 p/2] E. (A.8) i= We see the value of the above result in the following application. Suppose S n = n n k= X k and X k E p K. Then we have E ( n ) p X k k= [ n C2/p 2 C 2 n p [ k= k= X 2 k E p/2 X 2 k E p/2 ] p/2 ] p/2 C 2 n p [ k= ] p/2 K 2 = O( ). np/2 (A.9) Below is the result that that Moulines et al (2004) use (they call it the generalised Burkholder inequality) the proof can be found in Dedecker and Doukhan (2003). Note that it is for p 2, which I forgot to state in what I gave you. Lemma A.4. Suppose {φ k : k =, 2,...} is a stochastic process which satisfies E(φ k ) = 0 and E(φ p k ) < for some p 2. Let F k = σ(φ k, φ k,...). Then we have that s E φ k k= p s 2p φ k E p k= s E(φ j F k ) E p j=k /2. (A.0) We note if s j=k E(φ j F k ) E p <, then we (A.) is very similar to (A.6), and gives the same rate as (A.9). But I think one can obtain something similar for p 2. I think the below is correct. Lemma A.4.2 Suppose {φ k : k =, 2,...} is a stochastic process which satisfies E(φ k ) = 0 and E(φ q k ) < for some < q 2. Let F k = σ(φ k, φ k,...). Further, we suppose that there exists a 0 < ρ <, and 0 < K < such that E(φ t F t j ) q < Kρ j. Then we have that s E a k φ k k= q ( s ) /q K a k q, (A.) ρ k= 30

312 where K is a finite constant. PROOF. Let E j (φ k ) = E(φ k F k j ). We note that by definition {φ k } is a mixingale (see, for example, Davidson (997), chapter 6), therefore amost surely φ k satisfies the representation φ k = By substituting the above into the sum s k= a kφ k we obtain s a k φ k = k= s k= j=0 [E k j (φ k ) E k j (φ k )]. (A.2) j=0 [E k j (φ k ) E k j (φ k )] = ( s ) [E k j (φ k ) E k j (φ k )]. (A.3) Keeping j constant, we see that {E k j (φ k ) E k j (φ k )} k is a martingale sequence. Hence s k= [E k j(φ k ) E k j (φ k )] is the sum of martingale differences. This implies we can apply (A.4) to (A.3), and get j=0 k= s E a k φ k k= q s a k [E k j (φ k ) E k j (φ k )] j=0 k= ( 2 j=0 k= ) /q s a k ( E k j (φ k ) E k j (φ k ) E q ) q E q Under the stated assumption E k j (φ k ) E k j (φ k ) E q the above gives 2Kρ j. Substituting this inequality into s E a k φ k k= q ( ) /q ) /q s 2 a k q (2Kρ j ) q 2 +/q K a k q. j=0 k= j=0 ρ j ( s k= Thus we obtain the desired result. A.5 The Fast Fourier Transform (FFT) The Discrete Fourier transform is used widely in several disciplines. Even in areas its use may not be immediately obvious (such as inverting Toeplitz matrices) it is still used because it can be evalated in a speedy fashion using what is commonly called the fast fourier transform (FFT). It is an algorithm which simplifies the number of computing operations required to compute the Fourier 3

313 transform of a sequence of data. Given that we are in the age of big data it is useful to learn what one of most popular computing algorithms since the 60s actually does. Recalling the notation in Section the Fourier transform is the linear transformation F n X n = (J n (ω 0 ),..., J n (ω n )). If this was done without any using any tricks this requires O(n 2 ) computing operations. By using some neat factorizations, the fft reduces this to n log n computing operations. To prove this result we will ignore the standardization factor (2πn) /2 and consider just the Fourier transform d(ω k,n ) = x t exp (itω k,n ), t= } {{ } k different frequencies where ω k,n = 2πk n. Here we consider the proof for general n, later in Example A.5. we consider the specific case that n = 2 m. Let us assume that n is not a prime (if it is then we simply pad the vector with one zero and increase the length to n + ), then it can be factorized as n = pq. Using these factors we write t as t = t p + tmodp where t is some integer value that lies between 0 to q and t 0 = tmodp lies between 0 to p. Substituting this into d(ω k ) gives d(ω k ) = = x t exp [i(t p + tmodp)ω k,n ] t= p q t 0 =0 t =0 x t p+t 0 exp [i(t p + t 0 )ω k,n ] = p t 0 =0 exp [it 0 ω k,n ] q t =0 x t p+t 0 exp [it pω k,n ] It is straightforward to see that t pω k,n = 2πt pk n = 2πt k q = t ω k,q and that exp(it pω k,n ) = 32

314 exp(it ω k,q ) = exp(it ω kmodq,q ). This means d(ω k ) can be simplified as d(ω k ) = = = p t 0 =0 p t 0 =0 p t 0 =0 exp [it 0 ω k,n ] exp [it 0 ω k,n ] q t =0 q t =0 x t p+t 0 exp [it ω kmodq,q ] x t p+t 0 exp [it ω k0,q] } {{ } embedded Fourier transform exp [it 0 ω k,n ] A(t 0, kmodq), } {{ } q frequencies where k 0 = kmodq can take values from 0,..., q. Thus to evaluate d(ω k ) we need to evaluate A(t 0, kmodq) for 0 t 0 p, 0 k 0 q. To evaluate A(t 0, kmodq) requires q computing operations, to evaluate it for all t 0 and kmodq requires pq 2 operations. Note, the key is that less frequencies need to be evaluated when calculating A(t 0, kmodq), in particular q frequencies rather than N. After evaluating {A(t 0, k 0 ); 0 t 0 p, 0 k 0 q } we then need to take the Fourier transform of this over t 0 to evaluate d(ω k ) which is p operations and this needs to be done n times (to get all {d(ω k )} k ) this leads to np. Thus in total this leads to p 2 q }{{} + np }{{} = pq 2 + pn = n(q + p). (A.4) evaluation of all A evaluation of the transforms of A Observe that n(p + q) is a lot smaller than n 2. Looking back at the above calculation we observe that q 2 operations were required to calculate A(t 0, kmodq) = A(t 0, k 0 ) for all 0 k 0 q. However A(t 0, k 0 ) is a Fourier transform A(t 0, k 0 ) = q t =0 x t p+t 0 exp [it ω k0,q]. Therefore, we can use the same method as was used above to reduce this number. To do this we 33

315 need to factorize q into p = p q and using the above method we can write this as A(t 0, k 0 ) = = = p q t 2 =0 t 3 =0 p t 2 =0 p t 2 =0 x (t2 +t 3 p )p+t 0 exp [i(t 2 + t 3 p )ω k0,q] q exp [it 2 ω k0,q] t 3 =0 q exp [it 2 ω k0,q] t 3 =0 x (t2 +t 3 p )p+t 0 exp [it 3 p ω k0,q] x (t2 +t 3 p )p+t 0 exp [it 3 ω k0 modq,q ]. We note that k 0 modq = (kmod(p q )modq ) = kmodq, substituting this into the above we have A(t 0, k 0 ) = = p t 2 =0 p t 2 =0 q exp [it 2 ω k0,q] t 3 =0 exp [it 2 ω k0,q] A(t 0, t 2, k 0 modq ). } {{ } q frequencies x (t2 +t 3 p )p+t 0 exp [it 3 ω k0 modq,q ] Thus we see that q computing operations are required to calculate A(t 0, t 2, k 0 modq ) and to calculate A(t 0, t 2, kmodq ) for all 0 t 2 p and 0 kmodq q requires in total q 2p computing operations. After evaluating {A(t 0, t 2, k 0 modq ); 0 t 2 q 2, 0 kmodq q } we then need to take its Fourier transform over t 2 to evaluate A(t 0, k 0 ), which is p operations. Thus in total to evaluate A(t 0, k 0 ) over all k 0 we require q 2p + p q operations. Thus we have reduced the number of computing operations for A(t 0, k 0 ) from q 2 to q(p + q ), substituting this into (A.4) gives the total number of computing operations to calculate {d(ω k )} pq(p + q ) + np = n(p + p + q ). In general the same idea can be used to show that given the prime factorization of n = m s= p s, then the number of computing operations to calculate the DFT is n( m s= p s). 34

316 Example A.5. Let us suppose that n = 2 m then we can write d(ω k ) as d(ω k ) = x t exp(itω k ) = t= = n/2 t= n/2 t= (n/2) X 2t exp(i2tω k ) + X 2t+ exp(i(2t + )ω k ) t=0 (n/2) X 2t exp(i2tω k ) + exp(iω k ) X 2t+ exp(i2tω k ) t=0 = A(0, kmod(n/2)) + exp(iω k )A(, kmod(n/2)), since n/2 t= X 2t exp(i2tω k ) and n/2 t= X 2t+ exp(i2tω k ) are the Fourier transforms of {X t } on a coarser scale, therefore we can only identify the frequencies on a coarser scale. It is clear from the above that the evaluation of A(0, kmod(n/2)) for 0 kmod(n/2) n/2 requires (n/2) 2 operations and same for A(, kmod(n/2)). Thus to evaluate both A(0, kmod(n/2)) and A(, kmod(n/2)) requires 2(n/2) 2 operations. Then taking the Fourier transform of these two terms over all 0 k n is an additional 2n operations leading to 2(n/2) 2 + 2n = n 2 /2 + 2n operations < n 2. We can continue this argument and partition A(0, kmod(n/2)) = = n/2 X 2t exp(i2tω k ) t= n/4 t= (n/4) X 4t exp(i4tω k ) + exp(i2ω k ) X 4t+2 exp(i4tω k ). t=0 Using the same argument as above the calculation of this term over all k requires 2(n/4) 2 +2(n/2) = n 2 /8 + n operations. The same decomposition applies to A(, kmod(n/2)). Thus calculation of both terms over all k requires 2[n 2 /8 + n] = n 2 /4 + 2n operations. In total this gives (n 2 /4 + 2n + 2n)operations. Continuing this argument gives mn = n log 2 n operations, which is the often cited rate. Typically, if the sample size is not of order 2 m zeros are added to the end of the sequence (called padding) to increase the length to 2 m. 35

317 Appendix B Mixingales In this section we prove some of the results stated in the previous sections using mixingales. We first define a mixingale, noting that the definition we give is not the most general definition. Definition B.0. (Mixingale) Let F t = σ(x t, X t,...), {X t } is called a mixingale if it satisfies ρ t,k = { ( 2 } /2 E E(X t F t k ) E(X t )), where ρ t,k 0 as k. We note if {X t } is a stationary process then ρ t,k = ρ k. Lemma B.0. Suppose {X t } is a mixingale. Then {X t } almost surely satisfies the decomposition X t = j=0 PROOF. We first note that by using a telescoping argument that { } E(X t F t j ) E(X t F t j ). (B.) X t E(X t ) = m { E(Xt F t k ) E(X t F t k ) } + { E(X t F t m ) E(X t ) }. k=0 By definition of a martingale E ( E(X t F t m ) E(X t ) ) 2 0 as k, hence the remainder term in the above expansion becomes negligable as m and we have almost surely = X t E(X t ) { E(Xt F t k ) E(X t F t k ) }. k=0 36

318 Thus giving the required result. We observe that (B.) resembles the Wold decomposition. The difference is that the Wolds decomposition decomposes a stationary process into elements which are the errors in the best linear predictors. Whereas the result above decomposes a process into sums of martingale differences. It can be shown that functions of several ARCH-type processes are mixingales (where ρ t,k Kρ k (rho < )), and Subba Rao (2006) and Dahlhaus and Subba Rao (2007) used these properties to obtain the rate of convergence for various types of ARCH parameter estimators. In a series of papers, Wei Biao Wu considered properties of a general class of stationary processes which satisfied Definition B.0., where k= ρ k <. In Section B.2 we use the mixingale property to prove Theorem This is a simple illustration of how useful mixingales can be. In the following section we give a result on the rate of convergence of some random variables. B. Obtaining almost sure rates of convergence for some sums The following lemma is a simple variant on a result proved in Móricz (976), Theorem 6. Lemma B.. Let {S T } be a random sequence where E(sup t T S t 2 ) φ(t ) and {phi(t)} is a monotonically increasing sequence where φ(2 j )/φ(2 j ) K < for all j. Then we have almost surely T S T = O ( φ(t )(log T )(log log T ) +δ ). T PROOF. The idea behind the proof is to that we find a subsequence of the natural numbers and define a random variables on this subsequence. This random variable, should dominate (in some sense) S T. We then obtain a rate of convergence for the subsequence (you will see that for the subsequence its quite easy by using the Borel-Cantelli lemma), which, due to the dominance, can be transfered over to S T. We make this argument precise below. Define the sequence V j = sup t 2 j S t. Using Chebyshev s inequality we have P (V j > ε) φ(2j ). ε 37

319 Let ε(t) = φ(t)(log log t) +δ log t. It is clear that P (V j > ε(2 j )) Cφ(2 j ) φ(2 j )(log j) +δ j <, where C is a finite constant. Now by Borel Cantelli, this means that almost surely V j ε(2 j ). Let us now return to the orginal sequence S T. Suppose 2 j T 2 j, then by definition of V j we have S T ε(t ) V j a.s ε(2 j ε(2j ) ) ε(2 j ) < under the stated assumptions. Therefore almost surely we have S T = O(ε(T )), which gives us the required result. We observe that the above result resembles the law of iterated logarithms. The above result is very simple and nice way of obtaining an almost sure rate of convergence. The main problem is obtaining bounds for E(sup t T S t 2 ). There is on exception to this, when S t is the sum of martingale differences then one can simply apply Doob s inequality, where E(sup t T S t 2 ) E( S T 2 ). In the case that S T is not the sum of martingale differences then its not so straightforward. However if we can show that S T is the sum of mixingales then with some modifications a bound for E(sup t T S t 2 ) can be obtained. We will use this result in the section below. B.2 Proof of Theorem We summarise Theorem below. Theorem Let us suppose that {X t } has an ARMA representation where the roots of the characteristic polynomials φ(z) and θ(z) lie are greater than + δ. Then (i) n t=r+ (log log n) ε t X t r = O( +γ log n ) (B.2) n (ii) n t=max(i,j) (log log n) X t i X t j = O( +γ log n ). (B.3) n 38

320 for any γ > 0. By using Lemma??, and that n t=r+ ε tx t r is the sum of martingale differences, we prove Theorem 0.7.3(i) below. PROOF of Theorem We first observe that {ε t X t r } are martingale differences, hence we can use Doob s inequality to give E(sup r+ s T ( s t=r+ ε tx t r ) 2 ) (T r)e(ε 2 t )E(X 2 t ). Now we can apply Lemma?? to obtain the result. We now show that T T t=max(i,j) (log log T ) X t i X t j = O( +δ log T ). T However the proof is more complex, since {X t i X t j } are not martingale differences and we cannot directly use Doob s inequality. However by showing that {X t i X t j } is a mixingale we can still show the result. To prove the result let F t = σ(x t, X t,...) and G t = σ(x t i X t j, X t i X t j i,...). observe that if i > j, then G t F t i. We Lemma B.2. Let F t = σ(x t, X t,...) and suppose X t comes from an ARMA process, where the roots are greater than + δ. Then if E(ε 4 t ) < we have E ( E(X t i X t j F t min(i,j) k ) E(X t i X t j ) ) 2 Cρ k. PROOF. By expanding X t as an MA( ) process we have = E(X t i X t j F t min(i,j) k ) E(X t i X t j ) { a j a j2 E(εt i j ε t j j2 F t k min(i,j) ) E(ε t i j ε t j j2 ) }. j,j 2 =0 Now in the case that t i j > t k min(i, j) and t j j 2 > t k min(i, j), E(ε t i j ε t j j2 F t k min(i,j) ) = E(ε t i j ε t j j2 ). Now by considering when t i j t k min(i, j) or t j j 2 t k min(i, j) we have have the result. Lemma B.2.2 Suppose {X t } comes from an ARMA process. Then 39

321 (i) The sequence {X t i X t j } t satisfies the mixingale property E ( E(X t i X t j F t min(i,j) k ) E(X t i X t j F t k ) ) 2 Kρ k, (B.4) and almost surely we can write X t i X t j as X t i X t j E(X t i X t j ) = k=0 t=min(i,j) V t,k (B.5) where V t,k = E(X t i X t j F t k min(i,j) ) E(X t i X t j F t k min(i,j) ), are martingale differences. (ii) Furthermore E(V 2 t,k ) Kρk and E { sup min(i,j) s n where K is some finite constant. ( s t=min(i,j) {X t i X t j E(X t i X t j )}) 2} Kn, (B.6) PROOF. To prove (i) we note that by using Lemma B.2. we have (B.4). To prove (B.5) we use the same telescoping argument used to prove Lemma B.0.. To prove (ii) we use the above expansion to give E { sup min(i,j) s n = E { sup ( min(i,j) s n = E { = ( ( s t=min(i,j) k=0 t=min(i,j) sup k =0 k 2 =0 min(i,j) s n k=0 { E ( sup min(i,j) s n {X t i X t j E(X t i X t j )}) 2} (B.7) s V t,k ) 2 } s t=min(i,j) s t=min(i,j) V t,k s t=min(i,j) )} /2 ) 2 V t,k 2 V t,k2 } Now we see that {V t,k } t = {E(X t i X t j F t k min(i,j) ) E(X t i X t j F t k min(i,j) )} t, therefore {V t,k } t are also martingale differences. Hence we can apply Doob s inequality to E { ( sup s min(i,j) s n t=min(i,j) V ) t,k 320

322 and by using (B.4) we have E { sup min(i,j) s n ( s t=min(i,j) V t,k ) 2 } E ( Therefore now by using (B.7) we have E { sup min(i,j) s n ( s t=min(i,j) t=min(i,j) V t,k ) 2 = t=min(i,j) {X t i X t j E(X t i X t j )}) 2} Kn. E(V 2 t,k ) K nρk. Thus giving (B.6). We now use the above to prove Theorem 0.7.3(ii). PROOF of Theorem 0.7.3(ii). To prove the result we use (B.6) and Lemma B... 32

323 Bibliography Hong-Zhi An, Zhao-Guo. Chen, and E.J. Hannan. Autocorrelation, autoregression and autoregressive approximation. Ann. Statist., 0: , 982. A. Aue, L. Horvath, and J. Steinbach. Estimation in random coefficient autoregressive models. Journal of Time Series Analysis, 27:6 76, K. I. Beltrao and P. Bloomfield. Determining the bandwidth of a kernel spectrum estimate. Journal of Time Series Analysis, 8:23 38, 987. I. Berkes, L. Horváth, and P. Kokoskza. GARCH processes: Structure and estimation. Bernoulli, 9: , I. Berkes, L. Horvath, P. Kokoszka, and Q. Shao. On discriminating between long range dependence and changes in mean. Ann. Statist., 34:40 65, R.N. Bhattacharya, V.K. Gupta, and E. Waymire. The hurst effect under trend. J. Appl. Probab., 20: , 983. P. Billingsley. Probability and Measure. Wiley, New York, 995. T Bollerslev. Generalized autoregressive conditional heteroscedasticity. J. Econometrics, 3:30 327, 986. P. Bougerol and N. Picard. Stationarity of GARCH processes and some nonnegative time series. J. Econometrics, 52:5 27, 992a. P. Bougerol and N Picard. Strict stationarity of generalised autoregressive processes. Ann. Probab., 20:74 730, 992b. 322

324 G. E. P. Box and G. M. Jenkins. Time Series Analysis, Forecasting and Control. Cambridge University Press, Oakland, 970. A. Brandt. The stochastic equation Y n+ = A n Y n + B n with stationary coefficients. Adv. in Appl. Probab., 8:2 220, 986. W.L. Briggs and V. E. Henson. The DFT: An Owner s manual for the Discrete Fourier Transform. SIAM, Philadelphia, 997. D.R. Brillinger. Time Series: Data Analysis and Theory. SIAM Classics, 200. P. Brockwell and R. Davis. Time Series: Theory and Methods. Springer, New York, 998. W. W. Chen, C. Hurvich, and Y. Lu. On the correlation matrix of the discrete Fourier Transform and the fast solution of large toeplitz systems for long memory time series. American Statistical Association, 0:82 82, Journal of the R. Dahlhaus and D. Janas. A frequency domain bootstrap for ratio statistics in time series analysis. Ann. Statistic., 24: , 996. R. Dahlhaus and S. Subba Rao. A recursive online algorithm for the estimation of time-varying arch parameters. Bernoulli, 3: , J Davidson. Stochastic Limit Theory. Oxford University Press, Oxford, 994. J. Dedecker and P. Doukhan. A new covariance inequality. Stochastic Processes and their applications, 06:63 80, R. Douc, E. Moulines, and D. Stoffer. Nonlinear Time Series: Theory, Methods and Applications with R Examples. Chapman and Hall, 204. Y. Dwivedi and S. Subba Rao. A test for second order stationarity based on the discrete fourier transform. Journal of Time Series Analysis, 32:68 9, 20. R. Engle. Autoregressive conditional heteroscedasticity with estimates of the variance of the United Kingdom inflation. Econometrica, 50: , 982. J. C. Escanciano and I. N Lobato. An automatic Portmanteau test for serial correlation. Journal of Econometrics, 5:40 49,

325 J. Fan and Q. Yao. Nonlinear time series: Nonparametric and parametric methods. Springer, Berlin, W. Fuller. Introduction to Statistical Time Series. Wiley, New York, 995. C. W. J. Granger and A. P. Andersen. An introduction to Bilinear Time Series models. Vandenhoek and Ruprecht, Göttingen, 978. U. Grenander and G. Szegö. Toeplitz forms and Their applications. Univ. California Press, Berkeley, 958. P Hall and C.C. Heyde. Martingale Limit Theory and its Application. Academic Press, New York, 980. E.J. Hannan and Rissanen. Recursive estimation of ARMA order. Biometrika, 69:8 94, 982. J. Hart. Kernel regression estimation with time series errors. Journal of the Royal Statistical Society, 53:73 87, 99. C. Jentsch and S. Subba Rao. A test for second order stationarity of multivariate time series. Journal of Econometrics, 204. D. A. Jones. Nonlinear autoregressive processes. Proceedings of the Royal Society (A), 360:7 95, 978. Rosenblatt. M. and U. Grenander. Statistical Analysis of Stationary Time Series. Chelsea Publishing Co, 997. T. Mikosch. Elementary Stochastic Calculus With Finance in View. World Scientific, 999. T. Mikosch and C. Stărică. Is it really long memory we see in financial returns? In P. Embrechts, editor, Extremes and Integrated Risk Management, pages Risk Books, London, T. Mikosch and C. Stărică. Long-range dependence effects and arch modelling. In P. Doukhan, G. Oppenheim, and M.S. Taqqu, editors, Theory and Applications of Long Range Dependence, pages Birkhäuser, Boston, F. Móricz. Moment inequalities and the strong law of large numbers. Z. Wahrsch. verw. Gebiete, 35:298 34,

326 D.F. Nicholls and B.G. Quinn. Random Coefficient Autoregressive Models, An Introduction. Springer-Verlag, New York, 982. E. Parzen. On consistent estimates of the spectrum of a stationary process. Ann. Math. Statist., 957. E. Parzen. On estimation of the probability density function and the mode. Ann. Math. Statist., 962. M. Pourahmadi. Foundations of Time Series Analysis and Prediction Theory. Wiley, 200. M. B. Priestley. Spectral Analysis and Time Series: Volumes I and II. Academic Press, London, 983. B.G. Quinn and E.J. Hannan. The Estimation and Tracking of Frequency. Cambridge University Press, 200. X. Shao. A self-normalized approacj to confidence interval construction in time series. Journal of the Royal Statistical Society (B), 72: , 200. R. Shumway and D. Stoffer. Time Series Analysis and Its applications: With R examples. Springer, New York, D. Straumann. Estimation in Conditionally Heteroscedastic Time Series Models. Springer, Berlin, S. Subba Rao. A note on uniform convergence of an arch( ) estimator. Sankhya, pages , T. Subba Rao. On the estimation of bilinear time series models. In Bull. Inst. Internat. Statist. (paper presented at 4st session of ISI, New Delhi, India), volume 4, 977. T. Subba Rao. On the theory of bilinear time series models. Journal of the Royal Statistical Society(B), 43: , 98. T. Subba Rao and M. M. Gabr. An Introduction to Bispectral Analysis and Bilinear Time Series Models. Lecture Notes in Statistics (24). Springer, New York, 984. S. C. Taylor. Modelling Financial Time Series. John Wiley and Sons, Chichester,

327 Gy. Terdik. Bilinear Stochastic Models and Related Problems of Nonlinear Time Series Analysis; A Frequency Domain Approach, volume 42 of Lecture Notes in Statistics. Springer Verlag, New York, 999. M. Vogt. Nonparametric regression for locally stationary time series. Annals of Statistics, 40: , 203. A. M. Walker. On the estimation of a harmonic component in a time series with stationary independent residuals. Biometrika, 58:2 36, 97. P. Whittle. Gaussian estimation in stationary time series. Bulletin of the International Statistical Institute, 39:05 29,