Estimating an ARMA Process



Similar documents
Lecture 2: ARMA(p,q) models (part 3)

Time Series Analysis

Maximum Likelihood Estimation

Sales forecasting # 2

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Basics of Statistical Machine Learning

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Time Series Analysis 1. Lecture 8: Time Series Analysis. Time Series Analysis MIT 18.S096. Dr. Kempthorne. Fall 2013 MIT 18.S096

1 Short Introduction to Time Series

Univariate and Multivariate Methods PEARSON. Addison Wesley

Approximate Likelihoods for Spatial Processes

Lecture 3: Linear methods for classification

Univariate Time Series Analysis; ARIMA Models

Statistical Machine Learning

Linear Classification. Volker Tresp Summer 2015

Univariate Time Series Analysis; ARIMA Models

Linear Models for Continuous Data

7 Gaussian Elimination and LU Factorization

1 Teaching notes on GMM 1.

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

Least Squares Estimation

Time Series Analysis III

Efficiency and the Cramér-Rao Inequality

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Autocovariance and Autocorrelation

Markov Chain Monte Carlo Simulation Made Simple

ITSM-R Reference Manual

Multivariate Normal Distribution

Time Series Analysis

Introduction to General and Generalized Linear Models

Chapter 1. Vector autoregressions. 1.1 VARs and the identi cation problem

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Econometrics Simple Linear Regression

Logit Models for Binary Data

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

Topic 5: Stochastic Growth and Real Business Cycles

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Trend and Seasonal Components

Factor Analysis. Factor Analysis

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

Analysis and Computation for Finance Time Series - An Introduction

Data Mining: Algorithms and Applications Matrix Math Review

SYSTEMS OF REGRESSION EQUATIONS

State Space Time Series Analysis

3. Regression & Exponential Smoothing

Non-Stationary Time Series andunitroottests

Rob J Hyndman. Forecasting using. 11. Dynamic regression OTexts.com/fpp/9/1/ Forecasting using R 1

Exploratory Factor Analysis and Principal Components. Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016

Probability and Random Variables. Generation of random variables (r.v.)

MATH 4330/5330, Fourier Analysis Section 11, The Discrete Fourier Transform

Regression III: Advanced Methods

Time Series Analysis

Centre for Central Banking Studies

Java Modules for Time Series Analysis

Standard errors of marginal effects in the heteroskedastic probit model

Practice problems for Homework 11 - Point Estimation

Chapter 4: Vector Autoregressive Models

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

STA 4273H: Statistical Machine Learning

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Forecasting methods applied to engineering management

Solving Linear Systems, Continued and The Inverse of a Matrix

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Forecasting in supply chains

Lecture L3 - Vectors, Matrices and Coordinate Transformations

Analysis of Bayesian Dynamic Linear Models

ARMA, GARCH and Related Option Pricing Method

LECTURE 4. Last time: Lecture outline

Gaussian Conjugate Prior Cheat Sheet

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Multivariate Analysis (Slides 13)

171:290 Model Selection Lecture II: The Akaike Information Criterion

Time Series Analysis of Aviation Data

Monte Carlo Simulation

Module 3: Correlation and Covariance

Exam Solutions. X t = µ + βt + A t,

Chapter 3 RANDOM VARIATE GENERATION

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

Factor analysis. Angela Montanari

Note on the EM Algorithm in Linear Regression Model

Nonlinear Algebraic Equations Example

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Numerical methods for American options

Notes on Determinant

Introduction to Matrix Algebra

Master s Theory Exam Spring 2006

The Engle-Granger representation theorem

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Analysis of Mean-Square Error and Transient Speed of the LMS Adaptive Algorithm

Multivariate time series analysis is used when one wants to model and explain the interactions and comovements among a group of time series variables:

2. Linear regression with multiple regressors

Sections 2.11 and 5.8

Christfried Webers. Canberra February June 2015

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

Wooldridge, Introductory Econometrics, 3d ed. Chapter 12: Serial correlation and heteroskedasticity in time series regressions

Transcription:

Statistics 910, #12 1 Overview Estimating an ARMA Process 1. Main ideas 2. Fitting autoregressions 3. Fitting with moving average components 4. Standard errors 5. Examples 6. Appendix: Simple estimators for autoregressions Main ideas Efficiency Maximum likelihood is nice, if you know the right distribution. For time series, its more motivation for least squares. If the data are Gaussian, then ML is efficient; if the data are not Gaussian, then ML amounts to complicated weighted least squares. Gaussian log-likelihood for the stationary process {X t } that generates X = (X 1,..., X n ) is (minus twice the log of the likelihood) 2l(µ, φ, θ, σ 2 ) = n log 2π + log Γ n + (X µ) Γ 1 n (X µ). (1) Think of the covariances γ(k) as functions of the parameters of the process, γ(k) = σ 2 g(k; φ, θ). (2) To find the maximum likelihood estimates of µ, φ, and θ for an ARMA(p, q) process is simply a numerical minimization of the negative log likelihood. All you need to do is express the covariances in (1) as functions of the unknown parameters. For example, for the AR(1) process X t = φ 1 X t 1 + w t with µ = 0 (given), γ(0) = σ 2 /(1 φ 2 1 ), and γ(h) = φ h γ(0).

Statistics 910, #12 2 Recursion The models we consider are causal, with time flowing in one direction. Hence, it is useful to decompose the joint distribution of X in the log-likelihood (1) as a sequence of one-sided conditional distributions: f(x 1,..., x n ) = f(x 1 )f(x 2 x 1 ) f(x 3 x 1, x 2 ) f(x n x 1,..., x n 1 ). MLEs for AR(1) It s useful to solve for the MLE in closed form for the simplest of models. The log-likelihood (4) simplifies for a Gaussian AR(1) process: l(φ, σ 2 ) = log f(x 1 )f(x 2 X 1 ) f(x n X n 1 ) σ 2 = log N(0, 1 φ 2 )f(x 2 X 1 ) f(x n X n 1 ) ( = n 2 log(2πσ2 ) + 1 2 log(1 φ2 ) The derivatives that give the MLE s for σ 2 and φ are: and l σ 2 = n 2σ 2 + SS 2σ 4 X 2 1(1 φ 2 ) + l φ = X2 1 φ + (Xt φx t 1 )X t 1 σ 2 φ 1 φ 2 ) (X t φx t 1 ) 2 /2σ 2. t=2 } {{ } SS where SS denotes the sum of squares from the exponential term in the likelihood. Setting these to zero and solving gives: and ( ) ˆσ 2 ˆφ 1 ˆφ 2 X2 1 = ˆσ 2 = SS/n (X t ˆφX t 1 )X t 1. t=2 Since the l.h.s. of this expression has approximate expected value zero (note the expression for the variance of X t, the MLE can be seen to be quite similar to the usual LS estimator.

Statistics 910, #12 3 Initial values Again thinking recursively, the likelihood looks a lot like that in the normal linear model if we only knew enough to get started. If we condition on X 1, X 2,..., X p and w 1, w 2,..., w q (i.e., assume we know these), then + 2 log f(x n p+1 x p 1, wq 1 ) = c + (n p) log σ2 (w t = x t φ 1 x t 1 φ p x t p θ 1 w t 1 θ q w t q ) 2 t=p+1 by the change of variables from x t to w t, as in the normal linear model. This expression becomes amenable to least squares. This approach is called conditional least squares in R. What does this leave out? Conditioning on the initial values does not leave out too much, really. We still Need values for w 1,..., w q, and Would like to gain further information from X 1,..., X p. Autoregressions Why start with autoregressions? Several reasons: These often fit quite well (don t need the MA terms) because we know that we can approximate X t = π j X t j. This Markovian approach has become more popular because of its speed. Estimation is fast (MLEs require some iteration). Avoid estimation of initial w t s. Sampling properties are well-known (essentially those of normal linear model with stochastic explanatory variables). Backwards Even if we don t want the AR model itself, these are often used to estimate the initial errors, w 1, w 2,,..., w q. By fitting an autoregression backwards in time, we can use the fit to estimate say, ŵ (m) t = X t m j=1 ˆπ jx t+j (if we assume normality, the process is reversible).

Statistics 910, #12 4 MLE for autoregression In the AR(p) case, f(x 1,..., x n ) = f(x 1, x 2,..., x p ) f(x p+1 x p 1 } {{ } ) f(x n xn p n 1 } {{ } ) messy simple 1 P n = f(x 1, x 2,..., x p ) e 2 p+1 w2 t /σ2 (2πσ 2 ) (n p)/2. where w t = (x t µ) j φ j(x t j µ) for t = p + 1,..., n. But for the initial terms (the messy part), we have the same likelihood as in the normal linear model, and the MLEs are those that we would get from the least squares regression of x t on x t 1,..., x t p and a constant. (If we call the constant in that regression β 0, then ˆµ = ˆβ 0 /(1 ˆφ 1 ˆφ p ). But for ignoring the contribution from x 1,..., x p, least squares matches maximum likelihood in the AR(p) case. Hence, maximum likelihood cannot improve the estimates much unless p is large relative to n. Recursion = triangular factorization A recursion captures the full likelihood. For an AR(p) model with coefficients φ p = (φ p1, φ p2,... φ pp ) express the lower-order coefficients as functions of φ p (e.g., find γ(0) and φ 11 = Corr(X t, X t 1 ) in terms of φ p ). If we can do that, it is simple to model f(x 1,..., x p ) = f(x 1 )f(x 2 x 1 ) f(x p x p 1 1 ). The prior AR(1) example shows this. In general, use the Levinson recursion to obtain a triangular decomposition of the covariance matrix Γ n. This is done by converting the correlated variables X 1,..., X n into a collection, say U 1, U 2,..., U n of uncorrelated variables. One has many ways of doing this, such as the Gram-Schmidt or Cholesky factorization. In the following, let P j denote the projection onto the random variables in X j (as in fitting a regression). Following the Cholesky factorization, construct U 1 = X 1 U 2 = X 2 P 1 X 2 = X 2 φ 1,1 X 1

Statistics 910, #12 5 U 3 = X 3 P 12 X 3 = X 3 φ 2,2 X 1 φ 2,1 X 2 U 4 = X 4 P 123 X 4 = X 4 φ 3,3 X 1 φ 3,2 X 2 φ 3,1 X 3 j 1 U j = X j φ j 1,k X j k k=1 (This sequence of projections differs from those used in the numerically superior modified Gram-Schmidt method. GS sweeps X 1 from all of the others first, filling the first column of L rather than recursively one row at a time.) Let L denote a lower triangular matrix that begins L = 1 0 0 0 0... φ 11 1 0 0 0... φ 22 φ 21 1 0 0... φ 33 φ 32 φ 31 1 0.......... with diagonal elements L kk = 1 and off-diagonal elements L kj = φ k 1,k j, j = 1,..., k 1. Since the U j are uncorrelated, we have D n = Var(U = LX) = LΓ n L Γ 1 n (3) = L 1 D 1 n L 1, (4) where D n is the diagonal matrix with the conditional variances k σk 2 = Var X t φ kj X t j along the diagonal. Comments It follows that σ 2 = lim n Γ n+1 / Γ n. (Where have we seen this ratio of determinants before? Toeplitz matrices, Szegö s theorem, and Levinson s recursion itself.) If the process is indeed AR(p), the lower triangular matrix L is banded, with p subdiagonal stripes. The element L kj = 0 for j < k p. j=1

Statistics 910, #12 6 If not an AR model? Since w t is a linear combination of X s, s = 1,..., t, the approximation is a simple way of computing/representing w t = X t j π j X t j. The terms become more accurate ( w t approaches w t ) as t increases since w t w t = j=t π jx t j γ(0)c t for some c < 1 (since the π j are eventually a sum of geometric series). We ll do an example of this in R; see the accompanying script. ARMA via regression Given estimates of w t, fit the regression model using the estimates of the error process as covariates. That is, regress X t on its p lags and q lags of the estimated errors, X t = φ 1 X t 1 + + φ p X t p + θ 1 w t 1 + + θ q w t q + w t. This method works rather well, plus it opens up the use of all of the tools of regression diagnostics for use in time series analysis. The method dates back to J. Durbin (1960) The fitting of time series models (Intl Statis. Review, 233-244) and was exploited in a model selection context by Hannan and Kavaleris (1984) (Biometrika 71, 273-280). Standard errors Another insight from this approach is that it suggests the properties of estimators of φ and θ. Estimates of φ (ignoring the MA components) resemble those from usual regression, only with lags. Hence, for an AR(p) model, it follows that Var( ˆφ) σ 2 ˆΓ 1 p Maximum likelihood ARMA(p, q) models As noted earlier, the MLE is easy to compute if we condition on prior observations X 0,..., X p+1 and prior errors w 0,..., w q+1, for then the likelihood can be written in terms of w 1,..., w n, which are iid Gaussian. Estimation then reduces to minimizing n t=1 w2 t.

Statistics 910, #12 7 Without this conditioning, the resulting estimates of w 1, w 2,..., w n are functions of the parameters φ and θ, we can chose parameters that minimize this sum, giving a nonlinear least squares problem. Approximate noise terms Assume for this section that µ = 0 so that we can focus on the estimates of φ and θ. Let β = (φ 1,..., φ p, θ 1,..., θ q ) denote the vector of unknown coefficients. Consider setting the prior unknown X j and w j for j 0 to their marginal expected values (0) and define w 1 (β) = X 1, w 2 (β) = X 2 φ 1 X 1 θ 1 w 1, w 3 (β) = X 3 φ 1 X 2 φ 2 X 1 θ 1 w 2 θ 2 w 1, w n (β) = X n φ j X n j θ j w n j j j Backcasting In general, algorithms start like this, and then find estimates of x t and w t for t 0 (the method is known as backcasting ). We will avoid those methods and consider (coming lecture) methods based on the Kalman filter (used in R). We will continue this discussion, though, to explore how the iterative estimation proceeds and discover the form of the asymptotic variance matrix of the estimates. Nonlinear least squares The nonlinear SS is approximately a least squares problem w t (β) 2 ( w t (β 0 ) D t(β β 0 )) 2, (5) t=1 where the derivative term is t=1 D tj = w t β j. The negative sign is chosen to make the sum (5) look like a regression sum-of-squares i (y i x i β)2 with D t playing the role of x i and w t (β 0 ) as y i. The estimate of β is thus ˆβ = β 0 + (D D) 1 D w(β 0 ).

Statistics 910, #12 8 Continuing the analogy to least squares, the variance matrix of ˆβ is then (approximating w t w t ), Var( ˆβ) = σ 2 (D D) 1. Derivatives simplify once the estimates are beyond the direct presence of the initial conditions where we filled in zeros. For t > max(p, q), D tj = w t φ j = X t j q θ k D t k,j, j = 1,..., p, k=1 and, for the moving average terms, (use the product rule for derivatives) q D t,p+j = w t j θ k D t k,p+j, j = 1,..., q. k=1 That is, θ(b)d t,j = X t j for the AR terms and θ(b)d t,p+j = w t j for the MA terms. The derivatives thus behave as two AR processes for which (see Property 3.9) and D tj B j u t, u t = 1 θ(b) X t j = 1 φ(b) w t D t,p+j B j v t, v t = 1 θ(b) w t. Hence, the matrix D D/n is approximately the variance covariance of two distinct AR processes, both with errors w t. Example: ARMA(1,1) process The variance/covariance matrix of the estimates is (page 135 in 2nd edition, 134 in 3rd) n Var( φ, θ) σ 2 [ E u 2 t E u t v t E u t v t E vt 2 ] 1 = [ 1/(1 φ 2 ) 1/(1 + φθ) 1/(1 + φθ) 1/(1 θ 2 ) A related issue at this point is whether it is better to use the expression here for the expected Fisher information, or to use the observed matrix of second derivatives (which is available within the actual estimation routine as D D). There is a long literature that compares the use of expected versus observed Fisher information. ] 1

Statistics 910, #12 9 Asymptotic distributions Information matrix MLEs in regular problems are asymptotically normal, unbiased, and have variance determined by the information matrix (to this order of approximation). The information matrix is the inverse of the matrix of the expected values of the second derivatives of the log-likelihood function. That is, n( θ θ) N(0, In ) where I n = (E 2 l ) 1. For the AR(1) example, we have θ 2 ( 2 ) ( l X 2 E φ 2 = E 1 + n 2 X2 t 1 σ 2 1 ) 3φ2 (1 φ 2 ) 2 n 1 φ 2, as we obtained for least squares. Independence of σ 2 and φ The mixed partial is ( 2 ) ( ) ( l SS/ φ φx 2 E σ 2 = E φ 2σ 4 = E 1 + ) w t X t 1 σ 4 = 0 so that the estimated coefficients and error variance estimate are asymptotically independent. Invariance The variance of ˆφ or ˆθ is invariant of σ 2. The accuracy of the estimate of φ only depends upon the correlations of the process, not the level of the noise variance. Examples Examples using US unemployment are in the file macro.r on the text web site. Alternative conditions. These results are utopian in the sense that they hold when all of the assumptions (large n, normality, known orders,...) hold. Simulation makes it easy to explore what happens when these assumptions fail. Interesting examples that you can explore are: Skewness in the distribution of the AR(1) estimator.

Statistics 910, #12 10 Effect of outliers in the w t process. Boundary effects as φ 1 nears 1. Effect of model specification errors. Effects on prediction What are the optimal predictors once we replace φ and θ by estimates, particularly when the fitted model may not match p and q of the underlying data generating mechanism? (As if the underlying process were an ARMA process!)

Statistics 910, #12 11 Appendix: Estimators for Autoregressions Introduction In applications, easily fitting autoregressions is important for obtaining initial values of parameters and in getting estimates of the error process. Least squares is a popular choice, as is the Yule Walker procedure. Unlike the YW procedure, LS can produce a nonstationary fitted model. Both the YW and LS estimators are noniterative and consistent, so they can be used as starting values for iterative methods like ML. Yule-Walker estimators are basically moment estimators, and so have the consistency and asymptotic normality associated with smooth functions of moment estimators (Slutsky s theorem). For an autoregression of order p, the Yule-Walker estimator solves the system ˆΓ p ˆφp = ˆγ p. The Yule-Walker estimator has two important properties not shared by other estimators for autoregressions: 1. Estimates for a sequence of models of order p = 1, 2,..., n 1 are easily computed recursively using the Levinson algorithm applied to the estimated covariance matrix ˆΓ n. 2. The associated estimates define stationary processes. That is the zeros of ˆφ(z) must must lie outside the unit circle. The stationarity follows from observing that ˆΓ p is positive definite. Asymptotics The result is essentially that which obtains in a corresponding regression equation. Assuming that the process is indeed AR(p), we have the same asymptotic distribution for all three of the estimators: n( ˆφ φ) AN(φ, Γ 1 p ) A feature useful for identification of p is embedded here. If we overfit a model of order m > p, the fitted model is still correct (though not efficient) with φ j = 0 for j = p + 1,..., m. Consequently, the overfit coefficients have mean zero with variance that is O(1/n). Similar properties The methods for deriving properties of the YW and LS estimators are similar. The LS estimator is considered here, and properties of the YW estimator follow similarly. The two types of

Statistics 910, #12 12 estimators are asymptotically equivalent (though rather different in reasonably sized applications). Estimating equations Write the normal equations for the LS estimator as n( ˆφ φ) = (X X/n) 1 (X w)/ n where the n p p matrix X and vector w are defined as X = (X p,..., X n), X t = (X t,..., X t p+1 ) w = (w p,..., w n ) The cross-product matrix is nonsingular due to the a.s. convergence X X/n a.s. Γ p and non-singularity of Γ n. Asymptotically, then, n( ˆφ φ) Γ 1 n (X w)/ n. Return of the sandwich To make life easier, assume that the w t are independent. It follows that n( ˆφ φ) AN(0, Γ 1 n Var(X w/ n)γ 1 n ), so we are left to deal with X w/ n. Consider the jth element of this vector, (not related to terms in Levinson s method considered earlier) U j = t=p+1 w t X t j / n. From the assumed independence, the variance and covariance of the U j s are: n Cov(U j, U k ) = E s,t=p+1 = nσ 2 γ(j k), w t X t j w s X s k or, in matrix notation, Var(U) = σ 2 Γ p. Hence, the asymptotic distribution is n( ˆφ φ) AN(0, σ 2 Γ 1 n ), as one would anticipate from the results for regression (σ 2 (X X) 1 ).

Statistics 910, #12 13 Weaker assumptions. You can obtain the same result without independence by assuming that the w t s are stationary to order 4 and writing the process as an infinite moving average X t = ψ j w t j, or by assuming that the w t s form a martingale difference sequence. Taking the former approach, U j = t=p+1 ( ) w t ψ k w t j m / n, m=0 and the covariance is (assuming κ 4 = 0) n Cov(U j, U k ) = = s,t = t s,t=p+1 l,m ψ l ψ m E(w t w s w t j m w s k l ) ψ l ψ m E(w t w s )E(w t j m w s k l ) l,m m=k j+l ψ l ψ m E(w 2 t )E(w t k m w t k l ) = (n p)σ 4 m ψ l ψ l+k j = (n p)σ 2 γ(k j), as before.