Lectures in Modern Economic Time Series Analysis. 2 ed. c. Linköping, Sweden [email protected]
|
|
|
- Shannon Nigel Newton
- 10 years ago
- Views:
Transcription
1 Lectures in Modern Economic Time Series Analysis. 2 ed. c Bo Sjö Linköping, Sweden [email protected] October 30, 2011
2 2
3 CONTENTS 1 Introduction Outline of this Book/Text/Course/Workshop Why Econometrics? Junk Science and Junk Econometrics Introduction to Econometric Time Series Programs Different types of time series Repetition - Your First Courses in Statistics and Econometrics.. 15 I Basic Statistics 19 3 Time Series Modeling - An Overview Statistical Models Random Variables Moments of random variables Popular Distributions in Econometrics Analysing the Distribution Multidimensional Random Variables Marginal and Conditional Densities The Linear Regression Model A General Description The Method of Maximum Likelihood MLE for a Univariate Process MLE for a Linear Combination of Variables The Classical tests - Wald,LM and LR tests 41 II Time Series Modeling 43 6 Random Walks, White noise and All That Different types processes White Noise The Log Normal Distribution The ARIMA Model The Random Walk Model Martingale Processes Markov Processes Brownian Motions Brownian motions and the sum of white noise The geometric Brownian motion A more formal definition CONTENTS 3
4 7 Introductioo to Time Series Modeling Descriptive Tools for Time Series Weak and Strong Stationarity Weak Stationarity, Covariance Stationary and Ergodic Processes Strong Stationarity Finding the Optimal Lag Length and Information Criteria The Lag Operator Generating Functions The Difference Operator Filters Dynamics and Stability Fractional Integration Building an ARIMA Model. The Box-Jenkin s Approach Is the ARMA model identified? Theoretical Properties of Time Series Models The Principle of Duality Wold s decomposition theorem Additional Topics Seasonality Non-stationarity Aggregation Overview of Single Equation Dynamic Models Multipliers and Long-run Solutions of Dynamic Models Vector Autoregressive Models How estimate a VAR? Impulse responses in a VAR with non-stationary variables and cointegration BVAR, TVAR etc III Granger Non-causality Tests Introduction to Exogeneity and Multicollinearity Exogeneity Weak Exogeneity Strong Exogeneity Super Exogeneity Multicollinearity and understanding of multiple regression Univariate Tests of The Order of Integration The DF-test: The ADF-test The Phillips-Perron test The LMSP-test The KPSS-test The G(p, q) test The Alternative Hypothesis in I(1) Tests Fractional Integration CONTENTS
5 12 Non-Stationarity and Co-integration The Spurious Regression Problem Integrated Variables and Co-integration Approaches to Testing for Co-integration Integrated Variables and Common Trends A Deeper Look at Johansen s Test The Estimation of Dynamic Models Deterministic Explanatory Variables The Deterministic Trend Model Stochastic Explanatory Variables Lagged Dependent Variables Lagged Dependent Variables and Autocorrelation The Problems of Dependence and the Initial Observation Estimation with Integrated Variables Encompassing ARCH Models Practical Modelling Tips Some ARCH Theory Some Different Types of ARCH and GARCH Models The Estimation of ARCH models Econometrics and Rational Expectations Rational v.s. other Types of Expectations Typical Errors in the Modeling of Expectations Modeling Rational Expectations Testing Rational Expectations A Research Strategy References APPENDIX Appendix III Operators The Expectations Operator The Variance Operator The Covariance Operator The Sum Operator The Plim Operator The Lag and the Difference Operators Abstract CONTENTS 5
6 6 CONTENTS
7 1. INTRODUCTION He who controls the past controls the future. George Orwell in "1984". Please respect that this is work in progress. It has never been my intention to write a commercial book, or a perfect textbook in time series econometrics. It is simply a collection of lectures in a popular form that can serve as a complement to ordinary textbooks and articles used in education. The parts dealing with tests for unit roots (order of integration) and cointegration are not well developed. These topics have a memo of their own "A Guide to testing for unit roots and cointegration". When I started to put these lecture notes together some years ago I decided on title "Lectures in Modern Time Series Econometrics" because I thought that the contents where a bit "modern" compared to standard econometric textbook. During the fall of 2010 as I started to update the notes I thought that it was time to remove the word "modern" from the title. A quick look in Damodar Gujarati s textbook "Basic Econometrics" from 2009 convinced my to keep the word "modern" in te title. Gujarati s text on time series hasn t changed since the 1970 s even though time series econometrics has changed completely since the 70s. Thus, under these circumstances I see no reason to change the title, at least not yet. There are four ways in which one do time series econometrics. The first is to use the approach of the 1970s, view your time series model just like any linear regression, and impose a number of ad hoc restrictions that will hide all problems you find. This is not a good approach. This approach is only found in old textbooks and never in today s research. You might only see it used in very low scientific journals. Second, you can use theory to derive a time series model, and interesting parameters, that you then estimate with appropriate estimators. Examples of this ti derive utility functions, assume that agents have rational expectations etc. This is a proper research strategy. However, it typically takes good data, and you need to be original in your approach, but you can get published in good journals. The third, approach is simply to do statistical description of the data series, in the form of a vector autoregressive system, or reduced form of the vector error correction model. This system can used for forecasting, analysing relationships among data series and investigated with respect to unforeseen shocks such as drastic changes in energy prices, money supply etc. The fourth way is to go beyond the vector autoregressive system and try to estimate structural parameters in the form of elasticities and policy intervention parameters. If you forget about the first method, the choice depends on the problem at hand and you chose to formulate it. This book aims at telling you how to use methods three and four. The basic thinking is that your data is the real world, theories are abstractions that we use to understand the real world. In applied econometric time series you should always strive to build well-defined statistical models, that is models that are consistent with the data chosen. There is a complex statistical theory behind all this, that I will try to popularize in this book. I do not see this book as a substitute for an ordinary textbook. It is simply a complement. INTRODUCTION 7
8 1.1 Outline of this Book/Text/Course/Workshop This book is intended for people who has done a basic course in statistics and econometrics, either at the undergraduate or at the graduate level. If you did an undergraduate course I assume that you did it well. Econometrics is a type of course were every lecture, and every textbook chapter leads to the next level. The best way to learn econometrics is to be active, read several books, work on your own with econometric software. No teacher can learn you how to run a software. That is something you have to learn on your own by practicing how to use the software. There are some very good software out there, and some The outline differences between graduate and Ph.D. level mainly in the theoretical parts. At the Ph.D. level, there is more stress on theoretical backgrounds. 1) I will begin by talking about why econometrics is different from statistics, and why econometric time series is different from the econometrics your meet in many basic textbooks. 2) I will repeat very briefly basic statistics, and linear regression and stress what you should know in terms of testing and modeling dynamic models. For most students that will imply going back and do some quick repetition. 3) Introduction into statistical theory including maximum likelihood, random variables, density functions and stochastic processes. 4) Fourth, basic time series properties and processes. 5) Using and understanding ARFIMA and VAR modelling techniques. 6) Testing for non-stationary in the form of stochastic trends, i.e. test for unit roots. 7) The spurious regression problem 8) Testing and understanding cointegration. 9) Testing for Granger non-causality 10) The theory of reduction, exogeneity and building dynamic models and systems 11) Modelling time varying variances, ARCH and GARCH models 12) The implications and consequences of rational expectations on econometric modelling 13) Non-linearities 14) Additional topics For most of these topics I have developed more or less self-instructing exercises. 1.2 Why Econometrics? Why is there a subject called econometrics? Why study econometrics, instead of statistics? Why not let the statisticians teach statistics, and in particular time series techniques? These are common questions, raised during seminars and in private, by students, statisticians and economists. The answer is that each scientific area tends to create its own special methodological problems often heavily interrelated with theoretical issues. These problems, and the ways of solving them, are important in a particular area of science but not necessarily in others. Economics is a typical example, were the formulation of the economic and the statistical problem is deeply interrelated from the beginning. In everyday life we are forced to make decisions based on limited information. Most of our decisions deal with the an uncertain stochastic future. We all base our 8 INTRODUCTION
9 decisions on some view of the economy where we assume that certain events are linked to each other in more or less complex ways. Economists call this a model of the economy. We can describe the economy and the behavior of the individuals in terms of multivariate stochastic processes. Decisions based on stochastic sequences play a central role economics and in finance. Stochastic processes are the basis for our understanding about the behavior of economic agents and of how their behavior determine the future path of the economy. Most econometric text books deal with stochastic time series as a special application of the linear regression technique. Though this approach is acceptable for an introductory course in econometrics, it is unsatisfactory for students with a deeper interest in economics and finance. To understand the empirical and theoretical work in these areas, it is necessary to understand some of the basic philosophy behind stochastic time series. This work is a work in progress. It is based on my lectures on Modern Economic Time Series Analysis at the Department of Economics first at University of Gothenburg and later at University of Skovde and Linköping University in Sweden. The material is not ready for a widespread distribution. This work, most likely, contains lots of errors, some are known by the author, and some are not yet detected. The different sections do not necessarily follow in a logical order. Therefore, I invite anyone who has opinions about this work to share them me. The first part of this work provides a repetition of some basic statistical concepts, which are necessary understanding modern economic time series analysis. The motive for repeating these concepts is that they play a larger role in econometrics than many contemporary textbooks in econometrics indicate. Econometrics did not change much from the first edition of Johnston in the 60s until the revised version of Kmenta in the mid 80s. However, as a consequence of the critique against the use of econometrics delivered by Sims, Lucas, Leamer, Hendry and others, in combination with new insights into the behavior of non-stationary time series and the rapid development of computer technology, have revolutionized econometric modeling, and resulted in an explosion of knowledge. The demand for writing a decent thesis, or a scientific paper, based on econometric methods has risen far beyond what one can learn in an introductory course in econometrics. 1.3 Junk Science and Junk Econometrics In media you often hear about this and that being proved by scientific research. In the late 1990s newspapers told that someone had proved that genetic modified (GM) food could be dangerous. The news were spread quickly, and according to the story the original article had been stooped from being published by scientists with suspicious motives. Various lobby groups immediately jumped up. GM food were dangerous, should be banned and more money should go into this line of research. What had happened was the following. A researcher claimed to have shown that GM food were bad for health. He claimed this results for a number of media people, who distributed the results. (Remember the fuss about cold fusion ). The result were presented in a paper sent to a scientific journal for publication. The journal however, did not publish the article. It was dismissed because the results were not based on a sound scientific method. The researcher had feed rats with potatoes. One group of rats got GM potatoes, the other group of rats got normal non-gm potatoes. The rats that got GM potatoes seemed to develop cancer more often than the control group. The statistical difference JUNK SCIENCE AND JUNK ECONOMETRICS 9
10 between the groups were not big, but suffi ciently big for those wanting to confirm their a priori beliefs that GM food is bad. A somewhat embarrassing detail, never reported in the media, is that rats in general do not like potatoes. As a consequence both groups of rats in this study were suffering from starvation, which severely affected the test. It was not possible to determine if the difference between the two groups were caused by starvation, or by GM food. Once the researcher conditioned on the effects of starvation, the difference became insignificant. This is an example of Junk science, bad science getting a lot of media exposure because the results fits the interests of lobby groups, and can be used to scare people. The lesson for econometricians is obvious, if you come up with good results you get rewarded, bad results on the other hand can quickly be forgotten. The GM food example is extreme econometric work. Econometric research seldom get such media coverage, though there are examples such as Sweden s economic growth is less than other similar countries, the assumed dynamic effects of a reduction of marginal taxes. There are significant results that depend on one single outlier. Once the outlier is removed, the significance is gone, and the whole story behind this particular book is also gone. In these lectures we will argue that the only way to avoid junk econometrics is careful and systematic construction and testing of models. Basically, this is the modern econometric time series approach. Why is this modern, and why stress the idea of testing? The answers are simply that careers have been build on running junk econometric equations, most people are unfamiliar with scientific methods in general and the consequences of living in a world surrounded by random variables in particular. 10 INTRODUCTION
11 2. INTRODUCTION TO ECONO- METRIC TIME SERIES "Time is a great teacher, but unfortunately it kills all its pupils" Louis Hector Berlioz A time series is simply data ordered by time. For an econometrician time series is usually data that is also generated over time in such a way that time can be seen as a driving factor behind the data. Time series analysis is simply approaches that look for regularities in these data ordered by time. In comparison with other academic fields, the modeling of economic time series is characterized by the following problems, which partly motivates why econometrics is a subject of its own: The empirical sample sizes in economics are generally small, especially compared with many applications in physics or biology. Typical sample sizes ranges between observations. In many areas anything below 500 observations is considered a small sample. Economic time series are dependent in the sense that they are correlated with other economic time series. In the economic science, problems are almost never concerned with univariate series. Consumption, as an example, is a function of income, and at the same time, consumption also affects income directly and through various other variables. Economic time series are often dependent over time. Many series display high autocorrelation, as well as cross autocorrelation with other variables over time. Economic time series are generally non-stationary. Their means and variances change over time, implying that estimated parameters might follow unknown distributions instead of standard tabulated distributions like the normal distribution. Non-stationarity arises from productivity growth and price inflation. Non-stationary economic series appear to be integrated, driven by stochastic trends, perhaps as a result of stochastic changes in the total factor productivity. Integrated variables, and in particular the need to model them, are not that common outside economics. In some situations, therefore, inference in econometrics become quite complicated, and requires the development of new statistical techniques for handling stochastic trends. The concepts of cointegration and common trends, and the recently developed asymptotic theory for integrated variables are examples of this. Economic time series cannot be assumed to be drawn from samples in the way assumed in classical statistics. The classical approach is to start from a population from which a sample is drawn. Since the sampling process can be controlled the variables which make up the sample can be seen as random variables. Hypothesis are then formulated and tested conditionally on the assumption that the random variables have a specific distribution. Economic time series are seldom random variables drawn from some underlying population in the classical statistical sense. Observations do not represent INTRODUCTION TO ECONOMETRIC TIME SERIES 11
12 a random sample in the classical statistical sense, because the econometrician cannot control the sampling process of variables. Variables like, GDP, money, prices and dividends are given from history. To get a different sample we would have to re-run history, which of course is impossible. The way statistic theory deals with this situation is to reverse the approach taken in classical statistic analysis, and build a model that describes the behavior of the observed data. A model which achieves this is called a well defined statistical model, it can be understood as a parsimonious time invariant model with white noise residuals, that makes sense from economic theory. Finally, from the view of economics, the subject of statistics deals mainly with the estimation and inference of covariances only. The econometrician, however, must also give estimated parameters an economic interpretation. This problem cannot always be solved ex post, after the a model has been estimated. When it comes to time series, economic theory is an integrated part of the modeling process. Given a well defined statistical model, estimated parameters should represent behavior of economic agents. Many econometric studies fail because researchers assume that their estimates can be given an economic interpretation without considering the statistical properties of the model, or the simple fact there is in general not a one to one correspondence with observed variables and the concepts defined in economic theory Programs Here is a list of statistical software that you should be familiar with, please goggle, (those recommended for time series are marked with *): *RATS and CATS in RATS, Regression Analysis of Time Series and Cointegrating Analysis of Time Series ( - *PcGive - Comes highly recommended. Included in Oxmetrics modules, see also Timberlake consultants for more programs. - *Gretl (Free GNU license, very good for students in econometrics) - *JMulti (Free for multivariate time series analysis, updated? The discussion forum is quite dead, - *EViews - Gauss (good for simulation) - STATA (used by the World Bank, good for microeconometrics, panel data, OK on time series) - LIMDEP ( Mostly free with some editions of Green s Econometric text book?, you need to pay for duration models?) - SAS - Statistical Analysis System (good for big data sets, but not time series, mainly medicine, "the calculus program for decision makers") - Shazam And more, some are very special programs for this and that,... but I don t find them worth mentioning in this context For a recent discussion about the controversies in econometrics see The Economic Journal 12 INTRODUCTION TO ECONOMETRIC TIME SERIES
13 There is a bunch of software that allows you to program your own models or use other peoples modules: - Matlab - R (Free, GNU license, connects with Gretl) - Ox You should also know about C, C++, and LaTeX to be a good econometrician. Please google. For Data Envelopment Analysis (DEA) I recommend Tom Coelli s DEAP 2.1 or Paul W. Wilson s FEAR. 2.2 Different types of time series Given the general definition of time series above, there many types of time series. The focus in econometrics, macroeconomics and finance is in stochastic time series typically in the time domain, which are non-stationarity in levels but becomes what is called covariance stationary after differencing. In a broad perspective, time series analysis typically aims at making time series more understandable by decomposing them into different parts. The aim of this introduction is to give a general overview of the subject. A time series is any sequence ordered by time. The sequence can be either deterministic or stochastic. The primary interest in economics is in stochastic time series, where the sequence of observations is made up by the outcome of random variables. A sequence of stochastic variables ordered by time is called a stochastic time series process. The random variables that make up the process can either be discrete random variables, taking on a given set of integer numbers, or be continuous random variables taking on any real number between ±. While discrete random variables are possible they are not that common in economic time series research. Another dimension in modeling time series is to consider processes in discrete time or in continuous time. The principal difference is that stochastic variables in continuous time can take different values at any time. In a discrete time process, the variables are observed at fixed intervals of time (t), and they do not change between these observation points. Discrete time variables are not common in finance and economics. There are few, if any variables that remain fixed between their points of observations. The distinction between continuous time and discrete time is not matter of measurability alone. A common mistake is to be confused the fact that economic variables are measured at discrete time intervals. The money stock is generally measured and recorded as an end-of-month value. The way of measuring the stock of money does not imply that it remains unchanged between the observation interval, instead it changes whenever the money market is open. The same holds for variables like production and consumption. These activities take place 24 hours a day, during the whole year. The are measured as the flow of income and consumption over a period, typically a quarter, representing the integral sum of these activities. Usually, a discrete time variable is written with a time subscript (x t ) while continuous time variables written as x(t). The continuous time approach has a number of benefits, but the cost and quality of the empirical results seldom motivate the continuous time approach. It is better to use discrete time approaches DIFFERENT TYPES OF TIME SERIES 13
14 as an approximation to the underlying continuous time system. The cost for doing this simplification is small compared with the complexity of continuous time analysis. This should not be understood as a rejection of all continuous time approaches. Continuous time is good for analyzing a number of well defined problems like aggregation over time and individuals. In the end it should lead to a better understanding of adjustment speeds, stability conditions and interactions among economic time series, see Sjöö (1990, 1995). 2 In addition, stochastic time series can be analysed in the time domain or in the frequency domain. In the time domain the data is analysed ordered in given time periods such as days, weeks, years etc. The frequency approach decomposes time series into frequencies by using trigonometric functions like sinuses, etc. Spectral analysis is an example of analysis that uses the frequency domain, to identify regularities such as seasonal factors, trends, and systematic lags in adjustment etc. The main advantage with analysing time series in the frequency domain is that it is relatively easy to handle continuous time processes and observations observed as aggregations over time such as consumption. However, in economics and finance, where we are typically faced with given observations at given frequencies and we seek to study the behavior of agents operating in real time. Under these circumstances, the time domain is the most interesting road ahead because it has a direct intuitive appeal to both economists and policy makers. A dimension in modeling time series is to consider processes in discrete time or in continuous time. The principal difference here is that the stochastic variables in a continuous time process can take on different values at any time. In a discrete time process, the variables are observed at fixed intervals of time (t), and they are assumed not to change during the frequency interval. Discrete time variables are not common in finance and economics. There are few, if any variables that remain fixed between their points of observations. The distinction between continuous time and discrete time is not matter of measurability alone. A common mistake is to be confused the fact that economic variables are measured at discrete time intervals. The money stock is generally measured and recorded as an end-of-month value. The way of measuring the stock of money does not imply that it remains unchanged between the observation interval, instead it changes whenever the money market is open. The same holds for variables like production and consumption. These activities take place 24 hours a day, during the whole year. The are measured as the flow of income and consumption over a period, typically a quarter, representing the integral sum of these activities. Our interest is usually in analysing discrete time stochastic processes in the time domain. A time series process is generally indicated with brackets, like {y t }. In some situations it is necessary to be more precise about the length of the process. Writing {y} 1 indicates that he process start at period one and continues infinitely. The process consists of random variables because we can view each element in {y t } as a random variable. Let the process go from the integer values 1 up to T. If necessary, to be exact, the first variable in the process can be written as y t1 the second variable y t2 etc. up until y tt. The distribution function of the process can then be written as F (y t1, y t2,..., y tt ). 2 We can also mention the different types of series that are used; stocks, flows and price variables. Stocks are variables that can be observed at a point in time like, the money stock, inventories. Flows are variables that can only be observed over some period, like consumption or GDP. In this context price variables include prices, interest rates and similar variables which can be observed at a market at a given point in time. Combining these variables into multivariate process and constructing econometric models from observed variables in discrete time produces further problems, and in general they are quite diffi cult to solve without using continuous time methods. Usually, careful discrete time models will reduce the problems to a large extent. 14 INTRODUCTION TO ECONOMETRIC TIME SERIES
15 In some situation it is necessary to start from the very beginning. A time series is data ordered by time. A stochastic time series is a set of random variables ordered by time. Let Ỹit represent the stochastic variable Ỹi given at time t. Observations on this random variable is often indicated as y it. In general terms a stochastic time series is a series of random variables ordered by time. A series starting at time t = 1 and ending at time} t = T, consisting of T different random variables is written as {Ỹ1,1, Ỹ2,2,...ỸT,T. Of course, assuming that the series is built up by individual random variables, with their own independent probability distributions is a complex thought. But, nothing in our definition of stochastic time series rules out that the data is made up by completely different random variables. Sometimes, to understand and find solutions to practical problems, it will be necessary to go all the way back to the most basic assumptions. Suppose we are given a time series consisting of yearly observations of interest rates, {6.6, 7.5, 5.9, 5.4, 5.5, 4.5, 4.3, 4.8}, the first question to ask is this a stochastic series in the sense that these number were generated by one stochastic process or perhaps several different stochastic processes? Further questions would be to ask if the process or processes are best represented as continuous or discrete, are the observations independent or dependent? Quite often we will assume that the series are generated by the same identical stochastic process in discrete time. Based on these assumptions the modelling process tries to find systematic historical patters and cross-correlations with other variables in the data. All time series methods aim at decomposing the series into separate parts in some way. The standard approach in time series analysis is to decompose as y t = T t,d + S t,d + C t,d + I t, where T d and S d represents (deterministic) trend and seasonal components, C t,d is deterministic cyclical components and I is process representing irregular factors 3. For time series econometrics this definition is limited, since the econometrician is highly interested in the irregular component. As an alternative, let {y t } be a stochastic time series process, which is composed as, y t = systematic components + unsystematic components = T d + T s + S d + S s + {y t } + e t, (2.1) where the systematic components include deterministic trends T d, stochastic trend T s, deterministic seasonals S d stochastic seasonals S s, a stationary process (or the short-run dynamics) yt, and finally a white noise innovation term e t. The modeling problem can be described as the problem of identifying the systematic components such that the residual becomes a white noise process. For all series,remember that any inference is potentially wrong, if not all components have been modeled correctly. This is so, regardless of whether we model a simple univariate series with time series techniques, a reduced system, a or a structural model. Inference is only valid for a correctly specified model. 2.3 Repetition - Your First Courses in Statistics and Econometrics 1. To be completed... 3 For simplicity we assume a linear process. An alternative is to assume that the components are multiplicative, x t = T t,d S t,d C t,d I t. REPETITION - YOUR FIRST COURSES IN STATISTICS AND ECONOMETRICS 15
16 In you first course in statistics you learned how to use descriptive statistics; the mean and the variance. Next you learned to calculate the mean and variances from a sample that represents the whole underlying population. For the mean and the variance to work as a description of the underlying population it is necessary to construct the sample in such a way that the difference between the sample mean and the true population mean is non-systematic meaning that the difference between the sample mean and the population is unpredictable. This man that your estimated sample mean is random variable with known characteristics. The most important thing is to construct a sampling mechanism so that the mean calculated from the sample has the characteristics you want to have. That is the estimated mean should be unbiased, effi cient and consistent. You learn about random variables, probabilities, distributions functions and frequency distributions. Your first course in econometrics "A theory should be as simple as possible, but not simpler" Albert Einstein To be completed... Random variables, OLS, minimize the sum of squares, assumptions 1-5(6), understanding, multiple regression, multicollinearity, properties of OLS estimator Matrix algebra Tests and solutions for heteroscedasticity (cross-section), and autocorrelation (time series). If you read a good course you should have learned the three golden rules: test test test, and learned about the probabilities of the OLS estimator. Generalized least squares GLS System estimation: demand and supply models. Further extensions: Panel data, Tobit, Heckit, discrete choice, probit/logit, duration Time series: distributed lag models, partial adjustment models, error correction models, lag structure, stationarity vs. non-stationarity, co-integration What need to know... What you probably do not know but should know. OLS Ordinary least squares is a common estimation method. Suppose there are two series {y t, x t } y t = α + βx t + ε t Minimize the sum of Squares over the sample t = 1,.2...T, S = T t=1 ε2 t = T t=1 (y t α βx t ) 2 Take the derivative of S with respect to α and β, set the expressions to zero, and solve for β and α. δs δβ = δs δα = ˆβ = T SS = ESS + RSS 1 = ESS T SS + RSS T SS R 2 = 1 RSS T SS = ESS T SS Basic assumptions 1) E(ε t ) = 0 for all t 16 INTRODUCTION TO ECONOMETRIC TIME SERIES
17 2) E(ε t ) 2 = σ 2 for all t 3) E(ε t ε t k ) = 0 for all k t 4) E(X t ε t ) = 0 5) E(X X) 0 6) ε t NID(0, σ 2 ) Discuss these properties Properties Gauss-Markow BLUE Deviations Misspecification, add extra variable, forget relevant variable Multicollinearity Error in variables problem Homoscedasticity Heteroscedasticity Autocorrelation REPETITION - YOUR FIRST COURSES IN STATISTICS AND ECONOMETRICS 17
18 18 INTRODUCTION TO ECONOMETRIC TIME SERIES
19 Part I Basic Statistics 19
20
21 3. TIME SERIES MODELING - AN OVERVIEW Economists are generally interested in a small part of what is normally included in the subject Time Series Analysis. Various techniques such as filtering, smoothing and interpolation developed for deterministic time series are of relative minor interest for economists. Time series econometrics is more focused on the stochastic part of time series. The following is an brief overview of time series modeling, from an econometric perspective. It is not text book in mathematical statistics, nor is the ambition to be extremely rigorous in the presentation of statistical concepts. The aim more to be a guide for the yet not so informed economist who wants to know more about the statistical concepts behind time series econometrics. When approaching time series econometrics the statistical vocabulary quickly increases and can become overwhelming. These first two chapters seek to make it possible for people without deeper knowledge in mathematical statistics to read and follow the econometric and financial time series literature. A time series is simply a set of observations ordered by time. Time series techniques seeks to decompose this ordered series into different components, which in turn can be used to generate forecasts, learn about the dynamics of the series, and how it relates to other series. There is a number of dimensions and decision to keep account of when approaching this subject. First, the series, or the process, can be univariate or multivariate, depending on the problem at hand. Second, the series can be stochastic or purely deterministic. In the former case a stochastic random process is generating the observations. Third, given that the series is stochastic, with perhaps deterministic components, it can be modeled in the time domain or in the frequency domain. Modeling in the frequency domain implies describing the series in terms cosines functions of different wave lengths. This is a useful approach for solving some problems, but not a general approach for economic time series modeling. Fourth, the data generating process and the statistical model can constructed in continuous or discrete time. Continuous time econometrics is good for some problems but not all. In general it leads to more complex models. A discrete time approach builds on the assumption that the observed data is unchanged between the intervals of observation. This is a convenient approximation, that makes modeling easier, but comes at a cost in the form of aggregation biases. However, in the general case, this is a low cost, compared with the costs of general misspecification. A special chapter deals with the discussion of discrete versus continuous time modeling. The typical economic time series is a discrete stochastic process modeled in the time domain. Time series can be modelled by smoothing and filter techniques. For economists these techniques are generally uninteresting, though we will briefly come back to the concept of filters. The simplest way to model an economic time series is to use autoregressive techniques, or ARIMA techniques in the general case. Most economic time series, however, are better modeled as a part of a multivariate stochastic process. Economic theory systems of economic variables, leading to single equation transfer functions and systems of equations in a VAR model. These techniques are descriptive, they do not identify structural, or deep parameters like elasticities, marginal propensities to consume etc. The estimate more TIME SERIES MODELING - AN OVERVIEW 21
22 specific economic models, we turn to techniques as VECM, SVAR, and structural VECM. What is outlined above is quite different from the typical basic econometric textbook approach, which starts with OLS and ends in practice with GLS as the solution to all problems. Here we will develop methods, which first describes the statistical properties of the (joint) series at hand, and then allows the researcher to answer economic questions in such a way that the conclusions are statistically and economically valid. To get there we have to start with some basic statistics. 3.1 Statistical Models A general definition of statistical time series analysis is that it finds a mathematical model that links observed variables with the stochastic mechanism that generated the data. This sounds abstract, but the purpose of this abstraction is understand the analytical tools of time series statistics. The practical problem is the following; we have some stochastic observations over time. We know that these observations have been generated by a process, but we do not know what this process looks like. Statistical time series analysis is about developing the tools needed to mimic the unknown data generating function (DGP). We can formulate some general features of the model. First, it should be a well-defined statistical model in the sense that the assumptions behind the model should be valid for the data chosen. Later we will define more exactly what this implies for an econometric model. For the time being, we can say that single most important criteria of models is that the residuals should be a white noise process. Second, the parameters of the model should be stable over time. Third, the model should be simple, or parsimonious, meaning that its functional form should be simple. Fourth, the model should be parameterized in such a way that it is possible to give the parameters a clear interpretation and identify them with events in the real world. Finally, the model should be able to explain other rival models describing the dependent variable(s). The way to build a well-defined-statistical-model is to investigate the underlying assumptions of the model in a systematic way. It can easily be shown that t-values, R 2, and Durbin-Watson values are not suffi cient for determining the fit of a model. In later chapters we will introduce a systematic test procedure. The final aim of econometric modelling is to learn about economic behavior. To some extent this always implies using some a priori knowledge about in the form of theoretical relationships. Economists, in general, have extremely strong a priori belief about the size and sign of certain parameters. This way of thinking has lead to much confusion, because a priori believes can be driven too far. Econometrics is basically about measuring correlations. It is a common misunderstanding among non-econometricians that correlations can be too high or too low, or be deemed right or wrong. Measured correlations are the outcome of the data used, only. Anyone who thinks of an estimated correlation as wrong, must also explain what went wrong in the estimation process, which requires knowledge of econometrics and the real world. 22 TIME SERIES MODELING - AN OVERVIEW
23 3.2 Random Variables The basic reason for dealing with stochastic models rather than deterministic models is that we are faced with random variables. A popular definition of random variables goes like this: a random variable is a variable that can take on more than one value. 1 For every possible value that a random variable can take on there is a number between zero and one that describes the probability that the random variable will take on this value. In the following a random variable is indicated with. In statistical terms, a random variable is associated with the outcome of a statistical experiment. All possible outcomes of such an experiment can be called the sample space. If S is a sample space with a probability measure and if X is real valued function defined over S then X is called a random variable. There are two types of random variables; discrete random variables, which only take on a specific number of real values, and (absolute) continuous random variables, which can take on any value between ±. It is also possible to examine discontinuous random variables, but we will limit ourselves to the first two types. If the discrete random variable X can take k numbers of values (x 1,..., x k ), the probability of observing a value x j can be stated as, P (x j ) = p j. (3.1) Since probabilities of discrete random variables are additive, the probability of observing one of the k possible outcomes is equal to 1.0, or using the notation just introduced, P (x 1, x 2,..., or x k ) = p 1 + p p k = 1. (3.2) A discrete random variable is described by its probability function, F (x i ), which specifies the probability with which X takes on a certain value. (The term cumulative distribution is used synonymous with probability function). In time series econometrics we are in most applications dealing with continuous random variables. Unlike discrete variables, it is not possible to associate a specific observation with a certain probability, since these variables can take on an infinite range of numbers. The probability that a continuous random variable will take on a certain value is always zero. Because it is continuous we cannot make a difference between 1.01 and etc. This does not mean that the variables do not take on specific values. The outcome of the experiment, or the observation, is of course always a given number. Thus, for a continuous random variable, statements of the probability of an observation must be made in terms of the probability that the random variable X is less than or equal to some specific value. We express this with the distribution function F (x) of the random variable X as follows, F (x) = P ( X x) for < x <, (3.3) which states the probability of X taking a value less than or equal to x. The continuous analogue of the probability function is called the density function f(x), which we get by derivation of the distribution function, w.r.t the observations (x), df (x) = f(x). (3.4) dx 1 Random variables (RV:s) are also called stochastic variables, chance variables, or variates. RANDOM VARIABLES 23
24 The fundamental theorem of integral calculus gives us the following expression for the probability that X takes on a value less that or equal to x, F (x) = x f(u)du. (3.5) It follows that for any two constants (a) and (b), with a < b, the probability that X takes on a value on the interval from (a) to (b) is given by F (b) F (a) = = b b a a f(u)du f(u)du (3.6) f(u)du (3.7) The term density function is used in a way that is analogous to density in physics. Think of a rod of variable density, measured by the function f(x). To obtain the weight of some given length of this rod, we would have to integrate its density function over that particular part in which we are interested. Random variables care described by their density function and/or by their moments; the mean, the variance etc. Given the density function, the moments can be determined exactly. In statistical work, we must first estimate the moments, from the moments we can learn about density function. For, instance we can test, if the assumption of an underlying normal density function is consistent with the observed data. A random variable can be predicted, in other words it is possible to form an expectation of its outcome based on its density function. Appendix III deals with the expectations operator and other operators related to random variables. 3.3 Moments of random variables Random variables are characterized by their probability density functions pdf : s) or their moments. In the previous section we introduced pdf : s. Moments refers to measurements such as the mean, the variance, skewness, etc. If we know the exact density function of a random variable then we would also know the moments. In applied work, we will typically first calculate the moments from a sample, and from the moments figure out the density function of variables. The term moment originates from physics and the moment of a pendulum. For our purposes it can be though of as a general term which includes the definition of concepts like the mean and the variance, without referring to any specific distribution. Starting with the first moment, the mathematical expectation of a discrete random variable is given by, E( X) = xf(x) (3.8) where E is the expectation operator and f(x) is the value of its probability function at X. Thus, E( X) represents the mean of the discrete random variable X. Or, in other words, the first moment of the random variable. For a continuous random variable ( X), the mathematical expectation is E( X) = x f(x)dx (3.9) 24 TIME SERIES MODELING - AN OVERVIEW
25 where f(x) is the value of its probability density at x. The first moment can also be referred to as the location of the random variable. Location is a more generic concept than the first moment or the mean. The term moments are used in situations where we are interested in the expected value of a function of a random variable, rather than the expectation of the specific variable itself. Say that we are interested in Ỹ, whose values are related to X by the equation y = g(x). The expectation of Ỹ is equal to the expectation of g(x), since E(Ỹ ) = E [g(x)]. In the continuous case this leads to, E(Ỹ ) = E[g( X)] = g(x)f(x)dx. (3.10) Like density, the term moment, or moment about the origin, has its explanation in physics. (In physics the length of a lever arm is measured as the distance from the origin. Or if we refer to the example with the rod above, the first moment around the mean would correspond to horizontal center of gravity of the rod.) Reasoning from intuition, the mean can be seen as the midpoint of the limits of the density. The midpoint can be scaled in such a way that its becomes the origin of the x- axis. The term moments of a random variable is a more general way of talking about the mean and variance of a variable. Setting g(x) equal to x, we get the r:th moment around the origin, µ r = E( X r ) = x r f(x) (3.11) when X is a discrete variable. In the continuous case we get, µ r = E( X r ) = x r f(x)dx. (3.12) The first moment is nothing else than the mean, or the expected value of X. The second moment is the variance. Higher moments give additional information about the distribution and density functions of random variables. Now, defining g( X) = ( X µ r) we get what is called the r:th moment about the mean of the distribution of the random variable X. For r = 0, 1, 2, 3... we get for a discrete variable, and when X is continuous µ r = E[( X µ r) r ] = ( X µ r) r f(x) (3.13) µ r = E[( X µ r) r ] = ( X µ ) r f(x)dx. (3.14) The second moment about the mean, also called the second central moment, is nothing else than the variance of g(x) = x, var( X) = = [ X E( X)] 2 f(x)dx (3.15) X 2 f(x)dx [E( X)] 2 (3.16) = E( X 2 ) [E( X)] 2, (3.17) where f(x) is the value of probability density function of the random variable X at x.a more generic expression for the variance is dispersion. We can say that MOMENTS OF RANDOM VARIABLES 25
26 the second moment, or the variance, is a measure of dispersion, in the same way as the mean is a measure of location. The third moment, r = 3, measures asymmetry around the mean, referred to as skewness. The normal distribution is asymmetric around the mean. The likelihood of observing a value above or below the mean is the same for a normal distribution. For a right skewed distribution, the likelihood of observing a value higher than the mean is higher than observing a lower value. For a left skewed distribution, the likelihood of observing a value below the mean is higher than observing a value above the mean. The fourth moment, referred to as kurtosis, measures the thickness of the tails of the distribution. A distribution with thicker tails than the normal, is characterized by a higher likelihood of extreme events compared with the normal distribution. Higher moments give further information about the skewness, tails and the peak of the distribution. The fifth, the seventh moments etc. give more information about the skewness. Even moments, above four, give further information the thickness of the tails and the peak. 3.4 Popular Distributions in Econometrics In time series econometrics, and financial economics, there is a small set of distributions that one has to know. The following is a list of common distributions: Distribution Normal distribution N ( µ, σ 2) Log Normal distribution LogN ( µ, σ 2) Student t distribution St ( υ, µ, σ 2) Cauchy distribution Ca ( µ, σ 2) Gamma distribution Ga ( υ, µ, σ 2) Chi-square distribution χ (ν) F distribution F (d 1, d 2 ) Poisson distribution P ois (λ) Uniform distribution U ( a, b ) The pdf of a normal distribution is written as f(x) 1 (x µ)2 e 2σ 2. 2πσ 2 The normal distribution characterized by the following: the distribution is symmetric around its mean, and it has only two moments, the mean and the variance, N(µ, σ 2 ). The normal distribution can be standardised to have a mean of zero and variance of unity (say ( x σ E(x)) and is consequently called a standardised normal distribution, N(0, 1). In addition, it follows that the first four moments, the mean, the variance, the skewness and kurtosis, are E( X) = µ, V ar( X) = σ 2, Sk( X) = 0,and Ku( X) = 3.There are random variables that are not normal by themselves but becomes normal if they are logged. The typical examples are stock prices and various macroeconomic variables. Let S t be a stock price. The dollar return over a given interval, R t = S t S t 1 is not likely to be normally distributed due to simple fact that the stock price is raising over time, partly due to the fact that investors demand a return on their investment but mostly due to inflation. However, if you take the log of the stock price and calculate the per cent return (approximately), 26 TIME SERIES MODELING - AN OVERVIEW
27 r t = ln S t ln S t 1, this variable are much more likely to have a normal distribution (or a distribution that can be approximated with a normal distribution). Thus, since you have taken logs of variables in your econometric models, you have already worked with log normal variables. Knowledge about log normal distributions is necessary if you want to model, or better understand, the movements of actual stock prices and dollar returns. The Student t distribution is similar to the normal distribution, it is symmetric around the mean, it has a variance but has thicker tail than the normal distribution. The Student t distribution is described by ( υ, µ, σ 2) where µ refers to the mean and σ 2 refers to the variance. The parameter ν is called the degrees of freedom of the Student t distribution and refers to the thickness of tails. A random variable that follows a Student t distribution will converge to a normal random variable as the number of observations goes to infinity. The Cauchy distribution is related to the normal distribution and the Student t distribution. Compared with the normal it is symmetric and has two moments, but it has fatter tails and is therefore better suited for modelling random variables which takes on relatively more extreme events than the normal. The set back for empirical work is that higher moment are not defined meaning that it is diffi cult to use empirical moments to test for Cauchy distribution against say the normal or the Student t distribution. The gamma and the chi-square distributions are related to variances of normal random variables. If we have a set of normal random variables {Ỹ1, Ỹ2..., Ỹv} and for a new variable as X Ỹ Ỹ Ỹ v 2, then this new variable will have a gamma distribution as X Ga(ν, µ, σ 2 ).A special case of the gamma distribution is when we have µ = 0 and σ 2 = 1, the distribution is then called a chi-square distribution χ 2 (υ) with υ degrees of freedom. Thus, take the square of an estimated regression parameter and divide it with it variance and you get a chi-square distributed test for significance of the estimated β, (ˆβ/σˆβ) χ 2 (ν). The F distribution comes about when you compare the ration (or log difference) of two squared normal random variables. The Poisson distribution is used to model jumps in the data, usually in combination with a geometric Brownian motions, (jump diffusion models). The typical example is stock prices that might move up or down drastically. The parameter λ measures the probability of jump in the data. 3.5 Analysing the Distribution In practical work we need to know the empirical distribution of the variables we are working with, in order to make any inference. All empirical distributions can analysed with the help of their first four moments. Through the first four moments we get information first about the mean and the variance and second about the skewness and kurtosis. The latter moments are often critical when we decide if a certain empirical distribution should be seen as normal or at least approximately normal. It is, of course, extremely convenient to work with the assumption of a normal distribution, since a normal distribution is described by its first two moments only. In finance, the expected return is given be the mean, and the risk of the asset is given by its variance. An approximation to the holding period return of an asset is the log difference of its price. In the case of a normal distribution, there is no need to consider higher moments. Furthermore, linear combinations of ANALYSING THE DISTRIBUTION 27
28 normal variates result in new normally distributed variables. In econometric work, building regression equations, the residual process is assumed to be a normally independent white noise process, in order to allow for inference and testing. It is by calculating the sample moments we learn about the distribution of the series at hand. The most typical problem in empirical work is to investigate how well the distribution a variable can be approximated with a normal distribution. If the normal distribution is rejected for the residuals in a regression, the typical conclusion is that there something important missing in the regression equation. The missing part is either an important explanatory variable, or the direct cause of an outlier. To investigate the empirical distribution we need to calculate the sample moments of the variable. The sample mean, of {x t } = {x 1, x 2,...x T }, can be estimated as ˆµ x = x = (1/T ) T t=1 x t. Higher moments can be estimated with the formula m r = (1/T ) T t=1 (x t x) r. 2 A series is normally distributed, Xt N(µ x, σ 2 x), subtracting the mean and dividing with the standard error lead to a standardised normal variable, distributed as X N(0, 1). For a standardised normal variable the third and fourth moments equal 0 and 3, respectively. The standardised third moment is now as Skewness, given as b 1 = m 2 3/m 3 2. A skewness with a negative value indicates a left skew distribution, compared with the normal. If the series is the return on an asset it means that bad or negative surprises dominates over good positive surprises. A positive value of skewness implies a right skewed distribution. In terms of asset returns, good or positive surprises are more likely than bad negative surprises. The fourth moment, kurtosis is calculated as b 2 = m 4 /m 2 2. A value above 3, implies that the distribution generates more extreme values than the normal distribution. The distribution has fatter tails than the normal. Referring to asset returns, approximating the distribution with the normal, would underestimate the risk associated with the asset. An asymptotic test, with a null of a normal distribution is given by 3, JB = T [ m 2 3 /m [(m 4/m 2 2) 3] 2 ] [ 3m 2 + T 1 + m ] 1m m 2 m 2 χ 2 (2). 2 This test is known as the Jarque-Bera (JB) test and is the most common test for normality in regression analysis. The null hypothesis is that the series is normally distributed. Let µ 1, µ 2, µ 3 and µ 4 represent the mean, the variance, the skewness and the kurtosis. The null of a normal distribution is rejected if the test statistics is significant. The fact that the test is only valid asymptotically, means that we do not know the reason for a rejection in a limited sample. In a less than asymptotic sample rejection of normality is often caused by outliers. If we think the most extreme value(s) in the sample are non-typical outliers, removing them from the calculation the sample moments usually results in a non-significant JB test. Removing outliers is add hoc. It could be that these outliers are typical values of the true underlying distribution. 2 For these moments to be meaningful, the series must be stationary. Also, we would like {x t} to an independent process. Finally, notice that the here suggested estimators of the higher moments are not necessarily effi cient estimators. 3 This test statistics is for a variable with a non-zero mean. If the variable is adjusted for its mean (say an estimated residual), the second should be removed from the expression. 28 TIME SERIES MODELING - AN OVERVIEW
29 3.6 Multidimensional Random Variables We will now generalize the work of the previous sections by considering a vector of n random variables, X = ( X 1, X 2,..., X n ) (3.18) whose elements are continuous random variables with density functions f(x 1 )..., f(x n ), and distribution functions F (x 1 )..., F (x n ). The joint distribution will look like, F (x 1, x 2,..., x n ) = xn x1 f(x 1, x 2,..., x n )dx 1 dx p, (3.19) where f(x 1, x 2,..., x n ) is the joint density function. If these random variables are independent, it will be possible to write their joint density as the product of their univariate densities, f(x 1, x 2,..., x n ) = f(x 1 )f(x 2 ) f(x n ). (3.20) For independent random variables we can define the r:th product moment as, = E( X 1 r1, X 2 r2,..., X n rn ) (3.21) x 1 r1 x 2 r2 x n rn f(x 1, x 2,..., x n )dx 1 dx 2 dx n, (3.22) which, if the variables are independent, factorizes into the product E( X 1 r1 )E( X 2 r2 ) E( X n rn ). (3.23) It follows from this result that the variance of a sum of independent random variables is merely the sum of these individual variances, var( X 1 + X X n ) = var( X 1 ) + var( X 2 ) var( X n ). (3.24) We can extend the discussion of covariance to linear combinations of random variables, say a X = a1 X1 + a 2 X a p Xp, (3.25) which leads to, p p cov(a X) = a i a j σ ij. (3.26) i=1 j=1 These results hold for matrices as well. If we have Ỹ = A X, Z = B X, and the covariance matrix between X and Ỹ ( ), we have also that, cov(ÿ, Ÿ ) = A A, (3.27) cov( Z, Z) = B B, (3.28) and cov(ÿ, Z) = A B. (3.29) MULTIDIMENSIONAL RANDOM VARIABLES 29
30 3.7 Marginal and Conditional Densities Given a joint density function of n random variables, the joint probability of a subsample of them is called the joint marginal density. We can also talk about joint marginal distribution functions. If we set n = 3 we get the joint density function f(x 1, x 2, x 3 ). Given the marginal distribution g(x 2 x 3 ), the conditional probability density function of the random variable X 1, given that the random variables X 2 and X 3 takes on the values x 2 and x 3 is defined as, or ϕ(x 1 x 2, x 3 ) = f (x 1, x 2, x 2 ), (3.30) g(x 2, x 3 ) f(x 1, x 2, x 3 ) = ϕ(x 1 x 2, x 3 )g(x 2 x 3 ). (3.31) Of course we can define a conditional density for various combinations of X1, X 2 and X 3, like, p(x 1, x 3, x 2 ) or g(x 3 x 1, x 2 ). And, instead of three different variables we can talk about the density function for one random variable, say Ỹt, for which we have a sample of T observations. If all observations are independent we get, f(y 1, y 2,..., y t ) = f(y 1 )f(y 2 )...f(y t ). (3.32) Like before we can also look at conditional densities, like f(y t y 1, y 2,..., y t 1 ), (3.33) which in this case would mean that (y t ) the observation at time t is dependent on all earlier observations on Ỹt. It is seldom that we deal with independent variables when modeling economic time series. For example, a simple first order autoregressive model like y t = βy t 1 + ɛ t, implies dependence between the observations. The same holds for all time series models. Despite this shortcoming, density functions with independent random variables, are still good tools for describing time series modelling, because the results based on independent variables carries over to dependent variables in almost every case. 3.8 The Linear Regression Model A General Description In this section we look at the linear regression model starting from two random variables Ỹ and X. Two regressions can be formulated, and y = α + βx + ɛ, (3.34) x = γ + δy + ν. (3.35) Whether one chooses to condition y on x, or x on y depends on the parameter of interest. In the following it is shown how these regression expression are constructed from the correlation between x and y, and their first moments by making use of the (bivariate) joint density function of x and y. (One can view this section as an exercise in using density functions). 30 TIME SERIES MODELING - AN OVERVIEW
31 Without explicitly stating what the density function looks like, we will assume that we know the joint density function for the two random variables Ỹ and X, and want to estimate a set of parameters, α and β. Hence we got, the joint density, D(y, x; Ψ), (3.36) where Ψ is a vector of parameters which describes the relation between Ỹ and X. To get the linear regression model above we have condition on the outcome of X, D(y, x; Ψ) = D(y x, θ), (3.37) where θ represents the vector of parameters of interest θ = [α, β]. This operation requires, that the parameters of interest can be written as a function of the parameters in the joint distribution function, θ = f(ψ). The expected mean of Ỹ for given X is, equation 1 E(Ỹ x, θ) = y D(y x, θ)dy = α + βx, (3.38) or if we choose to condition on Ỹ instead, E( X y, φ) = x D(x y, φ)dx = γ + δx. The parameters in 3.38 can be estimated by using means, variances and covariances of the variables. Or in other terms, by using some of the lower moments of the joint distribution of X and Ỹ. Hence, the first step rewrite 3.38 in such a way that we can write α and β in terms of the means of X and Ỹ. Looking at the LHS of 3.38 it can be seen that a multiplication of the conditional density with the marginal density for X, g(x), leads to the joint density. Given the joint density we can choose to integrate out either x or y. In this case we chose to integrate over x. Thus we have after multiplication, y D(y x, θ)dyg(x)= α g(x) + βx g(x). (3.39) Integrating over x leads to, at the LHS, yd(y x, θ)dydg(x) = yd(y,x Ψ)dydxg(x) = yd(y Ψ) = E(y Ψ) = µ y. (3.40) Performing the same operations on the RHS leads to, α g(x)dx + β x g(x)dx If we put the two sides together we get, = α + β E( X) = α + β µ x. (3.41) E(Ỹ x, θ) = α + β E( X) = µ y = α + β µ x. (3.42) We now have one equation two solve for the two unknowns. Since we have used up the means let us turn to the variances by multiplying both sides of 3.38 with x and perform the same operations again. THE LINEAR REGRESSION MODEL A GENERAL DESCRIPTION 31
32 Multiplication with x and g(x) leads to, xyd(y x, θ)dyg(x) = α x g(x) + βx 2 g(x), (3.43) Integrate over x, The LHS leads to, = xyd(y x;θ)dydxg(x) α x g(x)dx + βx 2 g(x)dx. (3.44) xyd(y, x Ψ)dydx = E( X Ỹ ), (3.45) and the RHS, α x g(x)dx + β x 2 g(x)dx = α E( X) + β E( X 2 ). (3.46) Hence our second equation is, E( XỸ ) = α E( X) + β E( X 2 ). (3.47) Remembering the rules for the expectations operator, E( XỸ ) = µ x µ y + σ xy, and E( X 2 ) = µ 2 x + σ 2 x makes it possible to solve for α and β in terms of means and variances. From the first equation we get for α, If we substitute this into 3.39, we get which gives α = µ y βµ x. (3.48) E( XỸ ) = (µ y βµ x )µ x + β(µ 2 x + σ 2 x), µ x µ y + σ xy = µ x µ y βµ 2 x + βµ 2 x + βσ 2 x, (3.49) β = σ xy σ 2. (3.50) x Using these expressions in the linear regression line leads to, E(Ỹ x, θ) = µ y + σ xy σ 2 (x µ x ) = α + βx, (3.51) x or if we chose to condition on Ỹ instead, E( X y, Φ) = µ x + σ yx σ 2 (y µ y ) = γ + δy. (3.52) y We can now make use of the correlation coeffi cient and the β parameter in the linear regression. The correlation coeffi cient between X and Ỹ is defined as, If we put this into the equations above we get, ρ = σ xy σ x σ y or ρσ x σ y = σ xy. (3.53) E(Ỹ x, θ) = µ y + ρ σ y σ x (x µ x ), (3.54) 32 TIME SERIES MODELING - AN OVERVIEW
33 E( X y, Φ) = µ x + ρ σ x σ y (y µ y ). (3.55) So, if the two variables are independent their covariance is zero, and the correlation is also zero. Therefore, the conditional mean of each variable does not dependent on the mean and variance of the other variable. The final message is that a non-zero correlation, between two normal random variables, results in linear relationship between them. With a multivariate model, with more than two random variables, things are more complex. THE LINEAR REGRESSION MODEL A GENERAL DESCRIPTION 33
34 34 TIME SERIES MODELING - AN OVERVIEW
35 4. THE METHOD OF MAXIMUM LIKELIHOOD There are two fundamental approaches to estimation in econometrics, the method of moments and the maximum likelihood method. The difference is that the moments estimator deals with estimation without a priori choosing a specific density function. The maximum likelihood estimator (MLE), on the other hand, requires that a specific density function is chosen from the beginning. Asymptotically there is no difference between the two approaches. The MLE is more general, and is the basis for all the various tests applied in practical modeling. In this section we will focus on MLE exclusively because of its central role. The principles of MLE were developed early, but for a long time it was considered mainly as a theoretical device, with limited practical use. The progress in computer capacity has changed this. Many presentations of the MLE are too complex for students below the advanced graduate level. The aim of this chapter is to change this. The principle of ML is not different from OLS. The way to learn MLE is to start with the simplest case, the estimation of the mean and the variance of a single normal random variable. In the next step, it is easy to show how the parameters of a simple linear regression model can be found, and tested, using the techniques of MLE. In the third step, we can analyse how the parameters of any density function. Finally, it is often interesting to study the bivariate joint normal density function. This last exercise is good for understanding when certain variables can be treated as exogenous. The general idea is that after viewing how a single random variable can be replaced by a function of random variables, it becomes obvious how a multivariate non-linear system of variables can be estimated. Let us start with a single stochastic time series. The first moment, or the sample mean, of the random process X t with the observations (x 1, x 2,..., x T ) is found as x = T t=1 x t/t. By using this technique we simply calculated a number that we can use to describe one characteristic of the process X t. In the same way we can calculate the second moment around the mean, etc. In the long run, and for a stationary variable, we can use the central limit theorem (CLT) to argue that (x 1, x 2,..., x T ) has a normal distribution, which allows us to test for significance etc. 4.1 MLE for a Univariate Process The MLE approach starts from a random variable X t, and a sample of T independent observations (x 1, x 2..., x T ). The joint density function is f(x 1, x 2,..., x T ; θ) = f(x; θ) = T f(xt ; θ) (4.1) To describe this process there are k parameters, θ = (θ 1, θ 2,..., θ k ), so we write the density function as, f(x; θ) (4.2) THE METHOD OF MAXIMUM LIKELIHOOD 35
36 where x;θ indicates that it is the shape of the density, described by the parameters which gives us the sample. If the density function describes a normal distribution θ would consistent of two parameters the mean and the variance. Now, suppose that we know the functional form of the density function. If we also have a sample of observations on X t, we can ask the question which estimates of θ would be the most likely to find, given the functional form of the density and given the observations. Viewing the density in this way amounts to asking which values of θ maximize the value of the density function. Formulating the estimation problem in this way leads to a restatement of the density function in terms of a likelihood function, L(θ; x), (4.3) where the parameters are seen as a function of the sample. It is often convenient to work with the log of the likelihood instead, leading to the log likelihood log L(θ; x) = l(θ; x) (4.4) What is left is to find the maximum of this function with respect to the parameters in θ. The maximum, if it exists is found by solving the system of k simultaneous equations, δl(θ; x) = 0, (4.5) δθ i for θ, which will be the log likelihood estimates ˆθ, provided that D 2 l(θ; x) is a negative definite matrix. In matrix form this expression is also know as the score matrix, or the effi cient score for θ, which can be written as, δl(θ; x) δθ = S(θ), (4.6) such that the matrix of the effi cient score is zero at maximum. The matrix of the expected second order expressions is know as the information matrix [ δ 2 ] l(θ; x) E δ 2 = I(θ). (4.7) θ The information matrix plays an important role in demonstrating that ML estimators asymptotically attains the Cramer-Rao lower band, and in the derivation of the so-called classical test statistics associated with the ML estimator. It can be shown, under quite general conditions, that the variances of the estimated parameters from above (ˆθ) are given by the inverse of the information matrix, var(ˆθ) = [I(θ)] 1. (4.8) So far we have not assigned any specific distribution to the density function. Let us assume a sample of T independent normal random variables { X t }. The normal distribution is particularly easy two work with since it only requires two parameters to describe it. We want to estimate the first two moments, the mean µ and the variance σ 2, thus θ = (µ, σ 2 ). The likelihood is, L(θ; x) = [ [ ] 2π σ 2] T/2 exp 1 T 2σ 2 (x t µ) 2. (4.9) Taking logs of this expression yields, t=1 l(θ; x) = (T/2) log 2π (T/2) log σ 2 (1/2σ 2 ) T (x t µ) 2. (4.10) t=1 36 THE METHOD OF MAXIMUM LIKELIHOOD
37 The partial derivative with respect to µ and σ 2 are, and, δl δµ = 1 σ 2 T (x t µ), (4.11) t=1 δl δσ 2 = (T/2σ2 ) + (1/2σ 4 ) If these equations are set to zero, the result is, T (x t µ) 2. (4.12) t=1 T x t T µ = 0 (4.13) t=1 T (x t µ) 2 T σ 2 = 0. (4.14) t=1 If this system is solved for µ and σ 2 we get the estimates of the mean and the variance as 1 ˆµ x = 1 T ˆσ 2 x = 1 T T x t (4.15) t=1 T (x t ˆµ x ) 2 = 1 T t=1 [ T x 2 t 1 T ] 2 x t. (4.16) T Do these estimates of µ and σ 2 really represent the maximum solution of the likelihood function? To answer that question we have to look at the sign of the Hessian of the log likelihood function, the second order conditions, evaluated at estimated values of the parameters in θ, D 2 l(θ; x)= δ 2 l δµδµ δ 2 l δσ 2 δµ δ 2 l δµδσ 2 δ 2 = l δσ 2 δσ 2 t=1 t=1 T σ 1 2 σ 4 (xt µ) 1 σ 4 (xt µ) T 2σ 4 (x t µ) 2 If we substitute from the solutions of the estimates of µ and σ 2, we get, E[D 2 l(θ; x)]= 1 T 0 ˆσ 2 x T 0 2ˆσ 4 x. (4.17) = I(ˆθ), (4.18) Since the variance, σ 2 x is always positive we have a negative definite matrix, and a maximum value for the function at ˆµ x and ˆσ 2 x. It remains to investigate whether the estimates are unbiased. Therefore, replace the observations, in the solutions for µ and σ 2 x, by the random variable X and take expectation. The expected value of the mean is, E(ˆµ x ) = 1 T T E( X) = 1 T t=1 t=1 1 The solution is given by 1 T T t=1 [xt µ]2 = 1 T [ T t=1 x 2 t + µ 2 2x ] tµ = 1 T T t=1 x2 t + 1 T T t=1 µ2 2 1 T T t=1 xtµ = 1 T T t=1 x2 t + 1 T T µ2 2 1 T µ T t=1 xt = 1 [ T T t=1 x2 t + 1 T ] 2 T 2 t=1 xt 2 1 T T T 2 t=1 xt t=1 xt = 1 T T t=1 x2 t 1 T 2 [ x t] 2 T µ = µ, (4.19) MLE FOR A UNIVARIATE PROCESS 37
38 which proves that ˆµ x is an unbiased estimation of the mean. The calculations for the variance are bit more complex, but the idea is the same. The expected variance is, [ E[ˆσ 2 x] = 1 T ( T E X t 2 1 T ) 2 ] X t T t=1 t=1 [ = 1 T E T E( X t 2 ) 1 T ] T E T X t Xs = 1 T [ = 1 T t=1 s=1 T E( X 2 t ) E( X 2 t ) 1 T E ( T t s [(T 1)E( X 2t ) 1T T (T 1)[E( X t )] 2 ] T Xt Xs )] = T 1 T σ2 (4.20) Thus, ˆσ 2 is not an unbiased estimate of σ 2. The bias given by (T 1)/T, goes to zero as T. This is a typical result from MLE, the mean is correct but the variance is biased. To get an unbiased estimate if we need to correct the estimate in the following manner, s 2 = T 1 ˆσ 2 = T 1 T T ( 1 T T E T X 2 t t=1 t=1 X t ) 2. (4.21) The correction involves multiplying the estimated variance with [ ] T 1 T. 4.2 MLE for a Linear Combination of Variables We have derived the maximum likelihood estimates for a single independent normal variable. How does this relate to a linear regression model? Earlier, when we discussed the moments of a variable, we showed how it was possible, as a general principle, to substitute a random variable with a function of the variable. The same reasoning applies here. Say that X is a function of two other random variables Ỹ and Z. Assume the linear model y t = βz t + x t, (4.22) where Ỹ is a random variable, with observations {y t} and z t is, for the time being, assumed to be a deterministic variable.(this is not a necessary assumption). Instead of using the symbol x, for observation on the random variable X, let us set x t = ɛ t where ɛ t NID(0, σ 2 ). Thus, we have formulated a linear regression model with a white noise residual. This linear equation can be rewritten as, ɛ t = y t βz t (4.23) where the RHS is the function to be substituted with the single normal variable x t used in the MLE example above. The algebra gets a bit more complicated but the principal steps are the same. 2 The unknown parameters in this case are β and 2 As a consequence of more complex algebra the computer algorithms for estimating the variables will also get more complex. For the ordinary econometrician there are a lot of software packages that cover most of the cases. 38 THE METHOD OF MAXIMUM LIKELIHOOD
39 σ 2 ɛ. The log likelihood function will now look like, l(β, σ 2 ɛ; y, z) = (T/2) log 2π (T/2) log σ 2 ɛ (1/2σ 2 ɛ) T (y t z t β) 2. (4.24) The last factor in this expression can be identified as the sum of squares function, S(β). In matrix form we have, t=1 and T S(β) = (y t z t β) 2 = (Y Zβ) (Y Zβ) (4.25) t=1 l(β, σ 2 ɛ; y, z) = (T/2) log 2π (T/2) log σ 2 ɛ (1/2σ 2 ɛ)(y Zβ) (Y Zβ) (4.26) Differentiation of S(β) with respect to β yields which, if set to zero, solves to δs δβ = 2Z (Y Zβ), (4.27) ˆβ = (Z Z) 1 (Z Y ) (4.28) Notice that the ML estimator of the linear regression model is identical to the OLS estimator. The variance estimate is, ˆσ 2 ɛ = ɛ ɛ /T, (4.29) which in contrast to the OLS estimate is biased. To obtain these estimates we did not have to make any direct assumptions about the distribution of y t or z t. The necessary and suffi cient condition is that y t conditional on z t is normal, which means that y t βz t = ɛ t should follow a normal distribution. This is the reason why MLE is feasible even though y t might be a dependent AR(p) process. In the AR(p) process the residual term is a independent normal random variable. The MLE is given by substitution of the independently distributed normal variable with the conditional mean of y t. The above results can be extended to a vector of normal random variables. In this case we have a multivariate normal distribution, where the density is D(X) = D(X 1, X 2,..., X T ), (4.30) The random variables X will have a mean vector µ and a covariance matrix. The density function for the multivariate normal is, D(X) = [(2π) n/2 n/2 ] 1 exp[ (1/2)(X µ) 1 (X µ)] (4.31) which can be expressed in a compact form X t N(µ, ). With multivariate densities it is possible to handle systems of equations with stochastic variables, the typical case in econometrics. The bivariate normal is an often used device to derive models including 2 variables. Set X = ( X 1, X2 ), and = [ σ 2 1 σ 12 σ 21 σ 2 2 ] with = σ 2 1σ 2 2(1 p 2 ), (4.32) where p is the correlation coeffi cient. As can be seen > 1 unless p 2 = 1. If σ 12 = σ 21 = 0, the two processes are independent and can estimated individually MLE FOR A LINEAR COMBINATION OF VARIABLES 39
40 without losing any important information. In principle if σ 12 = σ 21 0, the two equations are dependent, and it will be necessary to estimate a complete system of equations to get correct estimates, which are unbiased and effi cient. A disadvantage with MLE is that the variance estimate is biased. This, however, is only a small sample effect. It can be shown that as T goes to infinity the bias disappears. Hence, the MLE is an asymptotically effi cient estimator. Furthermore, it can also be shown that MLE behaves asymptotically nice even if we drop the assumption of normally independently distributed residuals. The estimates will tend towards those given by NID errors. This situation is refereed to as quasi maximum likelihood. The advantages are easy to see. MLE offers a general approach to the estimation of econometric models. These models can be quite complex, non-linearity, moving average residuals and so on can be handled by MLE. Consequently there exists a large literature on MLE. In principle this literature is not diffi cult. The main problem for our understanding of the use of MLE in different situations lies in our understanding of matrix algebra. 40 THE METHOD OF MAXIMUM LIKELIHOOD
41 5. THE CLASSICAL TESTS - WALD,LM AND LR TESTS (To be completed, add figure of normal distributed variable with value of likelihood function (L) on the vertical axis and parameter value on the horizontal axis, with (ˆθ) is indicating the maximum value of L). There are three approaches to testing a statistical model model. The first is to start with an unrestricted model and imposed restrictions on the estimated model. The second approach is to impose the restrictions prior to estimation, and estimate a restricted model. The test is then performed by asking if the restriction should be lifted. The third approach, is to test for significant differences between an estimated restricted model and an estimated unrestricted model. The last approach involves estimating two models, rather than one. The three approaches of testing are named Wald tests (W ) - estimate an unrestricted model. Lagrange Multiplier tests (LM) -estimate a restricted model. Likelihood Ratio tests (LR) - estimate both the unrestricted and the restricted models. A test is labeled Wald, Lagrange Multiplier or Likelihood ratio depending on how it is constructed. A typical Wald test is the t-test for significance. A Lagrange multiplier test is the LM test of autocorrelation. Finally, the F-test for testing the significance of one or several parameters in a group represents a typical Likelihood ratio test. Imagine a figure of a normal density function, with the shape of a normal random variable centered around its (true) mean. On the vertical axis put the value of the likelihood function. The max is given by the peak of the distribution. Let the horizontal axis represent the estimated mean. The true mean is indicated by the peak of the normal distribution. The LR test is based on a comparison of likelihood values. If a restriction, which is imposed on the unrestricted model, is valid the value of the likelihood should not be reduced significantly. This test is based on two estimations, one unrestricted giving the value of the likelihood ˆL U and one restricted leading to ˆL R. From these two values the likelihood ratio is defined as, λ = ˆL R ˆL U. (5.1) This lead to the test statistic ( 2 ln λ) which has a χ 2 (R) distribution, where R is the number of restrictions. The Wald test compares (squared) estimated parameters with their variances. In a linear regression, if the residual is NID(0, σ 2 ), then ˆβ N(β, var(ˆβ)), so (ˆβ β) N(0, var(ˆβ), and a standard t-test will tell if β is significant or not. More generally if we have vector of normally distributed random variables ˆX N j (µ, ), then have (x µ) Σ 1 (x µ) χ 2 (J). (5.2) THE CLASSICAL TESTS - WALD,LM AND LR TESTS 41
42 The LM test starts from a restricted model and tests if the restrictions are valid. Here restrictions should be understood as a general concept. A model is restricted if it assumes homoscedasticity, no autocorrelation, etc. The test is formulated as, LM = δ ln L(ˆθ R ) [ ] 1 δ ln L(ˆθR ) I(ˆθ R ). (5.3) δˆθ R δˆθ R The formula looks complex but is in many cases extremely easy to apply. Consider the LM test for p : th order autocorrelation in the residuals ˆɛ t, ˆɛ t = α 0 + α 1ˆɛ t 1 + α 2ˆɛ t α pˆɛ t p + η t. (5.4) The LM test statistic for testing if the parameters α 1 to α p are zero, amounts to estimating the equation with OLS and calculate the test statistics T R 2, distributed as χ 2 (p) under the null of no autocorrelation. Similar tests can be formulated for testing various forms of heteroscedasticity. Tests can often be formulated in such a way that they follow both χ 2 and F -distributions. In less than large samples the F -distribution is better one to use. The general rule for choosing among tests based on the F or the χ 2 distribution is to use the F distribution, since it has better the small sample properties. If the information matrix is known (meaning that it is not necessary to estimate it), all three tests would lead to the same test statistic, regardless of the chosen distribution χ 2 or F. I all all three approaches lead to the same test statistics, we would have R W = R LR = R LM. However, when the information matrix is estimated we get the following relation between the tests R W R LR R LM. Remember (1) that when dealing with limited samples the three tests might lead to different conclusions, and (2) if the null is rejected the alternative can never be accepted. As a matter of principle, statistical tests only rejects the null hypothesis. Rejection of the null does not lead to accepting the alternative hypothesis, it leads only to the formulation of new null. As an example, in a test where the null hypothesis is homoscedasticity, the alternative is not necessarily heteroscedasticity. Tests are generally derived on the assumption that everything else is OK in the model. Thus, in this example, rejection of homoscedasticity could be caused by autocorrelation, non-normality, etc. The econometrician has to search for all possible alternatives. 42 THE CLASSICAL TESTS - WALD,LM AND LR TESTS
43 Part II Time Series Modeling 43
44
45 6. RANDOM WALKS, WHITE NOISE AND ALL THAT 6.1 Different types processes This section looks at different types of stochastic time series processes that are important in the economics and finance. Time series is a series where the data is ordered by time. A random variable ( X) is a variable which can take on more than one value, and for each value it can take one there is a value between zero and one that describes the probability of observing that value. We distinguish between discrete and continuous random variables. Discrete random variables can only take on a finite number of outcomes. A continuous random variable can take one value between - and +. The mathematical model of the probabilities associated with a random variable is given by the distribution function F (x), F (x) = P ( X x). If we have a continuous random variable, we can define the df (x) probability density function of the random variable as, f(x) = dx. Random variables are characterized by the probability functions, and their moments. First, second, third and fourth moments all describe the characteristics of a random variable. By estimating these we describe a random variable. All moments have direct implication for risk-and return decisions. Mean = return, Variance = risk, skewness and kurtosis implies deviations from normal and might affect behavior. To be completed. A stochastic time series process is then made up of a random variable that over time can take on more than one value. We denote a stochastic process as { X t } T 0 indicating that it starts at time zero and continuous to time T. To define a stochastic time series process we start with the random variable ( X ti ), which at time t can take on different values at the future periods i = 1, 2, 3,..n, where n might go to infinity. Often we will talk about conditional expectation of ( X t ), we want to estimate the most likely future value, given the information we have today. A stochastic time series process can be discrete or continuous. A discrete series is only changing values at discrete time periods, while a continuous process is, or can potentially, change values continuously and not only at discrete time intervals. The conditional expectation is written as E( X t+1 I t ) or E t ( X t+1 ). To formalize the use of conditional expectations, assume a probability space (Ω, Ϝ, P ), where Ω is the total sample space (or possible states of the world), Ϝ denotes the tribe of subsets of Ω that are outcomes (observations), and P is a probability measure associated with the outcomes. A very practical question in modeling is if there exists a simple mathematical form for associating outcomes with probabilities. Usually we will refer to the tribe of subsets Ϝ as the information set I t.we will assume that memory is not forgotten by the decision makers, so the information set is increasing over time, I t0 I t1... I tk I tk+1... In a discrete time setting we refer to this increasing sets as an increasing sequence of sigma-fields. In a continuous time setting, where new information arrives continuously, rather than at discrete time intervals, the increasing information set is referred to as a filtration, or an increasing family of sigma-algebra. A very unoffi - RANDOM WALKS, WHITE NOISE AND ALL THAT 45
46 cial standard is to use I t discrete time settings and Ϝ t for continuous time settings. We can also say that the set {F t.t 0} is a filtration, representing increasing family of sub-σ sigma algebras on Ϝ. Over time outcomes of X ti, (x 1, x 2,...,.x t ), will be added to the increasing family of information sets. We refer to the observed process, (x 1, x 2,.., x t ), as adapted to the filtration Ϝ t. We can also say that if (x 1, x 2,.., x t ) is an adapted process, then for the sequence of {x t } X t is a random variable with respect to {Ω, Ϝ), and for each t the value of Xt is know as x t. 6.2 White Noise A random variable is a white noise process if its expected mean is equal to zero, E[ɛ t ] = µ = 0, (6.1) its variance exists and is constant σ 2, and there is no memory in the process so the autocorrelation function is zero, E[ɛ t ɛ t ] = σ 2 (6.2) E[ɛ t ɛ s ] = 0 for t s. (6.3) In addition, the white noise process is supposed to follow a normal and independent distribution, ɛ t NID(0, σ 2 ). A standardized white noise have a distribution like NID(0, 1). Dividing ɛ t with 1/ σ 2 gives (ɛ t /σ) NID(0, 1). The independent normal distribution has some important characteristics. First, if we add normal random variables together, the sum will have a mean equal to the sum of the mean of all variables. Thus, adding T white noise variables together as, z T = T t=1 (ɛ t/σ) forms a new variables with mean E(z T ) = E(ɛ 1 /σ)+e (ɛ 2 /σ)+.. + E (ɛ T /σ) = (1/σ) [E(µ 1 ) + E(µ 2 ) E(µ T )] = 0. Since each variable is independent, we have the variance as σ 2 z = σ 2 z,1 +σ 2 z,2 +..+σ 2 z,t = = T. The random variable is distributed as z t NID(0, T ), with a standard deviation given as 1/ T. As the forecast horizon for z t increases, a 95% forecast confidence interval also increases with ±1.96 T. In the same way, we can define the distribution, mean and variance during subsets of time. If ɛ t N(0, 1) is defined for the period of year. The variables will be distributed over six months as, N(0, 1/2), with a standard deviation of 1/ 2, over three months the distribution is N(0, 1/4), with a standard deviation of 1/ 4. For any fraction (δ) over the year, the distribution becomes NID(0, 1/δ) and the standard deviation 1/ δ. This property of the variable following from the assumption of independent distribution, is known as Markov property. Given that x 0 is generated from an independent normal distribution N(µ, σ 2 ), the expected future value of x t at time x 0+T is distributed as N(µT, σ 2 T ). To sum up, it follows from the definition that a white noise process is not linearly predictable from its own past. The expected mean of a white noise, conditional on its history is zero, E [ɛ t ɛ t 1, ɛ t 2,...ɛ 1 ] = E [ɛ t ] = 0. (6.4) This is a relatively weak condition. A white noise process might be predicted by other variables, and by its own past using non-linear functions. A process is called an innovation if it is unpredictable given some information set I t. A process y t is an innovation process w.r.t. the an information set if, E[y t I t ) = 0. (6.5) 46 RANDOM WALKS, WHITE NOISE AND ALL THAT
47 where the information set I t includes not only the history of ɛ t, but also all other information which might be of importance for explaining this process. Stating that a series is a white noise innovation process, with respect to some information set I t, is a stronger requirement than white noise process. It is also a stronger statement than saying that ɛ t is a martingale difference process, because we add the assumptions of a normal distribution. The martingale and the martingale difference processes were defined in terms of their first moments only. Creating a residual process that is a white noise innovation term is a basic requirement in the modelling process. 6.3 The Log Normal Distribution The normal distribution is central in econometric modeling. However, financial prices display two characteristics which make them unfit for a stochastic process based on the assumption normal distributions. Stock prices cannot be negative, due to limited liability, and they tend to grow over time due to the time value of money. Thus, the distribution of stock prices is typically non-negative and skewed. The normal distribution on the other hand is symmetric and stretches from to +. A better alternative for modelling stock prices, and many other asset prices, is to assume a log normal distribution, which compared to the normal, is only defined over [0, ], and is right skewed and reflecting the fact that stock prices have a tendency to move up rather than down. Furthermore, log normal distribution have the property that the log of a log normal random variable has normal distribution. Thus, taking the log of log normal random stock prices transforms their distribution to a normal distribution. Let S ti be a random log normal stock [( price, with ) mean ] µ and variance σ 2 The log of s t, is then distributed as ln s t N µ σ2 2, σ 2. Given that S t, has a log normal distribution, it follows that the distance between S t and S t+n is distributed as S t+n S [( ) ] t N µ σ2 s n, σ 2 n. 6.4 The ARIMA Model The non-parametric white noise can be used to define (or generate) autoregressive models (AR), and moving average models (MA). The AR(p) model is y t = µ + a 1 y t y t p + ɛ t, (6.6) where µ is E(y t ) = µ, and ɛ NID(o, σ 2 ). Or using the lag operator, L i x t = x t i, A(L)y t = ɛ t, (6.7) where A(L) = (1 a 1 L a 2 L 2... a p L p ). The eigenvalues associated with this polynomial informs about the time path of y t. The moving vicarage model of order q is, THE LOG NORMAL DISTRIBUTION 47
48 or, using the lag operator, y t = µ + ɛ t + b 1 ɛ t b q ɛ t q, (6.8) y t = µ + B(L)ɛ t. (6.9) 6.5 The Random Walk Model A special case of the AR(1) model is the random walk model, x t = x t 1 + ɛ t where ɛ t NID(0, σ 2 ). (6.10) where x t 1 is the lagged value of x t, with an implicit parameter of unity, and ɛ t is a white noise process. It follows that given the past of the series the best prediction we can use is the present value of the series, and that the first difference is nothing else than a white noise, x t x t 1 = x t = ɛ t. The important factor is that the increments of the series is unpredictable from the series own past. A random walk is non-stationarity. By definition, it is integrated of order one I(1). Taking the first difference of a random walk series produces a stationary I(0) (white noise) series. A random walk has the property that today s value is the prediction of the variables future values, E(x t+1 x t, x t 1, x t 2,..., x t n ) = E(x t+1 x t ) = x t, (6.11) where n might be equal to infinity. This definition does not rule out the case that there are other variables that can be correlated with x t and thereby also predict x t+1. We can also say that a random walk has an infinite long memory. The mean is zero, the variance and autocovariance is equal to, var( X) = σ 2 t,and Cov( X t, Xt n ) = (t n)σ 2. E(x t ) = E t ɛ i = 0, (6.12) i=1 [( t )] 2 var(x t ) = E(x 2 t ) = E ɛ i = i=1 The first autocovariance is (t 1), ( t ) t 1 cov(x t x t 1 ) = E(x t x t 1 ) = E e i i=1 j=1 e j t i=1 j=1 = t E [e i e j ] = t. (6.13) t t 1 E [e i e j ] = t 1. i=1 j=1 (6.14) The autocovariances foe higher lag order follows from this previous example. As can be seen these are non-stationary moments, since both are dependent on time (t). It follows that the autocorrelation function looks like, ρ n = [(t n)/t] 1/2. (6.15) We can see that, given suffi ciently large number of observations, there is an infinite memory. All theoretical autocorrelations are equal to RANDOM WALKS, WHITE NOISE AND ALL THAT
49 If x t = (x 0, x 1,..., x n ), we can substitute repeatedly backwards, x t = x 0 + t ɛ i. (6.16) Thus, a random walk is a sum of white noise error from the beginning of the series (x 0 ). Hence, the value of today is dependent on shocks from built up beginning of the series. All shocks in the past, are still affecting the series today. Furthermore, all shocks are equally important. The process formed by t i=1 is called a stochastic trend. In contrast to a deterministic trend, the stochastic trend is changing its slope in a random way period by period. Ex post a stochastic trend might look like deterministic trend. Thus, it is not really possible to determine whether a variable is driven by a stochastic or a deterministic trend, or a combination of both. If we add a constant term to the model we get a random walk with a drift, i=0 x t = µ + x t 1 + ɛ t, (6.17) where the constant µ represents the drift term. In this processx t is driven by both a deterministic and a stochastic trend. If we perform the same backward substitution as above, we get, x t = µt + t ɛ i + x 0, (6.18) i=1 where t = 1, 2,..., n. Thus, a constant term in a random walk model implies that the variable follows a linear deterministic trend (µt) and a stochastic trend in the long run. In the long-run the deterministic trend will dominate the stochastic trend and determine the path of x t. Taking first differences leads to, x t = µ + ɛ t, (6.19) where the constant measures the average growth rate of x t, since E( x t ) = µ. The expected value of a driftless random walk, for any future date is always today s value, E(x t+n ) = x t. For a random walk with a drift the expected value is, E(x t+n ) = µn + x t At a first glance the random walk model might seem extreme, is it possible to motivate that a series has an infinite memory, so that shocks remain in the series forever? The answer is yes. The most common example is that of innovations leading to economic growth, which then spills over into other economic variables. Innovations leading to economic growth do not occur at fixed intervals, nor is every single invention equally important. Over time, innovations will occur at random intervals and some inventions will more important that others. The outcome is that productivity and economic growth is driven by a stochastic trend, just as described by a random walk. In empirical work it is common to find variables that behave like random walks. Given forward looking behavior of economic agents, it is often possible to construct economic models where transformed variables will behave like random walks. In a forward looking world agents will use all relevant information when they determine today s prices. One important characteristic follows from this, namely that today s price is the best prediction of future prices. However, the relationship between today s price and the predicted future price is more complex. We return to this issue below, when we talk about martingales. THE RANDOM WALK MODEL 49
50 A note on the estimation and testing of random walks A random walk process is also a series integrated of order one, it is also called a unit root process, and it contains a stochastic trend. Furthermore a random walk process can also be embedded in another process, say an ARIM A(p, d, q)process. The problem is that it is problematic to do inference on random walk variables (and integrated variables) because the estimated parameter on the lagged term will not follow a standard normal distribution. Hence, ordinary t, chi square and F distributions are not suitable for inference. Parameter estimates will generally be asymptotically unbiased. Their standard errors and variances do not follow standard distributions. For instance, a common t-test cannot be used to test if a = 1 in the regression, x t = ax t 1 + ɛ t. (6.20) If x t follows a random walk, the distribution of [â/st.dev(â)] will be skewed to the left, and thus depart from the student t-table. Just as in any autoregressive model the estimate â will be biased downward. The term (â a), however, becomes asymptotically a ratio between two random variables, which will lead to a second order bias in the estimation of the variance as T. In this case, with a unit root process, the ration random variables which in turn are functions of Wiener processes. In this situation one common approach is to use the so-called Dickey- Fuller test in combination simulated distributions. Testing for a unit root (a = 1) is one aspect of testing if a variable is a random walk. Another aspect if it is not possible to reject a unit root is to test if the residual is ɛ NID(0, σ 2 ). Cambell, Lo and MacKinley (1997, Ch 1) show how you can test for the absence of autocorrelation when dealing with the null hypothesis of a random walk. unfortunately, it is quite common in the literature to assume that a series is a random walk (meaning not rejecting the null of a unit root) only on unit root testing and forgetting about the properties of the residual term, which under a random walk is simply the first difference. When testing for random walk in limited samples it is extremely diffi cult to distinguish between a random walk and a stationary AR(1) model with a parameter of say A problem with random walks, as well as all variables which include stochastic trends, is that it is in general not possible to use standard distributions for inference. Parameter estimates will generally be unbiased, but their standard deviations and variances do not follow standard distributions. 6.6 Martingale Processes A random variable is said to be a martingale if the present observation is the best prediction of all future values. Let { X t } t=1 be a process of the random variable X t. We say that the variable is a martingale with respect to the information set I 1 t, if the expected value of Xt+s is equal to the present value of Xt, [ ] E Xt+s I t = x t for s > t. (6.21) 1 Alternatively, [ ] it is possible to define the information set at time t-1, and wrire the definition as E Xt I t 1 = X t RANDOM WALKS, WHITE NOISE AND ALL THAT
51 Given the information set, all information relevant for predicting X t+s is contained in today s value of Xt. Thus, the best prediction of Xt+1 is x t, and the value of today is the best prediction of all periods in the future. The information set might include the history of Xt as well as all other information that might be of relevance for predicting X t+s. The definition of a martingale is always relative, since we have the freedom of defining different information sets. If Xt is a martingale with respect to the information set I t, it might not be a martingale with respect to another information set I t unless the two sets are not identical. We can now continue and define the martingale difference process as the expected difference between X t+s and X t, E[( X t+s X t ) I t ] = E( X t+s x t ) = 0. (6.22) If a process is a martingale difference process, changes in the process are unpredictable from the information set. The sub-martingale and the super martingale are two versions of martingale processes. A sub-martingale is defined as [ ] E Xt+s I t x t, which says that, on average the expected value is growing over time. A supermartingale is defined as [ ] E Xt+s I t x t, which says that the expected value of Xt+s is given by X t but, on average, declining over time. Martingales are well known in the financial literature. If the agents on a financial market use all relevant information to predict the yields of financial assets, the prices of these assets will, under certain special conditions, behave like martingales. The random walk hypothesis of asset prices does not come from finance theory, it is based on empirical observations, and is mainly a hypothesis about the empirical behavior of asset prices which lacks a theoretical foundation. A random walk process is a martingale, but also includes statements about distributions. If we compare with the random walk we have the model, x t = x t 1 + ɛ t where ɛ t is a normally distributed white noise process. The latter is a stronger condition than assuming a martingale process. A random walk with a drift x t = µt + x t 1 + ɛ t, this variable is a sub-martingale,since the deterministic trend will increase the expectation over time, E( X t+1 ) = µt + x t. Let us now turn to finance theory. Theory that the price of an asset (P t+1 ) at time t + 1 is given by the price at t plus a risk-adjusted discount factor r. If we assume, for simplicity, that the discount factor is a constant we get that P t+1 = (1 + r)p t. Asset prices are therefore not driftless random walks, or martingales. The process described by theory is ln P t+1 = ln(1 + r) + ln P t + ɛ t+1, which is a sub-martingale given, in this case, a constant discount factor. If we would like to say that asset prices are martingales we must either transform the price process according [P t+1 /(1 + r)], or we must include the risk-adjusted discount factor in the information set. 2 Thus, the expected value of an asset price is, by definition, E(P t+1 ) = E(1 + r)p t. If the discount factor (and risk) is a constant (g) we get E(P t+1 ) = g + P t, which is a random walk with drift. If the risk premium is a time-varying stochastic 2 It is obvious that we can transform a variable into a martingale by substracting elements from the process by conditioning or direct calculation. In fact most variables can be transformed into a martingale in this way. An alternative way of transforming a variable into a martingale is to transform its probability distribution. In this method you look for a probability distribution which is equivalent to the one generating the conditional expectations. This type of distribution is called an equivalent martingale distribution. MARTINGALE PROCESSES 51
52 variable ( G t ), we have E(P t+1 ) = g t + P t, which takes us even further away from the random walk. It is important to distinguish between martingales and random walks. Financial theory ends in statements about the expected mean of a variable with respect to a given information set. A random walk is defined in terms of its own past only. Thus, saying that a variable is a random walk does not exclude the case that there exists an information set for which the variable is not a martingale. Furthermore, the residuals in a random walk model are by definition independent, if we assume them to be white noise. But, a martingale describes behavior of the first moment of a random variable. It does not imply independence between the higher moments of the series. If we model a martingale by a first order autoregressive process, we might find that the errors are dependent through higher moments. The variance of ɛ t is not σ 2, but a function of its own past, like ɛ 2 t = α + βɛ 2 t 1 + ν t, (6.23) where ν t is a white noise process. This is a first order ARCH(1) model (Auto Regressive Conditional Heteroscedasticity), which implies that a large shock to the series is likely to be followed by another large shock. In addition, it implies that the residuals are not independent of each other. The conclusion is that we must be careful when reading articles which claim that the exchange rate, or some other variable should be, or is, random walks, often what the authors really mean is that the variable is a martingale, conditional on some information. The martingale property is directly related to the effi cient market hypothesis (EMH), which set out the conditions under which changes in asset prices becomes unpredictable given different types of information. 6.7 Markov Processes Markov 3 processes represent a general type of series with the property that the value at time t contains all information necessary to form probability assessments of all future values of the variable. Compared with the martingale property above, this property is more far reaching. The martingale property is concerned with the conditional expectation of a variable, and not with the actual distribution function and the higher moments of the variable. Markov processes and the associated Markov property are important because it helps us to form stochastic time series processes. In economics and finance we like explain how expectations are generated and how expectations affects the outcome of observed prices and quantities on various markets. In particular, in financial economics and the pricing of derivatives, we like to model asset prices as continuous stochastic processes Once we can trace the price of asset continuously over time into the future, we can also determine the price of derivatives though replication and arbitrage In addition, we learn how to use derivatives to continuously hedge risky positions. 4 To predict or generate future possible paths of a Markov variable, we only need to know the most recent value, or its recent values of the variable. This is, 3 Markov is known for a number of results, including the so-called Markov estimates that prove the equality between OLS and MLE. 4 Recall that the definition of a derivative asset, is a financial contract that (1) derives its value from some underlying asset, and (2) at the time of expiration has exactly the same price as the underlying asset. 52 RANDOM WALKS, WHITE NOISE AND ALL THAT
53 in many modeling situations a very practical assumption, we do not need to know the history of the variable to learn how it behaves nor do we need to know actual values/observations of the future. The future of the series can be generated from its conditional past. Let F (x 1, x 2,..., x t ) be the distribution function of the random variable X t. There are 1, 2,..t observations of the series, where t might be equal to infinity. For each observation (x i ) there is a probability statement, F (x 1, x 2,..., x t ) = Pr ob( X 1 x 1, X 2 x 2,... X t x t ). A discrete time Markov process is characterized by the following property, Pr ob( X t+s x t + s x 1, x 2,...x t ) = Pr ob( X t+s x t+s x t ), (6.24) where s > 0. The expression says that all probability statements of future values of the random variable X t+s is only dependent on the value the variable takes at time t, and do not depend on earlier realizations. By stating that a variable is a Markov process we put a restriction on the memory of the process. The AR(1) model, and the random walk, are first-order Markov process, x t = a 1 x t 1 + ɛ t where ɛ t NID(0, σ 2 ). (6.25) Given that we know that ɛ t is a white noise process (NID(0, σ 2 )], and can observe x t we know all what there is to know about x t+ /σ, since x t /σ contains all information about the future. In practical terms, it is not necessary to work with the whole series, only a limited present. we can also say that the future of the process, given the present, is independent of the past. For a first order Markov process, the expected value of Xt+1, given all its possible present and historical values X t Xt 1, Xt 2..., can be expressed as, [ E Xt+1 X, Xt 1, X t 2... X ] [ ] t = E Xt+1 Xt. (6.26) Thus, a first order Markov process is also a martingale. Typically, the value of X t is know at time t as x t. The Markov property is a very convenient property if we want to build theoretical models describing the continuous evolution of asset prices. We can focuses on the value today, and generate future time series, irrespective of the past history of the process. Furthermore, at each period in future we can easily determine an exact future value, which is the equilibrium price for that period. The white noise process, as an example, is a Markov process. This follows from the fact that we assumed that each ɛ t was independent from its own past, and future. One outcome of the assumption of a normal and independent process, was that we could relatively easy form predictions and confidence intervals given only the value of ɛ t today. The definition of a Markov process can be extended to an m : th order Markov processes, for which we have, [ E Xt+1 Xt, X t 1, X t 2... X ] [ t = E Xt+1 X, Xt 1, X t 2... X ] t m., (6.27) where we need to condition on m historical (random) values (including the present value X t ) to predict the future. MARKOV PROCESSES 53
54 6.8 Brownian Motions Consider the random walk model, x t = x t 1 + ɛ t and assume that the distance between t and t 1 becomes smaller and smaller. As the distance between the observations gets smaller the function will in the end get so close to a continuous function that it becomes indistinguishable from a function in continuous time x(t) = x(t 1) + ɛ(t). This takes us to the random walk in continuous time, known as a Brownian motion or Wiener process. This section introduces, Brownian motions (Wiener process), geometric Brownian motion, jump diffusion models and Ornstein- Uhlenbeck process. There are (at least) two very important reasons for studying Wiener processes. The first is that the limiting distribution of most non-stationary variables in economics and finance are given as functions of a Brownian motion. It is this knowledge that helps us to understand the distribution of estimates based on nonstationary variables. The second reason for learning about Brownian motions is that they play an important role in modeling asset prices in finance. A word of warning, though Brownian motions have nice mathematical properties it is not necessarily so that it also fits given data series better. Normal discrete empirical modelling will take you a long way. The random walk is defined in discrete time. The intuition behind the random walk and the Brownian motion is as follows. If we let the steps between t and t 1 become infinitely small, the random walk can be said to converge to Brownian motion (or Wiener process. As the distance between t and t 1, alternatively between t and t+1, it becomes harder and harder to distinguish between a discrete time process and continuous time process. In the end, the difference will be so small that it will not matter. These processes have a long history. The Brownian motion was named after an English botanist, Robert Brown, who in 1827 observed that small particles immersed in a liquid, exhibited ceaseless irregular motion. Brown himself, however, named a few persons who had observed this phenomena before him. In 1900 a french mathematical named Bachelier described the random variation in stocks prices when he wanted to explain option prices. In 1917 Einstein observed similar behavior gas molecules. Finally, Norbert Wiener gave the process a rigorous mathematical treatment in a series of papers during 1918 and Is there a difference between what we call a Wiener processes and Brownian motion? In practice the answer is no. The two terms can and are used interchangeably. If you look at the details you will find that the Brownian motion have normally distributed increments. The Wiener process, on the other hand, is explicitly assumed to be a martingale. No such statement is made for the Brownian motion. 5 In practice, these differences means nothing (for more information search for the Lévy theorem). In econometrics there is a tendency to use Wiener processes to represent univariate processes and Brownian motion for multivariate processes. The most important characteristic of a Brownian motion is that all increments are independent, and not predictable from the past. Thus the Brownian motion can be said to be a martingale and it fulfills the Markov property. The latter means that the distribution of future values at (t + dt) depends only the current value of x(t). This is a good characteristic of models describing insecurity, in particular situations when nature is evolving as a function of random steps that we cannot 5 See Neftci, Salih (2000), An Introduction to the Mathematics of Financial Derivatives, 2 ed. Academic Press, Amsterdam. 54 RANDOM WALKS, WHITE NOISE AND ALL THAT
55 predict. The further we look into the future, the number of random changes gets larger, and probability statements about future events get harder and harder. A generalized (arithmetic) Brownian motion is written as dx t = αdt + σdw t (6.28) where d represent the continuous or infinitesimal small change in the variable x over the time interval dt. This can be written as dx t = x(t + dt) x(t). The parameters α and σ are real numbers (constants) where σ is strictly positive. As in a random walk the term αdt represents the drift and σdw can be said to add a stochastic noise to the series. W represents a standardized Wiener process, or Brownian motion, such that dw represents the differential of the Brownian motion, and dw t = dw (t + dt) W (t) has a standard normal distribution with mean zero and variance equal to dt. It is easy to see that αdt represent a drift term. Take the expected value of the process, E(x t ) = αdt + σ 0, both α and dt are non-stochastic, and dw has an expected value of zero. It follows that α = 1 dt E(dx t) represents the average change in x per unit of time. Of course, if α = 0 we have a driftless random walk in continuous time, E(dx t ) = σe(dw t ) = 0. The variance is V ar(dx t ) = σ 2 V ar(dw ) = σ 2 dt. Note shown here is that that the changes in x (dx t ) are independent and stationary. 6.9 Brownian motions and the sum of white noise In terms of the change over a specific (possibly) observable time period we need to introduce the notation δt to represent the change over some fraction of time t. By using this notation we can let t be a year or a month, and then by changing δ we can let the length of the period become smaller and smaller. The change due to the deterministic trend is written, per unit of time, as αδt. The stochastic noise that we add to dx over a given interval is written as σɛ δt, where ɛ t NID(0, 1). In the limit, as δ 0, we have that δx dt. In terms of small intervals the Brownian motion becomes δx t = αδt + σɛ δt. (6.29) To understand the asymptotic properties of the Brownian we could let, but there is a better way to see what happens. As we study a standardized Brownian motion/wiener process W (t) over the interval, [0, T ] we will find that we can divide this interval into segments t i t i 1, 0 = t 0 < t 1 < t 2 <... < t i <...t n = T. (6.30) Let the length of each segment be δ = t i t i 1, and assume that there is a random variable W t that takes on either the value δ or δ. Furthermore, assume that W ti is independent of W tj for i j, so that each increment is uncorrelated with other increments. The Wiener process is no defined as the sum of W ti as δ, which is the same as saying that as the interval [0, T ] is divided into finer and finer segments, we have BROWNIAN MOTIONS AND THE SUM OF WHITE NOISE 55
56 n W (t) = w ti as i (6.31) i=1 An extension of this, if ɛ t NID(0, σ 2 ), is that let W t = ɛ t /σ T will also converge to a Wiener process. Thus, the sum of a standardized white noise will also converge to a standardized Wiener process. This result is crucial for the understanding of the distribution a random walk and other unit root variables The geometric Brownian motion The arithmetic Brownian motion is not well suited for asset prices as their changes seldom display a normal distribution. The log of asset prices, and return, is better described with a normal distribution. This takes us to the geometric Brownian motion dx t x t = µdt + σdw t What happens here is that we assume that ln x t has a normal distribution, meaning that x t follows a log normal distribution, and µdt + σdw t follows a normal variable. Ito s lemma can be used to show that d ln x t = ) (µ σ2 dt + σdw t. 2 The expected value of the geometric Brownian motion is E(dx t /x t ) = µdt, and the variance is V ar(dx t /x t ) = σ 2 dt. There are several ways in which the model can be modified to better suit real world asset prices. One way is to introduce jumps in the process, so-called "jump diffusion models". This is done adding a Poisson process to the geometric Brownian motion, dx t x t = µdt + σdw t + U t dn(λ), where U t is a normally distributed random variable, N t represent a Poisson process with intensity λ to account for jumps in the price process. The random walk model is good for asset prices, but not for interest rates. The movements of interest rates are more bounded than asset prices. In this case the so-called Ornstein-Uhlenbeck process provides a more realistic description of the dynamics, dr t = α(b r t )dt + σw t. Thus the idea behind the Ornstein-Uhlenbeck process is that it restricts the movements of the variable (r) to be mean reverting, or to stay in a band, around b, where b can be zero A more formal definition 56 RANDOM WALKS, WHITE NOISE AND ALL THAT
57 If X(t) is a Wiener process, 0 t <.The series always starts in zero, X(0) = 0.and if t 0 t 1 t 2... t n, then all increments of X(ti ) are independent. In terms of the density function we have, D [x(t 1 ) x(t 0 ), x(t 2 ) x(t 1 ),..., x(t n ) x(t n 1 ) t 0, t 1,..., t n ] (6.32) = n D [x(ti ) x(t i 1) t 0, t 1,..., t n ]. (6.33) The expected value of each increment is zero, [ E X(tn ) X(t ] n 1 ) = 0, (6.34) with a variance [ ] var X(t) X(t 1) = σ 2 (t s), (6.35) where 0 s < t. Finally, since the increments are a martingale difference process, we can assume that these increments follow a normal distribution, so X(t) N[0, (t s)]. These assumptions lead to the density function, = D[x(t)] [ ( )] 1 x 2 σ n exp 1 (2π)t i 2σ 2 t 1 i=2 (t i t i 1 ) (1/2) σ (2π)t 1 exp ( (x i x i 1 ) 2 ) 2σ 2 (6.36) (t i t i 1 When σ 2 = 1, the process is called a standard Wiener process or standard Brownian motion. That the Brownian motion is quite special, can be seen from this density function. The sample path is continuous, but is not differentiable. [In physics this is explained as the motion of a particle which at no time has a velocity]. Wiener processes are of interest in economics of many reasons. First, they offer a way of modeling uncertainty. Especially in financial markets, where we sometimes have an almost continuous stream of observations. Secondly, many macro economic variables appear to be integrated or near integrated. The limiting distributions of such variables are known to be best described as functions of Wiener processes. In general we must assume that these distributions are nonstandard. To sum up, there are five important things to remember about the Brownian motions/wiener process; It represents the continuous time, (asymptotic) counterpart of random walks. It always starts at zero and are defined over 0 t <. The increments, any change between two points, regardless of the length of the intervals, are not predictable, are independent, and distributed as N(0, (t s)σ 2 ), for 0 s < t. It is continuous over 0 t <, but nowhere differentiable. The intuition behind this result is that the differential implies predictability, which would go against the previous condition. Finally, a function of a Brownian motion/wiener process will behave like a Brownian motion/wiener process. The last characteristic is important, because most economic time series variables can be classified as, random walks, integrated or near-integrated processes. In practice this means that their variances, covariances etc. have distributions that are functionals of Brownian motions. Even in small samples will functionals of Brownian motions better describe the distributions associated with economic variables that display tendencies of stochastic growth., BROWNIAN MOTIONS AND THE SUM OF WHITE NOISE 57
58 58 RANDOM WALKS, WHITE NOISE AND ALL THAT
59 7. INTRODUCTIOO TO TIME SE- RIES MODELING "Time is a great teacher, but unfortunately it kills all its pupils" Louis Hector Berlioz A time series is simply data ordered by time. And, time series analysis is simply approaches that look for regularities in these data ordered by time. Stochastic time series play an important part in economics and finance. To forecast and analyse these series it is necessary to take into account not only their stochastic nature but also the fact that they are non-stationary, dependent over time and are by nature correlated among each other. In theoretical models, the emphasis on intertemporal decision making highlights the role expectations play in a world where decisions must be made from information sets made up of stochastic processes. All time series techniques aim making the series more understandable by decomposing them into different parts. This can be done in several ways. This introduction s aim is to give a general overview of the subject. A time series is any sequence ordered by time. The sequence can be either deterministic or stochastic. The primary interest in economics is in stochastic time series, where the sequence is made up by random variables. A sequence of stochastic variables ordered by time is called a stochastic time series process. These random variables making up the process can either be discrete, taking on a given set of integer numbers, or be continuous random variables taking on any real number between ±. While discrete random variables are possible they are not common. Stochastic time series can be analysed. in the time domain or in the frequency domain. The former approach analysis stochastic processes in given time periods like, days, weeks, years etc. The frequency approach aims at decomposing the process into frequencies by using trigonometric functions like sinuses, etc. Spectral analysis is an example of analysis that uses the frequency domain, to identify regularities like seasonal factors, trends, and systematic lags in adjustment etc. In economics and finance, where we are faced with given observations and we study the behavior of agents operating in real time, the time domain is the most interesting road ahead. There are relatively few problems that are interesting to analyze in the frequency domain. Another dimension in modeling is processes in discrete time or in continuous time. The principal difference here is that the stochastic variables in a continuous time process can be measured at any time t, and that they can take different values at any time. In a discrete time process, the variables are observed at fixed intervals of time (t), and they do not change between these observation points. Discrete time variables are not common in finance and economics. There are few, if any variables that remain fixed between their points of observations. The distinction between continuous time and discrete time is not matter of measurability alone. A common mistake is to be confused the fact that economic variables are measured at discrete time intervals. The money stock is generally measured and recorded as an end-of-month value. The way of measuring the stock of money does not imply that it remains unchanged between the observation interval, instead it changes whenever the money market is open. The same holds for variables like production and consumption. These activities take place 24 hours a day, during the whole year. The are measured as the flow of income and con- INTRODUCTIOO TO TIME SERIES MODELING 59
60 sumption over a period, typically a quarter, representing the integral sum of these activities. Usually, a discrete time variable is written with a time subscript (x t ) while continuous time variables written as x(t). The continuous time approach has a number of benefits, but the cost and quality of the empirical results seldom motivate the continuous time approach. It is better to use discrete time approaches as an approximation to the underlying continuous time system. The cost for doing this simplification is small compared with the complexity of continuous time analysis. This should not be understood as a rejection of continuous time approaches. Continuous time is good for analyzing a number of well defined problems like aggregation over time and individuals. In the end it should lead to a better understanding of adjustment speeds, stability conditions and interactions among economic time series, see Sjöö (1990, 1995). 1 Thus, our interest is in analysing discrete time stochastic processes in the time domain. A time series process is generally indicated with brackets, like {y t }. In some situations it will be necessary to be more precise about the length of the process. Writing {y} 1 indicates that he process start at period one and continues infinitely. The process consists of random variables because we can view each element in {y t } as a random variable. Let the process go from the integer values 1 up to T. If necessary, to be exact, the first variable in the process can be written as y t1 the second variable y t2 etc. up until y tt. The distribution function of the process can then be written as F (y t1, y t2,..., y tt ). In some situation it is necessary to start from the very beginning. A time series is data ordered by time. A stochastic time series is a set of random variables ordered by time. Let Ỹit represent the stochastic variable Ỹi given at time t. Observations on this random variable is often indicated as y it. In general terms a stochastic time series is a series of random variables ordered by time. A series starting at time t = 1 and ending at time} t = T, consisting of T different random variables is written as {Ỹ1,1, Ỹ2,2,...ỸT,T. Of course, assuming that the series is built up by individual random variables, with their own independent probability distributions is a complex thought. But, nothing in our definition of stochastic time series rules out that the data is made up by completely different random variables. Sometimes, to understand and find solutions to practical problems, it will be necessary to go all the way back to the most basic assumptions. Suppose we are given a time series consisting of yearly observations of interest rates, {6.6, 7.5, 5.9, 5.4, 5.5, 4.5, 4.3, 4.8}, the first question to ask is this a stochastic series in the sense that these number were generated by one stochastic process or perhaps several different stochastic processes? Further questions would be to ask if the process or processes are best represented as continuous or discrete, are the observations independent or dependent? Quite often we will assume that the series are generated by the same identical stochastic process in discrete time. Based on these assumptions the modelling process tries to find systematic historical patters and cross-correlations with other variables in the data. All time series methods aim at decomposing the series into separate parts in some way. The standard approach in time series analysis is to decompose as y t = T t,d + S t,d + C t,d + I t, 1 We can also mention the different types of series that are used; stocks, flows and price variables. Stocks are variables that can be observed at a point in time like, the money stock, inventories. Flows are variables that can only be observed over some period, like consumption or GDP. In this context price variables include prices, interest rates and similar variables which can be observed at a market at a given point in time. Combining these variables into multivariate process and constructing econometric models from observed variables in discrete time produces further problems, and in general they are quite diffi cult to solve without using continuous time methods. Usually, careful discrete time models will reduce the problems to a large extent. 60 INTRODUCTIOO TO TIME SERIES MODELING
61 where T d and S d represents (deterministic) trend and seasonal components, C t,d is deterministic cyclical components and I is process representing irregular factors 2. For time series econometrics this definition is limited. Instead, let {y t } be a stochastic time series process, composed as, y t = systematic components + unsystematic component = T d + T s + S d + S s + {y t } + e t, (7.1) where the systematic components include deterministic trends T d, stochastic trend T s, deterministic seasonals S d stochastic seasonals S s, a stationary process (or the short-run dynamics) yt, and finally a white noise innovation term e t. The modeling problem can be described as the problem of identifying the systematic components such that the residual becomes a white noise process. For all series,remember that any inference is potentially wrong, if not all components have been modeled correctly. This is so, regardless of whether we model a simple univariate series with time series techniques, a reduced system, a or a structural model. Inference is only valid for a correctly specified model. Present ARIMA A class of models ARIMA (p,dq,) ARFIMA(p,d,q) models Operators Box Jenkins Identification tools: ACF, PAVFS, Q-test Deal with: Non-stationarity, dynamics Trend Seasonal effects Deterministic variables Theory ARIMA: After ARIMA? ARIMAX, Transfer function RDL, ARCH/GARCH Structural: Single equation, ADL Error correction modes (Older stuff) Mulivariate VAR VECM SVAR Add for VAR : How to build VAR:s Lags - white noise Lags dummies white noise Information criteria + min number of equations with AR Add for Rational expectations GMM GARCH 2 For simplicity we assume a linear process. An alternative is to assume that the components are multiplicative, x t = T t,d S t,d C t,d I t. INTRODUCTIOO TO TIME SERIES MODELING 61
62 7.1 Descriptive Tools for Time Series Random variables are described by their moments. Stochastic time series can be described by their means, variances and autocovariances. Given a random variable Ỹ t which generates an observed process {y t }, the mean and the variance are E{Ỹt} = µ and var{ỹt}. The autocovariance at lag k is γ k = cov(ỹt, Ỹt k) = E[Ỹt E(Ỹt)][Ỹt k E(Ỹt k)]. The dimension of a covariance measure is diffi cult to understand in terms of the strength of the relation. For practical work, a more useful measure is provided by the autocorrelation, ρ k = cov(ỹt, Ỹt k) var(ỹt)var(ỹt k) = γ k γ 0, where ρ k is is the autocorrelation between a realisation of the series at time t and time t ± k. Since the autocorrelation comes out as a number between 0 and ±1. The autocovariance operator can be applied to any lag, k±, and is therefore generally referred to as the autocorrelation function. Furthermore, if the series have a stationary mean and variance, it does not matter if we calculate the correlation function (or the autocovariances) backwards or forwards, ρ k = ρ k. The ACF tells us the following, the higher the value of ρ the stronger is the memory of the series. By studying how the autocorrelation changes as the distance between t and k changes a we can see if they tend to die out slowly or quickly, or remain constant for a given number of k. 3 If the ACF is equal to unity and dies out slowly this is a sign of a non-stationary variable. On the other hand, if the ACF is zero it is a sign of a white noise process were no historical values can predict coming observation of the same series. for a random time series process, the sample autocorrelation function becomes ˆρ k = 1 T k T k 1 T t=1 (y t ȳ)(y t k ȳ) T t=1 (y k = 0, 1, 2, 3..., t ȳ) 2 where T is the number of observations, and ȳ is the sample mean, ȳ = (1/T ) T i=1 y i. In practical work, the standard assumption is a constant variance over the sample, so that var(y t ) = var(y t k ). The sample autocorrelations are estimates of random variables they are therefore associated with variances. Bartlett (1946) shows that the variance of the k:th sample autocorrelation is 2 var(ˆρ k ) = 1 k ˆρ T k. Given the variance, and the standard deviation of the estimated variable, it becomes possible to set up a significance test. Asymptotically, this t-test has a normal distribution, with an expected value of zero under the null of no autocorrelation (no memory in the series). For a limited sample, a value of ˆρ k larger than two times its standard error is considered significant. The next question is how much autocorrelation is left between the observations at t and t k (Ỹt and Ỹt k) after we remove (condition on) the autocorrelation between t and t k? Removing the autocorrelation means that we first calculate the mean of Ỹtconditional on all observation on Ỹt and Ỹt k 1,another way of 3 Standard practice is to calculate the first K T/4 sample autocorrelations. j=1 62 INTRODUCTIOO TO TIME SERIES MODELING
63 expressing this is to say that we filter Ỹt from the influence of all lags of Ỹt between t 1 and t k 1. Using the expectations operator, we define the conditional mean as E{Ỹt y t 1, y t 2,...y t k 1 } = Ỹ t. The partial autocorrelation is then the slope coeffi cient in a regression between Ỹ t and Ỹk. This leads to the following definition of the partial autocorrelation function φ k = cov(ỹ t, Ỹt k y t 1,..., y t k 1 ). (7.2) var(ỹt k) The definition of the partial autocorrelation at lag k can be recognised as the coeffi cient on the lag at t = k in the autoregressive regression: y t = a 0 + a 1 y t φ k y k + e t. (7.3) Notice the difference, the partial autocorrelation φ k is a definition, not an estimate. The first partial autocorrelation is estimated by regressing y t on y t 1, the second partial autocorrelation is estimated by regressing y t on y t 1 and y t 2 and so on. 4 The partial autocorrelation functions can be estimated through regression techniques, by the so-called Yule-Walker estimator, alternatively using recursive techniques (Durbin 1961). The recursive technique utilises the fact that the first autocorrelation is equal to the first partial autocorrelation ˆρ 1 = ˆφ 1, then given ˆφ 1 the higher order φ i are solved step by step in a recursive equation system. The complicating factor is to estimate the variance of the partial autocorrelation function. If a regression technique is used, the estimated regression variance of (ˆφ k ) is not a correct estimate of the variance, because until the residual process is white noise, or at least free from autocorrelation the estimated variance is inefficient. Furthermore, the other (older) techniques of estimating the PACFs do not involve a variance estimate in the same way as the OLS estimator of φ k. The solution, therefore, is to assume that the estimated ˆφ k : s are a white noise process. Anderson (1944) shows that the asymptotic variance of a white noise series is 1/T. This leads to the (asymptotic) significance test, ˆφ k /(1/ T ). As a practical rule of thumb, in a limited sample, a test statistics greater than 2 is considered significant, and lead to a rejection of the null of ˆφ k = 0. The PACF informs about the length of autoregressive process. The necessary number of lags to describe an autoregressive process of order p ends at φ p. A closer look at these measures, and the way they are calculated reveals that they are only interesting for stationary series. The same holds for the mean and the variance, and other moments. The two measures, the ACF and the P ACF, are complementary to other descriptive devices, such as the mean, the variance, kurtosis, etc. The ACF and the P ACF describe the memory of a process. They explain if and how a series can be predicted from its own past. They help us to identify which type of process we are studying, if it is a white noise process, an integrated process, an AR process, an MA process, or an ARMA process. A white noise series is recognized by its lack of significant ACF and P ACF coeffi cient. Integrated variables are identified by the fact that their ACF dies out very slowly, in combination with at least one P ACF coeffi cient close to unity. Stationary ARMA models are identified with the following identification scheme: ACF P ACF AR(p) Tails off Cuts off at lag p MA(q) Cuts off at lag q Tails off ARM A Tails off Tails off 4 Notice, that in the regression, the parameters a 1, a 2,...a k 1 are not identical to φ 1, φ 2... φ t k 1 due to the (possible) correlation between y t 1 and lower order lags like y t 2 etc. The regression formula only identifies the last coeffi cient, at lag k, as the PACF φ k. DESCRIPTIVE TOOLS FOR TIME SERIES 63
64 This identification scheme above is a direct consequence of the properties of each type of model. And, the properties of each model can be calculated theoretically. These calculation are an important part of time series analysis and we will come back to these calculations below. The idea behind ARIMA modeling is to first calculate the ACF and the P ACF and use these to form an idea about the order of integration and the order of p and q. The second step, given what we know about the order of d, p, and q is then to estimate an ARIMA model. The third step is to test the estimated model for autocorrelation in the residual. The fourth step is reestimate models to find the best model according to the three criteria i) no autocorrelation, ii) the lowest possible residual variance and iii) not include so many parameters that it is becomes too complex Weak and Strong Stationarity A fundamental issue when analyzing time series processes is whether they are stationary or not. As a first, general definition, we can say that a non-stationary series changes its behavior over time such that the mean is changing over time. Many economic time series are non-stationary in the sense that they are growing over time, their estimated variances are also growing and the covariance function never dies out. In other words the calculation of the mean, autocovariance etc. are dependent on the time period we study, and inference becomes impossible. A stationary series on the other hand displays a behavior which is independent of the time period and it becomes possible to test for significance. Non-stationarity must either be removed before modeling or included in the model. This requires that we know what type of non-stationarity we are dealing with. The problem with non-stationary is that a series can be non-stationary in an infinite number of ways. And, to make the problem even more complex some types of non-stationarities will skew the distributions of the estimates such that inference based on standard distributions such as the t, the F or the χ 2 distributions are not only wrong but completely misleading. In order to model time series, we need to understand what non-stationarity is, how to estimate it and how to deal with it Weak Stationarity, Covariance Stationary and Ergodic Processes Of the two concepts, weak stationarity is the practical one. Weak stationarity is defined in terms of the first two moments of the process, the mean and the variance. A process {x t } is (weakly) stationary if (1) the mean is independent of time t, E{x t } = µ, (2) the variance exists and is less than infinity, and (3) the autocovariance is var{x t } = σ 2 <, cov{x t, x t k ) = γ k. 64 INTRODUCTIOO TO TIME SERIES MODELING
65 Thus, the mean and the variance are constant over time, and the covariance between two values of the process is only a function of the distance between the two points. A related concept is that of covariance stationarity if the autocovariances go to zero as the distance between the two points increases the series is said to be covariance stationary (or ergodic), cov(x t, x t k ) 0 as k. This definition brings us to the concept of ergodicity, which can be understood as a weak form of average asymptotic independence. The most important condition, but not suffi cient, for a series to be ergodic is lim t ( T 1 T k=1 cov(x t, x t k ) ) = 0. Compared with the former concept, cov(x t, x t k ) 0, ergodicity implies a restriction on the strength of the covariance structure. As more and more autocovariances are calculated their mean should go to zero. The term ergodic is used in connection with stationarity conditions Strong Stationarity Strong stationarity is defined in terms of the distribution function {x t }. Suppose a process that is ordered from observation 1 up to observation T. Each observation up to T can be thought of as a random variable. Hence we can write the first variable in the process as x t1 the second variable x t2 etc. up until x tt. The distribution function for this process is F (x t1, x t2,..., x tt ). Next, define the distribution function {x t } for another time interval, namely t + j, where j = 1, 2,..., T. This leads to the distribution function F j (x t+j1, x t+j2,..., x t+jt ). Strong stationarity requires that the two distribution functions are identical such that F (x t1, x t2,..., x tt ) = F j (x t+j1, x t+j2,..., x t+jt ), meaning that the characteristics of the process are independent of time. We will get the same means, etc. independently of the time period we choose for our calculations. By letting j take different integer values we get the j : th order strong stationarity. Thus, j = 1 leads to first order (strong) stationarity, etc. Strong stationary incorporates the definition of weak stationarity. But, the practical problem is that it is diffi cult to work with distribution functions for continuous random variables, so strong stationarity is mainly a theoretical concept. 1. (a) i. In this chapter we deal with a very broad class of models named ARMA models, autoregressive moving average models. These are a set of models that describe the process {x t } as a function of its own lags and a white noise process. The autoregressive models of order p [AR(p)], x t = a 0 + a 1 x t a p x t k + e t, where e t is a white noise process. A moving average model of order q [MA(q)] is defined as x t = a 0 + e t b 1 e t 1... b q e t q, DESCRIPTIVE TOOLS FOR TIME SERIES 65
66 where e t is a white noise process. The combination of autoregressive and moving average processes gives the ARIMA(p,q) model x t = a 0 + a 1 x t a p x t k + e b 1 e t 1... b q e t q. In addition we have integrated processes. An integrated process is defined as follows: a process x t is said to be integrated of order I(d), if it contains no deterministic components, is non-stationary in levels, but becomes stationary after differencing d times. Thus, a stationary series is denoted x t I(0), a first order integrated series is denoted as I(1), etc. To analyse time series it is necessary to introduce additional descriptive statistical tools beside means and variances. Then to handle the equations in an effi cient way we need a set of operators. Also, we need to classify time series as stationary or non-stationary. The descriptive devices are autocovariances, autocorrelations and partial autocorrelations. An important classification is stationarity or non-stationarity. For this purpose we need the concepts of weak and strong stationarity, and ergodic processes. The operators needed are the sum operator, the lag operator and the difference operator Finding the Optimal Lag Length and Information Criteria In empirical work, the question is to find the correct lag length. If we chose to few lags the model will be definition be misspecified, and the assumption of normally distributed white noise residual will be wrong. On the other hand, adding more lags to the AR or MA process will make the model capture more of the possible memory of the process, but the estimates will be ineffi cient. We need to add as few lags as possible, without rejecting the assumption of white noise residuals. The Box-Jenkin s method suggests that we start with a relatively large number of lags and tests for autocorrelation. Among those models, which has no significant autocorrelation, we then pick the model with the lowest possible information criteria. In the Box-Jenkins approach, testing for white noise is equal to testing for autocorrelation. The typical test for autocorrelation is the Box-Pearce test, also known as the portmanteau test, sometimes as the Q-test or the Ljung -Box test. To test for p:th order autocorrelation in a mean adjusted series, ε t, calculate the k:th order autocorrelation coeffi cient, T t=k+1 ˆρ k = ˆε tˆε t k T t=1 ˆε2 t for r = 1, 2,...p. The Box-Pearce test statistic is then given by BP = T p ˆρ 2 k. Under the null of no autocorrelation this test statistic has a χ 2 (p) distribution. The Box-Pearce statistics is best suited for testing the residual in an AR model. A modification, for ARMA, and more general regression models, is the so called Box-Ljung statistics, BL = T (T + 2) k=1 p r=1 ˆρ 2 r (T r), 66 INTRODUCTIOO TO TIME SERIES MODELING
67 which is also distributed as χ 2 (p). Given that the residuals of the estimated ARMA model do not display autocorrelation, we can turn to the optimal lag length. Information criteria is simply version of adjusted R 2 values. In an ordinary linear regression, as more explanatory variables are added to the model, the R 2 value will go up, and the effi ciency of the estimated parameters down. To compare the R 2 values of the same model, estimated with more or less explanatory variables it is necessary to look at the so called adjusted R 2 values. The principle behind an Information criteria is create a measure that rewards us in the modelling process for reducing the residual variance, but punishes us for adding too many lags that makes the estimates ineffi cient, and the predictions interval too wide. There are several information criteria. They are developed for special situations. In practice, however, they often tend to give the same answer in the end. The most well known criteria is Akiake s Information Criteria (AIC). If we estimate an autoregressive model with k lags from a sample of T observations, the information Akaike s information criteria is AIC = log ˆσ 2 ε + 2k/T, where ˆσ 2 ε is the estimated residual variance. Since an estimated residual variance gets smaller the more lags there are in the model, the last term (2k/T ) tries to compensate for the number of estimated parameters in the models. The smaller the value of the information criteria the better is the model, as long as there is no autocorrelation. For model with both AR and MA components Hannan and Rissanen suggested a different model, log ˆσ 2 ε + (p + q)(log T/)T, where p and q are the lag orders of the autoregressive and the moving average parts of the model. As for Akaike s model the smaller the value the better the model. From these two original criteria a number of different criteria has been developed, such as Schwartz information criteria (SIC), the Bayesian information criteria (BIC) and Hatami s information criteria (HIC) The Lag Operator When dealing with time series and dynamic econometric models, the expressions are easier to handle with the backward shift operator (B) or the lag operator (L). 5 The backward shift operator is the symbol most often used in statistical textbooks. Econometricians tend to use the lag operator more often. The first order lag operator is defined as, or more generally as the n:th order lag operator, Lx t = x t 1, (7.4) L n x t = x t n. (7.5) The lag operator is an expression such that when its is multiplied with an observation at any given time, it will shift the observation one period backwards 5 The practical difference between using the lag operator or the backward shift operator is that the lag operator also affects the conditional expectations generator E t which is of interest when working with economic theories dealing with expectations. DESCRIPTIVE TOOLS FOR TIME SERIES 67
68 in time. In other words, the lag operator can be viewed as a time traveling device, which makes it possible to travel both forward and backwards in time. A forward shift operator can be constructed a long the same lines. Thus, moving forward n observations in the series from an observation at time t is done by L n x t = x t+n. The properties of the lag operator implies that we can write an autoregressive expression of order p (AR(p)) as, a 0 x t + a 1 x t 1 + a 2 x t a p x t p = a 0 x t + a 1 Lx t + a 2 L 2 x t a p L p x t = (a 0 + a 1 L + a 2 L a p L p )x t = A(L)x t. (7.6) Notice that the lag operator can be moved across the equal sign. The AR(1) model, x t = a 1 x t 1 + ε t can be written as (1 La 1 )x t = ε t or A(L)x t = ε t or x t = [A(L)] 1 ε t. If necessary the lag length of the process can be indicated as A p (L). An ARMA(p, q) process can be written compactly as, A p (L)x t = B q (L)ε t. (7.7) Skipping the indication of lag lengths for convenience, the ARMA model can written as x t = [A(L)] 1 B(L)ε t or alternatively depending on the context as [B(L)] 1 A(L)x t = ε t. Thus, the lag operator works as any mathematical expression. However, whether or not moving the lag operator around results in a meaningful expression is associated with the principles of stationarity and invertibility, know as duality Generating Functions The function A(L) is a convenient way of writing the sequence. More generally we can refer to any expression of the type A(L) as a generating function. This includes the mean operator, the variance and covariance operators etc. Generating functions summarize a lot of information about sequences in a compact way and are an important tool in time series analysis. Their main advantage is that they saves time and make the expressions much simpler since a number mathematical operations can be applied to generating functions. As an example, given certain conditions concerning the sum a i, we can write invert A(L), and A(L) 1 A(L) = 1. The generating function for the lag operator is D(L) = k i d i z i, (7.8) where d i is generated by some other function. The point here is that it is often easier to do manipulations on D(L) directly than on each individual element in the expression. In the example above, we would refer to A(L)x t as the generating function of x t. A property of generating functions is that they are additive. If we have two series, a i, b i and i = 0, 1, 2,..., and define a third series as c i = a i + b i, it then follows that, C(L) = A(L) + B(L). (7.9) 68 INTRODUCTIOO TO TIME SERIES MODELING
69 Another property is that of convolution. Take the series a i and b i from above, a new series d i can then be defined by, d i = a 0 b i + a 1 b i 1 + a 2 b i a i b 0 = a h b i h. (7.10) i h=0 In this case we write D(L) as, D(L) = A(L)B(L). (7.11) The results stated in this section should be compared with chapter 19, below, which shows how long-run multipliers, etc. can be derived from lag operator The Difference Operator Given the definition of the lag operator (or the backward shift operator) the difference operator ( ) is defined as, = 1 L, (7.12) which for a variable x t leads to x t = (1 L)x t = x t x t 1. Notice that in time series statistics the difference operator are usually denoted with. In practice the -symbol denotes taking first differences of discrete variable. For a continuous variable taking first differencing implies taking the derivative with respect to time. If x(t) is a continuous time stochastic variable, Dx = dx/dt, (7.13) where D = d/dt. Differences of higher order are denoted in the same way as for the lag operator. Thus for the second difference of x t we write, 2 x t = (1 L) 2 x t = (1 2L + L 2 )x t = x t 2x t 1 + x t 2. (7.14) Higher order differences are given as d x t = (1 L) d x t. Notice the difference between the difference operators d x t and s x t. The first is the conventional difference operator, the second is the seasonal difference operator, such that s x t = (1 L s )x t = x t x t s. The subscript s indicates the interval over which we take the (seasonal difference). If x t is quarterly, setting s = 4, leads to the yearly changes in the series. This new series can the be differenced by using the difference operator, d s x t = d (1 L s )x t. DESCRIPTIVE TOOLS FOR TIME SERIES 69
70 7.1.8 Filters The generating functions takes us to the concept of filters. If x t is an AR(p) then the autoregressive part of this model can be though of as a filter such that if we multiply x t with A p (L) the result is a white noise process. In the same way, given a white noise series e t and some filter B(L), B(L)e t = y t, generates the series y t. Alternatively, think of S(t) as the seasonal component of the series x t, or in other words the seasonal filter. Multiplying x t with S(t), or in a linear relation subtract S(t)x t from x t, and the outcome is a deseasonalised variable. Thus, in this context the term filter is a broad concept, that indicates that we can transform series in different ways. From white noise we can produce ARIMA processes, or we can extract certain components out of a series Dynamics and Stability Given the parameters of an autoregressive process we may ask if the process is stationary or not. Starting from a steady state solution, will a shock to the process, given its parameters, result in an explosion of the series, in infinite growth or in a temporary deviation from steady state? The answers to these questions are given by analysing the roots of the polynomial given by the autoregressive process A(L). An autoregressive process can always be expressed as a stochastic difference equation, and we can deal with in the same way as with a normal difference equation. Starting from A(L)y t = ε t, withdraw y t 1 from both sides leads to the difference equation, y t = A (L)y t 1 + ε t. The solution of this equation is, y t = y p + y c, (7.15) where y p represents the particular solution, the long-run steady state equilibrium or the stationary long-run mean of y t, and y c represents the complementary solution, the deviation from the long-run steady state. Dynamic stability requires that y c vanishes as T. The roots of the polynomial A(L) tell us if this occurs. Given a change in ε t, what will happen to y t+1, y t+2,... y t+? Will y t+ explode, continue to grow for ever, or change temporary until it returns to the steady state equilibrium described by y p? The roots are given by solving for the r : s in the following equation, r p + a 1 r p 1 + a 2 r p a p = 0. (7.16) This equation leads to the latent roots of the polynomial. The condition for stability, when using the latent roots, is that the roots should be less than unity, or that the roots should be inside the unit circle. Root equal to unity, so called unit roots, imply an evergrowing series (stochastic trend), roots greater than unity implies an explosive process. Complex roots suggest that the adjustment is cyclical. Though not very likely, the process could follow an explosive cyclical path or cyclical permanent shocks. If the process is stationary, following a shock, y t will return to its stationary long-run mean. The roots can be complex indicating cyclical behavior. The case with one or several unit roots is of particular interest because it represents stochastic growth in a non-stationary variable. Series with one or more unit roots are also called integrated series. Many economic time processes appears to have a unit root, or roots close to unity. Using latent roots to define stability is common, but is not only way to define stability. Latent roots, or eigenvalues, are motivated with the fact that they are 70 INTRODUCTIOO TO TIME SERIES MODELING
71 easier to work with when matrix algebra is used. An alternative way of defining stability is to solve for the roots (λ) in the following equation, 1 a 1 λ + a 2 λ a p λ p = 0 (7.17) where λ. If the roots are greater than unity in absolute value λ > 1, lies outside the unit circle the process is stationary, if the roots are less than unity the process is explosive. The historical literature on time series uses both definitions, however, latent roots, or eigenvalues are now the established standard Fractional Integration Building an ARIMA Model. The Box-Jenkin s Approach The Box-Jenkin s approach is a practical way finding a suitable ARMA representation of a given time series. The steps are 1) Identification. Determine: (i) if seasonal differencing is necessary to remove seasonal factors, (ii) the number times the series need to be differenced to achieve stationarity and iii) study ACF and PACF to determine suitable order of the ARMA process. 2) Estimation. The identification step leads to (1) stationary series and (2) narrows the possible ARMA(p,q) process of interest to estimate. Methods of estimation? Remember problems with t-values?! 3) Testing. Test the estimated model(s) for white noise residuals, using Box-Pierce test for autocorrelation. Among models with white noise residuals pick the one with the smallest information criteria (AIC, BIC). Differences among information criteria? This leads quickly to a forecast model, or a representation for expectations generating mechanism that can be used in simple (rational) expectations modeling. Limitations of univariate ARIMA models. Most economic problems are multivariate. Variables depend on each other. Furthermore, the test procedure is only aimed at finding a forecast model. To build an econometric model that can be used for inference the demands for testing are higher Is the ARMA model identified? The parameters of an ARMA model might not be unique. To see the conditions for uniqeness, decompose the polynomials of the ARMA process A(L)y t = B(L)ε t into their factors 6 as, A(L) = Π p i=1 (1 λ il), (7.18) and B(L) = Π q j=1 (1 δ jl). (7.19) 6 If A(L) contains the polynominal 1 L the process is said to have a unit root. DESCRIPTIVE TOOLS FOR TIME SERIES 71
72 For a unique representation of the ARMA process there should be no common factors, like (1 λ m L) (1 δ k L). If this is the case, it is possible to take any other polynomial C(L) of finite order (< p), and multiply both sides of the ARMA process such that, C(L)A(L)y t = C(L)B(L)ε t, (7.20) leads to A (L)y t = B (L)ε t. (7.21) Thus, in the case of a common factor there is no unique representation of the parameters in A(L) and B(L). 7.2 Theoretical Properties of Time Series Models The Principle of Duality There is a link between AR and MA models, as the presentation of the lag operator indicated. An AR process with an infinite number of lags can under certain conditions be rewritten as a finite MA process. In a similar way an infinite moving average process can be inverted to an autoregressive process of finite order. These results have two practical implications. The first is that in practical modelling, a long MA process can often be rewritten as a shorter AR process instead, and the other way around. The second implication is that the two process are complementary to each other. The combination of AR and MA into ARMA will lead to relatively parsimonious models meaning models with quite few parameters. In fact, it is quite uncommon to find ARMA models above the order p = 2 and q = 2. The AR(1) process, y t = a 1 y t 1 + ε t, can be written as (1 a 1 L)y t = ε t, and in the next step as y t = (1 a 1 L) 1 ε t. The term (1 a 1 L) 1 represents the sum of an infinite moving average process, y t = 1 (1 a 1 L) ε t = b i ε t i = B( )ε t, i=0 where b 0 = 1. In the same way, a MA(1) process y t = ε t b 0 ε t 1, can be written as an infinite autoregressive AR process; i=0 a iy t i = A( ) = ε t. These transformations can be generalized for AR(p) and M A(q) processes, as well as for vector processes. The question is, when are these transformations meaningful? An AR process can always be inverted, but it will only have (a meaningful) summable MA process if it is stationary. Another way to state this condition is to say that the (latent) roots of A(L) = 0 should be less than unity (inside the unit circle). An MA process, on the other hand, is always stationary, since the ε t by definition is a stationary process. However, a MA process can only be inverted if the latent roots of the polynomial B(L) = 0 are less than unity, the roots are inside the unit circle. (Notice that we refer to the latent roots, if we switch to the ordinary roots the requirement is that they should be outside the unit circle, larger than one. See this paper for definitions of inside and outside the unit circle!) Thus, a MA is always stationary, but only invertible if the latent roots of B(L) are inside the unit circle. An AR process is always invertible to an infinite 72 INTRODUCTIOO TO TIME SERIES MODELING
73 MA process but only stationary if the latent roots of A(L) are inside the unit circle. The latter has one interesting implication, it is often convenient to rewrite an AR or a V AR to a moving average form and investigate the properties and consequences of non-stationary from the M A representation. The conditions are similar, and actually more general, for a multivariate processes, such that V AR(p) MA(q) Wold s decomposition theorem Linear ARIMA models are reasonable good approximations to many empirical time series processes. A theoretical result which suggests why ARIMA models are useful approximations is offered by Wold s decomposition theorem, Wold (1954). The theorem says that any covariance stationary process can be uniquely represented as the sum of two uncorrelated process, x t = d t + y t, where d t is a linearly deterministic process, and y t is an infinite moving average process, MA( ). Thus, we can write x t as x t = d t + b j e t j, j=0 where b 0 = 1, and e t is stationary (white noise) such that j=0 b2 j <, E(e t ) = 0, E(e 2 t ) and E(e t, e t j ) = 0 for j 0. The theorem has two implications. The first is that any series which appears to be covariance stationary can modeled as an infinite MA process. Given the principle of duality, we can expect to find a finite autoregressive process as well (compare with the principle of duality). Since many economic time series are covariance stationary after first differencing, we expect ARMA models as well as linear autoregressive distributed lag models, to work quite well for these series. The second implication is that we should be able to extract a white noise process out of any covariance stationary process. This leads to the conclusion that finding (or constructing) a white noise process in an empirical model is a basic necessity in the modeling process because most economic time series are covariance stationary after differencing. The presentation above has focused on the practical side of time series modelling. time series can be described and analysed theoretically. Consider the AR(1) model y t = a 1 y t 1 + ε t. The series y t is generated by the parameter a 1, the white noise process ε t and some initial value at the beginning of time say t = 0, y 0. Thus, given an initial value, a parameter a 1 and random number generator that generates ε t N(0, σ 2 ),where we for simplicity can set to σ 2 = 1, it becomes possible to generate possible series of y t using Monte Carlo technique. The different outcomes of the series y t can then be used to estimate the distribution of â 1 to learn about how to do inference in small and medium sized samples, and to understand the distributions as a We can also calculate the mean and the variance of y t. The series y t is not independent, since it is a autoregressive. Therefore, the mean and the variance of the observed y t is not informative for describing the series. Instead look at the mean of the zero mean (no constant) AR(1) process, in the form of the expected value; E(y t ) = E(a 1 y t 1 ) + E(ε t ). Looking at the expression, the left hand side tells us that the right hand side represents the mean of y t. The expected value of a white noise is definition zero, so E(ε t ) = 0. Since a 1 is a given constant we have for the other factor, E(a 1 y t 1 ) = a 1 E(y t 1 ). To find an answer we need to substitute the lags of y t 1, y t 2, etc. THEORETICAL PROPERTIES OF TIME SERIES MODELS 73
74 For the first lag, by substitution, we get a 1 E(y t 1 ) = a 1 E(a 1 y t 2 + ε t 1 ) = a 2 1E(y t 2 ). Substitute one more time, a 2 1E(y t 2 ) = a 2 1E(a 1 y t 3 +ε t 2 ) = a 3 1E(y t 3 ). As we continue substituting backwards we will end up with the initial value. Later we will examine the case of minus infinity. Since the initial value can be seen as a constant, we get as the final product a t 1E(y 0 ) = a t 1y 0. (Recall that the expected value of a constant is equal to the constant.) If the initial value is set to zero it follows that a t 1y 0 = 0, and that the mean of y t, is E(y t ) = a t 1E(y 0 ) = 0. It is standard to assume that the initial value is zero in this type of analysis. What happens if y t has a mean different from zero, and if the initial value is different from zero? The answer is simple as long as we can assume that the AR process is stationary and therefore the initial value is a constant there are no problems. Under these conditions, a non-zero mean can be represented by a constant parameter in the AR process, such as y t = µ 0 + a 1 y t 1 + ε t. The expected value of y t is E(y t ) = E(µ 0 ) + a 1 E(y t 1 ) + E(ε t ), which mean that the right hand side is µ 0 + a 1 E(y t 1 ). Again we need to substitute backwards leading to; µ 0 + a 1 E(µ 0 + a 1 y t 2 + ε t 1 ) = µ 0 + a 1 µ 0 + a 2 1E(y t 2 ). The next substitution gives, µ+a 1 µ+a 2 1µ+a 3 1E(y t 3 ). If we continue substituting back to minus infinity, and set the initial value to zero, we get, E(y t ) = µ(1 + a 1 + a a ) = µ = i=0 a i 1 µ (1 a 1 ) The last step is simply an application of the solution to an infinite series, which works in this case as long as the AR process is stationary, a 1 < 1. It is important that you understand the use of the expectations operator in this example because the technique is frequently used to derive a number of results. We could have reached the result in a simpler way if we had used the lag operator. Take the expectation of E(1 a 1 L)y t = E(µ + ε t ). The lag operator is a deterministic µ factor why the result is E(y t ) = (1 a 1L). Again, the left hand side is the sum of an infinite process. If there is no constant, µ = 0 it follows immediately that E(y t ) = 0. What is the variance of the process y t? The answer is given by understanding that E(y t y t ) = V ar(y t ) = γ 2.Thus, start from the AR(1) process, multiply both sides with y t to get y t y t = a 1 y t y t 1 + y t ε t. Next, take expectations of both sides, E(y t y t ) = a 1 E(y t y t 1 ) + E(y t ε t ), and substitute y t y t 1 and y t ε t as (a 1 y t 1 + ε t )y t 1 = a 1 yt ε t y t 1 and y t ε t = (a 1 y t 1 + ε t )ε t. From this we have a 2 1E(yt 1) 2 and a 1 E(ε t y t 1 ) + E(ε 2 t ).In the latter expression we have by definition that E(ε t y t 1 ) = 0 (recall the basic assumptions of OLS) and that E(ε 2 t ) = σ 2 ε. Put the results together, E(y t y t ) = a 2 1E(y 2 t 1) + σ 2 ε γ 2 = a 2 1γ 2 + σ 2 ε γ 2 (1 a 2 1) = σ 2 ε γ 2 = σ 2 ε (1 a 2 1 ) The technique is the same for any AR(p) process. From the calculation of the variance we can also see the value of the autocovariance and the autocorrelation coeffi cients, say at lag k. Multiply both sides of 74 INTRODUCTIOO TO TIME SERIES MODELING
75 the process with y t k and solve E(y t y t k ) = a 1 E(y t 1 y t. k ) + E(ε t y t k ). From this follows that the autocovariance is The autocorrelation is simply σ 2 ε γ k = a k 1 (1 a 2 1 ). ρ k = γ k γ = ak 1. From this expression it is obvious that the autocorrelation function for the AR(1) process dies out slowly as the lag length k increases. Calculating the mean, variance, autocovariances and autocorrelations for AR(1), AR(2), MA(1) and MA(2) processes are standard exercise in time series courses, followed by investigation of the unit root case a 1 = 1. To be completed Additional Topics Seasonality Seasonality is an inherent characteristic of most time series data. Seasonality can be dealt with in three ways. The first is to use seasonal dummy variables. The second method is to use seasonal differencing. And, the third method is to use a program called X12. (Previously X11) All methods suffers from the fact that effi cient estimation of seasonal effects requires a lot of data observations, which is rare in most applied econometric time series work. Econometricians tend to use seasonal dummies, since they are easy to use and leads to a transparency in the model. Seasonal differencing is the standard method in the Box-Jenkins approach. For a quarterly series seasonal differencing implies differencing in the following way; (1 L 4 )y t = y t y t 4 = 4 y t. The corresponding operator for monthly data is (1 L 12 ). In econometrics, the assumption of seasonal unit roots are diffi cult to test. There are few clear cut examples of such processes in the literature and the test for seasonal unit roots are quite complex, especially given the limited samples in econometrics. Thus, econometricians tend to use seasonal differencing when dummy variables do not work. Otherwise including lags at seasonal frequencies will usually take care of seasonal effects. Finally, X12 can be described as a state of the art tool, or as a black box, where you send in seasonal data and out comes a desasonalised series. X12 is a respected method to use, and is frequently used to deseasonalised public statistics. procedure, or some similar program to remove seasonality. Removing seasonality by seasonal differencing, seasonal dummies or by using X12 do not affect the presence of one or more unit roots in the series. The Dickey-Fuller test or other tests for unit root works as before. X12 is a program designed for univariate analysis, meaning that if seasonality is removed in single series by X12 prior to modeling a system, seasonality can still be left in a multivariate single equation model or in a system of equations. The problem with X12 is its black box nature, the econometrician losses some control over the modeling process. Some care in the use of X12 is recommended. ADDITIONAL TOPICS 75
76 7.3.2 Non-stationarity (To be completed) Differencing until stationarity is the standard Box-Jenkins approach. A bit ad hoc. In econometrics the approach is to test first, but only reject the null of integrated of order one in the case of strong evidence against. Alternatives, include linear deterministic trends, polynomial trends etc. Dangerous, spurious detrending under the maintained hypothesis of integrated variables. 7.4 Aggregation The following section offers a brief discussion about the problems of aggregation. The interested reader is referred to the literature to learn more [Wei (1990 is a good textbook with many references on the subject, see also Sjöö (1990, ch. 4]. Aggregation of series means aggregation over agents and markets, or aggregation over time. The stock of money, measured by (M3), at the end of the month represents an aggregation over individuals. A series like aggregate consumption in the national accounts, represents an aggregation over both individuals and time. Aggregation over time is usually referred to as temporal aggregation. Money holdings is a stock variable which can be measured at any point in time. Temporal aggregation of a stock variable implies picking observations with larger intervals, using say a money series measured at the end of a quarter, instead of at the end of each month. Consumption, on the other hand is a flow variable, it cannot be measured at a point in time, only as the sum of consumption over a given period. Temporal aggregation in this case implies taking the sum of consumption over intervals. The distinction is of importance because the effects of temporal aggregation are different for stock and flow variables. Aggregation, both over time and individuals, can change the functional form of the distribution of the variables, and that it can affect the residual variance and t-values. Exactly how aggregation changes a model varies from situation to situation. There are however some general conclusions regarding temporal aggregation which we will repeat in this section. In many situations there is little we can do about these problems, except working with continuous time models, or/and select series with a low degree of temporal aggregation. That the problem is hard to deal with is no excuse for forgetting or hiding them, as it is done in many text books in econometrics. The area of aggregation is an interesting challenge for econometricians since it has not been explored as much as it deserves. An interesting example of the consequences of aggregation is given in Christiano and Eichenbaum (1987). They show how one can get extremely different results by using discrete time models with yearly, quarterly and monthly data compared with a continuous time model. They tried to estimate the speed of adjustment in the stock of inventories, in the U.S national accounts. Using a continuous time model they estimated the average time for closing 95% of the gap between the desired and the actual stock of inventories, to be 17 days. The discrete models predicted much higher rates. Using monthly data the result was 46 days, with quarterly data 7 months, and with yearly data 5 (1/2) year! Aggregation also becomes an important problem if we have a theory that describes the stochastic behavior of a variable which we would like to test with empirical data. There are many results, in macro and finance, that predict that series should follow a random walk, or be the outcome of a martingale process. 76 INTRODUCTIOO TO TIME SERIES MODELING
77 There are several factors to consider if we like to estimate a process suggested by theory. An example is Hall (1978) who, from a life cycle hypothesis, derived that private consumption should follow an AR(1) process, and be a random walk under the assumption of rational expectations. The first factor, is that of temporal aggregation. An additional complication are adjustment costs, which will also affect the original model. If private consumption, as an example, is defined as an AR(1) model, temporal aggregation changes it to ARMA(1,1), the existence of adjustment costs will then transform it to an ARMA(2,1) model. Temporal aggregation, adjustment costs and measurement errors are factors which can affect the structure of the model and the size of estimated parameters. To this list one could also add problems of seasonal factors, trends and hidden periodicity. The latter is a problem, because the larger the temporal aggregation the more diffi cult it is to get a correct estimate of parameters that reflect cycles which are not timed with the sampling interval. Therefore, one should be critical of papers which try to prove that some empirical series behaves like a theoretical process. Is it possible for the author to control all of these factors? For a flow variable with an ARIMA representation, the outcome of temporal aggregation depends on hidden periodicity, which if it exists can affect both the AR and the MA process. In general, aggregation will complicate the structural of the ARIMA model. A simple AR model becomes an ARMA model. But, as aggregation becomes larger the structure of the model becomes simpler. For a stock variable the consequences are clearer. An ARIMA(p, d, q) process of a stock variable, becomes after temporal aggregation an ARMA(p, d, s) process, where s integer [(p + d) + (q p d)/m], and where m is the degree of temporal aggregation, or in other words the systematic sampling interval. As a rule of thumb it can be assumed that temporal aggregation adds +1 to the MA process. Since differencing is a form of temporal aggregation, taking higher and higher differences of a series will create an MA process. This can be seen in any time series program that produces ACF:s and PACF. The more one differences a series the more clearly will the series look like an MA process.. Thus, it follows that observing an MA process in the Identification step in the Box-Jenkins approach, is a sign of over-differencing. The expression holds even for an ARMA model where d = 0. For an ARIMA model, as m gets larger, the model turns towards an IMA(d, d 1) process. Thus, we end up with a random walk model. This is an interesting result for of two reasons. First, since the random walk model often seems to fit macroeconomic and especially financial time series quite well, could that be the outcome of having too large sampling intervals? Second, the result explains the findings in Christiano and Eichenbaum (1987), that larger sampling intervals lead to slower and slower adjustment speed in inventories. The larger the sampling interval, the more did inventories seem like a random walk. As a consequence, the more important seemed historical shocks, further and further back in history. In the end, in the random walk model, all historical shocks have the same importance and there would be no adjustment at all. Temporal aggregation will also affect prediction. The general result is that aggregation reduces the effi ciency of the forecasts, and that the relative loss of effi ciency is larger for a non-stationary series than a stationary one. (Remember that most macroeconomic series are non-stationary.) It is also worth mentioning some conclusions concerning causality. Aggregation will not affect the direction of causality, if there is a clear causality from one variable to another, when dealing with stock variables. It will, however, weaken the AGGREGATION 77
78 estimated strength of the relationship and can therefore lead to wrong conclusions from Granger non-causality tests. For flow variables, on the other hand, temporal aggregation turns a one direction causality into what will appear to be a two-sided causality. In this situation a clear warning is in place. Finally, we also look at the aggregation of two random variables, Xt and Ỹt. Suppose that they are two independent stationary processes with mean zero, The autocovariances of Xt and Ỹt are, E[ X t y t ] = E[Ỹt x t ] = 0. (7.22) cov(x t 1, x t k ) = ρ x, k, (7.23) cov(y t 1, y t k ) = ρ y, k. (7.24) The sum of Xt and Ỹt is, Z t = X t + Ỹt, (7.25) which will have an autocovariance equal to, ρ z, k = ρ x, k +ρ y, k. (7.26) In general, we can write this in the following way, if X t ARMA(p, m), (7.27) and and, then, Ỹ t ARMA(q, n). (7.28) Z t = X t + Ỹt, (7.29) Z t ARMA(x 1, x 2 ), (7.30) where x 1 p + q and x 2 max(p + n, q + m). As an example, think of a series which is measured with a white noise error. That is, the true series is added to a white noise series. If the true series is AR(p) then the result of this aggregation will be an ARMA(p, p) process. We can conclude this section by stating that aggregation leads to loss of information, which, if the aggregation is large, might fool us into assuming that the random walk is the appropriate model. The extent to which aggregation leads us to wrong conclusions has not been stated yet. Partly this is so because we need better data on shorter time intervals than what is available. Remember that ignoring problems is not a way of solving them. One way of dealing with the problems of aggregation is to use continuous time econometric techniques instead, see Sjöö (1993) for a discussion and further references. 7.5 Overview of Single Equation Dynamic Models The autoregressive process represent a basic way of modeling time series. As complexity and multivariate processes are introduced the AR model transform into a system of equation, where it becomes possible to give the parameters a structural (economic) interpretation. In principal, we have the following types of equation models, where ɛ t NID(0, σ 2 ). 78 INTRODUCTIOO TO TIME SERIES MODELING
79 1. Autoregressive models: AR(p) : A(L)y t = ɛ t, 2. Moving average models: MA(q) : y t = φ(l)ɛ t, 3. ARMA(p, q) models: A(L)y t = φ(l)ɛ t, (+ARIMA) 4. Distributed lag models: DL(p) : y t = B(L)x t + ɛ t, 5. Autoregressive distributed lag models: ADL(p) : A(L)y t = B(L)x t + ɛ t 6. ARMA model with exogenous explanatory variable ARMAX (ARIMAX): A(L)y t = B(L)x t + φ(l)ɛ t, 7. Rational distributed lag model RDL: y t = B(L) A(L) x t + φ(l)ɛ t 8. Transfer function: y t = B(L) A(L) x t + φ(l) φ(l) ɛ t Notice that the transfer function is also a rational distributed lag since it contains a ratio of two lag structures. Also, (7) and (8) can be viewed as distributed lag models since D(L) = [B(L)/A(L)]. Notice that rational distributed lag models require some information about B(L) to be workable. Imposing restrictions on the lag structure B(L) in distributed lag models lead to further models; 9. Geometric lag structure (= Koyck), where B(L) is assumed to decline according to some exponential function. 10. Polynomial distributed lag (PDL) models, where B(L) declines according to some polynomial function, decided a priori. (= Almon lags). 11. All other types of a priori restrictions on B(L) not covered by (9) and (10) The error correction model. This model embraces all of the above models as special cases. The following explains way this is so. Introduction to Error Correction Models Economic time series are often non-stationary, their means and variances change over time. The trend component in the data can either by deterministic or stochastic, or a combination of both. Fitting a deterministic trend assumes that the data series grow with a fixed rate each period. This is seldom a good way of characterizing describing trends in economic time series. Instead they are better described as containing stochastic trends with a drift. The series might be growing over time, but it is not possible to predict whether it grows or declines in the next period. Variables with stochastic trends can be made stationary by taking first differences. This type of variable is called integrated of order 1, where the order of integration is determined by the number of times the variable needs to be differenced before it becomes stationary. A necessary condition for fitting trending data in an econometric model, is that the variables share the same trend, otherwise there is no meaningful long-run relationship between them. 8 Testing for co-integration is a way of testing if the data 7 Restrictions are put on the lag process to make the estimation more effective. A priori, restrictions can be motivated by a limited sample and muticollinarity that affects estimated standard errors of the individual lags. These type of restrictions are not used anymore. Today, it is recognized that it is more important to focus information criteria, white noise residuals and building a well-defined statistical model, instead of imposing restrictions that might not be valid. 8 The exception is tests of the effi cient market hypothesis, and related tests of rational expectations. See Appendix A in Sjöö and Sweeney (1998) and Sjöö (1998). OVERVIEW OF SINGLE EQUATION DYNAMIC MODELS 79
80 has a common trend, or if they tend to drift apart as time increases. The simplest way to test for cointegration is the so called Engle and Granger two step procedure. The test implies determining whether the data contains stochastic trends, and if so, testing if there are common trends. If x t and y t are two variables, with non-stochastic trends that become stationary after first differencing, cointegration can be tested by running the following co-integrating regression, y t = α + βx t + ɛ t. (7.31) If both y t and x t are integrated variables of the same order, a necessary condition for a statistically meaningful long-run relationship is that the residual term (ɛ t ) is stationary. If that is the case the error term from the regression can be seen as temporary deviations from the long-run, and α and β can be viewed as estimates of the long-run steady state relation between x and y. A general way of building a model of time series, without imposing ad hoc a priori restrictions, is the autoregressive distributed lag model. For two variables we have, A(L)y t = B(L)x t + η t, (7.32) where the lags are given by A(L) = k i=0 a i, and B(L) = k i=0 b i. The first coeffi cient in A(L) is set to unity, a 1 = 1. The lag length is chosen such that the error term becomes a white noise process, η t NID(0, σ 2 ). The long-run solution of this model is given by, y t = πx t + η t, (7.33) where π = B(L)/A(L). Without loss of generality we can use the difference operator, x t = x t x t 1, to rewrite the autoregressive model as an error correction model, k k y t = β i x t i + γ i y t i + α ECM t 1 + η t, (7.34) i=0 i=1 where the error correction mechanism is given by ECM t 1 = (πx t 1 y t 1 ). The latter term can be said to represent the deviation from the long run steady state relation between the two variables. It is convenient to think of the ECM t variable at the first lag, controlling the long-run path of the dependent variable. Asymptotically it will not matter at which lag the ECM t is placed. Though in a multivariate model, and for a finite sample, it might make a difference, a seasonal lag on might work better. 9 Furthermore, for an ECM to work well in a model it should nor display any signs of seasonal effects or extreme outliers. These effects should be removed when the ECM t is constructed. The α-parameter of the error correction term indicates how changes in y t react to deviation from the long-run equilibrium. When modeling integrated variables, rewriting the system as a (vector) error correction model is a natural step. However, error correction models works with stationary data series. Assuming costly adjustment leads generally to partial adjustment models, that are better written in the less restrictive error correction form. Optimal control theory, approximations to structural systems in continuous time etc. will also lead to error correction models, see Hendry, Pagan and Wickens (1982), Hendry (1995), or ch. 2 in Banerjee et. al. (1993). If x t and y t contain stochastic trends it is necessary that they are co-integrated for the ADL model to make sense in the long-run. For instance, if the variables are co-integrated, the error term from the co-integrating regression (ɛ t above) can be used as the error correction mechanism. This was shown in Engle and Granger 9 For comparison see the discussion of seasonality, earlier in this paper. 80 INTRODUCTIOO TO TIME SERIES MODELING
81 (1987). If there is cointegration there is an ECM formulation, the reason being that cointegration implies Granger causality in at least one direction. The advantage of the error correction model is that it does not put a priori restrictions on the model and that it separates long-run and short run effects. It has proven to be a very effi cient way to model various economic models, like money demand, consumption etc. It should be recognized that the early literature on EC models tended to oversee the problem of weak exogeneity. With the developments in the fields of multivariate cointegration it has been shown that when the same EC expression determines more than one variable, there are cross equation restrictions between the co-integrating parameters. These restrictions imply that error correction expressions have to be estimated within complete systems, not from OLS. Multivariate Model Survey Multivariate models are introduced later. For the time we can conclude our listing of models with the following, Vector autoregressive models V AR. Vector autoregressive moving average processes V ARMA. The V AR and the V ARM A represents multivariate ARIMA models. Vector error correction models V ECMs. Structural vector autoregressive models SV AR. Systems of structural equations estimated using estimators. Structural vector error correction models The latter represent the final step, where a complete system of interactive variables are modeled and given en (economic) structural interpretation. OVERVIEW OF SINGLE EQUATION DYNAMIC MODELS 81
82 82 INTRODUCTIOO TO TIME SERIES MODELING
83 8. MULTIPLIERS AND LONG-RUN SOLUTIONS OF DYNAMIC MOD- ELS. Given an autoregressive, or distributed lag structure A(L), B(L) or D(L) the long run static solution of the model is found by setting L = 1. The intuition is that in the long run there will be no changes in the explanatory variables, and it will not matter if we explain y t by say x t and/or x t i. The conditional mean of y t in an ADL model for example is The mean path of y t is therefore E t {y t } = ȳ t = A(L) 1 B(L)x t. (8.1) ȳ = B(1) x, (8.2) A(1) where A(1) = (1 a 1 a 2... a r ) and B(1) = b 0 + b 1 + b b j ). In a distributed lag model we would have ȳ = D(1) x (8.3) Now a unit change (easier if in percent) in x leads to a new equilibrium, ȳ = D(1)( x + 1). (8.4) The total effect of a change in x t is given by the sum of the coeffi cients in D(L) when L = 1. If there are m lags in D(L), the total multiplier is D(1) = (γ 0 + γ 1 + γ γ m ) = m γ j. (8.5) It is also possible to think of the total multiplier as an infinite sum of γ variables which dies out slowly in the long-run. The impact multiplier is associated with the first parameter in D(1), which is γ 0. Thus taking γ 0 x t gives you the impact multiplier, the first periods effect following a change (a chock) in x t. The j : th interim multiplier (γ j ) is the sum of the coeffi cients up and including the j : th lag, j γ j = γ j. (8.6) j=0 It is common to standardize the j : th interim multiplier in the following way, j=0 m γ j = [ γ j ]/ D(1), (8.7) j=0 such that it represents the share of the total multiplier up until the j : th lag. MULTIPLIERS AND LONG-RUN SOLUTIONS OF DYNAMIC MODELS. 83
84 The mean lag is given as, m γ = [ j=0 m jγ j ]/[ γ j ], (8.8) j=0 Notice that m could be equal to infinity if we have a stable model, with stationary variables, such that the infinite sum of γ i converges to a constant sum in the long run. The mean lag can be derived in a more sophisticated way, by differentiating D(L) with respect to L and then dividing by D(1). That is, D(L) = γ 0 + γ 1 L + γ 2 L γ s L s, and (8.9) D (L) = γ 1 + 2γ 2 L + 3γ 3 L sγ s L s 1. (8.10) By dividing D (1) by D(1) we get, as a general result for ADL models, γ = D (1) D(1) B (1) B(1) A (1) A(1) (8.11) Finally we have the median lag, representing the number of periods required for 50% of the total effect to be achieved. The median lag is obtained by solving, γ j [ j γ j ]/ D(1) = (8.12) j=0 Sometimes the median lag is approximated by choosing the j : th interim multiplier in the middle of the lag structure. 84 MULTIPLIERS AND LONG-RUN SOLUTIONS OF DYNAMIC MODELS.
85 9. VECTOR AUTOREGRESSIVE MODELS The extension of ARIMA modeling into a multivariate framework leads to Vector Autoregressive (VAR) models, Vector Moving Average (VMA) models and Vector Autoregressive Moving /VARMA) models. In economics, since most variables display autocorrelation and are cross-correlated, VAR models are an interesting choice for modeling economic systems. Vector models can be constructed using similar techniques as those for single variable ARIMA models. The autocorrelation and partial autocorrelation functions can be extended to display cross-correlations among the variables in the system. However, when modelling more than two variables, these cross autocorrelation and cross partial autocorrelation functions quickly turn into complex matrix expressions for each lag. 1 Thus, the crosscorrelation functions are not practical tools to work with. The advantages of using VARs are that as VAR represent a statistical description of the economy. When using ARIMA on univariate series, in many situations the combination of AR and MA processes turn out to be an effi cient way of finding a stochastic representation of a process. VAR models are usually effective in modeling multivariate systems, and can be used to make forecasts and dynamic simulations of different shocks to system. These shocks can come from policy, from productivity or anywhere in the economy basically, and the shocks can be assumed to transitory or permanent. The main complicating factor is that in order to understand what shocks and simulations actually mean it is necessary to identify the underlying economic relation among the variables. To make VAR models work for economic analysis it is necessary to impose some restrictions on the residual covariance matrix of the VAR. Thus, there is no free lunch here in terms of avoiding discussing causality and simultaneity problems. It is necessary to point out the latter because in the beginning of the history of VAR models it seemed like VAR models could be used without economic theory, but that was build on a misunderstanding. In econometrics the focus is on finding a parsimonious VAR representation with N ID residuals. Let x t be an p dimensional vector of stochastic time series variables, represented as a the k : th order VAR model, x t = k A i x t i + e t, or i=1 A(L)x t = e t (9.1) where A i is the matrix of coeffi cients of lag number i, so A 0 A = p i=0 A i, where A 0 is a diagonal matrix, e t is a vector of white noise residual terms. Notice that all variables across all equations have the same lag length (k). This is so because it makes it possible estimate the system with OLS. If the lag order is allowed vary, the VAR must be estimated with the seemingly unrelated regressor method. A VAR model can be inverted into its VECMA form as 1 See Wei (1989) for a presentation of the Box-Jenkin s technique in a multivariate framework. VECTOR AUTOREGRESSIVE MODELS 85
86 x t = C i x t i = C(L)e t i=1 The MA form is convenient for analysing the properties of a VAR and investigate the consequences of shocks to the system. Estimation, however, is usually done in the VAR form, and is straightforward since each equation can be estimated individually with OLS. The lag length (k) of the autoregressive process is chosen such that the estimated residual process, in combination with constants, trend, dummy variables and seasonals, becomes white noise process in each equation. The idea is that the lag length is equal for all variables in all equations. A second order VAR of dimension p with a constant is x 1t x 2t. x pt = a 0 a 1. a p [ ] + a11 a 12 a 1p a 21 a 22 a 2p x 1t 1 x 2t 1. x p 1 x 1t 2 x 2t 2. x pt 2 + VAR models were strongly advocated by Sims (1980) as a response to what he described as incredible restrictions imposed on standard structural econometric models. Up until the mid 80s, empirical time series econometrics was dominated by the estimation of text-book equations. Researchers simply took an equation from theory, estimated it, and did not pay much attention to whether the model and the data actually fitted each other. Typically, dynamic lag structures where treated in a very ad hoc way. Sims argued that it would be better to find a statistical model, which described the data series and their interaction, as well as possible. Once the statistical model was there, it could be used to forecast and simulate the economy. In particular, it would according to Sims be possible to analyse the effects of various policy changes. Sims critique is related to the "Lucas critique". Lucas showed how in a world of rational expectations, it was not possible to understand estimated parameters in structural econometric models as (deep) structural behavior or policy parameters. Since agents form their behavior on plans building on forecasts of variables, not on historical outcomes of variables, the estimated parameters based on historical observation become a mixture of behavioral parameters and forecast generating parameters. Further, under rational expectations, econometric models could not be used to analyze policy changes, because a change in policy would by definition lead to a change in the parameters of the system. Sims therefore argued for VAR models as a statistical description of the economy, under given policy rules. The effects of surprise changes in policy variables could then be analysed in the reduced form. VAR models represent the reduced form of an underlying structural model. This can be seen by starting from a general (but not necessarily identified) structural model, and rewriting it in reduced form. As an example, start from the bivariate model, e 1t e 2t y t = γ 1 + a 11 x t + b 11 y t 1 + b 12 x t 1 + ɛ 1t (9.2) x t = γ 2 + a 21 y t + b 21 y t 1 + b 22 x t 1 + ɛ 2t (9.3) This system can be rewritten in reduced form by substituting for x t and y t on the RHS of the equations,. e pt y t = µ 1 + π 11 y t 1 + π 12 x t 1 + e 1t (9.4) x t = µ 2 + π 21 x t 1 + π 22 x t 1 + e 2t. (9.5) 86 VECTOR AUTOREGRESSIVE MODELS
87 The equations form a bi-variate VAR model of order one. The residuals of the VAR model (the reduced form) contain the residuals and the parameters (a 11 and a 21 ) of the structural model. The reduced system can be estimated by applying OLS to each equation. 2 The parameters of the VAR relate to the structural model as π 11 = α 11b 12 + b 11 1 α 11 α 21, etc. Thus, the parameters of the VAR are complex functions of some underlying structural model, and as such they are on their own quite uninteresting for economic analysis. It is the lag structure and sometimes it signs that are more interesting. The two residuals in this VAR are, and e 1t = ɛ 1t + α 11 ɛ 2t 1 α 11 α 21 (9.6) e 2t = α 11ɛ 1t + ɛ 2t 1 α 21 α 11. (9.7) These residuals are both white noise terms, but they are correlated with each other whenever the coeffi cients α 11 or α 21 are different from zero. The generalization of structural system above, setting z t = {y t, x t }, is Bz t = Γ 0 + Γ 1 z t 1 + ɛ t, (9.8) where[ ] [ ] [ 1 α11 γ01 γ11 γ B =, Γ α = and Π γ 1 = γ 21 γ 22 If both sides of 9.8 is multiplied with B 1 the result is, ]. z t = B 1 Γ 0 + B 1 Γ 1 z t 1 + B 1 ɛ t = Π 0 + Π 1 z t 1 + e t, (9.9) where Π 0 = B 1 Γ 0, Π 1 = B 1 Γ 1 and e t = B 1 ɛ t. This shows that the VAR model is a reduced form of an underlying structural model, where the structural dependence is hidden in the covariance matrix of the error terms. VAR models are estimated in their AR form, A(L)y t = e t. They can be inverted and analysed in their MA form, y t = C(L)e t. Beside predictions, VAR models are used for three types of analysis; Granger non-causality tests, forecast error variance decomposition and impulse response analysis. Granger non-causality tests deserve a special chapter and is therefore discussed in a following chapter. The other two techniques are typical VAR methods that make use of the MA form. Forecast Error Variance Decomposition. The forecast variance errors are explained in terms of the history of each variable. This analysis will tell how strong is the influence among the variables of the system. It tells us the proportion of movements in a sequence (of y i ) that is due to own shocks and the proportion due to shocks in other variables. If these other variables have little influence on the investigated variable, they will contribute little to the forecast error variance. Variables that are exogenous, will have small effects from other variables. 2 OLS is as effi cient as the seemingly unrelated estimator (SUR) in this case, because the equations contain the same explanatory variables. However, if we set some lags to zero and have a system with different lags in different equations, SUR will be a more effi cient estimator than OLS. VECTOR AUTOREGRESSIVE MODELS 87
88 Impulse response analysis. This is a graphic or numerical presentation of a simulation the system s response to an unexpected shock in one variable in the system. A typical example is to study how the economy, and real GDP, reacts to an unexpected change in the money supply under the assumption of rational expectations. A typical questions to ask are how long does it take for a shock in y t or x t before it dies out, will there be an effect at all, will it be positive or negative, will die out smoothly or through fluctuations? We can ask if shocks in y t affect x t etc. Let the MA form be y t = C(L)e t = t i=0 C ie t i, where C i is the matrix of coeffi cients for lag i. In matrix form, for a two dimensional system, [ y1t y 2t ] = t [ c11,i c 12,i i=0 c 21,i c 22,i ] [ e1i e 2i ]. (9.10) Setting i = 0 gives the impact multiplier, C 0, the initial effect of a shock. The matrix of total, or long-run multipliers, is given by i=0 C i. The impulse response functions are given by C(j) where j = 0,...t. Both the variance decomposition and the impulse response analysis require that the residual covariance matrix of the VAR is orthogonalized. This is so, because the errors e t are dependent on each other through the B 1 matrix. Unless the residuals of the VAR is orthogonalized it will not be possible to identify a shock from as a unique shock coming from one specific variable. 3 There are several ways of performing the orthogonalization of the residuals. (In the following we assume that the VAR is made up of stationary variables.) The idea is that restrictions must be put on the covariance matrix of the VAR. Cholesky decomposition. Cholesky decomposition represents a pure mathematical way to orthogonalize the residuals, which will depend on the ordering of the variables. It is custom to do several different decompositions, by changing the order of the equations in the model, show the sensitivity of creating orthogonalization in different ways. In terms of the residual covariance matrix, what the Cholesky decomposition achieves is to make the upper diagonal of the matrix zero. Assume a three dimension VAR, p = 3, and therefore a 3 3 covariance matrix, = σ 2 11 σ 12 σ 13 σ 21 σ 2 22 σ 23 σ 31 σ 32 σ The outcome of the decomposition is to create the following covariance matrix, σ = σ 21 σ σ 31 σ 32 σ 3 33 The problem for identifying the VAR and doing the impulse responses is that the covariance matrix is not diagonal. The Cholesky decomposition builds on the fact that any matrix P with the property that PP = defines an orthogonal covariance matrix such that e t = P 1 ɛ t becomes a diagonal matrix, e t (0, I N ).The ordering of the equations determines the outcome, and the causal ordering of the residual shocks. With N = 3, there are three possible orderings and outcomes, which can be more or less different. 3 Early VAR modelers did not recognize the need for orthogonalization. Thus papers from the first part of the 1980s must be read by some care. 88 VECTOR AUTOREGRESSIVE MODELS
89 Set up a recursive system. Instead of letting the computer do all of the job, you can set up the matrix B 1 so that the residuals form a recursive system by deciding on an ordering of the equations that corresponds to the ordering and residual correlations created be the Cholesky decomposition. Thus, the residual in equation one is not affected by the other two. (Meaning that x 1t is not explained by x 2t or x 3t ) The second residual is only affected by the first residual. And finally, the last (third) residual is affected by residual one and two. Econometric programs often includes Cholesky decomposition routines in combination with the analysis of VAR models. By changing the ordering of the equations it becomes possible to compare the effects of different recursive ordering of the variables. The problem is that we are drowning in output as the dimension of the VAR increases. Structural Autoregressive models SVAR. If economic theory does not suggest a recursive ordering, use economic theory to impose restrictions on the B 1 matrix. This is called Structural Vector Autoregressive (SVAR) models. 4 In practice the approach implies formulating a small structural (static) economic system for the residual process e t. If y t is an p-dimensional system, the error covariance matrix contains a total of p 2 parameters, leading to the estimation of p(p + 1)/2 or (p 2 + p)/2 number of parameters, equal to the number restrictions necessary for the matrix B 1. As an example for a 3 variable system, the error process could be set up as, e 1t = ɛ 1t e 2,t = c 21 ɛ 1t + ɛ 2t e 3t = c 31 ɛ 2t + c 32 ɛ 2t + ɛ 3t, (9.11) which happens to be a recursive ordering. Alternatively, the system could look like, e 1t = ɛ 1t + c 13 ɛ 3t e 2,t = c 21 ɛ 1t + ɛ 2t e 3t = c 31 ɛ 2t + ɛ 3t. (9.12) In both examples the number of restrictions imposed are equal to ( )/2.) = 6. Behind each equation is some reasoning about the plausible correlation among the variables at time t. In each equation there is one white noise residual term with an implicit parameter of unity, leaving three possible parameters (c 1, c 2, c 3 ) to describe how the shocks in the errors are related. An more general framework for identifying the VAR is Az t = A 0 + A 1z t 1 + Bɛ t, where contemporaneous correlations among the variables is captured by A and B takes care of correlations in the residual such that Bɛ t becomes diagonal. Once the error process is set up in such a way that the errors are orthogonal, it becomes possible to analyze the effects of one specific shock on the system and 4 A fourth approach is offered by Blanchard and Quah (1989), and builds on classifying shocks as temporary or permanent. This approach can be seen as an extension of the SVAR approach to processes including integrated variables with common trends. VECTOR AUTOREGRESSIVE MODELS 89
90 argue that the shock is unique coming only from that particular variable. Without orthogonalization the shock can be a mixture of effects from different variables, and not a clean shock. One controversy here is that it is up to the econometrician to identify and label the shocks as, for instance, demand or supply shocks. The basis for such labeling might not be strong. Further, by definition, the errors include, not only structural relations, but also everything that we do not know or understand about the system. For that reason it might be better to use economic theory to identify structural relations and build conventional econometric models instead, rather than trying to analyse what we do not understand. On the other hand, in a world of rational expectations where the expectations generating mechanisms is unknown, or cannot be modelled, VAR models is the best we can do How estimate a VAR? First you thing about your system. What is it that you want to explain? How could it be modelled as a recursive system? Second you estimate the equations, by OLS, the same lag lengths on all variables across the equations to avoid using the SUR estimation technique. Third, you investigate outliers and shifts and put in the appropriate dummy variables. Fourth, you try to find a short lag structure and white noise residuals. Fifth, if you cannot fulfill 4) you minimize the information criteria. In this case AIC is not the best choice, use BIC or something else Impulse responses in a VAR with non-stationary variables and cointegration. The orthogonalization of the residuals can offer some interesting intellectual challenges, especially in SVAR approach. If the variables in the VAR are integrated variables, which also are co-integrating, we are faced with some interesting problems. In the co-integrating VAR model there will be both stationary shocks and permanent chocks, and identifying these two types in the system is not always easy. If the VAR is of dimension p, there can be at most r co-integrating vectors, 0 r p, and p r common stochastic trends. Juselius (2006) ("The Co-integrated VAR Model", Oxford University Press) shows how an identification of the structural MA model, and orthogonalization of the residuals, can be done of both the in terms of short and the long-run of the system. The VAR(2), with no constants, trends or other deterministic variables, will have the following VECM representation after finding r co-integrating vectors, The MA version of this model is, x t = Γ 1 x t 1 + αβ x t 1 + ε t x t = C t ε i + C (L)ε t + x 0 i=1 Where the first factor on the right hand side represent the stochastic trends in the system and the second factor represents stationary part. The C matrix will then represent all that is not the stationary vectors, and is related to the co-integrated vectors as, C = β (α Γβ ) 1 α. 90 VECTOR AUTOREGRESSIVE MODELS
91 9.1 BVAR, TVAR etc. VAR models represent statistical descriptions of data series. As such is a basis for reducing your model and going into more ordinary structural econometric models, such as Vector Error correction Model (VECMs). Estimating a VAR is then a way of making sure that the final model is a well-defined statistical model, i.e. a model that is consistent with the data chosen. 1. We have talked about what you can do with the VAR in terms of forecasting, simulations, impulse responses, forecast error decomposition and Granger causality testing. in this context we meet the so-called SVAR - Structural VAR. There is, however, a number of other VARs that one needs to know about. The problems of working with VARs are obvious; there is a large amount of variables to be estimated, the estimated parameters might no be stable over time and there is a number of variables that are not modelled in the VAR because the VAR would get too large to handle. If you want to use the VAR for forecasting we need to address these problems. To handle the problem with time varying parameters there are Time-Varying-Parameter VARs (TVP-VARs). In addition there various VAR modeling techniques that deal with regime changes, Markov switching VARs, threshold VARs, floor and ceiling VARs, smooth transition VAR. To work with large number of variables and reduce the model it is possible to factor analysis, which takes us to Factor Augmented VARs (FA-VARs). Another approach is to use a priori information about parameters and their distribution in the form of represented by Bayesian VARs (BVARs). The latter is a popular approach in many central banks. We can illustrate the problem in the following way. Your model predicts that the inflation rate will vary around 10%, and the same time you have additional information indicating that inflation will fluctuate around 5 per cent, say that there is a sudden drop in inflation. What do you do? One approach is simply to reduce the constant term and predict changes in inflation around 5 per cent instead. A more ambitious approach is to incorporate more information in your model, from more data and place more emphasis on recent observations etc. Changing the constant is easy and quite normal. As you start walking along the path of making assumptions about the data and the parameters of the model you might go too far in the other direction. As long as we talk about forecasting, the proof is in the pudding. The best forecast wins, but as we talk about the best policy to achieve goals in the future you have to be much more careful. The type of VARs we have discussed so far are basically statistical representations of the data. Without futher restrictions, and incorporation of long-run steady state relations in the form of co-integrating vectors, their relative predictability will be quite poor. Also, the economy is more complex, involving many more variables that the two to six variables that can be handled in a standard VAR. If you model contains fifty or one hundred variables there will be too many lags and coeffi cients to estimate. One way of dealing with this problem is use so-call Bayesian VARs (BVAR). In the BVAR you can use prior information to reduce the number of coeffi cients you need to estimate. BVAR is popular among many central banks, included both the ECB and the FED to make construct better and bigger VARs for forecasting. 5 5 Gary Koop at University of Strathclyde has a home page with course material dealing with BVAR models. BVAR, TVAR ETC. 91
92 Finally, remember that the data is the real world, economic theories are constructions of the human mind (quote from David Hendry). If you want to use a priori information of some kind you might miss what the data, the real world, is trying to tell you. 92 VECTOR AUTOREGRESSIVE MODELS
93 Part III Granger Non-causality Tests 93
94
95 Whether a variable is affected by another in such away that it can be said to cause the other variable is a fundamental question in all sciences. However, to validate empirically that one variable are caused by another variable is problematic in economics since it is often quite diffi cult to set up controlled experiments. Granger (1969), building upon work done by Wiener, was the first to formalize an empirical concept of causality in economics. Granger s basic idea is that the future cannot predict the present or the past. It follows, as a necessary condition, that for one variable (x t ) to cause another variable (y t ), lagged values of x t must predict y t. This can be tested with the following vector autoregressive model, y t = k α i y t i + i=1 k β i x t i + e t, (9.13) where y t is explained by lagged values of y t and x t. The lag length (k) is determined such that e t is a white noise process, e t NID(0, σ 2 ). Alternatively, if you cannot find white noise residuals, minimize information criteria only instead. If all parameters associated with the process x t are different from zero, β 1 =... = β i 0, then x t is predicting y t, and x t can also be said to Granger cause the variable y t. If, on the other hand, all β-parameters are zero, x t cannot predict or cause y t. An F -test on the joint significance of the β parameters is suffi cient in this case. (Alternatively, the test can be set up in the form of chi-square test depending on mainly the software you are using.) The F -test works by comparing the mean squared errors from the equation above with those from a regression where the x s are excluded. If the inclusion of lagged x variables leads to a significant reduction in the mean square error, lagged values of x t are predicting y t and the variable x t can be said to Granger cause y t. Please notice the distinction between prediction and causality, which is important in a policy context. The fact that k i=1 β i is significantly different from zero, so that x t is predicting y t, does not imply that x t causes y t. It is easy to understand why, from the following analogy. A weatherman that predicts rain tomorrow, does not cause the rain that might fall tomorrow. This is so no matter how good this person is predicting tomorrows weather. This is the reason why the test should always be referred to as a Granger non-causality test and not a test of causality. Based on the assumption that the future cannot predict the present and the past, we can only test whether a variable is not causing another. Of course, the outcome of the test might be affected by the number of lags chosen in the VAR, and by the variables chosen to be included in the VAR. Though two variable VARs are common, this is often a crude simplification. The classical example is the effects of real money growth on real GDP growth. In one set-up you might find that monetary policy is effective, but add the interest rate to the VAR and you might find that monetary policy is ineffective. Finding that x t Granger causes y t does not exclude that the reverse is true. Two variables can Granger cause each other. A test of whether y t Granger causes x t, is performed with the following model, x t = k γ i x t i + i=1 i=1 k δ i y t i + η t, (9.14) where the lag length is the same as before, and η t NID(0, ω 2 ). If lagged values of y t predict x t ; y t is Granger causing x t. In some situation testing the reverse relationship is of no interest. For instance, the inflation rate in a small open economy should not Granger cause the inflation rate of the World. The main weakness of the Granger non-causality test is the assumption that the error process in the VAR is not only a white noise process, but also a white i=1 95
96 noise innovation process with respect to all relevant information for explaining the movements of x t and y t. This is an important issue which is often forgotten in applied work, were bivariate systems are the rule rather than the exception. Granger s basic definition of non-causality is based on the assumption that all factors relevant for predicting y t are known. Let I t represent all relevant information, both past and present, let X t be present and past observations on x t, such that X t = (x t, x t 1, x t 2,..., x 0 ), I t 1 and X t 1 represent past observations only. The variable x t can therefore be said to Granger cause y t if the mean square error (MSE) increases when y t is regressed against the information set where X t 1 is removed. In the bivariate case, this can be stated as, MSE(ŷ t I t 1 ) < MSE[ŷ t (I t 1 X t 1 ], (9.15) where ŷ is the predicted value of y t. The problem is to know what should be included in I t. If too many variables are included the degrees of freedom will diminish. If too few variables are included the test might lead to the wrong conclusions. The result of an unidirectional relation from x t to y t in a bivariate model, might be reversed if a relevant third variable is included in the system. This is a serious limitation of the Granger causality test. A way of reducing the problem is to always perform the tests in a VAR system. If some variable is to be treated as exogenous in the system, this must be based on strong a priori knowledge. The Granger non-causality test is sensitive to the spurious regression problem. The F -test is unreliable when used on integrated or near integrated, which is the standard situation in economics. However, using only first differences of the variables implies a loss of information. In this situation it is recommended to include error correction terms (or co-integrated vectors) in the VAR to increase the effi ciency of the F -tests. There is an interesting relation between cointegration and Granger causality, as shown by Engle and Granger (1987). If a co-integrating relationship is found, it follows there must exist Granger causality in at least one direction. Tests of cointegration do not exclude causality test, since they cannot determine the direction of the causality. However, if no cointegration is found we can conclude that there is no Granger causality either. 96
97 10. INTRODUCTION TO EXO- GENEITY AND MULTICOLLINEAR- ITY 10.1 Exogeneity Exogeneity assumptions are necessary in econometric model building. In many situations they are used in an ad hoc way; determined outside the system, or based on variables being classified as endogenous and predetermined. Based on this classification of the variables in the system, the basic econometric text book explains how to apply the rank and the order condition to identify a simultaneous system and if it is possible to use OLS or if a system estimator is necessary. In this section we introduce three basic concepts of exogeneity that covers, (1) estimation and inference, (2) conditional forecasting, simulations and (3) policy conclusions. The three concepts that allow you to perform these tasks are weak exogeneity, strong exogeneity and super exogeneity. Consider the following system and there is co-integration. y t = βx t + ε 1t x t = ε 2t If ε 1t and ε 2t are both stationary it follows that x t is I(1) and that y t if I(0) that β = 0. On the other hand if β 0, it follows that y t is I(1). To estimate β it is required that y t is not simultaneously influences x t. If y t or y t is part of the left-hand side of x t equation (and thus embedded in ε 2t ) the result is that E(ε 1t ε 2t ) 0, and we can write ε 1t = ρε 2t + u t. Where for simplicity we assume that u t N(0, σ 2 ). Now, if we estimate β with OLS, the outcome would be a biased estimate of β, since E(x t ε 1t ) = E(x t (ρε 2t + u t ), and we can no longer assume that x t and ε 1t are independent. This is example of lack of weak exogeneity. With the first model is not possible to estimate the parameter of interest β, the outcome from OLS is a different and biased β value Weak Exogeneity Weak exogeneity spell out the conditions under which it is possible to obtain unbiased and effi cient estimates. The definition is based splitting the joint density function, into a conditional density and a marginal density function; D 1 (y t, z t Y t 1, Z t 1 ; λ 1 ) = D 2 (y t y t, Y t 1, Z t 1, λ 2 )D 3 (z t Y t 1, Z t 1, λ 3 ), (10.1) INTRODUCTION TO EXOGENEITY AND MULTICOLLINEARITY 97
98 where the parameters of interest (θ), are a given as θ = f(λ 1 ), Y t 1 and Z t 1 are matrices of the finite historical values of these variables. The conditions under which it is possible to estimate the parameters of interest by modeling only the conditional density are that λ 2 and λ 3 should be variation free, and that are no cross restrictions between the parameters of λ 2 and λ3. In practical situations, using stationary data, this comes down to judging whether the error terms between the marginal and conditional models are correlated. 1 (If the data series are integrated the question becomes one of long-run independence between the two residual processes). Three important conclusions follow from the definition above. The first is that whether a variable is exogenous or not, depends on the parameters of interest. An OLS regression will always lead to estimates of some kind, but what is their meaning. To understand the regression we identify parameters of interest that relate to other variables through the (not modelled) marginal density functions. Thus, exogeneity must be stated in terms of parameters of interest, i.e. the variable y t is weakly exogenous for the parameter βy t. Second, it is diffi cult to test for weak exogeneity. Most existing tests fail, with the exception of Johansen s test for weak exogeneity of the variables in a co-integrating vectors. 2 The meaning of an exogeneity test is mainly to find an argument for not specifying the marginal model. However, the definition of weak exogeneity tells that this is not possible. A test will need the estimated marginal model, otherwise it will not work. But when the marginal is estimated (and tested for misspecification) the work is already done, so the only thing left is to compare the results. The third conclusion, is that it is not possible to state that a variable like the US inflation is determined outside the model for inflation in Zambia, or the rainfall in a agricultural model. If these variables enters the system in terms of expectations, it might be necessary to specify the stochastic process that generates these expectations in the model to get unbiased and effi cient estimates of the parameters of interest Strong Exogeneity Strong exogeneity spells out the conditions for conditional forecasting and simulations of a model with not modelled variables. The condition is weak exogeneity and that the marginal model should not depend on the endogenous variable. Thus the marginal process must be D 3 (z t Y t 1, Z t 1, λ 3 ) = D 3 (z t Z t 1 λ 3 ). (10.2) Meaning that it is not necessary to estimate the marginal process to forecast y t. 1 The condition of no correlation between the error terms is easily understandable if we assume that {y t, z t} is a bivariate normal process. Set up the density function, and determine the condition when it is possible to estimate the parameters of interest from the conditional model only. 2 Regarding Johansen s test, it is important to remember that it is model dependent. The test is performed conditionally on the short-run dynamics of the variables included in the system, the dummy variables and the specification of deterministic trend. 98 INTRODUCTION TO EXOGENEITY AND MULTICOLLINEARITY
99 Super Exogeneity Super exogeneity determines the conditions for using the estimated parameters for policy decisions. The condition is weak exogeneity and that the parameters of the conditional model are stable w.r.t. to changes in the marginal model. For instance, if the money supply rule changes, the parameters of the marginal process will also change. If this also leads to changes of the parameters of the conditional model, the conditional model cannot be used to analyse the implications of policy changes. Thus, super exogeneity defines the situations when the Lucas critique is not valid Multicollinearity and understanding of multiple regression. Multicollinearity has to do with how we understand the estimated parameters. Study the following model, y t = β 0 + β 1 x t + β 2 z t + ɛ t The estimated parameters of this model is analysed under the assumption that there is no correlation between the variables. The parameter β 1 is understood as the effect on y t following a unit change in x t while holding the other variables in the model (z t ) constant. In the same way β 2 measures the effect on y t while x t is held constant. Another way of expressing this is the following; E{y t z t } = β 1 x t and E{y t x t } = β 2 z t, which tells us that the effect of one parameter cannot be analysed in isolation from the rest of the model. The effect of z t in the model is not on y t in it self, it is on y t conditional on x t. The meaning of holding say x t constant in the model, while z t is free to vary implies that we study the effect on y t after removing the effects of x t on y t.if x t and z t are correlated it is not possible to keep one of the constant while the other is changing. This is the multicollinearity problem. The statistical problem is best understood by looking at the OLS variance of ˆβ. The variance is V ar(ˆβ 2 ) = σ 2 ɛ (xt x 2 ) (1 ρ xz ), where ρ xz is the correlation between x t and z t. If the correlation is perfect, ρ xz = 1, the denominator becomes zero and the calculation of the variance breaks down. Perfect multicollinearity means that the covariance matrix E(X X) 1 does not exist, and there is no solution to β = (X X) 1 XY. This is seldom a practical problem, since the computer program that calculates the estimates will break down when it tries to invert the matrix. 3 Near and less than perfect multicollinearity, meaning that ρ is between zero and unity, is more complex. However, the problem is limited only to the understanding of the estimated parameters, not in the understanding the model. Less than perfect multicollinearity will affect the residual variance of the model (σ 2 ), the estimated 3 If the inversion process does not break down completely, estimated variances of one ore more parameters will be incredibly large. MULTICOLLINEARITY AND UNDERSTANDING OF MULTIPLE REGRESSION. 99
100 variances of the variables. Historically, a number of measurements, remedies and quick fixes for multicollinearity has been suggested. None of these actually works. In cross section studies a typical problem is to explain household consumption. If you use household income, the number of rooms that the household posses, the number of children and the size of the car as explanatory variables, you would not be surprised to learn that these explanatory variables are highly correlated with each other. As a consequence it might be hard to understand what the parameters are estimating. This example shows that throwing in explanatory variables without a clear economic model in the background will lead to problems. There is no substitute for economic theory in this example. In time series modelling multicollinearity is often, somewhat mistakenly, linked to the estimation of lag lengths. Take the following distributed lag model as an example; x t = β 1 x t 1 + β 2 x t 2 + ε t. If x t is an AR(p) process, the x t variables in the equation are of course correlated, meaning that we cannot hold x t 1 constant and at the same time analyse the effect of varying x t on its own. On the other hand, we are not interested in changing one lag, while keeping the rest fixed. In a time series regression estimation aims at finding the suffi cient number of lags that describes the dynamic process. However, since the lags are correlated with each other, this will affect the estimated variance of each lag. This will make it more diffi cult to determine the correct number of lags in a model, if we were to check the fit of the model by looking at the t-values of the parameters only. Since model building should be aimed at finding a white noise innovation term, t values are seldom used to decide the over-all fit of the model. Instead we focus on misspecification tests of the model. We can summarize the fact about multicollinearity as follows. There is no way to accurately measure the degree of multicollinearity and there are no quick fixes. Never, under no circumstances, can you delete some variables to solve the problem as is suggested in some textbooks. Deleting variables means that you change the specification and the fit of the model. Leaving out a relevant explanatory variable leads to a misspecified model, which creates bias in the estimates and affects inference. As shown in Hendry (1990 Ch. 6), multicollinearity is not a model problem, or a misspecification problem, it has to do with the interpretation of the estimated variables only, and not with the fit of the model. It can be shown how the variables in a given model can be transformed such that the they become orthogonal to each other, without affecting the fit of the model. Returning to the example above, the interpretation of the parameters can be made clearer if we use the transformation = 1 L, y t = β 1 x t + β 3 x t 1 + ɛ t. (10.3) The transformation is just a reparameterization and does not affect the residual term. The parameter β 3 = β 1 + β 2 which is the long run static solution of the model. Thus we get an estimate of the short run effect on y t from β 1 and at the same time a direct estimate of the static long run solution from β 3. If the collinearity between x t and x t 1 is high, it can be assumed to be quite small when we look at x t and x t 1. Since our final interest in modelling economic time series is to find a well-defined statistical model, which mimics the DGP of the variable(s) multicollinearity is not really a problem. We will therefore not deal with this topic any further. 100 INTRODUCTION TO EXOGENEITY AND MULTICOLLINEARITY
101 11. UNIVARIATE TESTS OF THE ORDER OF INTEGRATION This section looks at a number of unit root tests, which can be applied to determine the order of integration of a variable. The following tests are presented, DF-test Dickey-Fuller test ADF-test Augmented Dickey-Fuller test Z-test Phillips and Perron s Z-test (To be included) LMSP-test Schmidt and Phillips LM test KPSS -test Kwiatkowsky, Phillips, Schmidt and Shin test G(p, q)-test Park s G-test. The alternative hypotheses to having an integrated series are discussed in a following section The DF-test: The Dickey-Fuller test is one of the oldest test. The tests builds on the assumed DGP, y t = y t 1 + ɛ t with ɛ t NID(0, σ 2 ). Given this DGP, subtract y t 1 from both sides, and estimate the equation a) y t = πy t 1 + ɛ t, or, put a constant term in the regression, to allow for the alternative of a deterministic trend in y 1 t, b) y t = α + πy t 1 + ɛ t, or, put in both a constant and a time trend in the estimated equation, to allow for both a linear deterministic trend and a quadratic deterministic trend in y t, c) y t = α + πy t 1 + βt + ɛ t, where π = 0 if y t is I(1). In this regression, know that π will be biased downwards, in a limited sample. Thus, we can put all the risk on the negative side and perform a one-sided test, instead of a two-sided standard t-test. The one sided t-test H 0 : ˆπ = 0 - y t I(1) against, H 1 : ˆπ < 0, y t I(0). The correct t-statistic for testing the significance of ˆπ is tabulated in Fuller (1976), under the assumption that y t is random walk, y t N(0, σ 2 ). The correct distribution for the t-test can also be calculated from MacKinnon (1991), for the exact sample size at hand. In practice the differences are small though. The t-statistics for the constant term and the trend term are tabulated in Dickey and Fuller (1980). Notice that the null hypothesis is that y t = ɛ t, where ɛ t is white noise. The econometrician, however, will not know 1 To understand why the constant represents a linear deterministic trend, go back to the discussion about the properties of the random walk process. UNIVARIATE TESTS OF THE ORDER OF INTEGRATION 101
102 this in advance. S/he must therefore set up the estimated model so that there is an meaningful alternative hypothesis to the stochastic trend (or unit root hypothesis). A general alternative is to assume that y t is driven by a combination of t and t 2. It is therefore recommendable, if ɛ t is white noise, to start with model c. If the t-value on π is significant according to the table in Fuller (1976). The null hypothesis of unit root process is rejected. It follows then that the t-statistics for testing the significance of β and α follow standard distributions. But, as long as the unit root hypothesis (π = 0) cannot be rejected, both β and α must be assumed to follow non-standard distributions. Thus, under the hypothesis that π = 0, the appropriate distributions for β and α are found in Dickey and Fuller (1980). In a limited sample it might be wise to compare the outcome of both model c and a. The test is easily extended to higher order unit roots, simply by performing the test on differenced data series. When will the test go wrong? First, if ɛ t is not white noise. In principle, e t can be an ARIMA process. In the following a number of models dealing with this situation is presented. If there is more than one unit root, then testing for one unit root is likely to be misleading. Hence a good testing strategy is to start by testing for two unit roots, which is done by applying the DF-test to the first difference of the series ( y t ). If a unit root in y t is rejected one can continue with testing for one unit root, using the series in level form y t The ADF-test The DF-test, like all tests of I(1) versus I(0), is sensitive to deviations from the assumption ɛ t NID(0, σ 2 ). The assumption of NID errors is critical to the simulated distributions in Fuller (1976). If there is autocorrelation in the residual process the OLS estimated residual will inappropriate, the residual variance estimate will be biased and inconsistent. The ADF-test seeks to solve the problem by augmenting the equations with lagged y t, or or y t = πy t 1 + y t = α + πy t 1 + k γ i y t i + ɛ t, (11.1) i=1 k γ i y t i + ɛ t, (11.2) i=1 y t = α + πy t 1 + β t + k γ i y t i + ɛ t. (11.3) The asymptotic test statistic is distributed as the DF-test, and the same recommendation applies to these equations, make sure there is a meaningful alternative hypothesis. Therefore start with the model including both a constant and a trend. The ADF test is better than the original DF-test since the augmentation leads to empirical white noise residuals. As for the DF-test, the ADF test must be set up in such a way that it has a meaningful alternative hypothesis, and higher order integration must be tested before the one only unit root case. 2 2 Sjöö (2000b) explains in some detail how the test is used in practice. i=1 102 UNIVARIATE TESTS OF THE ORDER OF INTEGRATION
103 The critical factor is to choose the length of the augmentation. Because y t is stationary, the distribution of the lags are normal, and standard tests, including Q-tests, LM test for serial correlation in the residual can be used. In small samples the augmentation might play an important role for the outcome of the test. No general rule can be established, more than that the residuals should not display autocorrelation. It is therefore up to the model to convince the readers (the critics) that the final verdict regarding the significance, or non-significance of π rests on solid ground. An additional complication is how to treat outliers in the sample. Outliers will affect the estimation, in particular the significance of the constant and the trend variable. If trends are significant, under the null of unit root process, according to the Tabulations in Dickey and Fuller (1979), the conclusion is that the estimate of y t 1 follows a normal distribution. Finding significant time trends often implies the rejection of a unit root. But, if this is caused by an outlier affecting the estimation of the trend, one has to be careful in rejecting the unit root. In the case of significant trend variables, leading to the rejection of the unit root hypothesis, some careful investigation of outliers is called for, to be secure against spurious regressions. The DF and ADF tests are the most well known tests, and are easily understood by most people. However, in limited samples and with ɛ t not being white noise, they are often quite inconclusive. The tests should therefore be accompanied by graphs and perhaps other tests The Phillips-Perron test The ADF-test tries to solve the problem of non-white noise residuals by adding lags of the dependent variable. It should be stressed that the ADF-test is quite adequate as a data descriptive device under the maintained hypothesis that the variables in a sample are integrated of order one. There are, however, a number of tests which tries improve on some of the weaknesses of the ADF-test. Phillips and Perron (1988) suggest non-parametric correction of the test statistic so that the Dickey-Fuller distribution can be used even in cases when the residual in the DFtest is not white noise. (The KPSS-test below a recent modification of the same principle) The method starts from the estimated t-value (ˆt π ) and the estimated residuals from the DF equation. The test statistic (t ) -the t-value- is modified with the following formula t = S ɛ S Π ˆt π T [S2 Π S2 ɛ ][std.er(ˆπ)/s] 2S Π (11.4) where s is the residual variance from the DF regression, and S 2 Π = T 1 T t=1 ˆɛ 2 t + 2T 1 S 2 ɛ = T 1 T t=1 ˆɛ 2 t, (11.5) l [1 j(l + 1) 1 ] j=1 T t=j+1 ˆɛ tˆɛ t j. (11.6) The last term is a non-parametric estimation of the residual variance, using Bartlett s triangular window. The critical factor is determine the size of the lag window l. UNIVARIATE TESTS OF THE ORDER OF INTEGRATION 103
104 The LMSP-test Start with the following DGP, y t = α + βt + x t and x t = πx t 1 + µ t where µ t NID(0, σ 2 ). Under a unit root H 0 : π = 1. To test, run the following regression, y t = α + φŝt 1 where Ŝt = T t=2 [ y t y t 1 /(T 1)]. Schmidt and Phillips (1992) simulated the t-statistic for ˆφ The KPSS-test This test is calculated by RATS 4. The DGP is assumed to be y t = βt + r t + ɛ t where r t = r t 1 + υ t. ɛ t NID(0, σ 2 ɛ ) and υ t NID(0, σ 2 v). The null hypothesis is that y t is stationary. The test is H 0 : σ 2 v = 0, against H 1 : σ 2 v > 0. Start by estimating the following equation, y t = α + βt + e t, (11.7) use the estimated residual to construct the following LM test statistic, η = T 2 t St 2 /s 2 (k), (11.8) 1 where s 2 (k) = T 1 S 2 t = i ê 2 i and (11.9) i=1 t ê 2 t + 2 T 1 1 k w(s, k) s=1 t t=s+1 ê t ê t s. (11.10) The critical values for the test is given in Kwiatkowsky et.al (1992). A Bartlett type window, w(s, k) = 1 [s/(k + 1)] is used to correct the estimate (sample) test statistics correspond to the simulated distribution which is based on white noise residuals. The KPSS test appears to be powerful against the alternative of a fractionally integrated series. That is, a rejection of I(0) does not lead to I(1), as in most unit root test, but rather to a I(d) process where 0 < d < 1. These type of series are called fractionally integrated. A high value of d implies a long memory process. In contrast to an integrated series I(1), or I(2) etc, a fractionally integrated series is reverting. Baillie and Bollerslev (1994) The G(p, q) test. This test builds on the conclusion that for a unit root variable, the estimated residuals are inappropriate and will indicate that unrelated variables are statistically significant (spurious regression). Therefore estimate, 1 : y t = α + βt + ɛ 1t (11.11) 104 UNIVARIATE TESTS OF THE ORDER OF INTEGRATION
105 2 : y t = α + β 1 t + β 2 t 2 + ɛ 2t, (11.12) where t 2 is a superfluous variable. Calculate the following test statistic, G(1, 2) = (RSS 1 RSS 2 )/s 2 (k), (11.13) where RSS 1 and RSS 2 are the residual sums of squares from model 1 and 2 respectively, s 2 (k) is as above. We can conclude that among theses tests, the ADF test is robust as long as the lag structure is correctly specified. The gains from correcting the estimated residual variance seem to be small The Alternative Hypothesis in I(1) Tests Rejecting one unit root does not necessarily mean that one can accept the alternative of an I(0) series. Sometimes unit root test will reject the assumption of a unit root even though the series is clearly non-stationary. There are several alternatives to rejecting the I I(1) hypothesis, The series is actually I(0). The series is driven by a deterministic rather than a stochastic trend. The series contain more than one unit root. 3 The series is driven by segmented trends, meaning that there are different deterministic trends for different sub-periods. The series contain fractionally integrated trends. It has an ARFIMA representation (AutoRegressive Fractionally Integrated Moving Average). The series is non-stationary, but driven by some (to us) unknown trend process. Tests for deterministic trends and more than one unit root are straight forward from the section above and not discussed here. The segmented trend approach was launched by Perron (1989). He argues that few series really are I(1). If we have detailed knowledge about the data generating process, we might establish that series have different deterministic trends for different time periods. The fact that these segmented trends shift over time implies that unit root tests cannot reject the hypothesis of an integrated variable. Thus, instead of detecting the correct deterministic trend(s), the test approximates the changing deterministic trend with a stochastic trend. Perron (1989) demonstrates this fact and drives a test for a known break date in the series. Banerjee et.al. (1992) develop a test for an unknown break date. The problem with this approach is that we somehow have to estimate these segmented trends. Sometimes it will be possible to argue for segmented trends, like World War One and Two, etc., but in principle we are left more or less with ad hoc estimates of what might be segmented trends. 3 Testing for integration should be done according to the Pantula Principle, since higher order integration dominates lower order integration, test from higher to lower order, and stop when it is not possible to reject the null. For instance, a test for I(1) v.s I(0) assumes that there are no I(2)processes. The presence of higher order cointegration might ruin the test for lower order integration, therefore start with I(2) and only if I(2) is rejected will it be meaningful to test for I(1), etc. THE ALTERNATIVE HYPOTHESIS IN I(1) TESTS 105
106 11.2 Fractional Integration For the class of integrated series discussed above the difference operator was assumed to be d = 1. The choice between d = 0 and d = 1 might be too restrictive in some situations. Especially, if unit root tests reject I(0) in favour of the I(1) hypothesis, when we have theoretical information that suggests that I(1) is implausible, or highly unrealistic. For example, unit root tests might find that both the forward and the spot foreign exchange rates are I(1), and that the forward premium (f s), the log difference, is also I(1), indicating no mean reversion in this difference series, and that the forward and the spot rates are not co-integrating. The expectations part of the forward rate would therefore be extremely small or irrational in some sense, so the risk premiums are causing the I(1) behavior. Autoregressive Fractional Difference Moving Average Models, represents a more general class of model than ARMA and ARIMA models, see Granger and Joyeux (1980) and Granger (1980). The ARFIMA (p, d, q) model is defined as φ(l)(1 L) d y t = µ + θ(l)ɛ t, (11.14) where d is the fractional differencing parameter. The difference operator (1 L) d is defined in terms of its Maclaurins series expansion. The difference operator works in the same way as for ARIMA models, applying the operator to y t results in (1 L) d y t = z t where z t has an ARMA representation. The FI operator transforms the original series into a series which has an ARMA representation. Once the long-run memory is removed, the standard techniques for identifying the ARMA process can be applied. The difference between ARIMA and ARFIMA models is that the latter allows for a more complex memory process. The Wold theorem says that any nondeterministic series has an infinite MA representation like, y t = i=0 ψ i ɛ t i, (11.15) where ɛ t iid(0, σ 2 ), and i=0 ψ2 i <. If this series also belongs to the class of series which has an ARMA representation, the autocorrelation function will die out exponentially. For an I(1) the autocorrelation function will display complete persistence, the theoretical autocorrelation function is unity for all lags. Because the autocorrelation function of an ARMA process dies out exponentially, it can be said to have a relatively short memory compared to series which have autocorrelation functions which do not die out as quickly. ARFIMA series, therefore represents long memory time series. The ARFIMA model allows the autocorrelation coeffi cients to exhibit hyperbolic patterns. For d < 1, the series is mean reverting, for 0.5 < d < 0.5 the ARFIMA series is covariance stationary. For a statistician who is describing the behavior of a time series an ARFIMA model might offer a better representation than the more traditional ARMA model, see Diebold and Rudebush (1989) Sowell (1992). For an econometrican however, the economic understanding is of equal importance. The standard question in most economic work is whether to use levels or percentage growth rates of the data, to construct models with known distributions. That means decide whether series are I(0) or I(1). Fractional integration does not affect these problems. It becomes important when we ask specific questions about the type of long-run memory we are dealing with, like is there mean reversion in the forward premium, or the real exchange rate, or in assets prices etc. Thus only when economic theory gives us a reason for testing something else than I(0) and I(1) is fractional integration 106 UNIVARIATE TESTS OF THE ORDER OF INTEGRATION
107 of interest. For applications of long-memory tests in general see Lo (1991) and Cheung and Lai (1995). FRACTIONAL INTEGRATION 107
108 108 UNIVARIATE TESTS OF THE ORDER OF INTEGRATION
109 12. NON-STATIONARITY AND CO-INTEGRATION Most macroeconomic and finance variables are non-stationary. This has enormous consequences for the use of statistical methods in economics research. Statistical theory assumes that variables are stationary, if they are not stationary statistical inference is generally not possible. It doesn t matter that numerous old textbooks in econometrics and research papers have ignored the problem. The problems associated with non-stationary variables in econometrics has been known since the 1920s, but didn t get a solution until the end of the 1980s. In principle there two ways of dealing with non-stationary, you must either remove the non-stationarity before setting up the econometric model or set up a model of non-stationary variables that forms a stationary relation. Typically, in none of these cases can you use standard inference based on t-, chi square or F-distributions. Now, variables can be non-stationary in an infinite number of ways. In practice, there are broadly two types of non-stationary variables of interest in econometrics. The first type are variables stationary around a deterministic trend. The second type are variables stationary around a stochastic trend. Stochastic trend variables are also known as integrated variables. Most variables in economics and finance seem to be driven by stochastic trends. The problem with stochastic trend variables (integrated variables) is that not only do they not follow standard distributions, if you try to use standard distributions you will most likely be fooled into thinking there are significant relations when in fact there are no relation. This is know as the spurious regression problem in the literature. Historically, trends were dealt with by removing what people assumed was a linear deterministic trend. This was done in the following way. The non-stationary variable was regressed against a constant and a linear trend variable; y t = α + βt + ỹ t (12.1) where t was a deterministic time trend, defined as t = 1, 2,..., T ). The residual ỹ t in this regression represents the de-trended y t series, which was then used in regression models with other stationary or detrended variables. In the equation above α becomes a combination of the sample mean of y t, and the average of the time variable. In general, the deterministic trend removal can be done with models including polynomial deterministic trends, such as y t = α + β 1 t + β 2 t β n t n + ỹ t. (12.2) This approach of fitting deterministic trends can be extended into cyclical trends, using trigonometric functions in combinations with the time trend. In the literature there are various deterministic filters that aim at removing long-run (supposedly deterministic) trends such as the so-called Hodrick-Prescott filter. However, if the series is driven by a stochastic trend the estimated variables of these models will not follow standard distributions and the regression will impose a spurious autocorrelation pattern in the spuriously detrended variable ỹ t. Thus, until you have investigated the non-stationary properties of the series and tested for stochastic trends (order of integration) it is not possible to do any econometric modelling. NON-STATIONARITY AND CO-INTEGRATION 109
110 Deterministic trends are seldom the best choice for economic time series. Instead the non-stationary behaviour is often better described with stochastic trends, which have no fixed trend that can be predicted from period to period. A random walk serves as the simplest example of a stochastic trend. Starting from the model, y t = y t 1 + v t where v t NID(0, σ 2 ), (12.3) repeated substitution backwards leads to, y t = y 0 + t v i. (12.4) The expression shows how the random walk variable is made up by the sum of all historical white noise shocks to the series. The sum represents the stochastic trend. The variable is non-stationary, but we cannot predict how it changes, at least no by looking at the history of the series. (See also the discussion above concerning random walks under the section about different stochastic processes) The stochastic trend term is removed by taking the first difference of the series. In the random walk case it implies that y t = v t is a stationary variable with constant mean and variance. Variables driven by stochastic trends are also called integrated variable because the sum process represents the integrated property of these variables. A generic representation is the combination of deterministic and stochastic trends, y t = α + µ t + βt + ỹ t, (12.5) where µ t = µ t 1 + v t, v t is NID(0, σ 2 ), t is the deterministic trend and ỹ t is a stationary process representing the stationary part of y t. In this model, the stochastic trend is represented by t i=1 v i. An alternative trend representation is segmented deterministic trends, illustrated by the model i=0 y t = α + β 1 t 1 + β 2 t β k t k + ỹ t (12.6) where t 1, t 2 etċ, are deterministic trends for different periods, such as wars, or policy regimes such as exchange rates, monetary policy etc.. Segmented trends are an alternative to stochastic trends, see Perron 1989, but the problem is that the identification of these different trends might be ad hoc. Given a suitable choice of trends almost any empirical series can be made stationary, but are the different trends really picking up anything interesting, that is not embraced by the assumption of stochastic trends, arising from innovations with permanent effects on the economy? The Spurious Regression Problem Most macroeconomic time series display non-stationarity and appears to be driven by stochastic trends. Regression with these variables leads to the danger of nonstandard distributed parameter estimates which make inference much more diffi - cult. The spurious regression problem was introduced in a article by Granger and Newbold in Granger and Newbold generated two random walk series, which were independent of each other by construction. Let the two variables be x t and y t, 110 NON-STATIONARITY AND CO-INTEGRATION
111 with first differences y t NID(0, σ 2 y), and x t NID(0, σ 2 x),by construction let y t and x t be independent. Next, consider the linear regression of y t and x t, y t = α + βx t + ε t. (12.7) Since y t and x t are independent there is no relation between them β must be zero and we would expect that the t-statistic of ˆβ will go to zero as the sample size increases so that tˆβ NID(0, 1). If we repeat the regression with new independent random walk we expect that in 5 per cent of test we would be unlucky and erroneously assume that there is significance even though true value of β is zero. However, this is not what happens. Granger and Newbold studied the empirical distribution of the regression above. They run 1000 regressions and found that the distribution of the t-statistic of ˆβ was the opposite of what we expect. In 95 % of the regression we find a significant relation even though the true value should be 5 %. Asymptotically the t-value of ˆβ approached 2.0. The problem got worse when more independent random walks were put into the equation. Granger and Newbold did also find that the reported R 2 values became relatively high while the Durbin-Watson value became low. Later in the 1980s, researchers such as Peter Phillips, showed that due to the integrated properties of the variables, their sample moments converge to functions of Wiener processes (Brownian motions). The sample moments will not converge to constants, like in the case of stationary stochastic regressors. Instead, the sample moments converge to random variables which are functions of Wiener processes. In this situation, with two (or more) random walk variables regressed against each other the t-statistics will approach 2.0 zero instead of 0.0. Thus, by using the t-distribution to test the null of no correlation between the variables, one will be fooled into rejecting the assumption of no correlation. This is the spurious regression problem. It is caused by parameter estimates which are not distributed according to the normal distribution, not even in the long run. In practical work, that is when using limited samples, this will occur not only when regressing random walk variables, but also when regressing integrated variables or near-integrated variables. Near-integrated variables are a classification of variables which in a limited sample, look and behave like integrated variables. An autoregressive process with an autoregressive parameter close to unity (say 0.9) can be called near integrated. In these situations, the distribution theory of integrated variables is a much better approximation than the standard normal Integrated Variables and Co-integration Normally, a linear combination of integrated variables will also be integrated of the same order as individual variables. The exception from this rule is called cointegration, when a linear combination of integrated variables results in a lower order of integration. So, in the linear regression above, since both y t and x t are integrated of order one I(1), and independent, the residual term ε t will be integrated of order one I(1) as well. In the case when the two I(1) variables share the same stochastic trend and form an I(0) residual we say that they are co-integrating. NON-STATIONARITY AND CO-INTEGRATION 111
112 The intuition here is that for the two variables to form a meaningful long-run relationship, their must share the same trend. Otherwise they will be drifting away from each other as time elapses. Therefore, to build econometric models which make sense in the long run, we have to investigate the trend properties of the variables and determine the type of trend and whether variables are cotrending and co-integrating or not. In econometric work, trend properties refer to the properties of the sample and how to do inference. It is not a theoretical concept about how economics variables grow in the long run. Once we have clarified the trend properties, it becomes possible to establish stationary relations and models, and econometric modeling can proceed as usual, and standard techniques for inference can be used. Definitions: Definition 1 A series with no deterministic component and which has a stationary and invertible autoregressive moving average (ARMA) representation after differencing (d) times, but which is not stationary after differencing (d 1) times, is said to be integrated of order d, denoted x t I(d). Definition 2 The components of the vector x t are said to be co-integrated of order d, b, denoted x t CI(d, b), if (i) x t is I(d) and (ii) there exists a non-zero vector β such that β x t I(d b), d b > 0. The vector β is called the co-integrating vector.(adapted from Engle and Granger (1987)). Remark 1 If x t has more than two elements there can be more than one cointegrating vector β. Remark 2 The order of integration of the vector x t is determined by the element which has the highest order of integration. Thus, x t can in principle have variables integrated of diff erent orders. A related definition concerns the error correction representation following from co-integration. Definition 3 A vector time-series x t has an error-correction representation if it can be expressed as A(L)(1 L)x t = γz t 1 + ω t, where ω t is a stationary multivariate disturbance term, with A(0) = I, A(1) having only finite elements, z t = β x t, and γ a non-zero vector. For the case where d = b = 1, and with co-integrating rank r, the Granger Representation Theorem holds. (Adapted from Banerjee et.al (1993)) Remark 3 This definition and the Granger Representation Theorem (Engle and Granger, 1987) tell us that if there is co-integration then there is also an error correction representation, and there must be Granger causality in at least one direction Approaches to Testing for Co-integration Under the general null hypothesis of independent and integrated variables estimated variances, and test statistics, do not follow standard distributions. Therefore the way ahead is to test for co-integration, and then try to formulate a regression model (or system) in terms of stationary variables only. Traditionally there are two approaches of testing for co-integration; residual based approaches and other approaches. The first type starts with the formulation of a co-integration regression, a regression model with integrated variables. Co-integration is then determined by investigating the residual(s) from that regression. The Engle and 112 NON-STATIONARITY AND CO-INTEGRATION
113 Granger two-step procedure and the Phillips-Oularies test are examples of this approach. The other approach is to start from some representation of a co-integrated system, (VAR, VECMA, etc.) and test for some specific characterization of cointegrated systems.. Johansen s VECM approach, or tests for common trends are examples. The Engle and Granger s two-step procedure is the easiest and most used residual based test. It is used because of its simplicity and ease of use, but is not a good test. The two-step procedure, starts with the estimation of the co-integrating regression. If y t and x t are two variables integrated of order one, the first step is to estimate the following OLS regression y t = α + βx t + z t (12.8) where the estimated residuals are ẑ t. If the variables are co-integrating, ẑ t will be I(0). The second step is to perform an Augmented Dickey-Fuller unit root test of the estimated residual, ẑ t = α + πẑ t 1 + k ẑ t i + ε t. (12.9) 1. If y t and x t share a common trend and co-integrate the residual must be a stationary process. If they don t share a common trend, they do not co-integrate, the parameter β must be zero and the residual z t must be non-stationary and integrated of the same order as y t. If the null, H 0 : ˆπ = 0, is rejected for the alternative H A : π < 0, we conclude that the variables are co-integrated, and that the long-run co-integrating parameter is β. Furthermore, we can refer to the OLS regression as the co-integrating regression. We know that the residual is stationary, ẑ t is I(0) and therefore ẑ t 1 can be used as en error correction term, identifying the long-run steady state relation between y t and x t. The relevant test statistics are not the one tabulated by Fuller (1976). Instead you have to look new simulated tables in Engle and Granger (1987), Engle and Yoo (1987), or Banerjee et al (1993). The reason is that the unit root test is now performed, not on a univariate process, but on a variable constructed from several stochastic processes. The test statistic will change depending on how many explanatory variables there are in the model. Remark 4 Remember that the t-statistics, and the estimated standard deviations, from the co-integrating regression must be considered, even if we find cointegration. Unless x t is exogenous the estimated parameters follow unknown non-normal distributions even asthmatically. Remark 5 For the outcome of the test, it will not matter which variable is chosen to be the dependent variable. As an economist you might favour setting one variable as dependent and understand the β parameters as long-run economic parameters (elasticities etc.) There are a number of problems with the Engle and Granger two-step procedure. The first is that the tabulated (non-standard) test statistic assumes white noise residuals. The augmentation tries to deal with this but is in most cases it is only a crude approximation. Second, the test assumes a common factor in the dynamic processes of y t and x t. In practice this restriction is quite restrictive and the test will not behave i=1 NON-STATIONARITY AND CO-INTEGRATION 113
114 good when it does not hold. The dynamics of the two process and their possible co-integrating relation is usually more complex. Third, the test assumes that there is only one co-integrating vector. If we test for co-integration between two variables this is not a problem, because then there can be only one co-integration vector. Suppose that we add another I(1) variable (u t ) to the co-integrating regression equation, y t = α + β 1 x t + β 2 u t + η t. (12.10) If y t and x t are co-integrating, they already form one linear combination (z t ) which is stationary. If u t I(1) is not co-integrating with the other variables, OLS will set β 2 to zero, and the estimated residual ˆη t is I(0). This is why the test will only work if there is only one co-integrating vector among the variables. If y t and x t are not co-integrating then adding u t I(1) might lead to a co-integrating relation. Thus, in this respect the test is limited, and testing must be done by creating logical chains of bi-variate co-integration hypotheses. Other residual based tests try to solve at least the first problem by adjusting the test statistics in the second step, so that it always fulfills the criteria for testing the null correctly. Some approaches try to transform the co-integrating regression is such a way that the estimated parameters follow a standard normal distribution. A better alternative to testing for co-integration among more than two variables is offered by Johansen s test. This test finds long long-run steady-state, or cointegrating, relations in the VAR representation of a system. Let the VAR, A k (L)x t = ΨD t + ε t, (12.11) represent the system. The VAR is a p-dimensional system, the variables are assumed to integrated of order d, {x} t I(d), D t is a vector deterministic variables, constants, dummies, seasonals and possible trends, Ψ is the associated coeffi cient matrix. The residual process is normally distributed white noise, ε t ID(0, ). It is important to find the optimal lag length in the VAR and have a normal distribution of the error terms in addition to white noise because the test uses a full information maximum likelihood estimator (FIML). estimators are notoriously sensitive to small samples and misspecifications why care must be taken in the formulation of the VAR. Once the VAR has been found, it can be rewritten in error correction form, x t = Πx t 1 + k Γ i x t 1 + ΨD t + ε t (12.12) i=1 In practical use the problem is to formulate the VAR, the program will rewrite the VAR for the user automatically. Johansen s test builds on the knowledge that if x t is I(d) and co-integration implies that there exists vectors such that β x t I(d b). In a practical situation we will assume that x t (1) and if there is cointegration, β x t I(0). If there is cointegration, the matrix Π must have reduced rank. The rank of Π indicates the number of independent rows in the matrix. Thus, if x t is a p-dimensional process, the rank (r) of Π matrix determines the number of co-integrating vectors (β), or the number of linear steady state relations among the variables in {x} t. Zero rank (r = 0) implies no cointegration vectors, full rank (r = p) means that all variables are stationary, while a reduced rank (0 < r < p) means cointegration and the existence of r co-integrating vectors among the variables. The procedure is to estimate the eigenvalues of Π and determine their significance. 1 1 The test is called the Trace test and its use is explanied in Sjö Guide to testing for NON-STATIONARITY AND CO-INTEGRATION
115 However, under the null of no co-integration, these estimates have non-standard distributions which depend on whether there is a deterministic trend, and or a constant term in the model. The test statistic is only known asymptotically and for a closed system without exogenous variables. In other situations the decision must be based on viewing the test statistics as approximations. Once the rank of Π is known, the matrix can be rewritten as Π = αβ such that β x t forms stationary co-integrating relations. The β are co-integrating parameters, and α represent the adjustment parameters. The significance of the alphas can be determined by ordinary t-test since they are associated with stationary relations, β x t I(0) Finding the VECM In practical use the problem is to formulate the VAR, the program will rewrite the VAR for the user and present the estimated α and β vectors. Sometimes it necessary to understand how the VECM is found. Consider the 2 dimensional VAR model, where the deterministic terms have been removed for simplification, y t = a 11 y t 1 + a 12 y t 2 + a 13 x t 1 + a 14 x t 2 + e 1t (12.13) z t = a 21 z t 1 + a 22 z t 1 + a 23 z t 1 + a 24 z t 2 + e 2t. (12.14) Start with the first equation and with y t from both sides of the equal sign. This gives you y t = (a 11 1)y t 1 + a 12 y t 2 + a 13 x t 1 + a 14 x t 2 + e 1t since the equation was correctly specified from the beginning it can transformed as long as we do not do anything that affects the properties of error term. Our aim is to split all lag terms into first differences and lagged variables in such a way that the model consists of one lag at t-1 for all variables and first differences. We can do this by using the difference operator, = (1 L), which can be used as y t = y t y t 1, or y t 1 = y t y t. Referring to the operators we have L = (1 ), or Ly t = (y t y t ). If we apply this to all lags of lower order than t 1, we get for t 2 the following, y t 2 = y t 1 y t 1, and z t 2 = z t 1 z t 1. Substitute this into the equation to get, y t = (a 11 1)y t 1 + a 12 (y t 1 y t 1 ) + a 13 z t 1 + a 14 (z t 1 z t 1 ) + e 1t Collecting terms gives, y t = ( 1 + a 11 + a 12 )y t 1 a 12 y t 1 + (a 13 + a 14 )z t 1 a 14 z t 1 + e 1t Performing the same operations on the second equation z t = ( 1 + a 21 + a 22 )z t 1 a 22 z t 1 + (a 23 + a 24 )y t 1 π 24 y t 1 + e 1t Write the system in matrix form, 1 x t = Πx t 1 + Γ i x t 1 + ε t (12.15) i=1 NON-STATIONARITY AND CO-INTEGRATION 115
116 where x t = y t z t, x t = y t z t, Π = [ ] π11 π 12,and Γ π 21 π 1 = γ γ 21 Since x t is integrated of order one, it follows that x t is integrated of order zero and therefore stationary. And, since x t is non-stationarity, the variables in x t grows in two dimensions unless they share the same trend. In that case we would say that they are co-integrated and share one common trend. In the case of a p-dimensional system, the system can expand in p dimensions or in less than p dimensions if variables share the same trend. under these properties a single y t 1 or z t 1 cannot be correlated with y t or z t. The only possible correlation that will not render the rows in Π to be different from zero is when (π 11 y t a + π 12 z t 1 ) forms a stationary process, i.e. there exists non-zero parameters π 11 and π 12 (or π 21 and π 22 ) such that when multiplied with the x:s a stationary relation is established. The test for this is to test for the rank of the Π matrix, the number of independent non-zero rows in Π. A rank of zero mean no co-integration, rank of 2 in this case means that the x:s are stationary, or stationary around deterministic trends if we allowed for constants in the equation. A reduced rank, which in this case is a rank equal 1, implies co-integration. Co-integration will imply that at least one α parameter will be significant, there will be (long-run) Granger causality in at least one direction. At least one variable must follow the other for them to stay together in fixed formation on the long run. Johansen s test is better than the two-step procedure in almost all aspects. The practical problems originate from choosing a correct combination of lags and dummy variables to make the residual come out as white noise. In a limited sample this can be diffi cult, and the results might change among different specifications of the system, just as it does in the two-step procedure. It is recommended to start with the two-step procedure, to learn about the data and get some preliminary results, instead of getting stuck with the Johansen test, having problems finding a specification that leads to economically interesting results. 116 NON-STATIONARITY AND CO-INTEGRATION
117 13. INTEGRATED VARIABLES AND COMMON TRENDS This chapter looks the common trends approach and some economics behind cointegration. For instance, the question of creating positive or negative shocks in stabilization policy. An important characteristic of integrated variables is that they become stationary after differencing. The definition of an integrated series is; A series, or a vector of series, y t with no deterministic component, which has a stationary, invertible ARMA representation, after differencing d times is said to be integrated of order d, denoted as x t I(d). It is possible to have variables driven by both stochastic and deterministic trends. In the very long run a deterministic trend will always dominate over a stochastic trend. In a limited sample however, it becomes an empirical question if the deterministic trend is suffi ciently strong to have an effect on the distributions of the estimates of the model. 1 We know, from the Wold representation theorem, that if y t is I(0), and has no deterministic process, it can be written as an infinite moving average process. (If the series has a deterministic process this can be removed before solving for the MA process). y t = C(L)ɛ t, (13.1) where L is the lag operator, and ɛ t iid(0, σ 2 ). Now, suppose that y t is I(1), then its first difference is stationary and has an infinite MA process, y t = C(L)ɛ t. (13.2) Under the assumption that ɛ t iid(0, σ 2 ), we have also that y t = 1 C(L)ɛ t = [1/(1 L)]C(L)ɛ t. (13.3) where 1/(1 L) represents the sum of an infinite series. For a limited sample, we get approximately, y t = y 0 + (1 + L + L L t 1 )C(L)ɛ t, (13.4) where y 0 is the initial value of the process seen as a deterministic component conditional on everything known at time zero. The long-run solution of this expression, setting L = 1, gives tc(1), and y t = y 0 + t C(1)ɛ t. (13.5) Unless C(1) = 0, this process will grow infinitely large as t. Looking at the second difference of y t I(1), leads to 2 y t = (1 L)C(L)ɛ t, (13.6) where = (1 L) is applied to both sides of the expression. This series has no long run MA representation, irrespective of C(L) = C(1) 0, since setting L = 1 gives (1 L)C(L) = (1 1)C(1) = 0. 1 See Nelson and Plosser (1982) for a discussion about the proper way to model the trend in economic time series. INTEGRATED VARIABLES AND COMMON TRENDS 117
118 Let us see what happens with the process in the future. From above we get the MA representation for some future period t + h, t+h t+h i y t+h = y 0 + Cj ɛ i (13.7) = y 0 + i=1 j=0 t i=1 t+h i j=0 Cj ɛ i + t+h i=1+t t+h i j=0 Cj ɛ i. (13.8) The forecasts are decomposed into what is known at time t, the first double sum, and what is going to happen between t and t + h i. The latter is unknown at time t, therefore we have to form the conditional forecast of y t+h at time t, y t+h t = y 0 + t i=1 [ t+h i j=0 Cj ] ɛ i. The effect of a shock today (at time t) on future periods is found by taking the derivative of the above expression with respect to a change in ɛ t, y t+h t / ɛ t = h j=0 Cj C(1) as t. (13.9) Thus, the long-run effect of a shock today can be expressed by the static long run solution of the MA representation of y t. (Equal to the sum of the MA coeffi cients). The persistence of a shock depends on the value of C(1). If C(1) happens to be 0, there is no long-run effect of today s shock. Otherwise we have three cases, C(1) is greater than 0, C(1) = 1 or C(1) is greater than unity. If C(1) is greater than 0 but less than unity, the shock will die out in the future. If C(1) = 1, the integrated variables (unit roots) case, a shock will be as important today as it is for all future periods. Finally, if C(1) is greater than one (explosive roots) the shock magnifies into the future, and we have an unstable system. If the series are truly I(1), spectral analysis can be applied to exactly measure the persistence of a shock. The persistence of shocks has interesting implications for economic policy. If shocks are very persistent, or explosive in some cases, it might be a good policy to try to avoid negative shocks, but create positive shocks. In our stabilization policy example, this can be understood as the authorities should be careful with deflationary policies, for instance, since they might result in high and persistent social costs, see Mankiw and Shapiro (198x) for a discussion of these issues. In the following, the MA representation of systems of integrated processes are analysed. For this purpose let y t be vector of I(1)-variables. Using the lag operator as, L = 1 (1 L) [1 ] 2, and Wald s decomposition theorem gives, y t = C(L)ɛ t = C(1)ɛ t + C (L)ɛ t. (13.10) If y t is a vector of I(1) variables, then we know from above that if the matrix C(1) 0, any shock to the series has infinite effects on future levels of y t. Let us consider a linear combination of these variables β y t = z t. Multiplication of the expression with β gives, z t = β C(1)ɛ t + β C (L)ɛ t. (13.11) In general, it is the case that when y t is I(1), z t is I(1) as well. Thus a linear combination of integrated variables will also be integrated. Implying that β C(1) 2 This is the same as y t = y t + y t 1 = [(1 L) + L]y t. 118 INTEGRATED VARIABLES AND COMMON TRENDS
119 is different from zero. Suppose, however, that there exists a matrix β such that β C(1) = 0, which implies that when β is multiplied with y t we get a stationary process, β y t I(0). As an example consider private aggregated consumption and private aggregated (disposable) income. Both variables could be random walks, but what about the difference between these variables? Is it likely to assume that a linear combination of them could be driven by a stochastic trend, meaning that consumption would deviate permanently from income in the long run? The answer is no. In the long-run it is not likely that a person consume more than his/her income, nor is it likely that a person will save more and more. Thus, we have to think of situations when two variables cointegrate, that is when a linear combination of I(1) series forms a new stationary series, integrated of order zero. (A more formal definition of co-integration is given in a following section.) In terms of the C(1) matrix, common trends or co-integration implies that there exists a matrix β such that β C(1) = 0, hence we get, z t = β C(L)ɛ t, (13.12) where z t is integrated of order zero, when y t consists of variables integrated of order one. The mathematical condition for having a vector β such that β C(1) = 0 is that C(1) has reduced rank. There must be at least one row representing the long run that can be solved from the other long run relations in C(1). If C(1) has reduced rank, there can be several β - vectors that lead to β C(1) = 0. We can express this as follows: any vector lying in the null space of C(1) is a co-integrating vector and that the co-integrating rank of C(1) is the rank of this null space. Say that y t is a vector of n variables. If all variables are non-stationary, and integrated of order one, the whole system could expand in n different directions. If some or all variables share the same trend in the long-run, the system would be expanding in only r < r dimension. How should we understand the reduced rank of C(1) in economic terms? Think of consumption and income again. If both series are I(1), constantly growing in the long-run. The difference between them should be stationary in the long run. In other words they must have a common (stochastic) trend, which the both follow in the long run. This common trend could understood as a given by technological growth, which leads to growth in income and thereby also to a long-run growth in consumption. Another way of expressing the same thing is to say that the common trend represents the cumulation of past technology shocks. Stock and Watson (1988), modeled the common trends representation of y t in the following way. Starting from y t = C(1)ɛ t + C (L)ɛ t, the level of y t is determined by, y t = y 0 + (1 + L + L L t 1 )[C(1) + (1 L)C (L)]ɛ t (13.13) = y 0 + C(1)(1 + L + L L t 1 )ɛ t + C (L)ɛ t. (13.14) If we have cointegration, and therefore common trends, C(1) must be of reduced rank. The matrix C(1) can be thought of as consisting of two sub-matrices, such that C(1) = AJ, where J is defined as, τ t = τ t 1 Jɛ t = (1 + L + L L t 1 )Jɛ t. (13.15) The variable τ t represents the common trends, modelled as random walks. Setting the initial condition τ 0 = 0, the level of y t is solved as y t = y 0 + Aτ t + C (L)ɛ t (13.16) which shows that y t is driven by the common trends representation Aτ t. It can also be shown that, since C(1) = AJ, that β A = 0, which implies that the INTEGRATED VARIABLES AND COMMON TRENDS 119
120 co-integrating linear combinations of y t have no common trends. In terms of the C(1) matrix, we can talk about two types of shocks. The first type of shocks decline over time, so that the variables in the system return to their equilibrium relation. These shocks are driven by the co-integrating vectors. The second type of shocks are those which move the whole system over time without affecting the long-run equilibrium. These shocks are the common trends of system. Cointegration and common trends have interesting implications for econometric model building and inference on dynamic models. For an econometrician, however, the MA representation are not always the easiest way to approach the concept of cointegration and stationary long-run relations. 120 INTEGRATED VARIABLES AND COMMON TRENDS
121 14. A DEEPER LOOK AT JO- HANSEN S TEST Earlier we looked at the moving average representation of a vector of integrated variables. This takes us to a definition of common trends in a system of variables with stochastic trends. For an economist it is usually more interesting to analyse a system in autoregressive format. By looking at the VAR representation we get a definition of cointegration or long-run steady state relations among the variables. Let the process {y t } be represented by the following k : th order vector autoregressive (VAR) model, consisting of p variables, A(L)y t = ΨD t + ɛ t, (14.1) where D t a vector of deterministic variables, including dummies and constants. The error term is ɛ t NID(0, Ω), and A(L) = k j=0 A jl j where A 0 = I. Thus, we are assuming that the process is multivariate normal, y NID(µ, ), with mean µ = A 1 y t A k y t k + ΨD t, and positive definite error covariance matrix. The system can be rewritten in error correction form, using the definition of the difference operator, y t y t y t 1, where and k 1 y t = Γ i y t i + Πy t k + ΨD t + ɛ t, (14.2) i=1 Γ i = (I + Π = (I + k j=1 i j=1 A j ), (14.3) A j ) = A(1). (14.4) Notice that in this example the system was rewritten such that the variables in levels (y t k ) ended up at the k : th lag. As an alternative it is possible to rewrite the system such that the levels enter the expression at the first lag, followed by k lags of y t i. The two ways of rewriting the system are identical. The preferred form depends on ones preferences. Since y t is integrated of order one and y t is stationary, it follows that there can be at most p 1 steady state relationships between the non-stationary variables in y t. Hence, p 1 is the largest possible number of linearly independent rows in the Π-matrix. The latter is determined by the number of significant eigenvalues in the estimated matrix ˆΠ = Â(1). Let r be the rank of Π, then rank(π) = 0 implies that there are no combinations of variables that leads to stationarity. In other words, there is no cointegration. If we have rank(π) = p, the Π matrix is said to have full rank, and all variables in y t must be stationary. Finally, reduced rank, 0 < r < p means that there are r co-integrating vectors in the system. Once a reduced rank has been determined, the matrix Π can be written as Π = αβ, where β y t represent the vectors of co-integrating relations, and α a matrix of adjustment coeffi cients measuring the strength by which each co-integrating vector affects an element of y t. Whether the co-integrating vectors β y t are referred A DEEPER LOOK AT JOHANSEN S TEST 121
122 to as error correction mechanisms, steady state relations, long-run equilibrium solutions or desired value is a question of how one views the underlying economic mechanisms. Given estimates of the eigenvalues of Π, α and β, it becomes possible to impose various restrictions on the parameter vectors to test homogeneity conditions in the β-vectors, how β y t affects y t, or a more general hypothesis regarding which combinations of variables that form stationary vectors. The tests are performed by comparing changes in the estimated eigenvalues from the unrestricted reduced rank estimate of Π with the outcome of a restricted estimation. In Johansen (1988) it is shown how to estimate the α and the β vectors in the Π matrix, given that the latter has reduced rank. The solution starts from conditioning out the short-run dynamics, as well as the effects of the dummy variables on y t and y t k respectively, k 1 y t = ρ 1, i y t i + γ 0 D t + R 1t, (14.5) i=1 k 1 y t p = ρ 2, i y t i + γ 2 D t + R kt. (14.6) i=1 The system in 14.1 can now be written in terms of the residuals above as, R 1t = αβ R kt + e t. (14.7) The vectors α and β can now be estimated by forming the product moment matrices S 11, S kk and S 1k from the residuals R 1, t and R k, t, S ij = T 1 T i=1 R it R jt, i, j = 0, k (14.8) For fixed β vectors, α is given by ˆα(β) = S 1k β(β S kk β) 1, and the sums of squares function ˆΩ(β) = S 11 ˆα(β)(β S kk β)ˆα(β). Minimizing this sum of squares function leads to maximum likelihood estimates of α and β. The estimates of β are found after solving the eigenvalue problem, λs kk S k1 S 1 11 S 1k = 0, (14.9) where λ is a vector of eigenvalues. The solution leads to estimates of the eigenvalues(ˆλ 1, ˆλ 2,..., ˆλ ρ ), and the corresponding eigenvectors ˆV = (ˆv 1, ˆv 2,..., ˆv ρ ), normalized around the squared residuals from equation 14.7 such that V S 22 V = I. The size of the eigenvalues (λ i ) tells us how much each linear combination of eigenvectors and variables, v i y t is correlated with the conditional process R 1t ( y t y t i, D). The number of non-zero eigenvalues (r) determines the rank of Π and lead to the co-integrating vectors of the system, while the number of zero eigenvalues (p r) define the common trends in the system. These are the combinations of v i y t that determine the directions in which the process is non-stationary. Given that 14.1 is a well-defined statistical model, it is possible to determine the distribution of the estimated eigenvalues under different assumptions of the number of co-integrating vectors in the model. The distributions of the eigenvalues depend not only on 14.1 being a well-defined statistical model, but also on the number of variables, the inclusion of constant terms in the co-integrating vectors and deterministic trends in the equations. Distributions for different models are tabulated in Johansen (1995). 122 A DEEPER LOOK AT JOHANSEN S TEST
123 The maximized log likelihood, conditional on the short run dynamics and the deterministic variables of the model is, ln L = constant (T/2) ln S 00 (T/2) r ln(1 λ i ). (14.10) From this expression two likelihood ratio tests for determining the number of non-zero eigenvalues are formulated. The first test concerns the hypothesis that the number of eigenvalues is less than or equal to some given number (q) such that H 0 : r q, against an unrestricted model where H 1 : r p. The test is given by, 2ln(Q; q p) = T p i=q+1 i=1 ln(1 ˆλ i ). (14.11) The second test is used for the hypothesis that the number of eigenvalues is less than the number tested in the previous hypothesis, H 0 : r q against H 1 : r q + 1, and is given by, 2ln(Q; q q + 1) = T ln(1 ˆλ q+1 ). (14.12) Both tests follow non-standard distributions which depend on the number of variables in the system (p), and on the presence of trends and constant terms. The number of non-zero eigenvalue estimates of β i are given by the corresponding eigenvectors such that ˆβ = (ˆv 1, ˆv 2,..., ˆv r ). Based on ˆβ the α-vectors can be solved by, ˆα = S 1k ˆβ(ˆβ Skk ˆβ) 1. (14.13) The estimated matrix Π = αβ is not identified, in the sense that we can pick any non-singular matrix M(rxr), so that αm β(m ) 1 = Π = αβ. There is no unique solution for the co-integrating vectors. This solution, explaining the economic meaning of the co-integrating vectors, is something that the econometrician must impose on the estimates. First, by normalizing each β-vector around a variable, and then tests different assumptions about the vector. By looking at the signs and relative sizes of the ˆβ-parameters, it is in general possible to find appropriate normalization of the β-vectors such that the outcome can be understood in terms of error correction mechanisms or long-run equilibrium relationships between economic variables. Assumptions concerning the sizes and relative signs of the parameters can be tested by comparing an unrestricted maximization with one where the restrictions have been imposed. Furthermore, to rule out the cases where y t is integrated of order 2, we must require that the matrix α Φ β has full rank, where Φ is the mean lag matrix of Π evaluated at unity, and α and β are the orthogonal matrices to α and β such that α α = β β = 0. The system in 14.1 also has a moving average form given by, Expression can be compared with y t = C(L)(ɛ t + µ + ΨD t ). (14.14) z t = β C(1)ɛ t + β C (L)ɛ t, (14.15) from the previous chapter. Since C(L) can be expanded as C(L) = C(1) + (1 L)C (L) when y t is integrated of order one we get, y t = y 0 + C(1) t ɛ i + C(1)µt + C(1) i=1 t ΨD i + C (L)(ɛ t + ΨD t ), (14.16) i=1 A DEEPER LOOK AT JOHANSEN S TEST 123
124 where C (L) = [C(L) C(1)](1 L) 1. The impact matrix C(1) shows how the non-stationary part of y t is generated from the underlying stochastic and deterministic trends. The link between the MA and the autoregressive form is shown in Johansen (1991), and is given by C(1) = β (α Φ β ) 1 α, (14.17) where β and α are the orthogonal vectors of α and β respectively. 1 Equation can be used to estimate C(1) from given estimates of α and β. But, since the error terms (ɛ i, t ) in the reduced form are correlated, the estimate of C(1) is not invariant to different ways of conditioning on current variables ( y t ). Given this limitation and the assumption that the driving trends should not be affected by the equilibrium forces, the common trends in the system are represented by α y t or alternatively by α ɛit, see Juselius (1992). The test procedure can be extended to incorporate variables integrated of order 2 as well. With both I(1) and I(2) processes in the system, two new co-integrating relations are possible. There can be combinations of I(2) variables forming stationary I(0) vectors, or I(2) variables forming non-stationary I(1) vectors which in turn cointegrate with I(1) variables to form stationary vectors. The error correction system in 14.1 can be written as k 2 2 y t = Γ y t 1 + Πy t 2 + Φ i 2 y t i + ΨD t + ɛ t, (14.18) i=1 where Γ = k 1 i=1 I + Π, Φ i = k 1 j=i+1 Γ j, and i = 1,... k 2. If y t is I(2) and y t is I(1), a reduced rank condition for the matrix Π must be combined with a reduced rank condition for the matrix of first differences Γ as well. Johansen (1991) shows that the condition for an I(2) process is α Γβ = ϕη, (14.19) where ϕ and η are (p r)xs, with rank s. With I(2) variables β y t is I(1). To make these vectors stationary they have to be combined with the vectors of first differences (κβ 2 y t ) to form stationary processes. In the latter expression β 2 is the squared orthogonal β vectors, and κ = (α α) 1 α Γβ 2 (β 2 β 2 ) 1. The squared orthogonal β vectors indicate which variables are I(2). An I(2) model is estimated in a way similar to the I(1) model. Maximum likelihood estimation is feasible since the residual terms of an I(2) model can be assumed to be a Gaussian process. The first step is to perform a reduced rank regression for the I(1) model of y t on y t 1, corrected for the short run dynamics ( y t 1,..., y t k+1 ) and the deterministic components (ΨD t ). This leads to estimates of ˆr, ˆα and ˆβ. In the second step, given the estimates of ˆr, ˆα and ˆβ, a reduced rank test is performed of ˆα 2 y t on ˆβ y t 1, corrected for 2 y t 1,... 2 y t k+2, and the constant terms. This leads to the estimates ŝ, ˆϕ, and ˆη. An I(2) process is harder to analyze in economic terms since the parameters and the test hypotheses have different interpretations. The tests concerning the β vectors are still valid, but are in general only valid for I(1) processes. It is, however, possible to form stationary relations by combining levels (ˆβy t ) with first difference expressions (κˆβ 2 y t ). The practical solution is to identify the I(2) terms and finds ways of transforming them to I(1) relations. The transformation to an I(1) system can be done by taking first differences of I(2) variables or by taking ratios of variables; modeling the real money stock rather than the money stock and the price level separately. 1 An orthogonal vector is often indcated by the sign attatched to the original vector. The vector β is the orthogonal vector to the vector β if β β = A DEEPER LOOK AT JOHANSEN S TEST
125 15. THE ESTIMATION OF DY- NAMIC MODELS (To be completed...) The modelling of stochastic differential models introduce some problems which clearly violate the assumptions behind the classical linear regression model. With some care most of these problems can be solved. The most important factors are whether the data series are stationary, and if the residuals are white noise. As long as the variables are stationary and the residual is a white noise process, OLS estimation is generally feasible. Autocorrelated residuals, however, mean that the OLS estimator is no longer consistent. In this situation the model must either be re-specified, or the whole model including the autoregressive process in the residuals must be estimated by maximum likelihood. To understand the differences between the estimation of stochastic difference equations and the classical linear regression model, we will introduce these differences step by step, in all there are 6 models of interest here, 1 The classical linear regression model. 2 Regression with deterministic trends. 3 Models with stochastic explanatory variables. 4 Autoregressive models, lagged dependent variable. 5 Autoregressive models with integrated variables (Testing for unit roots). 6 Regression models with Integrated variables (Spurious regression and cointegration). The following sections do not present any rigorous proofs concerning the properties of the OLS estimator. The aim is only to review known problems and introduce some new ones Deterministic Explanatory Variables (The Classical Linear Regression model) Starting with y t = βx t + ɛ t, the matrix form of this model is y = Xβ + ɛ, (15.1) where y is vector of T observations of y t, X a matrix of explanatory variables, β the parameters and ɛ a vector of residuals of the same dimension as y. (To keep the example simple, β is one parameter, but the example could be extended to a multivariate case). The classical case builds upon four assumptions. First, the model is linear, or log linear in variables. Second, the residuals are independent, have a mean of zero and a finite variance, E(ɛ) = 0, V ar(ɛ) = σ 2 I. (15.2) THE ESTIMATION OF DYNAMIC MODELS 125
126 This is basically a statement of correct specification. The model should be set up in such a way that the expected value of the residuals are zero. Third, the explanatory variables are non-stochastic and therefore independent of the errors, E(X ɛ) = 0. (15.3) Finally, the explanatory variables are linearly independent such that rank(x X) = rank(x) = k, (15.4) which ensures that the inverse of (x x) exists. Minimizing the sum of squared residuals leads to the following OLS estimator of β, ˆβ = (X X) 1 (X y) = β + (X X) 1 (X ɛ). (15.5) If we simplify the model to one parameter (β) and one explanatory variable (x t ), we have for a sample of T observations, ˆβ = β + [ 1 T T t=1 x t 2 ] 1 [ 1 T ] T x t ɛ t, (15.6) t=1 The estimated parameter ˆβ is equal to its true value β and an additional term. For the estimate to be unbiased the last factor must be zero. If we assume that the x s are deterministic the problem is relatively easy. A correct specification of the model, E(ɛ) = 0, leads to the result that ˆβ is unbiased. The parameter β has the variance, V ar(ˆβ) = E[(ˆβ β)(ˆβ β) ] = (x x) 1 x (σ 2 I)x(x x) 1 = σ 2 (x x) 1. (15.7) Taking expectations of ˆβ,under the assumptions above, verifies that ˆβ is an unbiased estimate of β, 1 E(ˆβ) = E(β) + E[(X X) 1 (X ɛ)] = β + E(X X) 1 E(X ɛ), (15.8) where (X X) is a constant when x t is deterministic, and E(X ɛ) = X E(ɛ) = 0, if the residuals have a zero mean. Thus, under these assumptions OLS is unbiased and also consistent (Not proven here). Consistency implies that the var(ˆβ) tends to zero as T. The problem with assuming that the x s are non-stochastic is of course that it is an unrealistic assumption in a time series setting. Typically the explanatory variables are as stochastic as the dependent variable. So far we have not made any statements about the distribution of the estimates. OLS has the advantage that it leads to unbiased and effi cient estimates under quite general assumptions. However, to make any inference on ˆβ, we need to make assumptions about its distribution. In most cases the assumption of a normal distribution is reasonable, at least asymptotically, or a reasonable approximation in a limited sample, leading to (ˆβ β) NID(0, σ 2 I). (15.9) Thus, the limiting distribution of β is normal, and since (ˆβ β) is a white noise process we know that it converges to the true sample moment with the speed given by the standard error of a white noise process, 1/T The expectation of an expectation is equal to the expectation E(ˆβ) = ˆβ,and the expectation of a constant is equal to the constant, since true parameter β can be treated as a constant we have E(β) = β. 126 THE ESTIMATION OF DYNAMIC MODELS
127 15.2 The Deterministic Trend Model A situation when the assumption of deterministic explanatory variables can make sense in a time series setting is when the dependent variable is driven by a deterministic trend. 2 Suppose that the explanatory variable is a deterministic time trend, y t = α + β t + ɛ t, (15.10) where t is a time trend, t = 1, 2, 3,... T, without stochastic variation. If the time trend is adjusted for its mean t = (t t), the constant term (α) will measure the unconditional mean of y t. Under the assumption that y t has a suffi ciently large deterministic trend component, w.r.t to the sample size, the error terms from this regression can be understood as the detrended y t series. Assume that both y t and t have been corrected for their means, OLS leads to ˆβ = β + [ 1 T T t=1 t 2 ] 1 [ 1 T ] T t ɛ t. (15.11) Taking expectations leads to the result that β is unbiased. The most important reason why this regression works well is that there is an additional t variable in the denominator. As t goes to infinity the denominator gets larger and larger compared to the numerator, so the ratio goes to zero much faster than otherwise. t= Stochastic Explanatory Variables Applying OLS to time series data introduces the problem of stochastic explanatory variables. The explanatory variable can be stochastic on their own, and lags of the dependent variables imply stochastic regressors. Let the model be, y t = βx t + ɛ t, (15.12) where x t is generated by the covariance stationary stochastic process {x t } T. 1 The OLS estimator leads to [ T ] 1 [ T ] ˆβ = β + x t ɛ t. (15.13) t=1 x 2 t t=1 Taking expectations of the expression leads to [ T ] 1 E x 2 t for the first factor and (15.14) t=1 [ T ] E x t ɛ t for the second factor. (15.15) t=1 In the classical linear regression case x t is assumed to be deterministic implying that [X X] is a constant and that E(x t ɛ t ) = x t E(ɛ t ) = 0. Here x t is a random variable, so additional assumptions must be made for the OLS estimator. 2 Other realistic examples in economics are deterministic dummy variables and deterministic seasonal components. THE DETERMINISTIC TREND MODEL 127
128 The necessary conditions are that {x t } T 1 is stationary process and that {x t } T 1 and {ɛ t } T 1 are independent. The first condition means that we can view the covariance matrix (X X) as fixed in repeated samples. In a time series perspective we cannot generally talk about repeated samples, instead we have to look at the sample moments as T. If x t is a stationary variable then we can state that as T, the covariance matrix will become a constant. This can be written as, 1 T T x t x t p Q, (15.16) t=1 meaning that the expression will converge in probability to a constant Q. 3 An alternative way to show the properties of OLS in the case of stochastic explanatory variables is to use the probability limit operator (p lim), p lim [X X] = Q. A convenient property of p lim operators is that p lim(x 1 ) = [p lim(x)] 1. Here, it remains to look at the numerator in the OLS expression. If {x t } T 1 and {ɛ t } T 1 are generated by two independent stochastic processes we have, for each pair of observations, that E(x t ɛ t ) = E(x t )E(ɛ t ), it can then be shown that or, alternatively, 1 T T x t ɛ t p 0, (15.17) t=1 p lim [X ɛ] p 0. (15.18) The intuition behind this result is that, because ɛ t is zero on average, we are multiplying x t with zero. It follows then that the average of (x t ɛ t ) will be zero. The practical implication is that given a suffi ciently large sample the OLS estimator will be unbiased, effi cient and consistent even when the explanatory variables are stochastic variables. If ɛ t NID(0, σ 2 ), we also have, conditional on the stochastic process {x t } T, that the estimated β is distributed as, 1 ˆβ xt N[β, σ 2 ɛ(x X) 1 ] (15.19) and (ˆβ β) xt N(0, σ 2 β), (15.20) making ˆβ an unbiased and consistent estimate, with a normal distribution such that standard distributions can be used for inference. The example can be extended by two assumptions. First let the residuals be e t iid(0, σ 2 ), they are independent and identically distributed as before, but not necessarily normal. Second, let the process {x t } T 1 be only covariance stationary in the long run, allowing the sample covariance to vary with time in a limited sample, E(X X) = (1/T ) t x t x t = Q t. The processes {x t } T 1 and {ɛ t } T are independent 1 as above. Under these conditions the estimated β is, [ ] T 1 ˆβ t = β + Q t (1/T ) x t ɛ t (15.21) The estimated β t can vary with t since Q t varies with time. To establish that OLS is a consistent estimator we need to establish that [ (1/T ) ] 1 T x t x t t=1 t=1 = Q 1 t p Q 1 (15.22) 3 In a multivariate model we would say that Q converges to a matrix of constants. 128 THE ESTIMATION OF DYNAMIC MODELS
129 The condition holds if {x t } T 1 is covariance stationary, as T goes to infinity the estimate will converge in probability ( p ) to a constant. The second condition is that the sum T t=1 x tɛ t converges in probability to zero, which takes place whenever x t and ɛ t are independent. The error process is iid, but not necessarily normal. Under the conditions given here, the central limit theorem is suffi cient to establish that the sequence { T t=1 x tɛ t }converges (weakly in distribution) to a normal distribution, T [(1/T ) x t ɛ t ] d N(0, σ 2 ), (15.23) t=1 so that (ˆβ t β) is asymptotically distributed as (ˆβ t β) N(0, σ 2 ). (15.24) In a limited sample the normal distribution will be an approximation. The result is necessary for using t, χ 2 and F -distributions for inference on ˆβ and ˆσ 2. To see how the last result works, recall the central limit theorem (CLT ). The CLT states that for a sample mean of an iid process z T, as T the sample size increases this will weakly converge to a normal distributed variable so for the sequence (1/T 1 2 )( zt µ) N(0, σ 2 ), (15.25) where µ is the population mean of z t. From the OLS estimator we have, T T (ˆβ t β) = [(1/T ) x t x t ] 1 [(1/T ) x t ɛ t ]. (15.26) t=1 Since (1/T ) = (1/T 1 2 )(1/T 1 2 ) the CLT can be evoked by rewriting the expression as, (1/T 1 2 )(ˆβt β) = [(1/T ) t=1 T T x t x t ] 1 [(1/T 1 2 ) x t ɛ t ], (15.27) t=1 where the LHS and the numerator on the RHS correspond to the CLT theorem. From the numerator, on the RHS, we get as T goes to infinity t=1 T [(1/T 1 2 ) x t ɛ t ] N(0, σ 2 ). (15.28) t=1 Moreover, we can also conclude that the rate of convergence is given by (1/T 1 2 ). Dividing the RHS side of the OLS estimator with (1/T 1 2 ) leaves (1/T 1 2 ) in the denominator which then represents the speed by which the estimate ˆβ t converges to its true value β Lagged Dependent Variables Let us now turn to the AR(1) model, y t = βy t 1 + ɛ t, (15.29) LAGGED DEPENDENT VARIABLES 129
130 where ɛ t iid(0, σ 2 ). (The estimation of AR(p) models follows from this example in a straightforward way). The estimated β is ˆβ = [ T 1 T 2 y t 1 t=1 ] 1 [ T 1 T t=1 y t 1 y t ], (15.30) leading to ˆβ β = [ T 1 T 2 y t 1 ] 1 [ T 1 T t=1 t=1 y t 1 ɛ t ]. (15.31) This is similar to the stochastic regressor case, but here {y t 1 } and {ɛ t } cannot be assumed to be independent, so E(y t 1 )E(ɛ t ) 0 and ˆβ can be biased in a limited sample. The dependence can be explained as follows ɛ t is dependent on y t, but y t is through the AR(1) process correlated with y t+1, so y t+1 is correlated with ɛ t+1. The long-run covariance (lrcov) between y t 1 and ɛ t is defined as, lrcov(y t 1 ɛ t ) = [ T 1 t=1 ] y t 1 ɛ t + E(y t 1 ɛ t+k ) + E(y t 1+k ɛ t ), (15.32) k=1 where the first term on the RHS is sample estimate of the covariance, the last two terms capture leads and lags in the cross correlation between y t 1 and ɛ t. As long as y t is covariance stationary and ɛ t is iid, the sample estimate of the covariance will converge to its true long-run value. This dependence from ɛ t to y t+1 is not of major importance for estimation. Since (y t 1 ɛ t ) is still a martingale difference sequence w.r.t. the history of y t and ɛ t, we have that E{y t 1 ɛ t y t 2, y t 3,..., ɛ t 1, ɛ t 2,...} = 0, so it can be established in line with the CLT, 4 that [ T ] (1/T 1 2 ) y t 1 ɛ t N(0, σ 2 Q). (15.33) t=1 Using the same assumptions and notation as above the variance is given is E(y t 1 ɛ t ɛ t y t 1 ) = E(σ 2 )E(y t 1 y t 1 ) = σ 2 Q t. These results are suffi cient to establish that OLS is a consistent estimator, though not necessarily unbiased in a limited sample. It follows that the distribution of the estimated β, and its rate of convergence is as above. The results are the same for higher order stochastic difference models. k= Lagged Dependent Variables and Autocorrelation In this section we look at the AR(1) model with an autoregressive residual process. Let the error process be, ɛ t = ρɛ t 1 + υ t, (15.34) where υ t iid(0, σ 2 ). In this case we get the following expression, 4 This result is established by the so called Mann-Wold theorem 130 THE ESTIMATION OF DYNAMIC MODELS
131 T y t 1 ɛ t = t=1 = ρ T T [y t 1 (ρɛ t 1 + v t )] = ρ y t 1 ɛ t 1 + t=1 t=1 T [y t 1 (y t 1 βy t 2 )] + t=1 t=1 T y t 1 v t t=1 T T = ρ yt 1 2 βρ y t 1 y t 2 + t=1 T y t 1 v t t=1 T y t 1 v t. (15.35) t=1 Dividing the expression with (1/T ) and taking expectations E [ (1/T ) ] T y t 1 ɛ t = ρ var(y t ) + β ρcov(y t 1 y 2 ) + cov(y t 1 v t ), (15.36) t=1 which establishes that the OLS estimator is biased and inconsistent. Only the last covariance term can be assumed to go to zero as T goes to infinity. In this situation OLS is always inconsistent. 5 Thus, the conclusion is that with a lagged depended variable OLS is only feasible if there is no serial correlation in the residual. There are two solutions in this situation, to respecify the equation so the serial correlation is removed from the residual process, or to turn to an iterative ML estimation of the model (y t βy t 1 ρɛ t 1 = v t ). The latter specification implies common factor restrictions, which if not tested is an ad hoc assumption. The approach was extremely popular in the late 70s and early 80s, when people used to rely on a priori assumptions in the form of adaptive expectations or costly adjustment, as examples, to derive their estimated models. Often economists started from a static formulation of the economic model and then added assumption about expectations or adjustment costs. These assumptions could then lead to an infinite lag structure with white noise residuals. To estimate the model these authors called upon the so called Koyck transformation to reduce the model to a first order autoregressive stochastic difference model, with an assumed first order serially correlated residual term The Problems of Dependence and the Initial Observation An additional problem is that of dependent observations. When we derived the estimators, in particular the MLE, we must assume that the observations are drawn independent distribution. A basic assumption is therefore violated, because the observations in a typical time series model are dependent. The AR(1) can serve as an example, x t = ax t 1 + ɛ t ɛ t NID(0, σ 2 ). (15.37) In this model each observation of Xt is dependent on the observation of x t in the previous period. How does this affect the ML estimator? Suppose the sample 5 Asymptotically, though, the estimates have normal distributions, because the long-run bias converges to a constant while the eroor process v t converges to NID(0, σ 2 ). This is a result of the CLT. THE PROBLEMS OF DEPENDENCE AND THE INITIAL OBSERVATION 131
132 only consists of two observations x 1 and x 2. The joint density function for these two observations can be factorised as, D(x 1, x 2 ) = D 1 (x 2 x 1 )D 2 (x 1 ) (15.38) Extend the sample to 3 observations and we get, D(x 1, x 2, x 3 ) = D 1 (x 3 x 2, x 1 )D 2 (x 2, x 1 ) (15.39) = D 1 (x 3 x 2, x 1 )D 2 (x 2 x 1 )D 3 (x 1 ). (15.40) With three observations, we have that the joint probability density function is equal to the density function of X3, conditional on X 2 and X 1, multiplied by the conditional density for X 2, multiplied by the marginal density for X 1. It follows that for a sample of T observations, the likelihood function can be written as, L(θ; x) = T D(x t X t 1, θ)f(x 1 ), (15.41) t=2 where X t 1 represents the observations up to and including x t 1. Now, the AR(1) model implies that the conditional density function of x, D(x t x t 1,..., x 1 ) is normally distributed with mean a 1 x t 1 and variance σ 2. The log likelihood function is, log L(a 1, σ 2 ; x) = [(T 1)/2] log 2π [(T 1)/2] log σ 2 (1/2σ 2 ) T t=2 (x t a 1 x t 1 ) 2 + log D(x 1 ). This looks like the expression for the MLE derived earlier, with the exception of the last term, the log likelihood for the very first observation. By definition, the first observation here contains the initial conditions for the model, meaning everything that happen up to and including the first period of the sample. The question is, how do we get rid of this term? A practical solution is to assume that x 1 can be treated as a fixed value in repeated realizations. (Compare with stochastic regressor case in OLS). In this case log f(x 1 ) can be seen as a constant which can be left out of the MLE because it will not affect the estimates of the parameters. An alternative way is to assume that X t is stationary and normally distributed. The absolute value of a 1 will be less than one. The unconditional normal distribution of X 1 is therefore known to have mean zero and variance σ 2 /(1 σ 2 ). The likelihood becomes, log L(a 1, σ 2 ; x) = (T/2) log 2π (T/2) log σ 2 +(1/2) log(l a 2 1 ) (1/2σ 2 )(1 a )x 1 (1/2σ 2 ) T t=2 (x t a 1 x t 1 ) 2 + log D(x 1 ). Unfortunately the log likelihood is no longer log-linear. The most convenient solution in this case is to drop the third and the fourth terms from the likelihood, with the argument that we are only dealing with one observation why the asymptotic properties of the estimator should be unchanged. The conclusion would be the same if we assume that X 1 is fixed in repeated samples. Finally, the most diffi cult way of dealing with the situation is to use the sample information to estimate the initial conditions of X t. This would be recommended if we are modeling non-stationary variables where the distribution of the initial value might differ to a large extent from the following observations. (An example of this can be found in Bergstrom (1989). 132 THE ESTIMATION OF DYNAMIC MODELS
133 15.7 Estimation with Integrated Variables (To be completed and extended) In this section we investigate the problems of estimating integrated series. An integrated variable can be defined as, A series (x t ) with no deterministic component and which has a stationary and invertible autoregressive moving average (ARMA) representation after differencing d times, but which is not a stationary after differencing only d 1 times, is said to be integrated of order d, denoted x (d). (Banerjee et. al. (1993)] In many areas were time series techniques are applied integrated variables are rare exceptions, which are seldom interesting to analyse. In economics this is not the case, most macroeconomic time series appear to be integrated or nearly integrated series, see Nelson and Plosser (1982). Thus, the estimation and distribution of sample estimates are of great importance in economics, especially since regression with integrated variables often results in spurious correlations when standard distributions are used for inference. The simplest example of an I(1) series is the random walk model y t = y t 1 +ɛ t, where ɛ t NID(0, σ 2 ). Taking the first difference of this variable results in a stationary I(0) series according to the definition given above. If y t is generated as an integrated series, the main problem with estimating a random walk model, y t = βy t 1 + ɛ t, (15.42) is that the estimated β is not following a normal distribution, not even asymptotically. The problem here is not inconsistency, but the nonstandard distribution of the estimated parameters. This is clearly established in Fuller(1976) where the results from simulating the empirical distribution of the random walk model is presented. Fuller generated data series from a driftless random walk model, and estimated the following models, a) y t = ρy t 1 + υ t, b) y t = α + ρy t 1 + υ t, c) y t = α + π(t t) + ρy t 1 + υ t, where α is constant and (t t) a mean adjusted deterministic trend. These equations follow from the random walk model. The reason for setting up these three models is that the modeler will not now in practice that the data is generated by a driftless random walk. S/he will therefore add a constant (representing the deterministic growth trend in y t ) or a constant and trend. The models are easy to understand, simply subtract y t 1 from both sides of the random walk model, which leads to y t y t 1 = βy t 1 y t 1 + ɛ t, (15.43) y t 1 = (β 1)y t 1 + ɛ t = ρy t 1 + ɛ t. (15.44) Thus, if y t 1 is integrated of order one ρ = 0. The problem here is that since ρ does not follow a standard distribution the conventional t-statistic cannot be used. This would not be a problem if β equals say 0.99, then β < 1 and the series would be stationary, and its asymptotic behaviour would be like the AR(1) model above. Fuller s simulations of the empirical t distribution of the estimated ρ in the three model showed that they did not converge to the normal distribution. With these results he established what is now know as the Dickey-Fuller distributions. Furthermore, the divergences compared to the normal distributions are huge. So, here is a case were the central limit theorem does not work. ESTIMATION WITH INTEGRATED VARIABLES 133
134 The standard t-statistic for an infinitely large sample is for a two sided test of ˆρ 0 equal to 1.96 at the 5 % level. However, according to the simulations of Dickey and Fuller the appropriate value of the t-statistic in model (a) is 2.23, for an infinity large sample. In an autoregressive model we know that the estimate of ρ is biased downward. Thus, the alternative hypothesis in models (a) to (c) is that ρ is less than zero. The associated asymptotic t-value for an estimate from a normal distribution, is therefore Dickey and Fuller established that the asymptotic critical values for one sided t-tests at the 5 % level in the models (a) to (c) are -1.95, and respectively. (See Fuller (1976) Table 3.2.1, page 373]. Notice that the critical values change depending on the parameters included in the empirical model. Also, the empirical distributions assume white noise residual; if this is not the case, either the model or the test statistic must be adjusted. Moreover, as long as β = 1 or ρ = 0 cannot be rejected, the estimated constant term in model b, as well as the constant and the quadratic trend in model c, also follow non-normal distributions. These cases are tabulated in Dickey and Fuller (1981). The consequence of ignoring the results of Dickey and Fuller is obvious. If using the standard tables, one will reject the null hypothesis of ρ = 0( β = 1.0) too many times. It follows that if you use standard t-tests you will end up modelling non-stationary series, which in turn take you to the spurious regression problem. The alternative hypotheses for unit root tests are discussed in the following chapter. The explanation to why the t-statistic ends up being non-normally distributed, can be introduced as follows. As T goes to infinity, the relative distance between y t and y t 1 becomes smaller and smaller. Increasing the sample size implies that the random walk model goes towards a continuous time random walk model. The asymptotic distribution of such a model is that of a Wiener process (or Brownian motion). The OLS estimate is, (ˆβ 1.0) = [ T 1 T 2 y t 1 t=1 ] 1 [ T 1 T t=1 y t 1 ɛ t ], (15.45) where, since y t is driven by stochastic trend, y t = t i=1 ɛ t i, the sample moments of the two factors on the RHS will not converge to constants, but to random variables instead. These random variables will have a non-standard distribution, often called a Dickey-Fuller distribution. We can express this as, (ˆβ 1.0) [Wyy(t)] 1 [W yɛ (t)]. (15.46) where W (t) indicates that the sample moment converges to a random variable which is a function if a Wiener process and therefore distributed according to a non-standard distribution. If the residuals are white noise then we get the so called Dickey-Fuller distributions. The intuition behind this result is that an integrated variable has an infinite memory so the correlation between y t 1 and ɛ t does not disappear as T grows. The nonstandard distribution remains, and gets worse if we choose to regress two independent integrated variables against each other. Assume that x t and y t are two random walk variables, such that y t = y t 1 + µ t, and x t = x t 1 + η t, (15.47) where both µ t and η t are NID(0, σ 2 ). In this case, β would equal zero in the model, y t = βx t + ɛ t. The estimated t-value from this model, when y t and x t are independent random walks should converge to zero. This is not what happens when y t and x t are also 134 THE ESTIMATION OF DYNAMIC MODELS
135 integrated variables. In this case the empirical t-value will converge to 2.0, leading to spurious correlation if a standard t-table at 5% is used to test for dependence between the variables. The problem can be described as follows. If β is zero, the residual term will be I(1) having the same sample moments as y t. Since y t is a random walk we know that the variance of ɛ t will be time dependent and non-stationary as T goes to infinity. The sample estimate of ɛ 2 t is therefore not representative for the true long run variance of the y t series. The OLS estimator gives ˆβ = β + [ 1 T T t=1 x t 2 ] 1 [ 1 T ] T x t ɛ t, (15.48) t=1 which, if the variables are integrated, converges to (ˆβ β) [B 1 (t)] 1 [B 2 (t)], (15.49) where B 1 and B 2 represent sample moments that are functions of random variables, which follow a Brownian motion (Wiener process). The intuition here is that in the long-run random walk variables collapse to its continuous time counterpart, which is the Brownian motion (the Wiener process). The important difference is that instead of having sample moments which are constant in the long run, we have a ratio between two random variables which are function of Brownian motions. In this situation the distribution of the estimated parameters end up following non-standard distributions. It is easy to understand why this is a bit problematic, just recall that a random walk can be written as the sum of all shocks to the series, plus the initial value. In other words, the sample moments in this case are sums of partial sums, since each observation of x t can be written as a sum of shocks. The estimated parameter β will still converge to its true sample moment. The variance, however, will be different. It can be shown that in this case the sample moment of β is, (ˆβ β) NID[0, σ 2 (t)] where the variance is a function of time. The estimate of β is still asymptotically correct and normally distributed but its variance is the variance of a Brownian motion. It can also be show, that the convergence of ˆβ to its true value is much faster than under OLS. Stock (1985) showed that the rate of convergence is 1/T, instead of the standard OLS rate 1/T 1 2. This is known as super convergence. Unfortunately, this is only an asymptotic result. In most applications the short run dynamics between the variables will seriously bias the OLS estimate in this situation. The consequence is that if ones tries to use standard tables, like t or F to test the significance of β, one might not be able to reject spurious results. The true t-values of this model will be much higher. If one regresses one ore more independent random walks against each other, standard t and F tables become useless and will lead the researcher to accept hypotheses of correlation when there is no correlation what so ever. These results might look like a special case, but they are not. In fact they carry over to small sample estimates involving all types of integrated and near integrated variables. The distributions of the parameters based on strongly autocorrelated data are closer to the ones of a random walk, than those of standard stationary normal variables. These results stress the importance of testing for the type of non-stationarity, order of integration, and presence of cointegration when working with time series. Otherwise one can easily fall into the spurious regression trap. The problems are likely to carry on even to the situation when β is different from zero. In this case ɛ t will be stationary, but the distribution of β is nonstandard as long as the two residual terms µ t and η t are dependent. In general, without a priori knowledge, the estimated standard errors from integrated variables must be ESTIMATION WITH INTEGRATED VARIABLES 135
136 assumed to follow nonstandard distributions. The estimated equation must either be modified, or cointegration tests must be carried out. 136 THE ESTIMATION OF DYNAMIC MODELS
137 16. ENCOMPASSING Often you will find that there are several alternative variables that you can put into a model, there might be several measures of income, capital or interest rates to choose from. Starting from a general to a specific model, several models of the same dependent variables might display, white noise innovation terms and stable parameters that all have signs and sizes that are in line with economic theory. A typical example is given by Mankiw and Shapiro (1986), who argue that in a money demand equation, private consumption is a better variable than income. Thus, we are faced with two empirical models of money demand. 1 The first model is, and the second is m t = β 0 + β 1 y t + β 2 y t 1 + β 3 r y + ɛ t (16.1) m t = α 0 + α 1 c t + α 2 c t 1 + α 3 r t + η t. (16.2) Which of these models is the best one, given that both can be claimed to be good estimates of the data generating process? The better model is the one that explains more of the systematic variation of m t and explains were other models go wrong. Thus, the better model will encompass the not so good models. The crucial factor is that y t and c t are two different variables, which leads to a non-nested test. To understand the difference between nested and non-nested tests set β 2 = 0. This is a nested test because it involves a restriction on the first model only. Now, set β 1 = β 2 = 0, this is also a nested test, because it only reduces the information of model one. If α 1 = α 2 = 0, this is also a nested test of the second model. Thus, setting β 1 = β 2 = 0, or α 1 = α 2 = 0, are only special cases of each model. The problem that we like to address here is whether to choose either y t or c t as the scale variable in the money demand equation. This is non-nested test because the test can not be written as a restriction in terms of one model only. The first thing to consider is that a stable model is better than an unstable one, so if one of the models is stable that is the one to choose. The next measure is to compare the residual variance and choose the model with the significantly smaller error variance. However, variance domination is not really suffi cient, PcGive therefore offers more tests, that allow the comparison of Model one versus Model two, and vice versa. Thus, there are three possible outcomes, Model one is better, Model two is better, or there is no significant difference between the two models. 1 For simplicity we assume that there is only one lag on income and consumption. This should not be seen as a restriction, the lag length can vary between the first and the second model. ENCOMPASSING 137
138 138 ENCOMPASSING
139 17. ARCH MODELS Autoregressive Conditional Heteroscedasticity (ARCH) means that the variance of a process changes in a systematic way over time. Why should one bother about heteroscedasticity in time series models? Heteroscedasticity is often viewed as unimportant in time series modeling, except the fact that it leads to ineffi cient estimates. Recall the linear regression model, y t = βx t + ɛ t ɛ t N(0, σ 2 ), (17.1) where the residual variance, the variance of the conditional mean of y t (σ 2 ) is usually assumed to be constant over time. In principle, however, nothing prevents the variance from varying over time (σ 2 t ). There are four reasons why this type of heteroscedasticity is important in time series models. The first is that any departures from having white noise residuals is a sign of misspecification. Heteroscedasticity tests represents a way of detecting misspecifications originating from leaving out an important explanatory variable, which is totally orthogonal to the other explanatory variables in the model. Second, if the variance of the model is changing over time so will the forecast intervals of the model. Hence, for the purpose of making better predictions ARCH is of interest because it leads to better forecast confidence intervals. One example is so-called Value at Risk (VaR) models which are used to forecast the level of reserves to meet cash flow fluctuations. Third, the modeling of ARCH disturbances is sometimes implied by theory, and in general it makes sense from economic theory in many situations. ARCH represents a time series approach to the variance component, which picks up effects not otherwise included in the model. Various types of time varying risk premiums are examples of this. Variables such as time varying risk premiums are diffi cult to observe and measure. But, we can trace their effects on the variance in a model like the one above. Examples of applications are intertemporal asset market models, CAPM, exchange rate markets, etc. Fourth, in option prices depends critically on expected future variances of price of the underlying asset. ARCH models offers a way of forecasting variances such that pricing can be more exact, and more profitable for those who are able to make better forecasts. An example of an ARCH(1) model is provided by, y t = βx t + ɛ t ɛ t N(0, σ 2 ) (17.2) h t = ω + α 1 ɛ 2 t 1, (17.3) where the error variance is dependent on its lagged value. The first equation is referred to as the mean equation and the second equation is referred to as the variance equation. Together they form an ARCH model, both equations must estimated simultaneously. In the mean equation here, βx t is simply an expression for the conditional mean of y t. In a real situation this can be explanatory variables, an AR or ARIMA process. It will be understood that y t is stationary and I(0), otherwise the variance will not exist. This example is an ARCH model of order one, ARCH(1). ARCH models can be said represent an ARMA process in the variance. The implication is that a high variance in period t-1 will be followed by higher variances in periods t, t + 1, t + 2 etc. How long the shock persists depends, as in the ARMA model on the size of ARCH MODELS 139
140 the parameters in combination with the lag lengths. A low variance period is likely to be followed by another low variance period, but a shock to the process and/or its variance will cause the variance to become higher before it settles down in the future. A consequence of An ARCH process is that the variance can be predicted. In other words it is possible to predict if the future variances, and standard errors will be large or small. This will improve forecasting in general and is useful tool for the pricing of derivative instruments. An ARCH(q) process is, y t = βx t + ɛ t ɛ t D(0, h t ) (17.4) q h t = ω + α 1 ɛ 2 t 1 + α 2 ɛ 2 t α q ɛ 2 t q = α 0 + α i ɛ 2 t i. (17.5) The expression for the variance shows a autoregressive process in the variance of ɛ. Deliberately the distribution of the residual term is left undetermined. In ARCH models normality is one option, but often the residual process will be non-nonrmal and often display thicker tails, and be leptokurtic. Thus, other distributions such as the Student t-distribution can be a better alternative. The t-distribution has three moments, the mean, the variance and the "degrees of freedom of the Student t-distribution". In this case, if the residual process ɛ t St(0, h 2, ν), where υ is a positive parameter that measures the relative importance of the peak in relation to the thickness of the tails. The Student t distribution is a symmetrical distribution that contains the normal distribution as a special case, as υ. The ARCH process can be detected by testing for ARCH and by inspecting the P ACF and ACF of the estimated squared residual ˆɛ 2 t. As is the case for AR models, ARCH has a more general form, the Generalised ARCH, which implies lagging the dependent variable h t. A long lag structure in the ARCH process can be substituted with lagged dependent variables to create a shorter process, just as for ARMA processes. A GARCH(1,1) model is written as, i=1 y t = βx t + ɛ t ɛ t D(0, h t ) (17.6) h t = ω + α 1 ɛ 2 t 1 + βh t 1. (17.7) The GARCH(1,1) process is a very typical process found in a number of empirical applications on ARCH processes. The convention is to indicate the length of the ARCH with q, and use the letter p to indicate the length of the lagged variance h t. The same convention assigns α to the ARCH process, and β to the GARCH process. Usually ω is usd for the constant time independent part of the variance instead of the α 0 that is used here. For an asset market this type of process would imply that there are persistent periods when asset prices fluctuate relatively little compared with other periods where prices fluctuate more and for longer times. A General GARCH(q,p) process is, y t = βx t + ɛ t ɛ t D(0, h t ) (17.8) q p h t = ω + α i ɛ 2 t i + β i h t i. (17.9) i=1 i=1 ARCH and GARCH models cannot be estimated by OLS, or standard regression programs. It is necessary to use an interativre system estimation method because the model is now consisting of two equations; the mean equation and the 140 ARCH MODELS
141 variance equation, where the variance equation dependends on estimates in the mean equation. In the example above, the additional parameters are w and α,that must be estimated in the same model. Therefore, some iterative ML estimator is necessary (special algorithms are also necessary). Gauss is a good program for estimating ARCH models, but takes some investment to learn, SAS (from ver. 6.08) is quite good, EViews is also good with excellent help facilities, RATS is an alternative. Finally, PcGive 10, can also do ARCH and GARCH models. A practical problem in estimation is that in a finite sample the estimated variance (h t ) there is no guarantee that the variance will be a positive number. For that reason, software will offer you the opportunity to restrict the values of α, β as well as the sum of the α : s and β : s sums to positive numbers Practical Modelling Tips In practical modelling it is necessary to start with the mean equation. It is necessary to have a correct specification of the mean equation, in order to get the variance process right. A stationary autoregressive process and relevant explanatory variables, and possible sesonal and other dummies must be included in the mean equation to get rid of autocorrelation and general misspecification. This is a a relatively easy procedure for financial return series, which often martingale processes Notice that ARCH and GARCH disappears with aggregation over time and low frequencies in recording data. Thus, ARCH/GARCH is typically never found for frequencies above months. Monthly data, or shorter intervals, are necessary for the modelling of ARCH/GARCH process. Even if models estimated with quarterly data and higher frequencies can display ARCH in testing the residuals, it is usually never possible to build an ARCH/GARCH models with that type of data. An ARCH process can be identified by testing for ARCH(q) structure in combination with using ACF : s and P ACF : s on the squared residuals from the mean equation. Estimate the mean equation, save the estimated residuals, square them and use ispect the ACF : sand P ACF : s of these squared residuals to identify a preliminary lagorder for the GARCH. However, this method is higly approximative regarding the order of q and p Some ARCH Theory To explore ARCH models, let us start with the following AR(1) model, which could represent an asset price, y t = γy t 1 + ɛ t, (17.10) where E(ɛ t ) = 0, V ar(ɛ t ) = σ 2 and γ < 1. (Thus, the model is stable and y t is stationary). Furthermore let us assume that the unconditional mean of y t is E(y t ) = (1/T ) y t, which is not dependent on time. The expected value of y t+1, conditional on the past history of y t is E t (y t+1 y t ) = γy t, (17.11) SOME ARCH THEORY 141
142 which varies over time since y t is a random variable. Now turn to the variance of y t+1 V ar(y t+1 ) = V ar(γy t ) + V ar(ɛ t ). (17.12) This variance consists of two parts, first we have the unconditional variance of y t+1 which is, for an AR(1) given by, V ar(y t+1 ) = Second we have the conditional variance of y t+1 σ2 1 γ 2. (17.13) V ar t (y t+1 y t ) = E[y t+1 E(y t+1 y t )] 2 = σ 2. (17.14) We can see that while the conditional expectation of y t+1 depends on the information set I t = y t, both the conditional (V ar t ) and the unconditional variances (Var) do not depend on I t = y t. If we extend the forecasts k periods ahead we get, by repeated substitution, y t+k = γ k y t + k γ k i ɛ t i. (17.15) The first term is the conditional expectation of y t k periods ahead. The second term is the forecast error. Hence, the conditional variance of y t k periods ahead is equal to k V ar t (y t+k ) = σ 2 γ 2(k i). (17.16) It can be seen that the forecast of y t+k depends on the information at time t. The conditional variance, on the other hand, depends on the length of the forecast horizon (k periods into the future), but not on the information set. Nothing says that this conditional variance should be stable. Like the forecast of y t it could very well depend on available information as well, and therefore change over time. So let us turn to the simplest case, where the errors follow an ARCH(1) model. We have the following model, y t = γy t 1 + ɛ t where ɛ t D(0, h t ), E(ɛ t ) = 0, E(ɛ t ɛ t i ) = 0 for i 0, and h t = w + αɛ 2 t. The process is assumed to be stable γ < 1, and since ɛ 2 t is positive we must have w > 0 and α 0. Notice that the errors are not autocorrelated, but at the same time they are not independent since they are correlated in higher moments through the ARCH effect. Thus, we cannot assume that the errors really are normally distributed. If we chose to use the normal distribution as a basis for ML estimation, this is only an approximation. (As an alternative we could think of using the t-distribution since the distribution of the errors tends to have fatter tails than that of the normal). Looking at the conditional expectations of the mean and the variance of this process, E t (y t+1 y t ) = γy t and V ar t (y t+1 y t ) = h t+1 = w + α(y t γy t ) 2. We can see that both depend on the available information at time t. Especially it should be noticed that the conditional variance of y t+1 increases by positive and negative shocks in y t. Extending the conditional variance expression k periods ahead, as above, we get, k V ar t (y t+k y t ) = γ 2(k i) E t (h t+k ). (17.17) i=1 where E t (h t+k ) is the conditional expectation of the error variance k periods ahead. To solve for the latter, and express the forecast in the same way as the one i=1 i=1 142 ARCH MODELS
143 for the conditional mean, let us turn to the unconditional variance if ɛ t, which is, E(ɛ t ɛ t ) = σ 2. In terms of h t, from which we get which, since ɛ 2 t = σ 2, implies that, Substitute by h t, ɛ 2 t = w + αɛ 2 t 1, (17.18) (1 αl)ɛ 2 t = w, (17.19) (1 α)σ 2 = w. (17.20) h t = (1 α)σ 2 + αɛ 2 t 1, (17.21) to get the relationship between the conditional and the unconditional variances of y t. The expected value of h t in any period i is, E(h t+i ) = σ 2 + αe[h t+i 1 σ 2 ]. (17.22) Repeated substitution leads to the conditional variance k periods ahead, k 1 k 1 V ar t (y t+k y t ) = σ 2 γ 2i + α s 1 (h t+1 σ 2 ) γ 2i α i. (17.23) i=1 The first term on the RHS is the long run unconditional forecast variance of y t. The second term represents the memory in the process, given by the presence of h t+1. If α < 1 the influence of (h t+1 σ 2 ) will die out in the long run and the second term vanishes. Thus, for long-run forecasts it is only the unconditional forecast variance which is of importance. Under the assumption of α < 1 the memory in the ARCH effect dies out. (Below we will relax this assumption, and allow for unit roots in the ARCH process). i= Some Different Types of ARCH and GARCH Models ARCH models represent a class of models were the variance is changing over time in a systematic way. Let us now define different types of ARCH models. In all these models there is always a mean equation, which must be correctly specified for the ARCH process to be modeled correctly. 1) ARCH(q), the ARCH model of order q, h t = α 0 + q α i ɛ 2 t i = α 0 + A(L)ɛ 2 t. (17.24) i=1 This is the basic ARCH model from which we now introduce different effects. 2) GARCH(q, p). Generalized ARCH models. If q is large then it is possible to get a more parsimonious representation by adding lagged h t to the model. This is like using ARMA instead of AR models. A GARCH(q, p) model is SOME DIFFERENT TYPES OF ARCH AND GARCH MODELS 143
144 h t = α 0 + q α i ɛ 2 t i + i=1 p β i h t i = α 0 + A(L)ɛ 2 t + B(L)h t, (17.25) i=1 where p 0, q > 0, a 0 > 0, α i 0, and β i 0. The sum of the estimated parameters β(1) = α i + β i shows the memory of the process. Values of β(1) equal to unity indicates that shocks to the variance has permanent effects, like in a random walk model. High values of β(1), but less than unity indicates a long memory process. It takes a long time before shocks to the variance disappears. If the roots of [1 B(L)] = 0 are outside the unit circle we the process is invertible and, h t = α 0 [1 B(L)] 1 + A(L)[1 B(L)] 1 ɛ 2 t (17.26) ] 1 p = α 0 [1 β i + δ i ɛ 2 t i (17.27) i=1 i=1 = a + D(L)ɛ 2 t ARCH( ). (17.28) If D(L) < 1 then GARCH = ARCH. Moreover, if the long run solution of the model B(1), is < 1, the δ i will decrease for all i > max(p, q). GARCHmodels are standard tools, in particular, for modeling foreign exchange rate markets and financial market data. Often the GARCH(1, 1) is the preferred choice. GARCH models some empirical observations quite well. The distribution of many financial series display fatter tails than the standard normal distribution. GARCH models in combination with the assumption of a normal distribution of the residual can generate such distributions. However, many series, like foreign exchange rates, display both fatter tails and are leptokurtic (the peak of the distribution is higher than the normal. A GARCH process combined with the assumption that the errors follow the t-distribution can generate this type observed data. Before continuing with different ARCH models, we can now look at an alternative formulation of ARCH models which show their similarities with ordinary time series models. Define the innovations in the conditional variance as, v t = ɛ 2 t h t. (17.29) The variable v t can be thought of as surprises in volatility, arising from new, unexpected, information on the markets. The GARCH model is then, which can be written as, and [1 B(L)](ɛ 2 t v t ) = α 0 + A(L)ɛ 2 t, (17.30) [1 B(L)](ɛ 2 t ) = α 0 + A(L)ɛ 2 t + [1 B(L)]v t, (17.31) [1 B(L) A(L)](ɛ 2 t ) = α 0 + v t B(L)v t 1, (17.32) which is an ARMA process. This shows us that we can identify a GARCH process using the same tools as an ARMA model. That is, by looking at the autocorrelations and partial autocorrelations of ˆɛ 2 t, estimated from OLS. Solving for the GARCH(1,1) model, ɛ 2 t = α 0 + (α 1 + β 1 )ɛ 2 t 1 + β 1 v t 1 + v t. (17.33) If α 1 +β 1 = 1, or (α i +β i ) = 1 in GARCH(q, p) model, we get what is called an integrated GARCH model. 144 ARCH MODELS
145 3) ARCH(q) model with explanatory variables, h t = α 0 + A(L)ɛ t 2 + βx t, (17.34) 1. where x t is a vector of explanatory variables, and β a vector parameters. In this model we have added explanatory variables into the ARCH process, just like we can add exogenous explanatory variables into an ARMA model. 4) M-ARCH Multivariate ARCH. The multivariate ARCH is basically an extension of the univariate model to a system of equations with time varying variances and covariances, like H t = h 11,t h 12,t... h 1n,t h 21,t h 22,t... h 2n,t h n1,t h n2,t... h nn,t The M-ARCH is like a VAR model for a system of variables, only now the system is extended to allow for interaction among the variances as well. Typical applications of multivariate ARCH are CAPM models of asset portfolios. 5) ARCH in mean. It is possible to put back the ARCH process into the conditional mean of the process, and let it represent some variable, like a time varying risk premium as an example. In this case we get the following system, y t = βx t + δh 1/2 t + ɛ t h t = α 0 + A(L)ɛ t 1. (17.35) There exists various ways of putting the variance back in the mean equation. The example above assumes that it is the standard error which is the interesting variable in the mean equation. 6) IGARCH. Integrated ARCH. When the coeffi cients sum to unity we get a model with extremely long memory. (Similar to the random walk model). Unlike the cases discussed earlier the shocks to the variance will not die out. Current information remains important for all future forecasts. We talk about an integrated variance and persistence in variance. A significant constant term in an GARCH process can be understood as a mean reversion of the variance. But if the variance is not mean-reverting, integrated GARCH is an alternative, that in a GARCH(1,1) process can put the constant zero, and restrict the two parameters to unity. 7) EGARCH. Exponential GARCH and ARCH models. (Exponential due to logs of the variables in the GARCH model). These models have the interesting characteristic that they allow for different reactions from negative and positive shocks. A phenomenon observed on many financial markets. In the output the first lagged residual indicated the effect of a positive shock, while the second lagged residual (in absolute terms) indicates the effect of a negative shock. 8) FIGARCH. Fractionally Integrated GARCH. This approach builds on the idea of fractional integration and allows for a slow hyperbolic rate of decay for the lagged squared innovation in the conditional variance function. See Baille, Bollerslev and Mikkelsen (1996). 9) NGARCH and NARCH Non-linear GARCH and ARCH models. 10) Common Volatilty. Introduced by Engle and Isle 1989 (and 1993), allows you to test for common GARCH Structure in different series. SOME DIFFERENT TYPES OF ARCH AND GARCH MODELS 145
146 11) Other types of GARCH models. In the literature there exists a number of X-GARCH-type of models, it is not possible to keep track of all possible twists here, but 1-10 are the relevant approaches The Estimation of ARCH models Let us now turn to the estimation of ARCH and GARCH models. The main problem is the distribution of the error terms, in general they are not normally distributed. The most used alternatives are the t-distribution and the gamma distribution. In applications in finance and foreign exchange rates a t-distribution is often motivated by the fact the empirical distributions of these variables display fatter tails than the normal distribution. If we assume that the residuals of the model follow a normal distribution, we have that the conditional variance is normally distributed, or ɛ t ɛ t 1 NID(0, σ 2 ). Using that assumption the following likelihood function is estimated, log L = T 2 log 2π T 2 T (log h t ) + ɛ2 t. (17.36) h t Notice that there are two equations involved here, the mean equation and the variance equation. The process is correctly modelled first when both equations are correctly modelled. To estimate ARCH and GARCH processes, non-standard algorithms are generally needed. If y t i is among the regressors some iterative method is always required. (GAUSS, RATS, SAS provide such facilities). There are also special programs which deal with ARCH, GARCH and multivariate ARCH. The research strategy is to begin by testing ARCH, by standard tests procedures. The following LM test for q order ARCH, is an example, t=1 ˆɛ 2 t = λ 1ˆɛ 2 t 1 + λ 2ˆɛ 2 t λ qˆɛ 2 t q + βy t + v t, (17.37) where T R 2 χ 2 (q). Notice that this requires that E(ɛ) = 0, and E(ɛ t ɛ t i ) 0, for i 0. If ARCH is found, or suspected, use standard time series techniques to identify the process. The specification of an ARCH model can be tested by Lagrange multiplier tests, or likelihood ration tests. Like in time series modeling the Box-Ljung test on the estimated residuals from an ARCH equation serves as a misspecification test. ARCH type of processes are seldom found in low frequency data. High frequency data is generally needed to observe these effects. Daily, weekly sometimes monthly data, but hardly ever in quarterly or yearly data. Finally, remember two things, first that ARCH effects imply thicker tails than the standard normal distribution. It not obvious that the normal distribution should be used. On the other hand it, there is no obvious alternative either. Often the normal distribution is the best approximation, unless there is some other information. On example, of other information, is that some series are leptokurtic, higher peak than the normal, in combination with fat tails. In that case the t-distribution might be an alternative. Thus using the normal density function is often an approximation. Second, correct inference on ARCH effects builds upon a correct specification of the mean equation. Misspecification tests of the mean equation are therefore necessary. 146 ARCH MODELS
147 18. ECONOMETRICS AND RA- TIONAL EXPECTATIONS The presence of expectations have consequences for econometric model building. In particular rational expectations have extremely important consequences. The most pessimistic views, following from rational expectations, reduce econometric modeling to simple data description, with little, or no room, for increasing our understanding of the behavior of economic agents. Muth s (1961) original definition of rational expectations goes very far. It assumes that agents know the true data generating process (DGP) of the complete system. This is in contrast to the econometrician who must estimate what he/she thinks is the DGP. The econometrician must also test for significant changes in his/her model before he/she can find out whether the process has changed. In econometrics we can only deal with a limited aspect of rational expectations, namely expectations formed conditionally on past (observed) history. In contrast to using econometrics, in the world of Muth and other rational expectation theorists, agents are free to form the best expectation at any time without estimating, or making inference from historical data. We can describe the econometric approach to rational expectations as follows, let x e t be the expected future value of the variable x t held by the agent(s) at time t. The expectation held at time t is x e t = E(x e t I t ), where I t is the information set containing the historical data used to form the expectation. Under rational expectations, by definition, the information set contains all relevant information for determining the expectation so that the difference between the actual outcome of x t+1 and its expectation (x e t ) is zero, E[(x t+1 x e t ) I t ] = 0. This is a weak condition. It allows expectations to be erroneous in individual periods, but requires that they are correct in average. Thus, in applied work the difference between the outcome and the expectation should a martingale difference process. Assuming that the difference is also a white noise innovation process is generally stronger than necessary. If the ordinary not expectations based econometric model is formulated as y t = βx t + e t, the assumption of rational expectations leads to the following model, y t = βe{x t+1 I t } + e t, or where x e t is the expected value of the variable x t. 1 y t = βx e t + e t, (18.1) Rational v.s. other Types of Expectations In earlier literature some researchers used to model other types of expectations than rational expectations; like myopic or static expectations. These alternatives are generally ad hoc, and not based on any reasonable assumptions about the 1 This is a generic example where x e t can be any variable, including ye t. ECONOMETRICS AND RATIONAL EXPECTATIONS 147
148 behavior of economic agents. Other expectations, than rational, imply that agents might ignore information that would raise their utility. With anything than rational expectations agents will be allowed to make systematic mistakes, implying that they ignore profit opportunities or that they are not, for some not explained reason, maximizing their utility. The economic science has yet to identify such behavior in the real world. Rational expectations becomes an equilibrium condition in the sense that there the difference between prediction and outcomes cannot be predicted. A model which allows for predictable differences between the expectations and the outcome is not complete without an economic explanation of what the difference means, and why it occurs. The correct way to approach the modeling of expectations is assume that agents form expectations so that they do not make systematic mistakes that reduce their welfare. Information used to predict the future will be collected and processed up to the point were the costs of gathering more information balances the revenue of additional information. Based on this type of behavior it might, as a special case, be optimal to use say today s value of a variable to predict all future values of that variable. But, these are exceptions from the rule. In general there is a catch 22 situation in the modeling rational expectations behavior. If the econometrician finds that the agents are doing systematic mistakes from ex post data, this is no evidence against the rational expectations hypothesis. Instead, the empirical finding might the result of conditioning on the wrong information set. Alternatively, the modeling of the expectation might be correct, and be an unbiased and effi cient estimate of the expectation held only at a certain point in time. This argument also include situation where there is a small probability of an event with large consequences, as devaluations, unpredicted changes in the monetary regime, wars, natural disasters etc. To examine these situations generally requires further testing of model, were the outcome will depend to a large extent on assumptions regarding distributions of the processes, if they are linear or non-linear etc. The discussion about other types of expectations brings us to the concepts of forward looking v.s. backward looking behavior. The difference can explained as follows. Consumption based on forward looking behavior is determined on the basis of expected future income. Consumption based on actual (existing) income is backward looking. In practice there might not a big difference, your present or recent income might be a good approximation to your future income. In some cases rational expectations might be to base decisions on contingent rules, and revise these rules only when the costs of deviating from the optimal/desired consumption is too big (or when the alternative cost to being outside equilibrium is to high) Typical Errors in the Modeling of Expectations Without given values of the expected value there are two types of common mistakes in econometric models on expected driven stochastic processes. The first mistake is to substitute x e t with the observed value x t. This leads to an error-invariables problem, since x t = x e t + v t, where v t is E(v t ) = 0.The error-in-variable problem implies that β will not be estimated correctly. OLS is inconsistent for the estimation of the original β parameter. The second mistake is to model the process for x t and substitute this process into Assume that the variable x t follows an AR(2) process, like x t = a 1 x t ECONOMETRICS AND RATIONAL EXPECTATIONS
149 a 2 x t 2 + n t, where n t NID(0, σ 2 ). Estimation of equation 18.1 leads to, y t = βa 1 x t 1 + βa 2 x t 2 + e t = π 1 x t 1 + π 2 x t 2 + e t. (18.2) This estimated model also gives the wrong results, if we are interested in estimation the (deep) behavioral parameter β. The variables x t 1 and x t 2 are not weakly exogenous for the parameter of interest (β) in this case. The estimated parameters will be a mixture of the deep behavioral parameter and the parameters of the expectations generating process (a 1 and a 2 ). Not only are the estimates biased, but policy conclusion based on this estimated model will also be misleading. If the parameters of the marginal model, (a 1 and a 2 ) describe some policy reaction function, say a particular type of money supply rule, changing this rule, i.e. changing a 1 and a 2 will also change π 1 and π 2. This is a typical example of when super exogeneity does not hold, and when an estimated model cannot be used to form policy recommendations. What is the solution to this dilemma of estimating deep behavior parameters, in order to understand working of the economy better? 1. One conclusion is that econometrics will not work. The problems of correctly specifying the expectation process in combinations with short samples make it impossible to use econometric to estimate deep parameters. A better alternative is to construct micro-based theoretical models and simulate these models. (As example, use calibration techniques) 2. Sim s solution was to advocate VAR models, and avoid estimating deep parameters. VAR models can then be used to increase our understanding about the economy, and be used to simulate the consequences of unpredictable events, like monetary or fiscal policy shocks in order to optimize policy. 3. Though the rational expectations critique (Lucas, Sims and others) seem to be devastating for structural econometric modeling, the critique has yet to be proven. In surprisingly many situations, policy changes appear to have small effects on estimated equations, i.e. the effects of the switch in monetary policy in the UK in early 1980s. 4. Finally, the assumption of rational expectations provides priori information that can be used to formulate an econometric model from the beginning. There are, in principle, three ways in which one can approach this problem; i) substitution, ii) system estimation based on the Full Information Maximum Likelihood (FIML) estimator or iii) use the General Methods of Moments (GMM) estimator. Substitution means to replace the expected explanatory variable with an expectation. This expectation could either be a survey expectation or an expectation generated by a forecasting model, i.e. an ARIMA model. The FIML method can be said build in the econometric forecast in an estimated system. The GMM estimator builds on the assumption that the explanatory variable and the residuals are orthogonal to each other. Since, rational expectations implies that the (rationally expected) explanatory variables are orthogonal to the residuals, the GMM estimator is well suited for rational expectations models. Because of this it is the preferred choice when it comes to estimating rational expectations models, especially in finance applications. ECONOMETRICS AND RATIONAL EXPECTATIONS 149
150 Modeling Rational Expectations (This section is very incomplete -see overheads) The substitution approach is perhaps the easiest way of modeling rational expectations. The approach is to find an estimate of E{x t I t }. The simplest approach is to let the information set contain only historical values of x t. As an example suppose that x t is an AR(1) process, so x t = α 1 x t 1 + v t where v t is NID(0, σ 2 ). The estimated process gives the estimates ˆx t that can be substituted into equation The outcome of the substitution is y t = βˆx t + u t where u t = e t β(ˆx t x e t ) = e t (v t ˆv t ). (18.3) OLS will lead to an unbiased estimate of ˆβ, because ˆx t is weakly exogenous w.r.t. β. FIML estimation builds on substituting x e t with the actual value x t and estimate this equation simultaneously with the marginal model for x t, say the AR(1) model assumed in the substitution example above. GMM and Instrumental Variables techniques start with substitution of the expected value (x e t ) with the actual observation (x t ), and then approach the errorin-variables problem. The key to the solution lies in the assumption that the difference between the expectations and the actual outcome is orthogonal to the information set used, the basic assumption for the method of moments estimator. The variables in the marginal process and the possible exogenous variables in the conditional model can then be used as instruments in the estimation of β Testing Rational Expectations (To be completed) Tests concerning given values of x e t. Given some values of the expectation process, there are three types of tests that can be performed. 1. Test if the difference between the expectation and the outcome is a martingale difference process, conditional on assumptions regarding risk premiums. 2. Test for news. Under the assumption of rational expectations the expected driven variable should only react the unpredictable event news but not to events that can be predicted. These assumptions are directly testable as soon as we have a forecasting model for x e t. 3. Variance bounds tests. Again, given x e t, it follows that the variance of y t in equation 18.1 must be higher than the variance of x e t. Encompassing tests If a model based on taking account of assumed rational expectations behavior is the correct model, it follows that this model should encompass other models with lack this feature. Thus, encompassing tests can used to discriminate between models based on rational expectations and other models. Tests of super exogeneity 150 ECONOMETRICS AND RATIONAL EXPECTATIONS
151 It follows from the rational expectations assumption that the parameters of the conditional model will change whenever the parameters of the marginal model change. First, if it can be established that the conditional model is stable, while the marginal model changes, this would be evidence against the rational expectations assumption, at least in the form of forward looking behavior. In the same way, it is possible to test for joint changes/shifts in the marginal and conditional models. 1. Is rational expectations important? The answer is it depends on your problem. If you really want to estimated a stochastic phenomena derived from theory, especially in finance, it is important to take rational expectations into account. It has to be at least weakly rational expectations because nobody has found any solid evidence against weak rational expectations. However, if you want to forecast or do standard structural modelling you can test for super exogeneity, and thereby also for rational expectations. Ericsson and Hendry (1989), Ericsson and Irons (1995), and Ericsson and Hendry (1997) do this for almost all instances of radical economic policy changes and finds no evidence of the structural breaks in the econometric models predicted by the rational expectations theory. Thus, in practice it is not a big problem unless you want it to be a big problem. ECONOMETRICS AND RATIONAL EXPECTATIONS 151
152 152 ECONOMETRICS AND RATIONAL EXPECTATIONS
153 19. A RESEARCH STRATEGY This section describes a research strategy for finding a well-defined statistical model of the DGP, which also has an economic interpretation. 1. I. Start from theory! Economic theory gives the parameters of interest and the relevant variables for estimating these parameters. Furthermore, theory suggest interesting long-run equilibria, homogeneity conditions etc. It is important to remember that theories are constructions of the human mind. The available data, on the other hand, is the real world. But, there might not be a one to one mapping between the variables of the real world and theory, no matter how good the theory might be. Aggregation over time and individual units, adjustment costs, measurement errors etc. will affect the estimated model. II. Determine the order of integration and type of non-stationarity among the variables. Are some are all variables non-stationary. What type of non-stationarity? The null should be integrated of order one, unless there is suffi cient evidence to reject this hypotheses. Once you know the order of integration you know to organise variables into meaningful statistical relations. You can test for cointegration, or co-trending, and with this knowledge formulate stationary relations where standard inference is possible, and where you can separate long-run relations (or alternatively permanent shocks) and short-term relations. The golden rule is that if a variable looks like I(1) treat it like an I(1) variable unless you have clear evidence to reject that hypothesis. III. Building a VAR and test for cointegration among integrated variables. Cointegration tests aim at identifying long-run stable (stationary) economically interesting relationships among the variables. This can be done 1) in the form of testing specific relations such as PPP, consumption function, money demand etc.. 2) In the case of building and modeling systems, it can be in a "complete system" or by dividing your problem into separate variables such as domestic inflation, money demand, economic growth etc. Remember the (asymptotic) property of co-integrating relations, that if you find them they are exists even if you add more variables to the model. This requires building a VAR and testing for cointegration. And, the VAR will be the departure for formulating a reduced form VECM and then a structural VECM, or single equation structural equations. The critical step is to find suitable order of the VAR (number of lag). The principle is to work from general to specific models, and search for parsimonious models. For cointegration tests a log order of 2 is minimum and often optimal. Sometimes identifying extreme outliers and impulse step dummies will help to cure both non-normality and autocorrelation in all equations. If it is not possible to get rid of autocorrelation with a small number of lags (perhaps in combination with dummies and seasonals), the alternative is to focus on second best. Autocorrelation in these equations is very bad for modelling, but it might not be possible to achieve both no autocorrelation and get a parsimonious model with suffi cient degrees of freedom for inference. In that situation, the relevant question is how much of the variation in the left hand side variables is optimal to model to get an near-well-identified statistical model? A RESEARCH STRATEGY 153
154 The second best in VAR modelling, is to get rid of autocorrelation in as many equations as possible, hopefully this will include that the vector no error autocorrelation test is not rejected. In this case study the F-test for the significance of each lag across the equations in the model. Look at the LR test for comparing lag orders in the VAR and most important chose the model with the smallest information critera and the smallest residual autocorrelation. 1 And, when you test the lag structure, look at the I(1) test for cointegration and study the estimated Π matrix for possible economically interesting co-integrating vectors,βx t 1.. Quite often you will see what a stable vector coming up quite independent of the lag order and autocorrelation in some residuals. Once the co-integrating rank is determined it remains to identify the estimated co-integrating vectors. If there is only one vector this is relatively simple. If there are more than one vector the vectors should fulfill the rank condition for identification of co-integrating vectors. This is explained in the work of Juselius, and Johansen and in more advanced text books in econometric time series. The golden rule is that the vectors should be unique (look different from each other), through the alpha value determine a left-hand variable. This is achieved by first choosing a suitable normalization, impose other unit elasticities and or same value but opposite signs, and by restricting some parameters to be zero in some vectors. (Remember that the size of the β coeffi cients are not related to their significance. If co-integration is not found? Rethink the problem. Have you forgotten some important explanatory variable? Look for outliers and test their effects. Use dummies, trends etc. if they can be motivated. Look for structural breaks, sample size. Use first (and/or second) differences instead, to get a model with only stationary I(0) variables that leads to estimated parameters with well defined distributions. You have to conclude that your model might not be good for long-run analysis. Continue with the modeling process to get the least bad of all possible models, at least. If possible, show that there may be strong a priori information that justifies the model. Add that cointegration is only an asymptotic result, and that your sample is too short. Consider stop modelling, and conclude that the absence of cointegration is an interesting conclusion in itself! (Data problem, wrong theory, missing explanatory factors etc.). Do not waste too much time on a problem where the answers will be dependent on ad hoc assumptions concerning distributions, or instable results which will be totally model dependent. If you find cointegration. Continue by testing for long-run homogeneity assumptions, weak exogeneity. and identification. This can be done by using Johansen s multivariate co-integrating technique. If more than one vector think about identification of vectors. IV. Decide on single or simultaneous model There are no good tests for weak exogeneity. Typically a good test of simultanity requires the specification of the complete model to work. And, then the work is already done. 1 In PcGive 12 you need to indicate in the "Option" window under Model choce that you want information crteria for each model. Then when you press "Progess" will you see both F-test for lag order and Information critera for the different VAR modeles you estimated. 154 A RESEARCH STRATEGY
155 If you reduce to single equation (or very limited systems) can you motivate the weak exogeneity. assumptions? The reduced form VECM gives you ideas about what a system might look like, and not like through the estimated (significant) alpha values. It is possible to test for predictability in the VECM by looking at the estimated alpha values, and argue for reductions of the system? Of course, from the reduced for VECM to logical step is to construct a simultaneous structural model based on testing the order and the rank condition in the model. However, this can be a bit of a challenge, especially if you are short of time. Furthermore, identification must be done on significant parameters (including lags) not on the underlying theoretical lag structure. V. Set up the Error Correction Representation. In the following we assume that you have chosen to continue with a single equation. Use the results from Johansen s multivariate cointegration technique, then formulate an ECM model directly. Test for cointegration in the ADL representation of the model. (PcGive test). It is necessary to choose lag lengths long enough to get white noise residuals. Test if residuals are NID(0, σ 2 ), +RESET test if possible. Having white noise innovation error terms is a necessary condition. If not white noise innovation? Add more lags. Did you forget something important? Study outliers. Use dummies and trends to get white noise. But remember that they should be motivated. Or continue to the least worse of all possible models, see above. Rethink the problem or stop. RESET test!! (Perhaps you should try to condition on some other variable instead?) When white noise is established: Is the equation in line with what you think can be an economic meaningful long-run equilibrium? Check sign and sizes of parameters. VI. Reduce the model. Remove insignificant variables (t-values below 1.0 to begin with). Start at low lags. Go from general to specific. Check misspecification/specification during reductions. Run test summary after each reduction. In PcGive all reductions are saved under Progress. It all about "Data Mining", but done effi ciently building on the empirical approach for ARIMA models introduced by Box-Jenkins, and new developments in Statistical theory. Modern mathematical statistical theory explain how you can go about finding a Data Generating Process by reversing the sampling process in classical statistics. Textbooks: Spanos, Mittelhammer A RESEARCH STRATEGY 155
156 In the reduction process remember the following identities, = 1 L and 1 = + L So if you have, as an example, +β 1 x t 1 β 2 x t 2, where β 1 β 2 (or no significant difference) with different sign on the lags. This is also +β 1 2 x and if β 1 β 2 then +β 1 2 x t 1 + (β 1 β 2 ) x t 2 = β 1 x t 1. Hence, you save one degree of freedom under these condition VII. Test the stability of the model Use recursive estimation method in PcGive. Remember that this is also useful during the identification of cointegrating vectors. For instance, it will allows you to see if you need to put in (restricted) impulse dummies in co-integrating vector. VIII. Test for rival models. Encompassing tests. Does your model explain the results, and the failure, of other rival models? Encompassing tests imply a comparison of the goodness of fit between different models, based on different explanatory variables. The reduction process might lead to several model with white noise residuals. To discriminate between these models they have to be tested against each other. IX. Test for super exogeneity.(rational expectations) If you want. Establish the stability of the conditional model without using ad hoc trends or dummies.(= criteria for stability). Test for instability in the marginal model. If it is unstable while the conditional is stable you have super exogeneity. If the marginal model is unstable you can go one step further by forcing the marginal to be stable by imposing trends and dummies in such a way that it becomes stable. Then put these trends and dummies into the conditional model and test if they are significant there? If not you have super exogeneity. And can reject the parts of the assumptions in the rational expectations theory. X. STOP when you find a model that is consistent with the data chosen. And where the parameters make economic sense. In other words "a well-defined statistical model". That is a model with white noise innovation residuals and stable parameters, which is also encompassing all other rival models. Encompassing meaning that your model explain other models and picks up more of the variation in the dependent variables, and which has an economic meaning. 1. XI. Report our results both parameters and misspecification tests. It is not suffi cient report only R 2 and DW-values. Show test summary (corresponding) and graphs of your data, in levels and first differences, and error terms, etc. Be open minded and inform the reader of the tests and the problems you have found. Don t try to prove things which one can easily reject by a simple test. The rule is to minimize the number of assumptions behind your model, and remember that the errors are the outcome of the formulation of the model. 156 A RESEARCH STRATEGY
157 20. REFERENCES Andersson, T.W. (1971) The Statistical Analysis of Time Series, John Wiley & Sons, New York. Andersson, T.W. (1984) An Introduction to Multivariate Statistical Analysis, John Wiley & Sons, New York. Banerjee, A., J. Dolado, J.W.Galbraith and D.F. Hendry, (1993) Cointegration, Error-Correction and the Econometric Analysis of Non-stationary Data, (Oxford University Press, Oxford). Baillie, Richard J. and Tim Bollerslev, The long memory of the Forward premium, Journal of Money and Finance 1994, 13 (5), p Baillie, Richard J., Tim Bolloerslev and Hans Ole Mikkelsen (1966) Fractionally Integrated Generalized Autoregressive Heteroscedastcity, Journal of Econometrics 74, Banerjee, A., R.L. Limsdaine and J.H Stock (1992) Recursive and Sequential tests of the Unit Root and Trend Break Hypothesis: Theory and International Evidence, Journal of Business and Economics Statistics?. Cheung, Y. and K. Lai (1993), Finite Sample Sizes of Johansen s Likelihood Ratio Tests for Cointegration, Oxford Bulletin of Economics and Statistics 55, p Cheung, Y. and K. Lai (1995) A Search for Long Memory in International Stock Markets Returns, Journal of International Money and Finance 14 (4), p Davidson, James, (1994) Stochastic Limit Theory, Oxford Univresity Press, Oxford. Dickey, D. and W.A. Fuller (1979), Distribution of the Estimators for Autoregressive Time Series with a Unit Root, Journal of the American Statistical Association 74. Diebold, F.X. and G.D. Rudebush (1989), Long Memory and Persistence in Aggregate Output, Journal of Monetary Economics 24 (September), p Eatwell, J., M. Milgate and P. Newman eds., (1990), Econometrics (Macmillian, London). Eatwell, J., M. Milgate and P. Newman eds., (1990) Time Series and Statistics (Macmillian, London). Engle, Robert F. ed. (1995) ARCH Selected Readings, Oxford University Press, Oxford. Engle, R.F. and C.W.J. Granger, eds. (1991), Long-Run Economic Relationships. Readings in Cointegration, (Oxford University Press, Oxford). Engle, R.F. and B.S. Yoo (1991) Cointegrated Economic Time Series: An Overview with New Results, in R.F Engle and C.W. Granger, eds., Long-Run Economic Relationships. Readings In Cointegration (Oxford University Press, Oxford). Ericsson, Neil R. and John S. Irons (1994) Testing Exogeneity, Oxford University Press, Oxford. Fuller, Wayne A. (1996) Introduction to Statistical Time Series, John Wiley & Sons, Nw York. Freud, J.E. (1972) Mathematical Statistics, 2ed.(Prentice/Hall London). Granger and Newbold (1986), Forecasting Economic Time Series, (Academic Press, San Diego). REFERENCES 157
158 Granger, C.W.J. and T. Lee (1989) Multicointegration, Advances in Econometrics, 8, Hamilton, James D. (1994) Time Series Analysis, Princton University Press, Priceton, New Jersey. Hargreaves, Colin P. ed. (1994) Nonstationarity Time Series Analysis and Cointegration, Oxfod University Press, Oxford. Harvey, A. (1990), The Econometric Analysis of Time Series, Philip Allan, New York). Hendry, David F. (1995) Dynamic Econometrics, Oxford University Press, Oxford. Hylleberg, Svend (1992) Modelling Seasonality, Oxford University Press, Oxford. Johansen, Sören (1995) Likelihood-Based Inference in Cointegrated Vector Autoregressive Models, Oxford University Press, Oxford. Johnston, J. (1984) Econometric Methods (McGraw-Hill, Singapore). Kwiatkowsky, D., P.C.B. Phillips, P. Schmidt and Y. Shin (1992) Testing the Null Hypothesis of Stationarity Against the Alternative of a Unit Root, Journal of Econometrics 54, p Lo, Andrew W. (1991) Long-Term Memory in Sock Market Prices, Economtrica 59 (5:September), p Maddala, G.S. (1988) Introduction to Econometrics (McMillian, New York). Morrison, D.F. (1967) Multivariate Statistical Methods, McGraw-Hill, New York). Pagan, A.R. and M.R. Wickens (1989) Econometrics: A Survey, Economic Journal, Park, J.Y. (1990), Testing for Unit Roots and Cointegration by Variable Addition, in T. B. Fomby and G.F. Rhodes (eds.) Co-integration, Spurious Regressions, and Unit Roots: Advances in Econometrics 8, JAI Press, New York. Perron, Pierre (1989) The Great Crash, the Oil Price Shock and the Unit Root Hupothesis, Econometrica 57, Phillips, P.C.B (1988) Reflections on Econometric Methodolgy, The Economic Record-Symposium on Econometric Methodolgy, December, Sjöö, Boo (2000) Testing for Unit Roots and Cointegration, memo. Sowell, F.B. (1992) Modeling Long-Memory Behavior with the Fractional ARMA Model, Journal of Monetary Economics 29 (April),p Spanos, A. (1986) Statistical Foundations of Econometric Modelling (Cambridge University Press, Cambridge). Wei, William W.S. (1990) Time Series Analysis. Univariate and Multivariate Methods, (Addison-Wesley Publishing Company, Redwood City) APPENDIX 1 A1 Smoothing Time Series Lag Windows. In the discussion about non-stationarity different ways of removing the trend in a time series was shown. If the trend is removed from, say, GDP we are left with swings in the data that can be identified as business cycles. In time series analysis such cycles are referred to as low frequency or periodic components. Application of smoothing filters arise in empirical studies of real business cycles, and in modelling financial variables daily interest rates where for example news about inflation and other variables occur only at monthly intervals and might 158 REFERENCES
159 cause monthly cycles in the data. 1 Smoothing methods, of course, are related closely to spectral analysis. In this appendix we concentrate on two filters, or lag windows, which represent the best, or most commonly used methods for time series in time domain. Start from a time series, r t. What we are looking for is some weights b i such that the filtered series x t, is free of low frequency components, x t = i=+k i= k b i r t+i. (20.1) In this formula the window is applied both backwards and forwards, implying a combination of backward and forward looking behavior. Whether this is a good or a bad thing depends totally on the series at hand, and is left to the judgment of the econometrician. The alternative is to let the window end at time i = 0. The literature is filled with methods of calculating the weights b i, in this appendix we will look at the two most commonly used methods; the Partzén window and the Tuckey-Hanning window. The Parzén window is calculated using the following weights, w i = 1 6(i/k) 2 + 6( i /k) 3, i k/2, 2(1 i /k) 3, k/2 i k, 0, i k, where k is the size of the lag window. The Parzén window tries to fit a third grade polynomial to the original series. An alternative is the so called Tuckey-Hanning window, calculated as, { } 1/2 [1 + cos(π i/k)], i k, w i = 0, i k Like the Parzen window, the weights need to be normalized. Under optimal conditions, that is the correct identification of underlying cycles, the difference between x t and r t, will appear as a normal distribution. The problem is to determine the bandwidth, the size of the window, or k in the formula above. Unfortunately there is no way easy way to determine this in practice. Choosing the size of the lag window involves a choice between low bias in the mean or a high variance of the smoothed series. The larger the window the smaller the variance but the higher is the bias. In practice, make sure that the weights at the end of the window are close to zero, and then judge the best fit from comparing x t r t. As a rule of thumb, choose a bandwidth equal to N exp(2/5), the number of observations (N) raised to the power of 2 over 5. The alternative rule is to set the bandwidth equal to N 1/4, or make a decision based on the last significant autocorrelation.. Since the choice of the window is always ad hoc in some sense, great care is needed if the smoothed series is going to be used to reveal correlations of great economic consequence. APPENDIX II Testing the Random Walk Hypothesis using the Variance Ratio Test. For a random walk, x t = x t 1 + ε t, where ε t NID(o, σ 2 ), we have that the variance is σ 2 t and that the autocovariance function is cov(x t, x t k ) = (t k)σ 2. It follows that cov(x t, x t 1 ) = σ 2 1, and that cov(x t, x t k ) = σ 2 1k. Defining σ 2 k = 1 k cov(x t, x t 1 ). For a random walk we get that the estimated variance ratio V R(k) = ˆσ2 k ˆσ 2 1 is not significantly different from zero. The estimated (unbiased) 1 To be clear, we are not saying that daily interest rates necessarily contain monthly cycles, only that it might be the case. One example is daily observations of the Swedish overnight interbank rate. APPENDIX 1 159
160 autocovariances are given as, for k = 1 and, for k > 1, ˆσ 2 k = ˆσ 2 1 = 1 T 1 T (x t x t 1 ˆµ) 2, (20.2) t=1 1 k(t k + 1)(1 k T ) T (x t x t k kˆµ) 2, (20.3) where ˆµ = 1 T (x T x 0 ), and T is the sample size. Assuming homoscedasticity, the asymptotic variance of the random variable V R(k) is, Φ(k) = t=k Under these assumptions a test statistic is given as Z(k) = V R(k) 1 [Φ(k)] 1 2 2(2k 1)(k 1). (20.4) 3kT a N(0, 1), (20.5) where a indicates the test statistic converges to an asymptotic normal distribution. Since many time series, especially in finance, show time varying heteroscedasticity, the test statistics need to be modified to take this into account. Lo and Mackinlay (1988) show that a heteroscedasticity consistent estimator of the asymptotic variance is given as, where k 1 [ Φ (k) = 2 1 j ] ˆδ(j) (20.6) k j=1 T t=j+1 ˆδ(j) (x t x t 1 ˆµ) 2 (x t j x t j 1 ˆµ) 2 = [ T ] 2. (20.7) t=1 (x t x t 1 ˆµ) The heteroscedastic consistent test statistic is therefore, Z (k) = V R(k) 1 [Φ (k)] 1 2 a N(0, 1). (20.8) The test is performed by calculating sequences of V R(k) as k goes from 1 to n, where n is some chosen fraction of the total number of observations. Since the test statistics only holds asymptotically, Monte Carlo simulations of limited samples are recommended. Under the null hypothesis of a random walk, it will not be possible to reject the assumption that Z(k) or Z (k) are different from zero Appendix III Operators When dealing with random variables, and series of data there some operators that simplifies work. This chapter presents the rules of some common operators applied 160 REFERENCES
161 to random variables and series of observations. These are the expectations operator, the variance operator, the covariance operator, the lag operator, the difference operator, and the sum operator. 2 The formal proofs behind these operators are not given, instead the chapter states the basic rules for using the operators. All operators serve the basic purpose of simplifying the calculations and communication involving random variables. Take the expectations operator (E), as an example. Writing E(x t ) means the same as I will calculate the mean (or the first moment) of the observations on random variable X. 3 But, I am not telling exactly which specific estimator I would be using, if I were to estimate the mean from empirical data, because in this context it is not important. One important use of operators is in investigating the properties of estimators under different assumptions concerning the underlying process. For instance, the properties of the OLS estimator, when the explanatory variables are stochastic, when the variables in the model are trending etc The Expectations Operator The first operator is the expectations operator. This is a linear operator and, is therefore easy to apply, as shown by the following rules. In the following, let c and k be two non-random constants, µ i is the mean of the variable i and σ ij is the covariance between variable i and variable j. It follows that, E(c) = c. E(c X) = ce( X) = cµ x. E(k + c X) = k + ce( X) = k + cµ x. E( X + Ỹ ) = E( X) + E(Ỹ ) = µ x + µ y. E( XỸ ) = E( X)E(Ỹ ) + covar( XỸ ) = µ x µ y + σ xy, where σ xy = 0 if X and Ỹ are two independent random variables. Compare with the expectation of X2, E( X 2 ) = E( X)E( X) + var( X) = µ 2 x + σ 2 x. The expectations operator is linear and straight forward to use, with one important exception - the expectation of a ratio. This is an important exception since it represents a quite common problem. E Ỹ X is not equal to E(Ỹ ). The problem is that the numerator and the denominator are not necessarily independent In this situation it is necessary to use E( X) the p lim operator, alternatively let the number of observations go to zero and use convergence in probability or distribution to analyze the outcome. In the derivation of the OLS estimator, the [ following ] transformation is often used, when X is viewed as given, E Ỹ X = E 1 X Ỹ = E( W Ỹ ). A similar problem occurs in financial economics. If F is the forward foreign exchange rate, and S is the spot rate; E ( ) ( F S E S ) F. However, E(ln F ln S) = E(ln F ) E(ln S). 2 The probability limit operator is introduced in a later chapter. 3 Notice the difference between an estimator and an estimate. APPENDIX III OPERATORS 161
162 The Variance Operator For the variance operator, var(.) or V (.) we have the following rules, var(c) = 0. var(c X) = c 2 var( X) = c 2 σ 2 x. var(k + c X) = c 2 var( X) = c 2 σ 2 x. var(ỹ + X) = var(ỹ ) + var( X) + 2cov(Ỹ + X) = σ 2 y + σ 2 x + 2σ yx. If Ỹ and X are independent we get, var(ỹ + X) = var(ỹ ) + var( X) + cov(ỹ + X) = σ 2 y + σ 2 x The Covariance Operator The covariance operator (cov) has already been used above. It can be thought of as a generalization of the variance operator. Suppose we have two elements of X, call them X i and X j. The elements can be two random variables in a multivariate process, or refereeing to observations at different times (i) or (j) of the same univariate time series process. The covariance between X i and X j is as, cov( X i, X j ) = E{[ X i E( X i )][ X j E( X j )]} = σ ij, [To be completed!] The covariance matrix of a random variable X with p elements can be defined E{[ X E( X)][ X E( X) ]} = σ 11 σ σ 1p σ 21 σ σ 2p σ p1 σ p σ pp where σ ii = σ 2 i, the variance of the i : th element. Like the expectations and the variance operator there some simple rules. If we add constants, a and b to X i and X j, cov( X i + a, Xj + b) = cov( X i, Xj ). If we multiply X i and X j with the constants (a) and (b) respectively, we get, cov(a X i, b X j ) = ab cov( X i, Xj ). The covariance operator is sometimes also written as C( ) The Sum Operator In the following represents the sum operator. The basic definition of the sum operator is, n x i = x m + x m+1 + x m x n, (20.9) i=m 162 REFERENCES
163 where m and n are integers, and m n. The important characteristic of the sum operator is that it is linear, all proofs of the following rules of the sum operator build on this fact. If k is a constant, n n kx i = k x i. (20.10) i=1 Some important rules deal with series of integer numbers, like a deterministic time trend t = 1, 2,...T. These are of interest when dealing with integrated variables and determining the order of probability, that is the order of convergence, here indicated with O(.), T t = T = (1/2)[T (T + 1)] = (1/2)[T + 1) 2 (T + 1)] t=1 i=1 = O(T 2 ) (20.11) T t 2 = T 2 = (1/6)[T (T + 1)(2T + 1)] t=1 = (1/3)[(T + 1) 3 (3/2)(T + 1) 2 + (1/2)(T + 1)] = O(T 3 ) (20.12) T t 3 = T 3 t=1 = (1/4)[T 2 (T + 1) 2 ](1/4)[T + 1) 4 2(T + 1) 3 + (n + 1) 2 ] = O(T 4 ). (20.13) The Plim Operator An estimator should be unbiased, have a minimum variance and be consistent. In limited samples these requirements will not always be met. To investigate what happens as the sample size increases towards infinity we us probability limits. If ˆθ is an estimate of the true parameter θ, we say that the estimator E(ˆθ) is consistent if the probability that we estimate θ as the sample size increases to infinity is equal to one. That is as the sample size approaches the population size, we should end up with the parameter describing the population and nothing else. Formally this can be stated as: the estimator E(ˆθ) is a consistent estimator of θ if, for arbitrary small (positive) numbers ɛ and δ, there exists a sample size (n) such that, This can also be written as Pr ob[ ˆθ θ < ɛ] > 1 δ for n > n 0. (20.14) or, in shorthand as, p lim n ([ ˆθ θ < ɛ] = 1 (20.15) APPENDIX III OPERATORS 163
164 ˆθ θ or p lim ˆθ = θ. (20.16) Probability limits are useful for examining the asymptotic properties of estimators of stationary processes. There are a few simple rules to follow, p lim(ax + by) = a p lim(x) + b p lim(y), (20.17) p lim(xy) = p lim(x) p lim(y), (20.18) p lim(x/y) = [p lim(x)]/[p lim(y)], (20.19) p lim(x 1 ) = [p lim(x)] 1, (20.20) These rules can be extended to matrices as, p lim(x 2 ) = [p lim(x)] 2. (20.21) p lim(ab) = p lim(a) p lim(b), (20.22) p lim(a 1 ) = [p lim(a)] 1. (20.23) These rules hold regardless of whether the variables are independent or not The Lag and the Difference Operators The lag operator is defined as L n x t = x t n. It can also be used to move forward in a time series, L n x t = x t+n. With the lag operator is becomes possible to write long lag structures in a simpler way. From the lag operator follows the difference operator such that = 1 L x t = x t x t 1 Notice that the difference operator can be used as, x t = x t + x t 1 or as, x t 1 = x t x t Differencing at higher order is done as d x t = (1 L) d x t 164 REFERENCES
165 Setting d = 2 we get, 2 x t = (1 L) 2 x t = (1 2L + L 2 )x t = x t 2x t 1 + x t 2 = x t x t 2 The letter d indicates differences, which can be done by integer numbers such as -2, -1, 0, 1 and 2. It is also possible to use real numbers, typically between -1.5 and With non-integer differencing we come fractional integration, and so-called long run memory series. If variables are expressed in log, which is the typical thing in time series, the first difference will be a close approximation to per cent growth. The lag operator is sometimes called the backward shift operator and is then indicated with the symbol B n. The difference operator, defined with the backward shift operator is written as d = (1 B) d. Econometricians use the terms lag operator and difference operators with the symbols above. Time series statisticians often use the backward shift notations. APPENDIX III OPERATORS 165
