Analysis of Bayesian Dynamic Linear Models

Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main area in which DLMs are used is in modeling observations collected over time for the purposes of forecasting or detection of model shifts. The model utilized by DLMs is actually a sequence of models which are updated at each time step, justifying the word dynamic. The characteristic of interest or unknown parameters in the time series is modeled as θ, possibly a vector of parameters, and the analysis of the evolution of θ most commonly the main goal. Forecasting is also a common goal, but this also depends on how θ is behaving over time. A model is fit to the parameter(s) of interest at time t 1 and using this model a value is forecasted for time t. Then, an observation is received at time t, compared to that which was forecasted, and the model is then updated given this new information as well as any newly obtained and relevant outside information. The Bayesian analysis is the most natural to allow for this updating of information at each time, t. The DLM is specified by the following observation equation and system equation: y t = F t θ t + ν t ν t (0, V t ) (1) θ t = G t θ t 1 + ω t ω t (0, W t ) (2) The observation equation (1) above models the observation vector at time t. These values are modeled with F t, a design matrix of known values of the independent variable(s). This is multiplied by the state or system vector θ t and then added to ν t which represents the observation error assumed to have mean 0. The system equation (2) models the state vector as the sum of the zero mean system/evolution errors, ω t, and the product of the state vector at the previous time, θ t 1, and the matrix G t, known as the evolution, system, transfer, or state matrix. The observation and evolution errors are considered independent of each other and internally independent. More concisely a DLM can be characterized by the following quadruple {F t, G t, V t, W t }, each of which could or could not be dependent upon time. For example, the quadruple {1, 1, V, W } represents a purely random walk if the distribution of the error terms is assumed to be normal. In the textbook problems, all four values of this quadruple are 1

considered known. Clearly, F t and G t are chosen by the modeler in accordance to the design of the model. In practice, the evolution variance, W t, is typically a value chosen and it is only the V t which is often unknown and sometimes needed to be estimated from the data. Because a DLM is dynamic, it is only locally appropriate in time. The model in Equation 2 is appropriate at time, t, until an observation, y t, is observed and the model is updated according to the new information. The amount of information at available at time t will be designated as D t. This new information can consist of information other than just the observation. For example, if the observations represent the amount of sales of a company during month t, and it is known that a rival company is going out of business at month t + 1, this expected increase in sales can also be included. In this project, however, a model will attempted to be fit to simulated data, thus the information matrix will be closed to external information, i.e. D t = {Y t, D t 1 }. In addition symbolically, the initial information available before observation of the process, i.e. at time t = 0, will be represented at D o. According to West, Harrison, and Migon [3], a key feature of the Bayesian analysis of DLMs is the use of a conjugate prior and posterior distribution for the parameters. Typically this distribution is taken to be the normal distribution. This is the distribution that will be considered for all models examined in this project. This leads to the following restatement of the general system and observation equations in Equation 2. (Y t θ t ) N[F t θ t, V t ] (3) (θ t θ t 1 ) N[G t θ t 1, W t ] (4) In addition, the initial information, displayed through the prior distribution of the parameter, is assumed normal, with initial estimates of the mean, µ o, and variance, C o. (θ o D o ) N[m o, C o ] (5) It is also assumed that the observation and system errors are independent of the initial information. Sequentially, this leads to another key to the analysis argued by West and Harrison in 2 is that given the present, the future is independent of the past. At each time increment, the following distributions describe how each is updated with respect to the new information. Posterior at t 1: (θ t 1 D t 1 ) N(m t 1, C t 1 ) Prior at t: (θ t, D t 1 ) N(a t, R t ) One-step forecast: (Y t D t 1 ) N(f t, Q t ) Posterior at t: (θ t, D t ) N(m t, C t ) The mean of the prior for θ t is updated from the mean of the posterior of θ t 1 by a t = G t m t 1 and its variance follows as R t = G t C t 1 G t + W t. The 2

one-step forecast mean is the computed as f t = F t a t and the variance of the distribution again follows as Q t = F t R t F t +V t. Finally, the mean of the posterior is updated through the equation m t = a t + A t e t where a t is the mean of the posterior, A t = R t F + Q 1 t is known as the adaptive vector, and e t = Y t f t is the one step forecast error, and the variance of the posterior is updated at C t = R t F t /Q t. This posterior for θ t then becomes the prior for θ t and the process is repeated for time t + 1. The general system and observation equations in Equation 4 will be used to simulate a data set from three different types of models; a random walk, a dynamic straight line with intercept located at the origin, and a dynamic linear regression. Through the use of the forecasting and recurrence distributional relations above, a dynamic linear model will attempted to be fit to each simulated dataset. The one-step forecast distribution will be plotted against the simulated values in an effort to judge the accuracy of the forecasts. Other interesting features of the analysis will also be explored. 2 Random Walk The first data set to which a Bayesian analysis of a DLM will be applied will be the simplest: a random walk. This constant model takes the form of F t = 1 and G t = 1 with observation and system error variance independent of time. The particular model that will be simulated here will take these values to be V = 4 and W = 0.25. Therefore, the quadruple that describes this model is {1, 1, 4, 0.25} which leads to the system and observation equations as the following: y t = θ t + ν t ν t (0, 4) θ t = θ t 1 + ω t ω t (0, 0.25) The simulation process to obtain a random walk dataset is as follows: 1. Start with an initial system observation θ o. Here this will be taken to be 16. 2. Simulate the first system error, ω 1, from N(0, 0.25). 3. Calculate θ 1 = θ o + ω 1. 4. Simulate the first observation error, ν 1, from N(0, 4). 5. Calculate the first observation y 1 = θ 1 + ν 1. 6. Repeat steps 2-5 for (n-1) times. The above algorithm was run resulting in a random walk with 50 values. These values can be seen plotted against time in Figure 1. 3

Figure 1: Simulated Random Walk Next, it is desired to forecast this time series using the forecasting and recurrence relations of the Bayesian analysis of the DLM. The values of interest that will be examined in further detail are the mean and variance of the forecast distribution, f t and Q t, the adaptive coefficient, A t, the error between the mean of the forecast distribution and the actual value, e t, and lastly, the posterior distribution of θ t at each time step characterized by the mean and variance, m t and C t. These values will be computed at each step using the following algorithm/formulas: 1. Start with initial values for the distribution of theta with mean, m 0, and variance, C 0 and estimates for the observation error variance, V and system error variance, W. 2. Compute forecast mean f 1 = m 0. 3. Compute forecast variance Q 1 = R 1 + V 4. Compute the Adaptive coefficient A 1 = R 1 Q 1 = C 0 + W Q 1 4

Month Forecast Distr Adaptive Coef Datum Error Posterior Info t Q t f t A t y t e t m t C t 0 16 1 1 5.25 16 0.238 16.48 0.48 16.11 0.95 2 5.2 16.11 0.231 14.33-1.79 15.7 0.92 3 5.17 15.7 0.227 16.14 0.44 15.8 0.91 4 5.16 15.8 0.224 11.94-3.86 14.93 0.9 5 5.15 14.93 0.223 20.34 5.4 16.14 0.89 6 5.14 16.14 0.222 24.13 7.99 17.91 0.89 7 5.14 17.91 0.222 11.2-6.71 16.43 0.89 8 5.14 16.43 0.221 21.58 5.16 17.57 0.88 9 5.13 17.57 0.221 14.55-3.02 16.9 0.88 Table 1: Various components of the one-step forecasting and updating recurrence relations for the random walk data. 5. Compute the forecast error e 1 = Y 1 f 1, where Y 1 is the first value in the random walk sequence. 6. Compute the posterior mean m 1 = m 0 + A 1 e 1. 7. Compute the posterior variance C 1 = A 1 V. 8. Repeat steps 2-7 (n 1) times. This algorithm will first be used to forecast the random walk dataset shown in Figure 1 by cheating, somewhat. The assumed known values of m 0, C 0, V and W will be taken to be those values used to simulate the data. The results of this DLM is shown plotted against the original values in Figure 2. The solid red line represents the forecast mean, f t, while the dashed red lines represent 95% confidence bands, computed with the usual normal assumption equation of f t ± 1.96 Q t. For the most part, the forecasted values react to the peaks of the observations, typically one time step behind the peak. The forecast also does not demonstrate as many random fluctuations as the original observations. The forecasted values as a sequence appear to be a one-time delayed smoother version of the original data. Although the forecasted values did not demonstrate as much variability as the original values, at most of the time steps the original value was contained within the 95% confidence band. For the first nine time steps, the values of the components of interest are shown in a table similar to that reported by West and Harrison in Table 1. It can be seen that the adaptive coefficient converges rapidly to 0.221. One interpretation of the adaptive coefficient is the prior regression coefficient of θ t upon y t. This rapid convergence implies that the prior information from the previous step is given about the same weight in forecasting the next point for all times, t. 5

Figure 2: Simulated Random Walk with forecasted values and 95% confidence bands. The above analysis relied on the fact that the prior information was the true values used to simulate the data; so, it is not surprising that the forecast was relatively good and the confidence bands relatively narrow. A question that may be asked is, how do these initial values affect the subsequent forecast, particularly the errors? A second analysis of the same random walk data was performed, but instead of the initial values being the true values used for simulation, the known values in the model were estimated from the data. Therefore, the initial forecast mean, m 0, was taken to be the first observed value, m 0 = 6.41. The value of C 0 was again taken to be 1. The variance of all 50 values from the random walk was computed to be s 2 19. Therefore, the estimated values of the observation error variance, V, and system error variance, W, were taken to each be half of the estimated variance. So, V = 9.5 and W = 9.5 were used for the second analysis. Initially of interest in comparing the two DLMs with differing initial values is that of the forecast errors, e t ; so how well do they do at predicting new values. The errors for each model are shown against time in Figure 3. Here only the first 6

25 values are plotted to distinguish differences and because the pattern continues for the remaining 25 values of time. At time t = 1, the forecast error of the model with estimated initial prior information is 0 since the initial value of the realized sequence was taken to be the prior mean. It can be seen that the errors between the two methods are dissimilar for the first five time values, and then converge and are similar for the remaining 20. This can possibly be explained by the values of the adaptive coefficient. At the time t = 8 the coefficient of m 0 when computing m 8 is (1 A t )(1 A t 1 )... (1 A 1 ). Therefore, the initial value contributes only about 17% to the estimated value at step 8 and this value decreases as time increases. The noticeable difference between using the known values for the prior and those estimated from the data can be seen by comparing Figure 4 to Figure 2. When using the data to estimate initial values, the forecast errors react more to the fluctuations of the observations and the uncertainty associated with the forecast distribution is much larger. This could be due to the fact that, in reality the observation error variance was 16 times larger than that of the system error variance, but in our naive estimation procedure, these variances were taken to be equal to each other. 3 Dynamic straight line through (0,0) The model in the previous section was that of a time invariant random walk. The observation and system error variance were constant throughout the process and the matrices of G t and F t were both identically 1. In this section a slightly more complex model will be explored, that which models the local relationship as a straight line through the origin, with values of the slope that vary with time. Again, a constance error variance model will be used. Therefore the quadruple that describes this model is {F t, 1, V, W } which results in the following observation and system equations y t = F t β t + ν t ν t (0, V ) β t = β t 1 + ω t ω t (0, W ) The covariate used here will be time, so F t = (1, 2,..., n). To simulate from this model, an algorithm similar to that used to simulate from the random walk will be used. The only different will occur to the equation in Step 5 which will become y t = t θ t +ν t. The initial slope used to simulate the data was β 0 = 3.2. The observational variance error was taken to be V = 4 and the system variance W = 0.5. Results of simulating 50 values in this way is shown in Figure 5. The increasing trend is explained by the use of time as the covariate. The algorithm to estimate the DLM for a straight line through the origin is similar to that used for the random walk in that the same values will need to be computed. However, for the random walk the matrix F t was not time dependent, so some of the formulas will be modified to adapt to this changing value. The updated algorithm is shown below. 7

Figure 3: Comparison of forecast errors using the known values for the prior information versus using values estimated from the data. 1. Start with initial values for the distribution of the slope, β 0, and variance, C 0 and estimates for the observation error variance, V and system error variance, W. 2. Compute forecast mean f 1 = F 1 m 0. 3. Compute forecast variance Q 1 = F 2 1 R 1 + V = F 2 1 (C 0 + W ) + V 4. Compute the Adaptive coefficient A 1 = R 1F 1 Q 1 = (C 0 + W ) F 1 Q 1 5. Compute the forecast error e 1 = Y 1 f 1, where Y 1 is the first value in the random walk sequence. 6. Compute the posterior mean m 1 = m 0 + A 1 e 1. 7. Compute the posterior variance C 1 = RtV Q t. 8. Repeat steps 2-7 (n 1) times. 8

Figure 4: Random walk with predictions and 95% error bands using values estimated from the data as the prior information. The algorithm was used to compute the forecasting and recurrence distributions of the simulated data in Figure 5. Cheating was again implemented as the prior information was taken to be that which was used to simulate the data; therefore, β 0 = 3.2, C 0 = 1, V = 4,and W = 0.5 were used in the algorithm above. The resulting prediction and 95% confidence bands are shown in Figure 6. The prediction means were similar to that seen with the random walk in that the fluctuations of the forecast means lag behind that the observed values by one point in time. A big difference between the confidence bands of the random walk and those seen here are that that width of the confidence bands are increasing with time. This is due to the fact that the variance of the 1-step forecast distribution, Q t, is dependent upon the value of the covariate, as seen in step 3 in the algorithm above. Since time is the covariate, this pattern makes sence. Another difference between the predictions from the first analysis of the random walk and that of the dynamic straight line is that the forecast mean responds more to the apparent random fluctuations of the observations. Where 9

Figure 5: Dynamic straight line through (0,0) the DLM analysis for the random walk appeared to be more of a smoother version of the observations, the dynamic straight line analysis does not appear to smooth the original data at all. This could be partially, at least explained by the comparison of the adaptive coefficient, A t and the posterior variance, C t by examining Table 1 for the random walk and the first five columns of Table 2. The range of A t values for the random walk analysis is between 0.22 and 0.23, while for the dynamic straight line analysis it ranges from 0.27 to 0.02. This implies that for the analysis of this section much less weight is place on the forecast error when computing the posterior mean. Also, the range of the posterior variance for the random walk analysis was only 0.88 to 1, while for the dynamic straight line it ranged from 0.002 to 1.09. This implies that for the straight line analysis the forecast variance becomes dominated by the fixed system variance and there is not much contribution from the posterior variance of θ t as time increases. Instead of examining what the results would have been if the prior information has been estimated from the data, this analysis will instead be compared to that of a static model. Table 2 compares the distribution of the posterior mean 10

Figure 6: Simulated dynamic straight line through (0,0) with forecasted values and 95% confidence bands. for β t through the mean m t and variance C t. Also included in the table is the adaptive coefficient, A t, for each analysis. These values are shown for the first five observations as well as the last five observations. The most obvious difference between the two models is the values of the posterior mean, m t, at large values of t. Because time was used as the covariate, the observations increased as time increased. The dynamic model accounted for this by incorporating the increasing covariate into the estimation. The static model, however, held F t fixed at 1, so to account for the increasing trend, the value of β t increased. In the static model, the posterior variance of β held steady throughout at around 1.1-1.2, while that of the dynamic model decreased to 0.002. This difference can be easily seen in the formulas for computing the posterior variance in each model, with the dynamic model taking into consideration the increasing covariate. Lastly, the adaptive coefficient for the dynamic model converges to a value of 0.02, while the static model converges to an order of magnitude larger at 0.297. Because the adaptive coefficient in the dynamic model is closer to 0, this implies that the prior distribution is more concentrated than the likelihood. So, 11

Dynamic Model Static Model F t y t m t C t A t m t C t A t 1 5.18 3.74 1.091 0.273 3.74 1.091 0.273 2 6.92 3.57 0.614 0.307 4.65 1.138 0.285 3 8.25 2.98 0.318 0.238 5.69 1.162 0.291 4 11.89 2.98 0.191 0.191 7.51 1.174 0.294 5 20.07 3.82 0.13 0.162 11.22 1.18 0.295.... 46 157.63 3.42 0.002 0.022 135.51 1.186 0.297 47 181.03 3.85 0.002 0.021 149 1.186 0.297 48 189.32 3.94 0.002 0.021 160.96 1.186 0.297 49 158.77 3.24 0.002 0.02 160.31 1.186 0.297 50 191.79 3.83 0.002 0.02 169.64 1.186 0.297 Table 2: Analysis of dynamic straight line through (0,0) using both a dynamic and a static model. the static model is more sensitive to the latest value of the observation. With the decreasing values of A t and Q t for the dynamic model, this implies that the model responds less and less to the most recent data point. 4 Dynamic Linear Regression The last set of DLM that will be analyzed in this report will be that with a slope and intercept that are time dependent. Therefore, θ t = (α t, β t ) will be bi-dimensional and the observation and state equations will now become y t = α t + t β t + ν t ν t (0, V ) α t = α t 1 + ω α,t ω α,t N(0, W α,t ) β t = β t 1 + ω β,t ω β,t (0, W β,t ) The above equations imply that the quadruple that characterizes the model is {F t, 1, V, W t } where F t = (1, t) and ( ) W t Wα,t 0 = 0 W β,t Here the observation error variance will again be assumed constant, but now the system error variances will vary with time. An algorithm similar to that used to simulate the observations for the random walk and the dynamic line with intercept 0, is used to simulate a dynamic linear regression dataset. The main difference is that two independent system error variances must be simulated to obtain one observation. Here again, time 12

will be used as the only covariate. Fifty observations were simulated using α 0 = 12, β 0 = 1.6, V = 3.8, W α,t = t0.35 and W β,t = 0.03/t. The resulting values are shown in Figure 7. Similar to the values simulated in the previous section, the decreasing trend is due to using a negative slope and time as the covariate. Figure 7: Simulated dynamic linear regression. The algorithm used to forecast this time series is again similar to those above, with added computations due to the two parameters that must be estimated. The necessary calculations are shown below. 1. Start with initial values of the intercept, α 0, slope, β 0, the initial covariance matrix of the joint distribution of the two, C 0, a 2x2 matrix with 0 s on the off diagonal, initial estimate of observation error variance, V, and a 2x2 matrix of the system error variance again with 0 s on the off diagonal. 2. Compute forecast mean f 1 = F 1a 1 = α 0 + (1)β 0. 3. Compute prior variance R 1 = C 0 + W 0. 13

4. Compute forecast variance Q 1 = F 1R 1 F 1 + V 5. Compute the adaptive coefficient A 1 = R 1 F 1 Q 1 1. 6. Compute the forecast error e 1 = y 1 f 1 7. Compute the posterior means α 1 = α 0 +A 1(1,1) e 1 and β 1 = β 0 +A 1(2,1) e 1. 8. Compute the posterior variance C 1 = R 1 A 1 Q 1 A 1. The algorithm was used on the previously simulated data using the known values as the prior information. Results of this algorithm can be seen in Figure 8. For the forecast means displayed by the red solid line, a similar pattern to that of the dynamic straight line with no intercept is seen. The predicted means appear very similar to the original data, but shift to the right by one time step. The predicted line seen here and in the previous section are so similar to the original data that it begs the question if this is a result of using the known values of the variances and initial estimates in the algorithms or is there something wrong with the prediction calculations. Although the latter cannot be ruled out, the former can definitely be accused for the analysis of the dynamic linear regression model as the true structure of the system variance, i.e. that W α,t = t0.35 and W β,t = 0.03/t were utilized in the calculations. Not surprisingly, the confidence bands around the forecasted means are very narrow; and they should be, the true state of nature is known and was used in the estimation procedure, so it would follow that there is little doubt about the estimates. With a complicated model as was used in this section, one may wonder how these initial values would be determined from only having access to the observed sequence. Currently, this is an area in which the author does not have any insight. 5 Conclusions Three different types of models were simulated and a Bayesian analysis of the resulting time series was attempted using dynamic linear models. The three types of models were a random walk, a dynamic straight line with intercept through the origin, and a dynamic linear regression, where the slope and intercept were allowed to vary with the covariate, taken here to be time. The main problem encountered with the analysis was that prior information of various parameters was needed in order to perform the necessary calculations. Since the data was simulated and did not represent any scientific process, the prior information available to the modeler was the actual values used to simulate the data. Intuitively, this led to very accurate results, but this information typically is unknown, and thus this procedure would not translate well to non-simulated data. The issue of not using the known values of simulation for estimation purposes was explored in the random walk example. Here the initial information was estimated from the data values themselves. It was seen that there was not a loss 14

Figure 8: Simulated dynamic linear regression with forecasted values and 95% confidence bands. in the prediction errors; however, the variance of the 1-step forecast distribution was significantly larger. In addition to the process of producing initial estimates, there are many other issues that could be addressed with respect to the Bayesian analysis of dynamic linear models. For all simulations in this report the observation variance, V, was assumed to be known, although this is the value that typcially must be estimated. An analysis which has to also model this parameter is another extension of the work of this report. Also, a model that utilizes a covariate other than just time also may be of interest. Lastly, analyzing non-simulated data with this procedure would be the true test of a modelers ability. 15

6 Bibliography 1. Petris, Giovanni. (2010). An R Package for Dynamic Linear Models. Journal of Statistical Software, 36(12), 1-16. http://www.jstatsoft.org/ v36/i12/. 2. West, Mike and Jeff Harrison. Bayesian Forecasting and Dynamic Models. New York: Springer Verlang, 1997. 3. West, Mike, Jeff Harrison, and Helio Migon. Dynamic Generalized Linear Models and Bayesian Forecasting. Journal of the American Statistical Association 80 (1985): 73-97. 16