Joint Stascal Meengs - Bayesian Stascal Science TIME SERIES ANALYSIS OF COMPOSITIONAL DATA USING A DYNAMIC LINEAR MODEL APPROACH AMITABHA BHAUMIK, DIPAK K. DEY AND NALINI RAVISHANKER Department of Stascs, University of Conneccut, U-4120, 215 Glenbrook Road, Storrs, CT 06269, Abstract. Composional me series data comprises of mulvariate observaons that at each me point are essenally proporons of a whole quanty. This kind of data occurs frequently in many disciplines such as economics, geology and ecology. Usual mulvariate stascal procedures available in the literature are not applicable for the analysis of such data since they ignore the inherent constrained nature of these observaons as parts of a whole. This arcle describes new techniques for modeling composional me series data in a hierarchical Bayesian framework. Modified dynamic linear models are fit to composional data via Markov chain Monte Carlo techniques. The distribuon of the underlying errors is assumed to be a scale mixture of mulvariate normals of which the mulvariate normal, mulvariate t, mulvariate logisc, etc., are special cases. In parcular, mulvariate normal and Student-t error structures are considered and compared through predicve distribuons. The approach is illustrated on a data set. Keywords. Box-Cox transformaon; Composional data; Hierarchical Bayesian model; Scale mixture of normals distribuon; Student-t distribuon. 1. INTRODUCTION Composional me series occur in many applicaon areas including geology, manufacturing design, ecology, biology, etc. It is characterized by components which are posive and sum to one at each me point. In many cases, such composions also change over me. In general, a G-variate composional me series data consists of a G-dimensional vector of non-negave components W t1,..., W tg such that W t1 +... + W tg = 1, for each t (Aitchison, 1982). Because of this unit sum constraint at each me t, the composion can be completely defined by any of the g = G 1 components so that the vector W t = (W t1,..., W tg ) lies in a g-dimensional simplex which is defined as S g t = {(W t1,..., W tg ) : W t1 0,..., W tg 0; W t1 +... + W tg = 1}. (1) Stascal analysis of such constrained mulvariate me series in the presence of covariate informaon is very useful in pracce. However, the biggest hindrance to the analysis of composional data arises from the two inherent constraints menoned above. Standard techniques such as regression with vector autoregressive moving average (VARMA) errors, or Kalman filtering (Kalman, 1969), etc., are not directly applicable to raw composional me series data. To circumvent this problem, Aitchison s (1982) proposal was to transform data from the g- dimensional simplex to the g-dimensional real space R g, and to employ an addive error model. The most popular transformaon is the Addive Log Rao (ALR) transformaon, which is of the form Y = log(w /W tg ), i = 1,..., g. Another possibility is the Box-Cox transformaon (Box and Cox, 1964) which was used for cross-seconal composional data by Rayens and Srinivasan (1991a, b) in a frequenst setup and by Iyenger and Dey (1998) in a Bayesian framework. In both cases, the transformed variable may be assumed to follow a mulvariate normal distribuon. The Box-Cox transformaon is more general in the sense that the ALR transformaon is a special case. Literature is sparse on composional analysis of mulvariate me series incorporang covariate informaon. Quintana and West (1988) used an ALR transformaon with mulvariate dynamic linear models. A state space model based on the Dirichlet distribuon was developed by Grunwald, Raftery and Guttorp (1993). In this arcle, we analyze composional me series data after a Box- Cox transformaon. Inference is carried out in the Bayesian framework using a very rich class of scale mixture of mulvariate normals (SMMVN) for modeling the errors. This class of distribuons includes the mulvariate normal, Student-t, logisc, stable etc. (see Dey and Chen, 1998). In this arcle, we invesgate the performance of the mulvariate normal 226
Joint Stascal Meengs - Bayesian Stascal Science and mulvariate Student-t distribuons in parcular. Model performance is assessed via predicve distribuons; the Condional Predicve Ordinate (CPO) is computed and is used for model assessment. The format of the paper is as follows. In secon 2 we discuss the proposed model. Predicve distribuons and their Monte Carlo esmates are also discussed. Secon 3 presents an illustraon. It consists of composions of vehicles produced by, and other countries over the me period 1947-1987. Secon 4 contains some concluding remarks. 2. A MODEL FOR TRANSFORMED COMPOSITIONS Let W t denote the G-dimensional composion at me t as defined in secon 1. We assume that W t depends on some unknown state of nature η t = (η t1,..., η tg ), a vector of true proporons. Due to the unit sum constraint, W t s are points on a g = G 1 dimensional simplex. The Box-Cox transformaon of W t transforms the vector to Y t in R g : ( W W ) λ 1 tg λ if λ 0 Y = (2) log( W W tg ) if λ = 0, where λ R is an unknown parameter known as the Box-Cox parameter, t = 1,..., T, i = 1,..., g and g = G 1. This transformaon is denoted in this paper as Y t = BC(W t, λ t ), as a special case of which we get ALR transformaon when λ = 0 for all i, t. 2.1. The Dynamic Linear Model Let X t denote the l -dimensional covariate vector at me t. The linear regression model for the g- dimensional me series Y t has the form Y t λ t = α t + X t β t + e t (3) where the unknown parameters α t and β t are g- dimensional vectors, λ t denotes the vector of Box- Cox parameters, and e t is the random error. This model is invesgated under the scale mixture of mulvariate normals (SMMVN) error distribuon. Suppose e t is normally distributed with mean 0 and unknown variance k(τ)v t, where k(τ) is a posive funcon of the mixing parameter τ, where τ has pdf g(τ). When k(τ) = τ, and τ has a degenerate distribuon at 1, it is well known that e t has a mulvariate normal distribuon with mean 0 and variance V t. When k(τ) = 1/τ, and the mixing density g(τ) is Gamma with both locaon and scale parameter equal to ν/2, the marginal distribuon of e t becomes Student-t, with locaon parameter 0, scale parameter τv t and degrees of freedom ν. While the mulvariate normal error model is simple to fit, the mulvariate Student-t distribuon handles extreme observaons better. The observaon equaon can be rewritten as Y t λ t = Z t η t + e t, (4) where Z t = (1 X t ) is a g l vector of covariates at me t and η t = (α t, β t ) is an l-dimensional vector, for l = l + 1. The essenal difference between the dynamic linear model and the stac linear model is that in the former case, the regression coefficients are not assumed to be constant, but may change with me. This dynamic feature is incorporated via the system equaon η t = Gη t 1 + w t, (5) where G is a known matrix, and we assume that p(w t W t ) is normal, i.e., w t N(0, W t ). (6) It is assumed that the unknown, me-dependent Box-Cox parameter follows the system equaon where p(u t U t ) is normal, i.e., λ t = Hλ t 1 + u t, (7) u t N(0, U t ). (8) Under this modeling framework, the system equaon and the equaon on the Box-Cox parameter can be reparametrized as η t = λ t = G t i w i (9) H t i u i, (10) where G t i and H t i are the t i th powers of G and H respecvely. Rewring the observaon equaon using these transformed parametric expressions, the model has the form 227
Joint Stascal Meengs - Bayesian Stascal Science Y t λ t = Z t G t i w i + e t (11) λ t = H t i u i, (12) where Y t λ t = BC(W t, λ t ). In addion to the usual linear model assumpons regarding the error terms, we also postulate that e t and u t are independent of w t. Let µ t = Z t t Gt i w i, Y=(Y 1,..., Y T ), V=(V 1,..., V T ), W=(W 1,..., W T ) and U=(U 1,..., U T ). 2.2. Posterior Distribuon and Model Fit The exact likelihood has the form L(Y ; V, W, U, τ ) exp[ 1 2k(τ).., u t ) µ t ) V 1 t exp[ 1 2 u t T (BC(W t, u 1,... (BC(W t, u 1,..., u t ) µ t )] g(τ) U t 1 u t ] exp[ 1 2 w t W 1 t w t ] V t 1 2 U t 1 2 W t 1 2. In the special case when the mixing parameter has a Gamma distribuon with both the locaon and scale parameters ν/2 for known ν, the likelihood funcon becomes, L(Y ; V, W, U) [1 + (BC(W t, u 1, u 2,..., u t ) µ t ) V 1 t (BC(W t, u 1, u 2,..., u t ) µ t )/v] [(v+g)/2] exp[ 1 2 u t U t 1 u t ] exp[ 1 2 w t W t 1 w t ] V t 1 2 U t 1 2 W t 1 2. For the prior specificaon, we assume that p(w 1 ) and p(u 1 ) are N(a w, R w ) and N(a u, R u ) distribuons respecvely with pre-specified hyperparameters a w, a u, R w and R u. The prior specificaons on the covariance matrices are assumed to be of the form π(v t, W t, U t ) = π(v t )π(w t )π(u t ), (13) where π(v t ), π(w t ) and π(u t ) are inverse Wishart distribuons with known hyperparameters. A roune sensivity analysis may be carried out in order to invesgate the effect of these hyperparameters on the model fit. By Bayes theorem, the joint posterior density is then π(u, V, W, U, τ Y ) L(Y ; V, W, U, τ ) π(v t, W t, U t ) p(w t W t )p(u t U t ) t=2 p(w 1 )p(u 1 )g(τ). The resulng joint posterior is analycally intractable. The full condional distribuons of the parameters are proporonal to the joint posterior and do not have standard forms. We have employed a sampling based approach via Gibbs sampling (Gelfand and Smith, 1990), with the Metropolis Hasngs algorithm (Hasngs, 1970) using a Gaussian proposal. Maximum likelihood esmates of the parameters provide the inial values for the sampler. Let (η (s), λ (s) ), s = 1,..., B denote convergent samples generated from the joint posterior. The means and standard deviaons of these samples corresponding to each parameter provide summary features of the marginal posterior distribuons. Samples of esmated expected composion proporons can also be calculated at each me point for given covariates. For s = 1,..., B, these are η (s) = and for i = G η (s) [ λ (s) 1 + g Ŷ (s) + 1 [ λ (s) ] 1/λ (s) Ŷ (s) + 1 tg = 1 1 + [ g λ (s) Ŷ (s) + 1 ], i = 1,..., g (s) 1/λ ] 1/λ (s) (14) (15) where λ (s) and Ŷ (s) are calculated from the given model. Esmates of these expected proporons are obtained as the Monte Carlo averages of the expressions in (14) and (15) over the B samples. Convex credible regions for proporons can also be constructed. The first step is to generate a sample of B proporons at each me point t = 1,..., T. At me point t, the s th sample for the ith proporon is given by equaon (14) and (15). These 100(1 α) credible regions indicate how the shapes of the simplexes corresponding to the true proporons change over me. These regions are obtained using the SPlus funcon chull, which returns the indices of 228
Joint Stascal Meengs - Bayesian Stascal Science the points belonging to the hull. The peel opon permits peeling off from the convex hull successively unl either all the planar points are assigned or unl a user-specified limit is reached. 2.3. Cross Validaon In this arcle, model assessment is based on the cross-validaon predicve density. It is well known that the marginal density of f(y ) is equivalent to the set {f(y r Y (r) ) : r = 1, 2,..., T }, where f(y r Y (r) ) is the predicve distribuon of Y r when the r th observaon is deleted. The expression for f(y r Y (r) ) is f(y r Y (r) ) = f(y ) f(y (r) ) = f(y r η, Y (r), Z) π(η Y (r) )dη, (16) which is called the Condional Predicve Ordinate (CPO). Using the samples η (s), s = 1, 2,..., B generated from the posterior, the Monte Carlo esmate of CPO can be calculated as (Dey, Chang and Ray, 1996) CPO r = B ( B s=1 ) 1 1. (17) f(y r η (s), Y (r) ) For the purpose of model checking, the presence of many small CPO s cricizes the model. A useful summary stasc of the CPO s is the logarithm of the pseudo-bayes factor (Geisser and Eddy, 1979 and Dey, Chen and Chang, 1997), defined as P sbf = T log(cpo r ). (18) r=1 The advantage of P sbf over the Bayes factor (Gelfand and Dey, 1992) as a model assessment tool is that the latter is not well defined with improper priors, and is generally quite sensive to vague proper priors. The evaluaon of f(y r η (s), Y (r) ) is complicated in me series models. The calculaon is simplified as follows (Pai and Ravishanker, 1996): f(y r η (s), Y (r) ) f i (y i y i 1,..., y 1, η), (19) i=t and facilitates model assessment. 3. ILLUSTRATION We analyze trivariate composional data of motor vehicle producon, consisng of the number of motor vehicles produced by, and Other countries between 1947 and 1987 (see Grunwald, Raftery and Guttorp, 1993). Data on the movement of the US economy, measured by the percentage change in US Gross Naonal Product (GNP) is a covariate. Cumulave proporons 0.0 0.2 0.4 0.6 0.8 1.0 1950 1960 1970 1980 Year Other Figure 1. Proporon of motor vehicle producon from three sources over me. Figure 1 shows the changes in the proporons of total motor vehicle producon in the three categories over me. Of the three sources, the change in the ese producon is most significant. accounted for around one third of the total producon in 1987 while their contribuon in 1947 was only around 2% of the total. The range of the US producon is (4796, 12900), indicang a producon growth over me, but with a sharp decrease in contribuon to total producon. Overall, the motor vehicle producon appears to be correlated with US growth rate, and there is evidence of some nonlinear relaonship. Visual examinaon of the scatter plots of producon from the three sources versus movement of the US economy suggests that whereas US producon is posively correlated with the covariate, the relaonship between the producon from and from and the covariate is only slightly posive, with a trace of nonlinear relaonship. Observed behavioral differences in the three sources indicate the usefulness of a composional me series analysis. Details from fitng the dynamic regression model which describes the effect of US growth rate on the Box-Cox transformed motor vehicle producon data from three different 229
Joint Stascal Meengs - Bayesian Stascal Science sources is presented here. Model 1 corresponds to the dynamic regression model with normal errors, while Model 2 corresponds to the dynamic regression model with Student-t errors. Details of the posterior esmates for t = 41 are given in Table I. 0.05 0.14 0.16 0.18 0.20-0.1-0.2 0.06 0.08 Coefficients of Intercept for 0.06-0.7 Coefficients of US GNP for -0.2 0.75 0.70 0.65 0.60 0.08-0.08 Coefficients of Intercept for -0.06 - Time = 20 0.06 Coefficients of Intercept for 0.0-0.2-0.8 Coefficients of US GNP for - - Coefficients of Intercept for - -0.05 0.05 Coefficients of Intercept for Figure 3. Convex hull for the intercept and US GNP coefficient under Model 1 The esmated convex credible regions at two selected me points for the true proporons corresponding to Model 1 are shown in Figure 2. It indicates that the relaonship between the proporon of vehicle producon in, and the do not change considerably. Figure 3 shows the 95% joint credible sets for the intercept and US GNP under Model 1 at three selected me points. Column 1 corresponds to and column 2 corresponds to. The figure indicates the presence of a slow change in the shape of the credible sets over me. We observed that this change is less severe for Model 2, corresponding to Student-t errors. 0.20 4. CONCLUDING REMARKS 0.56 0.54 0.52 0.48 0.50 0.36 0.34 A regression model with a dynamic structure for parameter evoluon over me is used to analyze composional me series data. The inherent constraint in composional data is overcome by the Box-Cox transformaon which is a more general version of the ALR transformaon. In our modeling framework, we incorporate flexibility in two ways. The first is via a more general class of transformaons from the simplex, and the second is through the dynamic linear model framework. Our model thereby incorporates a possible change in the underlying mechanism. The Bayesian methodology facilitates parametric inference of the resulng complex model. 0.38-0.05 0.18-0.8 Coefficients of US GNP for It is clear that the Box-Cox transformed vehicle producon has significant negave relaonship with US economic growth rate. Values of P sbf are 528.656 for the normal error model and -576.29 for the Student-t error model, indicang that the normal error assumpon provides a better fit. This is substanated by a plot of the logarithm of CPO raos; all points in the plot of log(cpo from Model 1/CPO from Model 2) are above 0. 0.16-0.7 Model 2 Mean Stdev 889 0.1777 425 0.1748 62 09 03 09 356 07 503 0.2151 Coefficients of US GNP for λ λ α α β β 0.14 Coefficients of Intercept for - Parameter 0.12-0.7 Coefficients of US GNP for 0.2 0.1 0.0 Coefficients of US GNP for Table I Posterior Means and Standard Deviaons of Parameters under Normal and Student-t errors Model 1 Mean Stdev 917 0.1699 393 0.1717-59 09-25 09 352 0.2007 593 0.2031 Time = 20 0.05 0.14 0.16 0.18 0.20 0.56 0.54 0.52 0.48 0.50 0.36 0.34 0.38-0.05 0.60 0.65 0.70 0.75 Figure 2. Credible regions for true proporons from Model 1 230
Joint Stascal Meengs - Bayesian Stascal Science REFERENCES Aitchison, J. (1982) The Stascal Analysis of Computaonal Data with discussion. J. R. Stast. Soc. B, 44, 139-177. Aitchison, J. (1986) The Stascal Analysis of Composional Data. London: Chapman and Hall. Box, G. E. P. and Cox, D. R. (1964) An Analysis of Transformaons. J. R. Stast. Soc. B, 26, 211-252. Chen, M. H. and Dey, D. K. (1998) Bayesian modeling of correlated binary responses via scale mixture of mulvariate normal link funcons. Sankhya, 60, 322-343. Dey, D. K. and Birmiwal, L. R. (1991) On Idenfying Mixing Density of Scale Mixtures of Normal Distribuons. Technical Report, Department of Stascs, University of Conneccut. Dey, D. K.; Chang, H. and Ray, S. C. (1996) A Bayesian Approach in Model Selecon for the Binary Response Data. Advances in Econometrics, 11, 145-175. Dey, D. K.; Chen, M. H. and Chang, H. (1997) Bayesian approach for the nonlinear random effects models. Biometrics, 53, 1293-1252. Gelfand, a, e. and Dey, D. K. (1994) Bayesian model choice: asymptocs and exact calculaons. J. R. Stast. Soc. B, 56, 501-514. Geisser, S. (1993) Predicve Inference: An Introducon, Chapman and Hall: London. Grunwald, G. K.; Raftery, A. E. and Guttorp, P. (1993) Time Series of Connuous Proporons. J. R. Stast. Soc. B, 55, 103-116. Hasngs, W. K. (1970) Monte Carlo sampling methods using Markov chains and their applicaons. Biometrika, 57, 97-109. Iyenger, M. and Dey, D. K. (1998) Box-Cox transformaon in Bayesian analysis of composional data. Environmetrics, 9, 657-671. Kalman, R. E. (1969) A New Approach to Linear Filtering and Predicng Problems. J. of Basic Engineering, 82, 34-45. Pai, J. S. and Ravishanker, N. (1996) Bayesian Modelling of ARFIMA Processes by Markov Chain Monte Carlo Methods. J. of Forecasng, 15, 63-82. Quintana, J. M. and West, M. (1988) Time Series Analysis of Composional Data. Bayesian Stascs : J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith (Eds.), Oxford University Press. Rayens, W. S. and Srinivasan, C. (1991 a) Box- Cox Transformaons in the Analysis of Composional Data. J. of Chemometrics, 5, 227-239. Rayens, W. S. and Srinivasan, C. (1991 b) Esmaon in Composional Data Analysis. J. of Chemometrics, 5, 361-374. West, M. and Harrison, P. J. (1997) Bayesian Forecasng and Dynamic Models, Second edion, Springer: New York. Geisser, S. and Eddy, W. (1979) A predicve approach to model selecon. J. Amer. Stast. Assoc., 74, 153-160. Gelfand, A. E.; Dey, D. K. and Chang, H. (1992) Model Determinaon using Predicve Distribuon with Implementaon via Sampling-Based Methods. Bayesian Stascs : J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith (Eds.), Oxford University Press, 147-167. Gelfand, A. E. and Smith, A. F. M. (1990) Sampling based approaches to calculang marginal densies. J. Amer. Stast. Assoc., 85, 398-409. 231