Compression and Aggregation of Bayesian Estimates for Data Intensive Computing

Transcription

1 Under consideration for publication in Knowledge and Information Systems Compression and Aggregation of Bayesian Estimates for Data Intensive Computing Ruibin Xi 1, Nan Lin 2, Yixin Chen 3 and Youngjin Kim 4 1 Center for Biomedical Informatics, Harvard Medical School, Boston, MA, USA; 2 Department of Mathematics, Washington University, St. Louis, MO, USA; 3 Department of Computer Science, Washington University, St. Louis, MO, USA; 4 Google Inc., Mountain View, CA, USA. Abstract. Bayesian estimation is a major and robust estimator for many advanced statistical models. Being able to incorporate prior knowledge in statistical inference, Bayesian methods have been successfully applied in many different fields such as business, computer science, economics, epidemiology, genetics, imaging and political science. However, due to its high computational complexity, Bayesian estimation has been deemed difficult, if not impractical, for large-scale databases, stream data, data warehouses, and data in the cloud. In this paper, we propose a novel compression and aggregation scheme (C&A) that enables distributed, parallel, or incremental computation of Bayesian estimates. Assuming partitioning of a large dataset, the C&A scheme compresses each partition into a synopsis and aggregates the synopsis into an overall Bayesian estimate without accessing the raw data. Such a C&A scheme can find applications in OLAP for data cubes, stream data mining, and cloud computing. It saves tremendous computing time since it processes each partition only once, enabling fast incremental update, and allows parallel processing. We prove that the compression is asymptotically lossless in the sense that the aggregated estimator deviates from the true model by an error that is bounded and approaches to zero when the data size increases. The results show that the proposed C&A scheme can make feasible OLAP of Bayesian estimates in a data cube. Further, it supports real-time Bayesian analysis of stream data which can only be scanned once and cannot be permanently retained. Experimental results validate our theoretical analysis and demonstrate that our method can dramatically save time and space costs with almost no degradation of the modeling accuracy. Received xxx Revised xxx Accepted xxx

2 2 R. Xi et al Keywords: Bayesian estimation; data cubes; OLAP; stream data mining;compression; aggregation. 1. Introduction In the last few years, there has been active research on compression and aggregation (C&A) schemes for advanced statistical analysis on structured and large-scale data [6, 7, 8, 13, 14, 17, 20, 21, 23, 34]. For a given statistical model, a general compression and aggregation (C&A) scheme partitions a large dataset into segments, processes each segment separately to generate some compressed representations, and aggregates the compressed results into a final model. Key benefits of such a scheme include its support for multi-dimensional data cube analysis, online processing, and distributed processing. We can find the following useful scenarios of the C&A scheme. The techniques developed in this paper are useful for data warehousing and the associated on-line analytical processing (OLAP) computing. With our C&A scheme, a Bayesian statistical model for a given data cell can be obtained by aggregating the compressed synopsis of relevant lower level cells, without building the model from raw data from scratch. Such a scheme allows for fast interactive analysis of multidimensional data to facilitate effective data mining at multiple levels of abstraction. The proposed C&A scheme enables online Bayesian analysis of real-time data streams. It is challenging to build online statistical models for high-speed data streams, since it is typically not practical to rebuild a complex model every time a new segment of data is received, due to high computational costs and the fact that raw data are not stored for many stream data applications. Our C&A scheme solves this problem by retaining only synopsis, instead of raw data, in the system. For each new data segment, we compress it and use our aggregation scheme to efficiently update the model online. Cloud computing is a major trend for data intensive computing, as it enables scalable processing of massive amount of data. It is a promising nextgeneration computing paradigm given its many advantages such as scalability, elasticity, reliability, high availability, and low cost. As data localization is important for efficiency, it is desirable that each processing unit is only responsible for its own segment of local data. The proposed C&A scheme is well suited for performing Bayesian analysis on massive datasets in the cloud. For example, the compression and aggregation phases in a C&A scheme match the mapping and reducing phases, respectively, of the well-known MapReduce algorithmic framework for cloud computing. The C&A scheme allows partitioning and parallel processing of data, and thus enables high-performance statistical analysis in the cloud. Although there are earlier works to support C&A schemes for statistical inferences, most of such works are based on maximum likelihood estimation (MLE). In this paper, we propose a C&A scheme for Bayesian estimation, another major estimation approach that is considered superior to MLE in many contexts. The premise of Bayesian statistics is to incorporate prior knowledge, along with a given set of current observations, in order to make statistical inferences. The prior

3 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 3 information could come from previous comparable experiments, from experiences of some experts or from existing theories. However, it is often very expensive to compute the Bayesian estimates, as there generally exists no closed form solution and Markov chain Monte Carlo (MCMC) methods such as Gibbs samplers and Metropolis algorithms are often employed. Hence, to process large-scale data (possibly in parallel) and online stream data using Bayesian estimation, fast and effective C&A schemes are desired. C&A schemes for Bayesian estimation have not been studied before. Earlier works in data cubes [13] support aggregation of simple measures such as sum() and average(). However, the fast development of OLAP technology has led to high demand for more sophisticated data analyzing capabilities, such as prediction, trend monitoring, and exception detection of multidimensional data. Oftentimes, existing simple measures such as sum() and average() become insufficient, and more sophisticated statistical models are desired to be supported in OLAP. Recently, some researchers developed aggregation schemes for more advanced statistical analyses including parametric models such as linear regression [8, 14] general multiple linear regression [7, 20] logistic regression analysis [34] and predictive filters [7], as well as nonparametric statistical models such as naive Bayesian classifiers [6] and linear discriminant analysis [23]. Along this line, we develop a C&A scheme to support Bayesian estimation. Bayesian methods are statistical approaches to parameter estimation and statistical inference which use prior distributions over parameters. Bayes rule provides the framework for combining prior information with sample data. Suppose that f(d θ) is the probability model of the data D with parameter (vector) θ Θ and π(θ) is the prior probability density function (pdf) on the parameter space Θ. The posterior distribution of θ given the data D, using Bayes rule, is given by f(θ D) = f(d θ)π(θ) θ Θ f(d θ)π(θ)dθ. The posterior mean θ = θf(θ D)dθ is then a Bayesian estimate of the θ Θ parameter θ. While it is easy to write down the formula of the posterior mean θ, a closed form existed only in a few simple cases, such as a normal sample with a normal prior. In practice, MCMC methods are often used to evaluate the posterior mean. However, these algorithms are usually slow especially for large data sets, making OLAP processing based on these algorithms impractical. Furthermore, these MCMC algorithms require using the complete data set. In many data mining applications such as stream data applications and distributed analysis in the cloud, we often encounter the difficulty of not having the complete set of data in advance. One-scan algorithms are required for such applications. In this paper, we propose a C&A scheme and its associated theory to support high-quality aggregation of Bayesian estimation for statistical models. In the proposed approach, we compress each data segment by retaining only the model parameters and some auxiliary measures. We then develop an aggregation formula that allows us to reconstruct the Bayesian estimation from partitioned segments with a small and asymptotically diminishing approximation error. We further show that the Bayesian estimates and the aggregated Bayesian estimates are asymptotically equivalent.

4 4 R. Xi et al This paper is organized as follows. In Section 2, we introduce the research problem in the context of data cubes, noting that the general theory and C&A scheme can be applied in other contexts such as stream data mining as well. In Section 3, we review the basics of Bayesian statistics. We develop the C&A scheme and its theory in Section 4 and report experimental results in Section 5. Then, we discuss related works in Section 6 and give conclusions in Section Concepts and Problem Definition We develop our theory and algorithms for the C&A scheme in the context of data cubes and OLAP. We understand that the proposed theory and algorithms can also be used in other contexts, such as stream data mining and cloud computing. We present our result in a data cube context since it assumes a clear and simple structure of data, which facilitates our discussion. In our empirical study, we show results in both data cube and data stream contexts. In this section, we introduce the basic concepts related to data cubes and define our research problem Data cubes Data cubes and OLAP tools are based on a multidimensional data model. The model views data in the form of a data cube. A data cube is defined by dimensions and facts. Dimensions are the perspectives or entities with respect to which an organization wants to keep records. Usually each dimension has multiple levels of abstraction formed by conceptual hierarchies. For example, country, state, city, and street are four levels of abstraction in a dimension for location. To perform multidimensional, multi-level analysis, we need to introduce some basic terms related to data cubes. Let D be a relational table, called the base table, of a given cube. The set of all attributes A in D is partitioned into two subsets, the dimensional attributes DIM and the measure attributes M (so DIM M = A and DIM M = ). The measure attributes depend on the dimensional attributes in D and are defined in the context of data cube using some typical aggregate functions, such as count(), sum(), avg(), or some Bayesian estimators to be studied here. A tuple with schema A in a multi-dimensional data cube space is called a cell. Given two distinct cells c 1 and c 2, c 1 is an ancestor of c 2, and c 2 a descendant of c 1 if on every dimensional attribute, either c 1 and c 2 share the same value, or c 1 s value is a generalized value of c 2 s in the dimension s concept hierarchy. A tuple c D is called a base cell. A base cell does not have any descendant. A cell c is an aggregated cell if it is an ancestor of some base cells. For each aggregated cell, the values of its measure attributes are derived from the set of its descendant cells Aggregation and classification of data cube measures A data cube measure is a numerical or categorical quantity that can be evaluated at each cell in the data cube space. A measure value is computed for a given cell by aggregating the data corresponding to the respective dimension-value pairs

5 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 defining the given cell. Measures can be classified into several categories based on the difficulty of aggregation. 1) An aggregate function is distributive if it can be computed in a distributed manner as follows. Suppose the data is partitioned into n sets. The computation of the function on each partition derives one aggregate value. If the result derived by applying the function to the n aggregate values is the same as that derived by applying the function on all the data without partitioning, the function can be computed in a distributive manner. For example, count() can be computed for a data cube by first partitioning the cube into a set of subcubes, computing count() for each subcube, and then summing up the counts obtained for each subcube. Hence, count() is a distributive aggregate function. For the same reason, sum(), min(), and max() are distributive aggregate functions. 2)An aggregate function is algebraic if it can be computed by an algebraic function with several arguments, each of which is obtained by applying a distributive aggregate function. For example, avg() (average) can be computed by sum()/count() where both sum() and count() are distributive aggregate functions. min N(), max N() and stand dev() are algebraic aggregate functions. 3)An aggregate function is holistic if there is no constant bound on the storage size needed to describe a sub-aggregate. That is, there does not exist an algebraic function with M arguments (where M is a constant) that characterize the computation. Common examples of holistic functions include median(), mode(), and rank(). Except for some simple special cases like a normal sample with a normal prior, Bayesian estimates seem to be holistic measures since they require the information of all the data points in an aggregated cell for the computation. In this paper, we show that Bayesian estimates are compressible measures [7, 34]. An aggregation function is compressible if it can be computed by a procedure with a number of arguments from lower level cells, and the number of arguments is independent from the number of tuples in the data cell. In other words, for compressible aggregate functions, we can compress each cell, regardless of its size (i.e., the number of tuples), into a constant number of arguments, and aggregate the function based on the compressed representation. The data compression technique should satisfy the following requirements: (1) the compressed data should support efficient lossless or asymptotically lossless aggregation of statistical measures in a multidimensional data cube environment; and (2) the space complexity of compressed data should be low and be independent from the number of tuples in each cell, as the number of tuples in each cell may be huge. In this paper, we develop a compression and aggregation scheme for Bayesian estimates that can support asymptotically lossless aggregation. 3. Bayesian Statistics Suppose that x 1,, x n are n observations from a probability model f(x θ), where θ Θ is the parameter (vector) of the probability model f(x θ). The prior information in Bayesian statistics is given by a prior distribution π(θ) on the parameter space Θ. Then, under the independence assumption of the observations x 1,, x n given the parameter θ, the posterior distribution, f(θ x 1,, x n ), of the parameter θ can be calculated using Bayes rule,

6 6 R. Xi et al f(x 1,, x n θ)π(θ) f(θ x 1,, x n ) = θ Θ f(x 1,, x n θ)π(θ)dθ n i=1 = f(x i θ)π(θ) n i=1 f(x i θ)π(θ)dθ, (1) θ Θ where f(x 1,, x n θ) is the joint distribution of x 1,, x n given the parameter θ. Then, we could use the posterior mean θn as an estimate of the parameter θ, i.e. θn = θf(θ x 1,, x n )dθ θ Θ ( = n θ Θ i=1 ) 1 f(x i θ)π(θ)dθ θ Θ θ n f(x i θ)π(θ)dθ. (2) MCMC meethods are often employed to evaluate the formula (2) due to its difficulty of direct evaluation. These algorithms are based on constructing a Markov chain that has the posterior distribution (1) as its equilibrium distribution. After running the Markov chain a large number of steps, called burn-in steps, a sample from the Markov chain could be viewed as a sample from the posterior distribution (1). We then can approximate the posterior mean θn with any accuracy we wish by taking a large enough sample from the posterior distribution (1). We consider the following example [10, 25, 32] to illustrate the algorithm of the Gibbs sampler. Example 1: 197 animals are distributed multinomially into four categories and the observed data are y = (y 1, y 2, y 3, y 4 ) = (125, 18, 20, 34). A genetic model specifies cell probabilities (1 i=1 2 + θ 4, 1 θ 4, 1 θ 4, θ 4 Assume that the prior distribution is Beta(1, 1), which is also the uniform distribution on the interval (0, 1) and therefore is a non-informative prior. The posterior distribution of θ is ). f(θ y) (2 + θ) y1 (1 θ) y2+y3 θ y4. It is difficult, though not impossible, to calculate the posterior mean. However, a Gibbs sampler can be easily developed by augmenting the data y. Specifically, let x = (x 1, x 2, x 3, x 4, x 5 ) such that y 1 = x 1 + x 2, y 2 = x 3, y 3 = x 4 and y 4 = x 5. Assume the cell probabilities for x are ( 1 2, θ 4, 1 θ 4, 1 θ 4, θ 4 Then, the distribution of y is the marginal distribution of x. The full conditional distribution of θ is f(θ x 2, y) θ x2+y4 (1 θ) y2+y3, which is Beta(x 2 + y 4 + 1, y 2 + y 3 + 1). The full conditional distribution of x 2 is f(x 2 y, θ) (2/(2 + θ)) y1 x2 (θ/(2 + θ)) x2, i.e. the binomial distribution Binom(y 1, θ/(2 + θ)). The Gibbs sampler starts with any value θ (0) (0, 1) and iterates the following two steps. ).

7 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 7 1. Generate x (k) 2 from the full conditional distribution f(x 2 y, θ (k 1) ), i.e. from Binom(125, θ (k 1) /(2 + θ (k 1) )). 2. Generate θ (k) from the full conditional distribution f(θ x 2, y), i.e. from Beta(x , 39). Then, we could take average over θ (b+s),, θ (b+sn) to get an estimate of θ, where b is a large positive integer and s is a positive integer. The first b iterations are burn-in iterations and b is usually chosen large enough such that the Markov chain converges after b iterations. When n is large enough, this average will be a very good approximation to the posterior mean. The integer s is to reduce the correlation between two successive samples and is usually chosen to be small. In our experiment where we set θ (0) = 0.5, s = 1, b = 1000 and n = 5, 000, the sample average we got is and is very close to the true posterior mean Compression and Aggregation of Bayesian Estimates Since the computation of the Bayesian estimates θ n often involves MCMC algorithms, the compression and aggregation of Bayesian estimation are more difficult compared to the maximum likelihood estimates (MLE) of coefficients in regression models. In general, it is very difficult to achieve lossless compression for the Bayesian estimates and we have to resort to the asymptotic theory of Bayesian estimation to derive an asymptotic lossless compression scheme. We first review the notion of asymptotically lossless compression representation (ALCR) introduced in [34]. Definition 4.1. In data cube analysis, a cell function g is a function that takes the data records of any cell with an arbitrary size as inputs and maps into a fixed-length vector as an output. That is: g(c) = v, for any data cell c (3) where the output vector v has a fixed size. Suppose that we have a probability model f(x θ), where x are attributes and θ is the parameter of the probability model. Suppose c a is a cell aggregated from the component cells c 1,, c k. We define a cell function g 2 to obtain m i = g 2 (c i ), i = 1,..., k and use an aggregation function g 1 to obtain an estimate of the parameter θ for c a by θ = g 1 (m 1,, m k ). (4) We say ˆθ, an estimate of θ, is an asymptotically losslessly compressible measure if we can find an aggregation function g 1 and a cell function g 2 such that a) the difference between θ = g 1 (m 1,, m k ) and ˆθ(c a ) tends to zero in probability as the number of tuples in c a goes to infinity, where m i = g 2 (c i ), i = 1,..., k; b) ˆθ(c a ) = g 1 (g 2 (c a )); and c) the dimension of m i is independent from the number of tuples in c i. The measures m i are called an ALCR of the cell c i, i = 1, k. In the following, we develop an ALCR for Bayesian estimates in (2) based on its asymptotic property. We show that the asymptotic distributions of the estimates obtained

8 8 R. Xi et al from aggregation of the ALCR for each component cell and the Bayesian estimates in the aggregated cell are the same and further show that the difference between them approaches zero in probability as the number of tuples in c a goes to infinity. Further, the space complexity of the ALCR is independent from the number of tuples. Therefore, the Bayesian estimates are asymptotically losslessly compressible measures Compression and aggregation scheme Consider aggregating K cells at a lower level into one aggregated cell at a higher level. Suppose that there are n k observations in the k th component cell c k. Let {x k,1,, x k,nk } be the observations in the component cell c k. Note that the observations x k,j (j = 1,, n k ) could be multidimensional. Based on the observations in the k th component cell c k, we have the Bayesian estimate ( θk,n k = n k θ Θ j=1 ) 1 f(x k,j θ)π(θ)dθ θ Θ n k θ f(x k,j θ)π(θ)dθ. (5) We propose the following asymptotically lossless compression technique for the Bayesian estimation. Compression into ALCR. For each base cell c k, k = 1,, K, at the lowest level of the data cube, calculate the Bayesian estimate θk,n k using (5). Save ALCR=(θ k,n k, n k ) in each component cell c k. Aggregation of ALCR. Calculate the aggregated ALCR ( θ a, n a ) using the following formula n a = K n k, k=1 θa = n 1 a K n k θk,n k k=1 Such a process can be used to aggregate base cells at the lowest level as well as cells at intermediate levels. But for any non-base cell, θ is used in place of θk,n k in its ALCR. j= Compressibility of Bayesian estimation We now show that ( θ a, n a ) is an ALCR. We denote the Bayesian estimate for the aggregated cell to be θ a and the corresponding estimates derived from the ALCR compression and aggregation to be θ a. We will show that the asymptotic distributions of θ a and θ a are the same and their difference tends to zero in probability. Suppose that Θ R p is an open subset of R p. We will only give detailed proof of the theorem in the case of p = 1 and briefly describe the proof for the multidimensional case. We make the following regularity assumptions on f θ ( ) = f( θ) before giving the main theorem.

9 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 9 (C1) {x : f θ (x) > 0} is the same for all θ Θ. (C2) L(θ, x) = log f θ (x) is thrice differentiable with respect to θ in a neighborhood U δ0 (θ 0 ) = (θ 0 δ 0, θ 0 + δ 0 ) of θ 0 Θ. If L, L and L (3) stand for the first, second and third derivatives, then E θ0 L (θ 0, X) and E θ0 L (θ 0, X) are both finite and sup L (3) (θ, x) M(x) and E θ0 M(X) <. θ U δ0 (θ 0) (C3) Interchange of the order of expectation with respect to θ 0 and differentiation at θ 0 are justified, so that E θ0 L (θ 0, X) = 0 and E θ0 L (θ 0, X) = E θ0 [L (θ 0, X)] 2. (C4) I θ0 = E θ0 [L (θ 0, X)] 2 > 0. (C5) If X 1,, X n are random variables sampled from f θ0 and L n (θ) = n i=1 L(θ, X i), then for any δ > 0, there exists an ε > 0 such that P θ0 {sup θ θ0 >δ[l n (θ) L n (θ 0 )] ε} 1. (C6) The prior has a density π(θ) with respect to the Lebesgue measure, which is continuous and positive at θ 0. Furthermore, π(θ) satisfies θ Θ θ π(θ)dθ <. These conditions guarantee the consistency and asymptotic normality of the posterior mean and are the same as the conditions in [12]. Theorem 1. Suppose {f θ θ Θ} satisfies Conditions (C1)-(C5) and the prior distribution satisfies Condition (C6). Let X k,1,, X k,nk (k = 1,, K) be random variables from the distribution f θ0, θk,n k be the posterior mean (2) based on the random variables X 1,k,, X 1,nk and θ a = n 1 K a k=1 n kθk,n k be the aggregated Bayesian estimate. Then we have na ( θ a θ 0 ) d N(0, I 1 θ 0 ) as m K = min{n 1,, n K }. Proof. Since {f θ θ Θ} and π(θ) satisfy Condition (C1)-(C6), from Theorem in [12] we have nk (θ k,n k θ 0 ) d N(0, I 1 θ 0 ) as n k. Let Z k,nk = n k (θk,n k θ 0 ) and φ k,nk (t) = E[e itz k,n k ] be its characteristic function. Denote v 2 = I 1 θ 0. Then, by Levy s Continuity Theorem (see, for example, [9] and [31] among others), we have φ k,nk (t) converges to exp( v 2 t 2 /2) uniformly in every finite interval, where exp( v 2 t 2 /2) is the characteristic function of the normal distribution N(0, v 2 ). On the other hand, the characteristic function of

10 10 R. Xi et al the random variable Z na = n a ( θ a θ 0 ) is φ na (t) = E[exp{it n a ( θ a θ 0 )}] = E[exp{it K n a n 1 a n k (θk,n k θ 0 )}] = = K k=1 K k=1 k=1 E[exp{itn 1/2 a n k (θ k,n k θ 0 )}] φ k,nk (n 1/2 k n 1/2 a t). Then, we have K log[φ na (t)] + v 2 t 2 /2 = {log[φ k,nk (n 1/2 k n 1/2 a t)] + n k v 2 t 2 } 2n a k=1 K log[φk,nk (n 1/2 k n 1/2 a t)] v2( n 1/2 k n 1/2 a t ) 2. k=1 Since φ k,nk (t) converges to exp( v 2 t 2 /2) uniformly in every finite interval, log[φ k,nk (t)] will converge to v 2 t 2 /2 uniformly in every finite interval. Then for any ε > 0, there exists an N k (ε) > 0 such that when n k > N k (ε), we have log[φ k,nk (τ)] + v 2 τ 2 /2 ε/k for all τ t. Take M K (ε) = max{n 1 (ε),, N K (ε)}. Since n 1/2 k n 1/2 a t t, we have log[φ na (t)] + v 2 t 2 /2 K ε/k = ε for m K M K (ε). Therefore, φ na (t) converges to exp( v 2 t 2 /2) for all t R and the Theorem can be seen by using Levy s Continuity Theorem again. To prove a similar result for p > 1, we need replace Conditions (C2) (C4) with the following conditions. (C2 ) L(θ, x) = log f θ (x) is thrice differentiable with respect to θ in a neighborhood U δ0 (θ 0 ) = {θ : θ θ 0 < δ} of θ 0 Θ. If L i, L ij and L(3) ijk stand for the first, second and third partial derivatives with respect to the i, j, kth component of θ, then E θ0 L i (θ 0, X) and E θ0 L ij (θ 0, X) are both finite and k=1 sup L (3) ijk (θ, x) M(x) and E θ 0 M(X) <. θ U δ0 (θ 0) (C3 ) Interchange of the order of expectation with respect to θ 0 and differentiation at θ 0 are justified, so that E θ0 L (θ 0, X) = 0 and E θ0 L (θ 0, X) = E θ0 [L (θ 0, X)L T (θ 0, X)], where L is the gradient vector with the ith component as L i, L is the Heissan matrix with L ij as its (i, j)th component and L T (θ 0, X) is the transpose of the column vector L (θ 0, X). (C4 ) I θ0 = E θ0 [L (θ 0, X)L T (θ 0, X)] is a positive definite matrix.

11 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 11 Table 1. Success rates for different groups of stone size. Treatment A Treatment B Small Stone 93%(81/87) 87%(234/270) Large Stone 73%(192/263) 69%(55/80) Both 78%(273/350) 83%(289/350) Theorem 2. Under Conditions (C1), (C2 ) (C4 ), (C5) and (C6), we have na ( θ a θ 0 ) N(0, d I 1 θ 0 ) as m K = min{n 1,, n K }. Proof. Let ˆθ k,nk be the MLE of the parameter θ based on the data X k,1,, X k,nk. From Theorem 5.1 in [18], we have nk (ˆθ k,nk θ 0 ) N(0, d I 1 θ 0 ) as n k. On the other hand, from Theorem 2.1 in [4], we have the difference between the Bayesian estimator θk,n k and the MLE satisfies n 1/2 k (θ k,n k ˆθ k,nk ) 0 almost surely. And hence, we have nk (θk,n k θ 0 ) N(0, d I 1 θ 0 ) as n k. The remaining part of the proof is similar to the proof of Theorem 1 and is omitted. Corollary 1. Under the conditions of Theorem 1 or 2, the difference between the estimates θ n a and θ na approaches 0 in probability. Proof. From Theorem 1, θ na approaches θ 0 in probability as m K goes to infinity. The Bayesian estimate θn a also approaches θ 0 in probability. Therefore, the difference between θn a and θ na converges to 0 in probability. Corollary 1 means that the difference between θn a and θ na will become smaller as more data become available. Henceforth, the estimate θ na is a good approximation to θn a with a diminishing error when the dataset is large Detection of non-homogeneous data Theorem 1 and Corollary 1 rely on the assumption that the data from different subcubes come from the same probability model, i.e. the data are homogeneous. Aggregation of non-homogenous data can lead to misleading results and Simpson s paradox [1] may occur. Therefore, it is important to develop tools of testing non-homogeneity. The test of non-homogeneity should be able to support the OLAP analysis and hence it should only depend on the compressed measures or the ALCRs of subcubes. The ALCR defined in Section 4.1 is insufficient for the test of non-homogeneity and one additional measure is needed. Let v k,nk be the posterior variance matrix based on the observations in the k th component cells, i.e. n k = C 1 (θ θk,n k )(θ θk,n k ) T f(x k,j θ)π(θ)dθ, (6) v k,nk θ Θ j=1

12 12 R. Xi et al where C = nk θ Θ j=1 f(x k,j θ)π(θ)dθ is the normalizing constant. If the parameter θ is p-dimensional, the measure v k,nk is a p p matrix. We propose the following modified compression and aggregation scheme. Compression into ALCR. For each base cell c k, k = 1,, K, at the lowest level of the data cube, calculate the Bayesian estimate θk,n k using (5) and the posterior variance v k,nk using (6). Save ALCR=(θ k,n k, v k,nk, n k ) in each component cell c k. Aggregation of ALCR. Calculate the aggregated ALCR ( θ a, ṽ a, n a ) using the following formula n a = K n k, k=1 θa = n 1 a K n k θk,n k, k=1 ṽ a = n 2 a K n 2 kv k,nk. For any non-base cells, θ and ṽ a are used in place of θ k,n k and v k,nk in their ALCR. Suppose that c 1 and c 2 are two subcubes and ( θ 1, ṽ 1, n 1 ) and ( θ 2, ṽ 2, n 2 ) are their ALCRs respectively. By Theorem 1, n k ( θ k θ 0 ) approximately follows the normal distribution N(0, I 1 θ 0 ), or θ k θ 0 (k = 1, 2) approximately follows the normal distribution N(0, n 1 k I 1 θ 0 ) (k = 1, 2). Using ṽ k as the estimate of n 1 k I 1 θ 0, it follows that t = ( θ 1 θ 2 ) T (ṽ 1 + ṽ 2 ) 1 ( θ 1 θ 2 ) approximately follows a χ 2 p distribution. Hence, we can use the statistic t to test the non-homogeneity. We use the kidney stone data as considered in [34] as an example of the test of non-homogeneity. The data are from a medical study [5, 16] comparing the success rates of two treatments for kidney stones. The two treatments are open surgery (treatment A) and percutaneous nephrolithotomy (treatment B). Table 1 shows the effects of both treatments under different conditions. It reveals that treatment A has a higher success rate than treatment B for both small stone and large stone groups. However, after aggregating over the two groups, treatment A has a lower success rate than treatment B. Let S be a binary random variable that indicates whether a treatment succeeds or not, and T be the type of treatment that a patient receives. We use p A and p B to denote the success rate of treatment A and B, respectively, and α A the probability that a patient receives treatment A. We have the probability model P r(s, T p A, p B, α A ) = [p SA(1 p ] I(T =A) [ I(T =B) A ) 1 S α A p S B(1 p B ) 1 S (1 α A )], where I( ) is the indicator function. Set priors for p A, p B, α A as the noninformative prior Beta(1, 1). Given observations D = {(s 1, t 1 ),, (s n, t n )}, the posterior distribution of (p A, p B, α A ) is f(p A, p B, α A D) p n As A k=1 (1 p A) n Af p n Bs B (1 p B) n Bf α n A A (1 α A ) n B, where n As = n i=1 s ii(t i = A), n Af = n i=1 (1 s i)i(t i = A), n Bs = n i=1 s ii(t i = B), n Bf = n i=1 (1 s i)i(t i = B), n A = n i=1 I(t i = A) and n B = n i=1 I(t i =

13 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 13 Fig. 1. MAD of the aggregated estimates with a varying number of partitions K, where the solid, dashed and the dotted lines correspond to the initial probabilities (l) p, the transition matrices (l) P, and the parameter α in the mixture of transition models, respectively. MAD Initial probability Transition Matrix α K B). Therefore, the posterior distribution is the product of three independent Beta distributions. Denote θ = (p A, p B, α A ). Based on the small stone group data, the Bayesian estimate is θs = (0.92, 0.86, 0.24) and the corresponding posterior variance is v s = diag{ , , }. Based on the large stone group, the Bayesian estimate is θl = (0.73, 0.68, 0.77) and the corresponding posterior variance is v l = diag{ , , }. Using these results, the test statistic is t = (θs θl )T (v s +v l ) 1 (θs θl ) = , which is highly significant. Therefore, it is highly possible that the two data sets are inhomogeneous and we should not aggregate them together. 5. Experimental Evaluation We perform experimental studies to evaluate the proposed scheme. We first evaluate the accuracy of the proposed C&A scheme. Then, we report the time and quality performance of the C&A scheme in data cube and stream data mining contexts. At last, we apply the C&A scheme to a real data set and show that the aggregated Bayesian estimates can closely approximate the Bayesian estimates Quality of the proposed compression and aggregation scheme In this subsection, we use the mixture of transition models to evaluate the quality of the proposed C&A scheme. The mixture of transition models has been used to model user visiting website [3, 26, 27], unsupervised training of robots [24] and the dynamics of a military scenario [30].

14 14 R. Xi et al Transition models are useful in describing time series that have only a finite number of states. The observations of a transition model are finite state Markov chains of finite length. For example, the sequence (A, B, A, C, B, B, C) could be a realization of a 3-state first-order Markov chain, where the transition probability at time t only depends on the state of the Markov chain at time t but not on the previous history. If all the observations are realizations from the same transition model, one can readily get a closed form of the posterior mean of the parameters. However, the set of sequences may be heterogeneous and the sequences may come from several different transition models, in which case the mixture of transition models is useful in estimating the transition matrices and clustering the observational sequences. Consider a data set of N sequences, D = {x 1,, x N }, that are realizations from some s-state discrete first-order Markov process. The sequences are possibly of different length. Assume that each sequence comes from one of m transition models. Let (l) P ij be the element (i, j) of the lth probability transition matrix, or the transition probability from state i to state j for a process in cluster l. Let (l) p i be the ith element of the initial state distribution of processes from cluster l. Further assume that α l is the probability that a process is from cluster l. Denote x 0 k as the initial state of the sequence x k and n (k) ij be the number of times that the process x k transitioned from state i to state j. The mixture of transition models is f(x k θ) = m s α l (l)p I(x0 k =i) i l=1 i=1 s s i=1 j=1 (l)p n(k) ij ij, where θ is the parameter vector consisting of (l) P ij, (l) p i and α l as its elements, and I( ) is the indicator function. The prior distribution for the parameter vectors α = (α 1,, α m ), (l) p = ( (l) p 1,, (l) p s ) and (l) P i = ( (l) P i1,, (l) P is ) are Dirichlet priors with all parameters as 1. The Dirichlet priors used here are non-informative priors. The posterior mean has no closed form for this Bayesian model. However, by introducing a missing data δ (k) l, a 0/1 unobserved indicator for whether process k belongs to cluster l, one can readily develop a Gibbs sampler [26, 27]. We apply the C&A scheme to a mixture of transition models. In the experiment, the number of clusters is set to 3 and the Markov chains are 2-state chains. We generated 10, 000 chains from the mixture of transition models and each chain is of length 30. The underlying true parameters are set as 1. initial probabilities: (1)p = (0.2, 0.8), (2)p = (0.9, 0.1), (3)p = (0.4, 0.6); 2. transition matrices ( (1)P = ) ( , (2)P = the probability vector α = (0.2, 0.5, 0.3). ) ( , (3)P = We partition the entire data set into K = 1, 10, 20, 100 cells with equal number of observations and then use our C&A scheme to approximately compute the posterior mean for the entire data set. We run the Gibbs sampler 11, 000 iterations and set the burn-in iterations to 1, 000. Note that the esti- ) ;

15 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 15 Fig. 2. Comparison of the C&A method (dotted lines) and direct method (solid lines) in data stream. MAD Time point MAD Time point Computational time (in seconds) Time point (a) MAD between β, ˆβ and β 0 (b) MAD between β and ˆβ (c) Update time. mate corresponding to K = 1 is just the posterior mean. Let (l) p, (l) P and (l) α be the estimates of (l) p, (l) P and α (l = 1, 2, 3), respectively. We define the maximum absolute deviances (MAD) as D( p, p) = max{ (l) p i (l) p i : l = 1, 2, 3, i = 1, 2}, D( P, P ) = max{ (l) Pij (l) P ij : l = 1, 2, 3, i, j = 1, 2} and D( α, α) = max{ α l α l : l = 1, 2, 3}. Figure 1 shows the MAD of the aggregated estimates from different partitions. The solid line is for the MAD D( p, p), the dashed line is for D( P, P ), and the dotted line is for D( α, α). Observed from the low MAD values (all 0.005), it is clear that the estimates under various number of partitions all have very small errors. The evaluation result shows that the accuracy of the aggregated estimates from our C&A scheme is almost as good as the accuracy of the original Bayesian estimates Performance on data streams In this experiment, we apply our aggregation method to data streams. The Bayes model under consideration is the linear model with 5 predictors, x 1,, x 5, i.e. y = β β i x i + ε, i=1 where ε is the error term. In the experiment, we set the true parameters β = (β 0,, β 5 ) = (0, 1, 2, 3, 4, 5) and the total number of observation N to be 5 million. We generate the covariates x i (i = 1,, 5) from the standard normal distribution, generate the error term ε from N (0, σ 2 = 4), and calculate the response y from the above equation. The priors of the parameters in the Bayesian model are the flat priors, i.e. π(β i ) 1 (i = 0,, 5) and π(σ 2 ) 1/σ 2. The Gibbs sampler can then be easily developed. We update our model for every 1000 new data records. In our methods, whenever we receive 1000 new data records, we compute its ALCR, update the Bayes linear model by aggregating the ALCR with previous ALCRs, and discard the raw data. We compare the performance of our method to a naive method, which stores all the stream data and uses the raw data to update the model for every 1000 new data records. We run the Gibbs sampler for iterations, set the number of burn-in iterations to 1000, and set s to 5. Figure 2a shows the MADs between the aggregation estimation β (dashed

16 16 R. Xi et al Table 2. Comparison of the computational time in Experiment 1. C&A method direct method Compression 1,403 minutes N/A Query processing 0.1 minute 19,049 minutes line), the estimation from naive method ˆβ (solid line) and the true parameter β. Figure 2b shows the MAD between β and ˆβ. Figure 2c gives the computational time used for updating the parameter estimation using our C&A scheme (dashed line) and the naive method (solid line). We see that, comparing to the naive method, the C&A method gives almost identical accuracy, but save tremendous amount of computing time. In fact, from Figure 2c, we see that the C&A method uses a nearly constant time to perform each online update, while the naive method uses more time as more data accumulates. It is clear that the C&A method is more suitable for stream data mining Performance on data cubes Experiment 1. In this experiment, we study the efficiency and quality of the compression and aggregation scheme for aggregated cells in data cubes. The Bayesian model under consideration is again the mixture of transition models. Two dimensions are time and location. Since the MCMC algorithm for the mixture of transition models is highly time-consuming even for moderate-sized data, we consider a relatively small data cube in this experiment. We have 20 months records in the time dimension and 50 states in the location dimension. In practice, the data can be the records of a website that records users visit to the website. The location dimension can be the IP address of the user. For each state in each month, we have 500 observations, i.e. we have 500 users records. Hence, we have 500,000 observations in total. The observations are sequences that record users visiting path in the website. As in Section 5.1, the number of clusters are set to 3 and the Markov chains are 2-state chains. The underlying true parameters are also set as in Section 5.1. We compare our ALCR method to the direct Bayesian estimation method, which directly uses raw data to calculate the Bayesian estimates of the parameters, by comparing their computing time for handling 100 randomly generated queries. To save the computing time of the direct Bayesian estimation method, the aggregated cells that the queries ask for can have at most 200 base cells. More specifically, to generate a query, we first randomly select a number D from {1,, 200}, and then we randomly select D cells from the 1000 base cells (t i, l j ) (i = 1,, 20, j = 1,, 50). The corresponding query asks for parameter estimates of the mixture of transition models based on the data of the selected D base cells. For example, assume that D is randomly selected as 3 and the base cells are chosen as c 1 = (t 1, l 1 ), c 5 = (t 5, l 5 ), c 30 = (t 30, l 30 ). Then, the aggregated estimate of the aggregated cell c a = c 1 c 5 c 30 is calculated by aggregating the ALCRs of the base cells c 1, c 5 and c 30 ; the Bayesian estimate is directly calculated with Gibbs sampling based on the raw data of the aggregated cell c a. We run the Gibbs sampler for 6, 000 iterations and set the number of burn-in iterations to 1, 000. Table 2 shows the time with and without using compression, respectively.

17 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 17 Fig. 3. Comparison of the C&A method (dotted lines) and direct method (solid lines) in Experiment 1. MAD MAD MAD Query Query Query (a) MADs for (l) p (b) MADs for (l) P (c) MADs for α. The first row shows the computational time for compression and the second row shows the aggregation time for all these 100 queries. Without using ALCR compression, the aggregation time is the time to compute Bayesian estimates directly from the raw data in these selected cells. It is obvious that our method saves huge amount of computational time when handling OLAP queries in a data cube. Figure 3 compares MADs of estimates for each query from the ALCR method and the direct method. The dotted lines are for the ALCR methods and the solid lines are for the direct method. Figure 3 (a), (b) and (c) are the MADs of estimates for the initial probabilities (l) p, the transition matrices (l) P, and the probabilities α, respectively. The queries are ordered by their sizes or by the base cell numbers in the queries. Figure 3 shows that the estimates from the ALCR method tend to have larger MAD than estimates from the direct method when the size of query is large, especially for the estimates of initial probabilities, although in general all the MADs for both methods are very small. Figure 4 shows the MAD between the original Bayesian estimates and the ALCR based estimates. The queries are ordered by their sizes. The differences of the initial probability estimates are generally larger comparing to estimates of other parameters, but generally the two estimates are close. Experiment 2. In this experiment, we consider the Bayesian estimator of the linear regression model and compare the computational efficiency and the accuracy of the Bayesian estimator and the C&A estimator in data cubes. The model under consideration is the same as Section 5.2, but the underlying true parameter β was set as (1, 2, 3, 4, 5, 6). The data cube is a 6-dimensional data cube and the dimension sizes are 50, 120, 5, 4, 3 and 2, respectively. Thus, the data cube contains = 720, 000 base cells. To introduce more variations, the number of observations in each base cell was sampled uniformly from {100,, 1000} and the variance of the error term was sampled from the chi-square distribution with 2 degrees of freedom. In total, the data cube has around 50 GB raw data. We randomly generated 2000 queries and set the maximum number of base cells of the queries as The procedure for generating the queries is similar as in Experiment 1. The total number of iterations of the Gibbs sampler was set as and the burn-in iterations as Table 3 shows the computation time for the two methods. Again, we see that the aggregation method saves a large amount of computational time compared with the direct Bayesian estimation

18 18 R. Xi et al Fig. 4. MAD between the original and the aggregated estimates in Experiment 1. The solid, dashed and dotted lines are MADs for the initial probabilities (l) p, the transition matrices (l) P, and the parameter α in the mixture of transition models, respectively. MAD Initial probability Transition Matrix α Query Table 3. Comparison of the computational time in Experiment 2. C&A method direct method Compression 645 minutes N/A Query processing 1 minute 1,779 minutes method. The accuracy of the estimates based on the ALCR method is also similar to the original Bayesian estimates (Figure 5) Application on a real data set In this section, we apply our compression and aggregation scheme to the Behavioral Risk Factor Surveillance System (BRFSS) survey data ( ) [11]. The BRFSS, administered by the Center for Disease Control and Prevention, is an ongoing data collection program designed to measure behavioral risk factors in the adult population. The BRFSS collects surveillance data on risk behaviors through monthly telephone interviews to people in the 50 states and 5 districts of the United States of America. After filtering records with missing data, this data set has around 1.2 million data points. We are interested in modeling the variable body mass index ( BMI4). The variable BMI4 can take value from 1 to 9998 and we view it as continuous variable. The explanatory variables are SEX, AGE, EXERANY2, DIABETE2, DRNKANY4, RFSMOK3 and EDUCAG. The variables EXERANY2, DIA- BETE2, DRNKANY4 and RFSMOK3 describe whether an interviewee has any kind of exercise during the past month, was told by a doctor that he/she has diabetes, has been drinking alcoholic beverages during the past month, and is a smoker, respectively. The variable EDUCAG indicates the completed education

19 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 19 Fig. 5. Comparison of accuracy of the Bayesian Estimates and the C&A estimates in Experiment 2. The queries are ordered by the base cell number. MAD Query (a) MAD of the Bayesian Estimates MAD Query (b) MAD of the ALCR estimates level of an interviewee. We stratify this variable into two levels, hight school or lower and above high school. The variables SEX and AGE are sex and age of an interviewee. For notation simplicity, denote Y as the response variable BMI4, and X 1,, X 7 as the seven explanatory variables. The model under consideration is the following linear model Y = β β i X i + ε, i=1 where ε is the error term. We compare the original Bayesian parameter estimate β with the aggregated estimate β. To accommodate different magnitudes of the estimates βi, we use the mean relative difference (1/8) 7 i=0 β i βi / β i as a measure of the accuracy of the aggregated estimate β. The data set in each year can be partitioned into 12 subsets by month and each subset can be further partitioned by the states. We compute the ALCR for each state in each month, and then aggregate the ALCRs. For the data in each month, we can get the aggregated estimates over the states. Figure 6 shows the mean relative difference of these aggregated estimates. The relative differences are always less than 0.04, which suggests the high accuracy of the ALCR method. 6. Discussion of Related Work We discuss some related works and compare them to ours. Statistical models can be put into two categories, parametric models such as linear regression and logistic regression, and nonparametric models such as probability based ensembles, naive Bayesian classifier and kernel-density-based classifiers. In parametric models, emphasis is often put on parameter estimation, such as how accurate an estimator is. On the other hand, prediction accuracy is more important in evaluating the performance of a nonparametric model. The framework of regression cube [7, 8] develops a lossless compression and aggregation scheme for general multiple linear regression. Another very related work is that on prediction cube [6], which supports

20 20 R. Xi et al Fig. 6. The mean relative difference between the Bayesian estimates β and the aggregated estimates β for BRFSS data. Mean Relative Difference Year OLAP of prediction models including probability based ensemble, naive Bayesian classifier, and kernel-density classifier. The prediction cubes bear similar ideas as regression cubes in that both of them aim at deriving high-level models from lower-level models instead of accessing the raw data and rebuilding the models from scratch. A key difference is that, the prediction cube only supports models that are distributively decomposable or algebraically decomposable [6], whereas the Bayesian models in our study are not. Also, the prediction cubes deal with the prediction accuracy of nonparametric statistical models, whereas our compression theory is developed for parameter reconstruction of Bayesian models. The above developments all focus on lossless computation for data cubes. Alternatively, asymptotically lossless computation that provides good approximations to the desired results is also acceptable in many applications when efficient storage and computation is attainable. Recently, a nearly lossless compression and aggregation scheme has been developed for logistic regression, a nonlinear parametric model [34]. An approximation technique called quasi-cube uses the loglinear model, a parametric model, to characterize regions of a data cube [2]. Efficient storage and fast computation are achieved by storing the parameters of the loglinear models instead of the original data. In quasi-cubes, the desired computation is done based on approximations to the original data provided by the loglinear model. However, it is difficult to quantify the approximation errors in a quasi-cube. Our paper considers aggregation operations without accessing the raw data. Palpanas, Koudas, and Mendelzon [22] have considered the reverse problem, which is to derive original raw data from the aggregates. An approximative estimation algorithm based on maximum information entropy is proposed [22]. It will be interesting to study the interactions of these two complimentary approaches. Safarinejadian et al.[28] recently proposed a distributive EM algorithm for estimating parameters in finite mixture models. The EM algorithm and MCMC

21 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 21 algorithm are two parallel algorithms for estimating parameters in finite mixture models. EM algorithms are generally faster than MCMC algorithms, but MCMC algorithms are generally easier to develop and implement, and MCMC algorithms can readily provide interval estimation. A comparison of our aggregated Bayesian estimate with their distributive EM algorithm would be interesting. Dimension hierarchies, cubes, and cube operations are formally introduced by Vassiliadis [33]. Lenz and Thalheim [19] proposed to classify OLAP aggregation functions into distributive, algebraic, and holistic ones. In data warehousing and OLAP, much progress has been made on the efficient support of standard and advanced OLAP queries in data cubes, including selective cube materialization [15] and intelligent roll-up [29]. However, the measures studied in previous OLAP systems are usually single values or simple statistics, not sophisticated statistical models such as the Bayesian models studied in this paper. Our work is related to database engine architectures such as Netezza (www. netezza.com) and Infobright ( where the synopses computed for the partitioned data blocks are used for query optimization and execution. Infobright has recently introduced the notion of a rough query which is an approximate query basing only on the synopsis without drilling down to full details. Our method matches this framework. We plan to extend the open source version of Infobright with the proposed Bayesian synopses in order to enrich querying by the elements of Bayesian modeling. 7. Conclusions In this paper, we have proposed an asymptotically lossless compression and aggregation technique to support efficient Bayesian estimation of statistical models. We have developed a compression and aggregation scheme that compresses a data segment into a compressed representation whose size is independent from the size of the data segment. Under regularity conditions, we have proved that the aggregated estimator is strongly consistent and asymptotically error-free. We have further proposed a compression and aggregation scheme that enables detection of non-homogeneous data. Our experimental studies on data cubes and data streams show that our compression and aggregation method can significantly reduce computational time with little loss of accuracy. Moreover, the aggregation error diminishes as the size of the data increases. Therefore, the proposed scheme is widely applicable as it enables efficient and accurate construction of Bayesian statistical models in a distributed fashion. It can be used in the contexts of data cube and OLAP, stream data mining, and cloud computing. For data cubes, it allows us to quickly perform OLAP operations and compute Bayesian statistics at any level in a data cube without retrieving or storing the raw data. For stream data mining, it enables efficient one-scan online computation of Bayesian statistics, without requiring to retain the raw data. For cloud computing, it facilitates analysis of distributed large datasets, under parallel processing paradigms such as MapReduce. The proposed scheme works best for the scenarios of homogeneous data. Since it is more likely to have inhomogeneous data in stream data, we would expect the convergence in stream data be worse than OLAP. However, as statistical analysis without realizing the inhomogeneity would be misleading, our method also provides a way to detect this inhomogeneity without accessing the raw data.

22 22 R. Xi et al Acknowledgements. This research is partly supported by NSF grant NeTS and a Microsoft Research New Faculty Fellowship to Y. C. and by NSF grant DMS to N.L.. References [1] A Agresti. Categorical Data Analysis. John Wiley and Sons, New Jersey, 2nd edition, [2] D. Barbara and X. Wu. Loglinear-based quasi cubes. Journal of Intelligent Information Systems, 16: , [3] I. Cadez, D. Heckerman, P. Smyth C. Meek, and S. White. Visualization of navigation patterns on a web site using model-based clustering. Technical report, Microsoft Research, MSR-TR [4] M.T. Chao. The asymptotic behavior of Bayes estimators. The Annals of Mathematical Statistics, 41(2): , [5] C. R. Charig, D. R. Webb, S. R. Payne, and O. E. Wickham. Comparison of treatment of renal calculi by operative surgery, percutaneous nephrolithotomy, and extracorporeal shock wave lithotripsy. British Medical Journal, 292: , [6] B. Chen, L. Chen, Y. Lin, and R Ramakrishnan. Prediction cubes. In Proceedings of the 31st VLDB Conference, pages , [7] Y. Chen, G. Dong, J. Han, J. Pei, B. Wah,, and J. Wang. Regression cubes with lossless compression and aggregation. IEEE Transactions on Knowledge and Data Engineering, 18: , [8] Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-dimensional regression analysis of time-series data streams. pages , [9] K. L. Chung. A Course in Probability Theory. Elsevier, San Diego, California, 3rd edition, [10]A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Ser. B, 39:1 38, [11]Centers for Disease Control and Prevention. Behavioral Risk Factor Surveillance System Survey Data. U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, [12]J. K. Ghosh and R. V. Ramamoorthi. Bayesian Nonprametrics. Springer, New Jersey, [13]J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29 54, [14]J. Han, Y. Chen, G. Dong, J. Pei, B. W. Wah, J. Wang, and Y. Cai. Stream cube: An architecture for multi-dimensional analysis of data streams. Distributed and Parallel Databases, 18(2): , [15]V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In Proceedings of ACM SIGMOD International Conferernce on Management of Data, pages , [16]S. A. Julious and M. A. Mullee. Confounding and Simpson s paradox. British Medical Journal, 309: , [17]A. Khoshgozaran, A. Khodaei, M. Sharifzadeh, and C. Shahabi. A hybrid aggregation and compression technique for road network databases. Knowledge and Information Systems, 17(3): , [18]E. L. Lehmann and G. Casella. Theory of Point Estimation. Springer, New Jersey, 2nd edition, [19]H. Lenz and B. Thalheim. OLAP databases and aggregation functions. In Proceedings of the 13th International Conference on Scientific and Statistical Database Management, pages , [20]C. Liu, M. Zhang, M. Zheng, and Y. Chen. Step-by-step regression: A more efficient alternative for polynomial multiple linear regression in stream cube. In Proceedings of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages , [21]H. Liu, Y. Lin, and J. Han. Methods for mining frequent items in data streams: an overview. Knowledge and Information Systems, pages 1 30, 2011.

23 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 23 [22]T. Palpanas, N. Koudas, and A. O. Mendelzon. Using datacube aggregates for approximate querying and deviation detection. IEEE Transactions on Knowledge and Data Engineering, 17(11): , [23]S. Pang, S. Ozawa, and N. Kasabov. Incremental linear discriminant analysis for classification of data streams. IEEE Transactions on Systems Man and Cybernetics, Part B, 35(5):905 14, [24]M. Ramoni, P. Sebastiani, and P. Cohen. Bayesian clustering by dynamics. Machine Learning, 47(1):99 121, [25]C. R. Rao. Linear Statistical Inference and Its Applications. John Wiley, New York, [26]G. Ridgeway. Finite discrete markov process clustering. Technical report, Microsoft Research, MSR-TR [27]G. Ridgeway and S. Altschuler. Clustering finite discrete markov chains. In Proceedings of the Section on Physical and Engineering Sciences, pages , [28]B. Safarinejadian, M.B. Menhaj, and M. Karrari. A distributed EM algorithm to estimate the parameters of a finite mixture of components. Knowledge and Information Systems, 23(3): , [29]G. Sathe and S. Sarawagi. Intelligent rollups in multidimensional OLAP data. In Proceedings of the 27th VLDB conference, pages , [30]P. Sebastiani, M. Ramonni, P. Cohen, J. Warwick, and J. Davis. Discovering dynamics using Bayesian clustering. In Advances in Intelligent Data Analysis, Lecture Notes in Computer Science, pages Springer, [31]A. N. Shiryaev. Probability. Springer, New Jersey, 2nd edition, [32]M. A. Tanner and W. H. Wong. The calculation of posterior distribution by data augmentation. Journal of the American Statistical Association, 82: , [33]P. Vassiliadis. Modeling multidimensional databases, cubes and cube operations. In Proceedings of the 10th International Conference on Scientific and Statistical Database Management, pages 53 62, [34]R. Xi, N. Lin, and Y. Chen. Compression and aggregation for logistic regression analysis in data cubes. IEEE Transactions on Knowledge and Data Engineering, 21(4): , Author Biographies Ruibin Xi is currently a Research Associate at Harvard Medical School, Center for Biomedical Informatics. He received his PhD degree from the Department of Mathematics, Washington University in St. Louis. His research interests include statistical analysis of next generation sequencing data, copy number variation and structural variation, statistical computing, massive data analysis, variable selection methods and Bayesian statistics. Nan Lin is an Associate Professor of Mathematics and Biostatistics at the Washington University in St. Louis. He received the Ph.D. in Statistics from University of Illinois at Urbabna-Champaign in He has also worked as a Postdoctoral Associate at Yale University School of Medicine from 2003 to His research interest includes statistical computing, massive data analysis, robust statistics, bioinformatics and psychometrics. He is a member of the American Statistical Association and the International Chinese Statistical Association.

24 24 R. Xi et al Yixin Chen is an Associate Professor of Computer Science at the Washington University in St Louis. He received the Ph.D. in Computing Science from University of Illinois at Urbana-Champaign in His research interests include nonlinear optimization, constrained search, planning and scheduling, data mining, and data warehousing. His work on constraint partitioning and planning has won First Prizes in optimal and satisficing tracks in the International Planning Competitions (2004 & 2006), the Best Paper Award at the International Conference on Tools for AI (2005), and the Best Paper Award at AAAI Conference (2010). His work on data clustering has won the Best Paper Award at the International Conference on Machine Learning and Cybernetics (2004) and the Best Paper nomination at the International Conference on Intelligent Agent Technology (2004). He is partially funded by an Early Career Principal Investigator Award (2006) from the Department of Energy and a Microsoft Research New Faculty Fellowship (2007). Youngjin Kim is a Software Engineer at Google Inc. in Mountain View, California. He received his Master degree from the Department of Computer Science at Washington University in St. Louis. His research interest includes machine learning and data mining from huge and noisy real world data. Correspondence and offprint requests to: Yixin Chen, Department of Computer Science, Washington University, St. Louis, MO, USA. chen@cse.wustl.edu