Compression and Aggregation of Bayesian Estimates for Data Intensive Computing

Size: px
Start display at page:

Download "Compression and Aggregation of Bayesian Estimates for Data Intensive Computing"

Transcription

1 Under consideration for publication in Knowledge and Information Systems Compression and Aggregation of Bayesian Estimates for Data Intensive Computing Ruibin Xi 1, Nan Lin 2, Yixin Chen 3 and Youngjin Kim 4 1 Center for Biomedical Informatics, Harvard Medical School, Boston, MA, USA; 2 Department of Mathematics, Washington University, St. Louis, MO, USA; 3 Department of Computer Science, Washington University, St. Louis, MO, USA; 4 Google Inc., Mountain View, CA, USA. Abstract. Bayesian estimation is a major and robust estimator for many advanced statistical models. Being able to incorporate prior knowledge in statistical inference, Bayesian methods have been successfully applied in many different fields such as business, computer science, economics, epidemiology, genetics, imaging and political science. However, due to its high computational complexity, Bayesian estimation has been deemed difficult, if not impractical, for large-scale databases, stream data, data warehouses, and data in the cloud. In this paper, we propose a novel compression and aggregation scheme (C&A) that enables distributed, parallel, or incremental computation of Bayesian estimates. Assuming partitioning of a large dataset, the C&A scheme compresses each partition into a synopsis and aggregates the synopsis into an overall Bayesian estimate without accessing the raw data. Such a C&A scheme can find applications in OLAP for data cubes, stream data mining, and cloud computing. It saves tremendous computing time since it processes each partition only once, enabling fast incremental update, and allows parallel processing. We prove that the compression is asymptotically lossless in the sense that the aggregated estimator deviates from the true model by an error that is bounded and approaches to zero when the data size increases. The results show that the proposed C&A scheme can make feasible OLAP of Bayesian estimates in a data cube. Further, it supports real-time Bayesian analysis of stream data which can only be scanned once and cannot be permanently retained. Experimental results validate our theoretical analysis and demonstrate that our method can dramatically save time and space costs with almost no degradation of the modeling accuracy. Received xxx Revised xxx Accepted xxx

2 2 R. Xi et al Keywords: Bayesian estimation; data cubes; OLAP; stream data mining;compression; aggregation. 1. Introduction In the last few years, there has been active research on compression and aggregation (C&A) schemes for advanced statistical analysis on structured and large-scale data [6, 7, 8, 13, 14, 17, 20, 21, 23, 34]. For a given statistical model, a general compression and aggregation (C&A) scheme partitions a large dataset into segments, processes each segment separately to generate some compressed representations, and aggregates the compressed results into a final model. Key benefits of such a scheme include its support for multi-dimensional data cube analysis, online processing, and distributed processing. We can find the following useful scenarios of the C&A scheme. The techniques developed in this paper are useful for data warehousing and the associated on-line analytical processing (OLAP) computing. With our C&A scheme, a Bayesian statistical model for a given data cell can be obtained by aggregating the compressed synopsis of relevant lower level cells, without building the model from raw data from scratch. Such a scheme allows for fast interactive analysis of multidimensional data to facilitate effective data mining at multiple levels of abstraction. The proposed C&A scheme enables online Bayesian analysis of real-time data streams. It is challenging to build online statistical models for high-speed data streams, since it is typically not practical to rebuild a complex model every time a new segment of data is received, due to high computational costs and the fact that raw data are not stored for many stream data applications. Our C&A scheme solves this problem by retaining only synopsis, instead of raw data, in the system. For each new data segment, we compress it and use our aggregation scheme to efficiently update the model online. Cloud computing is a major trend for data intensive computing, as it enables scalable processing of massive amount of data. It is a promising nextgeneration computing paradigm given its many advantages such as scalability, elasticity, reliability, high availability, and low cost. As data localization is important for efficiency, it is desirable that each processing unit is only responsible for its own segment of local data. The proposed C&A scheme is well suited for performing Bayesian analysis on massive datasets in the cloud. For example, the compression and aggregation phases in a C&A scheme match the mapping and reducing phases, respectively, of the well-known MapReduce algorithmic framework for cloud computing. The C&A scheme allows partitioning and parallel processing of data, and thus enables high-performance statistical analysis in the cloud. Although there are earlier works to support C&A schemes for statistical inferences, most of such works are based on maximum likelihood estimation (MLE). In this paper, we propose a C&A scheme for Bayesian estimation, another major estimation approach that is considered superior to MLE in many contexts. The premise of Bayesian statistics is to incorporate prior knowledge, along with a given set of current observations, in order to make statistical inferences. The prior

3 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 3 information could come from previous comparable experiments, from experiences of some experts or from existing theories. However, it is often very expensive to compute the Bayesian estimates, as there generally exists no closed form solution and Markov chain Monte Carlo (MCMC) methods such as Gibbs samplers and Metropolis algorithms are often employed. Hence, to process large-scale data (possibly in parallel) and online stream data using Bayesian estimation, fast and effective C&A schemes are desired. C&A schemes for Bayesian estimation have not been studied before. Earlier works in data cubes [13] support aggregation of simple measures such as sum() and average(). However, the fast development of OLAP technology has led to high demand for more sophisticated data analyzing capabilities, such as prediction, trend monitoring, and exception detection of multidimensional data. Oftentimes, existing simple measures such as sum() and average() become insufficient, and more sophisticated statistical models are desired to be supported in OLAP. Recently, some researchers developed aggregation schemes for more advanced statistical analyses including parametric models such as linear regression [8, 14] general multiple linear regression [7, 20] logistic regression analysis [34] and predictive filters [7], as well as nonparametric statistical models such as naive Bayesian classifiers [6] and linear discriminant analysis [23]. Along this line, we develop a C&A scheme to support Bayesian estimation. Bayesian methods are statistical approaches to parameter estimation and statistical inference which use prior distributions over parameters. Bayes rule provides the framework for combining prior information with sample data. Suppose that f(d θ) is the probability model of the data D with parameter (vector) θ Θ and π(θ) is the prior probability density function (pdf) on the parameter space Θ. The posterior distribution of θ given the data D, using Bayes rule, is given by f(θ D) = f(d θ)π(θ) θ Θ f(d θ)π(θ)dθ. The posterior mean θ = θf(θ D)dθ is then a Bayesian estimate of the θ Θ parameter θ. While it is easy to write down the formula of the posterior mean θ, a closed form existed only in a few simple cases, such as a normal sample with a normal prior. In practice, MCMC methods are often used to evaluate the posterior mean. However, these algorithms are usually slow especially for large data sets, making OLAP processing based on these algorithms impractical. Furthermore, these MCMC algorithms require using the complete data set. In many data mining applications such as stream data applications and distributed analysis in the cloud, we often encounter the difficulty of not having the complete set of data in advance. One-scan algorithms are required for such applications. In this paper, we propose a C&A scheme and its associated theory to support high-quality aggregation of Bayesian estimation for statistical models. In the proposed approach, we compress each data segment by retaining only the model parameters and some auxiliary measures. We then develop an aggregation formula that allows us to reconstruct the Bayesian estimation from partitioned segments with a small and asymptotically diminishing approximation error. We further show that the Bayesian estimates and the aggregated Bayesian estimates are asymptotically equivalent.

4 4 R. Xi et al This paper is organized as follows. In Section 2, we introduce the research problem in the context of data cubes, noting that the general theory and C&A scheme can be applied in other contexts such as stream data mining as well. In Section 3, we review the basics of Bayesian statistics. We develop the C&A scheme and its theory in Section 4 and report experimental results in Section 5. Then, we discuss related works in Section 6 and give conclusions in Section Concepts and Problem Definition We develop our theory and algorithms for the C&A scheme in the context of data cubes and OLAP. We understand that the proposed theory and algorithms can also be used in other contexts, such as stream data mining and cloud computing. We present our result in a data cube context since it assumes a clear and simple structure of data, which facilitates our discussion. In our empirical study, we show results in both data cube and data stream contexts. In this section, we introduce the basic concepts related to data cubes and define our research problem Data cubes Data cubes and OLAP tools are based on a multidimensional data model. The model views data in the form of a data cube. A data cube is defined by dimensions and facts. Dimensions are the perspectives or entities with respect to which an organization wants to keep records. Usually each dimension has multiple levels of abstraction formed by conceptual hierarchies. For example, country, state, city, and street are four levels of abstraction in a dimension for location. To perform multidimensional, multi-level analysis, we need to introduce some basic terms related to data cubes. Let D be a relational table, called the base table, of a given cube. The set of all attributes A in D is partitioned into two subsets, the dimensional attributes DIM and the measure attributes M (so DIM M = A and DIM M = ). The measure attributes depend on the dimensional attributes in D and are defined in the context of data cube using some typical aggregate functions, such as count(), sum(), avg(), or some Bayesian estimators to be studied here. A tuple with schema A in a multi-dimensional data cube space is called a cell. Given two distinct cells c 1 and c 2, c 1 is an ancestor of c 2, and c 2 a descendant of c 1 if on every dimensional attribute, either c 1 and c 2 share the same value, or c 1 s value is a generalized value of c 2 s in the dimension s concept hierarchy. A tuple c D is called a base cell. A base cell does not have any descendant. A cell c is an aggregated cell if it is an ancestor of some base cells. For each aggregated cell, the values of its measure attributes are derived from the set of its descendant cells Aggregation and classification of data cube measures A data cube measure is a numerical or categorical quantity that can be evaluated at each cell in the data cube space. A measure value is computed for a given cell by aggregating the data corresponding to the respective dimension-value pairs

5 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 5 defining the given cell. Measures can be classified into several categories based on the difficulty of aggregation. 1) An aggregate function is distributive if it can be computed in a distributed manner as follows. Suppose the data is partitioned into n sets. The computation of the function on each partition derives one aggregate value. If the result derived by applying the function to the n aggregate values is the same as that derived by applying the function on all the data without partitioning, the function can be computed in a distributive manner. For example, count() can be computed for a data cube by first partitioning the cube into a set of subcubes, computing count() for each subcube, and then summing up the counts obtained for each subcube. Hence, count() is a distributive aggregate function. For the same reason, sum(), min(), and max() are distributive aggregate functions. 2)An aggregate function is algebraic if it can be computed by an algebraic function with several arguments, each of which is obtained by applying a distributive aggregate function. For example, avg() (average) can be computed by sum()/count() where both sum() and count() are distributive aggregate functions. min N(), max N() and stand dev() are algebraic aggregate functions. 3)An aggregate function is holistic if there is no constant bound on the storage size needed to describe a sub-aggregate. That is, there does not exist an algebraic function with M arguments (where M is a constant) that characterize the computation. Common examples of holistic functions include median(), mode(), and rank(). Except for some simple special cases like a normal sample with a normal prior, Bayesian estimates seem to be holistic measures since they require the information of all the data points in an aggregated cell for the computation. In this paper, we show that Bayesian estimates are compressible measures [7, 34]. An aggregation function is compressible if it can be computed by a procedure with a number of arguments from lower level cells, and the number of arguments is independent from the number of tuples in the data cell. In other words, for compressible aggregate functions, we can compress each cell, regardless of its size (i.e., the number of tuples), into a constant number of arguments, and aggregate the function based on the compressed representation. The data compression technique should satisfy the following requirements: (1) the compressed data should support efficient lossless or asymptotically lossless aggregation of statistical measures in a multidimensional data cube environment; and (2) the space complexity of compressed data should be low and be independent from the number of tuples in each cell, as the number of tuples in each cell may be huge. In this paper, we develop a compression and aggregation scheme for Bayesian estimates that can support asymptotically lossless aggregation. 3. Bayesian Statistics Suppose that x 1,, x n are n observations from a probability model f(x θ), where θ Θ is the parameter (vector) of the probability model f(x θ). The prior information in Bayesian statistics is given by a prior distribution π(θ) on the parameter space Θ. Then, under the independence assumption of the observations x 1,, x n given the parameter θ, the posterior distribution, f(θ x 1,, x n ), of the parameter θ can be calculated using Bayes rule,

6 6 R. Xi et al f(x 1,, x n θ)π(θ) f(θ x 1,, x n ) = θ Θ f(x 1,, x n θ)π(θ)dθ n i=1 = f(x i θ)π(θ) n i=1 f(x i θ)π(θ)dθ, (1) θ Θ where f(x 1,, x n θ) is the joint distribution of x 1,, x n given the parameter θ. Then, we could use the posterior mean θn as an estimate of the parameter θ, i.e. θn = θf(θ x 1,, x n )dθ θ Θ ( = n θ Θ i=1 ) 1 f(x i θ)π(θ)dθ θ Θ θ n f(x i θ)π(θ)dθ. (2) MCMC meethods are often employed to evaluate the formula (2) due to its difficulty of direct evaluation. These algorithms are based on constructing a Markov chain that has the posterior distribution (1) as its equilibrium distribution. After running the Markov chain a large number of steps, called burn-in steps, a sample from the Markov chain could be viewed as a sample from the posterior distribution (1). We then can approximate the posterior mean θn with any accuracy we wish by taking a large enough sample from the posterior distribution (1). We consider the following example [10, 25, 32] to illustrate the algorithm of the Gibbs sampler. Example 1: 197 animals are distributed multinomially into four categories and the observed data are y = (y 1, y 2, y 3, y 4 ) = (125, 18, 20, 34). A genetic model specifies cell probabilities (1 i=1 2 + θ 4, 1 θ 4, 1 θ 4, θ 4 Assume that the prior distribution is Beta(1, 1), which is also the uniform distribution on the interval (0, 1) and therefore is a non-informative prior. The posterior distribution of θ is ). f(θ y) (2 + θ) y1 (1 θ) y2+y3 θ y4. It is difficult, though not impossible, to calculate the posterior mean. However, a Gibbs sampler can be easily developed by augmenting the data y. Specifically, let x = (x 1, x 2, x 3, x 4, x 5 ) such that y 1 = x 1 + x 2, y 2 = x 3, y 3 = x 4 and y 4 = x 5. Assume the cell probabilities for x are ( 1 2, θ 4, 1 θ 4, 1 θ 4, θ 4 Then, the distribution of y is the marginal distribution of x. The full conditional distribution of θ is f(θ x 2, y) θ x2+y4 (1 θ) y2+y3, which is Beta(x 2 + y 4 + 1, y 2 + y 3 + 1). The full conditional distribution of x 2 is f(x 2 y, θ) (2/(2 + θ)) y1 x2 (θ/(2 + θ)) x2, i.e. the binomial distribution Binom(y 1, θ/(2 + θ)). The Gibbs sampler starts with any value θ (0) (0, 1) and iterates the following two steps. ).

7 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 7 1. Generate x (k) 2 from the full conditional distribution f(x 2 y, θ (k 1) ), i.e. from Binom(125, θ (k 1) /(2 + θ (k 1) )). 2. Generate θ (k) from the full conditional distribution f(θ x 2, y), i.e. from Beta(x , 39). Then, we could take average over θ (b+s),, θ (b+sn) to get an estimate of θ, where b is a large positive integer and s is a positive integer. The first b iterations are burn-in iterations and b is usually chosen large enough such that the Markov chain converges after b iterations. When n is large enough, this average will be a very good approximation to the posterior mean. The integer s is to reduce the correlation between two successive samples and is usually chosen to be small. In our experiment where we set θ (0) = 0.5, s = 1, b = 1000 and n = 5, 000, the sample average we got is and is very close to the true posterior mean Compression and Aggregation of Bayesian Estimates Since the computation of the Bayesian estimates θ n often involves MCMC algorithms, the compression and aggregation of Bayesian estimation are more difficult compared to the maximum likelihood estimates (MLE) of coefficients in regression models. In general, it is very difficult to achieve lossless compression for the Bayesian estimates and we have to resort to the asymptotic theory of Bayesian estimation to derive an asymptotic lossless compression scheme. We first review the notion of asymptotically lossless compression representation (ALCR) introduced in [34]. Definition 4.1. In data cube analysis, a cell function g is a function that takes the data records of any cell with an arbitrary size as inputs and maps into a fixed-length vector as an output. That is: g(c) = v, for any data cell c (3) where the output vector v has a fixed size. Suppose that we have a probability model f(x θ), where x are attributes and θ is the parameter of the probability model. Suppose c a is a cell aggregated from the component cells c 1,, c k. We define a cell function g 2 to obtain m i = g 2 (c i ), i = 1,..., k and use an aggregation function g 1 to obtain an estimate of the parameter θ for c a by θ = g 1 (m 1,, m k ). (4) We say ˆθ, an estimate of θ, is an asymptotically losslessly compressible measure if we can find an aggregation function g 1 and a cell function g 2 such that a) the difference between θ = g 1 (m 1,, m k ) and ˆθ(c a ) tends to zero in probability as the number of tuples in c a goes to infinity, where m i = g 2 (c i ), i = 1,..., k; b) ˆθ(c a ) = g 1 (g 2 (c a )); and c) the dimension of m i is independent from the number of tuples in c i. The measures m i are called an ALCR of the cell c i, i = 1, k. In the following, we develop an ALCR for Bayesian estimates in (2) based on its asymptotic property. We show that the asymptotic distributions of the estimates obtained

8 8 R. Xi et al from aggregation of the ALCR for each component cell and the Bayesian estimates in the aggregated cell are the same and further show that the difference between them approaches zero in probability as the number of tuples in c a goes to infinity. Further, the space complexity of the ALCR is independent from the number of tuples. Therefore, the Bayesian estimates are asymptotically losslessly compressible measures Compression and aggregation scheme Consider aggregating K cells at a lower level into one aggregated cell at a higher level. Suppose that there are n k observations in the k th component cell c k. Let {x k,1,, x k,nk } be the observations in the component cell c k. Note that the observations x k,j (j = 1,, n k ) could be multidimensional. Based on the observations in the k th component cell c k, we have the Bayesian estimate ( θk,n k = n k θ Θ j=1 ) 1 f(x k,j θ)π(θ)dθ θ Θ n k θ f(x k,j θ)π(θ)dθ. (5) We propose the following asymptotically lossless compression technique for the Bayesian estimation. Compression into ALCR. For each base cell c k, k = 1,, K, at the lowest level of the data cube, calculate the Bayesian estimate θk,n k using (5). Save ALCR=(θ k,n k, n k ) in each component cell c k. Aggregation of ALCR. Calculate the aggregated ALCR ( θ a, n a ) using the following formula n a = K n k, k=1 θa = n 1 a K n k θk,n k k=1 Such a process can be used to aggregate base cells at the lowest level as well as cells at intermediate levels. But for any non-base cell, θ is used in place of θk,n k in its ALCR. j= Compressibility of Bayesian estimation We now show that ( θ a, n a ) is an ALCR. We denote the Bayesian estimate for the aggregated cell to be θ a and the corresponding estimates derived from the ALCR compression and aggregation to be θ a. We will show that the asymptotic distributions of θ a and θ a are the same and their difference tends to zero in probability. Suppose that Θ R p is an open subset of R p. We will only give detailed proof of the theorem in the case of p = 1 and briefly describe the proof for the multidimensional case. We make the following regularity assumptions on f θ ( ) = f( θ) before giving the main theorem.

9 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 9 (C1) {x : f θ (x) > 0} is the same for all θ Θ. (C2) L(θ, x) = log f θ (x) is thrice differentiable with respect to θ in a neighborhood U δ0 (θ 0 ) = (θ 0 δ 0, θ 0 + δ 0 ) of θ 0 Θ. If L, L and L (3) stand for the first, second and third derivatives, then E θ0 L (θ 0, X) and E θ0 L (θ 0, X) are both finite and sup L (3) (θ, x) M(x) and E θ0 M(X) <. θ U δ0 (θ 0) (C3) Interchange of the order of expectation with respect to θ 0 and differentiation at θ 0 are justified, so that E θ0 L (θ 0, X) = 0 and E θ0 L (θ 0, X) = E θ0 [L (θ 0, X)] 2. (C4) I θ0 = E θ0 [L (θ 0, X)] 2 > 0. (C5) If X 1,, X n are random variables sampled from f θ0 and L n (θ) = n i=1 L(θ, X i), then for any δ > 0, there exists an ε > 0 such that P θ0 {sup θ θ0 >δ[l n (θ) L n (θ 0 )] ε} 1. (C6) The prior has a density π(θ) with respect to the Lebesgue measure, which is continuous and positive at θ 0. Furthermore, π(θ) satisfies θ Θ θ π(θ)dθ <. These conditions guarantee the consistency and asymptotic normality of the posterior mean and are the same as the conditions in [12]. Theorem 1. Suppose {f θ θ Θ} satisfies Conditions (C1)-(C5) and the prior distribution satisfies Condition (C6). Let X k,1,, X k,nk (k = 1,, K) be random variables from the distribution f θ0, θk,n k be the posterior mean (2) based on the random variables X 1,k,, X 1,nk and θ a = n 1 K a k=1 n kθk,n k be the aggregated Bayesian estimate. Then we have na ( θ a θ 0 ) d N(0, I 1 θ 0 ) as m K = min{n 1,, n K }. Proof. Since {f θ θ Θ} and π(θ) satisfy Condition (C1)-(C6), from Theorem in [12] we have nk (θ k,n k θ 0 ) d N(0, I 1 θ 0 ) as n k. Let Z k,nk = n k (θk,n k θ 0 ) and φ k,nk (t) = E[e itz k,n k ] be its characteristic function. Denote v 2 = I 1 θ 0. Then, by Levy s Continuity Theorem (see, for example, [9] and [31] among others), we have φ k,nk (t) converges to exp( v 2 t 2 /2) uniformly in every finite interval, where exp( v 2 t 2 /2) is the characteristic function of the normal distribution N(0, v 2 ). On the other hand, the characteristic function of

10 10 R. Xi et al the random variable Z na = n a ( θ a θ 0 ) is φ na (t) = E[exp{it n a ( θ a θ 0 )}] = E[exp{it K n a n 1 a n k (θk,n k θ 0 )}] = = K k=1 K k=1 k=1 E[exp{itn 1/2 a n k (θ k,n k θ 0 )}] φ k,nk (n 1/2 k n 1/2 a t). Then, we have K log[φ na (t)] + v 2 t 2 /2 = {log[φ k,nk (n 1/2 k n 1/2 a t)] + n k v 2 t 2 } 2n a k=1 K log[φk,nk (n 1/2 k n 1/2 a t)] v2( n 1/2 k n 1/2 a t ) 2. k=1 Since φ k,nk (t) converges to exp( v 2 t 2 /2) uniformly in every finite interval, log[φ k,nk (t)] will converge to v 2 t 2 /2 uniformly in every finite interval. Then for any ε > 0, there exists an N k (ε) > 0 such that when n k > N k (ε), we have log[φ k,nk (τ)] + v 2 τ 2 /2 ε/k for all τ t. Take M K (ε) = max{n 1 (ε),, N K (ε)}. Since n 1/2 k n 1/2 a t t, we have log[φ na (t)] + v 2 t 2 /2 K ε/k = ε for m K M K (ε). Therefore, φ na (t) converges to exp( v 2 t 2 /2) for all t R and the Theorem can be seen by using Levy s Continuity Theorem again. To prove a similar result for p > 1, we need replace Conditions (C2) (C4) with the following conditions. (C2 ) L(θ, x) = log f θ (x) is thrice differentiable with respect to θ in a neighborhood U δ0 (θ 0 ) = {θ : θ θ 0 < δ} of θ 0 Θ. If L i, L ij and L(3) ijk stand for the first, second and third partial derivatives with respect to the i, j, kth component of θ, then E θ0 L i (θ 0, X) and E θ0 L ij (θ 0, X) are both finite and k=1 sup L (3) ijk (θ, x) M(x) and E θ 0 M(X) <. θ U δ0 (θ 0) (C3 ) Interchange of the order of expectation with respect to θ 0 and differentiation at θ 0 are justified, so that E θ0 L (θ 0, X) = 0 and E θ0 L (θ 0, X) = E θ0 [L (θ 0, X)L T (θ 0, X)], where L is the gradient vector with the ith component as L i, L is the Heissan matrix with L ij as its (i, j)th component and L T (θ 0, X) is the transpose of the column vector L (θ 0, X). (C4 ) I θ0 = E θ0 [L (θ 0, X)L T (θ 0, X)] is a positive definite matrix.

11 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 11 Table 1. Success rates for different groups of stone size. Treatment A Treatment B Small Stone 93%(81/87) 87%(234/270) Large Stone 73%(192/263) 69%(55/80) Both 78%(273/350) 83%(289/350) Theorem 2. Under Conditions (C1), (C2 ) (C4 ), (C5) and (C6), we have na ( θ a θ 0 ) N(0, d I 1 θ 0 ) as m K = min{n 1,, n K }. Proof. Let ˆθ k,nk be the MLE of the parameter θ based on the data X k,1,, X k,nk. From Theorem 5.1 in [18], we have nk (ˆθ k,nk θ 0 ) N(0, d I 1 θ 0 ) as n k. On the other hand, from Theorem 2.1 in [4], we have the difference between the Bayesian estimator θk,n k and the MLE satisfies n 1/2 k (θ k,n k ˆθ k,nk ) 0 almost surely. And hence, we have nk (θk,n k θ 0 ) N(0, d I 1 θ 0 ) as n k. The remaining part of the proof is similar to the proof of Theorem 1 and is omitted. Corollary 1. Under the conditions of Theorem 1 or 2, the difference between the estimates θ n a and θ na approaches 0 in probability. Proof. From Theorem 1, θ na approaches θ 0 in probability as m K goes to infinity. The Bayesian estimate θn a also approaches θ 0 in probability. Therefore, the difference between θn a and θ na converges to 0 in probability. Corollary 1 means that the difference between θn a and θ na will become smaller as more data become available. Henceforth, the estimate θ na is a good approximation to θn a with a diminishing error when the dataset is large Detection of non-homogeneous data Theorem 1 and Corollary 1 rely on the assumption that the data from different subcubes come from the same probability model, i.e. the data are homogeneous. Aggregation of non-homogenous data can lead to misleading results and Simpson s paradox [1] may occur. Therefore, it is important to develop tools of testing non-homogeneity. The test of non-homogeneity should be able to support the OLAP analysis and hence it should only depend on the compressed measures or the ALCRs of subcubes. The ALCR defined in Section 4.1 is insufficient for the test of non-homogeneity and one additional measure is needed. Let v k,nk be the posterior variance matrix based on the observations in the k th component cells, i.e. n k = C 1 (θ θk,n k )(θ θk,n k ) T f(x k,j θ)π(θ)dθ, (6) v k,nk θ Θ j=1

12 12 R. Xi et al where C = nk θ Θ j=1 f(x k,j θ)π(θ)dθ is the normalizing constant. If the parameter θ is p-dimensional, the measure v k,nk is a p p matrix. We propose the following modified compression and aggregation scheme. Compression into ALCR. For each base cell c k, k = 1,, K, at the lowest level of the data cube, calculate the Bayesian estimate θk,n k using (5) and the posterior variance v k,nk using (6). Save ALCR=(θ k,n k, v k,nk, n k ) in each component cell c k. Aggregation of ALCR. Calculate the aggregated ALCR ( θ a, ṽ a, n a ) using the following formula n a = K n k, k=1 θa = n 1 a K n k θk,n k, k=1 ṽ a = n 2 a K n 2 kv k,nk. For any non-base cells, θ and ṽ a are used in place of θ k,n k and v k,nk in their ALCR. Suppose that c 1 and c 2 are two subcubes and ( θ 1, ṽ 1, n 1 ) and ( θ 2, ṽ 2, n 2 ) are their ALCRs respectively. By Theorem 1, n k ( θ k θ 0 ) approximately follows the normal distribution N(0, I 1 θ 0 ), or θ k θ 0 (k = 1, 2) approximately follows the normal distribution N(0, n 1 k I 1 θ 0 ) (k = 1, 2). Using ṽ k as the estimate of n 1 k I 1 θ 0, it follows that t = ( θ 1 θ 2 ) T (ṽ 1 + ṽ 2 ) 1 ( θ 1 θ 2 ) approximately follows a χ 2 p distribution. Hence, we can use the statistic t to test the non-homogeneity. We use the kidney stone data as considered in [34] as an example of the test of non-homogeneity. The data are from a medical study [5, 16] comparing the success rates of two treatments for kidney stones. The two treatments are open surgery (treatment A) and percutaneous nephrolithotomy (treatment B). Table 1 shows the effects of both treatments under different conditions. It reveals that treatment A has a higher success rate than treatment B for both small stone and large stone groups. However, after aggregating over the two groups, treatment A has a lower success rate than treatment B. Let S be a binary random variable that indicates whether a treatment succeeds or not, and T be the type of treatment that a patient receives. We use p A and p B to denote the success rate of treatment A and B, respectively, and α A the probability that a patient receives treatment A. We have the probability model P r(s, T p A, p B, α A ) = [p SA(1 p ] I(T =A) [ I(T =B) A ) 1 S α A p S B(1 p B ) 1 S (1 α A )], where I( ) is the indicator function. Set priors for p A, p B, α A as the noninformative prior Beta(1, 1). Given observations D = {(s 1, t 1 ),, (s n, t n )}, the posterior distribution of (p A, p B, α A ) is f(p A, p B, α A D) p n As A k=1 (1 p A) n Af p n Bs B (1 p B) n Bf α n A A (1 α A ) n B, where n As = n i=1 s ii(t i = A), n Af = n i=1 (1 s i)i(t i = A), n Bs = n i=1 s ii(t i = B), n Bf = n i=1 (1 s i)i(t i = B), n A = n i=1 I(t i = A) and n B = n i=1 I(t i =

13 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 13 Fig. 1. MAD of the aggregated estimates with a varying number of partitions K, where the solid, dashed and the dotted lines correspond to the initial probabilities (l) p, the transition matrices (l) P, and the parameter α in the mixture of transition models, respectively. MAD Initial probability Transition Matrix α K B). Therefore, the posterior distribution is the product of three independent Beta distributions. Denote θ = (p A, p B, α A ). Based on the small stone group data, the Bayesian estimate is θs = (0.92, 0.86, 0.24) and the corresponding posterior variance is v s = diag{ , , }. Based on the large stone group, the Bayesian estimate is θl = (0.73, 0.68, 0.77) and the corresponding posterior variance is v l = diag{ , , }. Using these results, the test statistic is t = (θs θl )T (v s +v l ) 1 (θs θl ) = , which is highly significant. Therefore, it is highly possible that the two data sets are inhomogeneous and we should not aggregate them together. 5. Experimental Evaluation We perform experimental studies to evaluate the proposed scheme. We first evaluate the accuracy of the proposed C&A scheme. Then, we report the time and quality performance of the C&A scheme in data cube and stream data mining contexts. At last, we apply the C&A scheme to a real data set and show that the aggregated Bayesian estimates can closely approximate the Bayesian estimates Quality of the proposed compression and aggregation scheme In this subsection, we use the mixture of transition models to evaluate the quality of the proposed C&A scheme. The mixture of transition models has been used to model user visiting website [3, 26, 27], unsupervised training of robots [24] and the dynamics of a military scenario [30].

14 14 R. Xi et al Transition models are useful in describing time series that have only a finite number of states. The observations of a transition model are finite state Markov chains of finite length. For example, the sequence (A, B, A, C, B, B, C) could be a realization of a 3-state first-order Markov chain, where the transition probability at time t only depends on the state of the Markov chain at time t but not on the previous history. If all the observations are realizations from the same transition model, one can readily get a closed form of the posterior mean of the parameters. However, the set of sequences may be heterogeneous and the sequences may come from several different transition models, in which case the mixture of transition models is useful in estimating the transition matrices and clustering the observational sequences. Consider a data set of N sequences, D = {x 1,, x N }, that are realizations from some s-state discrete first-order Markov process. The sequences are possibly of different length. Assume that each sequence comes from one of m transition models. Let (l) P ij be the element (i, j) of the lth probability transition matrix, or the transition probability from state i to state j for a process in cluster l. Let (l) p i be the ith element of the initial state distribution of processes from cluster l. Further assume that α l is the probability that a process is from cluster l. Denote x 0 k as the initial state of the sequence x k and n (k) ij be the number of times that the process x k transitioned from state i to state j. The mixture of transition models is f(x k θ) = m s α l (l)p I(x0 k =i) i l=1 i=1 s s i=1 j=1 (l)p n(k) ij ij, where θ is the parameter vector consisting of (l) P ij, (l) p i and α l as its elements, and I( ) is the indicator function. The prior distribution for the parameter vectors α = (α 1,, α m ), (l) p = ( (l) p 1,, (l) p s ) and (l) P i = ( (l) P i1,, (l) P is ) are Dirichlet priors with all parameters as 1. The Dirichlet priors used here are non-informative priors. The posterior mean has no closed form for this Bayesian model. However, by introducing a missing data δ (k) l, a 0/1 unobserved indicator for whether process k belongs to cluster l, one can readily develop a Gibbs sampler [26, 27]. We apply the C&A scheme to a mixture of transition models. In the experiment, the number of clusters is set to 3 and the Markov chains are 2-state chains. We generated 10, 000 chains from the mixture of transition models and each chain is of length 30. The underlying true parameters are set as 1. initial probabilities: (1)p = (0.2, 0.8), (2)p = (0.9, 0.1), (3)p = (0.4, 0.6); 2. transition matrices ( (1)P = ) ( , (2)P = the probability vector α = (0.2, 0.5, 0.3). ) ( , (3)P = We partition the entire data set into K = 1, 10, 20, 100 cells with equal number of observations and then use our C&A scheme to approximately compute the posterior mean for the entire data set. We run the Gibbs sampler 11, 000 iterations and set the burn-in iterations to 1, 000. Note that the esti- ) ;

15 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 15 Fig. 2. Comparison of the C&A method (dotted lines) and direct method (solid lines) in data stream. MAD Time point MAD Time point Computational time (in seconds) Time point (a) MAD between β, ˆβ and β 0 (b) MAD between β and ˆβ (c) Update time. mate corresponding to K = 1 is just the posterior mean. Let (l) p, (l) P and (l) α be the estimates of (l) p, (l) P and α (l = 1, 2, 3), respectively. We define the maximum absolute deviances (MAD) as D( p, p) = max{ (l) p i (l) p i : l = 1, 2, 3, i = 1, 2}, D( P, P ) = max{ (l) Pij (l) P ij : l = 1, 2, 3, i, j = 1, 2} and D( α, α) = max{ α l α l : l = 1, 2, 3}. Figure 1 shows the MAD of the aggregated estimates from different partitions. The solid line is for the MAD D( p, p), the dashed line is for D( P, P ), and the dotted line is for D( α, α). Observed from the low MAD values (all 0.005), it is clear that the estimates under various number of partitions all have very small errors. The evaluation result shows that the accuracy of the aggregated estimates from our C&A scheme is almost as good as the accuracy of the original Bayesian estimates Performance on data streams In this experiment, we apply our aggregation method to data streams. The Bayes model under consideration is the linear model with 5 predictors, x 1,, x 5, i.e. y = β β i x i + ε, i=1 where ε is the error term. In the experiment, we set the true parameters β = (β 0,, β 5 ) = (0, 1, 2, 3, 4, 5) and the total number of observation N to be 5 million. We generate the covariates x i (i = 1,, 5) from the standard normal distribution, generate the error term ε from N (0, σ 2 = 4), and calculate the response y from the above equation. The priors of the parameters in the Bayesian model are the flat priors, i.e. π(β i ) 1 (i = 0,, 5) and π(σ 2 ) 1/σ 2. The Gibbs sampler can then be easily developed. We update our model for every 1000 new data records. In our methods, whenever we receive 1000 new data records, we compute its ALCR, update the Bayes linear model by aggregating the ALCR with previous ALCRs, and discard the raw data. We compare the performance of our method to a naive method, which stores all the stream data and uses the raw data to update the model for every 1000 new data records. We run the Gibbs sampler for iterations, set the number of burn-in iterations to 1000, and set s to 5. Figure 2a shows the MADs between the aggregation estimation β (dashed

16 16 R. Xi et al Table 2. Comparison of the computational time in Experiment 1. C&A method direct method Compression 1,403 minutes N/A Query processing 0.1 minute 19,049 minutes line), the estimation from naive method ˆβ (solid line) and the true parameter β. Figure 2b shows the MAD between β and ˆβ. Figure 2c gives the computational time used for updating the parameter estimation using our C&A scheme (dashed line) and the naive method (solid line). We see that, comparing to the naive method, the C&A method gives almost identical accuracy, but save tremendous amount of computing time. In fact, from Figure 2c, we see that the C&A method uses a nearly constant time to perform each online update, while the naive method uses more time as more data accumulates. It is clear that the C&A method is more suitable for stream data mining Performance on data cubes Experiment 1. In this experiment, we study the efficiency and quality of the compression and aggregation scheme for aggregated cells in data cubes. The Bayesian model under consideration is again the mixture of transition models. Two dimensions are time and location. Since the MCMC algorithm for the mixture of transition models is highly time-consuming even for moderate-sized data, we consider a relatively small data cube in this experiment. We have 20 months records in the time dimension and 50 states in the location dimension. In practice, the data can be the records of a website that records users visit to the website. The location dimension can be the IP address of the user. For each state in each month, we have 500 observations, i.e. we have 500 users records. Hence, we have 500,000 observations in total. The observations are sequences that record users visiting path in the website. As in Section 5.1, the number of clusters are set to 3 and the Markov chains are 2-state chains. The underlying true parameters are also set as in Section 5.1. We compare our ALCR method to the direct Bayesian estimation method, which directly uses raw data to calculate the Bayesian estimates of the parameters, by comparing their computing time for handling 100 randomly generated queries. To save the computing time of the direct Bayesian estimation method, the aggregated cells that the queries ask for can have at most 200 base cells. More specifically, to generate a query, we first randomly select a number D from {1,, 200}, and then we randomly select D cells from the 1000 base cells (t i, l j ) (i = 1,, 20, j = 1,, 50). The corresponding query asks for parameter estimates of the mixture of transition models based on the data of the selected D base cells. For example, assume that D is randomly selected as 3 and the base cells are chosen as c 1 = (t 1, l 1 ), c 5 = (t 5, l 5 ), c 30 = (t 30, l 30 ). Then, the aggregated estimate of the aggregated cell c a = c 1 c 5 c 30 is calculated by aggregating the ALCRs of the base cells c 1, c 5 and c 30 ; the Bayesian estimate is directly calculated with Gibbs sampling based on the raw data of the aggregated cell c a. We run the Gibbs sampler for 6, 000 iterations and set the number of burn-in iterations to 1, 000. Table 2 shows the time with and without using compression, respectively.

17 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 17 Fig. 3. Comparison of the C&A method (dotted lines) and direct method (solid lines) in Experiment 1. MAD MAD MAD Query Query Query (a) MADs for (l) p (b) MADs for (l) P (c) MADs for α. The first row shows the computational time for compression and the second row shows the aggregation time for all these 100 queries. Without using ALCR compression, the aggregation time is the time to compute Bayesian estimates directly from the raw data in these selected cells. It is obvious that our method saves huge amount of computational time when handling OLAP queries in a data cube. Figure 3 compares MADs of estimates for each query from the ALCR method and the direct method. The dotted lines are for the ALCR methods and the solid lines are for the direct method. Figure 3 (a), (b) and (c) are the MADs of estimates for the initial probabilities (l) p, the transition matrices (l) P, and the probabilities α, respectively. The queries are ordered by their sizes or by the base cell numbers in the queries. Figure 3 shows that the estimates from the ALCR method tend to have larger MAD than estimates from the direct method when the size of query is large, especially for the estimates of initial probabilities, although in general all the MADs for both methods are very small. Figure 4 shows the MAD between the original Bayesian estimates and the ALCR based estimates. The queries are ordered by their sizes. The differences of the initial probability estimates are generally larger comparing to estimates of other parameters, but generally the two estimates are close. Experiment 2. In this experiment, we consider the Bayesian estimator of the linear regression model and compare the computational efficiency and the accuracy of the Bayesian estimator and the C&A estimator in data cubes. The model under consideration is the same as Section 5.2, but the underlying true parameter β was set as (1, 2, 3, 4, 5, 6). The data cube is a 6-dimensional data cube and the dimension sizes are 50, 120, 5, 4, 3 and 2, respectively. Thus, the data cube contains = 720, 000 base cells. To introduce more variations, the number of observations in each base cell was sampled uniformly from {100,, 1000} and the variance of the error term was sampled from the chi-square distribution with 2 degrees of freedom. In total, the data cube has around 50 GB raw data. We randomly generated 2000 queries and set the maximum number of base cells of the queries as The procedure for generating the queries is similar as in Experiment 1. The total number of iterations of the Gibbs sampler was set as and the burn-in iterations as Table 3 shows the computation time for the two methods. Again, we see that the aggregation method saves a large amount of computational time compared with the direct Bayesian estimation

18 18 R. Xi et al Fig. 4. MAD between the original and the aggregated estimates in Experiment 1. The solid, dashed and dotted lines are MADs for the initial probabilities (l) p, the transition matrices (l) P, and the parameter α in the mixture of transition models, respectively. MAD Initial probability Transition Matrix α Query Table 3. Comparison of the computational time in Experiment 2. C&A method direct method Compression 645 minutes N/A Query processing 1 minute 1,779 minutes method. The accuracy of the estimates based on the ALCR method is also similar to the original Bayesian estimates (Figure 5) Application on a real data set In this section, we apply our compression and aggregation scheme to the Behavioral Risk Factor Surveillance System (BRFSS) survey data ( ) [11]. The BRFSS, administered by the Center for Disease Control and Prevention, is an ongoing data collection program designed to measure behavioral risk factors in the adult population. The BRFSS collects surveillance data on risk behaviors through monthly telephone interviews to people in the 50 states and 5 districts of the United States of America. After filtering records with missing data, this data set has around 1.2 million data points. We are interested in modeling the variable body mass index ( BMI4). The variable BMI4 can take value from 1 to 9998 and we view it as continuous variable. The explanatory variables are SEX, AGE, EXERANY2, DIABETE2, DRNKANY4, RFSMOK3 and EDUCAG. The variables EXERANY2, DIA- BETE2, DRNKANY4 and RFSMOK3 describe whether an interviewee has any kind of exercise during the past month, was told by a doctor that he/she has diabetes, has been drinking alcoholic beverages during the past month, and is a smoker, respectively. The variable EDUCAG indicates the completed education

19 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 19 Fig. 5. Comparison of accuracy of the Bayesian Estimates and the C&A estimates in Experiment 2. The queries are ordered by the base cell number. MAD Query (a) MAD of the Bayesian Estimates MAD Query (b) MAD of the ALCR estimates level of an interviewee. We stratify this variable into two levels, hight school or lower and above high school. The variables SEX and AGE are sex and age of an interviewee. For notation simplicity, denote Y as the response variable BMI4, and X 1,, X 7 as the seven explanatory variables. The model under consideration is the following linear model Y = β β i X i + ε, i=1 where ε is the error term. We compare the original Bayesian parameter estimate β with the aggregated estimate β. To accommodate different magnitudes of the estimates βi, we use the mean relative difference (1/8) 7 i=0 β i βi / β i as a measure of the accuracy of the aggregated estimate β. The data set in each year can be partitioned into 12 subsets by month and each subset can be further partitioned by the states. We compute the ALCR for each state in each month, and then aggregate the ALCRs. For the data in each month, we can get the aggregated estimates over the states. Figure 6 shows the mean relative difference of these aggregated estimates. The relative differences are always less than 0.04, which suggests the high accuracy of the ALCR method. 6. Discussion of Related Work We discuss some related works and compare them to ours. Statistical models can be put into two categories, parametric models such as linear regression and logistic regression, and nonparametric models such as probability based ensembles, naive Bayesian classifier and kernel-density-based classifiers. In parametric models, emphasis is often put on parameter estimation, such as how accurate an estimator is. On the other hand, prediction accuracy is more important in evaluating the performance of a nonparametric model. The framework of regression cube [7, 8] develops a lossless compression and aggregation scheme for general multiple linear regression. Another very related work is that on prediction cube [6], which supports

20 20 R. Xi et al Fig. 6. The mean relative difference between the Bayesian estimates β and the aggregated estimates β for BRFSS data. Mean Relative Difference Year OLAP of prediction models including probability based ensemble, naive Bayesian classifier, and kernel-density classifier. The prediction cubes bear similar ideas as regression cubes in that both of them aim at deriving high-level models from lower-level models instead of accessing the raw data and rebuilding the models from scratch. A key difference is that, the prediction cube only supports models that are distributively decomposable or algebraically decomposable [6], whereas the Bayesian models in our study are not. Also, the prediction cubes deal with the prediction accuracy of nonparametric statistical models, whereas our compression theory is developed for parameter reconstruction of Bayesian models. The above developments all focus on lossless computation for data cubes. Alternatively, asymptotically lossless computation that provides good approximations to the desired results is also acceptable in many applications when efficient storage and computation is attainable. Recently, a nearly lossless compression and aggregation scheme has been developed for logistic regression, a nonlinear parametric model [34]. An approximation technique called quasi-cube uses the loglinear model, a parametric model, to characterize regions of a data cube [2]. Efficient storage and fast computation are achieved by storing the parameters of the loglinear models instead of the original data. In quasi-cubes, the desired computation is done based on approximations to the original data provided by the loglinear model. However, it is difficult to quantify the approximation errors in a quasi-cube. Our paper considers aggregation operations without accessing the raw data. Palpanas, Koudas, and Mendelzon [22] have considered the reverse problem, which is to derive original raw data from the aggregates. An approximative estimation algorithm based on maximum information entropy is proposed [22]. It will be interesting to study the interactions of these two complimentary approaches. Safarinejadian et al.[28] recently proposed a distributive EM algorithm for estimating parameters in finite mixture models. The EM algorithm and MCMC

21 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 21 algorithm are two parallel algorithms for estimating parameters in finite mixture models. EM algorithms are generally faster than MCMC algorithms, but MCMC algorithms are generally easier to develop and implement, and MCMC algorithms can readily provide interval estimation. A comparison of our aggregated Bayesian estimate with their distributive EM algorithm would be interesting. Dimension hierarchies, cubes, and cube operations are formally introduced by Vassiliadis [33]. Lenz and Thalheim [19] proposed to classify OLAP aggregation functions into distributive, algebraic, and holistic ones. In data warehousing and OLAP, much progress has been made on the efficient support of standard and advanced OLAP queries in data cubes, including selective cube materialization [15] and intelligent roll-up [29]. However, the measures studied in previous OLAP systems are usually single values or simple statistics, not sophisticated statistical models such as the Bayesian models studied in this paper. Our work is related to database engine architectures such as Netezza (www. netezza.com) and Infobright ( where the synopses computed for the partitioned data blocks are used for query optimization and execution. Infobright has recently introduced the notion of a rough query which is an approximate query basing only on the synopsis without drilling down to full details. Our method matches this framework. We plan to extend the open source version of Infobright with the proposed Bayesian synopses in order to enrich querying by the elements of Bayesian modeling. 7. Conclusions In this paper, we have proposed an asymptotically lossless compression and aggregation technique to support efficient Bayesian estimation of statistical models. We have developed a compression and aggregation scheme that compresses a data segment into a compressed representation whose size is independent from the size of the data segment. Under regularity conditions, we have proved that the aggregated estimator is strongly consistent and asymptotically error-free. We have further proposed a compression and aggregation scheme that enables detection of non-homogeneous data. Our experimental studies on data cubes and data streams show that our compression and aggregation method can significantly reduce computational time with little loss of accuracy. Moreover, the aggregation error diminishes as the size of the data increases. Therefore, the proposed scheme is widely applicable as it enables efficient and accurate construction of Bayesian statistical models in a distributed fashion. It can be used in the contexts of data cube and OLAP, stream data mining, and cloud computing. For data cubes, it allows us to quickly perform OLAP operations and compute Bayesian statistics at any level in a data cube without retrieving or storing the raw data. For stream data mining, it enables efficient one-scan online computation of Bayesian statistics, without requiring to retain the raw data. For cloud computing, it facilitates analysis of distributed large datasets, under parallel processing paradigms such as MapReduce. The proposed scheme works best for the scenarios of homogeneous data. Since it is more likely to have inhomogeneous data in stream data, we would expect the convergence in stream data be worse than OLAP. However, as statistical analysis without realizing the inhomogeneity would be misleading, our method also provides a way to detect this inhomogeneity without accessing the raw data.

22 22 R. Xi et al Acknowledgements. This research is partly supported by NSF grant NeTS and a Microsoft Research New Faculty Fellowship to Y. C. and by NSF grant DMS to N.L.. References [1] A Agresti. Categorical Data Analysis. John Wiley and Sons, New Jersey, 2nd edition, [2] D. Barbara and X. Wu. Loglinear-based quasi cubes. Journal of Intelligent Information Systems, 16: , [3] I. Cadez, D. Heckerman, P. Smyth C. Meek, and S. White. Visualization of navigation patterns on a web site using model-based clustering. Technical report, Microsoft Research, MSR-TR [4] M.T. Chao. The asymptotic behavior of Bayes estimators. The Annals of Mathematical Statistics, 41(2): , [5] C. R. Charig, D. R. Webb, S. R. Payne, and O. E. Wickham. Comparison of treatment of renal calculi by operative surgery, percutaneous nephrolithotomy, and extracorporeal shock wave lithotripsy. British Medical Journal, 292: , [6] B. Chen, L. Chen, Y. Lin, and R Ramakrishnan. Prediction cubes. In Proceedings of the 31st VLDB Conference, pages , [7] Y. Chen, G. Dong, J. Han, J. Pei, B. Wah,, and J. Wang. Regression cubes with lossless compression and aggregation. IEEE Transactions on Knowledge and Data Engineering, 18: , [8] Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-dimensional regression analysis of time-series data streams. pages , [9] K. L. Chung. A Course in Probability Theory. Elsevier, San Diego, California, 3rd edition, [10]A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Ser. B, 39:1 38, [11]Centers for Disease Control and Prevention. Behavioral Risk Factor Surveillance System Survey Data. U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, [12]J. K. Ghosh and R. V. Ramamoorthi. Bayesian Nonprametrics. Springer, New Jersey, [13]J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29 54, [14]J. Han, Y. Chen, G. Dong, J. Pei, B. W. Wah, J. Wang, and Y. Cai. Stream cube: An architecture for multi-dimensional analysis of data streams. Distributed and Parallel Databases, 18(2): , [15]V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In Proceedings of ACM SIGMOD International Conferernce on Management of Data, pages , [16]S. A. Julious and M. A. Mullee. Confounding and Simpson s paradox. British Medical Journal, 309: , [17]A. Khoshgozaran, A. Khodaei, M. Sharifzadeh, and C. Shahabi. A hybrid aggregation and compression technique for road network databases. Knowledge and Information Systems, 17(3): , [18]E. L. Lehmann and G. Casella. Theory of Point Estimation. Springer, New Jersey, 2nd edition, [19]H. Lenz and B. Thalheim. OLAP databases and aggregation functions. In Proceedings of the 13th International Conference on Scientific and Statistical Database Management, pages , [20]C. Liu, M. Zhang, M. Zheng, and Y. Chen. Step-by-step regression: A more efficient alternative for polynomial multiple linear regression in stream cube. In Proceedings of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages , [21]H. Liu, Y. Lin, and J. Han. Methods for mining frequent items in data streams: an overview. Knowledge and Information Systems, pages 1 30, 2011.

23 Compression and Aggregation of Bayesian Estimates for Data Intensive Computing 23 [22]T. Palpanas, N. Koudas, and A. O. Mendelzon. Using datacube aggregates for approximate querying and deviation detection. IEEE Transactions on Knowledge and Data Engineering, 17(11): , [23]S. Pang, S. Ozawa, and N. Kasabov. Incremental linear discriminant analysis for classification of data streams. IEEE Transactions on Systems Man and Cybernetics, Part B, 35(5):905 14, [24]M. Ramoni, P. Sebastiani, and P. Cohen. Bayesian clustering by dynamics. Machine Learning, 47(1):99 121, [25]C. R. Rao. Linear Statistical Inference and Its Applications. John Wiley, New York, [26]G. Ridgeway. Finite discrete markov process clustering. Technical report, Microsoft Research, MSR-TR [27]G. Ridgeway and S. Altschuler. Clustering finite discrete markov chains. In Proceedings of the Section on Physical and Engineering Sciences, pages , [28]B. Safarinejadian, M.B. Menhaj, and M. Karrari. A distributed EM algorithm to estimate the parameters of a finite mixture of components. Knowledge and Information Systems, 23(3): , [29]G. Sathe and S. Sarawagi. Intelligent rollups in multidimensional OLAP data. In Proceedings of the 27th VLDB conference, pages , [30]P. Sebastiani, M. Ramonni, P. Cohen, J. Warwick, and J. Davis. Discovering dynamics using Bayesian clustering. In Advances in Intelligent Data Analysis, Lecture Notes in Computer Science, pages Springer, [31]A. N. Shiryaev. Probability. Springer, New Jersey, 2nd edition, [32]M. A. Tanner and W. H. Wong. The calculation of posterior distribution by data augmentation. Journal of the American Statistical Association, 82: , [33]P. Vassiliadis. Modeling multidimensional databases, cubes and cube operations. In Proceedings of the 10th International Conference on Scientific and Statistical Database Management, pages 53 62, [34]R. Xi, N. Lin, and Y. Chen. Compression and aggregation for logistic regression analysis in data cubes. IEEE Transactions on Knowledge and Data Engineering, 21(4): , Author Biographies Ruibin Xi is currently a Research Associate at Harvard Medical School, Center for Biomedical Informatics. He received his PhD degree from the Department of Mathematics, Washington University in St. Louis. His research interests include statistical analysis of next generation sequencing data, copy number variation and structural variation, statistical computing, massive data analysis, variable selection methods and Bayesian statistics. Nan Lin is an Associate Professor of Mathematics and Biostatistics at the Washington University in St. Louis. He received the Ph.D. in Statistics from University of Illinois at Urbabna-Champaign in He has also worked as a Postdoctoral Associate at Yale University School of Medicine from 2003 to His research interest includes statistical computing, massive data analysis, robust statistics, bioinformatics and psychometrics. He is a member of the American Statistical Association and the International Chinese Statistical Association.

24 24 R. Xi et al Yixin Chen is an Associate Professor of Computer Science at the Washington University in St Louis. He received the Ph.D. in Computing Science from University of Illinois at Urbana-Champaign in His research interests include nonlinear optimization, constrained search, planning and scheduling, data mining, and data warehousing. His work on constraint partitioning and planning has won First Prizes in optimal and satisficing tracks in the International Planning Competitions (2004 & 2006), the Best Paper Award at the International Conference on Tools for AI (2005), and the Best Paper Award at AAAI Conference (2010). His work on data clustering has won the Best Paper Award at the International Conference on Machine Learning and Cybernetics (2004) and the Best Paper nomination at the International Conference on Intelligent Agent Technology (2004). He is partially funded by an Early Career Principal Investigator Award (2006) from the Department of Energy and a Microsoft Research New Faculty Fellowship (2007). Youngjin Kim is a Software Engineer at Google Inc. in Mountain View, California. He received his Master degree from the Department of Computer Science at Washington University in St. Louis. His research interest includes machine learning and data mining from huge and noisy real world data. Correspondence and offprint requests to: Yixin Chen, Department of Computer Science, Washington University, St. Louis, MO, USA. chen@cse.wustl.edu

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information

Logistic Regression (1/24/13)

Logistic Regression (1/24/13) STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used

More information

Parallelization Strategies for Multicore Data Analysis

Parallelization Strategies for Multicore Data Analysis Parallelization Strategies for Multicore Data Analysis Wei-Chen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Poisson Models for Count Data

Poisson Models for Count Data Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

How To Understand The Theory Of Probability

How To Understand The Theory Of Probability Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL

More information

Bootstrapping Big Data

Bootstrapping Big Data Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Mathematics Course 111: Algebra I Part IV: Vector Spaces

Mathematics Course 111: Algebra I Part IV: Vector Spaces Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data Faming Liang University of Florida August 9, 2015 Abstract MCMC methods have proven to be a very powerful tool for analyzing

More information

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by Statistics 580 Maximum Likelihood Estimation Introduction Let y (y 1, y 2,..., y n be a vector of iid, random variables from one of a family of distributions on R n and indexed by a p-dimensional parameter

More information

Multivariate Normal Distribution

Multivariate Normal Distribution Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

More information

BayesX - Software for Bayesian Inference in Structured Additive Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich

More information

Probabilistic user behavior models in online stores for recommender systems

Probabilistic user behavior models in online stores for recommender systems Probabilistic user behavior models in online stores for recommender systems Tomoharu Iwata Abstract Recommender systems are widely used in online stores because they are expected to improve both user

More information

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Department of Industrial Engineering and Management Sciences Northwestern University September 15th, 2014

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization

Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization Archis Ghate a and Robert L. Smith b a Industrial Engineering, University of Washington, Box 352650, Seattle, Washington,

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding

More information

Joint models for classification and comparison of mortality in different countries.

Joint models for classification and comparison of mortality in different countries. Joint models for classification and comparison of mortality in different countries. Viani D. Biatat 1 and Iain D. Currie 1 1 Department of Actuarial Mathematics and Statistics, and the Maxwell Institute

More information

A Game Theoretical Framework for Adversarial Learning

A Game Theoretical Framework for Adversarial Learning A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

Introduction to Online Learning Theory

Introduction to Online Learning Theory Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

More information

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

More information

Bayesian Statistics: Indian Buffet Process

Bayesian Statistics: Indian Buffet Process Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Applications of R Software in Bayesian Data Analysis

Applications of R Software in Bayesian Data Analysis Article International Journal of Information Science and System, 2012, 1(1): 7-23 International Journal of Information Science and System Journal homepage: www.modernscientificpress.com/journals/ijinfosci.aspx

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

An extension of the factoring likelihood approach for non-monotone missing data

An extension of the factoring likelihood approach for non-monotone missing data An extension of the factoring likelihood approach for non-monotone missing data Jae Kwang Kim Dong Wan Shin January 14, 2010 ABSTRACT We address the problem of parameter estimation in multivariate distributions

More information

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS TEST DESIGN AND FRAMEWORK September 2014 Authorized for Distribution by the New York State Education Department This test design and framework document

More information

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large

More information

Nonparametric adaptive age replacement with a one-cycle criterion

Nonparametric adaptive age replacement with a one-cycle criterion Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: Pauline.Schrijner@durham.ac.uk

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

MATHEMATICAL METHODS OF STATISTICS

MATHEMATICAL METHODS OF STATISTICS MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMER TROFESSOK IN THE UNIVERSITY OF STOCKHOLM Princeton PRINCETON UNIVERSITY PRESS 1946 TABLE OF CONTENTS. First Part. MATHEMATICAL INTRODUCTION. CHAPTERS

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

More information

Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance. Chapter 6: Behavioural models Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

More information

Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

More information

Bayes and Naïve Bayes. cs534-machine Learning

Bayes and Naïve Bayes. cs534-machine Learning Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

More information

CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen

CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 3: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major

More information

Dirichlet Processes A gentle tutorial

Dirichlet Processes A gentle tutorial Dirichlet Processes A gentle tutorial SELECT Lab Meeting October 14, 2008 Khalid El-Arini Motivation We are given a data set, and are told that it was generated from a mixture of Gaussian distributions.

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

More information

1 if 1 x 0 1 if 0 x 1

1 if 1 x 0 1 if 0 x 1 Chapter 3 Continuity In this chapter we begin by defining the fundamental notion of continuity for real valued functions of a single real variable. When trying to decide whether a given function is or

More information

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different

More information

Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach

Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach Refik Soyer * Department of Management Science The George Washington University M. Murat Tarimcilar Department of Management Science

More information

E3: PROBABILITY AND STATISTICS lecture notes

E3: PROBABILITY AND STATISTICS lecture notes E3: PROBABILITY AND STATISTICS lecture notes 2 Contents 1 PROBABILITY THEORY 7 1.1 Experiments and random events............................ 7 1.2 Certain event. Impossible event............................

More information

Chenfeng Xiong (corresponding), University of Maryland, College Park (cxiong@umd.edu)

Chenfeng Xiong (corresponding), University of Maryland, College Park (cxiong@umd.edu) Paper Author (s) Chenfeng Xiong (corresponding), University of Maryland, College Park (cxiong@umd.edu) Lei Zhang, University of Maryland, College Park (lei@umd.edu) Paper Title & Number Dynamic Travel

More information

Mario Guarracino. Data warehousing

Mario Guarracino. Data warehousing Data warehousing Introduction Since the mid-nineties, it became clear that the databases for analysis and business intelligence need to be separate from operational. In this lecture we will review the

More information

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Two approaches to the integration of heterogeneous data warehouses

Two approaches to the integration of heterogeneous data warehouses Distrib Parallel Databases (2008) 23: 69 97 DOI 10.1007/s10619-007-7022-z Two approaches to the integration of heterogeneous data warehouses Riccardo Torlone Published online: 23 December 2007 Springer

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

Package EstCRM. July 13, 2015

Package EstCRM. July 13, 2015 Version 1.4 Date 2015-7-11 Package EstCRM July 13, 2015 Title Calibrating Parameters for the Samejima's Continuous IRT Model Author Cengiz Zopluoglu Maintainer Cengiz Zopluoglu

More information

1 The Brownian bridge construction

1 The Brownian bridge construction The Brownian bridge construction The Brownian bridge construction is a way to build a Brownian motion path by successively adding finer scale detail. This construction leads to a relatively easy proof

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

The Heat Equation. Lectures INF2320 p. 1/88

The Heat Equation. Lectures INF2320 p. 1/88 The Heat Equation Lectures INF232 p. 1/88 Lectures INF232 p. 2/88 The Heat Equation We study the heat equation: u t = u xx for x (,1), t >, (1) u(,t) = u(1,t) = for t >, (2) u(x,) = f(x) for x (,1), (3)

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

GENERATING SIMULATION INPUT WITH APPROXIMATE COPULAS

GENERATING SIMULATION INPUT WITH APPROXIMATE COPULAS GENERATING SIMULATION INPUT WITH APPROXIMATE COPULAS Feras Nassaj Johann Christoph Strelen Rheinische Friedrich-Wilhelms-Universitaet Bonn Institut fuer Informatik IV Roemerstr. 164, 53117 Bonn, Germany

More information

PS 271B: Quantitative Methods II. Lecture Notes

PS 271B: Quantitative Methods II. Lecture Notes PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.

More information

Customer Classification And Prediction Based On Data Mining Technique

Customer Classification And Prediction Based On Data Mining Technique Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING

ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of

More information

Department of Economics

Department of Economics Department of Economics On Testing for Diagonality of Large Dimensional Covariance Matrices George Kapetanios Working Paper No. 526 October 2004 ISSN 1473-0278 On Testing for Diagonality of Large Dimensional

More information

Regression analysis of probability-linked data

Regression analysis of probability-linked data Regression analysis of probability-linked data Ray Chambers University of Wollongong James Chipperfield Australian Bureau of Statistics Walter Davis Statistics New Zealand 1 Overview 1. Probability linkage

More information

Bayesian Predictive Profiles with Applications to Retail Transaction Data

Bayesian Predictive Profiles with Applications to Retail Transaction Data Bayesian Predictive Profiles with Applications to Retail Transaction Data Igor V. Cadez Information and Computer Science University of California Irvine, CA 92697-3425, U.S.A. icadez@ics.uci.edu Padhraic

More information

Handling attrition and non-response in longitudinal data

Handling attrition and non-response in longitudinal data Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

More information

BANACH AND HILBERT SPACE REVIEW

BANACH AND HILBERT SPACE REVIEW BANACH AND HILBET SPACE EVIEW CHISTOPHE HEIL These notes will briefly review some basic concepts related to the theory of Banach and Hilbert spaces. We are not trying to give a complete development, but

More information

Dynamic Linear Models with R

Dynamic Linear Models with R Giovanni Petris, Sonia Petrone, Patrizia Campagnoli Dynamic Linear Models with R SPIN Springer s internal project number, if known Monograph August 10, 2007 Springer Berlin Heidelberg NewYork Hong Kong

More information

1 Maximum likelihood estimation

1 Maximum likelihood estimation COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

More information

A HYBRID GENETIC ALGORITHM FOR THE MAXIMUM LIKELIHOOD ESTIMATION OF MODELS WITH MULTIPLE EQUILIBRIA: A FIRST REPORT

A HYBRID GENETIC ALGORITHM FOR THE MAXIMUM LIKELIHOOD ESTIMATION OF MODELS WITH MULTIPLE EQUILIBRIA: A FIRST REPORT New Mathematics and Natural Computation Vol. 1, No. 2 (2005) 295 303 c World Scientific Publishing Company A HYBRID GENETIC ALGORITHM FOR THE MAXIMUM LIKELIHOOD ESTIMATION OF MODELS WITH MULTIPLE EQUILIBRIA:

More information

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Nominal and ordinal logistic regression

Nominal and ordinal logistic regression Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome

More information

Alabama Department of Postsecondary Education

Alabama Department of Postsecondary Education Date Adopted 1998 Dates reviewed 2007, 2011, 2013 Dates revised 2004, 2008, 2011, 2013, 2015 Alabama Department of Postsecondary Education Representing Alabama s Public Two-Year College System Jefferson

More information

Tutorial on Markov Chain Monte Carlo

Tutorial on Markov Chain Monte Carlo Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology,

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Linear Algebra Notes for Marsden and Tromba Vector Calculus Linear Algebra Notes for Marsden and Tromba Vector Calculus n-dimensional Euclidean Space and Matrices Definition of n space As was learned in Math b, a point in Euclidean three space can be thought of

More information

Inner Product Spaces

Inner Product Spaces Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and

More information

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

More information

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP Data Warehousing and End-User Access Tools OLAP and Data Mining Accompanying growth in data warehouses is increasing demands for more powerful access tools providing advanced analytical capabilities. Key

More information

Applications to Data Smoothing and Image Processing I

Applications to Data Smoothing and Image Processing I Applications to Data Smoothing and Image Processing I MA 348 Kurt Bryan Signals and Images Let t denote time and consider a signal a(t) on some time interval, say t. We ll assume that the signal a(t) is

More information

Lecture 2 Linear functions and examples

Lecture 2 Linear functions and examples EE263 Autumn 2007-08 Stephen Boyd Lecture 2 Linear functions and examples linear equations and functions engineering examples interpretations 2 1 Linear equations consider system of linear equations y

More information