A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

Transcription

1 A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data Faming Liang University of Florida August 9, 2015

2 Abstract MCMC methods have proven to be a very powerful tool for analyzing data of complex structures. However, their compute-intensive nature, which typically require a large number of iterations and a complete scan of the full dataset for each iteration, precludes their use for big data analysis. We propose the so-called bootstrap Metropolis-Hastings (BMH) algorithm, which provides a general framework for how to tame powerful MCMC methods to be used for big data analysis; that is to replace the full data log-likelihood by a Monte Carlo average of the log-likelihoods that are calculated in parallel from multiple bootstrap samples. The BMH algorithm possesses an embarrassingly parallel structure and avoids repeated scans of the full dataset in iterations, and is thus feasible for big data problems. Compared to the popular divide-and-combine method, BMH can be generally more efficient as it can asymptotically integrate the whole data information into a single simulation run. The BMH algorithm is very flexible. Like the MH algorithm, it can serve as a basic building block for developing advanced MCMC algorithms that are feasible for big data problems.

3 Big Data Big Data: Data too large to handle easily on a single server or too time consuming to analyze using traditional statistical methods. Examples of Big Data: Genome data: using big data to find better treatments for patients through genomic sequencing technologies Atmospheric sciences data: rapidly ballooning observations (e.g., radar, satellites, sensor networks), climate data, ensemble data. Social sciences data: social networks (Facebook, LinkedIn, network), social media data (news, telephone calls) Finance data, image data, etc.

4 Big Data Challenges Accessing, using and visualizing data Server-side processing and distributed storage Limited number of statistical methods: From the view of statistical inference, it is unclear how the current statistical methodology can be transported to the paradigm of big data. Modeling: With growing size typically comes a growing complexity of data structures and of the models needed to account for the structures. Missing data

5 Strategies used in Big Data Analysis Split and Merge: Lin and Xi (2011, SII): Aggregated estimating equation Xie (2013): high dimensional variable selection Song and Liang (2015, JRSSB): Bayesian high dimensional variable selection Online Learning: stream data Data/model reduction: Using low-rank models for approximate inference of massive data Subsampling: Liang et al. (2013, JASA) Bag of Little Bootstraps (Kleiner et al., 2012): provides an efficient way of bootstrapping for big data estimators, which functions by combining the results of bootstrapping multiple small subsets of the big original dataset.

6 Aggregated Estimating Equation Aggregated estimating equation (Lin and Xi, 2011): It employs a divide-and-combine strategy. It is first to compress the raw data of each partition of the full dataset into some low dimensional statistics, and then to obtain an approximation to the estimating equation estimator, the aggregated estimating equation estimator, by solving an equation aggregated from the saved low dimensional statistics in all partitions.

7 Resampling-based Stochastic Approximation Liang et al. (2013, JASA) proposed a new parameter estimator, maximum mean log-likelihood estimator, for big data problems, and a resampling-based stochastic approximation method for obtaining such an estimator. θ l n(θ X n ) = 0 ( ) n 1 m θ l n,m,i(θ X m ) = 0 E θ l m(θ X m ) = 0 The resampling-based stochastic approximation method successfully avoids some difficulties involved in big data computation, such as whole data scanning. i

8 Bag of Little Bootstraps The bootstrap method is a resampling-based method, and has been widely used in applied statistics for assessing the quality of estimators since proposed by Efron (1979). The bag of little bootstraps (Kleiner et al., 2012) provides an efficient way of bootstrapping for big data estimators, which functions by combining the results of bootstrapping multiple small subsets of the big original dataset.

9 Bootstrap MH Algorithm: Motivation Markov chain Monte Carlo (MCMC) methods have been widely used in statistical data analysis, and they have proven to be a very powerful and typically unique computational tool for analyzing data of complex structures. MCMC methods are computer-intensive, which typically require a large number of iterations and a complete scan of the full dataset for each iteration. This feature precludes their use for big data analysis. We aim to develop a framework under which the powerful MCMC methods can be tamed for using in big data analysis, such as parameter estimation, optimization, and model selection.

10 Bootstrap MH Algorithm: Basic Idea The bootstrap Metropolis-Hastings (BMH) algorithm works by replacing the full data log-likelihood by a Monte Carlo average of the log-likelihoods that are calculated in parallel from multiple bootstrap samples, where the bootstrap sample refers to a small set of observations drawn from the full dataset at random and with/without replacement. By this way, BMH avoids repeated scans of the full dataset in iterations, while it is still able to produce sensible solutions, such as parameter estimates or posterior samples, to the problem under consideration. BMH is feasible for big data and workable on parallel and distributed architectures.

11 BMH Algorithm: Notation Let D i denote a bootstrap sample of D, which is resampled from the full dataset at random and with/without replacement. Let m denote the size of D i = {xij : j = 1, 2,..., m}. If resampling is done without replacement, D i is called a subsample or ( n m) -bootstrap sample. Otherwise, Di is called an m-out-of-n bootstrap sample or m/n-bootstrap sample.

12 BMH Algorithm: Notation Let f (D i θ) denote a likelihood-like function of D i, and define l m,n,k (D s θ) = 1 k k log f (D i θ), (1) i=1 where k denotes the number of bootstrap samples drawn from D, and D s = {D 1,..., D k } is the collection of the bootstrap samples. The definition of f (D i θ) depends on the feature of D. If the observations in D are independently and identically distributed (i.i.d.), then, regardless D i is a ( n m) - or m/n-bootstrap sample, we define m f (D i θ) = f (xij θ). (2) j=1

13 BMH Algorithm: Algorithm 1. Draw ϑ from a proposal distribution Q(θ t, ϑ). 2. Draw k bootstrap samples D 1,..., D k via ( n m) - or m/n-bootstrapping. Let D s = {D 1,..., D k }. 3. Calculate the BMH ratio: r(θ t, D s, ϑ) = exp {l m,n,k (D s ϑ) l m,n,k (D s θ t )} π(ϑ) Q(ϑ, θ t ) π(θ t ) Q(θ t, ϑ). 4. Set θ t+1 = ϑ with probability α(θ t, D s, ϑ) = min{1, r(θ t, D s, ϑ)}, and set θ t+1 = θ t with the remaining probability.

14 BMH Algorithm: Remarks In BMH, {θ t } form a Markov chain with the transition kernel given by P m,n,k (θ, dϑ) = D s D α(θ, D s, ϑ)q(θ, ϑ)ψ(d s ) + δ θ (dϑ) 1 D s D Θ α(θ, D s, ϑ )Q(θ, dϑ )ψ(d s) where D denote the space of D s, ψ(d s ) denotes the probability of drawing D s, and δ θ ( ) is an indicator function. For ( ) n m -bootstrapping, ψ(ds ) = ( n k; m) and for m/n-bootstrapping, ψ(d s ) = 1/n mk. (3)

15 BMH Algorithm: Remarks When the observations in D are i.i.d, both the resampling schemes, ( n m) - or m/n-bootstrapping lead to the same stationary distribution of BMH. Since BMH is proposed for simulations on parallel computers, the parameter k specifies the number of processors/nodes used in computing the averaged log-likelihood function. Theoretically, a large value of k is preferred. However, an extremely large value of k may slow down the computation due to the increased inter-node communications. In our experience, to achieve a good performance for BMH, k does not need to be very large. The choice of m can depend on the complexity of the model under consideration, in particular, the dimension of θ. In general, m should increase with the complexity of the model.

16 BMH Algorithm: Convergence Let g m (D θ) = exp{e[log f (D i θ)]}, where E[ ] denotes the expectation. Define the transition kernel [ ] P m (θ, ϑ) = α(θ, ϑ)q(θ, ϑ)+δ θ (dϑ) 1 α(θ, ϑ )Q(θ, ϑ )dϑ, Θ (4) which is induced by the proposal Q(, ) for a MH move with the invariant distribution given by π m (θ D) g m (D θ)π(θ). (5)

17 BMH Algorithm: Convergence Assume the following conditions hold: (A) sup θ Θ E log f (X i θ) <. (B) Assume that P m defines an irreducible and aperiodic Markov chain such that π m ( )P m = π m ( ). Therefore, for any starting point θ 0 Θ, lim t P t m(θ 0, ) π m ( ) = 0, where denotes the total variation norm. (C) For any (θ, ϑ) Θ Θ, 0 < exp{l m,n,k (D s ϑ) l m,n,k (D s θ)}/[g m (D ϑ) g m (D θ)] <, ψ( where ψ(d s ) is the resampling probability of D s from D.

18 BMH Algorithm: Convergence Lemma 1 Assume that the condition (A) holds and m = O(n γ ). If γ < 1/2, then U m,n (D θ) log(g m (D θ)) p 0, as n. (6)

19 BMH Algorithm: Convergence Theorem 1.( ( n m) -bootstrapping) Assume the observations in D are iid and the conditions (A), (B) and (C) hold. Then for any ɛ (0, 1] and any θ 0 Θ, there exist N(ɛ, θ 0 ) N, K(ɛ, θ 0, n) N, and T (ɛ, θ 0, n, k) N such that for any n > N(ɛ, θ 0 ), k > K(ɛ, θ 0, n), and t > T (ɛ, θ 0, n, k), P t m,n,k (θ 0, ) π m ( ) ɛ, where π m ( ) is the stationary distribution of P m as defined in (5). Theorem 2.(m/n-bootstrapping) Under similar conditions to Theorem 1, BMH with m/n-bootstrapping has the same stationary distribution as with ( n m) -bootstrapping.

20 Some key points: BMH Algorithm: Bayesian Inference It follows from the asymptotic normality of posterior distributions, see e.g., Chen (1985), we have π n (θ D) L N(µ n, Σ n ), (7) where µ n denotes the mode of π n (θ D) and Σ n = { 2 log(π n (θ)l(d θ))/ θ θ T } 1. Under regularity conditions, we show that as m, ( π m (θ D) L N µ n, n ) m Σ n. (8) The properties of π n (θ D) can be conveniently inferred from BMH samples.

21 Simulated Example Consider the normal linear regression y i = β 0 + β 1 x i1 + β 2 x i2 + β 3 x i3 + ɛ i, i = 1, 2,..., n, where (β 0, β 1, β 2, β 3 ) = (2, 0.25, 0.25, 0) are regression coefficients, and ɛ 1,..., ɛ n are iid normal random errors with mean 0 and variance σ 2. In simulations, we set n = 10 5 and σ 2 = 0.25, generate both x 1 = (x 11,..., x n1 ) T and x 2 = (x 12,..., x n2 ) T from the multivariate normal distribution N(0, I n ), and set x 3 = (x 13,..., x n3 ) T = 0.7x z, where z is also generated from N(0, I n ). Let θ = (β 0, β 1, β 2, β 3, σ 2 ) and θ = (2, 0.25, 0.25, 0, 0.25) be its true value.

22 Simulated Example Table: Parameter estimation results of MH and BMH for the simulated example. (k, m) m n 100% β 0 β 1 β 2 β 3 log σ 2 MH for the full data (1,10 5 ) 100% ( ) ( ) ( ) ( ) ( ) BMH with n m -bootstrapping (25,200) 0.2% ( ) ( ) ( ) ( ) ( ) (25,500) 0.5% ( ) ( ) ( ) ( ) ( ) (25,1000) 1% ( ) ( ) ( ) ( ) ( )

23 Simulated Example Table: Parameter estimation results of MH and BMH for the simulated example. (k, m) m n 100% β 0 β 1 β 2 β 3 log σ 2 MH for the full data (1,10 5 ) 100% BMH with m/n-bootstrapping (25,200) 0.2% (25,500) 0.5% (25,1000) 1% ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

24 Simulated Example Table: Comparison of BMH (k = 50, m = 200) with the divide-and-combine (D&C) and AMHT (approximated MH test; Korattikara et al., 2014) for parameter estimation. Algorithm β 0 β 1 β 2 β 3 log(σ 2 ) BMH Estimate SD ( ) ( ) ( ) ( ) ( ) D&C Estimate SD ( ) ( ) ( ) ( ) ( ) AMHT Estimate SD ( ) ( ) ( ) ( ) ( )

25 Variance Estimation Table: MH, BMH, D&C and AMHT estimates of σ11 2,..., σ2 55, σ2 12, σ2 34 and ρ β2,β 3 obtained with pooled samples, where σij 2 denotes the (i, j)th element of Σ 0, and ρ β2,β 3 denotes the correlation coefficient of β 2 and β 3. Method σ11 2 σ22 2 σ33 2 σ44 2 σ55 2 σ12 2 σ34 2 MH BMH D&C AMHT

26 QQ-plot of Posterior Samples (a) β 0 (b) β 1 (c) β 2 (d) β 3 (e) logσ 2 BMH BMH BMH BMH BMH MH MH MH MH MH (f) β 0 (g) β 1 (h) β 2 (i) β 3 (j) logσ 2 AMHT AMHT AMHT AMHT AMHT MH MH MH MH MH Figure: QQplots of the posterior samples generated by MH, BMH (with k = 50, m = 200 and ( n m) -bootstrapping), and AMHT for the simulation example: the plots in the first row are BMH versus MH, and the plots in the second row are AMHT versus MH.

27 Histogram of Posterior Samples (a) β 1 (b) β 1 (c) β 1 Frequency Frequency Frequency MH samples AMHT samples BMH samples Figure: Histograms of the posterior samples of β 1 generated by (a) MH, (b) AMHT, and (c) BMH for the simulation example.

28 Comments on BMH It can asymptotically integrate the whole data information into a single simulation run. Like the MH algorithm, it can serve as a basic buidling block for developing advanced MCMC algorithms.

29 Neural Network Universal approximation ability: A feed-forward network with a single hidden layer containing a finite number of hidden units is a universal approximator among continuous functions on compact subsets, under mild assumptions on the activation function. It is potentially a good tool for big data modeling!

30 Neural Network I 4 H 3 O 3 I 3 H 2 O 2 I 2 B H 1 O 1 I 1 Input Layer Hidden Layer Output Layer Figure: A fully connected one hidden layer MLP network.

31 Tempering BMH for Learning BNN with Big Data 1. Draw k bootstrap samples, D 1,..., D k, with/without replacement from the entire training dataset D. 2. Try to update each sample of the current population (θ 1 t,..., θ Π t ) by the local updating operators, where t indexes iterations, and the energy function is calculated by averaging the energy values calculated from the bootstrap samples D 1,..., D k. 3. Try to exchange θ i t with θ j t for n 1 pairs (i, j) with i being sampled uniformly on {1,..., n} and j = i ± 1 with probability ω i,j, where ω i,i+1 = ω i,i 1 = 0.5 for 1 < i < Π and ω 1,2 = ω Π,Π 1 = 1.

32 j=1 Tempering BMH for Learning BNN with Big Data (Rank=2) Slave-1 Generate Bootstrap Samples (Rank=1) Master (Rank=3) Slave-2 Bootstrap Sample D2 Bootstrap Sample D1 Bootstrap Sample D3 with Updated Parameters θ i t Broadcast θ i t Broadcast θ i t E ( θ i t D2 ) Reduce E ( θ i t D1 ) Reduce E ( θ i t D3 ) 1 3 Calculate 3 E ( θ i ) t Dj and Update Parameters No End of current iteration? Yes No End of simulation? Yes End Output simulation results End Figure: Parallel implementation of tempering BMH: The flowchart of the tempering BMH algorithm with 3 processors.

33 Forest Cover Type Data The goal of this study is to predict forest cover types from cartographic variables in the forested areas with minimal human-caused disturbances. The data were taken from four wilderness areas located in the Roosevelt National Forest of northern Colorado. It consisted of 581,012 observations. Each observation was obtained from the US Geological Survey (USGS) digital elevation model data based on m raster cells, and it consisted of 54 cartographic explanatory variables including 10 quantitative variables, 4 binary wilderness area variables, and 40 binary soil type variables. These observations have been classified into seven classes according to their cover types. The respective class sizes are , , 35754, 2747, 9493, 17367, and

34 Forest Cover Type Data Table 4. BMH results for forest cover type data. The resampling rares for the 7 types of observations are 0.5%, 0.5%, 1%, 2.5%, 1.5%, 1% and 1%, respectively. The aggregated resampling rate for the training data is 0.59%. Bootstrapping k Average network Size Training rate(%) Prediction rate(%) CPU(h) (1.18) 72.2 (0.17) 72.4 (0.17) 32.0 (2.9) m/n (0.95) 72.2 (0.15) 72.4 (0.12) 33.7 (3.2) (1.52) 72.3 (0.07) 72.4 (0.07) 28.5 (2.5) n m (0.83) 72.3 (0.15) 72.4 (0.16) 31.9 (3.0)

35 Forest Cover Type Data: Efficiency of BMH For comparison, we have applied parallel tempering to train the BNN with a single-threaded simulation on an Intel Nehalem server. At each local updating step of parallel tempering, the whole training dataset is scanned once. Hence, this algorithm runs extremely slow. The first 5536 iterations of the simulation have taken 688 CPU hours, although the Intel Nehalem processor is much faster (approximately 2.5 times) than the processor used in the cluster machine. To finish iterations, it will take about 3000 CPU hours (125 days) on the Intel Nehalem server. Compared 3000 hours to 30 hours, it shows a great advantage of the parallelized BMH algorithm for big data problems.

36 Resampling on Distributed Architectures Let S i denote the ith subset of data stored in node i, i = 1,..., k. For each i, 1. Set j = i 1 or i + 1 with equal probability. If j = 0, reset j = k; and if j = k + 1, reset j = Exchange M randomly selected observations between S i and S j, where M can be a pre-specified or random number. It follows from the standard theory of MCMC (see e.g. Geyer, 1991) that the above procedure will ensure that each subset stored in a single node is a random subset of the whole dataset.

37 Discussion We have proposed the BMH algorithm as a basic MCMC algorithm for Bayesian analysis of big data. The BMH algorithm is workable on parallel and distributed architectures and avoids repeated scans of the full dataset in iterations, and is thus feasible for big data problems. Compared to the popular divide-and-combine method, BMH is generally more efficient as it can asymptotically integrate the whole data information into a single simulation run.

38 Discussion (continued) The BMH algorithm is very flexible. Like the Metropolis-Hastings algorithm, it can serve as a basic building block for developing advanced MCMC algorithms that are feasible for big data problems. Sampling: Tempering BMH, which combines BMH with parallel tempering. Model selection: Reversible jump BMH, which combines BMH with reversible jump MCMC. Optimization: Simulated annealing BMH, which combines BMH with simulated annealing. Compared to the existing methods, such as aggregated estimating equation, resampling-based stochastic approximation, and bag of little bootstrap, BMH has the unique power to tame the powerful MCMC methods for using in big data analysis.

39 Acknowledgments NSF grants KAUST grant Student: Jinsu Kim for parallel programming.