Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu Abstract The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large datasets, the computation of bootstrap-based quantities can be prohibitively computationally demanding. As an alternative, we introduce the Bag of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and subsampling to obtain a more computationally efficient, though still robust, means of quantifying the quality of estimators. BLB shares the generic applicability and statistical efficiency of the bootstrap and is furthermore well suited for application to very large datasets using modern distributed computing architectures, as it uses only small subsets of the observed data at any point during its execution. We provide both empirical and theoretical results which demonstrate the efficacy of BLB. Introduction Assessing the quality of estimates based upon finite data is a task of fundamental importance in data analysis. For example, when estimating a vector of model parameters given a training dataset, it is useful to be able to quantify the uncertainty in that estimate (e.g., via a confidence region), its bias, and its risk. Such quality assessments provide far more information than a simple point estimate itself and can be used to improve human interpretation of inferential outputs, perform hypothesis testing, do bias correction, make more efficient use of available resources (e.g., by ceasing to process data when the confidence region is sufficiently small), perform active learning, and do feature selection, among many more potential uses. Accurate assessment of estimate quality has been a longstanding concern in statistics. A great deal of classical work in this vein has proceeded via asymptotic analysis, which relies on deep study of particular classes of estimators in particular settings to obtain analytic approximations to particular quality measures (such as confidence regions) [4]. While this approach ensures asymptotic correctness and allows analytic computation, it is limited to cases in which the relevant asymptotic analysis is tractable and has actually been performed. In contrast, recent decades have seen greater focus on more automatic methods, which generally require significantly less analysis, at the expense of doing more computation. The bootstrap [2, 3] is perhaps the best known and most widely used among these methods, due to its simplicity and generic applicability. Efforts to ameliorate statistical shortcomings of the bootstrap in turn led to the development of related methods such as the m out of n bootstrap and subsampling [, 4]. Despite the computational demands of this more automatic methodology, advancements have been driven primarily by a desire to remedy shortcomings in statistical correctness, while largely ignoring computational considerations such as processing time, space, communication, and parallelism. However, with the advent of increasingly large datasets and diverse sets of often complex and exploratory queries, computational considerations and automation (i.e., lack of reliance on deep analysis of the specific estimator and setting of interest) are increasingly important. Indeed, even as the amount of available data grows, the number of parameters to be estimated and the number of potential sources of bias often also grow, leading to a need to be able to tractably assess estimator quality in the setting of large data. Thus, unlike previous work on estimator quality assessment, here we directly study both the accuracy and the computational costs of state of the art automatic methods such as the bootstrap and its relatives. The bootstrap, despite its
simplicity of implementation and very general statistical correctness, has relatively high computational costs. We also find that its relatives, such as the m out of n bootstrap and subsampling, have lesser computational costs, as expected, but have correctness which is sensitive to hyperparameters (such as the number of subsampled data points) that must be specified a priori; additionally, these methods generally require the use of more prior information (such as rates of convergence of estimators) than the bootstrap. Motivated by these observations, we introduce a new procedure, the Bag of Little Bootstraps (BLB), which functions by combining the results of bootstrapping multiple small subsets of a larger original dataset. As we demonstrate below, BLB has a significantly more favorable computational profile than the bootstrap while being more robust than existing alternatives such as the m out of n bootstrap and subsampling. Additionally, our procedure maintains the generic applicability and favorable statistical properties of the bootstrap and is well suited to implementation on modern distributed and parallel computing architectures. 2 Setting, Notation, and a Bit of Related Work We assume that we observe a sample X,..., X n drawn i.i.d. from some true (unknown) underlying distribution P. Based only on this observed data, we estimate a quantity ˆθ n = θ(x,..., X n ) = θ(p n ), where P n = n n i= δ X i is the empirical distribution of X,..., X n. The true (unknown) population value of the estimated quantity is θ(p ). For example, θ might compute a measure of correlation, a classifier s parameters, or the prediction accuracy of a trained model. Noting that ˆθ n is a random quantity because it is based on n random observations, we define Q n (P ) Q as the true underlying distribution of ˆθ n, which is determined by both P and the form of θ. Our end goal is the computation of some metric ξ(q n (P )) : Q Ξ, for Ξ a vector space, which informatively summarizes Q n (P ). For instance, ξ might compute a confidence region, a standard error, or a bias. In practice, we do not have direct knowledge of P or Q n (P ), and so we must estimate ξ(q n (P )) itself based only on the observed data and knowledge of θ. Using this notation, the bootstrap [2, 3] simply forms the data-driven plugin approximation ξ(q n (P )) ξ(q n (P n )). While ξ(q n (P n )) cannot be computed exactly in most cases, it is generally amenable to straightforward Monte Carlo approximation via the following simple algorithm: repeatedly resample n points i.i.d. from P n, compute θ on each resample, form the empirical distribution Q n of the computed θ s, and approximate ξ(q n (P )) ξ(q n ). Though conceptually simple and powerful, this procedure requires repeated computation of θ on resamples having size comparable to that of the original dataset. If the original dataset is large, then this repeated computation can be prohibitively costly. 3 Bag of Little Bootstraps (BLB) The Bag of Little Bootstraps (Algorithm ) functions by averaging the results of bootstrapping multiple small subsets of X,..., X n. As a result, it inherits the statistical correctness of the bootstrap, while in most cases having the reduced computational costs of the m out of n bootstrap and subsampling. Additionally, as we show in experiments below, BLB is significantly more robust than these procedures to the choice of subset size. BLB proceeds by repeatedly subsampling b < n points from X,..., X n. Each subsample defines an empirical distribution P (j) based on those b points, which is used to compute a Monte Carlo approximation to ξ(q n(p (j) )) via repeated resampling of n points from P (j). Having obtained one such Monte Carlo approximation per subsample, we then average them to obtain a final, improved estimate of ξ(q n (P )). As we show both empirically and theoretically, BLB s output converges (generally quickly) as we increase s, the number of subsamples (or chunks); also, fairly modest values of r, the number of Monte Carlo resamples, seem to be sufficient (e.g., in our experiments). Similarly to the bootstrap, each outer iteration of BLB computes an approximation to ξ(q n (P )) based on a plugin estimate ξ(q n (P (j) )). In contrast to the bootstrap, the plugin estimator used by each outer iteration of BLB is based on, which is more compact and hence generally less computationally demanding than the full empirical distribution P (j) P n. The empirical distribution P (j) is, however, inferior to P n as an approximation to the true underlying distribution P. Therefore, BLB averages across multiple different realizations of P (j) to improve the quality of the final result. Delving more deeply into the computational requirements of BLB, each size n resample from P (j) contains at most b distinct data points, each of which may be replicated multiple times. Thus, we can represent each resample using 2
Algorithm : Bag of Little Bootstraps (BLB) Input: Data X,..., X n θ: estimator of interest ξ: estimator quality assessment Output: An estimate of ξ(q n (P )). b: subsample size s: number of subsamples r: number of Monte Carlo iterations for j to s do // Subsample the data Randomly sample a set I of b indices from {,..., n} without replacement [or, choose I to be a disjoint chunk of size b from a predefined random partition of {,..., n}] P (j) b i I δ X i // Approximate ξ(q n (P (j) )) for k to r do Sample Xk,,..., X k,n P(j) i.i.d. ˆθ n,k θ(x k,,..., X k,n ) end Q n,j r r k= δˆθ n,k ξn,j ξ(q n,j ) end // Average estimates of ξ(q n (P )) computed on different subsamples return s s j= ξ n,j only space O(b) by simply maintaining the b < n distinct points with associated counts. If θ can compute directly on this weighted representation (i.e., if θ s computational requirements scale only in the number of distinct data points presented to it), then we achieve corresponding savings in required space and computation time. This property does indeed hold for many if not most commonly used θ s, including M-estimators such as linear and kernel regression, logistic regression, and Support Vector Machines, among many others. As we show in our experiments and analysis, b can generally be chosen to be much smaller than n: for example, we might take b = n γ where γ [.5, ]. As a result, each computation of θ by BLB can be significantly faster than each computation of θ by the bootstrap. Indeed, a simple and standard calculation [3] shows that each resample generated by the bootstrap contains approximately.632n distinct points, which is quite large if n is large. More concretely, if, for instance, n =,,, then a bootstrap resample would contain approximately 632, distinct points; in contrast, with b = n.6, each BLB subsample and resample would contain at most 3, 98 distinct points. Assuming a data point size of MB, the full dataset would occupy TB of storage space, a bootstrap resample would occupy approximately 632 GB, and each BLB subsample or resample would occupy approximately 4 GB. Thus, BLB only requires repeated computation on small subsets of the original dataset and avoids the bootstrap s crippling need for repeated computation of θ on resamples having size comparable to that of the original dataset. Due to its much smaller subsample and resample space requirements, BLB is also significantly more amenable to distribution of different subsamples and resamples and their associated computations to independent compute nodes or racks of nodes, thus allowing for simple distributed and parallel implementations. 4 Experiments We investigate the empirical performance characteristics of BLB and compare to existing methods via experiments on simulated data. Use of simulated data is necessary here because it allows knowledge of Q n (P ) and hence ξ(q n (P )) (i.e., ground truth). For different datasets and estimation tasks, we study the convergence properties of BLB as well as the bootstrap, the m out of n bootstrap, and subsampling. We consider two different tasks and forms for θ: regression and classification. For both settings, the data has the form X i = ( X i, Y i ) P, i.i.d. for i =,..., n, where X i R d ; Y i R for regression, whereas Y i {, } for classification. We use n = 2, for the plots shown, and d is set to for regression and for classification. 3
.8.6.4 BLB.5 BLB.6 BLB.7 BLB.8 BLB.9.8.6.4 BOFN.5 BOFN.6 BOFN.7 BOFN.8 BOFN.9.8.6.4 BLB.5 BLB.6 BLB.7 BLB.8 BLB.9.8.6.4 BOFN.5 BOFN.6 BOFN.7 BOFN.8 BOFN.9.2.2.2.2 5 5 2 5 5 2 5 5 2 5 5 2 Figure : Relative error vs. processing time for BLB, bootstrap (), and b out of n bootstrap (BOFN) on two different problems. For both BLB and BOFN, b = n γ with the value of γ for each trajectory given in the legend. (two lefthand plots) Regression setting with linear data generating distribution and Gamma X i distribution. Leftmost plot shows BLB with ; other plot shows BOFN. (two righthand plots) Classification setting with linear data generating distribution and Gamma X i distribution. Lefthand plot shows BLB with ; other plot shows BOFN. In each case, θ estimates a linear parameter vector in R d (via either least squares or logistic regression) for a model of the mapping between X i and Y i. We define ξ as computing a set of marginal 95% confidence intervals, one for each element of the estimated parameter vector (averaging across ξ s consists of averaging element-wise interval boundaries). To evaluate the various quality assessment procedures on a given estimation task and true underlying data distribution P, we first compute the ground truth ξ(q n (P )) based on 2, realizations of datasets of size n from P. Then, for an independent dataset realization of size n from the true underlying distribution, we run each quality assessment procedure and record the estimate of ξ(q n (P )) produced after each iteration (e.g., after each bootstrap resample or BLB subsample is processed), as well as the cumulative time required to produce that estimate. Every such estimate is evaluated based on the average (across dimensions) relative deviation of its component-wise confidence intervals widths from the corresponding true widths. We repeat this process on five independent dataset realizations of size n and average the resulting relative errors and corresponding times across these five datasets to obtain a trajectory of relative error versus time for each quality assessment procedure (the trajectories variances are relatively small). To maintain consistency of notation, we henceforth refer to the m out of n bootstrap as the b out of n bootstrap. For BLB, the b out of n bootstrap, and subsampling, we consider subsample sizes b = n γ where γ {.5,.6,.7,.8,.9}; we use r = in all runs of BLB. Figure shows results for both the regression and classification settings. Here the true underlying distribution P generates each coordinate of each X i independently from a different Gamma distribution, and it generates Y i from X i via a noisy linear model. As seen in the figure, in the regression setting BLB (first plot) succeeds in converging to low relative error significantly more quickly than the bootstrap, for all values of b considered. In contrast, the b out of n bootstrap (second plot) fails to converge to low relative error for smaller values of b. In the classification setting, BLB (third plot) converges to relative error comparable to that of the bootstrap for b > n.6, while converging to higher relative errors for the smallest values of b considered. For larger values of b, which are still significantly smaller than n, we again converge to low relative error more quickly than the bootstrap. We are also once again more robust than the b out of n bootstrap (fourth plot), which fails to converge to low relative error for b n.7. In fact, even for b n.6, BLB s performance is superior to that of the b out of n bootstrap. For the aforementioned cases in which BLB does not match the relative error of the bootstrap, additional empirical results (not shown here) and our theoretical analysis indicate that this discrepancy in relative error diminishes as n increases. Identical evaluation of subsampling (plots not shown) shows that it performs strictly worse than the b out of n bootstrap. Qualitatively similar results also hold in both the regression and classification settings when P generates X i from either Normal or StudentT distributions, and when P uses a non-linear noisy mapping between X i and Y i (so that θ learns a misspecified model). These experiments demonstrate the improved computational efficiency of BLB relative to the bootstrap, the fact that BLB maintains statistical correctness, and the improved robustness of BLB to the choice of b. Note that all speedups observed for BLB over the bootstrap in our experiments are achieved without parallelization, though better parallelizability is indeed another significant advantage of BLB. See the extended version of this paper for theoretical results showing that BLB is consistent and shares the higher-order correctness (i.e., favorable statistical convergence rate) of the bootstrap, additional experimental results, and further discussion of BLB and related prior work. 4
References [] P. Bickel, F. Gotze, and W. van Zwet. Resampling fewer than n observations: gains, losses and remedies for losses. Statistica Sinica, 997. [2] B. Efron. Bootstrap methods: another look at the jackknife. Annals of Statistics, 979. [3] B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, 993. [4] D. Politis, J. Romano, and M. Wolf. Subsampling. Springer, 999. 5