The Fundamental Incompatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsampling

The Fundamental Incomatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsamling Michael Betancourt Deartment of Statistics, University of Warwick, Coventry, UK CV4 7A BETANAPHA@GMAI.COM Abstract everaging the coherent exloration of Hamiltonian flow, Hamiltonian Monte Carlo roduces comutationally efficient Monte Carlo estimators, even with resect to comlex and highdimensional target distributions. When confronted with data-intensive alications, however, the algorithm may be too exensive to imlement, leaving us to consider the utility of aroximations such as data subsamling. In this aer I demonstrate how data subsamling fundamentally comromises the scalability of Hamiltonian Monte Carlo. With the reonderance of alications featuring enormous data sets, methods of inference reuiring only subsamles of data are becoming more and more aealing. Subsamled Markov Chain Monte Carlo algorithms, Neiswanger et al., 2013; Welling & Teh, 2011), are articularly desired for their otential alicability to most statistical models. Unfortunately, careful analysis of these algorithms reveals unavoidable biases unless the data are tall, or highly redundant Bardenet et al., 2014; Teh et al., 2014; Vollmer et al., 2015). Because redundancy can be defined only relative to a given model, the utility of these subsamled algorithms is then a conseuence of not only the desired accuracy and also the articular model and data under consideration, severely restricting racticality. Recently Chen et al., 2014) considered subsamling within Hamiltonian Monte Carlo Duane et al., 1987; Neal, 2011; Betancourt et al., 2014b) and demonstrated that the biases induced by naive subsamling lead to unaccetably large biases. Ultimately the authors rectified this bias by sacrificing the coherent exloration of Hamiltonian flow for a diffusive correction, fundamentally comromising the scalability of the algorithm with resect to the comlexity Proceedings of the 32 nd International Conference on Machine earning, ille, France, 2015. JMR: W&CP volume 37. Coyright 2015 by the authors). of the target distribution. An algorithm scalable with resect to both the size of the data and the comlexity of the target distribution would have to maintain the coherent exloration of Hamiltonian flow while subsamling and, unfortunately, these objectives are mutually exclusive in general. In this aer I review the elements of Hamiltonian Monte Carlo critical to its robust and scalable erformance in ractice and demonstrate how different subsamling strategies all comromise those roerties and conseuently induce oor erformance. 1. Hamiltonian Monte Carlo in Theory Hamiltonian Monte Carlo utilizes deterministic, measurereserving mas to generate efficient Markov transitions Betancourt et al., 2014b). Formally, we begin by comlementing a target distribution, π ex[ V )] d n, with a conditional distribution over auxiliary momenta arameters, π ex[ T, )] d n. Together these define a joint distribution, ϖ H ex[ T, ) + V ))] d n d n ex[ H, )] d n d n, and a Hamiltonian system corresonding to the Hamiltonian, H, ). We refer to T, ) and V ) as the kinetic energy and otential energy, resectively. The Hamiltonian immediately defines a Hamiltonian vector field, H = H H, and an alication of the exonential ma yields a Hamiltonian flow on the joint sace, φ H τ = e τ H ee, 2013), which exactly reserves the joint distribution under a ullback, ) φ H π t H = π H.

The Fundamental Incomatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsamling Conseuently, we can comose a Markov chain by samling the auxiliary momenta,, ), π, alying the Hamiltonian flow,, ) φ H t, ) and then rojecting back down to the target sace,, ). By construction, the trajectories generated by the Hamiltonian flow exlore the level sets of the Hamiltonian function. Because these level sets can also san large volumes of the joint sace, sufficiently-long trajectories can yield transitions far away from the initial state of the Markov chain, drastically reducing autocorrelations and roducing comutationally efficient Monte Carlo estimators. When the kinetic energy does not deend on osition we say that the Hamiltonian is searable, H, ) = T ) + V ), and the Hamiltonian vector field decoules into a kinetic vector field, T and otential vector field, V, H = H = T H V T + V. In this aer I consider only searable Hamiltonians, although the conclusions also carry over to the non-seerable Hamiltonians, for examle those arising in Riemannian Hamiltonian Monte Carlo Girolami & Calderhead, 2011). 2. Hamiltonian Monte Carlo in Practice The biggest challenge of imlementing Hamiltonian Monte Carlo is that the exact Hamiltonian flow is rarely calculable in ractice and we must instead resort to aroximate integration. Symlectic integrators, which yield numerical trajectories that closely track the true trajectories, are of articular imortance to any high-erformance imlementation. An esecially transarent strategy for constructing symlectic integrators is to slit the Hamiltonian into terms with soluble flows which can then be comosed together eimkuhler & Reich, 2004; Hairer et al., 2006). For examle, consider the symmetric Strang slitting, φ V 2 φt φ V 2 = e 2 V e T e 2 V, where is a small interval of time known as the ste size. Aealing to the Baker-Cambell-Hausdorff formula, this symmetric comosition yields φ V 2 φt φ V 2 = e V 2 e T e 2 V = e V 2 ex T + V 2 [ ] ) + 2 T, V + O 3) 4 = ex V 2 + T + V 2 [ ] + 2 T, V 4 + 1 [ V 2 2, T + V 2 [ ] ]) + 2 T, V 4 + O 3) = ex H [ ] [ ] [ ] ) + 2 T, V + 2 V, T + 2 V, V 4 4 8 + O 3) = e H + O 3). Comosing this symmetric comosition with itself = τ/ times results in a symlectic integrator accurate to second-order in the ste size for any finite integration time, τ, ) φ H,τ φ V φt 2 φ V 2 = e H + O 3)) = e ) H + ) O 2) = e τ H + τo 2) = e τ H + O 2). Remarkably, the resulting numerical trajectories are confined to the level sets of a modified Hamiltonian given by an O 2) erturbation of the exact Hamiltonian Hairer et al., 2006; Betancourt et al., 2014a). Although such symlectic integrators are highly accurate, they still introduce an error into the trajectories that can bias the Markov chain and any resulting Monte Carlo estimators. In ractice this error is tyically comensated with the alication of a Metroolis correction, acceting a oint along the numerical trajectory only with robability )) a, ) = min 1, ex H, ) H φ H,τ, ). A critical reason for the scalable erformance of such an imlementation of Hamiltonian Monte Carlo is that the error in a symlectic integrator scales with the ste size,. Conseuently a small bias or a large accetance robability can be maintained by reducing the ste size, regardless of the comlexity or dimension of the target distribution Betancourt et al., 2014a). If the symlectic integrator is comromised, however, then this scalability and generality is lost.

The Fundamental Incomatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsamling 3. Hamiltonian Monte Carlo With Subsamling A common criticism of Hamiltonian Monte Carlo is that in data-intensive alications the evaluation of otential vector field, V = V, and hence the simulation of numerical trajectories, can become infeasible given the exense of the gradient calculations. This exense has fueled a variety of modifications of the algorithm aimed at reducing the cost of the otential energy, often by any means necessary. An increasingly oular strategy targets Bayesian alications where the data are indeendently and identically distributed. In this case the osterior can be maniulated into a roduct of contributions from each subset of data, πθ y) πθ) J πy j θ), j=1 and the otential energy likewise decomoses into a sum, V ) = J V j ) j=1 = J j=1 ) 1 J log πθ) + log πy j θ), where each V j deends on only a single subset. This decomosition suggests algorithms which consider not the entirety of the data and the full otential energy, V, but rather only a few subsets at a time. Using only art of the data to generate a trajectory, however, comromises the structure-reserving roerties of the symlectic integrator and hence the scalability of its accuracy. Conseuently the erformance of any such subsamling method deends critically on the details of the imlementation and the structure of the data itself. Here I consider the erformance of two immediate imlementations, one based on subsamling the data in between Hamiltonian trajectories and one based on subsamling the data within a single trajectory. As exected, the erformance of both methods leaves much to be desired. 3.1. Subsamling Data In Between Trajectories Given any subset of the data, we can aroximate the otential energy as V J V j and then generate trajectories corresonding to the flow of the aroximate Hamiltonian, H j = T +J V j. In order to avoid arsing the entirety of the data, the Metroolis correction at the end of each trajectory can be neglected and the corresonding samles left biased. Unlike the numerical trajectories from the full Hamiltonian, these subsamled trajectories are biased away from the exact trajectories regardless of the chosen ste size. In articular, the bias of each ste, where e 2 I V j e T e 2 I V j = e H j + O 3) = e H V j + O 3), V V j = J V ) j, 1) ersists over the entire trajectory, ) e 2 J V j e T e 2 J V j τ H ) V = e j + O 2). As the dimension of the target distribution grows, the subsamled gradient, J V j /, drifts away from the true gradient, V/, unless the data become increasingly redundant. Conseuently the resulting trajectory introduces an irreducible bias into the algorithm, similar in nature to the asymtotic bias seen in subsamled angevin Monte Carlo Teh et al., 2014; Vollmer et al., 2015), which then induces either a vanishing Metroolis accetance robability or highly-biased exectations if the Metroolis correction is neglected Figure 1). Unfortunately, the only way to decrease the deendency on redundant data is to increase the size of each subsamle, which immediately undermines any comutational benefits. Consider, for examle, a simle alication where we target a one-dimensional osterior distribution, with the likelihood and rior πµ y) πy µ) πµ), 2) πy µ) = N N y n µ, σ 2) n=1 πµ) = N µ m, s 2). Searating the data into J = N/B batches of size B and decomosing the rior into J individual terms then gives V j = const + B σ 2 + Ns 2 N σ 2 s 2 ) 1 jb B n=j 1)B+1 µ x n Ns 2 + mσ 2 2 σ 2 + Ns 2. Here I take σ = 2, m = 0, s = 1, and generate N = 500 data oints assuming µ = 1.

The Fundamental Incomatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsamling Stochastic Stochastic Exact Exact a) b) Figure 1. The bias induced by subsamling data in Hamiltonian Monte Carlo deends on how recisely the gradients of the subsamled otential energies integrate to the gradient of the true otential energy. a) When the subsamled gradient is close to the true gradient, the stochastic trajectory will follow the true trajectory and the bias will be small. b) Conversely, if the subsamled gradient is not close to the true otential energy then the stochastic trajectory will drift away from the true trajectory and induce a bias. Subsamling between trajectories reuires that each subsamled gradient aroximate the true gradient, while subsamling within a single trajectory reuires only that the average of the subsamled gradients aroximates the true gradient. As the dimension of the target distribution grows, however, an accurate aroximation in either case becomes increasingly more difficult unless the data become corresondingly more redundant relative to the comlexity of the target distribution. When the full data are used, numerical trajectories generated by the second-order symlectic integrator constructed above closely follow the true trajectories Figure 2a). Aroximating the otential with a subsamle of the data introduces the aforementioned bias, which shifts the stochastic trajectory away from the exact trajectory desite negligible error from the symlectic integrator itself Figure 2b). Only when the size of each subsamle aroaches the full data set, and the comutational benefit of subsamling fades, does the stochastic trajectory rovide a reasonable aroximation to the exact trajectory Figure 2c) As noted above, geometric considerations suggest that this bias should grow with the dimensionality of the target distribution. To see this, consider running subsamled Hamiltonian Monte Carlo on the multivariate generalization of 2), D πµ d y d ), 3) d=1 where the true µ d are samled from µ d N 0, 1) and trajectories are generated using a subsamled integrator with ste size,, a random integration time τ U0, 2π), and no Metroolis correction. As a surrogate for the accuracy of the resulting samles I will use the average Metroolis accetance robability of each new state using the full data. When the full data are used in this model, the ste size of the symlectic integrator can be tuned to maintain constant accuracy as the dimensionality of the target distribution, D, increases. The bias induced by subsamling between trajectories, however, is invariant to the ste size of the integrator and raidly increases with the dimension of the target distribution. Here the data were artitioned into J = 25 batches of B = 20 data, the subsamle used for each trajectory is randomly selected from the first five batches, and the ste size of the subsamled trajectory is reduced by N/J B) = 5 to eualize the comutational cost with full data trajectories Figure 3). 3.2. Subsamling Data within a Single Trajectory Given that using a single subsamle for an entire trajectory introduces an irreducible bias, we might next consider subsamling at each ste within a single trajectory, hoing that the bias from each subsamle cancels in exectation. Ignoring any Metroolis correction, this is exactly the naive stochastic gradient Hamiltonian Monte Carlo of Chen et al., 2014). To understand the accuracy of this strategy consider building u such a stochastic trajectory one ste at a time. Given the first two randomly-selected subsamles, V i and then V j, the first two stes of the resulting integrator are given by φ Hj φ Hj φ Hi φ Hi = e H V j e H V i + O 3) = ex 2H V i + ) V j [ + 2 H V j, H 2 ] ) V i + O 3) = ex 2H V i + ) V j [ ] [ + 2 H, V\i V\j, H 2 ]) ) + O 3),

The Fundamental Incomatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsamling Full Data J = 1, B = 500) 1 Modified evel Set Average Accetance Probability Using Full Data 0.8 0.6 0.4 0.2 Full Data 0 1 10 100 1000 Dimension Subsamled Between Subsamled Within a) Small Subset J = 50, B = 10) Modified evel Set Exact Stochastic evel Set Modified Stochastic evel Set Figure 3. When the full data are used, high accuracy of Hamiltonian Monte Carlo samles, here reresented by the average Metroolis accetance robability using the full data, can be maintained even as the dimensional of the target distribution grows. The biases induced when the data are subsamled, however, cannot be controlled and uickly devastate the accuracy of the algorithm. Here the ste size of the subsamled algorithms has been decreased relative to the full data algorithm in order to eualize the comutational cost even in this simle examle, a roer imlementation of Hamiltonian Monte Carlo can achieve a given accuracy much more efficiently than subsamling. b) arge Subset J = 2, B = 250) Modified evel Set Exact Stochastic evel Set Modified Stochastic evel Set where we have used the fact that the vector fields { V j } commute with each other. Similarly, the first three stes are given by c) Figure 2. Even for the simle osterior 2), subsamling data in between trajectories introduces significant athologies. a) When the full data are used, numerical trajectories dashed line) closely track the exact trajectories solid line). Subsamling of the data introduces a bias in both the exact trajectories and corresonding numerical trajectories. b) If the size of each subsamle is small then this bias is large. c) Only when the size of the subsamles aroaches the size of the full data, and any comutational benefits from subsamling wane, do the stochastic trajectories rovide a reasonable emulation of the true trajectories. φ H k φ Hj φ Hi = ex 3H + 2 2 + 2 2 + O 3) = ex 3H V i + V j + ) V k [ H, ] [ V i V j, H ]) [ H V k, 2H V i ]) ) V j V i + 2 [ H, V i ] V j + ) V k [ H, ])) V k + O 3), and, letting j l denote the subsamle chosen at the l-th ste,

The Fundamental Incomatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsamling the comosition over an entire trajectory becomes φ Hj l where and = ex ) H ) 1 + ) V jl [ H, V j1 ] [ H, V jl ])) + ) O 2) = ex τh τ 1 V jl [ +τ H, ] [ V j1 H, ])) V j + O 2) = ex τh ) + τb 1 + τb 2 + O 2), B 1 = 1 V jl [ B 2 = H, ] [ V j1 H, ]) V j. Once again, subsamling the data introduces bias into the numerical trajectories. Although the second source of bias, B 2, is immediately rectified by aending the stochastic trajectory with an udate from the initial subsamle such that j = j 1, the first source of bias, B 1, is not so easily remedied. Exanding, 1 V jl = 1 ) V J Vjl = V J V jl n=1 V = J V j ), we see that B 1 vanishes only when the average gradient of the selected subsamles yields the gradient of the full otential. Averaging over subsamles may reduce the bias comared to using a single subsamle over the entire trajectory 1), but the bias still scales oorly with the dimensionality of the target distribution Figure 1). In order to ensure that the bias vanishes identically and indeendent of the redundancy of the data, we have to use each subsamle the same number of times within a single trajectory. In articular, both biases vanish if we use each subsamle twice in a symmetric comosition of the form φ H ) l φ H ) +1 l. Because this comosition reuires using all of the subsamles it does not rovide any comutational savings and it seems rather at odd with the original stochastic subsamling motivation. Indeed, this symmetric comosition is not stochastic at all and actually corresonds to a rather elaborate symlectic integrator where the otential energy from each subsamle generates its own flow, euivalent to the integrator in Slit Hamiltonian Monte Carlo Shahbaba et al., 2014) with the larger ste size J. Removing intermediate stes from this symmetric, stochastic trajectory Figure 4a) reveals the level set of the corresonding modified Hamiltonian Figure 4b). Because this symmetric comosition integrates the full Hamiltonian system, the error is once again controllable and vanishes as the ste size is decreased Figure 4c). imiting the number of subsamles, however, leaves the irreducible bias in the trajectories that cannot be controlled by the tuning the ste size Figures 3, 5). Once more we are left deendent on the redundancy of the data for any hoe of imroved erformance with subsamling. 4. Conclusion The efficacy of Markov Chain Monte Carlo for comlex, high-dimensional target distributions deends on the ability of the samler to exlore the intricate and often meandering neighborhoods on which the robability is distributed. Symlectic integrators admit a structure-reserving imlementation of Hamiltonian Monte Carlo that is amazingly robust to this comlexity and caable of efficiently exloring the most comlex target distributions. Subsamled data, however, does not in general have enough information to enable such efficient exloration. This lack of information manifests as an irreducible bias that devastates the scalable erformance of Hamiltonian Monte Carlo. Conseuently, without having access to the full data there is no immediate way of engineering a well-behaved imlementation of Hamiltonian Monte Carlo alicable to most statistical models. As with so many other subsamling algorithms, the adeuacy of a subsamled Hamiltonian Monte Carlo imlementation is at the mercy of the redundancy of the data relative to the comlexity of the target model, and not in the control of the user. Unfortunately many of the roblems at the frontiers of alied statistics are in the wide data regime, where data are sarse relative to model comlexity. Here subsamling methods have little hoe of success; we must focus our efforts not on modifying Hamiltonian Monte Carlo but rather on imroving its imlementation with, for examle, better memory management and efficiently arallelized gradient calculations.

The Fundamental Incomatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsamling ε = 0.05 Numerical Trajectory ε = 0.05 Subsamled Trajectory a) ε = 0.05 Modified evel Set Numerical Trajectory a) ε = 0.0005 b) Subsamled Trajectory ε = 0.002 Numerical Trajectory c) b) Figure 4. The symmetric comosition of flows from each subsamles of the data eliminates all bias in the stochastic trajectory because it imlicitly reconstructs a symlectic integrator. Refining a) all intermediate stes in a stochastic trajectory b) to only those occurring after a symmetric swee of the subsamles reveals the level set of the modified Hamiltonian corresonding to the imlicit symlectic integrator. Because of the vanishing bias, c) the error in the stochastic trajectory can be controlled by taking the ste size to zero. Figure 5. a) Utilizing only a few subsamles within a trajectory yields numerical trajectories biased away from the exact trajectories. b) Unlike the error introduced by a full symlectic integrator, this bias is irreducible and cannot be controlled by tuning the ste size. The erformance of such an algorithm is limited by the size of the bias which itself deends on the redundancy of the data relative to the target model.

The Fundamental Incomatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsamling Acknowledgements It is my leasure to thank Simon Byrne, Mark Girolami, Matt Hoffman, Matt Johnson, and Sam ivingstone for thoughtful discussion and helful comments, as well as three anonymous reviewers for constructive assessments. I am suorted under EPSRC grant EP/J016934/1. References Bardenet, Rémi, Doucet, Arnaud, and Holmes, Chris. An adative subsamling aroach for MCMC inference in large datasets. In Proceedings of The 31st International Conference on Machine earning,. 405 413, 2014. Betancourt, Michael, Byrne, Simon, and Girolami, Mark. Otimizing the integrator ste size for hamiltonian monte carlo. ArXiv e-rints, 1410.5110, 11 2014a. Shahbaba, Babak, an, Shiwei, Johnson, Wesley O, and Neal, Radford M. Slit Hamiltonian Monte Carlo. Statistics and Comuting, 243):339 349, 2014. Teh, Yee Whye, Thiéry, Alexandre, and Vollmer, Sebastian. Consistency and fluctuations for stochastic gradient angevin dynamics. ArXiv e-rints, 1409.0578, 09 2014. Vollmer, Sebastian J., Zygalakis, Konstantinos C.,, and Teh, Yee Whye. non-)asymtotic roerties of stochastic gradient angevin dynamics. ArXiv e-rints, 1501.00438, 01 2015. Welling, Max and Teh, Yee W. Bayesian learning via stochastic gradient angevin dynamics. In Proceedings of the 28th International Conference on Machine earning ICM-11),. 681 688, 2011. Betancourt, Michael, Byrne, Simon, ivingstone, Samuel, and Girolami, Mark. The geometric foundations of Hamiltonian Monte Carlo. ArXiv e-rints, 1410.5110, 10 2014b. Chen, Tiani, Fox, Emily B, and Guestrin, Carlos. Stochastic gradient Hamiltonian Monte Carlo. Proceedings of The 31st International Conference on Machine earning,. 1683 1691, 2014. Duane, Simon, Kennedy, A.D., Pendleton, Brian J., and Roweth, Duncan. Hybrid Monte Carlo. Physics etters B, 1952):216 222, 1987. Girolami, Mark and Calderhead, Ben. Riemann Manifold angevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B Statistical Methodology), 732):123 214, 2011. Hairer, E., ubich, C., and Wanner, G. Geometric Numerical Integration: Structure-Preserving Algorithms for Ordinary Differential Euations. Sringer, New York, 2006. ee, John M. Introduction to Smooth Manifolds. Sringer, 2013. eimkuhler, B. and Reich, S. Simulating Hamiltonian Dynamics. Cambridge University Press, New York, 2004. Neal, R.M. MCMC using Hamiltonian dynamics. In Brooks, Steve, Gelman, Andrew, Jones, Galin., and Meng, Xiao-i eds.), Handbook of Markov Chain Monte Carlo. CRC Press, New York, 2011. Neiswanger, Willie, Wang, Chong, and Xing, Eric. Asymtotically exact, embarrassingly arallel MCMC. arxiv e-rints, 1311.4780, 2013.