The Fundamental Incompatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsampling



Similar documents
A MOST PROBABLE POINT-BASED METHOD FOR RELIABILITY ANALYSIS, SENSITIVITY ANALYSIS AND DESIGN OPTIMIZATION

Large-Scale IP Traceback in High-Speed Internet: Practical Techniques and Theoretical Foundation

Point Location. Preprocess a planar, polygonal subdivision for point location queries. p = (18, 11)

Softmax Model as Generalization upon Logistic Discrimination Suffers from Overfitting

The Online Freeze-tag Problem

An important observation in supply chain management, known as the bullwhip effect,

1 Gambler s Ruin Problem

Web Application Scalability: A Model-Based Approach

Failure Behavior Analysis for Reliable Distributed Embedded Systems

The fast Fourier transform method for the valuation of European style options in-the-money (ITM), at-the-money (ATM) and out-of-the-money (OTM)

Machine Learning with Operational Costs

Risk in Revenue Management and Dynamic Pricing

Storage Basics Architecting the Storage Supplemental Handout

X How to Schedule a Cascade in an Arbitrary Graph

Comparing Dissimilarity Measures for Symbolic Data Analysis

Managing specific risk in property portfolios

Gaussian Processes to Speed up Hamiltonian Monte Carlo

Stochastic Derivation of an Integral Equation for Probability Generating Functions

ENFORCING SAFETY PROPERTIES IN WEB APPLICATIONS USING PETRI NETS

The risk of using the Q heterogeneity estimator for software engineering experiments

A Modified Measure of Covert Network Performance

Concurrent Program Synthesis Based on Supervisory Control

Local Connectivity Tests to Identify Wormholes in Wireless Networks

POISSON PROCESSES. Chapter Introduction Arrival processes

Monitoring Frequency of Change By Li Qin

A Simple Model of Pricing, Markups and Market. Power Under Demand Fluctuations

A Multivariate Statistical Analysis of Stock Trends. Abstract

From Simulation to Experiment: A Case Study on Multiprocessor Task Scheduling

Time-Cost Trade-Offs in Resource-Constraint Project Scheduling Problems with Overlapping Modes

An inventory control system for spare parts at a refinery: An empirical comparison of different reorder point methods

On the predictive content of the PPI on CPI inflation: the case of Mexico

Factoring Variations in Natural Images with Deep Gaussian Mixture Models

Alpha Channel Estimation in High Resolution Images and Image Sequences

Automatic Search for Correlated Alarms

Load Balancing Mechanism in Agent-based Grid

Robust Regression on MapReduce

Buffer Capacity Allocation: A method to QoS support on MPLS networks**

An Efficient Method for Improving Backfill Job Scheduling Algorithm in Cluster Computing Systems

Characterizing and Modeling Network Traffic Variability

Static and Dynamic Properties of Small-world Connection Topologies Based on Transit-stub Networks

Probabilistic models for mechanical properties of prestressing strands

CABRS CELLULAR AUTOMATON BASED MRI BRAIN SEGMENTATION

DAY-AHEAD ELECTRICITY PRICE FORECASTING BASED ON TIME SERIES MODELS: A COMPARISON

Introduction to NP-Completeness Written and copyright c by Jie Wang 1

A Virtual Machine Dynamic Migration Scheduling Model Based on MBFD Algorithm

Real-Time Multi-Step View Reconstruction for a Virtual Teleconference System

An Introduction to Risk Parity Hossein Kazemi

6.042/18.062J Mathematics for Computer Science December 12, 2006 Tom Leighton and Ronitt Rubinfeld. Random Walks

Risk and Return. Sample chapter. e r t u i o p a s d f CHAPTER CONTENTS LEARNING OBJECTIVES. Chapter 7

Title: Stochastic models of resource allocation for services

arxiv: v1 [hep-th] 26 Nov 2007

Two-resource stochastic capacity planning employing a Bayesian methodology

Jena Research Papers in Business and Economics

IMPROVING NAIVE BAYESIAN SPAM FILTERING

A Note on Integer Factorization Using Lattices

NEWSVENDOR PROBLEM WITH PRICING: PROPERTIES, ALGORITHMS, AND SIMULATION

F inding the optimal, or value-maximizing, capital

2D Modeling of the consolidation of soft soils. Introduction

Multi-Channel Opportunistic Routing in Multi-Hop Wireless Networks

Predicate Encryption Supporting Disjunctions, Polynomial Equations, and Inner Products

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

Efficient Training of Kalman Algorithm for MIMO Channel Tracking

Learning Human Behavior from Analyzing Activities in Virtual Environments

A highly-underactuated robotic hand with force and joint angle sensors

Effect Sizes Based on Means

Forensic Science International

STATISTICAL CHARACTERIZATION OF THE RAILROAD SATELLITE CHANNEL AT KU-BAND

Multiperiod Portfolio Optimization with General Transaction Costs

INFERRING APP DEMAND FROM PUBLICLY AVAILABLE DATA 1

Beyond the F Test: Effect Size Confidence Intervals and Tests of Close Fit in the Analysis of Variance and Contrast Analysis

Mean shift-based clustering

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

An Associative Memory Readout in ESN for Neural Action Potential Detection

Stability Improvements of Robot Control by Periodic Variation of the Gain Parameters

Large firms and heterogeneity: the structure of trade and industry under oligopoly

Migration to Object Oriented Platforms: A State Transformation Approach

Synopsys RURAL ELECTRICATION PLANNING SOFTWARE (LAPER) Rainer Fronius Marc Gratton Electricité de France Research and Development FRANCE

Multistage Human Resource Allocation for Software Development by Multiobjective Genetic Algorithm

Fluent Software Training TRN Solver Settings. Fluent Inc. 2/23/01

PRIME NUMBERS AND THE RIEMANN HYPOTHESIS

How To Understand The Difference Between A Bet And A Bet On A Draw Or Draw On A Market

Minimizing the Communication Cost for Continuous Skyline Maintenance

The impact of metadata implementation on webpage visibility in search engine results (Part II) q

Computing the Most Probable String with a Probabilistic Finite State Machine

THE WELFARE IMPLICATIONS OF COSTLY MONITORING IN THE CREDIT MARKET: A NOTE

COST CALCULATION IN COMPLEX TRANSPORT SYSTEMS

Compensating Fund Managers for Risk-Adjusted Performance

Piracy and Network Externality An Analysis for the Monopolized Software Industry

Dynamics of Open Source Movements

Computational Finance The Martingale Measure and Pricing of Derivatives

Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network

A Method to Segment a 3D Surface Point Cloud for Selective Sensing in Robotic Exploration

FREQUENCIES OF SUCCESSIVE PAIRS OF PRIME RESIDUES

The HIV Epidemic: What kind of vaccine are we looking for?

A Third Generation Automated Teller Machine Using Universal Subscriber Module with Iris Recognition

Moving Objects Tracking in Video by Graph Cuts and Parameter Motion Model

Service Network Design with Asset Management: Formulations and Comparative Analyzes

Modeling and Simulation of an Incremental Encoder Used in Electrical Drives

Software Cognitive Complexity Measure Based on Scope of Variables

Branch-and-Price for Service Network Design with Asset Management Constraints

Transcription:

The Fundamental Incomatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsamling Michael Betancourt Deartment of Statistics, University of Warwick, Coventry, UK CV4 7A BETANAPHA@GMAI.COM Abstract everaging the coherent exloration of Hamiltonian flow, Hamiltonian Monte Carlo roduces comutationally efficient Monte Carlo estimators, even with resect to comlex and highdimensional target distributions. When confronted with data-intensive alications, however, the algorithm may be too exensive to imlement, leaving us to consider the utility of aroximations such as data subsamling. In this aer I demonstrate how data subsamling fundamentally comromises the scalability of Hamiltonian Monte Carlo. With the reonderance of alications featuring enormous data sets, methods of inference reuiring only subsamles of data are becoming more and more aealing. Subsamled Markov Chain Monte Carlo algorithms, Neiswanger et al., 2013; Welling & Teh, 2011), are articularly desired for their otential alicability to most statistical models. Unfortunately, careful analysis of these algorithms reveals unavoidable biases unless the data are tall, or highly redundant Bardenet et al., 2014; Teh et al., 2014; Vollmer et al., 2015). Because redundancy can be defined only relative to a given model, the utility of these subsamled algorithms is then a conseuence of not only the desired accuracy and also the articular model and data under consideration, severely restricting racticality. Recently Chen et al., 2014) considered subsamling within Hamiltonian Monte Carlo Duane et al., 1987; Neal, 2011; Betancourt et al., 2014b) and demonstrated that the biases induced by naive subsamling lead to unaccetably large biases. Ultimately the authors rectified this bias by sacrificing the coherent exloration of Hamiltonian flow for a diffusive correction, fundamentally comromising the scalability of the algorithm with resect to the comlexity Proceedings of the 32 nd International Conference on Machine earning, ille, France, 2015. JMR: W&CP volume 37. Coyright 2015 by the authors). of the target distribution. An algorithm scalable with resect to both the size of the data and the comlexity of the target distribution would have to maintain the coherent exloration of Hamiltonian flow while subsamling and, unfortunately, these objectives are mutually exclusive in general. In this aer I review the elements of Hamiltonian Monte Carlo critical to its robust and scalable erformance in ractice and demonstrate how different subsamling strategies all comromise those roerties and conseuently induce oor erformance. 1. Hamiltonian Monte Carlo in Theory Hamiltonian Monte Carlo utilizes deterministic, measurereserving mas to generate efficient Markov transitions Betancourt et al., 2014b). Formally, we begin by comlementing a target distribution, π ex[ V )] d n, with a conditional distribution over auxiliary momenta arameters, π ex[ T, )] d n. Together these define a joint distribution, ϖ H ex[ T, ) + V ))] d n d n ex[ H, )] d n d n, and a Hamiltonian system corresonding to the Hamiltonian, H, ). We refer to T, ) and V ) as the kinetic energy and otential energy, resectively. The Hamiltonian immediately defines a Hamiltonian vector field, H = H H, and an alication of the exonential ma yields a Hamiltonian flow on the joint sace, φ H τ = e τ H ee, 2013), which exactly reserves the joint distribution under a ullback, ) φ H π t H = π H.

The Fundamental Incomatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsamling Conseuently, we can comose a Markov chain by samling the auxiliary momenta,, ), π, alying the Hamiltonian flow,, ) φ H t, ) and then rojecting back down to the target sace,, ). By construction, the trajectories generated by the Hamiltonian flow exlore the level sets of the Hamiltonian function. Because these level sets can also san large volumes of the joint sace, sufficiently-long trajectories can yield transitions far away from the initial state of the Markov chain, drastically reducing autocorrelations and roducing comutationally efficient Monte Carlo estimators. When the kinetic energy does not deend on osition we say that the Hamiltonian is searable, H, ) = T ) + V ), and the Hamiltonian vector field decoules into a kinetic vector field, T and otential vector field, V, H = H = T H V T + V. In this aer I consider only searable Hamiltonians, although the conclusions also carry over to the non-seerable Hamiltonians, for examle those arising in Riemannian Hamiltonian Monte Carlo Girolami & Calderhead, 2011). 2. Hamiltonian Monte Carlo in Practice The biggest challenge of imlementing Hamiltonian Monte Carlo is that the exact Hamiltonian flow is rarely calculable in ractice and we must instead resort to aroximate integration. Symlectic integrators, which yield numerical trajectories that closely track the true trajectories, are of articular imortance to any high-erformance imlementation. An esecially transarent strategy for constructing symlectic integrators is to slit the Hamiltonian into terms with soluble flows which can then be comosed together eimkuhler & Reich, 2004; Hairer et al., 2006). For examle, consider the symmetric Strang slitting, φ V 2 φt φ V 2 = e 2 V e T e 2 V, where is a small interval of time known as the ste size. Aealing to the Baker-Cambell-Hausdorff formula, this symmetric comosition yields φ V 2 φt φ V 2 = e V 2 e T e 2 V = e V 2 ex T + V 2 [ ] ) + 2 T, V + O 3) 4 = ex V 2 + T + V 2 [ ] + 2 T, V 4 + 1 [ V 2 2, T + V 2 [ ] ]) + 2 T, V 4 + O 3) = ex H [ ] [ ] [ ] ) + 2 T, V + 2 V, T + 2 V, V 4 4 8 + O 3) = e H + O 3). Comosing this symmetric comosition with itself = τ/ times results in a symlectic integrator accurate to second-order in the ste size for any finite integration time, τ, ) φ H,τ φ V φt 2 φ V 2 = e H + O 3)) = e ) H + ) O 2) = e τ H + τo 2) = e τ H + O 2). Remarkably, the resulting numerical trajectories are confined to the level sets of a modified Hamiltonian given by an O 2) erturbation of the exact Hamiltonian Hairer et al., 2006; Betancourt et al., 2014a). Although such symlectic integrators are highly accurate, they still introduce an error into the trajectories that can bias the Markov chain and any resulting Monte Carlo estimators. In ractice this error is tyically comensated with the alication of a Metroolis correction, acceting a oint along the numerical trajectory only with robability )) a, ) = min 1, ex H, ) H φ H,τ, ). A critical reason for the scalable erformance of such an imlementation of Hamiltonian Monte Carlo is that the error in a symlectic integrator scales with the ste size,. Conseuently a small bias or a large accetance robability can be maintained by reducing the ste size, regardless of the comlexity or dimension of the target distribution Betancourt et al., 2014a). If the symlectic integrator is comromised, however, then this scalability and generality is lost.

The Fundamental Incomatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsamling 3. Hamiltonian Monte Carlo With Subsamling A common criticism of Hamiltonian Monte Carlo is that in data-intensive alications the evaluation of otential vector field, V = V, and hence the simulation of numerical trajectories, can become infeasible given the exense of the gradient calculations. This exense has fueled a variety of modifications of the algorithm aimed at reducing the cost of the otential energy, often by any means necessary. An increasingly oular strategy targets Bayesian alications where the data are indeendently and identically distributed. In this case the osterior can be maniulated into a roduct of contributions from each subset of data, πθ y) πθ) J πy j θ), j=1 and the otential energy likewise decomoses into a sum, V ) = J V j ) j=1 = J j=1 ) 1 J log πθ) + log πy j θ), where each V j deends on only a single subset. This decomosition suggests algorithms which consider not the entirety of the data and the full otential energy, V, but rather only a few subsets at a time. Using only art of the data to generate a trajectory, however, comromises the structure-reserving roerties of the symlectic integrator and hence the scalability of its accuracy. Conseuently the erformance of any such subsamling method deends critically on the details of the imlementation and the structure of the data itself. Here I consider the erformance of two immediate imlementations, one based on subsamling the data in between Hamiltonian trajectories and one based on subsamling the data within a single trajectory. As exected, the erformance of both methods leaves much to be desired. 3.1. Subsamling Data In Between Trajectories Given any subset of the data, we can aroximate the otential energy as V J V j and then generate trajectories corresonding to the flow of the aroximate Hamiltonian, H j = T +J V j. In order to avoid arsing the entirety of the data, the Metroolis correction at the end of each trajectory can be neglected and the corresonding samles left biased. Unlike the numerical trajectories from the full Hamiltonian, these subsamled trajectories are biased away from the exact trajectories regardless of the chosen ste size. In articular, the bias of each ste, where e 2 I V j e T e 2 I V j = e H j + O 3) = e H V j + O 3), V V j = J V ) j, 1) ersists over the entire trajectory, ) e 2 J V j e T e 2 J V j τ H ) V = e j + O 2). As the dimension of the target distribution grows, the subsamled gradient, J V j /, drifts away from the true gradient, V/, unless the data become increasingly redundant. Conseuently the resulting trajectory introduces an irreducible bias into the algorithm, similar in nature to the asymtotic bias seen in subsamled angevin Monte Carlo Teh et al., 2014; Vollmer et al., 2015), which then induces either a vanishing Metroolis accetance robability or highly-biased exectations if the Metroolis correction is neglected Figure 1). Unfortunately, the only way to decrease the deendency on redundant data is to increase the size of each subsamle, which immediately undermines any comutational benefits. Consider, for examle, a simle alication where we target a one-dimensional osterior distribution, with the likelihood and rior πµ y) πy µ) πµ), 2) πy µ) = N N y n µ, σ 2) n=1 πµ) = N µ m, s 2). Searating the data into J = N/B batches of size B and decomosing the rior into J individual terms then gives V j = const + B σ 2 + Ns 2 N σ 2 s 2 ) 1 jb B n=j 1)B+1 µ x n Ns 2 + mσ 2 2 σ 2 + Ns 2. Here I take σ = 2, m = 0, s = 1, and generate N = 500 data oints assuming µ = 1.

The Fundamental Incomatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsamling Stochastic Stochastic Exact Exact a) b) Figure 1. The bias induced by subsamling data in Hamiltonian Monte Carlo deends on how recisely the gradients of the subsamled otential energies integrate to the gradient of the true otential energy. a) When the subsamled gradient is close to the true gradient, the stochastic trajectory will follow the true trajectory and the bias will be small. b) Conversely, if the subsamled gradient is not close to the true otential energy then the stochastic trajectory will drift away from the true trajectory and induce a bias. Subsamling between trajectories reuires that each subsamled gradient aroximate the true gradient, while subsamling within a single trajectory reuires only that the average of the subsamled gradients aroximates the true gradient. As the dimension of the target distribution grows, however, an accurate aroximation in either case becomes increasingly more difficult unless the data become corresondingly more redundant relative to the comlexity of the target distribution. When the full data are used, numerical trajectories generated by the second-order symlectic integrator constructed above closely follow the true trajectories Figure 2a). Aroximating the otential with a subsamle of the data introduces the aforementioned bias, which shifts the stochastic trajectory away from the exact trajectory desite negligible error from the symlectic integrator itself Figure 2b). Only when the size of each subsamle aroaches the full data set, and the comutational benefit of subsamling fades, does the stochastic trajectory rovide a reasonable aroximation to the exact trajectory Figure 2c) As noted above, geometric considerations suggest that this bias should grow with the dimensionality of the target distribution. To see this, consider running subsamled Hamiltonian Monte Carlo on the multivariate generalization of 2), D πµ d y d ), 3) d=1 where the true µ d are samled from µ d N 0, 1) and trajectories are generated using a subsamled integrator with ste size,, a random integration time τ U0, 2π), and no Metroolis correction. As a surrogate for the accuracy of the resulting samles I will use the average Metroolis accetance robability of each new state using the full data. When the full data are used in this model, the ste size of the symlectic integrator can be tuned to maintain constant accuracy as the dimensionality of the target distribution, D, increases. The bias induced by subsamling between trajectories, however, is invariant to the ste size of the integrator and raidly increases with the dimension of the target distribution. Here the data were artitioned into J = 25 batches of B = 20 data, the subsamle used for each trajectory is randomly selected from the first five batches, and the ste size of the subsamled trajectory is reduced by N/J B) = 5 to eualize the comutational cost with full data trajectories Figure 3). 3.2. Subsamling Data within a Single Trajectory Given that using a single subsamle for an entire trajectory introduces an irreducible bias, we might next consider subsamling at each ste within a single trajectory, hoing that the bias from each subsamle cancels in exectation. Ignoring any Metroolis correction, this is exactly the naive stochastic gradient Hamiltonian Monte Carlo of Chen et al., 2014). To understand the accuracy of this strategy consider building u such a stochastic trajectory one ste at a time. Given the first two randomly-selected subsamles, V i and then V j, the first two stes of the resulting integrator are given by φ Hj φ Hj φ Hi φ Hi = e H V j e H V i + O 3) = ex 2H V i + ) V j [ + 2 H V j, H 2 ] ) V i + O 3) = ex 2H V i + ) V j [ ] [ + 2 H, V\i V\j, H 2 ]) ) + O 3),

The Fundamental Incomatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsamling Full Data J = 1, B = 500) 1 Modified evel Set Average Accetance Probability Using Full Data 0.8 0.6 0.4 0.2 Full Data 0 1 10 100 1000 Dimension Subsamled Between Subsamled Within a) Small Subset J = 50, B = 10) Modified evel Set Exact Stochastic evel Set Modified Stochastic evel Set Figure 3. When the full data are used, high accuracy of Hamiltonian Monte Carlo samles, here reresented by the average Metroolis accetance robability using the full data, can be maintained even as the dimensional of the target distribution grows. The biases induced when the data are subsamled, however, cannot be controlled and uickly devastate the accuracy of the algorithm. Here the ste size of the subsamled algorithms has been decreased relative to the full data algorithm in order to eualize the comutational cost even in this simle examle, a roer imlementation of Hamiltonian Monte Carlo can achieve a given accuracy much more efficiently than subsamling. b) arge Subset J = 2, B = 250) Modified evel Set Exact Stochastic evel Set Modified Stochastic evel Set where we have used the fact that the vector fields { V j } commute with each other. Similarly, the first three stes are given by c) Figure 2. Even for the simle osterior 2), subsamling data in between trajectories introduces significant athologies. a) When the full data are used, numerical trajectories dashed line) closely track the exact trajectories solid line). Subsamling of the data introduces a bias in both the exact trajectories and corresonding numerical trajectories. b) If the size of each subsamle is small then this bias is large. c) Only when the size of the subsamles aroaches the size of the full data, and any comutational benefits from subsamling wane, do the stochastic trajectories rovide a reasonable emulation of the true trajectories. φ H k φ Hj φ Hi = ex 3H + 2 2 + 2 2 + O 3) = ex 3H V i + V j + ) V k [ H, ] [ V i V j, H ]) [ H V k, 2H V i ]) ) V j V i + 2 [ H, V i ] V j + ) V k [ H, ])) V k + O 3), and, letting j l denote the subsamle chosen at the l-th ste,

The Fundamental Incomatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsamling the comosition over an entire trajectory becomes φ Hj l where and = ex ) H ) 1 + ) V jl [ H, V j1 ] [ H, V jl ])) + ) O 2) = ex τh τ 1 V jl [ +τ H, ] [ V j1 H, ])) V j + O 2) = ex τh ) + τb 1 + τb 2 + O 2), B 1 = 1 V jl [ B 2 = H, ] [ V j1 H, ]) V j. Once again, subsamling the data introduces bias into the numerical trajectories. Although the second source of bias, B 2, is immediately rectified by aending the stochastic trajectory with an udate from the initial subsamle such that j = j 1, the first source of bias, B 1, is not so easily remedied. Exanding, 1 V jl = 1 ) V J Vjl = V J V jl n=1 V = J V j ), we see that B 1 vanishes only when the average gradient of the selected subsamles yields the gradient of the full otential. Averaging over subsamles may reduce the bias comared to using a single subsamle over the entire trajectory 1), but the bias still scales oorly with the dimensionality of the target distribution Figure 1). In order to ensure that the bias vanishes identically and indeendent of the redundancy of the data, we have to use each subsamle the same number of times within a single trajectory. In articular, both biases vanish if we use each subsamle twice in a symmetric comosition of the form φ H ) l φ H ) +1 l. Because this comosition reuires using all of the subsamles it does not rovide any comutational savings and it seems rather at odd with the original stochastic subsamling motivation. Indeed, this symmetric comosition is not stochastic at all and actually corresonds to a rather elaborate symlectic integrator where the otential energy from each subsamle generates its own flow, euivalent to the integrator in Slit Hamiltonian Monte Carlo Shahbaba et al., 2014) with the larger ste size J. Removing intermediate stes from this symmetric, stochastic trajectory Figure 4a) reveals the level set of the corresonding modified Hamiltonian Figure 4b). Because this symmetric comosition integrates the full Hamiltonian system, the error is once again controllable and vanishes as the ste size is decreased Figure 4c). imiting the number of subsamles, however, leaves the irreducible bias in the trajectories that cannot be controlled by the tuning the ste size Figures 3, 5). Once more we are left deendent on the redundancy of the data for any hoe of imroved erformance with subsamling. 4. Conclusion The efficacy of Markov Chain Monte Carlo for comlex, high-dimensional target distributions deends on the ability of the samler to exlore the intricate and often meandering neighborhoods on which the robability is distributed. Symlectic integrators admit a structure-reserving imlementation of Hamiltonian Monte Carlo that is amazingly robust to this comlexity and caable of efficiently exloring the most comlex target distributions. Subsamled data, however, does not in general have enough information to enable such efficient exloration. This lack of information manifests as an irreducible bias that devastates the scalable erformance of Hamiltonian Monte Carlo. Conseuently, without having access to the full data there is no immediate way of engineering a well-behaved imlementation of Hamiltonian Monte Carlo alicable to most statistical models. As with so many other subsamling algorithms, the adeuacy of a subsamled Hamiltonian Monte Carlo imlementation is at the mercy of the redundancy of the data relative to the comlexity of the target model, and not in the control of the user. Unfortunately many of the roblems at the frontiers of alied statistics are in the wide data regime, where data are sarse relative to model comlexity. Here subsamling methods have little hoe of success; we must focus our efforts not on modifying Hamiltonian Monte Carlo but rather on imroving its imlementation with, for examle, better memory management and efficiently arallelized gradient calculations.

The Fundamental Incomatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsamling ε = 0.05 Numerical Trajectory ε = 0.05 Subsamled Trajectory a) ε = 0.05 Modified evel Set Numerical Trajectory a) ε = 0.0005 b) Subsamled Trajectory ε = 0.002 Numerical Trajectory c) b) Figure 4. The symmetric comosition of flows from each subsamles of the data eliminates all bias in the stochastic trajectory because it imlicitly reconstructs a symlectic integrator. Refining a) all intermediate stes in a stochastic trajectory b) to only those occurring after a symmetric swee of the subsamles reveals the level set of the modified Hamiltonian corresonding to the imlicit symlectic integrator. Because of the vanishing bias, c) the error in the stochastic trajectory can be controlled by taking the ste size to zero. Figure 5. a) Utilizing only a few subsamles within a trajectory yields numerical trajectories biased away from the exact trajectories. b) Unlike the error introduced by a full symlectic integrator, this bias is irreducible and cannot be controlled by tuning the ste size. The erformance of such an algorithm is limited by the size of the bias which itself deends on the redundancy of the data relative to the target model.

The Fundamental Incomatibility of Scalable Hamiltonian Monte Carlo and Naive Data Subsamling Acknowledgements It is my leasure to thank Simon Byrne, Mark Girolami, Matt Hoffman, Matt Johnson, and Sam ivingstone for thoughtful discussion and helful comments, as well as three anonymous reviewers for constructive assessments. I am suorted under EPSRC grant EP/J016934/1. References Bardenet, Rémi, Doucet, Arnaud, and Holmes, Chris. An adative subsamling aroach for MCMC inference in large datasets. In Proceedings of The 31st International Conference on Machine earning,. 405 413, 2014. Betancourt, Michael, Byrne, Simon, and Girolami, Mark. Otimizing the integrator ste size for hamiltonian monte carlo. ArXiv e-rints, 1410.5110, 11 2014a. Shahbaba, Babak, an, Shiwei, Johnson, Wesley O, and Neal, Radford M. Slit Hamiltonian Monte Carlo. Statistics and Comuting, 243):339 349, 2014. Teh, Yee Whye, Thiéry, Alexandre, and Vollmer, Sebastian. Consistency and fluctuations for stochastic gradient angevin dynamics. ArXiv e-rints, 1409.0578, 09 2014. Vollmer, Sebastian J., Zygalakis, Konstantinos C.,, and Teh, Yee Whye. non-)asymtotic roerties of stochastic gradient angevin dynamics. ArXiv e-rints, 1501.00438, 01 2015. Welling, Max and Teh, Yee W. Bayesian learning via stochastic gradient angevin dynamics. In Proceedings of the 28th International Conference on Machine earning ICM-11),. 681 688, 2011. Betancourt, Michael, Byrne, Simon, ivingstone, Samuel, and Girolami, Mark. The geometric foundations of Hamiltonian Monte Carlo. ArXiv e-rints, 1410.5110, 10 2014b. Chen, Tiani, Fox, Emily B, and Guestrin, Carlos. Stochastic gradient Hamiltonian Monte Carlo. Proceedings of The 31st International Conference on Machine earning,. 1683 1691, 2014. Duane, Simon, Kennedy, A.D., Pendleton, Brian J., and Roweth, Duncan. Hybrid Monte Carlo. Physics etters B, 1952):216 222, 1987. Girolami, Mark and Calderhead, Ben. Riemann Manifold angevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B Statistical Methodology), 732):123 214, 2011. Hairer, E., ubich, C., and Wanner, G. Geometric Numerical Integration: Structure-Preserving Algorithms for Ordinary Differential Euations. Sringer, New York, 2006. ee, John M. Introduction to Smooth Manifolds. Sringer, 2013. eimkuhler, B. and Reich, S. Simulating Hamiltonian Dynamics. Cambridge University Press, New York, 2004. Neal, R.M. MCMC using Hamiltonian dynamics. In Brooks, Steve, Gelman, Andrew, Jones, Galin., and Meng, Xiao-i eds.), Handbook of Markov Chain Monte Carlo. CRC Press, New York, 2011. Neiswanger, Willie, Wang, Chong, and Xing, Eric. Asymtotically exact, embarrassingly arallel MCMC. arxiv e-rints, 1311.4780, 2013.