Computational Statistics for Big Data

Transcription

1 Lancaster University Computational Statistics for Big Data Author: 1 Supervisors: Paul Fearnhead 1 Emily Fox 2 1 Lancaster University 2 The University of Washington September 1, 2015 Abstract The amount of data stored by organisations and individuals is growing at an astonishing rate. As statistical models grow in complexity and size, traditional machine learning algorithms are struggling to scale well to the large datasets required for model fitting. Markov chain Monte Carlo (MCMC) is one algorithm that has been left behind. However the algorithm has proven to be an invaluable tool for training complex statistical models. This report discusses a number of possible solutions that enable MCMC to scale more effectively to large datasets. We focus on two particular solutions to this problem: batch methods and stochastic gradient Monte Carlo methods. Batch methods split the full dataset into disjoint subsets, and run traditional MCMC on each subset. The difficulty of these methods is in recombining the MCMC output run on each subset. The idea is that this will be a close approximation to the posterior using the full dataset. Stochastic gradient Monte Carlo approximately samples from the full posterior but uses only a subsample of data at each iteration. It does this by combining two key ideas. Stochastic optimization, which is an alogorithm used to find the mode of the posterior but uses only a subset of the data at each iteration; Hamiltonian Monte Carlo, which is a method used to provide efficient proposals for Metropolis-Hastings algorithms with high acceptance rates. After discussing the methods and important extensions, we perform a simulation study, which compares the methods and how they are affected by various model properties.

2 Contents 1 Introduction An overview of methods Report outline Batch methods Introduction Splitting the data Efficiently sampling from products of Gaussian mixtures Parametric recombination methods Nonparametric methods Semiparametric methods Conclusion Stochastic gradient methods Introduction Stochastic optimization Hamiltonian Monte Carlo Stochastic gradient Langevin Monte Carlo Stochastic gradient Hamiltonian Monte Carlo Conclusion Simulation study Introduction Batch methods Stochastic gradient methods Conclusion Future Work Introduction Further comparison of batch methods Tuning guidance for stochastic gradient methods Using batch methods to analyse complex hierarchical models

3 1 Introduction As the amount of data stored by individuals and organisations grows, statistical models have advanced in complexity and size. Often much statistical methodology has focussed on fitting models with limited data. Now we are faced by the opposite problem, we have so much data that traditional statistical methods struggle to cope and run exceptionally slowly. These problems have led to a rapidly evolving area of statistics and machine learning, which develops algorithms which are scalable as the size of data increases. The size of data is generally used to mean one of two things: the dimensionality of the data or the number of observations. In this report we focus on methods which have been designed to be scalable as the number of observations increases. Data with a large number of observations is often referred to as tall data. Currently, large scale machine learning models are being trained mainly using optimization methods such as stochastic optimization. These algorithms are mainly used for their speed, they are fast to train models even when there are a huge number of observations available. The methods speed is due to the fact that at each iteration the algorithms only use a subset of all the available data. The downside is that these methods only find local maxima of the posterior distribution, meaning they only produce a point estimate which can lead to overfitting. A key appeal of Bayesian methods is that they produce a whole distribution of possible parameter values, which allows uncertainty to be quantified, reducing the risk of overfitting. While approximating parameter uncertainty using stochastic optimization can be done, for complex models this approximation can be very poor. Generally the Bayesian posterior distribution is simulated from using statistical algorithms known as Markov chain Monte Carlo (MCMC). The problem is that these algorithms require calculations over the whole dataset at each iteration, meaning the algorithms are slow for large datasets. Therefore the next generation of MCMC algorithms which scale to large datasets needs to be developed. 1.1 An overview of methods We begin this section with a more formal statement of the problem. Suppose we wish to train a model with probability density p(x θ), where θ is an unknown parameter vector, and x x is the model data. Let the likelihood of the model be p(x θ) = N i=1 p(x i θ) and the prior for the parameter be p(θ). Our interest is in the posterior p(θ x) p(x θ)p(θ), which quantifies the most likely values of θ given the data x. Commonly we simulate from the posterior using the Metropolis-Hastings (MH) algorithm, arguably the most popular MCMC algorithm. At each iteration, given a current state θ, the algorithm proposes some new state θ from some proposal q(.). This new state is then accepted as part of the sample with probability α = q(θ)p(θ x) q(θ )p(θ x) = q(θ)p(x θ )p(θ ) q(θ )p(x θ)p(θ). Notice that at each iteration, the MH algorithm requires calculation of the likelihood at the new state θ. This requires a computation over the whole dataset, which is infeasibly 2

4 slow when N is large. This is the key bottleneck in Metropolis-Hastings, and other MCMC algorithms, when they are being used with large datasets. A number of solutions have been proposed for this problem, and they can generally be divided into three categories. We refer to these categories as batch methods, stochastic gradient methods and subsampling methods. Batch methods aim to make use of recent hardware developments which makes the parallelisation of computational work more accessible. They split the dataset x into disjoint batches x B1,..., x BS. The structure of the posterior allows separate MCMC algorithms to be run on these batches in parallel in order to simulate from each subposterior p(θ x Bs ) p(θ) 1/S p(x Bs θ). These simulations must then be combined in order to generate a sample which approximates the full posterior p(θ x). This is where the main challenge lies. Stochastic gradient methods make use of sophisticated proposals that have been suggested for MCMC. These methods use gradients of the log posterior in order to suggest new states which have very high acceptance rates. When free constants of these proposals are tuned in a certain way these rates can be so high that we can get rid of the acceptance step and still sample from a good approximation to the posterior. However the gradient calculation still requires a computation over the whole dataset. Therefore the gradients of the log posterior need to be estimated using only a subsample of the data, which introduces extra noise. Subsampling methods propose various methods to keep the MCMC algorithm largely as is but use only a subset of the data in the acceptance step at each iteration. Certain methods exist which allow this to be done while still sampling from the true posterior distribution. However this advantage often comes at the cost of poor mixing. Other methods achieve the result by introducing controlled biases, these methods often mix better. 1.2 Report outline This report provides a review of batch methods and stochastic gradient methods outlined in Section 1.1. The reviewed methods are then implemented and compared under a variety of scenarios. In Section 2 we discuss batch methods, including parametric contributions by Scott et al. (2013) and Neiswanger et al. (2013), nonparametric and semiparametric methods introduced by Neiswanger et al. (2013) as well as more recent developments. Section 3 sees a review of stochastic gradient methods, including the stochastic gradient Langevin dynamics (SGLD) algorithm of Welling and Teh (2011) and the stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm of Chen et al. (2014). Stochastic optimization methods which are currently employed to train algorithms which rely on large datasets are considered. An introduction of Hamiltonian Monte Carlo, which is used to produce proposals for the SGHMC algorithms is provided. Finally we examine the literature which provide further theoretical results for the algorithms, as well as proposed improvements. In Section 4 the algorithms reviewed in the report are compared, code for the implementations are available on GitHub: A relatively simple model is used for comparison, a multivariate t-distribution. Therefore in order to really test the methods, the number of observations is kept small. First the effect of bandwidth choice for nonparametric/semiparametric methods is investigated. The performance effect of the number of 3

5 observations and the dimensionality of the target are compared for all the methods. The batch size for the batch methods, and the subsample size for the stochastic gradient methods are considered too. 2 Batch methods 2.1 Introduction In order to speed up MCMC, it is natural to consider parallelisation. Advances in hardware allow many jobs to be run in parallel over separate cores. These advances have been used to speed up many other computationally intensive algorithms. Parallelising MCMC has proven difficult however since MCMC is inherently sequential in nature and parallelisation requires minimal communication between machines. A natural way to parallelise MCMC is to split the data into different subsets. MCMC for each subset is then run separately on different machines. In this case the main problem is how to recombine our MCMC samples of each subset while ensuring the final sample is close as possible to the true posterior. In this section, we discuss parametric and nonparametric methods suggested to do this. 2.2 Splitting the data Suppose we have N i.i.d. data points x. We wish to investigate a model with probability density p(θ x), where θ is an unknown parameter vector. Let the likelihood be p(x θ) = N i=1 p(θ x i) and the prior we assign to θ be p(θ). Then the full posterior for the data p(θ x) is given by p(θ x) p(θ)p(x θ). (2.1) Let B 1,..., B S be a partition of {1,..., n}, and x Bi be the corresponding set of data points x Bi = {x i : i B i }. We refer to x Bi as the i th batch of data. We can rewrite (2.1) as p(θ x) p(θ) S p(x Bs θ) = s=1 S p(θ) 1/S p(x Bs θ). For brevity we will write the S batches of data as x 1,..., x S from now on. Let us define the subposterior p(θ x s ) by p(θ x s ) p(θ) 1/S p(θ x s ). Therefore we have that p(θ x) S s=1 p(θ x s). The idea of batch methods for big data is to run MCMC separately to sample from each subposterior. These samples are then combined in some way so that the final sample follows the full posterior p(θ x) as closely as possible. 2.3 Efficiently sampling from products of Gaussian mixtures Before we outline recombination methods in more detail, we discuss certain important properties of the multivariate Normal distribution which will prove useful later. s=1 4

6 Suppose we have S multivariate Normal densities N (θ µ s, Σ s ) for s {1,..., S}, then Wu (2004) shows that their product can be written, up to a constant of proportionality, as where Σ = ( S s=1 S N (θ µ s, Σ s ) N (θ µ, Σ), s=1 Σ 1 s ) 1, µ = Σ ( S Now suppose we have a set of S Gaussian mixtures {p s (θ)} S s=1, p s (θ) = s=1 M ω m,s N (θ µ m,s, Σ s ), m=1 ) Σ 1 s µ s. (2.2) where ω m,s denote the mixture weights. For simplicity we assume that the number of components in each mixture is the same and that each Gaussian component in the mixture shares a common variance which is diagonal. We wish to sample from the product of these Gaussian mixtures, It can be shown using induction that p(θ) S p s (θ). (2.3) s=1 S M ω m,s N (θ µ m,s, Σ s ) = m=1 l 1 s=1 S ω ls,sn (θ µ ls,s, Σ s ), l S s=1 where we label each component of the sum using L = (l 1,..., l S ), where l s {1,..., M}. It follows from this and results above about products of Gaussians, (2.3) is equivalent to a Gaussian mixture with M S mixture components. Therefore sampling from this product can be performed exactly in two steps. Firstly we sample from one of the M S components of the mixture according to its weight, then we draw a sample from the corresponding Gaussian component (Ihler et al., 2004). The parameters of the L th Gaussian component be calculated using (2.2) and are given by ( S ) 1 ( S ), µ L = Σ L Σ L = s=1 Σ 1 s s=1 Σ 1 s µ ls,s The unnormalised weight of the L th mixture component is given by (Ihler et al., 2004). ω L S s=1 ω l s,sn (θ µ ls,s, Σ s ). N (θ µ L, Σ L ) 5

7 In order to use this exact method we need to calculate the normalising constant for the weights Z = L ω L. As M and S grow this exact sampling method becomes computationally infeasible as the calculation of Z and the drawing a sample from p(.) both take O(M S ) time. This fact, along with memory requirements mean that sampling from p(θ) using the exact method quickly becomes impossible. In cases where exact sampling from the mixture is infeasible, a number of methods have been proposed. For a review the reader is suggested to refer to Ihler et al. (2004). A common approach is to use a Gibbs sampling style approach. At each iteration, S 1 of the labels l i are fixed, while one label, call it l j, is sampled from the corresponding conditional density p(θ l j ). The notation l j refers to {l i i {1,..., S}, i j}. After a fixed number of new label values have been drawn, a sample is drawn from the mixture component indicated by the current label values. While this approach often produces good results, it can require a large number of samples before it accurately represents the true mixture density due to multimodality. A number of suggestions have been made to improve this standard Gibbs sampling approach, for example using multiscale sampling (Ihler et al., 2004) and parallel tempering (Rudoy and Wolfe, 2007). 2.4 Parametric recombination methods There are a number of methods proposed to recombine subposterior samples which exactly target the full posterior p(θ x) when it is Normally distributed. We refer to these methods as parametric. Intuition for why this assumption might be valid for a large class of models comes from the Bernstein-von Mises Theorem (Le Cam, 2012), which is a central limit theorem for Bayesian statistics. Assuming suitable regularity conditions, and that the data is realised from a unique true parameter value θ 0, the theorem states that the posterior for the data tends to a Normal distribution centred around θ 0. In particular, for large N the posterior is found to be well approximated by N(θ 0, I 1 (θ 0 )), where I(θ) is Fisher s information matrix. Since we are aiming to efficiently sample from models with large amounts of data, this approximation appears to be particularly relevant. Neiswanger et al. (2013) propose to combine samples by approximating each subposterior using a Normal distribution, and then using results for products of Gaussians in order to combine these approximations. Let ˆµ s and ˆΣ s denote the sample mean and sample variance of the MCMC output for batch s. Then we can approximate the distribution of each subposterior by N(ˆµ s, ˆΣ s ). Using (2.2), the full posterior can be estimated by simply multiplying these subposterior estimates together. It follows the estimate will be multivariate Gaussian with mean ˆµ and variance ˆΣ given by ˆΣ = ( S s=1 ˆΣ 1 s ) 1, ˆµ = ˆΣ ( S s=1 ) ˆΣ 1 s ˆµ s. (2.4) Scott et al. (2013) propose a similar method, where samples are combined using averaging. Their method is known as consensus Monte Carlo. Denote the j th sample from subposterior s by θ sj. Then suppose each subposterior is assigned a weight denoted by W s (this is a matrix in the multivariate case), the j th draw ˆθ j from the consensus approximation to the 6

8 full posterior is given by ˆθ j = ( S s=1 W s) 1 S s=1 W s θ sj. When each subposterior is Normal, then the full posterior is also Normal, and when we set the weights to be W s = V ar(θ x s ) then ˆθ j will be exact draws from the full posterior. The idea is that even when the subposteriors are non-gaussian, the draw ˆθ j will still be a close approximation to the posterior. Scott et al. (2013) suggests using the sample variance of each batch as the weight values in practice, due to exact results in the Normal case. Key advantages of the two approximations outlined above are that they are fast and relatively quick to converge when models are close to Gaussian. However they only target the full posterior exactly if either each subposterior is Normally distributed, or the size of each batch tends to infinity. Therefore the methods performance on non-gaussian targets should be explored, especially when they are multi-modal, since the methods may conceivably struggle in these cases. Rabinovich et al. (2015) suggest extending the Consensus Monte Carlo algorithm of Scott et al. (2013) by relaxing the restriction of aggregation using averaging. Suppose we pick a draw from each subposterior, θ 1,..., θ S. Then let us refer to the function used to aggregate these draws as F (θ 1,..., θ S ), so in the case of Consensus Monte Carlo we have F (θ 1,..., θ S ) = ( S s=1 W s) 1 S s=1 W s θ s. Rabinovich et al. (2015) suggest trying to adaptively choose the best aggregation function F (.). Motivation for this is that the averaging function used in Scott et al. (2013) is only known to be exact in the case of Gaussian posteriors. In order to adaptively choose F (.), Rabinovich et al. (2015) use variational Bayes. However the method requires the introduction of an optimization step, and it would be interesting to investigate the relative improvement in the approximation in using the method, versus the increase in computation time. 2.5 Nonparametric methods While the methods outlined above work relatively well when subposteriors approximately Gaussian, it is not clear how they behave when models are far away from Gaussian, or when batch sizes are small. Neiswanger et al. (2013) therefore suggest an alternative method based on kernel density estimation which can be shown to target the full posterior asymptotically, as the number of samples drawn from each subposterior tends to infinity. Let x 1,..., x N be a sample from a distribution of dimension d with density f. Kernel density estimation is a method for providing an estimate ˆf of the density. The kernel density estimation for f at a point x is ˆf(x) = 1 N N K H (x x i ), i=1 7

9 where H is a d d symmetric, positive-definite matrix known as the bandwidth and K is the unscaled kernel, which is a symmetric, d-dimensional density. K H is related to K by K H (x) = H 1/2 K(H 1/2 x). Commonly the kernel function K is chosen to be Gaussian since it leads to smooth density estimates and it simplifies mathematical analysis (Duong, 2004). The bandwidth is an important factor in determining the accuracy of a kernel density estimate as it controls the smoothing of the estimate. Suppose we have a sample {θ m,s } M m=1 from each subposterior s {1,..., S}. Neiswanger et al. (2013) suggest approximating each subposterior using a kernel density estimate with Gaussian kernel and diagonal bandwidth matrix h 2 I, where I is the d-dimensional identity matrix. Denote this estimate by ˆp s (θ), then we can write it as ˆp s (θ) = 1 M M N (θ θ m,s, h 2 I), m=1 where N (. θ m,s, h 2 I) denotes a d-dimensional Gaussian density with mean θ m,s and variance h 2 I. The estimate for the full posterior ˆp(θ x) is then defined to be the product of the estimates for each batch S ˆp(θ x) = ˆp s (θ) = 1 S M N (θ θ M S m,s, h 2 I). (2.5) s=1 s=1 m=1 Therefore the estimate for the full posterior becomes a product of Gaussian mixtures as discussed in Section 2.3. By introducing a similar labelling system L = (l 1,..., l S ) with l s {1,..., M}, we can again derive an explicit expression for the resulting mixture. While Neiswanger et al. (2013) uses common variance h 2 I for each kernel, we suggest it might be better to use a diagonal matrix Λ since different parameters may differ considerably in variance. In either case, assuming a common, diagonal variance Λ across the kernel estimates for each batch, the weights in the product (2.5) simplify to S ω L N (θ ls,s θ L, Λ), s=1 θl = 1 S S θ ls,s. (2.6) s=1 The L th component of the mixture simplifies to N (θ θ L, Λ/S). Given that this method is designed for use with large datasets, the number of components of the resulting Gaussian mixture will be very large. Therefore efficiently sampling from it is an important issue to consider. Neiswanger et al. (2013) recommends sampling from the full posterior estimate using a similar method to the Gibbs sampling approach as outlined in Section 2.3. In order to avoid calculating the conditional distribution of the weights however, they use a Metropolis within Gibbs approach as follows. Setting all labels except the current, l s, fixed, we randomly sample a new value for l s. We then accept this new label with probability equal to the corresponding values for the weights. The full algorithm is 8

10 detailed in Algorithm 1. Algorithm 1: Combining Batches Using Kernel Density Estimation. Data: Samples from each subposterior s {1,..., S}, {θ m,s } M m=1. Result: Sample from an estimate of the full posterior p(θ x). Draw an initial label L by simulating l s Unif({1,..., M}), s {1,..., S}. for i = 1 to T do h h(i) for s = 1 to S do Create a new label C := (c 1,..., c S ) and set C L Draw a new value for index s in C, c s Unif({1,..., M}) Simulate u Unif(0, 1) if end end u < ω C /ω L then L C Simulate θ i N( θ L, h2 M I) end Notice that in the algorithm, h is changed as a function of the iteration i. In particular Neiswanger et al. (2013) specify the function h(i) = i 1/(4+d). This causes the bandwidth to decrease at each iteration and is referred to as annealing. The properties of annealing are investigated further in Section 4. In their paper Neiswanger et al. (2013) assume that the number of iterations is the same as the size of the sample from each subposterior. However this is not necessary, in fact when we are trying to sample from a mixture with a large number of components we may need to simulate more times than this in order to ensure the sample accurately represents the true KDE approximation. While this algorithm may improve results as models move away from Gaussianity, kernel density estimation is known to perform poorly at high dimensions so the algorithm will deteriorate as the dimensionality of θ increases. The algorithm suffers from the curse of dimensionality in the number of batches and the size of the MCMC sample simulated from each subposterior. This suggests that as the number of batches increases the accuracy and mixing of the algorithm will be affected. The algorithm requires the user to choose a bandwidth estimate, the performance of the algorithm to different bandwidth choices would therefore be interesting to investigate. In the original paper by Neiswanger et al. (2013), it is suggested to use a Gaussian kernel with bandwidth h 2 I. However as mentioned earlier, different parameters may have different variances. The algorithm would probably perform better by using a more general diagonal matrix Λ, especially as this does not particularly increase the complexity of the algorithm. Using a common bandwidth parameter across batches eases computation however it may negatively affect the performance of the algorithm. Note when discussing products of Gaussian mixtures in 2.3, the variances across different mixtures did not need to be assumed common. Therefore further improvements might be made by varying bandwidths across batches, though this would increase computational expense. Finally improvements could be gained by using more sophisticated methods to sample from the product of kernel density 9

11 estimates (Ihler et al., 2004; Rudoy and Wolfe, 2007). A number of developments have been proposed for Algorithm 1. Wang and Dunson (2013) note that the algorithm performs poorly when samples from each subposterior do not overlap. In order to improve this they suggest to smooth each subposterior using a Weierstrass transform, which simply takes the convolution of the density with a Gaussian function. The transformed function can be seen as a smoothed version of the original which tends to increase the overlap between subposteriors. They then approximate the full posterior as a product of the Weierstrass transform of each subposterior. However, since in general the approximation to each subposterior will be empirical, its Weierstrass transform corresponds to a kernel density estimator. Therefore this method, for all intents and purposes, is the same as the original algorithm by Neiswanger et al. (2013), so still suffers from many of the same problems. An alternative method to improve overlap between the supports of each subposterior is to use heavier tailed kernels in the kernel density estimation. Implementing this however will require some work in order to be able to sample from the resulting product of mixtures, since nice properties for the product of these heavier tailed distributions may not hold. Therefore alternative methods for sampling will need to be developed. Wang et al. (2015) rather than using kernel density estimation use space partitioning methods to partition the space into disjoint subsets, and produce counts of the number of points contained in each of these subsets. This produces an estimate of each subposterior akin to a multi-dimensional histogram. An estimate to the full posterior can then be made by multiplying subposterior estimates together and normalizing. This algorithm helps solve the explosion of mixture components that affects algorithm 1. Despite this, the algorithm will still suffer when the supports of each subposterior do not overlap. Moreover the algorithm is more complicated to implement and will be affected by the choice of partitioning used. Alternatively there have been suggestions to introduce suitable metrics which allow summaries of a set of probability measures to be defined. This allows batches to be recombined in terms of these summaries. For example Minsker et al. (2014) use a metric known as the Wasserstein distance measure in order to define the median posterior from a set of subposteriors. Similarly Srivastava et al. (2015) also use the Wasserstein distance to calculate a summary of the subposteriors known as the barycenter. This allows them to produce an estimate for the full posterior which they refer to as the Wasserstein posterior or WASP. However the statistical properties of these measures is unclear and needs to be investigated further. 2.6 Semiparametric methods In order to account for the fact that the nonparametric method Algorithm 1 is slow to converge, Neiswanger et al. (2013) suggest producing a semiparametric estimator (Hjort and Glad, 1995) of each subposterior. This estimator combines the parametric estimator characterised by (2.4) and the nonparametric estimator detailed by Algorithm 1. More specifically, each subposterior is estimated by (Hjort and Glad, 1995) ˆp s (θ) = ˆf s (θ)ˆr(θ), 10

12 where ˆf s (θ) = N (θ ˆµ s, ˆΣ s ) and ˆr(θ) is a nonparametric estimator of the correction function r(θ) = p s (θ)/ ˆf s (θ). Assuming a Gaussian kernel for ˆr(θ), Neiswanger et al. (2013) write down an explicit expression for ˆp s (θ) ˆp s (θ) = 1 M M m=1 N (θ θ m,s, h 2 I)N (θ ˆµ s, ˆΣ s ) ˆf s (θ m,s ) = 1 M M N (θ θ m,s, h 2 I)N (θ ˆµ s, ˆΣ s ) N (θ m,s ˆµ s, ˆΣ. s ) Similarly to the nonparametric method, we can produce an estimate for the full posterior ˆp(θ x) as the product of estimates for each subposterior. Once again this results in a mixture of Gaussians with M S components. Using the label L = (l 1,..., l S ) then the L th mixture weight W L and component c L is given by m=1 W L ω LN ( θ L ˆµ, ˆΣ + h S I) S s=1 N (θ l s,s ˆµ s, ˆΣ s ), c L = N (θ µ L, Σ L ), where ω L and θ L are as defined in (2.6), and the parameters of the mixture component are ( ) 1 ( ) S 1 S Σ L = I + ˆΣ, µ L = Σ L h h I θ L + ˆΣ 1ˆµ, where ˆΣ and ˆµ are as defined in (2.4). Sampling from this mixture can be performed by using Algorithm 1 replacing weights and parameters where appropriate. As h 0, the semiparametric component parameters Σ L and µ L approach the corresponding nonparametric component parameters. This motivates Neiswanger et al. (2013) to suggest an alternative semiparametric algorithm where the nonparametric component weights ω L are used instead of W L. Their reasoning is that the resulting algorithm may have a higher acceptance probability and is still asymptotically exact as the batch size tends to infinity. As in Section 2.5, a bandwidth matrix with identical diagonal elements hi will not necessarily be the best choice for the bandwidth if different dimensions of the parameters have different scales or variances. However the algorithm can easily be extended to using a diagonal bandwidth matrix Λ in a similar way to the nonparametric method. While this method may solve the problem that the nonparametric method is slow to converge in high dimensions, the performance of the algorithm is not well understood. For example as models tend away from Gaussianity, how will the algorithm perform when it includes this parametric term. Moreover the model still suffers from the curse of dimensionality in terms of the number of mixture components. The model will also be affected by bandwidth choice. 2.7 Conclusion In this section we outlined batch methods. Batch methods split a large dataset up into smaller subsets, run parallel MCMC on these subsets, and then combine the MCMC output to obtain an approximation to the full posterior. A couple of methods appealed to the Bernsteinvon Mises theorem in order to approximate each subposterior by a Normal distribution. 11

13 The resulting approximation to the full posterior could be found using standard results for products of Gaussians. However these methods are only exact if each subposterior is Normal, or as the number of observations in each batch tends to infinity. Performance of the methods when these assumptions are violated needs to be investigated. Alternative methods used kernel density estimation or a mixture of a Normal estimate and a kernel density estimate to approximate each subposterior. These estimates could then be combined by using results for the product of mixtures of Gaussians. However the resulting approximation was a mixture of M S components, which is difficult to sample from efficiently. Moreover kernel density estimation is known to deteriorate as dimensionality increases and requires the choice of a bandwidth. To conclude, each of the batch methods have either undesirable qualities or properties which are not well understood. These issues need reviewing before the methods can be used with confidence in practice. Batch methods are particularly suited to models which exhibit structure, for example hierarchical models. 3 Stochastic gradient methods 3.1 Introduction Methods currently employed in large scale machine learning are generally optimization based methods. One method employed frequently in training machine learning models is known as stochastic optimization (Robbins and Monro, 1951). This method is used to optimize a likelihood function in a similar way to traditional gradient ascent. The key difference is that at each iteration rather than using the whole dataset only a subset is used. While the method produces impressive results at low computational cost, it has a number of downsides. Parameter uncertainty is not captured using this method, since it only produces a point estimate. Though uncertainty can be estimated using a Normal approximation, for more complex models this estimate may be poor. This means models fitted using stochastic optimization can suffer from overfitting. Since the method does not sample from the posterior as in traditional MCMC, the algorithm can get stuck in local maxima. Methods outlined in this section aim to combine the subsampling approach of stochastic optimization, with posterior sampling, which helps capture uncertainty in parameter estimates. The section begins by outlining stochastic optimization, before introducing stochastic gradient Langevin dynamics (SGLD) and stochastic gradient Hamiltonian Monte Carlo (SGHMC), the two key algorithms for big data discussed in this section. Hamiltonian Monte Carlo (HMC), a technique used extensively by SGHMC, is reviewed. 3.2 Stochastic optimization Let x 1,..., x N be data observed from a model with probability density function p(x θ) where θ denotes an unknown parameter vector. Assigning a prior p(θ) to θ, as usual our interest is 12

14 in the posterior p(θ x) p(θ) N p(x i θ), where we define p(x θ) = N i=1 p(x i θ) to be the likelihood. Stochastic optimization (Robbins and Monro, 1951) aims to find the mode θ of the posterior distribution, otherwise known as the MAP estimate of θ. The idea of finding the mode of the posterior rather than the likelihood is that the prior p(θ) regularizes the parameters, meaning it acts as a penalty for model complexity which helps prevent overfitting. At each iteration t, stochastic optimization takes a subset of the data s t and updates the parameters as follows (Welling and Teh, 2011) ( ) θ t+1 = θ t + ɛ t 2 i=1 log p(θ t ) + N n x i s t log p(x i θ t ) where ɛ t is the stepsize at each iteration and s t = n. The idea is that over the long run the noise in using a subset of the data is averaged out, and the algorithm tends towards a standard gradient descent. Clearly when the number of observations N is large, using only a subset of the data is much less computationally expensive. This is a key advantage of stochastic optimization. Provided that ɛ t =, ɛ 2 t <, (3.1) t=1 and p(θ x) satisfies certain technical conditions, this algorithm is guaranteed to converge to a local maximum. A common extension of stochastic optimization which will be needed later is known as stochastic optimization with momentum. This is commonly employed when the likelihood surface exhibits a particular structure, one example where the method is employed extensively is in the training of deep neural networks. In this case we introduce a variable ν, which is referred to as the velocity of the trajectory. The parameter updates then proceed as follows ( ) ν t+1 = (1 α)ν t + η ɛ t 2 t=1 log p(θ t ) + N n x i s t log p(x i θ t ) θ t+1 = ν t+1 + θ t (3.2) where α and η are free parameters to be tuned. While stochastic optimization is used frequently by large scale machine learning practitioners, it does not capture parameter uncertainty since it only produces a point estimate of θ. This means that models fit using stochastic optimization can often suffer from overfitting and requires some form of regularization. One common method to provide an approximation to the true posterior is to fit a Gaussian approximation at the point estimate., 13

15 Suppose θ 0 is the true mode of the posterior p(θ x). Then using Taylor s expansion about θ 0, we find (Bishop, 2006) log p(θ x) log p(θ 0 x) + (θ θ 0 ) T log p(θ x) 1 2 (θ θ 0) T H[log p(θ 0 x)](θ θ 0 ) = log p(θ 0 x) 1 2 (θ θ 0) T H[log p(θ 0 x)](θ θ 0 ), where H[g(.)] is the Hessian matrix of the function g(.), and we have used the fact that the gradient of the log posterior at θ 0 is 0. Let us denote the Hessian H[log p(θ x)] := V 1 [θ], then taking the exponential of both sides we find { p(θ x) A exp 1 } 2 (θ θ 0) T V 1 [θ 0 ](θ θ 0 ), where A is some constant. This is the kernel of a Gaussian density, suggesting an approximation to the posterior of the form N(θ, V [θ ]), where θ is an estimate of the mode to be found. This is often referred to as a Laplace approximation. By the Bernstein-von Mises theorem, this approximation is expected to become an increasingly accurate approximation as the number of observations increases. However since the approximation is based only on distributional aspects at one point, the approximation can miss important properties of the distribution (Bishop, 2006). Moreover distributions which are multimodal will be approximated very poorly by this approximation. Therefore while the approximation may work well for less complex distributions when plenty of data is available, the approximation may struggle for more complex models. This motivates us to consider methods which aim to combine the performance of stochastic optimization while being able to account for parameter uncertainty. 3.3 Hamiltonian Monte Carlo Hamiltonian dynamics was originally developed as an important reformulation of Newtonian dynamics, and serves as a vital tool in statistical physics. More recently though, Hamiltonian dynamics has been used to produce proposals for the Metropolis-Hastings algorithm which explore the parameter space rapidly and have very high acceptance rates. The acceptance calculations in the Metropolis-Hastings algorithm is computationally intensive when a lot of data is available. However as outlined later, by combining ideas from stochastic optimization and Hamiltonian dynamics, we are able to approximately simulate from the posterior distribution without using an acceptance calculation. In light of this, we review Hamiltonian Monte Carlo, a method which produces efficient proposals for the Metropolis-Hastings algorithm Hamiltonian dynamics Hamiltonian dynamics was traditionally developed to describe the motion of objects under a system of forces. In two dimensions a common analogy used to visualise the dynamics is a frictionless puck sliding over a surface of varying height (Neal, 2010). The state of the 14

16 system consists of the puck s position θ, and its momentum (mass times velocity) r. Both of which are 2-dimensional vectors. The state of the system is governed by its potential energy U(θ) and its kinetic energy K(r). If the puck is moving on a flat part of the space, then it will have constant velocity. However as the puck begins to pick up height, its kinetic energy decreases and its potential energy increases as it slows. If its kinetic energy reaches zero the puck moves back down the hill, and its potential energy decreases as its kinetic energy increases. More formally Hamiltonian dynamics is described by a Hamiltonian function H(r, θ), where r and θ are both d-dimensional. The Hamiltonian determines how r and θ change over time as follows dθ i dt = H r i, dr i dt = H θ i. (3.3) Hamiltonian dynamics has a number of properties which are crucial for its use in constructing MCMC proposals. Firstly, Hamiltonian dynamics is reversible, meaning that the mapping from the state (r(t), θ(t)) at time t to the state (r(t + s), θ(t + s)) at time t + s is one-to-one. A second property is that the dynamics keeps the Hamiltonian invariant or conserved. This can be easily shown using (3.3) as follows dh dt = d i=1 ( dθi dt H + dr i θ i dt ) H = r i d i=1 ( H H + H ) H = 0. r i θ i θ i r i In order to use Hamiltonian dynamics to simulate from a distribution we need to translate the density function to a potential energy function, and introduce artificial momentum variables to go with these position variables of interest. A Markov chain can then be simulated where at each iteration we resample the momentum variables, simulate Hamiltonian dynamics for a number of iterations, and then perform a Metropolis Hastings acceptance step with the new variables obtained from the simulation. In light of this, for Hamiltonian Monte Carlo we generally define the Hamiltonian H(r, θ) to be of the following form H(r, θ) = U(θ) + K(r), where θ is the vector we are simulating from and the momentum vector r is constructed artificially. Using the notation in Section 3.2 the potential energy is then defined to be ( ) N N U(θ) = log p(θ) p(x i θ) = log p(θ) log p(x i θ). (3.4) i=1 i=1 The kinetic energy is defined as K(r) = 1 2 rt M 1 r, (3.5) where M is a symmetric, positive definite mass matrix. 15

17 3.3.2 Using Hamiltonian dynamics in MCMC In order to relate the potential and kinetic energy functions to the distribution of interest, we can use the concept of a canonical distribution. Given some energy function E(x), defined over the state of x, the canonical distribution over the states of x is defined to be P (x) = 1 Z exp{ E(x)/(k BT )}, (3.6) where Z is a normalizing constant, k B is Boltzmann s constant, and T is defined to be the temperature of the system. The Hamiltonian is an energy function defined over the joint state of r and θ, so that we can write down the joint distribution defined by the function as P (r, θ) exp{ H(r, θ)/(k B T )}. If we now assume the Hamiltonian is of the form described by (3.4), (3.5), and that k B T = 1, then we find that P (r, θ) exp{ U(θ)} exp{ K(r)} p(θ x)n (r 0, M). So that the distribution for r and θ defined by the Hamiltonian are independent and the marginal distribution of θ is its posterior distribution. This relationship enables us to describe Hamiltonian Monte Carlo (HMC), which can be used to simulate from continuous distributions whose density can be evaluated up to a normalizing constant. A requirement of HMC is that we can calculate the derivatives of the log of the target density. HMC samples from the joint distribution for (θ, r). Therefore by discarding the samples for r we obtain a sample from the posterior p(θ x). Generally we choose the components of r (r i ) to be independent, each with variance m i. This allows us to write the kinetic energy as d ri 2 K(r) =. 2m i In order to approximate Hamiltonian s equations computationally, we need to discretize time using a small stepsize ɛ. There are a number of ways to do this, however in practice the leapfrog method often produces good results. The method works as follows: 1. r i (t + ɛ/2) = r i (t) ɛ U 2 θ i (θ(t)), 2. θ i (t + h) = θ i (t) + ɛ K r i (r(t + h/2)), 3. r i (t + h) = r i (t + h/2) h 2 U θ i (θ(t + h)). The leapfrog method has a number of desirable properties, including that it is reversible and volume preserving. An effect of this is that at the acceptance step, the proposal distributions cancel, so that the acceptance probability is simply a ratio of the canonical distributions at the proposed and current states. Since we must discretize the equations in order to simulate from them, the posterior p(θ x) is not invariant under the approximate dynamics. This is i=1 16

18 why the acceptance step is required, as it corrects for this error. As the stepsize ɛ tends to zero, the acceptance rate of the leapfrog method tends to 1 as the approximation moves closer to true Hamiltonian dynamics. Now that we have outlined how to approximate the Hamiltonian equations, we can outline Hamiltonian Monte Carlo. HMC is performed in two steps as follows: 1. Simulate new values for the momentum variables r N(0, M). 2. Simulate Hamiltonian dynamics for L steps with stepsize ɛ using the leapfrog method. The momentum variables are then negated, and the new state (θ, r ) is accepted with probability min {1, exp{h(θ, r) H(θ, r )}} Developments in HMC and tuning HMC allows the state space to be explored rapidly and has high acceptance rates. However in order to gain these benefits, we need to ensure that L and ɛ are properly tuned. Generally it is recommended to use trial values for L and ɛ and to use traceplots and autocorrelation plots to decide on how quickly the resulting algorithm converges and how well it is exploring the state space. The presence of multiple modes can be an issue for HMC, and requires special treatment (Neal, 2010). Therefore it is recommended the algorithm is run at different starting points to ensure multimodality is not present. Suppose we have an estimate of the variance matrix for θ, if the variables appear to be correlated then HMC may not explore the parameter space effectively. One way to improve the performance of HMC in this case is to set M = ˆΣ 1, where ˆΣ is our estimate of V ar(θ x). The selection of the stepsize ɛ is very important in HMC, since selecting a size that is too big will result in a low acceptance rate, while selecting a size that is too small will result in slow exploration of the space. Selecting ɛ too large can be particularly problematic as it can cause instability in the Hamiltonian error, which leads to very low acceptance. In situations where the mass matrix M is the diagonal matrix, the stability limit for ɛ is given by the width of the distribution in its most constrained direction. For a Gaussian distribution, this is the square root of the smallest eigenvalue of the covariance matrix for θ. The value of L is also an important quantity to choose when tuning the HMC algorithm. Selecting L too small will mean the HMC explores the space with inefficient random walk behaviour as the next state will still be correlated with the previous state. On the other hand selecting L too large will waste computation and lower acceptance rates. There have been a number of important developments to HMC. Girolami and Calderhead (2011) introduced Riemannian Manifold Hamiltonian Monte Carlo, which simulates HMC in a Riemannian space rather than a Euclidean one. This effectively enables the use of positiondependent mass matrices M. Using this result, the algorithm will sample more efficiently from distributions where parameters of interest exhibit strong correlations. A recent development by Hoffman and Gelman (2014) led to the development of the No U-turn Sampler. This enables the automatic and adaptive tuning of the stepsize ɛ and the trajectory length L. This is an important development since the tuning of HMC algorithms is a non-trivial task. Alternative methods to the leapfrog method for simulating Hamiltonian dynamics have 17

19 been developed. These enable us to to handle constraints on the variables, or to exploit partially analytic solutions (Neal, 2010). As mentioned earlier, HMC can have considerable difficulty moving between the modes of a distribution. A number of schemes have been developed to solve this problem including tempered transitions Neal (1996) and annealed importance sampling Neal (2001). 3.4 Stochastic gradient Langevin Monte Carlo A special case of HMC arises, known as Langevin Monte Carlo, when we only use a single leapfrog step to propose a new state. Its name comes from its similarity to the theory of Langevin dynamics in physics. Welling and Teh (2011) noticed that the discretized form of Langevin Monte Carlo has a comparable structure to that of stochastic optimization, outlined in Section 3.2. This motivates them to develop an algorithm based on Langevin Monte Carlo, which only uses a subsample of the dataset to calculate the gradient of the potential energy U. They show that by using a stepsize that decreases with time, the algorithm will smoothly transition from a stochastic gradient descent to sampling approximately from the posterior distribution, without the need for an acceptance step. This result along with the fact that only a subsample of the data is used at each iteration, means that the algorithm is scalable to large datasets Stochastic gradient Langevin Monte Carlo Langevin Monte Carlo arises from HMC when we use only one leapfrog step in generating a new state (r, θ). In this case we can remove any explicit mention of momentum variables and propose a new value for θ as follows (Neal, 2010) θ t+1 = θ t a2 2 U θ + η, where η N(0, a 2 ) and a is some constant. Using our particular expression of the potential energy (3.4), we can write ( ) θ t+1 = θ t + ɛ N log p(θ t ) + log p(x i θ t ) + η, 2 i=1 = θ t ɛ 2 U(θ t) + η (3.7) where ɛ = a 2. While being a special case of Hamiltonian Monte Carlo, the properties of Langevin dynamics are somewhat different. We cannot typically set a very large, so the state space is normally explored a lot slower than using HMC. The proposal for Langevin Monte Carlo is a particular discretization of a stochastic differential equation (SDE) known as Langevin dynamics. Writing this discretization as an SDE we obtain dθ = 1 2 U(θ)dt + dw = 1 U(θ)dt + N (0, dt), (3.8) 2 18

20 where W is a Wiener process and we have informally written dw as N (0, dt). A Wiener process is a stochastic process with the following properties: 1. W (0) = 0 with probability 1; 2. W (t + h) W (t) N(0, h) and is independent of W (τ) for τ t. It can be shown that, under certain conditions, the posterior distribution p(θ x) is the stationary distribution of (3.8). This motivates the Metropolis-adjusted Langevin algorithm (MALA), which uses (3.7) as a proposal for the Metropolis-Hastings algorithm. When there are a large number of observations available, U(θ) is expensive to calculate at each iteration, since it requires the evaluation of the log likelihood gradient. Welling and Teh (2011) therefore suggest introducing an unbiased estimator of U(θ) which uses only a subset s t of the data at each iteration. The estimator Ũ(θ) is given as follows We use that Ũ(θ) = log p(θ) N n x i s t log p(x i θ). (3.9) Ũ(θ) = U(θ) + ν, (3.10) where ν is some noise term which we refer to as the stochastic gradient noise. Using this estimator in place of U(θ) in a Langevin Monte Carlo update we obtain the following ( ) θ t+1 = θ t + ɛ log p(θ t ) + N log p(x i θ t ) + η, (3.11) 2 n x i s t = θ t + ɛ 2 U(θ t) + ɛ 2 ν t + η. If we assume that the stochastic gradient noise ν t has variance V (θ t ), then the term ɛ ν 2 t has variance ɛ2 V (θ 2 t). Therefore for small ɛ, η, which has variance ɛ, will dominate. As we send ɛ 0, (3.11) will approximate Langevin dynamics and sample approximately from p(θ x), without the need for an acceptance step. This result motivates Welling and Teh (2011) to suggest an algorithm that uses (3.11) to update θ t, but to decrease the stepsize ɛ to 0 as the number of iterations t increases. Leading to the SGLD algorithm update ( ) θ t+1 = θ t + ɛ t 2 log p(θ t ) + N n x i s t log p(x i θ t ) + η t (3.12) Noting the similarity between (3.12) and stochastic optimization, they suggest decreasing ɛ t according to the conditions (3.1) to ensure that the noise in the stochastic gradients average out. The result is an algorithm that transitions smoothly between stochastic gradient descent and approximately sampling from the posterior using an increasingly accurate discretization of Langevin dynamics. Since the stepsize must decrease to zero, the mixing rate of the 19

21 algorithm will slow as the number of iterations increases. Putting this all together we outline the full SGLD procedure in Algorithm 2. Algorithm 2: Stochastic gradient Langevin dynamics (SGLD). Input: Initial estimate θ 1, stepsize function ɛ(t), subsample size s t = n, likelihood and prior gradients p(x θ) and p(θ). Result: Approximate sample from the full posterior p(θ x). for t = 1 to T do ɛ ɛ(t) Sample s t from full dataset x η N(0, ɛ) ( θ θ + ɛ 2 log p(θ) + N n x i s t log p(x i θ) ) + η if end end ɛ small enough then Store θ as part of the sample Discussion and tuning Teh et al. (2014) study SGLD theoretically and show that, given regularity conditions, estimators derived from an SGLD sample are consistent and satisfy a central limit theorem. They reveal that for polynomial stepsizes of the form ɛ t = a(b + t) α, the optimal choice of α is 1/3. The rate of convergence of SGLD is shown to be T 1/3, where T is the number of iterations of SGLD. This is slower than the traditional Monte Carlo rate of T 1/2, and is due to the decreasing stepsizes. In tuning the algorithm the key constants that need to be chosen are those used in the stepsize, a and b, and the subsample size n. To avoid divergence it is important to keep the stochastic gradient noise under control, especially as N gets large. This can be done in two ways. One is to increase the subsample size n, another is to keep the stepsize ɛ small. However in order to keep the algorithm efficient the subsample size needs to be kept relatively small, Welling and Teh (2011) suggest keeping it in the hundreds. Therefore the main constant that needs to be considered in tuning is a. Set a too large and the stochastic gradient noise dominates for too long and the algorithm never moves to posterior sampling. Set a too small however and the parameter space is not explored efficiently enough. Problems with this method include that it is important for the step sizes to decrease to zero so that the acceptance rate is not needed. However this means the mixing rate of the algorithm will slow down as the number of iterations increase. There are a few ways around this. One is to stop decreasing the step size once it falls below a threshold and the rejection rate is negligible, however in this case the posterior will still be explored slowly. The other is to use this algorithm initially for burn-in, then switch to an alternative MCMC method later which is more efficient. However both these solutions require significant hand-tuning beforehand. The decelerating mixing rate makes it less clear how the algorithm compares to other samplers, while it requires only a fraction of the dataset per iteration, this is offset by the fact that more iterations are required to reach the accuracy of other samplers (Bardenet 20

22 et al., 2015). Another problem with the method is that it often explores the state space inefficiently. This is because Langevin dynamics explores the state space less efficiently than more general HMC. This is motivation for stochastic gradient HMC (Chen et al., 2014) which is discussed in Section 3.5. Note that similar to HMC, certain parameters may have a much higher variance than others. In this case we can use a preconditioning matrix M to bring all the parameters onto a similar scale, allowing the algorithm to explore the space more efficiently. The algorithm including preconditioning can simply be written as ( ) θ t+1 = θ t + ɛ t 2 M log p(θ t ) + N n x i s t log p(x i θ t ) + η t, where η t N(0, ɛ t M). Provided the size of the subset n is large enough, we can use the central limit theorem to approximate V (θ t ) by its empirical covariance V (θ t ) N 2 (y(x n 2 i, θ t ) ȳ(θ t ))(y(x i, θ t ) ȳ(θ t )) T = N 2 n V s, (3.13) x i s t where y(x i, θ t ) = log p(x i θ t ) + 1 log p(θ N t) and ȳ(θ t ) = 1 n x i s t y(x i, θ t ). From (3.13) we determine that the variance of a stochastic gradient step can be estimated by ɛ2 t N 2 MV 4n sm (Welling and Teh, 2011), so that for the injected noise to dominate, denoting the largest eigenvalue of MV s M by λ, we require α = ɛ2 t N 2 4n λ 1. Therefore using the fact that the Fisher s information I NV s, and that the posterior variance Σ θ I 1 for large n, we can find the approximate stepsize at which the injected noise will dominate. Denoting the smallest eigenvalue of Σ θ by λ θ, the stepsize can be given by ɛ t 4αnλ N θ. This stepsize is generally small. Suppose we have a sample θ 1,..., θ m which is output from the algorithm. Since the mixing of the algorithm decelerates, standard Monte Carlo estimates will overemphasize parts of the sample where the stepsize is small. This increases the variance of the estimate, though it remains consistent. Therefore Welling and Teh (2011) suggest instead to use the estimate E(f(θ)) T t=1 ɛ tf(θ t ) T t=1 ɛ, t which is also consistent Further developments A number of extensions to the original SGLD algorithm by Welling and Teh (2011) have been suggested. Ahn et al. (2012) aim to improve the mixing of the algorithm by appealing 21

23 to the Bernstein-von Mises theorem. Their method samples from the Bernstein-von Mises Normal approximation to the posterior when the step sizes are high. Fisher s information, used in the Normal approximation, is estimated from the data. When step sizes are small, the method employs Langevin dynamics to sample from a non-gaussian approximation to the posterior. The idea of this approach is that they are trading in bias for computational gain, since the mixing rate of this algorithm will be higher. However theoretical guarantees for this algorithm are not well understood, and biases in the sample could be large when models are complex. Ahn et al. (2014) propose a distributed or parallelised algorithm based on SGLD. This works by dividing batches of data across machines. SGLD is run on each worker for a number of iterations, the last observation of this trajectory is then passed to another worker which carries on the trajectory using its own local dataset. In order to limit the amount of time spent waiting for the slowest workers the number of iterations of SGLD run on each worker is set to depend on the speed of the worker. However the consistency results of Teh et al. (2014) may no longer hold for this method and this needs to be checked. Patterson and Teh (2013) develop an algorithm inspired by SGLD intended for models where the parameters of interest are discrete probability distributions over K items. Common models with these properties includes for example latent Dirichlet allocation. Sato and Nakagawa (2014) analyse the properties of SGLD with a constant stepsize and find that the algorithm is weakly convergent. Vollmer et al. (2015) appeal that while favourable results have been obtained for the SGLD algorithm, most assume that the stepsize decreases to zero which is not true in practice. This motivates them to calculate biases explicitly including their dependence on the stepsize and stochastic gradient noise. Using these results they propose a modified SGLD algorithm which reduces the bias in the original algorithm due to stochastic gradient noise. 3.5 Stochastic gradient Hamiltonian Monte Carlo We have seen that Hamiltonian Monte Carlo, provides an efficient proposal for the Metropolis- Hastings algorithm which has a high acceptance rate and explores the space rapidly. Welling and Teh (2011) propose combining Langevin Monte Carlo with stochastic optimization in order to develop a scalable MCMC algorithm which only uses a subset of the data at each iteration. However due to the restriction of just one leapfrog step at each iteration, Langevin Monte Carlo can explore the state space inefficiently and it would be beneficial to extend the result to enable subsampling in Hamiltonian Monte Carlo. This extension is non-trivial, Betancourt (2015) discusses how using Hamiltonian Monte Carlo while subsampling naively can lead to unacceptable biases. Chen et al. (2014) discuss a potential solution to the problem by appealing to the dynamics of HMC itself, referred to as stochastic gradient Hamiltonian Monte Carlo (SGHMC). However in doing so they make the assumption that the stochastic gradient noise, as defined in (3.10), is Gaussian. Bardenet et al. (2015) show that when this assumption is violated poor performance can result. 22

24 3.5.1 Stochastic gradient Hamiltonian Monte Carlo SGHMC can be implemented naively by simply using a subset of the data to calculate the gradient of the potential energy U at each iteration. This is considered in Chen et al. (2014). They use the same unbiased estimator of the potential energy gradient adopted by Welling and Teh (2011) and defined in (3.9). The key assumption made by Chen et al. (2014) in developing their algorithm is that the stochastic gradient noise ν, defined in (3.10), is Normal. To argue the validity of their assumption, they appeal to the central limit theorem, though the use of this assumption needs further verification. They therefore write Ũ(θ) U(θ) + N (0, V (θ)), where V is the covariance matrix of the stochastic noise. This assumption allows Chen et al. (2014) to approximately write the dynamics of the naive approach as dθ = M 1 rdt, dr = U(θ)dt + N (0, 2Bdt), (3.14) where B = 1 ɛv (θ). For brevity we write B rather than B(θ), despite its dependence on θ. 2 These dynamics can then be discretized using the leapfrog method outlined in Section 3.3. Chen et al. (2014) show that the posterior distribution p(θ x) is no longer invariant under the dynamics of (3.14). Further to this Betancourt (2015) shows that naively subsampling in this way when implementing HMC can lead to unacceptable biases. This is due to the stochastic gradient noise now present in the dynamics. In order to try and limit this noise, Chen et al. (2014) therefore introduce a friction term to the dynamics. The friction term involves adding the term BM 1 r to the momentum dynamics. This leads to the full dynamics dθ = M 1 rdt, dr = U(θ)dt BM 1 rdt + N (0, 2Bdt). The friction term acts by reducing the energy H(θ, r), which in turn reduces the influence of the noise (Chen et al., 2014). However in practice we rarely know B analytically, and instead simply have an estimate ˆB. In this case Chen et al. (2014) suggest introducing a freely chosen friction term C such that C B, meaning that C is componentwise greater than or equal to ˆB. They then introduce the following dynamics dθ = M 1 rdt, dr = U(θ)dt CM 1 rdt + N (0, 2(C ˆB)dt) + N (0, 2Bdt). (3.15) Chen et al. (2014) make two suggestions when discussing the value of ˆB. One is to just ignore the stochastic gradient noise and set ˆB = 0. While this is not technically correct, as the step size tends to 0 so will B. Therefore eventually the terms involving C will dominate the dynamics. An alternative is to set ˆB to 1 2 ɛ ˆV, where ˆV is an estimate of the stochastic gradient noise found using an estimate of Fisher s information (Ahn et al., 2014). 23

25 3.5.2 Tuning SGHMC While we now have an algorithm which does not depend on knowing the stochastic noise model B precisely, it is not obvious how to pick C when tuning the algorithm. In order to gain more insight into best practices for tuning the algorithm, Chen et al. (2014) appeal to the connection of SGHMC to stochastic optimization with momentum, outlined in Section 3.2. By setting ν = ɛm 1 r and discretizing the dynamics, we can rewrite (3.15) as ν t+1 = ν t ɛ 2 M 1 Ũ(θ t) ɛm 1 Cν t + N (0, 2ɛ 3 M 1 (C ˆB)M 1 ), θ t+1 = ν t+1 + θ t. Next setting η = ɛ 2 M 1, α = ɛm 1 C, ˆβ = ɛm 1 ˆB we obtain ν t+1 = (1 α)ν t η Ũ(θ) + N (0, 2(α ˆβ)η), θ t+1 = θ t + ν t+1. (3.16) Notice the similarity between (3.2) and (3.16). In fact when the noise is removed, C = ˆB = 0, then SGHMC naturally reduces to stochastic optimization with momentum. Therefore Chen et al. (2014) suggest appealing to this similarity and to choose the constants η and α rather than the matrix C and the stepsize ɛ. This simplifies tuning since we can use results from the stochastic optimization with momentum literature (Sutskever et al., 2013). However it is not obvious to what extent these results are applicable to SGHMC. Chen et al. (2014) recommend using suggestions for ˆB when selecting ˆβ. Therefore they advise to either set ˆβ = 0 or ˆβ = η ˆV /2, where ˆV is an estimate of Fisher s information. This leaves three parameters to be tuned, we refer to these as follows: the learning rate η, the momentum decay α, the subsample size n. A key principle to be kept in mind is that the stochastic gradient noise needs to be kept small, especially as N gets large, this can be done in one of two ways. One is using larger subsample sizes n, the other is a smaller learning rate η. Clearly to keep the speed of the algorithm we want to keep the subsample size small. In light of this Chen et al. (2014) suggest keeping η small, since large values can cause the algorithm to diverge. More specifically they suggest setting η = γ/n, where γ is some constant, which we refer to as the batch learning rate, generally set to be around 0.1 or They suggest from previous implementations to keep the subsample size in the hundreds, for example n = 500. Finally by appealing to practices in SGD with momentum, they suggest setting α to be small, at around 0.01 or 0.1. Now we have some guidance on choosing the algorithm constants, we outline the full 24

26 SGHMC procedure in Algorithm 3 Algorithm 3: Stochastic gradient Hamiltonian Monte Carlo (SGHMC). Input: Initial estimate θ 1, batch learning rate γ, momentum decay α, subsample size n, trajectory length L, likelihood and prior gradients p(x θ) and p(θ). Result: Approximate sample from the full posterior p(θ x). for t = 1 to T do Sample s t from full dataset x Generate new momentum variables r N(0, M) Reparameterise ν ɛm 1 r for l = 1 to L do a N(0, 2(α ˆβ)η) ν (1 α)ν η Ũ(θ) + a θ θ + ν end Store θ as part of the sample end Discussion and extensions This method can be seen as an extension to stochastic gradient Langevin dynamics. The key advantage the method adds is the efficient exploration of the state space inherent in Hamiltonian dynamics. However while favourable theoretical results have been found for the SGLD algorithm such as a central limit theorem and consistency results, in adding Hamiltonian dynamics to a stochastic optimization Chen et al. (2014) have had to rely on assuming the stochastic gradient noise is Gaussian. Relying on this assumption can yield to arbitrarily poor performance when this assumption is violated (Bardenet et al., 2015). Therefore the behaviour of the algorithm when simulating from complex models needs to be explored. An alternative to relying on a Gaussian noise assumption would be, rather than dispensing with an acceptance step completely, to use the results of Bardenet et al. (2015) and perform an acceptance step using only a subset of the data. Another problem with the algorithm is that there are a large number of parameters to be tuned with few results discussing best practices for doing so. It follows that guidance for tuning the algorithm or an SGHMC algorithm which tunes adaptively should be developed. More recently Ma et al. (2015) proposed a general framework for producing stochastic gradient MCMC methods for which the target distribution is left invariant. They show that this proposed framework is complete. This means a large class of Markov processes with the desired stationary distribution can be written in terms of this framework, including HMC, SGLD, SGHMC, etc. Using the framework, Ma et al. (2015) introduce a stochastic gradient Riemann Hamiltonian Monte Carlo (SGRHMC) algorithm. This algorithm combines the scalability of SGHMC with work by Girolami and Calderhead (2011) on adaptively tuning the mass matrix M in Hamiltonian Monte Carlo. Ma et al. (2015) identify that using the framework in order to choose dynamics well suited to the desired target is an important future direction. 25

27 A problem addressed by Ding et al. (2014) is that the stochastic noise B is not easy to estimate. Appealing to developments in molecular simulation, Ding et al. (2014) consider the theory of a canonical ensemble, which represents the possible states of a system in thermal equilibrium with a heat bath at fixed temperature T. The probability of states in a canonical ensemble follows the canonical distribution (3.6) outlined in Section 3.3. Ding et al. (2014) assert that a critical characteristic of the canonical ensemble is that the system temperature, defined as the mean kinetic energy, satisfies the equilibrium condition k B T 2 = 1 E[K(r)], (3.17) n where we have used the notation in Section 3.3. In dynamics based Monte Carlo methods (for example HMC), the canonical ensemble is approximated in order to generate samples. This approximation is outlined for HMC in Section 3.3. However in order to correctly simulate from the canonical ensemble, the dynamics must maintain the thermal equilibrium condition (3.17) (Ding et al., 2014). While it can be shown that the dynamics of HMC maintains thermal equilibrium, due to the presence of the stochastic gradient noise in SGHMC, the algorithm no longer satisfies thermal equilibrium. In order to account for this, Ding et al. (2014) introduce a thermostat, which controls the mean kinetic energy adaptively. To do this, they introduce a new variable ξ, and propose the following dynamics as an alternative to the SGHMC algorithm dθ = rdt dr = Ũ(θ)dt ξrdt + N (0, 2Adt) ( ) 1 dξ = n rt r 1 dt, where A is a constant to be chosen. Due to its similarity to the Nosé-Hoover thermostat in statistical physics, Ding et al. (2014) refer to this algorithm as the stochastic gradient Nosé- Hoover thermostat (SGNHT). They also introduce a more general method which is able to handle non-isotropic noise from Ũ(θ). However, discretizing this system introduces biases which need to be studied. 3.6 Conclusion In this section we outlined stochastic gradient Monte Carlo. First we discussed stochastic optimization, a technique used to estimate the mode of the posterior distribution while using only a subset of the data at each iteration. A downside of this method is that it only produces a point estimate of θ, which can often lead to overfitting. Hamiltonian Monte Carlo, a method used to produce efficient proposals for a Metropolis-Hastings algorithm with high acceptance is then discussed, along with complexities in tuning the algorithm. With the machinery in place we are able to introduce stochastic gradient Langevin dynamics, which combines stochastic optimization with posterior sampling. The algorithm aims to improve upon the overfitting issues of stochastic optimization. However the algorithm mixes slowly for two reasons. One is due to its decreasing stepsize, the other is due to the fact 26

28 that Langevin dynamics does not explore the state space efficiently since it takes just one step at a time. The optimal convergence speed along with results proving consistency and a central limit theorem have been found for the algorithm. An optimal scaling result would be an important addition to this work. The mixing issues of SGLD are aimed to be resolved using stochastic gradient Hamiltonian Monte Carlo. This aims to extend the SGLD algorithm to use general Hamiltonian dynamics in order to approximately sample from the posterior. This in turn means that the parameter space will be explored more efficiently. However in doing so Chen et al. (2014) assumed that the stochastic gradient noise is Gaussian. The effects of this assumption for complex models are not clear. The method relies on a number of parameters and results on best tuning practices for these are limited. 4 Simulation study 4.1 Introduction Overview In this section, we investigate the performance of the discussed methods in various scenarios. The study is divided into two main parts. The first compares the performance of the batch methods, the second compares the performance of the stochastic gradient methods. The batch methods were generally coded from scratch using R, though some methods were implemented using the parallelmcmccombine package developed by Miroshnikov and Conlon (2015). The stochastic gradient methods were all implemented from scratch using Julia. The reason for using Julia for the stochastic methods is for its speed in iterable tasks where vectorization is not possible. Implementations of each method are available on GitHub: 9ZGHP2. In each study the target is a multivariate t-distribution with location θ, scale Σ and degrees of freedom ν. In all cases Σ and ν are assumed known, and the algorithms are used to estimate θ. Suppose the target has dimension d, then the true values of the parameters θ, Σ and ν are as follows θ = 0., Σ = diag{5,..., 5}, ν = 3, 0 where θ is a vector of length d and Σ is a d d diagonal matrix. Since the model we use to test the methods is relatively simple, the number of observations simulated from the target is kept small. The idea behind this is to emulate more complex models where the proportion of parameters to estimate against the number of observations is high. When we are not investigating the dimensionality of the target, its dimension is kept at 2. The batch methods section compares the performance of the algorithms as the following scenarios change: The choice of bandwidth (nonparametric/semiparametric); 27

29 The number of observations; The size of the batches; The dimension of the target. The stochastic gradient methods section compares performances for the following scenarios: The subsample size; The number of observations; The dimension of the target. The choice of bandwidth aims to compare the nonparametric/semiparametric methods sensitivity to the bandwidth choice, and to see if there appears to be a clear optimal bandwidth for the methods under this scenario. When studying how the algorithms behave when the number of observations vary, we are particularly interested in the behaviour of the parametric methods as the Bernstein-von Mises theorem no longer holds for the full posterior. Similarly as the size of each batch varies we are interested in the behaviour of the parametric methods behave as the Bernstein-von Mises theorem no longer holds for each subposterior. Since the number of observations are fixed when the batch size is investigated, the number of batches also varies in this investigation. We can therefore examine how the semiparametric/nonparametric methods behave as the number of batches gets large. The subsample size investigation will allow us to examine whether there is a point at which the performance improvement in using a larger subsample size is offset by the extra computational cost. Finally when comparing how the different methods perform under dimensionality, we will be particularly interested in the semiparametric/nonparametric methods since KDE is known to scale poorly with dimension Performance summaries The performance of the algorithms are compared by calculating the Kullback-Leibler divergence between the empirical distribution of the approximate sample, and the empirical distribution of a sample obtained from a standard MH algorithm. The KL divergence is a measure of distance between two distributions. Given two continuous distributions P and Q, the Kullback-Leibler divergence D KL (P Q) is defined to be D KL (P Q) = p(x) p(x) q(x) dx. Clearly since we only have samples from each distribution the KL divergence needs to be approximated. We estimate it using a method introduced by Boltz et al. (2007) which is implemented in the R package FNN (Beygelzimer et al., 2013). The method works by calculating an empirical density function using k-nearest neighbours, and then comparing the KL divergence of these empirical densities. For reference, the estimated KL divergence between two samples from the same posterior calculated using an MH sampler was In each case the KL divergence of a sample from the full posterior found using Metropolis- Hastings against a Laplace approximation calculated using stochastic optimization, which we refer to as a stochastic approximation, is plotted for comparative purposes. This serves as a baseline method, and so we are particularly interested in performance of the algorithms when the performance of the stochastic approximation is bad. 28

30 4.2 Batch methods Choice of bandwidth First we look at the effect the choice of the bandwidth has on the quality of the sample that is obtained from nonparametric and semiparametric methods. In this case we are interested in the effect that annealing the bandwidth has on the performance of the methods. We also wish to determine the sensitivity of each algorithm to the choice of bandwidth. 3 KL Divergence Ratio of bandwidth to largest batch standard deviation. Annealed Not Annealed method Npara Semipara Semipara_full Figure 1: Plots of the KL divergence from a standard MH sample for different batch MCMC methods against the ratio of bandwidth to the largest batch standard deviation observations are simulated from the target distribution, which are then allocated randomly across 50 batches iterations of MCMC are run separately for the observations across each batch and these samples are then combined using various nonparametric and semiparametric algorithms. The results are plotted in Figure 1. The plot compares three methods, abbreviated as follows: Npara: the nonparametric recombination method introduced in Section 2.5; Semipara_full: the semiparametric method introduced in Section 2.6; 29

31 Semipara: the alternative semiparametric method introduced in Section 2.6. This method uses the same mixture weights as the nonparametric method, which increases the acceptance rate of the MCMC. Annealed indicates that the bandwidth is set to tend closer to zero at each iteration, as opposed to using a constant bandwidth. In the case the bandwidth is annealed, the x-axis indicates the ratio of the bandwidth at the first iteration of the algorithm to the maximum standard deviation of a batch. Notice first that the performance of the nonparametric method shows considerable sensitivity to bandwidth choice, while the semiparametric methods only show sensitivity if the bandwidth is set too small. This sensitivity could be due to two things. One is that the MCMC gets stuck in small modes when the bandwidth is too small, the other is that the supports of each batch do no overlap well. To examine this further, we pick a bandwidth whose ratio to the maximum standard deviation of a batch is 0.4. We implement the Npara method in two ways. First, we implement it in the standard way simulating from the mixture using MCMC. Second, we fit a KDE to each batch at a grid of points and then take the product of these estimates. We refer to this as the grid method. In this case we find that the problem is the MCMC getting stuck, since the grid method performs quite well at low bandwidths. At higher bandwidths however, the grid method exhibits similar poor fit to the standard method. When it is annealed, the optimal bandwidth appears to be about 1.2 standard deviations (sds) for the nonparametric method, while it is about 0.7 sds when it is not. Other examples were tried and similar values for the optimal bandwidth were found. The performance of the semiparametric methods seem quite similar to each other, except at low bandwidths when the semiparametric method using nonparametric weights seems to outperform the standard method. This is probably due to low MCMC acceptance for the method that is using the standard weights. Optimal bandwidths for the semiparametric methods seems to be about 1.5 sds when the bandwidth is not annealed and 2 sds when the bandwidth is annealed. There appears to be no harm in assiging a bandwidth that is quite large when using the semiparametric methods Number of observations We investigate how the number of observations in each batch affects the performance of the different algorithms. We have particular interest in how the parametric methods perform as the Bernstein-von Mises no longer applies and the posterior moves away from Gaussianity. This time the number of observations that we simulate from the target varies, and these are divided into 10 batches iterations of MCMC are run on each batch and samples are combined using the following batch methods: NeisNonparaAn: nonparametric method with annealing; NeisPara: parametric method introduced in Section 2.4; NeisSemiparaAn: semiparametric method with annealing, using nonparametric weights; NeisSemiparaFullAn: semiparametric method with annealing; Scott: parametric method introduced in Section 2.4; stochastic_approx: sample from a stochastic approximation. 30

32 Since it was found that annealing does not affect performance of the algorithms too much, provided a good value is chosen for the bandwidth, only the annealed methods are included in the plot method KL divergence NeisNonparaAn NeisPara NeisSemiparaAn NeisSemiparaFullAn Scott stochastic_approx Number observations Figure 2: Plots of the KL divergence from a standard MH sample for different batch MCMC methods against the total number of observations to be divided among 10 batches. In this example, the best performance is found to be attained by the parametric methods Scott and NeisPara. The methods even perform reasonably well when the stochastic approximation to the true posterior is poor. This comes as a surprise since theoretical justification for all these methods relies on the Bernstein-von Mises theorem. The methods are only adversely affected in the extreme case of just 10 observations, so one observation per batch. The worst performance in this example is by NeisSemiparaAn, whose KL divergence is very high for a small number of observations. Since there are only a few observations per batch in these cases, the means are very different, and the semiparametric method using the standard weights seems to tend to get stuck at components far away from the true mean. It probably gets stuck because of its low acceptance rate. Figure 2 shows that the stochastic optimization method performs poorly when there are few observations. This is because the method struggles to get close to the true mode of the posterior, suggesting the presence of multiple modes. This is good news for the parametric methods, as it was questionable how they would perform in the presence of multiple modes. All the methods except NeisSemiparaFullAn appear to outperform a stochastic approximation when there are a small number of observations. However as n gets larger the approximation begins to outperform the methods NeisNonparaAn and NeisSemiparaAn. NeisSemiparaAn appears to perform somewhat worse than the other methods across the board. This is possibly due to using weights in the MCMC that are only asymptotically valid as h 0. Since the methods are working with only a few observations the bandwidth h is 31

33 probably quite high in this case Batch size We investigate how the size of each batch affects the performance of the different algorithms. 800 observations are simulated from the target, the data is then divided up randomly into sets of different numbers of batches iterations of MCMC are run on each batch and the samples are combined using the discussed batch methods. The results are plotted in Figure 3. KL divergence Batch Size method NeisNonparaAn NeisPara NeisSemiparaAn NeisSemiparaFullAn Scott stochastic_approx Figure 3: Plots of the KL divergence from a standard MH sample for different batch MCMC methods against different batch sizes. Figure 3 shows that the semiparametric and parametric methods perform well across a variety of batch sizes. This again is somewhat surprising, since for non-gaussian targets, the parametric methods are only theoretically justified as the size of each batch tends to infinity. The fact that an approximation using stochastic optimization performs well also suggests it may be instructive to try a more complex example to see how the methods perform then. The nonparametric method performs very poorly when batch sizes are small. Comparing to the grid method, this is mainly due to inefficient MCMC, probably because the number of mixture components is large when there are a lot of batches. The grid approximation to the full posterior is not perfect however, perhaps because the subposterior supports do not overlap well when batch sizes are small Dimensionality We investigate how the dimensionality of the target affects the performance of the different algorithms. We simulate 800 observations from the target, the data is then divided up randomly into sets of 40 batches each with 20 observations iterations of MCMC are 32

34 run on each batch and the samples are combined using the batch methods. The results are plotted in Figure KL divergence Dimension of target method NeisNonparaAn NeisPara NeisSemiparaAn NeisSemiparaFullAn Scott stochastic_approx Figure 4: Plots of the KL divergence from a sample obtained using a standard MH algorithm for different batch MCMC methods against the dimension of the target. Once again it is found that the best performance in this example is by the parametric methods Scott and NeisParam, despite the stochastic approximation performing badly in high dimensions. It was found that in high dimensional cases the stochastic approximation performed poorly because the approximation of the posterior variance was inadequate. This is in contrast with a low number of observations when the approximation struggled to find the mode. As expected the nonparametric method based on kernel density estimation performs very badly in high dimensions. This will be due mainly to the poor performance of KDE in high dimensions. The efficiency of the nonparametric method might be improved by choosing a kernel with heavier tails. To a lesser extent, both semiparametric methods perform progressively worse as the dimensionality increases. Again this is probably due to the known poor performance of kernel density estimation in high dimensions. 4.3 Stochastic gradient methods Tuning the algorithms In the case of stochastic gradient Langevin dynamics (SGLD) and stochastic gradient Hamiltonian Monte Carlo, we have certain free constants we may choose. In these cases we need to make good choices for the constants in order for the algorithms to sample closely to the posterior distribution. In the case of the SGLD algorithm, as we talked about earlier, it is recommended by Welling and Teh (2011) to keep the subsample size n in the hundreds. The effect of a change in subsample size is investigated later. On the recommendations of Teh 33

35 et al. (2014) discussed in Section 3.4, we set the sample size to be of the form a (1 + N t) 1/3, where N is the total number of observations, and t is the iteration. Therefore the only constant that needs to be chosen when tuning the algorithm is a. There is little in the literature on the choice of this constant, however we had the luxury of being able to run a standard MH sampler on our chosen examples. Accordingly, we chose this constant by running the algorithm with a number of different values of a, and choosing the value of a which minimised the KL divergence between the SGLD sample and the sample from a standard MH sampler. Empirically the value which had the most effect on the choice of a was the number of observations. The subsample size n appeared to have had some effect on the choice of a, while dimensionality appeared to have the least effect. In the case of the SGHMC algorithm, as recommended by Chen et al. (2014), we appealed to its connection with stochastic gradient descent with momentum. We therefore reparameterised in terms of a learning rate η, a momentum decay α, the subsample size n and the trajectory L. It was found that provided it was set relatively high the trajectory L had limited effect on the quality of the sample, we therefore fixed L = 10. We set η = γ/n and tuned γ, known as the batch learning rate. In their paper, Chen et al. (2014) recommend setting α quite small, to 0.1 or 0.01, and setting γ to 0.1 or 0.01 too. We checked this recommendation by varying α and γ and finding the constant choices which minimised the KL divergence between these and a sample from a standard MH sampler. We found that for our examples, the recommended choices for η and α were not the best in general. In fact it was often appropriate to choose a value for α in the interval [1, 2] and for γ in the interval [0, 10]. Chen et al. (2014) recommend a subsample size n of approximately 500, and we explore the effect of choosing different subsample sizes later in the section Subsample size In this investigation we compared the effect of different subsample sizes on the sample quality. Our interest is whether there is a clear point at which obtaining a larger subsample size is no longer worth the increase in computational expense observations were simulated from the target distribution. After a burn-in of 10 3 iterations, SGLD and SGHMC algorithms were run for 10 4 steps with different subsample sizes n. For comparison a Normal approximation calculated using stochastic optimization with the same subsample sizes is also included. The results are plotted in Figure 5 Something immediately obvious from Figure 5 is that there is a clear point at which larger subsample sizes provide limited extra value. In this case a subsample size of about appears optimal for both methods. Notice that for small subsample sizes the performance of the SGLD algorithm appears to be somewhat poorer than SGHMC, in some cases it even performs worse than a stochastic approximation. This may be due to the large stochastic gradient noise when the subsample size is this small. In cases where the stochastic gradient noise is very large it may be the case that the injected noise never has a chance to dominate the sampling, so that the posterior sample is not of a good quality. While the same argument might be used for the SGHMC the presence of its friction term may be the reason it is less affected. 34

36 0.25 KL Divergence Subsample Size method sghmc sgld stochastic_approx Figure 5: Plots of the KL divergence from a standard MH sample for different stochastic gradient Monte Carlo methods against different subsample sizes. Both algorithms outperform the stochastic approximations for reasonable subsample sizes, even though the number of observations is high. Despite having no theoretical guarantees, SGHMC also generally appears to outperform SGLD. However this is at the cost of a slower run time and more constants to tune. In general it was found that SGHMC runs a little less than L times slower than SGLD, where L is the length of the trajectory. In this example, the best values for a when applying the SGLD algorithm ranged from 0.8, for a small subsample size of 10 to 40 for a larger subsample size of This suggests that as the subsample size increases relative to the number of observations, we are able to increase the stepsize and so make more confident moves. The best values of α for the SGHMC algorithm ranged from 2 for small values n to 0.5 for large values. The best values found for γ ranged from 6 for small values of n to 0.5 for large values. While Chen et al. (2014) suggest that as the number of observations N gets bigger, we can either set a small learning rate η or use a larger subsample size n, this example appears to suggest differently. As n increases while keeping the observations fixed in this case, we would expect η to increase to compensate. However in this case the opposite occurs, and the best constant choices appear to be to decrease η as we also decrease α Number of observations We compare the effect of different numbers of observations on the performance of each of the stochastic gradient methods. Contrasting numbers of observations were simulated from the target distribution and after a burn-in of 10 3 iterations, SGLD and SGHMC algorithms were run for 10 4 iterations. In all cases the subsample size was fixed at 10, while this is small, it allowed us to test the methods at a very low number of observations. Results are plotted in Figure 6. The results for the batch method Scott, which performed particularly well, are also plotted for comparison. 35

37 KL Divergence method Scott sghmc sgld stochastic_approx Number of Observations Figure 6: Plots of the KL divergence from a standard MH sample for different stochastic gradient Monte Carlo methods against different numbers of observations. Notice that the sample quality of the SGLD and SGHMC algorithms are not too affected by an excessively small number of observations as the batch methods are. The performance of the SGHMC algorithm in these cases is exceptional, with an estimated KL divergence of the order 0.001, outperforming all the batch methods. For a larger number of observations, the performance of the SGLD algorithm appears rather poor, as it is outperformed by the stochastic approximation, and many of the batch methods. This poor performance is probably due to the excessively small subsample size of 10, as we have seen in Section the method is less resilient to large stochastic gradient noise than SGHMC. The best choices for a when applying the SGLD algorithm in this case was found to range from 200 for small values of n, to 50 for the larger values of n. Therefore in this case we find that as we increase the number of observations, while keeping the subsample size fixed, the corresponding stepsize should decrease. This occurs because as we increase the number of observations the stochastic gradient noise will increase. We therefore want to make less confident moves to compensate for this. When applying the SGHMC algorithm however, the best values for γ and α appeared to have no trend with the number of observations. That being said, the best values found were quite variable in magnitude. Values for γ ranged from 0.01 to 6 while values for α ranged from 0.01 to 1.9. Choosing different parameters from these generally led to an estimated KL divergence of the order 0.1 rather than This suggests that tuning this algorithm may be somewhat of a fine art and that finding good rules for choosing γ and α will probably be difficult Dimensionality We investigate how the dimensionality of the target affects the performance of the different stochastic gradient algorithms. We simulate 800 observations from the target distribution with varying dimensionality d. SGLD and SGHMC are then used to approximately sample 36

38 from the posterior with fixed subsample size n = 200. The results are plotted in Figure 7. Again the KL divergence of the batch method Scott is plotted for comparison. 2.0 KL Divergence Number of Observations method Scott sghmc sgld stochastic_approx Figure 7: Plots of the KL divergence from a sample obtained using a standard MH algorithm for different stochastic gradient Monte Carlo methods against the dimension of the target. Once again the SGHMC algorithm substantially outperforms the SGLD algorithm at all dimensions. Both methods perform relatively well at high dimensions. At high dimensions, the SGLD algorithm performs about on par with the NeisPara algorithm, and slightly worse than the algorithm of Scott, while the SGHMC algorithm appears again to perform the best of all the methods considered in the investigation. The SGLD algorithm appears to get worse due to dimensionality quite quickly, but this then levels off. The best values for a when applying the SGLD algorithm in this example ranged from 90 at low dimensions to 70 at higher dimensions. This suggests that at higher dimensions a slightly lower stepsize may be required in order for the algorithm to explore the space most effectively. Once again we find little trend in the best choices for α and γ with dimension. Values for γ range from 0.1 to 9, while values for α range from 0.1 to Conclusion We find that the parametric methods are surprisingly robust to a variety of scenarios for our relatively simple model. Including small batch sizes, high dimensionality and a small number of observations. This encourages us to explore the methods properties in more complex, multimodal models. The nonparametric and semiparametric methods did not perform so well. Many of these issues seemed to be as a result of the MCMC simulation getting stuck. More efficient ways of combining the subposteriors or simulating from the resulting mixture are required. The nonparametric method was particularly poor at high dimensions, this may be improved by investigating heavier tailed kernels. 37

39 Stochastic gradient methods seemed to be robust to a variety of scenarios. However considerable time was spent tuning these algorithms, even with an MH sample of the full posterior available. Therefore results on tuning guidance for these methods is required. It was found that SGHMC was just under L times slower to run than SGLD, where L is the trajectory length. With 3 constants to tune, compared to 1 for SGLD, SGHMC took longer to tune than SGLD. However its performance was generally much better. 5 Future Work 5.1 Introduction As a new field, avenues for research are growing rapidly. All of the methods mentioned in the report are relatively recent, therefore in many cases their theoretical properties have not been adequately explored. Biases often exist in the methods which might be improved upon, alternatively the methods have flaws which need to be addressed. In this section we outline three areas of open research that have come to light from the review and simulation study, and that I am likely to pursue as part of the PhD project. 5.2 Further comparison of batch methods Most of the batch methods either relied on potentially quite restrictive assumptions, or had undesirable properties. While the simulation study shed some light on the practical performance of the algorithms, the model used in the testing was relatively simple so issues may have been missed. Moreover the simulation study itself opened up a few more questions which it is important to answer. This leads us to our first area of research which is to provide a more comprehensive comparison of the various batch methods. For the rest of this section, we outline what we are looking for in particular. Parametric recombination methods relied heavily on the assumption of the Bernstein-von Mises theorem, a central limit theorem for Bayesian statistics. In particular both methods are only exact when each subposterior was Gaussian or as the size of each batch tends to infinity. This immediately begs the question, in what cases is this assumption valid. The simulation study showed that the two parametric algorithms were surprisingly robust to a number of different scenarios. This included cases where the number of observations was small compared to the dimensionality of the target, a commonly occurring scenario in practice. Of particular interest was the fact that the methods would regularly outperform the stochastic approximation. There is no underlying theory to suggest why this might be, so working to develop theoretical results of the sort would be a valuable contribution. The strong performance of the parametric methods may be as a result of the relatively simple model being used for the comparison. Therefore an investigation into the performance of parametric methods when they are used to train a more complex target is required. The desire is for it to be a standard model from the machine learning literature which exhibits multimodality. Some common models which fit these properties include Bayesian logistic regression, neural networks and latent Dirichlet allocation. 38

40 Nonparametric and semiparametric recombination methods suffer from a number of disadvantages. These disadvantages came to light during the simulation studies. The nonparametric algorithm performed poorly at high dimensions, small batch sizes and was sensitive to bandwidth choice. Semiparametric methods performed poorly at high dimensions and when there were a small number of observations. Many issues with the algorithms were as a result of the MCMC used to simulate from the Gaussian mixture getting stuck. Work on recombining the subposteriors more efficiently is needed. As outlined in Section 2.3, efficient methods for simulating from Gaussian mixtures have been developed (Ihler et al., 2004; Rudoy and Wolfe, 2007). Potential improvements to the posterior estimate by using these methods to simulate from the Gaussian mixture could be studied. By changing the kernel used in the KDE to be heavier tailed, the algorithms may work more effectively at high dimensions and at smaller batch sizes, since the subposterior densities are likely to overlap. Implementing a different kernel will require an alternative strategy when recombining the density estimates. The fact that the product of two Gaussian densities is itself a Gaussian density is key in the nonparametric method introduced by Neiswanger et al. (2013). Results regarding the choice of bandwidth when estimating each subposterior could be a useful direction. The estimation of the bandwidth of each subposterior is a balance between the precision of the estimate and the amount of subposterior overlap. Reviewing the methods performance when targets are multimodal would be useful. One might expect these methods to perform better than the parametric methods in this case. To summarise we aim to compare the methods with more complex models, with a particular interest in multimodal targets. We plan to perform a review of the performance of the nonparametric and semiparametric methods when heavier tailed kernels are used. Theoretical results as to the performance of the parametric methods when compared with a stochastic approximation could be developed. Finally developing more efficient ways to simulate from the nonparametric approximation to the posterior could be an area to pursue. 5.3 Tuning guidance for stochastic gradient methods A particularly non-trivial part of implementing the stochastic gradient Monte Carlo methods SGLD and SGHMC was the tuning of them. The absence of an acceptance step means that tuning cannot simply be performed by looking at the traceplots or by checking the acceptance rate. While we had the luxury of comparing with a sample produced using Metropolis-Hastings, clearly in general this will not be the case. While Teh et al. (2014) have found optimal convergence rates for SGLD there are no results on the choice of the constant a. Similarly there have been no results on good choices of α and γ for SGHMC. Therefore both methods require guidance on tuning. Optimal scaling results for either algorithm would be useful. In order to find optimal scaling results for MCMC algorithms, a relevant measure of efficiency is required. In Roberts et al. (2001), the optimal scaling of various MH algorithms are reviewed. This includes the Metropolis adjusted Langevin algorithm (MALA) which uses the same proposal as SGLD, though without the decreasing stepsize. However the measure of efficiency used in this case 39

41 is the reciprocal of the integrated autocorrelation time, which is found to be related to the acceptance rate. This measure is probably not applicable in the case of the SGLD algorithm since there is no acceptance rate. Therefore moves which would normally not be accepted by a MALA algorithm, leading to high autocorrelation measures between the new state and the current state, would simply be an overconfident move for the SGLD algorithm. This would have low autocorrelation between the new and current states. A good measure of efficiency for stochastic gradient based algorithms is therefore required in order to allow optimal scaling results to be obtained. This efficiency measure is also required for the development of methods which tune adaptively, similar to the No U-turn Sampler developed by Hoffman and Gelman (2014) for HMC. 5.4 Using batch methods to analyse complex hierarchical models A form of model which could benefit considerably from the scalability improvements introduced by batch methods are hierarchical models. This type of model is very common among a wide range of applications, from computer science to ecology. In this case the structure of the model can be taken advantage of. Suppose we have a nested hierarchical model of the form y ij f(y x j ), x j g(x j θ), then we might split the data according to the value j (Scott et al., 2013). Batch methods can then be used to estimate θ, and this can be used in order to simulate values of x j in parallel. Similar methods can be used for more complex structures. Clearly one key benefit of this approach is that it allows the problem to be parallelised naturally in groups. Another reason we might expect this approach to work particularly well is because generally only a few parameters need to be approximated by combining batches. The rest are estimated in parallel given these parameters using standard MCMC. In this section we outline a particular form of hierarchical model which could benefit from this approach. A common statistical problem, especially in medical applications, is the need to model the relationship between a single time independent response and a functional predictor. It is generally impossible to observe this functional predictor exactly, the data therefore tends to consist of noisy observations of the function. For example, we may wish to predict whether an individual is infected with a particular disease based on various measurements that have been recorded over time. A particular example is determining the relationship between magnetic resonance imaging (MRI) data and health outcomes (Goldsmith et al., 2012). This type of problem is considered by Woodard et al. (2013). For each subject i, we assume we have noisy observations {W i (x ik } K k=1 } from a function f i(x). In order to fully account for uncertainty in the estimation of each function, Woodard et al. (2013) introduce a hierarchical model. The model first produces an estimate ˆf i (x) of each function using the noisy observations. The response variable Y i is then regressed against statistical summaries of these estimates. Suppose in producing an estimate of f i (x) we are required to estimate a set of parameters for each subject ω i, and that regressing Y i on summaries of ω i requires the estimation of a further set of parameters φ. Let the vector of noisy observations for each subject {W i (x ik )} K k=1 } 40

42 be denoted by W i, and the matrix of all observations be denoted W. Woodard et al. (2013) take a Bayesian approach and aim to sample from the posterior distribution given by p({ω i } n i=1, φ W, Y) p(φ) n p(ω i )p(y i ω i, φ)p(w i, ω i ), i=1 where Y is a vector of responses. Typically accurate estimation of f i (x) requires a large number of observations. Therefore when there are a large number of subjects, this problem becomes infeasible due to computational expense. In order to account for this fact, Woodard et al. (2013) suggest decomposing the posterior as follows p({ω i } n i=1, φ W, Y) = p({ω i } n i=1 W, Y)p(φ {ω} n i=1, Y). They then suggest, assuming independence across subjects i, to apply an approximation known as modularization (Liu et al., 2009) to obtain p({ω i } n i=1 W, Y) p({ω i } n i=1 W) = n p(ω i W i ). i=1 This approximation allows us to estimate the function f i (x) for each subject in parallel. Functional estimation is the slower part of the algorithm, so this speeds up the algorithm considerably. However splitting the posterior in this way loses valuable information. An alternative approach would be to split posterior distribution into subposteriors by subject p(ω i, φ W i, Y i ) as follows p({ω i } n i=1, φ W, Y) = n p 1/n (φ)p(ω i )p(y i ω i, φ)p(w i, ω i ), i=1 n p(ω i, φ W i, Y i ). i=1 These subposteriors can then be simulated from in parallel using the methodology outlined in Woodard et al. (2013). Batch methods could then be used to produce an estimate for φ, the parameter of interest. This is a particularly suitable use for batch methods since the dimensionality of φ is likely to be considerably lower than that of ω i. A review into the performance of this idea would be useful, especially if its performance is found to give a significant improvement to solutions which use modularization. 41

43 References Ahn, S., Korattikara, A., and Welling, M. (2012). Bayesian posterior sampling via stochastic gradient Fisher scoring. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages Ahn, S., Shahbaba, B., and Welling, M. (2014). Distributed stochastic gradient MCMC. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages Bardenet, R., Doucet, A., and Holmes, C. (2015). On Markov chain Monte Carlo methods for tall data. arxiv preprint arxiv: Betancourt, M. (2015). The fundamental incompatibility of scalable Hamiltonian Monte Carlo and naive data subsampling. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages Beygelzimer, A., Kakadet, S., Langford, J., Arya, S., Mount, D., and Li, S. (2013). FNN: fast nearest neighbor search algorithms and applications. R package version 1.1. Bishop, C. M. (2006). Pattern recognition and machine learning. Springer. Boltz, S., Debreuve, E., and Barlaud, M. (2007). knn-based high-dimensional Kullback- Leibler distance for tracking. In Eighth International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS-07), pages IEEE. Chen, T., Fox, E. B., and Guestrin, C. (2014). Stochastic gradient Hamiltonian Monte Carlo. arxiv preprint arxiv: Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R. D., and Neven, H. (2014). Bayesian sampling using stochastic gradient thermostats. In Advances in Neural Information Processing Systems, pages Duong, T. (2004). Bandwidth selectors for multivariate kernel density estimation. PhD thesis, University of Western Australia. Girolami, M. and Calderhead, B. (2011). Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2): Goldsmith, J., Bobb, J., Crainiceanu, C. M., Caffo, B., and Reich, D. (2012). Penalized functional regression. Journal of Computational and Graphical Statistics. Hjort, N. L. and Glad, I. K. (1995). Nonparametric density estimation with a parametric start. The Annals of Statistics, pages Hoffman, M. D. and Gelman, A. (2014). The no-u-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. The Journal of Machine Learning Research, 15(1):

44 Ihler, A. T., Sudderth, E. B., Freeman, W. T., and Willsky, A. S. (2004). Efficient multiscale sampling from products of Gaussian mixtures. Advances in Neural Information Processing Systems, 16:1 8. Le Cam, L. (2012). Asymptotic methods in statistical decision theory. Springer Science & Business Media. Liu, F., Bayarri, M., Berger, J., et al. (2009). Modularization in Bayesian analysis, with emphasis on analysis of computer models. Bayesian Analysis, 4(1): Ma, Y.-A., Chen, T., and Fox, E. B. (2015). MCMC. arxiv preprint arxiv: A complete recipe for stochastic gradient Minsker, S., Srivastava, S., Lin, L., and Dunson, D. (2014). Scalable and robust Bayesian inference via the median posterior. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages Miroshnikov, A. and Conlon, E. (2015). parallelmcmccombine: Methods for combining independent subset MCMC posterior samples to estimate a full posterior density. R package version 1.0. Neal, R. M. (1996). Sampling from multimodal distributions using tempered transitions. Statistics and Computing, 6(4): Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing, 11(2): Neal, R. M. (2010). MCMC using Hamiltonian Dynamics. In Handbook of Markov Chain Monte Carlo. Chapman & Hall. Neiswanger, W., Wang, C., and Xing, E. (2013). parallel MCMC. arxiv preprint arxiv: Asymptotically exact, embarrassingly Patterson, S. and Teh, Y. W. (2013). Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In Advances in Neural Information Processing Systems, pages Rabinovich, M., Angelino, E., and Jordan, M. I. (2015). Variational Consensus Monte Carlo. arxiv preprint arxiv: Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, pages Roberts, G. O., Rosenthal, J. S., et al. (2001). Optimal scaling for various Metropolis- Hastings algorithms. Statistical Science, 16(4): Rudoy, D. and Wolfe, P. J. (2007). Multi-scale MCMC methods for sampling from products of Gaussian mixtures. In Acoustics, Speech and Signal Processing, ICASSP IEEE International Conference on, volume 3, pages III IEEE. 43

45 Sato, I. and Nakagawa, H. (2014). Approximation analysis of stochastic gradient Langevin dynamics by using Fokker-Planck equation and Ito process. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H., George, E., and McCulloch, R. (2013). Bayes and big data: The Consensus Monte Carlo algorithm. In EFaBBayes 250 conference, volume 16. Srivastava, S., Cevher, V., Tran-Dinh, Q., and Dunson, D. B. (2015). WASP: Scalable Bayes via barycenters of subset posteriors. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pages Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In Proceedings of the 30th international conference on machine learning (ICML-13), pages Teh, Y. W., Thiéry, A., and Vollmer, S. (2014). Consistency and fluctuations for stochastic gradient Langevin dynamics. arxiv preprint arxiv: Vollmer, S. J., Zygalakis, K. C., et al. (2015). (Non-) asymptotic properties of stochastic gradient Langevin dynamics. arxiv preprint arxiv: Wang, X. and Dunson, D. B. (2013). Parallelizing MCMC via Weierstrass sampler. arxiv preprint arxiv: Wang, X., Guo, F., Heller, K. A., and Dunson, D. B. (2015). Parallelizing mcmc with random partition trees. arxiv preprint arxiv: Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML- 11), pages Woodard, D. B., Crainiceanu, C., and Ruppert, D. (2013). Hierarchical adaptive regression kernels for regression with functional predictors. Journal of Computational and Graphical Statistics, 22(4): Wu, J. (2004). Some properties of the Gaussian distribution. Georgia Institute of Technology. 44