# Computational Statistics for Big Data

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Lancaster University Computational Statistics for Big Data Author: 1 Supervisors: Paul Fearnhead 1 Emily Fox 2 1 Lancaster University 2 The University of Washington September 1, 2015 Abstract The amount of data stored by organisations and individuals is growing at an astonishing rate. As statistical models grow in complexity and size, traditional machine learning algorithms are struggling to scale well to the large datasets required for model fitting. Markov chain Monte Carlo (MCMC) is one algorithm that has been left behind. However the algorithm has proven to be an invaluable tool for training complex statistical models. This report discusses a number of possible solutions that enable MCMC to scale more effectively to large datasets. We focus on two particular solutions to this problem: batch methods and stochastic gradient Monte Carlo methods. Batch methods split the full dataset into disjoint subsets, and run traditional MCMC on each subset. The difficulty of these methods is in recombining the MCMC output run on each subset. The idea is that this will be a close approximation to the posterior using the full dataset. Stochastic gradient Monte Carlo approximately samples from the full posterior but uses only a subsample of data at each iteration. It does this by combining two key ideas. Stochastic optimization, which is an alogorithm used to find the mode of the posterior but uses only a subset of the data at each iteration; Hamiltonian Monte Carlo, which is a method used to provide efficient proposals for Metropolis-Hastings algorithms with high acceptance rates. After discussing the methods and important extensions, we perform a simulation study, which compares the methods and how they are affected by various model properties.

2 Contents 1 Introduction An overview of methods Report outline Batch methods Introduction Splitting the data Efficiently sampling from products of Gaussian mixtures Parametric recombination methods Nonparametric methods Semiparametric methods Conclusion Stochastic gradient methods Introduction Stochastic optimization Hamiltonian Monte Carlo Stochastic gradient Langevin Monte Carlo Stochastic gradient Hamiltonian Monte Carlo Conclusion Simulation study Introduction Batch methods Stochastic gradient methods Conclusion Future Work Introduction Further comparison of batch methods Tuning guidance for stochastic gradient methods Using batch methods to analyse complex hierarchical models

3 1 Introduction As the amount of data stored by individuals and organisations grows, statistical models have advanced in complexity and size. Often much statistical methodology has focussed on fitting models with limited data. Now we are faced by the opposite problem, we have so much data that traditional statistical methods struggle to cope and run exceptionally slowly. These problems have led to a rapidly evolving area of statistics and machine learning, which develops algorithms which are scalable as the size of data increases. The size of data is generally used to mean one of two things: the dimensionality of the data or the number of observations. In this report we focus on methods which have been designed to be scalable as the number of observations increases. Data with a large number of observations is often referred to as tall data. Currently, large scale machine learning models are being trained mainly using optimization methods such as stochastic optimization. These algorithms are mainly used for their speed, they are fast to train models even when there are a huge number of observations available. The methods speed is due to the fact that at each iteration the algorithms only use a subset of all the available data. The downside is that these methods only find local maxima of the posterior distribution, meaning they only produce a point estimate which can lead to overfitting. A key appeal of Bayesian methods is that they produce a whole distribution of possible parameter values, which allows uncertainty to be quantified, reducing the risk of overfitting. While approximating parameter uncertainty using stochastic optimization can be done, for complex models this approximation can be very poor. Generally the Bayesian posterior distribution is simulated from using statistical algorithms known as Markov chain Monte Carlo (MCMC). The problem is that these algorithms require calculations over the whole dataset at each iteration, meaning the algorithms are slow for large datasets. Therefore the next generation of MCMC algorithms which scale to large datasets needs to be developed. 1.1 An overview of methods We begin this section with a more formal statement of the problem. Suppose we wish to train a model with probability density p(x θ), where θ is an unknown parameter vector, and x x is the model data. Let the likelihood of the model be p(x θ) = N i=1 p(x i θ) and the prior for the parameter be p(θ). Our interest is in the posterior p(θ x) p(x θ)p(θ), which quantifies the most likely values of θ given the data x. Commonly we simulate from the posterior using the Metropolis-Hastings (MH) algorithm, arguably the most popular MCMC algorithm. At each iteration, given a current state θ, the algorithm proposes some new state θ from some proposal q(.). This new state is then accepted as part of the sample with probability α = q(θ)p(θ x) q(θ )p(θ x) = q(θ)p(x θ )p(θ ) q(θ )p(x θ)p(θ). Notice that at each iteration, the MH algorithm requires calculation of the likelihood at the new state θ. This requires a computation over the whole dataset, which is infeasibly 2

4 slow when N is large. This is the key bottleneck in Metropolis-Hastings, and other MCMC algorithms, when they are being used with large datasets. A number of solutions have been proposed for this problem, and they can generally be divided into three categories. We refer to these categories as batch methods, stochastic gradient methods and subsampling methods. Batch methods aim to make use of recent hardware developments which makes the parallelisation of computational work more accessible. They split the dataset x into disjoint batches x B1,..., x BS. The structure of the posterior allows separate MCMC algorithms to be run on these batches in parallel in order to simulate from each subposterior p(θ x Bs ) p(θ) 1/S p(x Bs θ). These simulations must then be combined in order to generate a sample which approximates the full posterior p(θ x). This is where the main challenge lies. Stochastic gradient methods make use of sophisticated proposals that have been suggested for MCMC. These methods use gradients of the log posterior in order to suggest new states which have very high acceptance rates. When free constants of these proposals are tuned in a certain way these rates can be so high that we can get rid of the acceptance step and still sample from a good approximation to the posterior. However the gradient calculation still requires a computation over the whole dataset. Therefore the gradients of the log posterior need to be estimated using only a subsample of the data, which introduces extra noise. Subsampling methods propose various methods to keep the MCMC algorithm largely as is but use only a subset of the data in the acceptance step at each iteration. Certain methods exist which allow this to be done while still sampling from the true posterior distribution. However this advantage often comes at the cost of poor mixing. Other methods achieve the result by introducing controlled biases, these methods often mix better. 1.2 Report outline This report provides a review of batch methods and stochastic gradient methods outlined in Section 1.1. The reviewed methods are then implemented and compared under a variety of scenarios. In Section 2 we discuss batch methods, including parametric contributions by Scott et al. (2013) and Neiswanger et al. (2013), nonparametric and semiparametric methods introduced by Neiswanger et al. (2013) as well as more recent developments. Section 3 sees a review of stochastic gradient methods, including the stochastic gradient Langevin dynamics (SGLD) algorithm of Welling and Teh (2011) and the stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm of Chen et al. (2014). Stochastic optimization methods which are currently employed to train algorithms which rely on large datasets are considered. An introduction of Hamiltonian Monte Carlo, which is used to produce proposals for the SGHMC algorithms is provided. Finally we examine the literature which provide further theoretical results for the algorithms, as well as proposed improvements. In Section 4 the algorithms reviewed in the report are compared, code for the implementations are available on GitHub: A relatively simple model is used for comparison, a multivariate t-distribution. Therefore in order to really test the methods, the number of observations is kept small. First the effect of bandwidth choice for nonparametric/semiparametric methods is investigated. The performance effect of the number of 3

5 observations and the dimensionality of the target are compared for all the methods. The batch size for the batch methods, and the subsample size for the stochastic gradient methods are considered too. 2 Batch methods 2.1 Introduction In order to speed up MCMC, it is natural to consider parallelisation. Advances in hardware allow many jobs to be run in parallel over separate cores. These advances have been used to speed up many other computationally intensive algorithms. Parallelising MCMC has proven difficult however since MCMC is inherently sequential in nature and parallelisation requires minimal communication between machines. A natural way to parallelise MCMC is to split the data into different subsets. MCMC for each subset is then run separately on different machines. In this case the main problem is how to recombine our MCMC samples of each subset while ensuring the final sample is close as possible to the true posterior. In this section, we discuss parametric and nonparametric methods suggested to do this. 2.2 Splitting the data Suppose we have N i.i.d. data points x. We wish to investigate a model with probability density p(θ x), where θ is an unknown parameter vector. Let the likelihood be p(x θ) = N i=1 p(θ x i) and the prior we assign to θ be p(θ). Then the full posterior for the data p(θ x) is given by p(θ x) p(θ)p(x θ). (2.1) Let B 1,..., B S be a partition of {1,..., n}, and x Bi be the corresponding set of data points x Bi = {x i : i B i }. We refer to x Bi as the i th batch of data. We can rewrite (2.1) as p(θ x) p(θ) S p(x Bs θ) = s=1 S p(θ) 1/S p(x Bs θ). For brevity we will write the S batches of data as x 1,..., x S from now on. Let us define the subposterior p(θ x s ) by p(θ x s ) p(θ) 1/S p(θ x s ). Therefore we have that p(θ x) S s=1 p(θ x s). The idea of batch methods for big data is to run MCMC separately to sample from each subposterior. These samples are then combined in some way so that the final sample follows the full posterior p(θ x) as closely as possible. 2.3 Efficiently sampling from products of Gaussian mixtures Before we outline recombination methods in more detail, we discuss certain important properties of the multivariate Normal distribution which will prove useful later. s=1 4

6 Suppose we have S multivariate Normal densities N (θ µ s, Σ s ) for s {1,..., S}, then Wu (2004) shows that their product can be written, up to a constant of proportionality, as where Σ = ( S s=1 S N (θ µ s, Σ s ) N (θ µ, Σ), s=1 Σ 1 s ) 1, µ = Σ ( S Now suppose we have a set of S Gaussian mixtures {p s (θ)} S s=1, p s (θ) = s=1 M ω m,s N (θ µ m,s, Σ s ), m=1 ) Σ 1 s µ s. (2.2) where ω m,s denote the mixture weights. For simplicity we assume that the number of components in each mixture is the same and that each Gaussian component in the mixture shares a common variance which is diagonal. We wish to sample from the product of these Gaussian mixtures, It can be shown using induction that p(θ) S p s (θ). (2.3) s=1 S M ω m,s N (θ µ m,s, Σ s ) = m=1 l 1 s=1 S ω ls,sn (θ µ ls,s, Σ s ), l S s=1 where we label each component of the sum using L = (l 1,..., l S ), where l s {1,..., M}. It follows from this and results above about products of Gaussians, (2.3) is equivalent to a Gaussian mixture with M S mixture components. Therefore sampling from this product can be performed exactly in two steps. Firstly we sample from one of the M S components of the mixture according to its weight, then we draw a sample from the corresponding Gaussian component (Ihler et al., 2004). The parameters of the L th Gaussian component be calculated using (2.2) and are given by ( S ) 1 ( S ), µ L = Σ L Σ L = s=1 Σ 1 s s=1 Σ 1 s µ ls,s The unnormalised weight of the L th mixture component is given by (Ihler et al., 2004). ω L S s=1 ω l s,sn (θ µ ls,s, Σ s ). N (θ µ L, Σ L ) 5

7 In order to use this exact method we need to calculate the normalising constant for the weights Z = L ω L. As M and S grow this exact sampling method becomes computationally infeasible as the calculation of Z and the drawing a sample from p(.) both take O(M S ) time. This fact, along with memory requirements mean that sampling from p(θ) using the exact method quickly becomes impossible. In cases where exact sampling from the mixture is infeasible, a number of methods have been proposed. For a review the reader is suggested to refer to Ihler et al. (2004). A common approach is to use a Gibbs sampling style approach. At each iteration, S 1 of the labels l i are fixed, while one label, call it l j, is sampled from the corresponding conditional density p(θ l j ). The notation l j refers to {l i i {1,..., S}, i j}. After a fixed number of new label values have been drawn, a sample is drawn from the mixture component indicated by the current label values. While this approach often produces good results, it can require a large number of samples before it accurately represents the true mixture density due to multimodality. A number of suggestions have been made to improve this standard Gibbs sampling approach, for example using multiscale sampling (Ihler et al., 2004) and parallel tempering (Rudoy and Wolfe, 2007). 2.4 Parametric recombination methods There are a number of methods proposed to recombine subposterior samples which exactly target the full posterior p(θ x) when it is Normally distributed. We refer to these methods as parametric. Intuition for why this assumption might be valid for a large class of models comes from the Bernstein-von Mises Theorem (Le Cam, 2012), which is a central limit theorem for Bayesian statistics. Assuming suitable regularity conditions, and that the data is realised from a unique true parameter value θ 0, the theorem states that the posterior for the data tends to a Normal distribution centred around θ 0. In particular, for large N the posterior is found to be well approximated by N(θ 0, I 1 (θ 0 )), where I(θ) is Fisher s information matrix. Since we are aiming to efficiently sample from models with large amounts of data, this approximation appears to be particularly relevant. Neiswanger et al. (2013) propose to combine samples by approximating each subposterior using a Normal distribution, and then using results for products of Gaussians in order to combine these approximations. Let ˆµ s and ˆΣ s denote the sample mean and sample variance of the MCMC output for batch s. Then we can approximate the distribution of each subposterior by N(ˆµ s, ˆΣ s ). Using (2.2), the full posterior can be estimated by simply multiplying these subposterior estimates together. It follows the estimate will be multivariate Gaussian with mean ˆµ and variance ˆΣ given by ˆΣ = ( S s=1 ˆΣ 1 s ) 1, ˆµ = ˆΣ ( S s=1 ) ˆΣ 1 s ˆµ s. (2.4) Scott et al. (2013) propose a similar method, where samples are combined using averaging. Their method is known as consensus Monte Carlo. Denote the j th sample from subposterior s by θ sj. Then suppose each subposterior is assigned a weight denoted by W s (this is a matrix in the multivariate case), the j th draw ˆθ j from the consensus approximation to the 6

8 full posterior is given by ˆθ j = ( S s=1 W s) 1 S s=1 W s θ sj. When each subposterior is Normal, then the full posterior is also Normal, and when we set the weights to be W s = V ar(θ x s ) then ˆθ j will be exact draws from the full posterior. The idea is that even when the subposteriors are non-gaussian, the draw ˆθ j will still be a close approximation to the posterior. Scott et al. (2013) suggests using the sample variance of each batch as the weight values in practice, due to exact results in the Normal case. Key advantages of the two approximations outlined above are that they are fast and relatively quick to converge when models are close to Gaussian. However they only target the full posterior exactly if either each subposterior is Normally distributed, or the size of each batch tends to infinity. Therefore the methods performance on non-gaussian targets should be explored, especially when they are multi-modal, since the methods may conceivably struggle in these cases. Rabinovich et al. (2015) suggest extending the Consensus Monte Carlo algorithm of Scott et al. (2013) by relaxing the restriction of aggregation using averaging. Suppose we pick a draw from each subposterior, θ 1,..., θ S. Then let us refer to the function used to aggregate these draws as F (θ 1,..., θ S ), so in the case of Consensus Monte Carlo we have F (θ 1,..., θ S ) = ( S s=1 W s) 1 S s=1 W s θ s. Rabinovich et al. (2015) suggest trying to adaptively choose the best aggregation function F (.). Motivation for this is that the averaging function used in Scott et al. (2013) is only known to be exact in the case of Gaussian posteriors. In order to adaptively choose F (.), Rabinovich et al. (2015) use variational Bayes. However the method requires the introduction of an optimization step, and it would be interesting to investigate the relative improvement in the approximation in using the method, versus the increase in computation time. 2.5 Nonparametric methods While the methods outlined above work relatively well when subposteriors approximately Gaussian, it is not clear how they behave when models are far away from Gaussian, or when batch sizes are small. Neiswanger et al. (2013) therefore suggest an alternative method based on kernel density estimation which can be shown to target the full posterior asymptotically, as the number of samples drawn from each subposterior tends to infinity. Let x 1,..., x N be a sample from a distribution of dimension d with density f. Kernel density estimation is a method for providing an estimate ˆf of the density. The kernel density estimation for f at a point x is ˆf(x) = 1 N N K H (x x i ), i=1 7

9 where H is a d d symmetric, positive-definite matrix known as the bandwidth and K is the unscaled kernel, which is a symmetric, d-dimensional density. K H is related to K by K H (x) = H 1/2 K(H 1/2 x). Commonly the kernel function K is chosen to be Gaussian since it leads to smooth density estimates and it simplifies mathematical analysis (Duong, 2004). The bandwidth is an important factor in determining the accuracy of a kernel density estimate as it controls the smoothing of the estimate. Suppose we have a sample {θ m,s } M m=1 from each subposterior s {1,..., S}. Neiswanger et al. (2013) suggest approximating each subposterior using a kernel density estimate with Gaussian kernel and diagonal bandwidth matrix h 2 I, where I is the d-dimensional identity matrix. Denote this estimate by ˆp s (θ), then we can write it as ˆp s (θ) = 1 M M N (θ θ m,s, h 2 I), m=1 where N (. θ m,s, h 2 I) denotes a d-dimensional Gaussian density with mean θ m,s and variance h 2 I. The estimate for the full posterior ˆp(θ x) is then defined to be the product of the estimates for each batch S ˆp(θ x) = ˆp s (θ) = 1 S M N (θ θ M S m,s, h 2 I). (2.5) s=1 s=1 m=1 Therefore the estimate for the full posterior becomes a product of Gaussian mixtures as discussed in Section 2.3. By introducing a similar labelling system L = (l 1,..., l S ) with l s {1,..., M}, we can again derive an explicit expression for the resulting mixture. While Neiswanger et al. (2013) uses common variance h 2 I for each kernel, we suggest it might be better to use a diagonal matrix Λ since different parameters may differ considerably in variance. In either case, assuming a common, diagonal variance Λ across the kernel estimates for each batch, the weights in the product (2.5) simplify to S ω L N (θ ls,s θ L, Λ), s=1 θl = 1 S S θ ls,s. (2.6) s=1 The L th component of the mixture simplifies to N (θ θ L, Λ/S). Given that this method is designed for use with large datasets, the number of components of the resulting Gaussian mixture will be very large. Therefore efficiently sampling from it is an important issue to consider. Neiswanger et al. (2013) recommends sampling from the full posterior estimate using a similar method to the Gibbs sampling approach as outlined in Section 2.3. In order to avoid calculating the conditional distribution of the weights however, they use a Metropolis within Gibbs approach as follows. Setting all labels except the current, l s, fixed, we randomly sample a new value for l s. We then accept this new label with probability equal to the corresponding values for the weights. The full algorithm is 8

10 detailed in Algorithm 1. Algorithm 1: Combining Batches Using Kernel Density Estimation. Data: Samples from each subposterior s {1,..., S}, {θ m,s } M m=1. Result: Sample from an estimate of the full posterior p(θ x). Draw an initial label L by simulating l s Unif({1,..., M}), s {1,..., S}. for i = 1 to T do h h(i) for s = 1 to S do Create a new label C := (c 1,..., c S ) and set C L Draw a new value for index s in C, c s Unif({1,..., M}) Simulate u Unif(0, 1) if end end u < ω C /ω L then L C Simulate θ i N( θ L, h2 M I) end Notice that in the algorithm, h is changed as a function of the iteration i. In particular Neiswanger et al. (2013) specify the function h(i) = i 1/(4+d). This causes the bandwidth to decrease at each iteration and is referred to as annealing. The properties of annealing are investigated further in Section 4. In their paper Neiswanger et al. (2013) assume that the number of iterations is the same as the size of the sample from each subposterior. However this is not necessary, in fact when we are trying to sample from a mixture with a large number of components we may need to simulate more times than this in order to ensure the sample accurately represents the true KDE approximation. While this algorithm may improve results as models move away from Gaussianity, kernel density estimation is known to perform poorly at high dimensions so the algorithm will deteriorate as the dimensionality of θ increases. The algorithm suffers from the curse of dimensionality in the number of batches and the size of the MCMC sample simulated from each subposterior. This suggests that as the number of batches increases the accuracy and mixing of the algorithm will be affected. The algorithm requires the user to choose a bandwidth estimate, the performance of the algorithm to different bandwidth choices would therefore be interesting to investigate. In the original paper by Neiswanger et al. (2013), it is suggested to use a Gaussian kernel with bandwidth h 2 I. However as mentioned earlier, different parameters may have different variances. The algorithm would probably perform better by using a more general diagonal matrix Λ, especially as this does not particularly increase the complexity of the algorithm. Using a common bandwidth parameter across batches eases computation however it may negatively affect the performance of the algorithm. Note when discussing products of Gaussian mixtures in 2.3, the variances across different mixtures did not need to be assumed common. Therefore further improvements might be made by varying bandwidths across batches, though this would increase computational expense. Finally improvements could be gained by using more sophisticated methods to sample from the product of kernel density 9

11 estimates (Ihler et al., 2004; Rudoy and Wolfe, 2007). A number of developments have been proposed for Algorithm 1. Wang and Dunson (2013) note that the algorithm performs poorly when samples from each subposterior do not overlap. In order to improve this they suggest to smooth each subposterior using a Weierstrass transform, which simply takes the convolution of the density with a Gaussian function. The transformed function can be seen as a smoothed version of the original which tends to increase the overlap between subposteriors. They then approximate the full posterior as a product of the Weierstrass transform of each subposterior. However, since in general the approximation to each subposterior will be empirical, its Weierstrass transform corresponds to a kernel density estimator. Therefore this method, for all intents and purposes, is the same as the original algorithm by Neiswanger et al. (2013), so still suffers from many of the same problems. An alternative method to improve overlap between the supports of each subposterior is to use heavier tailed kernels in the kernel density estimation. Implementing this however will require some work in order to be able to sample from the resulting product of mixtures, since nice properties for the product of these heavier tailed distributions may not hold. Therefore alternative methods for sampling will need to be developed. Wang et al. (2015) rather than using kernel density estimation use space partitioning methods to partition the space into disjoint subsets, and produce counts of the number of points contained in each of these subsets. This produces an estimate of each subposterior akin to a multi-dimensional histogram. An estimate to the full posterior can then be made by multiplying subposterior estimates together and normalizing. This algorithm helps solve the explosion of mixture components that affects algorithm 1. Despite this, the algorithm will still suffer when the supports of each subposterior do not overlap. Moreover the algorithm is more complicated to implement and will be affected by the choice of partitioning used. Alternatively there have been suggestions to introduce suitable metrics which allow summaries of a set of probability measures to be defined. This allows batches to be recombined in terms of these summaries. For example Minsker et al. (2014) use a metric known as the Wasserstein distance measure in order to define the median posterior from a set of subposteriors. Similarly Srivastava et al. (2015) also use the Wasserstein distance to calculate a summary of the subposteriors known as the barycenter. This allows them to produce an estimate for the full posterior which they refer to as the Wasserstein posterior or WASP. However the statistical properties of these measures is unclear and needs to be investigated further. 2.6 Semiparametric methods In order to account for the fact that the nonparametric method Algorithm 1 is slow to converge, Neiswanger et al. (2013) suggest producing a semiparametric estimator (Hjort and Glad, 1995) of each subposterior. This estimator combines the parametric estimator characterised by (2.4) and the nonparametric estimator detailed by Algorithm 1. More specifically, each subposterior is estimated by (Hjort and Glad, 1995) ˆp s (θ) = ˆf s (θ)ˆr(θ), 10

12 where ˆf s (θ) = N (θ ˆµ s, ˆΣ s ) and ˆr(θ) is a nonparametric estimator of the correction function r(θ) = p s (θ)/ ˆf s (θ). Assuming a Gaussian kernel for ˆr(θ), Neiswanger et al. (2013) write down an explicit expression for ˆp s (θ) ˆp s (θ) = 1 M M m=1 N (θ θ m,s, h 2 I)N (θ ˆµ s, ˆΣ s ) ˆf s (θ m,s ) = 1 M M N (θ θ m,s, h 2 I)N (θ ˆµ s, ˆΣ s ) N (θ m,s ˆµ s, ˆΣ. s ) Similarly to the nonparametric method, we can produce an estimate for the full posterior ˆp(θ x) as the product of estimates for each subposterior. Once again this results in a mixture of Gaussians with M S components. Using the label L = (l 1,..., l S ) then the L th mixture weight W L and component c L is given by m=1 W L ω LN ( θ L ˆµ, ˆΣ + h S I) S s=1 N (θ l s,s ˆµ s, ˆΣ s ), c L = N (θ µ L, Σ L ), where ω L and θ L are as defined in (2.6), and the parameters of the mixture component are ( ) 1 ( ) S 1 S Σ L = I + ˆΣ, µ L = Σ L h h I θ L + ˆΣ 1ˆµ, where ˆΣ and ˆµ are as defined in (2.4). Sampling from this mixture can be performed by using Algorithm 1 replacing weights and parameters where appropriate. As h 0, the semiparametric component parameters Σ L and µ L approach the corresponding nonparametric component parameters. This motivates Neiswanger et al. (2013) to suggest an alternative semiparametric algorithm where the nonparametric component weights ω L are used instead of W L. Their reasoning is that the resulting algorithm may have a higher acceptance probability and is still asymptotically exact as the batch size tends to infinity. As in Section 2.5, a bandwidth matrix with identical diagonal elements hi will not necessarily be the best choice for the bandwidth if different dimensions of the parameters have different scales or variances. However the algorithm can easily be extended to using a diagonal bandwidth matrix Λ in a similar way to the nonparametric method. While this method may solve the problem that the nonparametric method is slow to converge in high dimensions, the performance of the algorithm is not well understood. For example as models tend away from Gaussianity, how will the algorithm perform when it includes this parametric term. Moreover the model still suffers from the curse of dimensionality in terms of the number of mixture components. The model will also be affected by bandwidth choice. 2.7 Conclusion In this section we outlined batch methods. Batch methods split a large dataset up into smaller subsets, run parallel MCMC on these subsets, and then combine the MCMC output to obtain an approximation to the full posterior. A couple of methods appealed to the Bernsteinvon Mises theorem in order to approximate each subposterior by a Normal distribution. 11

13 The resulting approximation to the full posterior could be found using standard results for products of Gaussians. However these methods are only exact if each subposterior is Normal, or as the number of observations in each batch tends to infinity. Performance of the methods when these assumptions are violated needs to be investigated. Alternative methods used kernel density estimation or a mixture of a Normal estimate and a kernel density estimate to approximate each subposterior. These estimates could then be combined by using results for the product of mixtures of Gaussians. However the resulting approximation was a mixture of M S components, which is difficult to sample from efficiently. Moreover kernel density estimation is known to deteriorate as dimensionality increases and requires the choice of a bandwidth. To conclude, each of the batch methods have either undesirable qualities or properties which are not well understood. These issues need reviewing before the methods can be used with confidence in practice. Batch methods are particularly suited to models which exhibit structure, for example hierarchical models. 3 Stochastic gradient methods 3.1 Introduction Methods currently employed in large scale machine learning are generally optimization based methods. One method employed frequently in training machine learning models is known as stochastic optimization (Robbins and Monro, 1951). This method is used to optimize a likelihood function in a similar way to traditional gradient ascent. The key difference is that at each iteration rather than using the whole dataset only a subset is used. While the method produces impressive results at low computational cost, it has a number of downsides. Parameter uncertainty is not captured using this method, since it only produces a point estimate. Though uncertainty can be estimated using a Normal approximation, for more complex models this estimate may be poor. This means models fitted using stochastic optimization can suffer from overfitting. Since the method does not sample from the posterior as in traditional MCMC, the algorithm can get stuck in local maxima. Methods outlined in this section aim to combine the subsampling approach of stochastic optimization, with posterior sampling, which helps capture uncertainty in parameter estimates. The section begins by outlining stochastic optimization, before introducing stochastic gradient Langevin dynamics (SGLD) and stochastic gradient Hamiltonian Monte Carlo (SGHMC), the two key algorithms for big data discussed in this section. Hamiltonian Monte Carlo (HMC), a technique used extensively by SGHMC, is reviewed. 3.2 Stochastic optimization Let x 1,..., x N be data observed from a model with probability density function p(x θ) where θ denotes an unknown parameter vector. Assigning a prior p(θ) to θ, as usual our interest is 12

14 in the posterior p(θ x) p(θ) N p(x i θ), where we define p(x θ) = N i=1 p(x i θ) to be the likelihood. Stochastic optimization (Robbins and Monro, 1951) aims to find the mode θ of the posterior distribution, otherwise known as the MAP estimate of θ. The idea of finding the mode of the posterior rather than the likelihood is that the prior p(θ) regularizes the parameters, meaning it acts as a penalty for model complexity which helps prevent overfitting. At each iteration t, stochastic optimization takes a subset of the data s t and updates the parameters as follows (Welling and Teh, 2011) ( ) θ t+1 = θ t + ɛ t 2 i=1 log p(θ t ) + N n x i s t log p(x i θ t ) where ɛ t is the stepsize at each iteration and s t = n. The idea is that over the long run the noise in using a subset of the data is averaged out, and the algorithm tends towards a standard gradient descent. Clearly when the number of observations N is large, using only a subset of the data is much less computationally expensive. This is a key advantage of stochastic optimization. Provided that ɛ t =, ɛ 2 t <, (3.1) t=1 and p(θ x) satisfies certain technical conditions, this algorithm is guaranteed to converge to a local maximum. A common extension of stochastic optimization which will be needed later is known as stochastic optimization with momentum. This is commonly employed when the likelihood surface exhibits a particular structure, one example where the method is employed extensively is in the training of deep neural networks. In this case we introduce a variable ν, which is referred to as the velocity of the trajectory. The parameter updates then proceed as follows ( ) ν t+1 = (1 α)ν t + η ɛ t 2 t=1 log p(θ t ) + N n x i s t log p(x i θ t ) θ t+1 = ν t+1 + θ t (3.2) where α and η are free parameters to be tuned. While stochastic optimization is used frequently by large scale machine learning practitioners, it does not capture parameter uncertainty since it only produces a point estimate of θ. This means that models fit using stochastic optimization can often suffer from overfitting and requires some form of regularization. One common method to provide an approximation to the true posterior is to fit a Gaussian approximation at the point estimate., 13

15 Suppose θ 0 is the true mode of the posterior p(θ x). Then using Taylor s expansion about θ 0, we find (Bishop, 2006) log p(θ x) log p(θ 0 x) + (θ θ 0 ) T log p(θ x) 1 2 (θ θ 0) T H[log p(θ 0 x)](θ θ 0 ) = log p(θ 0 x) 1 2 (θ θ 0) T H[log p(θ 0 x)](θ θ 0 ), where H[g(.)] is the Hessian matrix of the function g(.), and we have used the fact that the gradient of the log posterior at θ 0 is 0. Let us denote the Hessian H[log p(θ x)] := V 1 [θ], then taking the exponential of both sides we find { p(θ x) A exp 1 } 2 (θ θ 0) T V 1 [θ 0 ](θ θ 0 ), where A is some constant. This is the kernel of a Gaussian density, suggesting an approximation to the posterior of the form N(θ, V [θ ]), where θ is an estimate of the mode to be found. This is often referred to as a Laplace approximation. By the Bernstein-von Mises theorem, this approximation is expected to become an increasingly accurate approximation as the number of observations increases. However since the approximation is based only on distributional aspects at one point, the approximation can miss important properties of the distribution (Bishop, 2006). Moreover distributions which are multimodal will be approximated very poorly by this approximation. Therefore while the approximation may work well for less complex distributions when plenty of data is available, the approximation may struggle for more complex models. This motivates us to consider methods which aim to combine the performance of stochastic optimization while being able to account for parameter uncertainty. 3.3 Hamiltonian Monte Carlo Hamiltonian dynamics was originally developed as an important reformulation of Newtonian dynamics, and serves as a vital tool in statistical physics. More recently though, Hamiltonian dynamics has been used to produce proposals for the Metropolis-Hastings algorithm which explore the parameter space rapidly and have very high acceptance rates. The acceptance calculations in the Metropolis-Hastings algorithm is computationally intensive when a lot of data is available. However as outlined later, by combining ideas from stochastic optimization and Hamiltonian dynamics, we are able to approximately simulate from the posterior distribution without using an acceptance calculation. In light of this, we review Hamiltonian Monte Carlo, a method which produces efficient proposals for the Metropolis-Hastings algorithm Hamiltonian dynamics Hamiltonian dynamics was traditionally developed to describe the motion of objects under a system of forces. In two dimensions a common analogy used to visualise the dynamics is a frictionless puck sliding over a surface of varying height (Neal, 2010). The state of the 14

16 system consists of the puck s position θ, and its momentum (mass times velocity) r. Both of which are 2-dimensional vectors. The state of the system is governed by its potential energy U(θ) and its kinetic energy K(r). If the puck is moving on a flat part of the space, then it will have constant velocity. However as the puck begins to pick up height, its kinetic energy decreases and its potential energy increases as it slows. If its kinetic energy reaches zero the puck moves back down the hill, and its potential energy decreases as its kinetic energy increases. More formally Hamiltonian dynamics is described by a Hamiltonian function H(r, θ), where r and θ are both d-dimensional. The Hamiltonian determines how r and θ change over time as follows dθ i dt = H r i, dr i dt = H θ i. (3.3) Hamiltonian dynamics has a number of properties which are crucial for its use in constructing MCMC proposals. Firstly, Hamiltonian dynamics is reversible, meaning that the mapping from the state (r(t), θ(t)) at time t to the state (r(t + s), θ(t + s)) at time t + s is one-to-one. A second property is that the dynamics keeps the Hamiltonian invariant or conserved. This can be easily shown using (3.3) as follows dh dt = d i=1 ( dθi dt H + dr i θ i dt ) H = r i d i=1 ( H H + H ) H = 0. r i θ i θ i r i In order to use Hamiltonian dynamics to simulate from a distribution we need to translate the density function to a potential energy function, and introduce artificial momentum variables to go with these position variables of interest. A Markov chain can then be simulated where at each iteration we resample the momentum variables, simulate Hamiltonian dynamics for a number of iterations, and then perform a Metropolis Hastings acceptance step with the new variables obtained from the simulation. In light of this, for Hamiltonian Monte Carlo we generally define the Hamiltonian H(r, θ) to be of the following form H(r, θ) = U(θ) + K(r), where θ is the vector we are simulating from and the momentum vector r is constructed artificially. Using the notation in Section 3.2 the potential energy is then defined to be ( ) N N U(θ) = log p(θ) p(x i θ) = log p(θ) log p(x i θ). (3.4) i=1 i=1 The kinetic energy is defined as K(r) = 1 2 rt M 1 r, (3.5) where M is a symmetric, positive definite mass matrix. 15

17 3.3.2 Using Hamiltonian dynamics in MCMC In order to relate the potential and kinetic energy functions to the distribution of interest, we can use the concept of a canonical distribution. Given some energy function E(x), defined over the state of x, the canonical distribution over the states of x is defined to be P (x) = 1 Z exp{ E(x)/(k BT )}, (3.6) where Z is a normalizing constant, k B is Boltzmann s constant, and T is defined to be the temperature of the system. The Hamiltonian is an energy function defined over the joint state of r and θ, so that we can write down the joint distribution defined by the function as P (r, θ) exp{ H(r, θ)/(k B T )}. If we now assume the Hamiltonian is of the form described by (3.4), (3.5), and that k B T = 1, then we find that P (r, θ) exp{ U(θ)} exp{ K(r)} p(θ x)n (r 0, M). So that the distribution for r and θ defined by the Hamiltonian are independent and the marginal distribution of θ is its posterior distribution. This relationship enables us to describe Hamiltonian Monte Carlo (HMC), which can be used to simulate from continuous distributions whose density can be evaluated up to a normalizing constant. A requirement of HMC is that we can calculate the derivatives of the log of the target density. HMC samples from the joint distribution for (θ, r). Therefore by discarding the samples for r we obtain a sample from the posterior p(θ x). Generally we choose the components of r (r i ) to be independent, each with variance m i. This allows us to write the kinetic energy as d ri 2 K(r) =. 2m i In order to approximate Hamiltonian s equations computationally, we need to discretize time using a small stepsize ɛ. There are a number of ways to do this, however in practice the leapfrog method often produces good results. The method works as follows: 1. r i (t + ɛ/2) = r i (t) ɛ U 2 θ i (θ(t)), 2. θ i (t + h) = θ i (t) + ɛ K r i (r(t + h/2)), 3. r i (t + h) = r i (t + h/2) h 2 U θ i (θ(t + h)). The leapfrog method has a number of desirable properties, including that it is reversible and volume preserving. An effect of this is that at the acceptance step, the proposal distributions cancel, so that the acceptance probability is simply a ratio of the canonical distributions at the proposed and current states. Since we must discretize the equations in order to simulate from them, the posterior p(θ x) is not invariant under the approximate dynamics. This is i=1 16

18 why the acceptance step is required, as it corrects for this error. As the stepsize ɛ tends to zero, the acceptance rate of the leapfrog method tends to 1 as the approximation moves closer to true Hamiltonian dynamics. Now that we have outlined how to approximate the Hamiltonian equations, we can outline Hamiltonian Monte Carlo. HMC is performed in two steps as follows: 1. Simulate new values for the momentum variables r N(0, M). 2. Simulate Hamiltonian dynamics for L steps with stepsize ɛ using the leapfrog method. The momentum variables are then negated, and the new state (θ, r ) is accepted with probability min {1, exp{h(θ, r) H(θ, r )}} Developments in HMC and tuning HMC allows the state space to be explored rapidly and has high acceptance rates. However in order to gain these benefits, we need to ensure that L and ɛ are properly tuned. Generally it is recommended to use trial values for L and ɛ and to use traceplots and autocorrelation plots to decide on how quickly the resulting algorithm converges and how well it is exploring the state space. The presence of multiple modes can be an issue for HMC, and requires special treatment (Neal, 2010). Therefore it is recommended the algorithm is run at different starting points to ensure multimodality is not present. Suppose we have an estimate of the variance matrix for θ, if the variables appear to be correlated then HMC may not explore the parameter space effectively. One way to improve the performance of HMC in this case is to set M = ˆΣ 1, where ˆΣ is our estimate of V ar(θ x). The selection of the stepsize ɛ is very important in HMC, since selecting a size that is too big will result in a low acceptance rate, while selecting a size that is too small will result in slow exploration of the space. Selecting ɛ too large can be particularly problematic as it can cause instability in the Hamiltonian error, which leads to very low acceptance. In situations where the mass matrix M is the diagonal matrix, the stability limit for ɛ is given by the width of the distribution in its most constrained direction. For a Gaussian distribution, this is the square root of the smallest eigenvalue of the covariance matrix for θ. The value of L is also an important quantity to choose when tuning the HMC algorithm. Selecting L too small will mean the HMC explores the space with inefficient random walk behaviour as the next state will still be correlated with the previous state. On the other hand selecting L too large will waste computation and lower acceptance rates. There have been a number of important developments to HMC. Girolami and Calderhead (2011) introduced Riemannian Manifold Hamiltonian Monte Carlo, which simulates HMC in a Riemannian space rather than a Euclidean one. This effectively enables the use of positiondependent mass matrices M. Using this result, the algorithm will sample more efficiently from distributions where parameters of interest exhibit strong correlations. A recent development by Hoffman and Gelman (2014) led to the development of the No U-turn Sampler. This enables the automatic and adaptive tuning of the stepsize ɛ and the trajectory length L. This is an important development since the tuning of HMC algorithms is a non-trivial task. Alternative methods to the leapfrog method for simulating Hamiltonian dynamics have 17

19 been developed. These enable us to to handle constraints on the variables, or to exploit partially analytic solutions (Neal, 2010). As mentioned earlier, HMC can have considerable difficulty moving between the modes of a distribution. A number of schemes have been developed to solve this problem including tempered transitions Neal (1996) and annealed importance sampling Neal (2001). 3.4 Stochastic gradient Langevin Monte Carlo A special case of HMC arises, known as Langevin Monte Carlo, when we only use a single leapfrog step to propose a new state. Its name comes from its similarity to the theory of Langevin dynamics in physics. Welling and Teh (2011) noticed that the discretized form of Langevin Monte Carlo has a comparable structure to that of stochastic optimization, outlined in Section 3.2. This motivates them to develop an algorithm based on Langevin Monte Carlo, which only uses a subsample of the dataset to calculate the gradient of the potential energy U. They show that by using a stepsize that decreases with time, the algorithm will smoothly transition from a stochastic gradient descent to sampling approximately from the posterior distribution, without the need for an acceptance step. This result along with the fact that only a subsample of the data is used at each iteration, means that the algorithm is scalable to large datasets Stochastic gradient Langevin Monte Carlo Langevin Monte Carlo arises from HMC when we use only one leapfrog step in generating a new state (r, θ). In this case we can remove any explicit mention of momentum variables and propose a new value for θ as follows (Neal, 2010) θ t+1 = θ t a2 2 U θ + η, where η N(0, a 2 ) and a is some constant. Using our particular expression of the potential energy (3.4), we can write ( ) θ t+1 = θ t + ɛ N log p(θ t ) + log p(x i θ t ) + η, 2 i=1 = θ t ɛ 2 U(θ t) + η (3.7) where ɛ = a 2. While being a special case of Hamiltonian Monte Carlo, the properties of Langevin dynamics are somewhat different. We cannot typically set a very large, so the state space is normally explored a lot slower than using HMC. The proposal for Langevin Monte Carlo is a particular discretization of a stochastic differential equation (SDE) known as Langevin dynamics. Writing this discretization as an SDE we obtain dθ = 1 2 U(θ)dt + dw = 1 U(θ)dt + N (0, dt), (3.8) 2 18

20 where W is a Wiener process and we have informally written dw as N (0, dt). A Wiener process is a stochastic process with the following properties: 1. W (0) = 0 with probability 1; 2. W (t + h) W (t) N(0, h) and is independent of W (τ) for τ t. It can be shown that, under certain conditions, the posterior distribution p(θ x) is the stationary distribution of (3.8). This motivates the Metropolis-adjusted Langevin algorithm (MALA), which uses (3.7) as a proposal for the Metropolis-Hastings algorithm. When there are a large number of observations available, U(θ) is expensive to calculate at each iteration, since it requires the evaluation of the log likelihood gradient. Welling and Teh (2011) therefore suggest introducing an unbiased estimator of U(θ) which uses only a subset s t of the data at each iteration. The estimator Ũ(θ) is given as follows We use that Ũ(θ) = log p(θ) N n x i s t log p(x i θ). (3.9) Ũ(θ) = U(θ) + ν, (3.10) where ν is some noise term which we refer to as the stochastic gradient noise. Using this estimator in place of U(θ) in a Langevin Monte Carlo update we obtain the following ( ) θ t+1 = θ t + ɛ log p(θ t ) + N log p(x i θ t ) + η, (3.11) 2 n x i s t = θ t + ɛ 2 U(θ t) + ɛ 2 ν t + η. If we assume that the stochastic gradient noise ν t has variance V (θ t ), then the term ɛ ν 2 t has variance ɛ2 V (θ 2 t). Therefore for small ɛ, η, which has variance ɛ, will dominate. As we send ɛ 0, (3.11) will approximate Langevin dynamics and sample approximately from p(θ x), without the need for an acceptance step. This result motivates Welling and Teh (2011) to suggest an algorithm that uses (3.11) to update θ t, but to decrease the stepsize ɛ to 0 as the number of iterations t increases. Leading to the SGLD algorithm update ( ) θ t+1 = θ t + ɛ t 2 log p(θ t ) + N n x i s t log p(x i θ t ) + η t (3.12) Noting the similarity between (3.12) and stochastic optimization, they suggest decreasing ɛ t according to the conditions (3.1) to ensure that the noise in the stochastic gradients average out. The result is an algorithm that transitions smoothly between stochastic gradient descent and approximately sampling from the posterior using an increasingly accurate discretization of Langevin dynamics. Since the stepsize must decrease to zero, the mixing rate of the 19

### Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian

### Tutorial on Markov Chain Monte Carlo

Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology,

### Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

### Section 5. Stan for Big Data. Bob Carpenter. Columbia University

Section 5. Stan for Big Data Bob Carpenter Columbia University Part I Overview Scaling and Evaluation data size (bytes) 1e18 1e15 1e12 1e9 1e6 Big Model and Big Data approach state of the art big model

### MCMC Using Hamiltonian Dynamics

5 MCMC Using Hamiltonian Dynamics Radford M. Neal 5.1 Introduction Markov chain Monte Carlo (MCMC) originated with the classic paper of Metropolis et al. (1953), where it was used to simulate the distribution

### Gaussian Processes to Speed up Hamiltonian Monte Carlo

Gaussian Processes to Speed up Hamiltonian Monte Carlo Matthieu Lê Murray, Iain http://videolectures.net/mlss09uk_murray_mcmc/ Rasmussen, Carl Edward. "Gaussian processes to speed up hybrid Monte Carlo

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### Posterior probability!

Posterior probability! P(x θ): old name direct probability It gives the probability of contingent events (i.e. observed data) for a given hypothesis (i.e. a model with known parameters θ) L(θ)=P(x θ):

### Big Data, Statistics, and the Internet

Big Data, Statistics, and the Internet Steven L. Scott April, 4 Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39 Summary Big data live on more than one machine. Computing takes

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data Faming Liang University of Florida August 9, 2015 Abstract MCMC methods have proven to be a very powerful tool for analyzing

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### Christfried Webers. Canberra February June 2015

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Introduction to Markov chain Monte Carlo with examples from Bayesian statistics

Introduction to Markov chain Monte Carlo with examples from Bayesian statistics First winter school in escience Geilo, Wednesday January 31st 2007 Håkon Tjelmeland Department of Mathematical Sciences Norwegian

### Markov chains and Markov Random Fields (MRFs)

Markov chains and Markov Random Fields (MRFs) 1 Why Markov Models We discuss Markov models now. This is the simplest statistical model in which we don t assume that all variables are independent; we assume

Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

### Markov Chain Monte Carlo Simulation Made Simple

Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical

### Markov Chain Monte Carlo

CS731 Spring 2011 Advanced Artificial Intelligence Markov Chain Monte Carlo Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu A fundamental problem in machine learning is to generate samples from a distribution:

### EC 6310: Advanced Econometric Theory

EC 6310: Advanced Econometric Theory July 2008 Slides for Lecture on Bayesian Computation in the Nonlinear Regression Model Gary Koop, University of Strathclyde 1 Summary Readings: Chapter 5 of textbook.

### Centre for Central Banking Studies

Centre for Central Banking Studies Technical Handbook No. 4 Applied Bayesian econometrics for central bankers Andrew Blake and Haroon Mumtaz CCBS Technical Handbook No. 4 Applied Bayesian econometrics

### Analysis of Bayesian Dynamic Linear Models

Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main

### Distributed Bayesian Posterior Sampling via Moment Sharing

Distributed Bayesian Posterior Sampling via Moment Sharing Minjie Xu 1, Balaji Lakshminarayanan 2, Yee Whye Teh 3, Jun Zhu 1, and Bo Zhang 1 1 State Key Lab of Intelligent Technology and Systems; Tsinghua

### Bayesian Techniques for Parameter Estimation. He has Van Gogh s ear for music, Billy Wilder

Bayesian Techniques for Parameter Estimation He has Van Gogh s ear for music, Billy Wilder Statistical Inference Goal: The goal in statistical inference is to make conclusions about a phenomenon based

### Bayesian Statistics in One Hour. Patrick Lam

Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical

### Bayesian Statistics: Indian Buffet Process

Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note

### CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

### These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

### Introduction to Markov Chain Monte Carlo

Introduction to Markov Chain Monte Carlo Monte Carlo: sample from a distribution to estimate the distribution to compute max, mean Markov Chain Monte Carlo: sampling using local information Generic problem

### Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

### Lab 8: Introduction to WinBUGS

40.656 Lab 8 008 Lab 8: Introduction to WinBUGS Goals:. Introduce the concepts of Bayesian data analysis.. Learn the basic syntax of WinBUGS. 3. Learn the basics of using WinBUGS in a simple example. Next

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

### Dirichlet Processes A gentle tutorial

Dirichlet Processes A gentle tutorial SELECT Lab Meeting October 14, 2008 Khalid El-Arini Motivation We are given a data set, and are told that it was generated from a mixture of Gaussian distributions.

### Parameter estimation for nonlinear models: Numerical approaches to solving the inverse problem. Lecture 12 04/08/2008. Sven Zenker

Parameter estimation for nonlinear models: Numerical approaches to solving the inverse problem Lecture 12 04/08/2008 Sven Zenker Assignment no. 8 Correct setup of likelihood function One fixed set of observation

### MAN-BITES-DOG BUSINESS CYCLES ONLINE APPENDIX

MAN-BITES-DOG BUSINESS CYCLES ONLINE APPENDIX KRISTOFFER P. NIMARK The next section derives the equilibrium expressions for the beauty contest model from Section 3 of the main paper. This is followed by

### Chapter 4: Non-Parametric Classification

Chapter 4: Non-Parametric Classification Introduction Density Estimation Parzen Windows Kn-Nearest Neighbor Density Estimation K-Nearest Neighbor (KNN) Decision Rule Gaussian Mixture Model A weighted combination

### Principle of Data Reduction

Chapter 6 Principle of Data Reduction 6.1 Introduction An experimenter uses the information in a sample X 1,..., X n to make inferences about an unknown parameter θ. If the sample size n is large, then

### Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi

Neural Networks CAP5610 Machine Learning Instructor: Guo-Jun Qi Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector

### Imperfect Debugging in Software Reliability

Imperfect Debugging in Software Reliability Tevfik Aktekin and Toros Caglar University of New Hampshire Peter T. Paul College of Business and Economics Department of Decision Sciences and United Health

### Hierarchical models in Stan. Daniel Lee Columbia University, Statistics Department

Hierarchical models in Stan Daniel Lee Columbia University, Statistics Department bearlee@alum.mit.edu BayesComp mc-stan.org Stan: Help User Guide: http://mc-stan.org/manual.html Homepage: http://mc-stan.org

### Dealing with large datasets

Dealing with large datasets (by throwing away most of the data) Alan Heavens Institute for Astronomy, University of Edinburgh with Ben Panter, Rob Tweedie, Mark Bastin, Will Hossack, Keith McKellar, Trevor

### 11. Time series and dynamic linear models

11. Time series and dynamic linear models Objective To introduce the Bayesian approach to the modeling and forecasting of time series. Recommended reading West, M. and Harrison, J. (1997). models, (2 nd

### 2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came

### Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### Robert Collins CSE598G. More on Mean-shift. R.Collins, CSE, PSU CSE598G Spring 2006

More on Mean-shift R.Collins, CSE, PSU Spring 2006 Recall: Kernel Density Estimation Given a set of data samples x i ; i=1...n Convolve with a kernel function H to generate a smooth function f(x) Equivalent

### Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

### Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering Department of Industrial Engineering and Management Sciences Northwestern University September 15th, 2014

### PS 271B: Quantitative Methods II. Lecture Notes

PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.

### A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses. Michael R. Powers[ 1 ] Temple University and Tsinghua University

A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses Michael R. Powers[ ] Temple University and Tsinghua University Thomas Y. Powers Yale University [June 2009] Abstract We propose a

### Senior Secondary Australian Curriculum

Senior Secondary Australian Curriculum Mathematical Methods Glossary Unit 1 Functions and graphs Asymptote A line is an asymptote to a curve if the distance between the line and the curve approaches zero

### Parallelization Strategies for Multicore Data Analysis

Parallelization Strategies for Multicore Data Analysis Wei-Chen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management

### A Learning Based Method for Super-Resolution of Low Resolution Images

A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 emre.ugur@ceng.metu.edu.tr Abstract The main objective of this project is the study of a learning based method

### Constrained Bayes and Empirical Bayes Estimator Applications in Insurance Pricing

Communications for Statistical Applications and Methods 2013, Vol 20, No 4, 321 327 DOI: http://dxdoiorg/105351/csam2013204321 Constrained Bayes and Empirical Bayes Estimator Applications in Insurance

### Maximum Likelihood Estimation

Math 541: Statistical Theory II Lecturer: Songfeng Zheng Maximum Likelihood Estimation 1 Maximum Likelihood Estimation Maximum likelihood is a relatively simple method of constructing an estimator for

### BAYESIAN ECONOMETRICS

BAYESIAN ECONOMETRICS VICTOR CHERNOZHUKOV Bayesian econometrics employs Bayesian methods for inference about economic questions using economic data. In the following, we briefly review these methods and

### Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

### Calculating Interval Forecasts

Calculating Chapter 7 (Chatfield) Monika Turyna & Thomas Hrdina Department of Economics, University of Vienna Summer Term 2009 Terminology An interval forecast consists of an upper and a lower limit between

### Marketing Mix Modelling and Big Data P. M Cain

1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

### Cheng Soon Ong & Christfried Webers. Canberra February June 2016

c Cheng Soon Ong & Christfried Webers Research Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 31 c Part I

### Inference on Phase-type Models via MCMC

Inference on Phase-type Models via MCMC with application to networks of repairable redundant systems Louis JM Aslett and Simon P Wilson Trinity College Dublin 28 th June 202 Toy Example : Redundant Repairable

### An Introduction to Machine Learning

An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

### bayesian inference: principles and practice

bayesian inference: and http://jakehofman.com july 9, 2009 bayesian inference: and motivation background bayes theorem bayesian probability bayesian inference would like models that: provide predictive

### Credit Risk Models: An Overview

Credit Risk Models: An Overview Paul Embrechts, Rüdiger Frey, Alexander McNeil ETH Zürich c 2003 (Embrechts, Frey, McNeil) A. Multivariate Models for Portfolio Credit Risk 1. Modelling Dependent Defaults:

### 4. Introduction to Statistics

Statistics for Engineers 4-1 4. Introduction to Statistics Descriptive Statistics Types of data A variate or random variable is a quantity or attribute whose value may vary from one unit of investigation

### CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

### Methods of Data Analysis Working with probability distributions

Methods of Data Analysis Working with probability distributions Week 4 1 Motivation One of the key problems in non-parametric data analysis is to create a good model of a generating probability distribution,

### Supplement to Call Centers with Delay Information: Models and Insights

Supplement to Call Centers with Delay Information: Models and Insights Oualid Jouini 1 Zeynep Akşin 2 Yves Dallery 1 1 Laboratoire Genie Industriel, Ecole Centrale Paris, Grande Voie des Vignes, 92290

### Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget

Anoop Korattikara AKORATTI@UCI.EDU School of Information & Computer Sciences, University of California, Irvine, CA 92617, USA Yutian Chen YUTIAN.CHEN@ENG.CAM.EDU Department of Engineering, University of

### Lecture 20: Clustering

Lecture 20: Clustering Wrap-up of neural nets (from last lecture Introduction to unsupervised learning K-means clustering COMP-424, Lecture 20 - April 3, 2013 1 Unsupervised learning In supervised learning,

### CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

### Bootstrapping Big Data

Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

### Big Data need Big Model 1/44

Big Data need Big Model 1/44 Andrew Gelman, Bob Carpenter, Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, Allen Riddell,... Department of Statistics,

### BayesX - Software for Bayesian Inference in Structured Additive Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### IRTPRO is an entirely new application for item calibration and test scoring using IRT.

IRTPRO 3 FEATURES... 1 ORGANIZATION OF THE USERS GUIDE... 2 MONTE CARLO-MARKOV CHAIN (MCMC) ESTIMATION... 4 MCMC GRAPHICS... 4 Autocorrelations... 4 Trace Plots... 5 Running Means... 6 Posterior Densities...

### An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics

Slide 1 An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics Dr. Christian Asseburg Centre for Health Economics Part 1 Slide 2 Talk overview Foundations of Bayesian statistics

### Improving Generalization

Improving Generalization Introduction to Neural Networks : Lecture 10 John A. Bullinaria, 2004 1. Improving Generalization 2. Training, Validation and Testing Data Sets 3. Cross-Validation 4. Weight Restriction

### Detection of changes in variance using binary segmentation and optimal partitioning

Detection of changes in variance using binary segmentation and optimal partitioning Christian Rohrbeck Abstract This work explores the performance of binary segmentation and optimal partitioning in the

### BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential

### Estimating the evidence for statistical models

Estimating the evidence for statistical models Nial Friel University College Dublin nial.friel@ucd.ie March, 2011 Introduction Bayesian model choice Given data y and competing models: m 1,..., m l, each

### Exploiting the Statistics of Learning and Inference

Exploiting the Statistics of Learning and Inference Max Welling Institute for Informatics University of Amsterdam Science Park 904, Amsterdam, Netherlands m.welling@uva.nl Abstract. When dealing with datasets

### Summary of Probability

Summary of Probability Mathematical Physics I Rules of Probability The probability of an event is called P(A), which is a positive number less than or equal to 1. The total probability for all possible

### Nonparametric adaptive age replacement with a one-cycle criterion

Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: Pauline.Schrijner@durham.ac.uk

### Department of Economics

Department of Economics On Testing for Diagonality of Large Dimensional Covariance Matrices George Kapetanios Working Paper No. 526 October 2004 ISSN 1473-0278 On Testing for Diagonality of Large Dimensional

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### SYSM 6304: Risk and Decision Analysis Lecture 3 Monte Carlo Simulation

SYSM 6304: Risk and Decision Analysis Lecture 3 Monte Carlo Simulation M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu September 19, 2015 Outline

### Monte Carlo Simulation

1 Monte Carlo Simulation Stefan Weber Leibniz Universität Hannover email: sweber@stochastik.uni-hannover.de web: www.stochastik.uni-hannover.de/ sweber Monte Carlo Simulation 2 Quantifying and Hedging

Stochastic Gradient Method: Applications February 03, 2015 P. Carpentier Master MMMEF Cours MNOS 2014-2015 114 / 267 Lecture Outline 1 Two Elementary Exercices on the Stochastic Gradient Two-Stage Recourse

### Exploratory Data Analysis

Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction

### C: LEVEL 800 {MASTERS OF ECONOMICS( ECONOMETRICS)}

C: LEVEL 800 {MASTERS OF ECONOMICS( ECONOMETRICS)} 1. EES 800: Econometrics I Simple linear regression and correlation analysis. Specification and estimation of a regression model. Interpretation of regression

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### Gaussian Processes in Machine Learning

Gaussian Processes in Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany carl@tuebingen.mpg.de WWW home page: http://www.tuebingen.mpg.de/ carl