# Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach Rémi Bardenet Arnaud Doucet Chris Holmes Department of Statistics, University of Oxford, Oxford OX1 3TG, UK Abstract Markov chain Monte Carlo (MCMC methods are often deemed far too computationally intensive to be of any practical use for large datasets. This paper describes a methodology that aims to scale up the Metropolis-Hastings (MH algorithm in this context. We propose an approximate implementation of the accept/reject step of MH that only requires evaluating the likelihood of a random subset of the data, yet is guaranteed to coincide with the accept/reject step based on the full dataset with a probability superior to a user-specified tolerance level. This adaptive subsampling technique is an alternative to the recent approach developed in (Korattikara et al., 2014, and it allows us to establish rigorously that the resulting approximate MH algorithm samples from a perturbed version of the target distribution of interest, whose total variation distance to this very target is controlled explicitly. We explore the benefits and limitations of this scheme on several examples. 1. Introduction Consider a dataset X = {x 1,..., x n } and denote by p(x 1,..., x n θ the associated likelihood for parameter θ Θ. Henceforth we assume that the data are conditionally independent, so that p(x 1,..., x n θ = n i=1 p(x i θ. Given a prior p(θ, Bayesian inference relies on the posterior π(θ p(x 1,..., x n θp(θ. In most applications, this posterior is intractable and we need to rely on Bayesian computational tools to approximate it. A standard approach to sample approximately from π(θ is the Metropolis-Hastings algorithm (MH; Robert & Casella Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, JMLR: W&CP volume 32. Copyright 2014 by the author(s. (2004, Chapter 7.3. MH consists in building an ergodic Markov chain of invariant distribution π(θ. Given a proposal q(θ θ, the MH algorithm starts its chain at a userdefined θ 0, then at iteration k + 1 it proposes a candidate state θ q( θ k and sets θ k+1 to θ with probability α(θ k, θ = 1 π(θ q(θ k θ π(θ k q(θ θ k = 1 p (θ q(θ k θ p (θ k q(θ θ k n i=1 p(x i θ p(x i θ k, (1 while θ k+1 is otherwise set to θ k. When the dataset is large (n 1, evaluating the likelihood ratio appearing in the MH acceptance ratio (1 is too costly an operation and rules out the applicability of such a method. The aim of this paper is to propose an approximate implementation of this ideal MH sampler, the maximal approximation error being pre-specified by the user. To achieve this, we first present the ideal MH sampler in a slightly non-standard way. In practice, the accept/reject step of the MH step is implemented by sampling a uniform random variable u U (0,1 and accepting the candidate if and only if u < π(θ q(θ k θ π(θ k q(θ θ k. (2 In our specific context, it follows from (2 and the independence assumption that there is acceptance of the candidate if and only if Λ n (θ k, θ > ψ(u, θ k, θ, (3 where for θ, θ Θ we define the average log likelihood ratio Λ n (θ, θ by Λ n (θ, θ = 1 n [ p(xi θ ] log (4 n p(x i θ and where i=1 ψ(u, θ, θ = 1 [ ] n log u q(θ θp(θ q(θ θ p(θ.

2 The pseudocode of MH is given in Figure 1, unusually formulated using the expression (3. The advantage of this presentation is that it clearly outlines that the accept/reject step of MH requires checking whether or not (3 holds. MH ( p(x θ, p(θ, q(θ θ, θ 0, N iter, X 1 for k 1 to N iter 2 θ θ k 1 3 θ q(. θ, u U (0,1, 4 ψ(u, θ, θ 1 n log [ 5 Λ n (θ, θ 1 n u p(θq(θ θ p(θ q(θ θ n i=1 log [ p(xi θ p(x i θ 6 if Λ n (θ, θ > ψ(u, θ, θ 7 θ k θ Accept 8 else θ k θ Reject 9 return (θ k k=1,...,niter Figure 1. The pseudocode of the MH algorithm targeting the posterior π(θ p(x 1,..., x n θp(θ. The formulation singling out Λ n(θ, θ departs from conventions (Robert & Casella, 2004, Chapter 7.3 but serves the introduction of our main algorithm MHSUBLHD in Figure 3. So as to save computational efforts, we would like to be able to decide whether (3 holds using only a Monte Carlo approximation of Λ n (θ, θ based on a subset of the data. There is obviously no hope to be able to guarantee that we will take the correct decision with probability 1 but we would like to control the probability of taking an erroneous decision. Korattikara et al. (2014 propose in a similar large datasets context to control this error using an approximate confidence interval for the Monte Carlo estimate. Similar ideas have actually appeared earlier in the operations research literature. Bulgak & Sanders (1988, Alkhamis et al. (1999 and Wang & Zhang (2006 consider maximizing a target distribution whose logarithm is given by an intractable expectation; in the large dataset scenario this expectation is w.r.t the empirical measure of the data. They propose to perform this maximization using simulated annealing, a non-homogeneous version of MH. This is implemented practically by approximating the MH ratio log π(θ /π(θ through Monte Carlo and by determining an approximate confidence interval for the resulting estimate; see also (Singh et al., 2012 for a similar idea developed in the context of inference in large-scale factor graphs. All these approaches rely on approximate confidence intervals so they do not allow to control rigorously the approximation error. Moreover, the use of approximate confidence intervals can yield seriously erroneous inference results as demonstrated in Section 4.1. The method presented in this paper is a more robust alternative to these earlier proposals, which can be analyzed theoretically and whose properties can be better quantified. ] ] As shown in Section 2, it is possible to devise an adaptive sampling strategy which guarantees that we take the correct decision, i.e. whether (3 holds or not, with at worst a user-specified maximum probability of error. This sampling strategy allows us to establish in Section 3 various quantitative convergence results for the associated Markov kernel. In Section 4, we compare our approach to the one by Korattikara et al. (2014 on a toy example and demonstrate the performance of our methodology on a large-scale Bayesian logistic regression problem. All proofs are in the supplemental paper. 2. A Metropolis-Hastings algorithm with subsampled likelihoods In this section, we use concentration bounds so as to obtain exact confidence intervals for Monte Carlo approximations of the log likelihood ratio (4. We then show how such bounds can be exploited so as to build an adaptive sampling strategy with desired guarantees MC approximation of the log likelihood ratio Let θ, θ Θ. For any integer t 1, a Monte Carlo approximation Λ t (θ, θ of Λ n (θ, θ is given by Λ t (θ, θ = 1 t t [ p(x log i θ ] p(x i θ, (5 i=1 where x 1,..., x t are drawn uniformly over {x 1,..., x n } without replacement. We can quantify the precision of our estimate Λ t (θ, θ of Λ n (θ, θ through concentration inequalities, i.e., a statement that for δ t > 0, P( Λ t (θ, θ Λ n (θ, θ c t 1 δ t, (6 for a given c t. Hoeffding s inequality without replacement Serfling (1974, for instance, uses 2(1 f c t = C t log(2/δ t θ,θ, (7 t where C θ,θ = max 1 i n log p(x i θ log p(x i θ (8 and ft n is approximately the fraction of used samples. The term (1 ft in (7 decreases to 1 n as the number t of samples used approaches n, which is a feature of bounds corresponding to sampling without replacement. Let us add that C θ,θ typically grows slowly with n. For instance, if the likelihood is Gaussian, then C θ,θ is proportional to max n i=1 x i, so that if the data actually were = t 1

3 sampled from a Gaussian, C θ,θ would grow in log(n (Cesa-Bianchi & Lugosi, 2009, Lemma A.12. If the empirical standard deviation ˆσ t of {log p(x i θ log p(x i θ} is small, a tighter bound known as the empirical Bernstein bound (Audibert et al., 2009 c t = ˆσ t 2 log(3/δt t + 6C θ,θ log(3/δ t, (9 t applies. While the bound of Audibert et al. (2009 originally covers the case where the x i are drawn with replacement, it was early remarked (Hoeffding, 1963 that Chernoff bounds, such as the empirical Bernstein bound, still hold when considering sampling without replacement. Finally, we will also consider the recent Bernstein bound of Bardenet & Maillard (2013, Theorem 3, designed specifically for the case of sampling without replacement Stopping rule construction The concentration bounds given above are helpful as they can allow us to decide whether (3 holds or not. Indeed, on the event { Λ t (θ, θ Λ n (θ, θ c t }, we can decide whether or not Λ n (θ, θ > ψ(u, θ, θ if Λ t (θ, θ ψ(u, θ, θ > c t additionally holds. This is illustrated in Figure 2. Combined to the concentration inequality (6, we thus take the correct decision with probability at least 1 δ t if Λ t (θ, θ ψ(u, θ, θ > c t. In case Λ t (θ, θ ψ(u, θ, θ c t, we want to increase t until the condition Λ t (θ, θ ψ(u, θ, θ > c t is satisfied. Let δ (0, 1 be a user-specified parameter. We provide a construction which ensures that at the first random time T such that Λ T (θ, θ ψ(u, θ, θ > c T, the correct decision is taken with probability at least 1 δ. This adaptive stopping rule adapted from (Mnih et al., 2008 is inspired by bandit algorithms, Hoeffding races (Maron & Moore, 1993 and procedures developed to scale up boosting algorithms to large datasets (Domingo & Watanabe, Formally, we set the stopping time T = n inf{t 1 : Λ t (θ, θ ψ(u, θ, θ > c t }, (10 where a b denotes the minimum of a and b. In other words, if the infimum in (10 is larger than n, then we stop as our sampling without replacement procedure ensures Λ n(θ, θ = Λ n (θ, θ. Letting p > 1 and selecting δ t = p 1 pt δ, we have p t 1 δ t δ. Setting (c t t 1 such that (6 holds, the event E = t 1{ Λ t (θ, θ Λ n (θ, θ c t } (11 has probability larger than 1 δ under sampling without replacement by a union bound argument. Now by definition of T, if E holds then Λ T (θ, θ yields the correct decision, as pictured in Figure 2. A slight variation of this procedure is actually implemented in practice; see Figure 3. The sequence (δ t is decreasing, and each time we check in Step 19 whether or not we should break out of the while condition, we have to use a smaller δ t, yielding a smaller c t. Every check of Step 19 thus makes the next check less likely to succeed. Thus, it appears natural not to perform Step 19 systematically after each new x i has been drawn, but rather draw several new subsamples x i between each check of Step 19. This is why we introduce the variable t look is Steps 6, 16, and 17 of Figure 3. This variable simply counts the number of times the check in Step 19 was performed. Finally, as recommended in a related setting in (Mnih et al., 2008; Mnih, 2008, we augment the size of the subsample geometrically by a userinput factor γ > 1 in Step 18. Obviously this modification does not impact the fact that the correct decision is taken with probability at least 1 δ. MHSUBLHD( p(x θ, p(θ, q(θ θ, θ 0, N iter, X, (δ t, C θ,θ 1 for k 1 to N iter 2 θ θ k 1 3 θ q(. θ, u U (0,1, 4 ψ(u, θ, θ 1 n log [ 5 t 0 6 t look 0 7 Λ 0 u p(θq(θ θ p(θ q(θ θ 8 X Keeping track of points already used 9 b 1 Initialize batchsize to 1 10 DONE FALSE 11 while DONE == FALSE do 12 x t+1,..., x b w/o repl. X \ X Sample new batch without replacement 13 X X {x t+1,..., x b ( } 14 Λ 1 b tλ + [ b i=t+1 log p(x 15 t b 16 c 2C θ,θ 17 t look t look + 1 (1 f t log(2/δt look 2t ] ] i θ p(x i θ 18 b n γt Increase batchsize geometrically 19 if Λ ψ(u, θ, θ c or b > n 20 DONE TRUE 21 if Λ > ψ(u, θ, θ 22 θ k θ Accept 23 else θ k θ Reject 24 return (θ k k=1,...,niter Figure 3. Pseudocode of the MH algorithm with subsampled likelihoods. Step 16 uses a Hoeffding bound, but other choices of concentration inequalities are possible. See main text for details.

4 Figure 2. Schematic view of the acceptance mechanism of MHSUBLHD given in Figure 3: if Λ t (θ, θ ψ(u, θ, θ > c t, then MHSUBLHD takes the acceptance decision based on Λ t (θ, θ, without requiring to compute Λ n(θ, θ. 3. Analysis In Section 3.1 we provide an upper bound on the total variation norm between the iterated kernel of the approximate MH kernel and the target distribution π of MH applied to the full dataset, while Section 3.2 focuses on the number T of subsamples required by a given iteration of MHSUBLHD. We establish a probabilistic bound on T and give a heuristic to determine whether a user can expect a substantial gain in terms of number of samples needed for the problem at hand Properties of the approximate transition kernel For θ, θ Θ, we denote by P (θ, dθ = α(θ, θ q(θ θdθ ( + δ θ (dθ 1 α(θ, ϑq(ϑ θdϑ (12 the ideal MH kernel targeting π with proposal q, where the acceptance probability α(θ, θ is defined in (1. Denote the acceptance probability of MHSUBLHD in Figure 3 by α(θ, θ = E1 {Λ T (θ,θ >ψ(u,θ,θ }, (13 where the expectation in (13 is with respect to u U (0,1 and the variables T and x 1,..., x T described in Section 2. Finally, denote by P the MHSUBLHD kernel, obtained by substituting α to α in (12. The following Lemma states that the absolute difference between α and α is bounded by the user-defined parameter δ > 0. Lemma 3.1. For any θ, θ α(θ, θ δ. Θ, we have α(θ, θ Lemma 3.1 can be used to establish Proposition 3.2, which states that the chain output by the algorithm MHSUBLHD in Figure 3 is a controlled approximation to the original target π. For any signed measure ν on (Θ, B (Θ, let ν TV = 1 2 sup f:θ [ 1,1] ν (f denote the total variation norm, where ν(f = f(θν (dθ. For any Markov Θ kernel Q on (Θ, B (Θ, we denote by Q k be the k-th iterate kernel defined by induction for k 2 through Q k (θ, dθ = Θ Q(θ, dϑqk 1 (ϑ, dθ with Q 1 = Q. Proposition 3.2. Let P be uniformly geometrically ergodic, i.e., there exists an integer m and a probability measure ν on (Θ, B (Θ such that for all θ Θ, P m (θ, (1 ρ ν(. Hence there exists A < such that θ Θ, k > 0, P k (θ, π TV Aρ k/m. (14 Then there exists B < and a probability distribution π on (Θ, B (Θ such that for all θ Θ and k > 0, P k (θ, π TV B[1 (1 δ m (1 ρ] k/m. (15 Furthermore, π satisfies π π TV Amδ 1 ρ. (16 One may obtain tighter bounds and ergodicity results by weakening the uniform geometric ergodicity assumption and using recent results on perturbed Markov kernels (Ferré et al., 2013, but this is out of the scope of this paper On the stopping time T A BOUND FOR FIXED θ, θ The following Proposition gives a probabilistic upper bound for the stopping time T, conditionally on θ, θ Θ and u [0, 1] and when c t is defined by (7. A similar bound holds for the empirical Bernstein bound in (9. Proposition 3.3. Let δ > 0 and δ t = p 1 pt δ. Let θ, θ Θ p such that C θ,θ 0 and u [0, 1]. Let = Λ n(θ, θ ψ(u, θ, θ 2C θ,θ (17

5 and assume 0. Then if p > 1, with probability at least 1 δ, { [ ] [ ]} 4 4p 2p T 2 p log 2 + log 1. (18 δ(p 1 The relative distance in (18 characterizes the difficulty of the step. Intuitively, at equilibrium, i.e., when (θ, θ π(θq(θ θ and u U [0,1], if the log likelihood log p(x θ is smooth in θ, the proposal could be chosen so that Λ n (θ, θ has positive expectation and a small variance, thus leading to high values of and small values of T A HEURISTIC AT EQUILIBRIUM Integrating (18 with respect to θ, θ to obtain an informative quantitative bound on the average number of samples required by MHSUBLHD at equilibrium would be desirable but proved difficult. However the following heuristic can help the user figure out whether our algorithm will yield important gains for a given problem. For large n, standard asymptotics (van der Vaart, 2000 yield that the log likelihood is approximately a quadratic form log p(x θ (θ θ T H n (θ θ with H n of order n. Assume the proposal q( θ is a Gaussian random walk N ( θ, Γ of covariance Γ, then the expected log likelihood ratio under π(θq(θ θ is approximately Trace(H n Γ. According to (Roberts & Rosenthal, 2001, an efficient random walk Metropolis requires Γ to be of the same order as Hn 1, that is, of order 1/n. Finally, the expected Λ n (θ, θ at equilibrium is of order 1/n, and can thus be compared to ψ(u, θ, θ = log(u/n in Line 19 of MHSUBLHD in Figure 3. The RHS of the first inequality in Step 19 is the concentration bound c t, which has a leading term in ˆσ t / t in the case of (9. In the applications we consider in Section 4, ˆσ t is typically proportional to θ θ, which is of order n since Γ Hn 1. Thus, to break out of the while loop in Line 19, we need t n. At equilibrium, we thus should not expect gains of several orders of magnitude: gains are fixed by the constants in the proportionality relations above, which usually depend on the empirical distribution of the data. We provide a detailed analysis for a simple example in Section Experiments All experiments were conducted using the empirical Bernstein-Serfling bound of Bardenet & Maillard (2013, which revealed equivalent to the empirical Bernstein bound in (9, and much tighter in our experience with MHSUBLHD than Hoeffding s bound in (7. All MCMC runs are adaptive Metropolis (Haario et al., 2001; Andrieu & Thoms, 2008 with target acceptance 25% when the dimension is larger than 2 and 50% else (Roberts & Rosenthal, Hyperparameters of MHSUBLHD were set to p = 2, γ = 2, and δ = The first two were found to work well with all experiments. We found empirically that the algorithm is very robust to the choice of δ On the use of asymptotic confidence intervals MCMC algorithms based on subsampling and asymptotic confidence intervals experimentally lead to efficient optimization procedures (Bulgak & Sanders, 1988; Alkhamis et al., 1999; Wang & Zhang, 2006, and perform well in terms of classification error when used, e.g., in logistic regression (Korattikara et al., However, in terms of approximating the original posterior, they come with no guarantee and can provide unreliable results. To illustrate this, we consider the following setting. X is a synthetic sample of size 10 5 drawn from a Gaussian N (0, 0.1 2, and we estimate the parameters µ, σ of a N (µ, σ 2 model, with flat priors. Analytically, we know that the posterior has its maximum at the empirical mean and variance of X. Running the approximate MH algorithm of Korattikara et al. (2014, using a level for the test ɛ = 0.05, and starting each iteration by looking at t = 2 points so as to be able to compute the variance of the log likelihood ratios, leads to the marginal of σ shown in Figure 4(a. The empirical variance of X is denoted by a red triangle, and the maximum of the marginal is off by 7% from this value. Relaunching the algorithm, but starting each iteration with a minimum t = 500 points, leads to better agreement and a smaller support for the marginal, as depicted in Figure 4(b. Still, t = 500 works better for this example, but fails dramatically if X are samples from a lognormal logn (0, 2, as depicted in Figure 4(c. The asymptotic regime, in which the studentized statistic used in (Korattikara et al., 2014 actually follows a Student distribution, depends on the problem at hand and is left to the user to specify. In each of the three examples of Figure 4, our algorithm produces significantly better estimates of the marginal, though at the price of a significantly larger average number of samples used per MCMC iteration. In particular, the case X logn (0, 2 in Figure 4(c essentially requires to use the whole dataset Large-scale Bayesian logistic regression In logistic regression, an accurate approximation of the posterior is often needed rather than minimizing the classification error, for instance, when performing Bayesian variable selection. This makes logistic regression for large datasets a natural application for our algorithm, since the constant C θ,θ in concentration inequalities such as (9 can

6 Approx. CI, 3% of n MHSubLhd, 54% of n (a X N (0, 0.1 2, starting at t = Approx. CI, 16% of n 2000 MHSubLhd, 54% of n (b X N (0, 0.1 2, starting at t = Approx. CI, 21% of n MHSubLhd, 99% of n (c X lnn (0, 2, starting at t = 500 (d train error vs. time Figure 4. (a,b,c Estimates of the marginal posteriors of σ obtained respectively by the algorithm of Korattikara et al. (2014 using approximate confidence intervals and our algorithm MHSUBLHD given in Figure 3, for X sampled from each of the distributions indicated below the plots, and with different starting points for the number t of samples initially drawn from X at each MH iteration. On each plot, a red triangle indicates the true maximum of the posterior, and the legend indicates the proportion of X used on average by each algorithm. (d The synthetic dataset used in Section The dash-dotted line indicates the Bayes classifier. be computed as follows. The log likelihood log p(y x, θ = log(1 + e θt x (1 yθ T x (19 is L-Lipschitz in θ with L = x, so that we can set C θ,θ = θ θ max 1 j n x j. We expect the Lipschitz inequality to be tight as (19 is almost linear in θ THE covtype DATASET We consider the dataset covtype.binary 1 described in (Collobert et al., The dataset consists of 581,012 points, of which we pick n = 400, 000 as a training set, following the maximum training size in (Collobert et al., The original dimension of the problem is 54, with the first 10 attributes being quantitative. To illustrate our point without requiring a more complex sampler than MH, we consider a simple variant of the classification problem using the first q = 2 attributes only. We use the preprocessing and Cauchy prior recommended by Gelman et al. (2008. We draw four random starting points and launch independent runs of both the traditional MH in Figure 1 and 1 available at cjlin/libsvmtools/datasets/binary.html our MHSUBLHD in Figure 3 at each of these four starting points. Figure 5 shows the results: plain lines indicate traditional MH runs, while dashed lines indicate runs of MHSUBLHD. Figures 5(c and 5(d confirm that MHSUBLHD accurately approximates the target posterior. In all Figures 5(a to 5(d, MHSUBLHD reproduces the behaviour of MH, but converges up to 3 times faster. However, the most significant gains in number of samples used happen in the initial transient phase. This allows fast progress of the chains towards the mode but, once in the mode, the average number of samples required by MHSUBLHD is close to n. We observed the same behaviour when considering all q = 10 quantitative attributes of the dataset, as depicted by the train error in Figure 7(a SYNTHETIC DATA To investigate the rôle of n in the gain, we generate a 2D binary classification dataset of size n = Given the label, both classes are sampled from unit Gaussians centered on the x-axis, and a subsample of X is shown in Figure 4(d. The results are shown in Figure 6. The setting appears more favorable than in Section 4.2.1, and MHSUBLHD chains converge up to 5 times faster. The average number of samples used is smaller, but it is still around 70% after the tran-

7 (a train error vs. time (b test error vs. time (c 30% quantile of the chain vs. time (d mean of the chain vs. time Figure 5. Results of 4 independent runs of MH (plain lines and MHSUBLHD (dashed lines for the 2 first attributes of the covtype dataset. The legend indicates the average number of samples required as a proportion of n. (a train error vs. time (b test error vs. time (c 30% quantile of the chain vs. time (d mean of the chain vs. time Figure 6. Results of 4 independent runs of MH (plain lines and MHSUBLHD (dashed lines for the synthetic dataset described in Section The legend indicates the average number of samples required as a proportion of n. On Figures 6(a and 6(b, a dashdotted line indicates the error obtained by the Bayes classifier.

8 (a train error vs. time (b Figure 7. (a Results of 4 independent runs of MH (plain lines and MHSUBLHD (dashed lines for the 10 quantitative attributes of the covtype dataset. The legend indicates the average number of samples required as a proportion of n. (b Running proportion of samples needed vs. iteration number for different values of n, for the Gaussian experiment of Section 4.3. sient phase for all approximate chains A Gaussian example To further investigate when gains are made at equilibrium, we now consider inferring the mean θ of a N (θ, 1 model, using a sample X N ( 1 2, σ2 X of size n. Although simple, this setting allows us analytic considerations. The log likelihood ratio is log p(x θ p(x θ = (θ θ(x θ + θ 2 so that we can set ( C θ,θ = 2 θ θ We also remark that max x i + θ + θ 1 i n 2. V x log p(x θ p(x θ = θ θ σ X. (20 Under the equilibrium assumptions of Section 3.2.2, θ θ is of order n 1/2, so that the leading term t 1/2ˆσ t of the concentration inequality (9 is of order σ X n 1/2 t 1/2. Thus, to break out of the while loop in Line 19 in Figure 3, we need t σx 2 n. In a nutshell, larger gains are thus to be expected when data are clustered in terms of log likelihood. To illustrate this phenomenon, we set σ X = 0.1. To investigate the behavior at equilibrium, all runs were started at the mode of a subsampled likelihood, using a proposal covariance matrix proportional to the covariance of the target. In Figure 7(b, we show the running average number of samples needed for 6 runs of MHSUBLHD, with n ranging from 10 5 to With increasing n, the number of samples needed progressively drops to 25% of the total n. This is satisfying, as the number of samples required at equilibrium should be less than 50% to actually improve on usual MH, since a careful implementation of MH in Figure 1 only requires to evaluate one single full likelihood per iteration, while methods based on subsampling require two. 5. Conclusion We have presented an approximate MH algorithm to perform Bayesian inference for large datasets. This is a robust alternative to the technique in (Korattikara et al., 2014, and this robustness comes at an increased computational price. We have obtained theoretical guarantees on the resulting chain, including a user-controlled error in total variation, and we have demonstrated the methodology on several applications. Experimentally, the resulting approximate chains achieve fast burn-in, requiring on average only a fraction of the full dataset. At equilibrium, the performance of the method is strongly problem-dependent. Loosely speaking, if the expectation w.r.t. π(θq(θ θ of the variance of log p(x θ /p(x θ w.r.t. to the empirical distribution of the observations is low, then one can expect significant gains. If this expectation is high, then the algorithm is of limited interest as the Monte Carlo estimate Λ t (θ, θ requires many samples t to reach a reasonable variance. It would be desirable to use variance reduction techniques but this is highly challenging in this context. Finally, the algorithm and analysis provided here can be straightforwardly extended to scenarios where π is such that log π(θ /π(θ is intractable, as long as a concentration inequality for a Monte Carlo estimator of this log ratio is available. For models with an intractable likelihood, it is often possible to obtain such an estimator with low variance, so the methodology discussed here may prove useful. ACKNOWLEDGMENTS Rémi Bardenet acknowledges his research fellowship through the 2020 Science programme, funded by EPSRC grant number EP/I017909/1. Chris Holmes is supported by an EPSRC programme grant and a Medical Research Council Programme Leader s award.

9 References Alkhamis, T. M., Ahmed, M. A., and Tuan, V. K. Simulated annealing for discrete optimization with estimation. European Journal of Operational Research, 116: , Andrieu, C. and Thoms, J. A tutorial on adaptive MCMC. Statistics and Computing, 18: , Audibert, J.-Y., Munos, R., and Szepesvári, Cs. Exploration-exploitation trade-off using variance estimates in multi-armed bandits. Theoretical Computer Science, Bardenet, R. and Maillard, O.-A. Concentrations inequalities for sampling without replacement. preprint, URL arxiv.org/abs/ Bulgak, A. A. and Sanders, J. L. Integrating a modified simulated annealing algorithm with the simulation of a manufacturing system to optimize buffer sizes in automatic assembly systems. In Proceedings of the 20th Winter Simulation Conference, Cesa-Bianchi, N. and Lugosi, G. Combinatorial bandits. In Proceedings of the 22nd Annual Conference on Learning Theory, Collobert, R., Bengio, S., and Bengio, Y. A parallel mixture of SVMs for very large scale problems. Neural Computation, 14(5: , Domingo, C. and Watanabe, O. MadaBoost: a modification of AdaBoost. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory (COLT, Maron, O. and Moore, A. Hoeffding races: Accelerating model selection search for classification and function approximation. In Advances in Neural Information Processing Systems. The MIT Press, Mnih, V. Efficient stopping rules. Master s thesis, University of Alberta, Mnih, V., Szepesvári, Cs., and Audibert, J.-Y. Empirical Bernstein stopping. In Proceedings of the 25th International Conference on Machine Learning (ICML, Robert, C. P. and Casella, G. Monte Carlo Statistical Methods. Springer-Verlag, New York, Roberts, G. O. and Rosenthal, J. S. Optimal scaling for various Metropolis-Hastings algorithms. Statistical Science, 16: , Serfling, R. J. Probability inequalities for the sum in sampling without replacement. The Annals of Statistics, 2 (1:39 48, Singh, S., Wick, M., and McCallum, A. Monte carlo mcmc: Efficient inference by approximate sampling. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, van der Vaart, A. W. Asymptotic Statistics. Cambridge University Press, Wang, L. and Zhang, L. Stochastic optimization using simulated annealing with hypothesis test. Applied Mathematics and Computation, 174: , Ferré, D., L., Hervé, and Ledoux, J. Regular perturbation of V-geometrically ergodic Markov chains. Journal of Applied Probability, 50(1: , Gelman, A., Jakulin, A, Pittau, M. G., and Su, Y-S. A weakly informative default prior distribution for logistic and other regression models. Annals of applied Statistics, Haario, H., Saksman, E., and Tamminen, J. An adaptive Metropolis algorithm. Bernoulli, 7: , Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13 30, Korattikara, A., Chen, Y., and Welling, M. Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In Proceedings of the International Conference on Machine Learning (ICML, 2014.

### Bayesian Statistics: Indian Buffet Process

Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note

### Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

### Introduction to Markov Chain Monte Carlo

Introduction to Markov Chain Monte Carlo Monte Carlo: sample from a distribution to estimate the distribution to compute max, mean Markov Chain Monte Carlo: sampling using local information Generic problem

### Bootstrapping Big Data

Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian

### Tutorial on Markov Chain Monte Carlo

Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology,

### Probabilistic Methods for Time-Series Analysis

Probabilistic Methods for Time-Series Analysis 2 Contents 1 Analysis of Changepoint Models 1 1.1 Introduction................................ 1 1.1.1 Model and Notation....................... 2 1.1.2 Example:

Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

### Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

### CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

### EC 6310: Advanced Econometric Theory

EC 6310: Advanced Econometric Theory July 2008 Slides for Lecture on Bayesian Computation in the Nonlinear Regression Model Gary Koop, University of Strathclyde 1 Summary Readings: Chapter 5 of textbook.

### UCB REVISITED: IMPROVED REGRET BOUNDS FOR THE STOCHASTIC MULTI-ARMED BANDIT PROBLEM

UCB REVISITED: IMPROVED REGRET BOUNDS FOR THE STOCHASTIC MULTI-ARMED BANDIT PROBLEM PETER AUER AND RONALD ORTNER ABSTRACT In the stochastic multi-armed bandit problem we consider a modification of the

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### More details on the inputs, functionality, and output can be found below.

Overview: The SMEEACT (Software for More Efficient, Ethical, and Affordable Clinical Trials) web interface (http://research.mdacc.tmc.edu/smeeactweb) implements a single analysis of a two-armed trial comparing

### ABC methods for model choice in Gibbs random fields

ABC methods for model choice in Gibbs random fields Jean-Michel Marin Institut de Mathématiques et Modélisation Université Montpellier 2 Joint work with Aude Grelaud, Christian Robert, François Rodolphe,

### Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### The Optimality of Naive Bayes

The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most

### Parallelization Strategies for Multicore Data Analysis

Parallelization Strategies for Multicore Data Analysis Wei-Chen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### The Artificial Prediction Market

The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### PS 271B: Quantitative Methods II. Lecture Notes

PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.

### Parameter estimation for nonlinear models: Numerical approaches to solving the inverse problem. Lecture 12 04/08/2008. Sven Zenker

Parameter estimation for nonlinear models: Numerical approaches to solving the inverse problem Lecture 12 04/08/2008 Sven Zenker Assignment no. 8 Correct setup of likelihood function One fixed set of observation

### Regression Using Support Vector Machines: Basic Foundations

Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering

### Model-based Synthesis. Tony O Hagan

Model-based Synthesis Tony O Hagan Stochastic models Synthesising evidence through a statistical model 2 Evidence Synthesis (Session 3), Helsinki, 28/10/11 Graphical modelling The kinds of models that

### Christfried Webers. Canberra February June 2015

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

### Fairfield Public Schools

Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

### Big Data, Statistics, and the Internet

Big Data, Statistics, and the Internet Steven L. Scott April, 4 Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39 Summary Big data live on more than one machine. Computing takes

### Probability and Statistics

CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b - 0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be

Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### Bayesian sample size determination of vibration signals in machine learning approach to fault diagnosis of roller bearings

Bayesian sample size determination of vibration signals in machine learning approach to fault diagnosis of roller bearings Siddhant Sahu *, V. Sugumaran ** * School of Mechanical and Building Sciences,

### CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

### Combining information from different survey samples - a case study with data collected by world wide web and telephone

Combining information from different survey samples - a case study with data collected by world wide web and telephone Magne Aldrin Norwegian Computing Center P.O. Box 114 Blindern N-0314 Oslo Norway E-mail:

### 2014-2015 The Master s Degree with Thesis Course Descriptions in Industrial Engineering

2014-2015 The Master s Degree with Thesis Course Descriptions in Industrial Engineering Compulsory Courses IENG540 Optimization Models and Algorithms In the course important deterministic optimization

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Inference on Phase-type Models via MCMC

Inference on Phase-type Models via MCMC with application to networks of repairable redundant systems Louis JM Aslett and Simon P Wilson Trinity College Dublin 28 th June 202 Toy Example : Redundant Repairable

### Markov Chain Monte Carlo Simulation Made Simple

Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical

### Lecture 6: The Bayesian Approach

Lecture 6: The Bayesian Approach What Did We Do Up to Now? We are given a model Log-linear model, Markov network, Bayesian network, etc. This model induces a distribution P(X) Learning: estimate a set

### Nan Kong, Andrew J. Schaefer. Department of Industrial Engineering, Univeristy of Pittsburgh, PA 15261, USA

A Factor 1 2 Approximation Algorithm for Two-Stage Stochastic Matching Problems Nan Kong, Andrew J. Schaefer Department of Industrial Engineering, Univeristy of Pittsburgh, PA 15261, USA Abstract We introduce

### Un point de vue bayésien pour des algorithmes de bandit plus performants

Un point de vue bayésien pour des algorithmes de bandit plus performants Emilie Kaufmann, Telecom ParisTech Rencontre des Jeunes Statisticiens, Aussois, 28 août 2013 Emilie Kaufmann (Telecom ParisTech)

### A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data Faming Liang University of Florida August 9, 2015 Abstract MCMC methods have proven to be a very powerful tool for analyzing

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget

Anoop Korattikara AKORATTI@UCI.EDU School of Information & Computer Sciences, University of California, Irvine, CA 92617, USA Yutian Chen YUTIAN.CHEN@ENG.CAM.EDU Department of Engineering, University of

### Learning outcomes. Knowledge and understanding. Competence and skills

Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges

### Notes from Week 1: Algorithms for sequential prediction

CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

### 5. Convergence of sequences of random variables

5. Convergence of sequences of random variables Throughout this chapter we assume that {X, X 2,...} is a sequence of r.v. and X is a r.v., and all of them are defined on the same probability space (Ω,

### Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

### Gaussian Processes to Speed up Hamiltonian Monte Carlo

Gaussian Processes to Speed up Hamiltonian Monte Carlo Matthieu Lê Murray, Iain http://videolectures.net/mlss09uk_murray_mcmc/ Rasmussen, Carl Edward. "Gaussian processes to speed up hybrid Monte Carlo

### Online Classification on a Budget

Online Classification on a Budget Koby Crammer Computer Sci. & Eng. Hebrew University Jerusalem 91904, Israel kobics@cs.huji.ac.il Jaz Kandola Royal Holloway, University of London Egham, UK jaz@cs.rhul.ac.uk

### Multiple Testing. Joseph P. Romano, Azeem M. Shaikh, and Michael Wolf. Abstract

Multiple Testing Joseph P. Romano, Azeem M. Shaikh, and Michael Wolf Abstract Multiple testing refers to any instance that involves the simultaneous testing of more than one hypothesis. If decisions about

### HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

### On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers

On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers Francis R. Bach Computer Science Division University of California Berkeley, CA 9472 fbach@cs.berkeley.edu Abstract

### Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

### Monotonicity Hints. Abstract

Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: joe@cs.caltech.edu Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology

### CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### Chapter 14 Managing Operational Risks with Bayesian Networks

Chapter 14 Managing Operational Risks with Bayesian Networks Carol Alexander This chapter introduces Bayesian belief and decision networks as quantitative management tools for operational risks. Bayesian

### Estimation and comparison of multiple change-point models

Journal of Econometrics 86 (1998) 221 241 Estimation and comparison of multiple change-point models Siddhartha Chib* John M. Olin School of Business, Washington University, 1 Brookings Drive, Campus Box

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Final Exam, Spring 2007

10-701 Final Exam, Spring 2007 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 16 numbered pages in this exam (including this cover sheet). 3. You can use any material you brought:

### Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

### The CUSUM algorithm a small review. Pierre Granjon

The CUSUM algorithm a small review Pierre Granjon June, 1 Contents 1 The CUSUM algorithm 1.1 Algorithm............................... 1.1.1 The problem......................... 1.1. The different steps......................

### Java Modules for Time Series Analysis

Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series

### Probabilistic user behavior models in online stores for recommender systems

Probabilistic user behavior models in online stores for recommender systems Tomoharu Iwata Abstract Recommender systems are widely used in online stores because they are expected to improve both user

### Identifying Market Price Levels using Differential Evolution

Identifying Market Price Levels using Differential Evolution Michael Mayo University of Waikato, Hamilton, New Zealand mmayo@waikato.ac.nz WWW home page: http://www.cs.waikato.ac.nz/~mmayo/ Abstract. Evolutionary

### Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization

Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization Archis Ghate a and Robert L. Smith b a Industrial Engineering, University of Washington, Box 352650, Seattle, Washington,

### IN THIS PAPER, we study the delay and capacity trade-offs

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 15, NO. 5, OCTOBER 2007 981 Delay and Capacity Trade-Offs in Mobile Ad Hoc Networks: A Global Perspective Gaurav Sharma, Ravi Mazumdar, Fellow, IEEE, and Ness

### Monte Carlo testing with Big Data

Monte Carlo testing with Big Data Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with: Axel Gandy (Imperial College London) with contributions from:

### Introduction to Online Learning Theory

Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

### Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - ECML-PKDD,

### Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

### A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

### Bayes and Naïve Bayes. cs534-machine Learning

Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

### Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk

Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training

### Linear regression methods for large n and streaming data

Linear regression methods for large n and streaming data Large n and small or moderate p is a fairly simple problem. The sufficient statistic for β in OLS (and ridge) is: The concept of sufficiency is

### SAS Certificate Applied Statistics and SAS Programming

SAS Certificate Applied Statistics and SAS Programming SAS Certificate Applied Statistics and Advanced SAS Programming Brigham Young University Department of Statistics offers an Applied Statistics and

### 1 Limiting distribution for a Markov chain

Copyright c 2009 by Karl Sigman Limiting distribution for a Markov chain In these Lecture Notes, we shall study the limiting behavior of Markov chains as time n In particular, under suitable easy-to-check

### Offline sorting buffers on Line

Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

### Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Shape, Space, and Measurement- Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two- and three-dimensional shapes by demonstrating an understanding of:

### Machine learning and optimization for massive data

Machine learning and optimization for massive data Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Eric Moulines - IHES, May 2015 Big data revolution?

### Load Balancing and Switch Scheduling

EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: liuxh@systems.stanford.edu Abstract Load

### Master s Theory Exam Spring 2006

Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem

### Faculty of Science School of Mathematics and Statistics

Faculty of Science School of Mathematics and Statistics MATH5836 Data Mining and its Business Applications Semester 1, 2014 CRICOS Provider No: 00098G MATH5836 Course Outline Information about the course

### A Perspective on Statistical Tools for Data Mining Applications

A Perspective on Statistical Tools for Data Mining Applications David M. Rocke Center for Image Processing and Integrated Computing University of California, Davis Statistics and Data Mining Statistics

### SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

### Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

### Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max