f (x) π (dx), n f(x k ). (1) k=1
|
|
|
- Dale Rodgers
- 10 years ago
- Views:
Transcription
1 To appear Annals of Applied Probability 2005Vol. 0, No. 0, 1 8 ON THE ERGODICITY PROPERTIES OF SOME ADAPTIVE MCMC ALGORITHMS By Christophe Andrieu, and Éric Moulines University of Bristol and École Nationale Supérieure des Télécommunications In this paper we study the ergodicity properties of some adaptive Monte Carlo Markov chain algorithms MCMC that have been recently proposed in the literature. We prove that under a set of verifiable conditions, ergodic averages calculated from the output of a so-called adaptive MCMC sampler converge to the required value and can even, under more stringent assumptions, satisfy a central limit theorem. We prove that the conditions required are satisfied for the Independent Metropolis-Hastings algorithm and the Random Walk Metropolis algorithm with symmetric increments. Finally we propose an application of these results to the case where the proposal distribution of the Metropolis-Hastings update is a mixture of distributions from a curved exponential family. 1. Introduction. Markov chain Monte Carlo MCMC, introduced by Metropolis et al. 1953, is a popular computational method for generating samples from virtually any distribution π. In particular there is no need for the normalising constant to be known and the space R nx for some integer n x on which it is defined can be high dimensional. We will hereafter denote B the associated countably generated σ-field. The method consists of simulating an ergodic Markov chain { k, k 0} on with transition probability P such that π is a stationary distribution for this chain, i.e πp = π. Such samples can be used e.g. to compute integrals π f def = for some π-integrable function f : R, S n f = 1 n f x π dx, n f k. 1 In general the transition probability P of the Markov chain depends on some tuning parameter, say θ defined on some space Θ R n θ for some integer n θ, and the convergence properties of the Monte Carlo averages in Eq. 1 might highly depend on a proper choice for these parameters. The authors would like to thank the Royal Society/CNRS partnership and the Royal Society International Exchange visit program, An initial version of this work was presented at the First Cape Cod Monte Carlo Workshop, September AMS 2000 subject classifications: Primary 65C05, 65C40, 60J27, 60J35; secondary 93E35 Keywords and phrases: Adaptive Markov chain Monte Carlo, Self-tuning algorithm, Metropolis-Hastings algorithm, Stochastic approximation, State-dependent noise, Randomly varying truncation, Martingale, Poisson method 1
2 2 C. ANDRIEU & É. MOULINES We illustrate this here with the Metropolis-Hastings MH update, but it should be stressed at this point that the results presented in this paper apply to much more general settings including in particular hybrid samplers, sequential or population Monte Carlo samplers. The MH algorithm requires the choice of a proposal distribution q. In order to simplify the discussion, we will here assume that π and q admit densities with respect to the Lebesgue measure λ Leb, denoted with an abuse of notation π and q hereafter. The rôle of the distribution q consists of proposing potential transitions for the Markov chain { k }. Given that the chain is currently at x, a candidate y is accepted with probability αx, y defined as { 1 πy qy,x πx qx,y if πxqx, y > 0 αx, y = 1 otherwise, where a b def = mina, b. Otherwise it is rejected and the Markov chain stays at its current location x. The transition kernel P of this Markov chain takes the form for x, A B P x, A = αx, x+zqx, x+zλ Leb dz+1 A x 1 αx, x+zqx, x+zλ Leb dz, A x x 2 where A x def = {z, x + z A}. The Markov chain P is reversible with respect to π, and therefore admits π as invariant distribution. Conditions on the proposal distribution q that guarantee irreducibility and positive recurrence are mild and many satisfactory choices are possible; for the purpose of illustration, we concentrate in this introduction on the symmetric increments random-walk MH algorithm hereafter SRWM, in which qx, y = qy x for some symmetric probability density q on R nx, referred to as the increment distribution. The transition kernel Pq SRW of the Metropolis algorithm is then given for x, A B by Pq SRW x, A = 1 A x πx + z πx qz λ Leb dz + 1 A x x 1 1 πx + z qz λ Leb dz. 3 πx A classical choice for q is the multivariate normal distribution with zero-mean and covariance matrix Γ, N 0, Γ. We will later on refer to this algorithm as the N-SRWM. It is well known that either too small or too large a covariance matrix will result in highly positively correlated Markov chains, and therefore estimators S n f with a large variance Gelman et al have shown that the optimal covariance matrix under restrictive technical conditions not given here for the N-SRWM is /n x Γ π, where Γ π is the true covariance matrix of the target distribution. In practice this covariance matrix Γ is determined by trial and error, using several realisations of the Markov chain. This hand-tuning requires some expertise and can be time-consuming. In order to circumvent this problem, Haario et al have proposed to learn Γ on the fly. The Haario et al.
3 2001 algorithm can be summarized as follows, ERGODICITY OF ADAPTIVE MCMC 3 µ k+1 = µ k + γ k+1 k+1 µ k k 0 4 Γ k+1 = Γ k + γ k+1 k+1 µ k k+1 µ k T Γ k, where, denoting C nx + the cone of positive n x n x matrices, k+1 is drawn from P θk k,, where for θ = µ, Γ Θ = R nx C+ nx, P θ = PN SRW 0,λΓ is here the kernel of a symmetric random walk MH with a Gaussian increment distribution N 0, λγ, where λ > 0 is a constant scaling factor depending only on the dimension of the state-space n x and kept constant across the iterations, {γ k } is a non-increasing sequence of positive stepsizes such that γ k = and γ1+δ k < for some δ > 0 Haario et al have suggested the choice γ k = 1/k. It was realised in Andrieu and Robert 2001 that such a scheme is a particular case of a more general framework. More precisely, for θ = µ, Γ Θ, define H : Θ R nx R nx nx as Hθ, x def = x µ, x µx µ T Γ T. 5 With this notation, the recursion in 4 may be written as θ k+1 = θ k + γ k+1 Hθ k, k+1, k 0, 6 with k+1 P θk k,. This recursion is at the core of most of classical stochastic approximation algorithms see e.g. Benveniste et al. 1990, Duflo 1997, Kushner and Yin 1997 and the references therein. This algorithm is designed to solve the equations hθ = 0 where θ hθ is the so-called mean field defined as hθ def = Hθ, xπdx. 7 For the present example, assuming that x 2 πdx <, one can easily check that hθ = Hθ, xπdx = µ π µ, µ π µµ π µ T + Γ π Γ T, 8 with µ π and Γ π the mean and covariance of the target distribution, µ π = x πdx and Γ π = x µ π x µ π T πdx. 9 One can rewrite 6 as θ k+1 = θ k + γ k+1 hθ k + γ k+1 ξ k+1, where {ξ k = Hθ k 1, k hθ k 1, k 1} is generally referred to as the noise. The general theory of stochastic approximation SA provides us with conditions under which this recursion eventually converges to the set {θ Θ, hθ = 0}. These issues are discussed in Sections 3 and 5.
4 4 C. ANDRIEU & É. MOULINES In the context of adaptive MCMC, the parameter convergence is not the central issue; the focus is rather on the approximation of πf by the sample mean S n f. However there is here a difficulty with the adaptive approach: as the parameter estimate θ k = θ k 0,..., k depends on the whole past, the successive draws { k } do not define an homogeneous Markov chain and standard arguments for the consistency and asymptotic normality of S n f do not apply in this framework. Note that this is despite the fact that for any θ Θ, πp θ = π. This is illustrated by the following example. Let = {1, 2} and consider for θ, θ1, θ2 Θ the following Markov transition probability matrices [ ] [ ] 1 θ θ 1 θ1 θ1 P θ = P =. θ 1 θ θ2 1 θ2 One can check that for any θ Θ, π = 1/2, 1/2 satisfies πp θ = π. However if we let θ k be a given function θ : 0, 1 of the current state, i.e. θ k = θ k, one defines a new Markov chain with transition probability P with now [θ2/θ1+θ2, θ1/θ1+θ2] as invariant distribution. One recovers π when the dependence on the current state k is removed or vanishes with the iterations. With this example in mind, the problems that we address in the present paper and our main general results are in words the following: 1. In situations where θ k+1 θ k 0 as k + w.p. 1, we prove a strong law of large numbers for S n f see Theorem 8 under mild additional conditions. Such a consistency result may arise even in situations where the parameter sequence {θ k } does not converge. 2. In situations where θ k converges w.p. 1, we prove an invariance principle for ns n f πf; the limiting distribution is in general a mixture of Gaussian distributions see Theorem 9. Note that Haario et al have proved the consistency of Monte Carlo averages for the algorithm described by 4. Our result applies to more general settings and rely on assumptions which are less restrictive than those used in Haario et al The second point above, the invariance principle, has to the best of our knowledge not been addressed for adaptive MCMC algorithms. We point out that Atchadé and Rosenthal 2003 have independently extended the consistency result of Haario et al to the case where is unbounded, using Haario et al s mixingale technique. Our technique of proof is different and our algorithm allows for an unbounded parameter θ to be considered, as opposed to Atchadé and Rosenthal The paper is organized as follows. In Section 2 we detail our general procedure and introduce some notation. In Section 3, we establish the consistency i.e. a strong law of large numbers for S n f Theorem 8. In Section 4 we strengthen the conditions required to ensure the law of large numbers LLN for S n f and establish an invariance principle Theorem 9. In Section 5 we focus on the classical Robbins-Monro implementation of our procedure and introduce further conditions that allow us to prove that {θ k } converges w.p. 1 Theorem 11. In Section 6 we establish general properties of the generic SRWM required to ensure a LLN and an invariance principle. For pædagocical purposes we show how to apply these results to the N-SRWM of Haario et al Theorem 15. In Section 7 we present another application of our theory. We focus on the Independent Metropolis- Hastings algorithm IMH and establish general properties required for the LLN and the
5 ERGODICITY OF ADAPTIVE MCMC 5 invariance principle. We then go on to propose and analyse an algorithm that matches the so-called proposal distribution of the IMH to the target distribution π, in the case where the proposal distribution is a mixture of distributions from the exponential family. The main result of this section is Theorem 21. We conclude with the remark that this latter result equally applies to a generalisation of the N-SRWM, where the proposal is again a mixture of distributions. Application to samplers which consist of a mixture of SRWM and IMH is straightforward. 2. Algorithm description and main definitions. Before describing the procedure under study, it is necessary to introduce some notation and definitions. Let T be a separable space and let BT be a countably generated σ-field on T. For a Markov chain with transition probability P : T BT [0, 1] and any non-negative measurable function f : T [0, +, we denote by P ft = P t, f def = T P t, dt ft and for any integer k, P k the k-th iterate of the kernel. For a probability measure µ, we define, for any A BT, µp A def = T µdtp t, A. A Markov chain on a state space T is said to be µ-irreducible if there exists a measure µ on BT such that, whenever µa > 0, k=0 P k t, A > 0 for all t T. Denote by µ a maximal irreducibility measure for P see Meyn and Tweedie 1993 Chapter 4 for the definition and the construction of such a measure. If P is µ-irreducible, aperiodic and has an invariant probability measure π, then π is unique and is a maximal irreducibility measure. Two main ingredients are required for the definition of our adaptive MCMC algorithms: 1. A family of Markov transition kernels on, {P θ, θ Θ} indexed by a finitedimensional parameter θ Θ R n θ an open set. For each θ in Θ, it is assumed that P θ is π-irreducible and that πp θ = π, i.e. π is the invariant distribution for P θ. 2. A family of update functions {Hθ, x : Θ R n θ}, which are used to adapt the value of the tuning parameter. The adaptive algorithm studied in this paper which corresponds to the process {Z k } defined below requires for both its definition and study the introduction of an intermediate stopped process, which we now define. First, in order to take into account potential jumps outside the space Θ, we extend def the parameter space with a cemetery point, θ c Θ and define Θ = Θ {θ c }. It is convenient to introduce the family of transition kernels {Q γ, γ 0} such that for any γ 0, x, θ Θ, A B and B B Θ, Q γ x, θ; A B = A P θ x, dy1{θ + γhθ, y B} + δ θc B P θ x, dy1{θ + γhθ, y / Θ}, 10 where δ θ denotes the Dirac delta function at θ Θ. In its general form the basic version of the adaptive MCMC algorithms considered here may be written as follows. Set θ 0 = θ Θ, 0 = x, and for k 0 define recursively the sequence { k, θ k, k 0}: if θ k = θ c, then set θ k+1 = θ c and k+1 = x. Otherwise k+1, θ k+1 Q ρk+1 k, θ k ;, where ρ = {ρ k } A
6 6 C. ANDRIEU & É. MOULINES is a sequence of stepsizes. The sequence { k, θ k } is a non-homogeneous Markov chain on the product space Θ. This non-homogeneous Markov chain defines a probability measure on the canonical state space Θ N equipped with the canonical product σ- algebra. We denote F = {F k, k 0} the natural filtration of this Markov chain and P ρ x,θ and E ρ x,θ the probability and the expectation associated to this Markov chain starting from x, θ Θ. Because of the interaction with feedback between k and θ k, the stability of this inhomogeneous Markov chain is often difficult to establish. This is a long-standing problem in the field of stochastic optimization: known practical cures to this problem include the reprojections on a fixed set see Kushner and Yin 1997 or the more recent reprojection on random varying boundaries proposed in Chen and Zhu 1986, Chen et al and generalized in Andrieu et al More precisely, we first define the notion of compact coverage of Θ. A family of compact subsets {K q, q 0} of Θ is said to be a compact coverage if, K q = Θ, and K q intk q+1, q 0, 11 q 0 where inta denotes the interior of set A. Let γ def = {γ k } be a monotone non-increasing sequence of positive numbers and let K be a subset of. For a sequence a = {a k } and an integer l, we define the shifted sequence a l as follows: for any k 1, a l def k = a k+l. Let Π : Θ K K 0 be a measurable function. Define the homogeneous Markov chain Z k = { k, θ k, κ k, ν k, k 0} on the product space Z def = Θ N N with transition probability R : Z BZ : [0, 1] algorithmically defined as follows note that in order to alleviate notation, the dependence of R on both γ and {K q, q 0} is implicit throughout the paper: for any x, θ, κ, ν Z 1. If ν = 0, then draw, θ Q γκ Πx, θ; ; otherwise, draw, θ Q γκ+ν x, θ;. 2. If θ K κ, then set: κ = κ and ν = ν + 1; otherwise, set κ = κ + 1, and ν = 0. In words, κ and ν are counters: κ is the index of the current active truncation set; ν counts the number of iterations since the last reinitialisation. The event {ν k = 0} indicates that a reinitialization occurs: the algorithm is restarted at iteration k from a point in K K 0 with the smaller sequence of stepsizes γ κ k. Note the important fact, at the heart of our analysis, that between reinitializations this process coincides with the basic version of the algorithm described earlier, with ρ = γ κ k. This is formalized in Lemma 1 below. This algorithm is reminiscent of the projection on random varying boundaries proposed in Chen and Zhu 1986, Chen et al. 1988: whenever the current iterate wanders outside the active truncation set the algorithm is reinitialised with a smaller initial value of the stepsize and a larger truncation set. The homogeneous Markov chain {Z k, k 0} defines a probability measure on the canonical state space Z N equipped with the canonical product σ-algebra. We denote G = {G k, k 0}, P x,θ,k,l and Ēx,θ,k,l the filtration, probability and expectation associated to this Markov chain started from x, θ, k, l Z. For simplicity we will use the shorthand
7 notation Ē and P for ERGODICITY OF ADAPTIVE MCMC 7 Ēx,θ def = Ēx,θ,0,0 and P x,θ def = P x,θ,0,0 for all x, θ Θ. These probability measures depend upon the deterministic sequence γ : the dependence will be implicit hereafter. We define recursively {T n, n 0} the sequence of successive reinitialisation times T n+1 = inf {k T n + 1, ν k = 0}, with T 0 = 0, 12 where by convention inf{ } =. It may be shown that under mild conditions on {P θ, θ Θ}, {Hθ, x, θ, x Θ } and the sequence γ, P sup κ n < = P {T n = } = 1, n 0 i.e., the number of reinitialisations of the procedure described above is finite P -a.s.. We postpone the presentation and the discussion of simple sufficient conditions that ensure that this holds in concrete situations to Sections 5, 6 and 7. We will however assume this property to hold in Sections 3 and 4. Again, we stress on the fact that our analysis of the homogeneous Markov chain {Z k } the algorithm for a given sequence γ relies on the study of the inhomogeneous Markov chain defined earlier the stopped process, def for the sequences {ρ k = γ κ k} of stepsizes. It is therefore important to precisely and probabilistically relate these two processes. This is the aim of the lemma below adapted from Andrieu et al., 2005, Lemma 4.1. Define for K Θ, n=0 σk = inf{k 1, θ k K}. 13 Lemma 1. For any m 1, for any non-negative measurable function Ψ m : Θ m R +, for any integer k 0 and x, θ Θ, Ē x,θ,k,0 [Ψ m 1, θ 1,..., m, θ m 1{T 1 m}] = E γ k Πx,θ [Ψ m 1, θ 1,..., m, θ m 1{σK k m}]. 3. Law of large numbers. Hereafter, for a probability distribution P, the various kinds of convergence in probability, almost-sure and weak in distribution are denoted respectively prob. a.s P, P and D P Assumptions. As pointed out in the introduction, a LLN has been obtained for a particular adaptive MCMC algorithm by Haario et al. 2001, using mixingale theory see McLeish Our approach is more in line with the martingale proof of the LLN for Markov chains, and is based on the existence and regularity of the solutions of Poisson s equation and martingale limit theory. The existence and appropriate properties of those solutions can be easily established under a uniform in the parameter θ version of the V - uniform ergodicity of the transition kernels P θ see condition A1 below and Proposition 3.
8 8 C. ANDRIEU & É. MOULINES We will use the following notation throughout the paper. For W : [1, and f : R a measurable function define fx f W = sup x W x and L W = {f : f W < }. 14 We will also consider functions f : Θ R. We will often use the short-hand notation f θ x = fθ, x for all θ, x Θ in order to avoid ambiguities. We will assume that f θ 0 whenever θ / Θ except when f θ does not depend on θ, i.e. f θ f θ for any θ, θ Θ Θ. Let W : [1,. We say that the family of functions {f θ : R, θ Θ} is W - Lipschitz if for any compact subset K Θ, sup f θ W < and sup θ θ 1 f θ f θ W <. 15 θ K θ,θ K K,θ θ A1 For any θ Θ, P θ has π as stationary distribution. In addition there exists a function V : [1, such that sup x K V x < with K defined in Section 2 and such that for any compact subset K Θ, i Minorization condition. There exist C B, ɛ > 0 and a probability measure ϕ all three depending on K such that ϕc > 0 and for all A B and θ, x K C P θ x, A ɛϕa. ii Drift condition. There exist constants λ [0, 1, b 0, depending on V, C and K satisfying { λv x x C P θ V x b x C, for all θ K. A2 For any compact subset K Θ and any r [0, 1] there exists a constant C depending on K and r such that, for any θ, θ K K and f L V r, where V is given in A1. P θ f P θ f V r C f V r θ θ, A3 {H θ, θ Θ} is V β -Lipschitz for some β [0, 1/2] with V defined in A1. Remark 1. Note that for the sake of clarity and simplicity, we restrict here our results to the case where {P θ, θ Θ} satisfy one-step drift and minorisation conditions. As shown in Andrieu et al the more general case where either an m-step drift or minorisation condition is assumed for m > 1 requires one to modify the algorithm in order to prevent large jumps in the parameter space see Andrieu and Moulines 2003 Andrieu et al This mainly leads to substantial notational complications, but the arguments remain essentially unchanged.
9 ERGODICITY OF ADAPTIVE MCMC 9 Conditions of the type A1 to establish geometric ergodicity have been extensively studied over the last decade for the Metropolis-Hastings algorithms. Typically the required drift function depends on the target distribution π, which makes our requirement of uniformity in θ K in A1 reasonable and relatively easy to establish see Sections 6 and 7. The following theorem, due to Meyn and Tweedie 1994 and recently improved by Baxendale 2005 converts information about the drift and minorization condition into information about the long-term behaviour of the chain. Theorem 2. Assume A1. Then, for all θ Θ, P θ admits π as its unique stationary probability measure, and πv <. Let K Θ be a compact subset and r [0, 1]. There exists ρ < 1 depending only and explicitly on the constants r, ɛ, ϕc, λ and b given in A1 such that, whenever ρ ρ, 1, there exists C < depending only and explicitly on r, ρ, ɛ, ϕc and b such that for any f L V r, for all θ K and k 0, P k θ f πf V r C f V rρk. 16 Formulas for ρ and C are given in Meyn and Tweedie, 1994, Theorem 2.3 and have been later improved in Baxendale, 2005, Section 2.1. This theorem automatically ensures the existence of solutions to Poisson s equation. More precisely, for all θ, x Θ and f L V r, k=0 P θ k fx πf < and def ˆf θ = k=0 P θ k f πf, is a solution of Poisson s equation ˆf θ P θ ˆfθ = f πf. 17 Poisson s equation has proven to be a fundamental tool for the analysis of additive functionals, in particular to establish limit theorems such as the functional central limit theorem see e.g. Benveniste et al. 1990, Nummelin 1991, Meyn and Tweedie, 1993, Chapter 17, Glynn and Meyn 1996, Duflo The Lipschitz continuity of the transition kernel P θ as a function of θ Assumption A2 does not seem to have been studied for the Metropolis-Hastings algorithm. We establish this continuity for the SRWM algorithm and the independent MH algorithm IMH in Sections 6 and 7. This assumption, used in conjunction with A1, allows one to establish the Lipschitz continuity of the solution of Poisson s equation. Proposition 3. Assume A1. Suppose that the family of functions {f θ, θ Θ} is V r - def Lipschitz, for some r [0, 1]. Define for any θ Θ, ˆfθ = k=0 P k θ f θ πf θ. Then, for any compact set K, there exists a constant C such that, for any θ, θ K, ˆf θ V r + P θ ˆfθ V r C sup f θ V r 18 θ K ˆf θ ˆf θ V r + P θ ˆfθ P θ ˆfθ V r C θ θ sup f θ V r. 19 θ K The proof is given in Appendix B. Remark 2. The regularity of the solutions of Poisson s equation has been studied, under various ergodicity and regularity conditions on the mapping θ P θ, by see for instance
10 10 C. ANDRIEU & É. MOULINES Benveniste et al and Bartusek 2000 for regularity under conditions implying V -uniform geometric ergodicity. The results of the proposition above are sharper than those reported in the literature because all the transition kernels P θ share the same limiting distribution π, a property which plays a key role in the proof. We finish this section with a convergence result for the chain { k } under the probability P ρ x,θ, which is an important direct byproduct of the property just mentioned in the remark above. This result improves on Holden 1998, Haario et al and Atchadé and Rosenthal Proposition 4. Assume A1-3, let ρ 0, 1 be as in Eq. 16, let ρ = { ρ k } be a positive and finite, non-increasing sequence such that lim sup k ρ k nk / ρ k < + with for k k 0 for some integer k 0, nk := log ρ k / ρ/ logρ +1 k and nk = 0 otherwise, where ρ [ρ 1, +. Let f L V 1 β, where V is defined in A1 and β is defined in A3. Let K be a compact subset of Θ. Then there exists a constant C 0, + depending only on K, the constants in A1 and ρ such that for any x, θ K, E ρ x,θ {f k πf1{σk k}} C f V 1 β ρ k V x. The proof is in Appendix B 3.2. Law of large numbers. We prove in this section a law of large numbers LLN under P for n 1 n f θ k k, where {f θ, θ Θ} is a set of sufficiently regular functions. It is worth noticing here that it is not required that the sequence {θ k } converges in order to establish our result. The proof is based on the identity, f θk k πdxf θk x = ˆf θk k P θk ˆfθk k, where ˆf θ is a solution of Poisson s equation, Eq. 17. The decomposition ˆf θk k P θk ˆfθk k = ˆfθk 1 k P θk 1 ˆfθk k ˆf θk 1 k + ˆf θk 1 k 1 + P θk 1 ˆfθk 1 k 1 P θk ˆfθk k, 20 evidences the different terms that need to be controlled to prove the LLN. The first term in the decomposition is except at the time of a jump a martingale difference sequence. As we shall see, this is the leading term in the decomposition and the other terms are remainders which are easily dealt with thanks to the regularity of the solutions of Poisson s equation under A1. The term ˆf θk k ˆf θk 1 k can be interpreted as the perturbation introduced by adaption. We preface our main result, Theorem 8, with two intermediate propositions concerned with the control of the fluctuations of the sum n f θ k k πdxf θ k x for the inhomogeneous chain { k, θ k } under the probability P ρ x,θ. The following lemma, whose proof is given in Appendix A, is required in order to prove Propositions 6 and 7.
11 ERGODICITY OF ADAPTIVE MCMC 11 Lemma 5. Assume A1. Let K Θ be a compact set and r [0, 1] be a constant. There exists a constant C depending only on r, K and the constants in A1 such that, for any sequences ρ = {ρ k } and a = {a k } of positive numbers and for any x, θ K E ρ x,θ [V r k 1{σK k}] CV r x, 21 [ ] k r E ρ x,θ max a mv m r 1{σK m} C r a m V r x, 22 1 m k m=1 [ m ] E ρ x,θ max 1{σK m} V r k CnV r x m n Proposition 6. Assume A1-3. Let {f θ, θ Θ} be a V α -Lipschitz family of functions for some α [0, 1 β, where V is defined in A1 and β is defined in A3. Let K be a compact subset of Θ. Then for any p 1, 1/α +β], there exists a constant C depending only on p, K and the constants in A1 such that, for any sequence ρ = {ρ k } of positive numbers such that k 1 ρ k < we have for all x, θ K, δ > 0, γ > α, and integer l 1, [ ] P ρ x,θ P ρ x,θ [ sup m 1 M m δ m l sup m l { 1{σK > m} m γ m Cδ p sup f θ p V l {p/2 p 1} V αp x, 24 α θ K f θk k } ] f θk xπdx M m δ { p } Cδ p sup f θ p V k l γ ρ α k V x βp + l γ αp V αp x, θ K def where σk is given in Eq. 13, ˆfθ = [P θ kf θ πf θ ] is a solution of Poisson s equation, Eq. 17, and M m def = 1{σK > m} 25 m [ ˆf θk 1 k P θk 1 ˆfθk 1 k 1 ]. 26 Remark 3. The result provides us with some useful insights into the properties of MCMC with vanishing adaption. First, whenever {θ k } K Θ for a deterministic compact set K, the bounds above give explicit rates of convergence for ergodic averages 1. The price to pay for adaption is apparent on the RHS of Eq. 25, as reported in Andrieu The constraints on p, β and α illustrate the tradeoff between the rate of convergence, the smoothness of adaption and the class of functions covered by our result. Scenarios of interest include the case where assumption A3 is satisfied with β = 0 in other words for any compact subset K Θ, the function sup θ K sup x H θ x < or the case where A3 holds for any β 0, 1/2], which both imply that the results of the proposition hold for any α < 1 see Theorem 15 for an application. Proof of Proposition 6. For notational simplicity, we set σ def = σk. Let p 1, 1/α + β] and K Θ be a compact set. In this proof C is a constant which only depends upon
12 12 C. ANDRIEU & É. MOULINES the constants in A1-3, p and K; this constant may take different values upon each appearance. Theorem 2 implies that for any θ Θ, ˆf θ exists and is a solution of Poisson s equation Eq. 17. We decompose the sum 1{σ > m} m f θ k k f θ k xπdx as M m + R m 1 + R m 2 where R 1 m R 2 m def = 1{σ > m} def = 1{σ > m} m ˆf θk k ˆf θk 1 k, P θ0 ˆfθ0 0 P θm ˆfθm m We consider these terms separately. First since 1{σ > m} = 1{σ > m}1{σ k} for 0 k m, m M m = 1{σ > m} ˆfθk 1 k P θk 1 ˆf θk 1 k 1 1{σ k} M m, where M m def = m [ ˆfθk 1 k P θk 1 ] ˆf θk 1 k 1. 1{σ k}. By Proposition 3, Eq. 18, there exists a constant C such that for all θ K, ˆf θ V α + P θ ˆfθ V α C sup θ K f θ V α. Since 0 αp 1, by using Eq. 21 in Lemma 5 we have for all x, θ K, { E ρ x,θ ˆf } θk 1 k p + P θk 1 ˆfθk 1 k 1 p 1{σ k} CV αp x sup f θ p V. 27 α θ K Since {[ ] } E ρ x,θ ˆfθk 1 k P θk 1 ˆf θk 1 k 1 1{σ k} F k 1 = P θk 1 ˆfθk 1 k 1 P θk 1 ˆfθk 1 k 1 1{σ k} = 0, { M m } is a P ρ x,θ, {F k}-adapted martingale with increments bounded in L p. Using Burkholder s inequality for p > 1 Hall and Heyde 1980 Theorem 2.10, we have E ρ x,θ { M m p} C p E ρ x,θ { m ˆf p/2 } θk 1 k P θk 1 ˆfθk 1 k 1 2 1{σ k}, 28 where C p is a universal constant. For p 2, by Minkowski s inequality, E ρ x,θ { M m p} C p { m For 1 p 2, we have E ρ x,θ { m } { p/2 E ρ x,θ ˆf 2/p θk 1 k P θk 1 ˆfθk 1 k 1 p 1{σ k}}. ˆf p/2 } θk 1 k P θk 1 ˆfθk 1 k 1 2 1{σ k} E ρ x,θ { m 29 } ˆf θk 1 k P θk 1 ˆfθk 1 k 1 p 1{σ k}. 30
13 ERGODICITY OF ADAPTIVE MCMC 13 By combining the two cases above and using Eq. 27, we obtain for any x, θ K and p > 1, { E ρ x,θ M m p} Cm p/2 1 sup f θ p V V αp x. 31 α θ K Let l 1. By Birnbaum and Marshall 1961 s inequality a straightforward adaptation of Birnbaum and Marshall 1961 s result is given in Proposition 22, Appendix A and Eq. 31 there exists a constant C such that P ρ x,θ [ sup m l m 1 M m δ ] { δ p m p m + 1 p E ρ x,θ m=l Cδ p sup f θ p V α θ K m=l { M m p}} m p 1+p/2 1 V αp x Cδ p sup f θ p V l {p/2 p 1} V αp x, α θ K which proves Eq. 24. We now consider R 1 m. Eq. 19 shows that there exists a constant C such that, for any θ, θ K K, ˆf θ ˆf θ V α C θ θ sup θ K f θ V α. On the other hand, by construction, θ k θ k 1 = ρ k Hθ k 1, k for σ k and under assumption A3 there exists a constant C such that, for any x, θ K, Hθ, x CV β x. Therefore, there exists C such that, for any m l and γ > α, m γ R 1 m = m γ 1{σ > m} m { ˆfθk k ˆf } θk 1 k C sup f θ V α θ K m k l γ ρ k V α+β k 1{σ k}. Hence, using Minkowski s inequality and Eq. 21 one deduces that there exists C such that for x, θ K, l 1 and γ > α, { } p E ρ x,θ sup m γp R m 1 p C sup f θ p V k l γ ρ α k V α+βp x. 32 m l θ K Consider now R m 2. The term P θ0 ˆfθ0 0 does not pose any problem. From Lemma 5 Eq. 22 there exists a constant C such that, for all x, θ K and 0 < α < γ, E ρ x,θ { sup m l m γp Pθm ˆfθm m p } 1{σ > m} C sup f θ p V E ρ α x,θ θ K { sup m l m γ/α V α m αp C sup f θ p V k γ/α V αp x. α θ K k=l p } 1{σ > m} The case α = 0 is straightforward. From Markov s inequality, Eqs. 32 and 33 one deduces
14 14 C. ANDRIEU & É. MOULINES We can now apply the results of Proposition 6 for the inhomogeneous Markov chain defined below 10 to the time homogeneous time chain {Z k } under the assumption that the number of reinitialization κ n is P almost surely finite. Note that the very general form of the result will allow us to prove a central limit theorem in Section 4. Proposition 7. Let {K q, q 0} be a compact coverage of Θ and let γ = {γ k } be a nonincreasing positive sequence such that k 1 γ k <. Consider the time homogeneous Markov chain {Z k } on Z with transition probability R as defined in Section 2. Assume A1-3 and let F def = {f θ, θ Θ} be a V α -Lipschitz family of functions for some α [0, 1 β, with β as in A3 and V as in A1. Assume in addition that P {lim n κ n < } = 1. Then for any f θ F, n 1 n f θk k f θk xπdx 1{ν k 0} a.s P Proof. Without loss of generality, we may assume that, for any θ Θ, f θxπdx = 0. def Let p 1, 1/α + β] and denote κ = lim n κ n. By construction of the T k s see Eq. 12, since P -a.s. we have κ < then T κ <, P -a.s.. We now decompose S n = n f def θ k k 1{ν k 0} as S n = S n 1 + S n 2, where S n 1 def = T κ n f θk k 1{ν k 0} and S n 2 def = n k=t κ +1 f θ k k. Since T κ < P -a.s., T κ f θ k k 1{ν k 0} <, P -a.s., showing that n 1 S n 1 0, P -a.s.. We now bound the second term. For any integers n and K, P [ sup m n K i=0 m 1 S 2 m P sup m n ] δ, κ K m 1 m f θk k δ, κ = i, T i n/2 + P [T κ > n/2]. 35 k=t i +1 Since for κ = i, T i+1 =, sup m 1 m f θk k m n k=t i +1 δ, κ = i, T i n/2 sup 1{T i+1 > m} m 1 m f θk k m n k=t i +1 δ, T i n/2 { m } 1{σK i > m} m 1 f θk k τ T i δ, sup m n/2 where we have used that T i+1 = T i + σk i τ T i with τ the shift operator on the canonical space of the chain {Z n }. As a consequence, by applying Lemma 1 noting that 1{σK i m}1{σk i > m} = 1{σK i > m} and 1{σK i > m} F m, the strong Markov property, Proposition 6 with γ = 1, using the fact that {γ k } is non-increasing and that
15 ERGODICITY OF ADAPTIVE MCMC 15 sup x K V x <, we have for 0 i K, P sup m 1 m f θk k m n k=t i +1 δ, κ = i, T i n/2 G T i [ m ] P γ i Π Ti,θ Ti sup 1{σK i > m}m 1 f θk k m n/2 δ { p C δ p sup f θ p n p1 α n } {p/2 p 1} V [ n/2 k] 1 γ α k + +. θ K i 2 2 By Kronecker s lemma, the condition k 1 γ k < implies that n 1 n γ k 0 as n, showing that n/2 [ n/2 k] 1 γ k n/2 1 γ k + k= n/2 +1 k 1 γ k 0, as n. Combining this together with 35 and 36 shows that, for any K, δ, η > 0 there exists N such that for n N, [ ] P sup m 1 S 2 m δ, κ K η. m n Now for K large enough so that P [κ > K] η the result above shows that there exists an N such that for any n N P [ ] sup m n m 1 S 2 m δ 2η, which concludes the proof. Remark 4. Checking P lim n κ n < = 1 depends on the particular algorithm used to update the parameters. Verifiable conditions have been established in Andrieu et al to check the stability of the algorithm; see Sections 5, 6 and 7. We may now state our main consistency result. Theorem 8. Let {K q, q 0} be a compact coverage of Θ and let γ = {γ k } be a nonincreasing positive sequence such that k 1 γ k <. Consider the time homogeneous Markov chain {Z k } on Z with transition probability R as defined in Section 2. Assume A1-3 and let f : R be a function such that f V α < for some α [0, 1 β with β as in A3 and V as in A1. Assume in addition that P {lim n κ n < } = 1. Then, n 1 n 36 [f k πf] a.s P Proof. We may assume that πf = 0. From Proposition 7 it is sufficient to prove that n 1 κ n j=1 f T j a.s P 0. Since κ < P -a.s., κ j=1 f Tj < P -a.s.. The proof follows from κ n j=1 f T j κ j=1 f Tj.
16 16 C. ANDRIEU & É. MOULINES 4. Invariance principle. We now prove an invariance principle. As it is the case for homogeneous Markov chains, more stringent conditions are required here than for the simple LLN. In particular we will require here that the series {θ k } converges P -a.s.. This is in contrast with simple consistency for which boundedness of {θ k } was sufficient. The main idea of the proof consists of approximating n 1/2 n {f k πf} with a triangular array of martingale differences sequence, and then apply an invariance principle for martingale differences to show the desired result. Theorem 9. Let {K q, q 0} be a compact coverage of Θ and let γ = {γ k } be a nonincreasing positive sequence such that k 1/2 γ k <. Consider the time homogeneous Markov chain {Z k } on Z with transition probability R as defined in Section 2. Assume A1-3 and let f : R satisfying f L V α where V is defined in A1 and α [0, 1 β/2 with β as in A3. Denote for any θ Θ, [ ] σ 2 θ, f def 2 def = π ˆfθ P θ ˆfθ with ˆfθ = f πf]. 38 Assume in addition that there exists a random variable θ Θ, such that πdx ˆf 2 θ x < and πdxp θ ˆfθ x 2 < P -a.s. and Then, k=0 lim sup θ n θ = 0, P a.s. n n 1/2 n [f k πf] D P Z, [ where the random variable Z has characteristic function Ē exp 1 2 σ2 θ, ft 2 ]. If in addition σθ, f > 0, P -a.s., then 1 n f k πf D P N 0, nσθ, f Proof. Without loss of generality, we suppose that πf = 0. The proof relies again on a martingale approximation. Set for k 1, [ ] def ξ k = ˆfθk 1 k P θk 1 ˆf θk 1 k 1 1{ν k 1 0}. 40 Since f is V α -Lipschitz, Proposition 3 shows that { ˆf θ, θ Θ} and {P θ ˆfθ, θ Θ} are V α -Lipschitz. Since 2α < 1, this implies that {P θ ˆf 2 θ, θ Θ} and {P θ ˆfθ 2, θ Θ} are V 2α - Lipschitz. We deduce that {ξ k } is a P, {G k, k 0}-adapted square-integrable martingale difference sequence, i.e. for all k 1, Ē [ξk 2] < and Ē [ξ k G k 1 [ ] = 0, P -a.s.. We are going to prove that with Z a r.v. with characteristic function Ē exp 1 2 σ2 θ, ft 2 ], 1 n 1 n n n ξ k D [P k θ P Z 41 f k 1 n n ξ k a.s P 0. 42
17 ERGODICITY OF ADAPTIVE MCMC 17 To show 41, we use Hall and Heyde, 1980, Corollary 3.1 of Theorem We need to establish that a the sequence n 1 n Ē [ξ 2 k G k 1] converges in P -probability to σ 2 θ, f, b the conditional Lindeberg condition is satisfied, for all ε > 0, n 1 n Ē [ξ 2 k 1{ ξ k ε n} G k 1 ] prob. P We first prove a. Note that [ ] 2 Ē [ξk 2 G 2 k 1] = P θk 1 ˆf θk 1 k 1 P θk 1 ˆfθk 1 k 1 1{ν k 1 0}. Since {P θ ˆf 2 θ, θ Θ} and {P θ ˆfθ 2, θ Θ} are V 2α -Lipschitz and 2α [0, 1 β, we may apply Proposition 7 to prove that 1 n n [ ] 2 2 P θk 1 ˆf θk 1 k 1 P θk 1 ˆfθk 1 k 1 1{ν k 1 0} 1 n n 1 k=0 [ ] 2 2 πdx P θk ˆf θk x P θk ˆfθk x 1{ν k 0} a.s P 0. For any j 0 and κ = j, {θ k, k > T j } K j, which together with the V 2α -Lipschitz property and the dominated convergence theorem imply that, [ ] 2 lim 1{κ 2 = j} πdx P θk ˆf θk x P θk ˆfθk x k [ ] 2 2 = 1{κ = j} πdx P θ ˆf θ x P θ ˆfθ x, P a.s. By the dominated convergence theorem and since P κ < = 1, [ ] 2 2 πdx P θk ˆf θk x P θk ˆfθk x 1{ν k 0} lim k = lim = k j=0 { j=0 { lim k πdx [ ]} 2 2 P θk ˆf θk x P θk ˆfθk x 1{ν k 0}1{κ = j} [ ]} 2 2 πdx P θk ˆf θk x P θk ˆfθk x 1{κ = j}, P a.s. and the Cesàro convergence theorem finally shows that 1 lim n n n 1 k=0 [ ] 2 2 πdx P θk ˆf θk x P θk ˆfθk x 1{ν k 0} = [ ] 2 2 πdx P θ ˆf θ x P θ ˆfθ x, P a.s.
18 18 C. ANDRIEU & É. MOULINES We now establish the conditional Lindeberg condition in b. We use the following Lemma, which is a conditional version of Dvoretzky, 1972, Lemma 3.3. Lemma 10. Let G be a σ-field and a random variable such that E [ 2 G ] <. Then, for any ε > 0, 4E [ 2 1{ ε} G ] E [ E [ G] 2 1{ E [ G] 2ε} G ]. Using Dvoretzky s lemma, we have for any ε, M > 0 and n large enough, n 1 n Ē [ξk 2 1{ ξ k ε n} G k 1 ] 4n 1 n 1 k=0 P θk k, dx ˆf 2 θ k x1{ ˆf θk x M} 1{ν k 0}, P a.s. Proceeding as above, the right-hand side of the previous display converges P -a.s. to πdx ˆf θ 2 x1{ ˆf θ x M}, where we have used that, for any θ Θ, πp θ = π. Since πdx ˆf θ 2 x < P -a.s., the monotone convergence theorem implies that πdx ˆf θ 2 x1{ ˆf θ x M} = 0, P a.s. lim M showing that the conditional Lindeberg condition b holds. In order to prove Eq. 42, we proceed along the lines of the proof of Proposition 7. Firstly, since κ <, P -a.s., T κ f k + T κ ξ k <, P -a.s., which implies that T n 1/2 κ T κ f k ξ k a.s P Secondly, proceeding as in the proof of 36 and using Eq. 25 of Proposition 6 with γ = 1/2 and some p 1, 1/α + β] since f is V α -Lipschitz, we have that for any 0 i K for some K > 0 and n > 0, P sup m 1/2 m m f k ξ k m n k=t i +1 k=t i +1 δ, κ = i, T i n/2 G T i { p C δ p f p n } p1/2 α V [ n/2 k] 1/2 γ α k Under the assumption k 1/2 γ k <, proceeding as below Eq. 36, one can show that lim n [ n/2 k] 1/2 γ k = 0. Arguing as in 35 we conclude that m 1/2 m k=t i +1 f k m k=t i +1 ξ k a.s P 0. 46
19 ERGODICITY OF ADAPTIVE MCMC 19 The proof of 42 follows from 44 and 46. The proof of 39 follows from Hall and Heyde, 1980, Corollary 3.2 of Theorem Stability and convergence of the stochastic approximation process. In order to conclude the part of this paper dedicated to the general theory of adaptive MCMC algorithm, we now present generally verifiable conditions under which the number of reinitializations of the algorithm that produces the Markov chain {Z k } described in Section 2 is P -a.e. finite. This is a difficult problem per se, which has been worked out in a companion paper, Andrieu et al We here briefly introduce the conditions under which this key property is satisfied and give without proof the main stability result. The reader should refer to Andrieu et al for more details. As mentioned in the introduction, the convergence of the stochastic approximation procedure is closely related to the stability of the noiseless sequence θ k+1 = θ k + γ k+1 h θ k. A practical technique to prove the stability of the noiseless sequence consists, when possible, of determining a Lyapunov function w : Θ [0, such that wθ, hθ 0, where w denotes the gradient of w with respect to θ and for u, v R n, u, v is their Euclidian inner product we will later on also use the notation v = v, v to denote the Euclidean norm of v. This indeed shows that the noiseless sequence {w θ k } eventually decreases, showing that lim k w θ k exists. It should therefore not be surprising if such a Lyapunov function can play an important rôle in showing the stability of the noisy sequence {θ k }. With this in mind, we can now detail the conditions required to prove our convergence result. A4 Θ is an open subset of R n θ. The mean field h : Θ R n θ is continuous and there exists a continuously differentiable function w : Θ [0, with the convention wθ = when θ Θ such that: i For any M > 0, the level set W M def = {θ Θ, wθ M} Θ is compact, ii The set of stationary points L def = {θ Θ, wθ, hθ = 0} belongs to the interior of Θ, iii For any θ Θ, wθ, hθ 0 and the closure of wl has an empty interior. Finally we require some conditions on the sequence of stepsizes γ = {γ k }. A5 The sequence γ = {γ k } is non-increasing, positive and γ k = and } {γ k 2 + k 1/2 γ k <. The following theorem is a straightforward simplification of Andrieu et al., 2005, Theorems 5.4 and 5.5 and shows that the tail probability of the number of reinitialisations decreases faster than any exponential and that the parameter sequence {θ k } converges to the stationary set L. For a point x and a set A we denote dx, A def = inf{ x y : y A}. Theorem 11. Let {K q, q 0} be a compact coverage of Θ and let γ = {γ k } be a real valued sequence. Consider the time homogeneous Markov chain {Z k } on Z with transition
20 20 C. ANDRIEU & É. MOULINES probability R as defined in Section 2. Assume A1-5. Then, lim sup k 1 log k [ P x,θ inf x,θ Θ sup x,θ Θ P x,θ [sup κ n k] =, n 0 ] lim dθ k, L = 0 = 1. k 6. Consistency and invariance principle for the adaptive N-SRW kernel. In this section we show how our results can be applied to the adaptive N-SRWM algorithm proposed by Haario et al and described in Section 1. We first illustrate how the conditions required to prove the LLN in Haario et al can be alleviated. In particular no boundedness condition is required on the parameter set Θ, but rather conditions on the tails of the target distribution π. We then extend these results further and prove a central limit theorem Theorem 15. In view of the results proved above it is required a to prove the ergodicity and regularity conditions for the Markov kernels outlined in assumption A1 b to prove that the reinitializations occur finitely many times stability and that {θ k } eventually converges. Note again that the convergence property is only required for the CLT. We first focus on a. The geometric ergodicity of the SRWM kernel has been studied by Roberts and Tweedie 1996 and refined by Jarner and Hansen 2000; the regularity of the SRWM kernel has, to the best of our knowledge, not been considered in the literature. The geometric ergodicity of the SRWM kernel mainly depends on the tail properties of the target distribution π. We will therefore restrict our discussion to target distributions that satisfy the following set of conditions. These are not minimal but easy to check in practice see Jarner and Hansen 2000 for details. M The probability density π is defined on = R nx for some integer n x and has the following properties: i It is bounded, bounded away from zero on every compact set and continuously differentiable. ii It is super-exponential, i.e. lim x + x, log π x =. x iii The contours A x = {y : πy = πx} are asymptotically regular, i.e. x lim sup π x, < 0. x + x π x We now establish uniform minorisation and drift conditions for the SRWM algorithm defined in Eq. 3. Let M denote the set of probability densities w.r.t. the Lebesgue
21 ERGODICITY OF ADAPTIVE MCMC 21 measure λ Leb. For any a, b > 0, define Q a,b M, Q a,b def = { q M, qx = q x and } inf qx b x a. 47 Proposition 12. Assume M. For any η 0, 1, set V = π η /sup x πx η. Then, 1. For any non-empty compact set C there exists a > 0 such that for any b > 0 such that Q a,b there exists ɛ > 0 such that C is a 1,ɛ-small set for the elements of {Pq SRW : q Q a,b }, with minorisation probability distribution ϕ such that for any A B, ϕa = λ Leb A C/λ Leb C, i.e. inf q Q a,b P SRW q x, A ɛϕa for all x C and A B Furthermore, for any a > 0 and b > 0 such that Q a,b, sup lim sup q Q a,b x + sup x,q Q a,b Pq SRW V x V x Pq SRW V x V x < 1, 49 < Let q, q M be two symmetric probability distributions. Then, for any r [0, 1] and any f L V r we have P q SRW f Pq SRW f V r 2 f V r qx q x λ Leb dx. 51 The proof is in Appendix C. As an example of application one can again consider the adaptive N-SRWM introduced earlier in Section 1, where the proposal distribution is N 0, Γ. In the following lemma, we show that the mapping Γ PN SRW 0,Γ is Lipschitz continuous. This result can be generalized to distributions in the curved exponential family see Proposition 16. Lemma 13. Let K be a convex compact subset of C nx + and set V = π η /sup π η for some η 0, 1. For any r [0, 1], any Γ, Γ K K, f L V r, we have P SRW N 0,Γ f P SRW N 0,Γ f V r 2n x λ min K f V r Γ Γ, where for Γ C nx +, Γ 2 = Tr[ΓΓ T ] and λ min K is the minimum possible eigenvalue for matrices in K. The proof is in Appendix D. We now turn on to proving that the stochastic approximation procedure outlined by Haario et al is ultimately pathwise bounded and eventually converges. In the case of the algorithm proposed by Haario et al. 2001, the parameter estimates µ k and Γ k take the form of maximum likelihood estimates under the i.i.d. multivariate Gaussian model. It therefore comes as no surprise if the appropriate
22 22 C. ANDRIEU & É. MOULINES Lyapunov w function to check A4 is the Kullback-Leibler divergence between the target density π and the normal density N µ, Γ, wµ, Γ = log detγ + µ µ π T Γ 1 µ µ π + TrΓ 1 Γ π, 52 where µ π and Γ π are the mean and covariance of the target distribution, defined in 9. Using straightforward algebra and the definition 8 of the mean field h, one can check that wµ, Γ, hµ, Γ = 2µ µ π T Γ 1 µ µ π TrΓ 1 Γ Γ π Γ 1 Γ Γ π µ µ π T Γ 1 µ µ π 2, 53 that is wθ, hθ 0 for any θ def = µ, Γ Θ, with equality if and only if Γ = Γ π and µ = µ π. The situation is in this case simple as the set of stationary points {θ Θ, hθ = 0} is reduced to a single point, and the Lyapunov function w goes to infinity as µ or Γ goes to the boundary of the cone of positive matrices. Now it can be shown that these results lead to the following intermediate lemma, see Andrieu et al for details. Lemma 14. Assume M, let H and h be as in Eq. 5 and Eq. 8. Then, A3-4 are satisfied with V = π 1 /sup π 1 for any β 0, 1/2] and w as in Eq. 52. In addition, the set of stationary points L def = {θ Θ def = R nx C+ nx, wθ, hθ = 0} is reduced to a single point θ π = µ π, Γ π whose components are respectively the mean and covariance of the distribution π. From Proposition 12 and Lemmas 13, we deduce our main theorem for this section, concerned with the adaptive N-SRWM of Haario et al as described in Section 1 but with reprojections as in Section 2. Theorem 15. Consider the process {Z k } with {P θ, θ = µ, Γ Θ def = R nx C+ nx, q θ = N 0, λγ, λ > 0} as in Eq. 3, {H θ, θ Θ} as in Eq. 5, π satisfying M, γ = {γ k, k 0} satisfying A5 and K a compact set. Let W def = π 1 /sup π 1. Then for any α [0, 1, 1. For any f LW α a strong LLN holds, i.e. n n 1 a.s f k fxπdx P For any f LW α/2 a CLT holds, i.e. 1 n [f k πf] D n P N 0, σ 2 θ π, f, if σθ π, f > 0, 1 nσθπ, f n f k πf D P N 0, 1, where θ π = µ π, Γ π and σ 2 θ π, f are defined in 38.
23 ERGODICITY OF ADAPTIVE MCMC 23 The proof is immediate. We refer the reader to Haario et al for applications of this type of algorithm to various settings. 7. Application: matching π with mixtures Setup. The independence Metropolis-Hastings algorithm IMH corresponds to the case where the proposal distribution used in a MH transition probability does not depend on the current state of the MCMC chain, i.e. qx, y = qy for some density q M. The transition kernel of the Metropolis algorithm is then given for x and A B by Pq IMH x, A = α q x, yqy λ Leb dy + 1 A x 1 α q x, y qy λ Leb dy A with α q x, y = 1 πyqx πxqy. 55 Irreducibility of Markov chains built on this model naturally require that qx > 0 whenever πx > 0. In fact the performance of the IMH is known to depend on how well the proposal distribution mimics the target distribution and this can be quantified in several ways. For example it has been shown in Mengersen and Tweedie 1996 that the IMH sampler is geometrically ergodic if and only if there exists ε > 0 such that q Q ε,π M, where { } Q ε,π = q M : λ Leb {x : qx/πx < ε} = This condition implies that the whole state space is a 1, ε-small set, which in turn implies that convergence occurs uniformly, at a geometric rate upper bounded by 1 ε. Given a family of candidate proposal distributions {q θ M, θ Θ} it seems therefore natural to maximise θ inf x πx/q θ x. However, although theoretically attractive, the optimisation of this uniform criterion might be a very ambitious task in practice. Furthermore it might not necessarily be a good choice for a given parametric family of proposal distributions: one might in this case try to optimise the transition probability for pathological features of π with small probability under π, at the expense of more fundamental characteristics of the target, such as its global shape. Additionally, such pathological features can very often be taken care of by other specialised MCMC updates. Instead of this uniform criterion we suggest the optimisation of an average property of the ratio πx/q θ x under π, which possesses the advantage of being more amenable to computations. It is argued in Gasemyr 2003 that minimising the total variation distance π q θ T V is a sensible criterion to optimise, since it can be proved that the expected acceptance probability is bounded below by 1 π q θ T V and that for a bounded function f the first covariance coefficient of the Markov chain in the stationary regime is bounded as follows: cov π f k, f k+1 5/2 2 sup x f π q θ T V. However no systematic way of effectively minimising this criterion is described. We propose here to use the Kullback- Leibler divergence between the target distribution π and an auxiliary distribution q θ close in some sense to q θ, Kπ q θ = πx log πx q θ x λleb dx. 57
24 24 C. ANDRIEU & É. MOULINES The proposal distribution q θ of the IMH algorithm is then constructed from q θ : as we shall see this offers an additional degree of freedom which, in particular, will be a simple way of ensuring that {q θ, θ Θ} Q ε,π, defined in 56, for some ε > 0 see Remark 7. The use of this criterion possesses several advantages. First, invoking Pinsker s inequality, it is possible to repeat Gasemyr 2003 s arguments. Secondly it formalises several ideas that have been proposed in the literature cf. Gasemyr 2003 and Gelman and Rubin 1992 among others. In Gelman and Rubin 1992 it is suggested to use the EM Expectation- Minimisation algorithm in order to fit a mixture of normals in the, possibly penalised, maximum likelihood sense to samples from a preliminary run of an MCMC algorithm. This mixture can then be used to define the proposal distribution of an IMH. As we shall see, the choice of the Kullback-Leibler KL divergence corresponds precisely to this choice and naturally leads to an on-line EM algorithm that allows us to adjust q θ to π as samples from π become available from the MCMC sampler. Finally, we point out at this stage that although we restrict here our discussion to the IMH algorithm, the KL criterion can equally be used for other updates, such as the SRWM algorithm. The algorithm proposed by Haario et al is in this case a particular instance of the algorithm hereafter. In order to allow for flexibility and the description of a general class of algorithms, we consider here mixtures of distributions in the exponential family for the auxiliary proposal distribution. More precisely, let Ξ R n ξ, Z R nz for some integers n ξ and n z and define the following family of exponential probability densities defined with respect to the product measure λ Leb µ for some measure µ on Z E c = {f : f ξ x, z = exp { ψξ + T x, z, φξ } ; ξ, x, z Ξ Z}, where ψ : Ξ R, φ : Ξ R n θ and T : Z R n θ. Define E the set of densities q ξ that are marginals of densities from E c, i.e. such that for any ξ, x Ξ we have q ξ x = f ξ x, zµdz. 58 Z This family of densities covers in particular finite mixtures of multivariate normal distributions. Here, the variable z plays the role of the label of the class, which is not observed see e.g. Titterington et al Using standard missing data terminology, f ξ x, z is the complete data likelihood and q ξ is the associated incomplete data likelihood, which is the marginal of the complete data likelihood with respect to the class labels. When the number of observations is fixed, a classical approach to estimate the parameters of a mixture distribution consists of using the EM algorithm Classical EM algorithm. The classical EM algorithm is an iterative procedure which consists of two steps. Given n independent samples 1,..., n distributed marginally according to π: 1 Expectation step: calculate the conditional expectation of the complete data log-likelihood given the observations and ξ k the estimate of ξ at iteration k ξ Qξ, ξ k def = n E [logf ξ i, Z i i, ξ k ]. i=1
25 ERGODICITY OF ADAPTIVE MCMC 25 2 Maximisation step: maximise the function ξ Qξ, ξ k with respect to ξ. The new estimate for ξ is ξ k+1 = argmax ξ Ξ Qξ, ξ k provided that it exists and is unique. The key property at the core of the EM algorithm is that the incomplete data likelihood n i=1 q ξ k+1 i n i=1 q ξ k i is increased at each iteration, with equality if and only if ξ k is a stationary point i.e. a local or global minimum or a saddle point: under mild additional conditions see e.g. Wu. 1983, the EM algorithm therefore converges to stationary points of the marginal likelihood. Note that, when n, under appropriate conditions, the renormalized incomplete data log-likelihood n 1 n i=1 log q ξ i converges to E π [log q ξ ] which is equal, up to a constant and a sign, to the Kullback-Leibler divergence between π and q ξ. In our particular setting the classical batch form of the algorithm is as follows. First define for ξ Ξ the conditional distribution ν ξ x, z def = f ξx, z, 59 q ξ x where q ξ is given by 58. Now, assuming that Z T x, z ν ξx, zµdz <, one can define for x and ξ Ξ ν ξ T x def = T x, zν ξ x, zµdz, 60 and check that for f ξ E c and any ξ, ξ Ξ Ξ where L : Θ Ξ R is defined as Z E{logf ξ i, Z i i, ξ } = Lν ξ T i ; ξ, Lθ; ξ def = ψξ + θ, φξ, with Θ def = T, Z. 61 From this, one easily deduces that for n samples, 1 n Qξ, ξ k = nl ν ξk T i ; ξ. n Assuming now for simplicity that for all θ Θ, the function ξ Lθ; ξ reaches its maximum at a single point denoted ˆξθ, i.e. Lθ; ˆξθ Lθ; ξ for all ξ Ξ, the EM recursion can then be simply written as ξ k+1 = ˆξ 1 n ν ξk T i. n i=1 The condition on the existence and uniqueness of ˆξθ is not restrictive: it is for example satisfied for finite mixtures of normal distributions. More sophisticated generalisations of the EM algorithm have been developed in order to deal with situations where this condition is not satisfied, see e.g. Meng and Van Dyk Our scenario differs from the classical setup above in two respects. First the number of samples considered evolves with time and it is required to estimate ξ on the fly. Secondly the samples { i } are generated by a transition probability with invariant distribution π and are therefore not independent. We address the first problem in Subsection 7.3 and the two problems simultaneously in Subsection 7.4 and describe our particular adaptive MCMC algorithm. i=1
26 26 C. ANDRIEU & É. MOULINES 7.3. Sequential EM algorithm. Sequential implementations of the EM algorithm for estimating the parameters of a mixture when the data are observed sequentially in time have been considered by several authors see Titterington et al., 1985, Chapter 6, Arcidiacono and Bailey Jones 2003 and the references therein. The version presented here is in many respect a standard adaptation of these algorithms and consists of recursively and jointly estimating and maximising with respect to ξ the function θξ = E π [log q ξ ] = π{νt ξ }, which, as pointed out earlier, is the Kullback-Leibler divergence between π and q ξ, up to an additive constant and a sign. At iteration k + 1, given an estimate θ k of θ and ξ k = ˆξθ k, sample k+1 π and calculate θ k+1 = 1 γ k+1 θ k + γ k+1 ν ξk T k+1 = θ k + γ k+1 ν ξk T k+1 θ k, 62 where {γ k } is a sequence of stepsizes and γ k [0, 1]. This can be interpreted as a stochastic approximation algorithm θ k+1 = θ k + γ k+1 Hθ k, k+1 with for θ Θ, Hθ, x = νˆξθ T x θ and hθ = π νˆξθ T θ. 63 It is possible to introduce at this stage a set of simple conditions on the distributions in E c that ensures the convergence of the sequence {θ k } defined above. By convergence we mean here that {θ k } converges to the set of stationary points of the Kullback-Leibler divergence between π and qˆξθ, i.e. L def = {θ Θ : wθ = 0}, where for θ Θ wθ = Kπ qˆξθ, 64 and K and q ξ are defined in 57 and 58, respectively. It is worth noticing that these very same conditions will be used to prove the convergence of our adaptive MCMC algorithm. E1 i The sets Ξ and Θ are open subsets of R n ξ and R n θ respectively. Z is a compact subset of R nz. ii For any x, T x def = inf{m : µ{z : T x, z M} = 0} <. iii The functions ψ : Ξ R and φ : Ξ R n θ are twice continuously differentiable on Ξ. iv There exists a continuously differentiable function ˆξ : Θ Ξ such that for all θ, ξ Θ Ξ, Lθ; ˆξθ Lθ; ξ. Remark 5. For many models the function ξ Lθ; ξ admits a unique global maximum for any θ Θ and the existence and differentiability of θ ˆξθ follows from the implicit function theorem under mild regularity conditions. E2 i The level sets {θ Θ, wθ M} for M > 0 are compact;
27 ERGODICITY OF ADAPTIVE MCMC 27 ii The set L def = {θ Θ, wθ = 0} of stationary points is included in a compact subset of Θ; iii The closure of wl has an empty interior. Remark 6. Assumption E2 depends on both the properties of π and qˆξθ and should therefore be checked on a case by case basis. Note however that a these assumptions are satisfied for finite mixtures of distributions in the exponential family under classical technical conditions on the parametrization beyond the scope of the present paper see, among others Titterington et al., 1985, chapter 6 and Arcidiacono and Bailey Jones 2003 for details b the third assumption in E2 can very often be checked using Sard s theorem. We first prove here an intermediate proposition concerned with estimates of the variation q ξ q ξ under E1 in various senses. Note that most of these results are not used in this section, but will be useful in the following. Proposition 16. Let { q ξ, ξ Ξ} E be a family of distributions satisfying E1. Then for any convex compact set K Ξ 1. There exists a constant C < such that sup ξ log q ξ x C1 + T x. 65 ξ K 2. For any ξ, ξ, x K 2 there exists a constant C < such that q ξ x q ξ x < C ξ ξ 1 + T x sup q ξ x. 66 ξ K 3. For W [1, such that sup ξ K q ξx[1 + T x]w xλ Leb dx < and any ξ, ξ K there exists a constant C < such that q ξ x q ξ x W xλ Leb dx C ξ ξ. 67 The proof is in Appendix E. The key to establish the convergence of the stochastic approximation procedure here consists of proving that wθ = Kπ qˆξθ plays the rôle of a Lyapunov function. This is hardly surprising as the algorithm aims at minimizing sequentially in time the incomplete data likelihood. More precisely, we have Proposition 17. Assume E1. Then, for all θ Θ, wθ, hθ 0 and L = {θ Θ : wθ, hθ = 0} = {θ Θ : wθ = 0}, 68 ˆξ L = {ξ Ξ : ξ Kπ q ξ = 0}, 69 where θ hθ is given in 63. The proof is in Appendix E. Another important result to prove convergence is the regularity of the field θ H θ. We have
28 28 C. ANDRIEU & É. MOULINES Proposition 18. Assume E1. Then {H θ, θ Θ} is 1 + T 2 -Lipschitz, where H θ is defined in Eq. 63. The proof is in Appendix E. From this and standard results on the convergence of SA, one may show that the SA procedure converges pointwise under E On-line EM for IMH adaptation. We now consider the combination of the sequential EM algorithm described earlier with the IMH sampler. As we shall see in Proposition 20, using qˆξθ as a proposal distribution for the IMH transition is not sufficient to ensure the convergence of the algorithm, and it will be necessary to use a mixture of a fixed distribution ζ which will not be updated during the successive iterations and an adaptive component, here qˆξθ. More precisely we define the following family of parametrized IMH transition probabilities {P θ, θ Θ}. For e 0, 1] let ζ Q e,π assumed nonempty be a density which does not depend on θ Θ, let ι 0, 1 and define the family of IMH transition probabilities P e,ζ def = {P θ def = P IMH q θ, θ Θ} with {q θ def = 1 ι qˆξθ + ι ζ, θ Θ}. 70 The following properties on ζ and E c will be required in order to ensure that P e,ζ satisfies A1: E3 i There exist e > 0 and ζ Q e,π such that for any compact K Ξ { } sup inf M : λ Leb qξ x1 + T x M = 0 <. 71 ξ K ζx ii There exists W [1, such that for any compact subset K Ξ, W x1 + T xζxλ Leb dx + sup W x1 + T x q ξ xλ Leb dx <, ξ K and sup x K W x <, where K is defined in Section 2. Remark 7. It is worth pointing out that the above choice for q θ and the condition ζ Q e,π automatically ensure that {q θ, θ Θ} Q ε,π for ε = eι. The basic version see Section 2 of our algorithm now proceeds as follows. Set θ 0 Θ, ξ 0 = ˆξθ 0 and draw 0 according to some initial distribution. At iteration k + 1 for k 0, draw k+1 P θk k, where P θ is given in 70. Compute θ k+1 = θ k + γ k+1 ν ξk T k+1 θ k and ξ k+1 = ˆξθ k+1. We will study here the corresponding algorithm with reprojections which results in the homogeneous Markov chain {Z k, k 0} as described in Section 2. We now establish intermediate results about P e,ζ and {H θ, θ Θ} which will lead to the proof that A1-3 are satisfied. We start with a general proposition about the properties of IMH transition probabilities, relevant to check A1. Proposition 19. Let V : [1, + and let q Q ε,π for some ε > 0. Then,
29 ERGODICITY OF ADAPTIVE MCMC is a 1, ε-small set with minorisation distribution ϕ = q and Pq IMH V x 1 εv x + qv, where qv = qxv xλ Leb dx. 2. For any f L V and any proposal distributions q, q Q ε,π 2 f V 1 P IMH q f Pq IMH f V qx q x V xλ Leb dx+ [qv q V ] 1 1 q 1 q q 1 q The proof is in Appendix F. In contrast with the SRWM, the V -norm P IMH q f Pq IMH f V can be large even in situations where qx q x V xλ Leb dx is small. This stems from the fact that the ratio of densities q/q enters the upper bound above. As we shall see in Proposition 20 below, this is what motivates our definition of the proposal distributions in 70 as a mixture of q ξ E and a non-adaptive distribution ζ which satisfies E3. Proposition 20. Assume that the family of distributions { q ξ, ξ Ξ} E satisfies E1 and E3. Then the family of transition kernels P e,ζ given in 70 satisfies A1 with ɛ = eι, V = W, ϕ = ζ and if W is bounded then C =, λ = 0, otherwise choose ε 0, eι such that C = {x : V x < ε 1 sup θ K q θ V } is such that ζc > 0 and set λ = 1 eι + ε. The proof is in Appendix F. We are now in a position to present our final result: Theorem 21. Let π M and { q ξ, ξ Ξ} E be a family of distributions. Define Θ := T, Z. Consider the following family of transition probabilities and functions, i {P θ, θ Θ} as in Eq. 70, where ζ Q e,π for some e > 0, { q ξ, ξ Ξ} is further assumed to satisfy E1, E3 with V such that T L V β/2 for some β [0, 1 and E2, ii {H θ, θ Θ} as in Eq. 63. Let {K q, q 0} be a compact coverage of Θ, let K be a compact set and let γ = {γ k } satisfy A5. Consider the time homogeneous Markov chain {Z k } on Z with transition probability R as defined in Section 2. Then for any x, θ K K 0 and any α < 1 β, 1. for any f L V α, n 1 n f k πf 0 P a.s. 2. there P -a.s. exists a random variable θ {θ Θ : θ Kπ qˆξθ = 0} such that provided that σθ, f > 0 and for any f L V α/2, 1 n σθ, f n f k πf D P N 0, 1, where σθ, f is given in Eq. 38.
30 30 C. ANDRIEU & É. MOULINES Proof. The application of Propositions and 20 shows that A1-3 are satisfied, which together with E2 and A5 imply Theorem 11. Then we conclude with Theorems 8 and 9. Remark 8. It is worth noticing that provided that π M satisfies M, the results of Propositions 12, 16, 17 and 18, proved in this paper easily allow one to establish a result similar to Theorem 21 for a generalization of the N-SRWM of Haario et al described here in Section 1 and studied in Section 5 to the case where the proposal distribution belongs to E, i.e. when the proposal is a mixture of distributions. 8. Acknowledgements. The authors would like to thank Randal Douc, David Leslie, the anonymous reviewers and in particular Gersende Fort for very helpful comments that have helped improve an earlier version of the paper. APPENDI A: STABILITY OF THE INHOMOGENEOUS CHAIN Proof of Lemma 5. Under A1-2, we have for x, θ Θ and k 1, E ρ x,θ [V k1{σk k}] = E ρ x,θ [Eρ [V k F k 1 ]1{σK k}] Now, a straightforward induction leads to, λe ρ x,θ [V k 11{σK k 1}] + b. E ρ x,θ [V k1{σk k}] λ k V x + b 1 λ, which shows that there exists a constant C depending only on K and the constants appearing in A1 such that for all k 0, E ρ x,θ [V k1{σk k}] CV x. 73 Now, for any r [0, 1], by Jensen s inequality we have for any k 0, r E ρ x,θ [V r k 1{σK k}] E ρ x,θ [V k1{σk k}] C r V r x, 74 showing 21. Similarly, using again Jensen s inequality, E ρ x,θ [ ] [ ] r max a mv m r 1{σK m} E ρ 1 m k x,θ max a mv m 1{σK m} 1 m k k r k r a m E ρ x,θ [V m1{σk m}] C r a m V r x, showing 22. Finally, since m=1 1{σK m} m V r k m=1 m V r k 1{σK k},
31 ERGODICITY OF ADAPTIVE MCMC 31 we have [ m ] [ E ρ x,θ max 1{σK m} n ] V r k E ρ 1 m n x,θ V r k 1{σK k} CnV r x, showing 23. The following proposition is a direct adaptation of Birnbaum and Marshall 1961 s inequality: Proposition 22. Let {S k, F k, k 0} be a submartingale, E S k F k 1 S k 1 a.e.. Let {a k > 0, 1 k n} be a non-increasing real valued sequence. If p 1 is such that E S k p < for k {1,..., n}, then for m n, { } n 1 P max a k S k 1 a p m k n k ap k+1 E S k p + a p ne S n p. k=m APPENDI B: PROOF OF PROPOSITIONS 3 AND 4 In the sequel, C is a generic constant, which may take different values upon each appearance. Proof of Proposition 3. Let K Θ be a compact set of Θ and r [0, 1]. For any θ, θ K K and f LV r, n 1 Pθ n f P θ n f = n 1 P j θ P θ P θ P n j 1 θ f = P j θ πp θ P θ P n j 1 θ f πf, j=0 where we have used that, πp θ = πp θ = π for any θ, θ. Theorem 2 shows that there exists a constant C and ρ 0, 1 such that for any θ K, l 0 and any f L V r, j=0 P l θ f πf V r C f V r ρl. 75 Under assumption A2, for any θ, θ K K, l 0 and any f L V r P j θ πp θ P θ P n j 1 θ f πf V r Cρ j P θ P θ P n j 1 θ f πf V r Cρ j θ θ P n j 1 θ f πf V r C θ θ f V rρ n, showing that there exists a constant C < such that for any θ, θ K K and f L V r, P n θ f P n θ f V r Cnρn θ θ f V r. 76
32 32 C. ANDRIEU & É. MOULINES Now consider {f θ, θ Θ} a family of V r -Lipschitz functions. From Eq. 75, for any θ K, k=0 P θ kf θ πf θ < and ˆf def θ = k=0 P θ kf θ πf θ belongs to L V r. Now we consider the difference, ˆf θ ˆf θ = = Pθ k f θ πf θ Pθ k f θ πf θ k=0 k=0 k=0 Pθ k f θ Pθ k f θ Pθ k f θ f θ πf θ f θ, which using Eq. 75 and 76 shows that ˆf θ ˆf θ V r C θ θ kρ k f θ V r + C ρ k f θ f θ V r. k=0 k=0 and we conclude by using the fact that {f θ, θ Θ} is a V r -Lipschitz family of functions. Using the same arguments one can prove a similar bound for P θ ˆfθ P θ ˆfθ V r. Proof of Proposition 4. For simplicity we set σ := σk and in what follows C is a finite constant whose value might change on each appearance. Let x, θ Θ. For k k 0, we introduce the following decomposition, E ρ x,θ {f k πf 1σ k} + E ρ x,θ E ρ x,θ k=0 { f k P nk θ k nk f k nk 1σ k} { P nk θ k nk f k nk πf 1σ k}. By Theorem 2, the last term is bounded by C f V 1 β V 1 β xρ nk C ρ 1 f V 1 β ρ k V x. We consider the first term and use the following new decomposition of this bias term compare with Haario et al. 2001, { E ρ x,θ f k P nk θ k nk f k nk 1σ k} nk { 1} E ρ x,θ P j 1 θ k j+1 f k j+1 P j θ k j f k j 1σ > k j + j=2 nk { { } 1} E ρ E ρ x,θ x,θ P j 1 θ k j+1 f k j+1 P j 1 θ k j f k j+1 F k j 1σ > k j + j=2 C f V 1 β nk j=2 j 1ρ j 1 E ρ x,θ C f V 1 β nk 1 j=1 { θ k j+1 θ k j V 1 β k j+1 1σ > k j + 1} jρ j ρ k j E ρ x,θ {V k j1σ k j} C f V 1 β V x ρ k nk+1,
33 ERGODICITY OF ADAPTIVE MCMC 33 where we have successively used the convention Eq. 76 with r = 1 β, A3, the fact that ρ is assumed non-increasing and Eq. 74. We conclude with the additional condition on ρ. APPENDI C: PROOF OF PROPOSITION 12 For any x, define the acceptance region Ax = {z ; πx + z πx} and the rejection region Rx = {z ; πx + z < πx}. From the definition 47 of Q a,b Roberts and Tweedie, 1996, Theorem 2.2 applies for any q Q a,b and we can conclude that 48 is satisfied. Noting that the two sets Ax and Rx do not depend on the proposal distribution q and using the conclusion of the proof of Theorem 4.3 of Jarner and Hansen 2000 we have inf lim inf qzλ Leb dz > 0, q Q a,b x + Ax so that from the conclusion of the proof of Theorem 4.1 of Jarner and Hansen 2000, sup q Q a,b lim sup x + which proves 49. Finally, for any q K a,b, Pq SRW V x = V x + Rx Ax Pq SRW V x = 1 inf lim inf qzλ Leb dz < 1, V x q Q a,b x + Ax πx + z η πx η qzλ Leb dz+ πx + z 1 + πx which proves 50. Now notice that P SRW q fx Pq SRW fx = We therefore focus, for r [0, 1] and f L V r, on the term αx, x + zqz q zfx + zλ Leb dz f V r V r x αx, x + z qz q z V r x + zλ Leb dz V r x πx + z rη Ax πx rη qz q z λ Leb dz + qz q z λ Leb dz. πx + z1 η πx 1 η qzλ Leb dz sup 1 u + u 1 η, 0 u 1 αx, x + zqz q zfx + zλ Leb dz+ fx αx, x + zq z qzλ Leb dz. Rx πx + z 1 rη πx 1 rη qz q z λ Leb dz
34 34 C. ANDRIEU & É. MOULINES We now conclude that for any x and any f L V r, Pq SRW fx Pq SRW fx V r 2 f x V r qz q z λ Leb dz. APPENDI D: PROOF OF LEMMA 13 For notational simplicity, we write q Γ for N 0, Γ. We have q Γ z q Γ z λ Leb 1 dz = d dv q Γ+vΓ Γzλ Leb dv λleb dz and let Γ v = Γ + vγ Γ, so that d dv log q Γ+vΓ Γz = 1 2 Tr [ Γ 1 v and consequently 1 d dv q Γ+vΓ Γzλ Leb dv λleb dz Γ Γ 0 where we have used the following inequality, Tr[Γ 1 v zz T Γ 1 v 0 Γ Γ + Γ 1 v zz T Γ 1 v Γ Γ ] 1 0 Γ 1 v λ Leb dv Γ Γ] Γ Γ Tr[Γ 1 v Γ 1 v zz T ]. APPENDI E: PROOF OF PROPOSITIONS 16, 17 AND 18 n x λ min K Γ Γ, Hereafter, for a scalar function s, s is a column vector and for a column vector valued function v with scalar entries v 1, v 2,... we use the convention that v is the matrix with v i as its i-th column. Proof of Proposition 16. We first note that from Fisher s identity we have ξ Ξ, ξ log q ξ x = ξ log f ξ x, zν ξ x, zµdz = ξ ψξ + ξ φξν ξ T x. Z and from E1 we conclude that Eq. 65 holds. Eq. 66 is a direct consequence of 65. def Now we prove Eq. 67. With ψ = ψξ def ψξ and φ = φξ φξ for ξ, ξ Ξ, q ξ x q ξ x W xλ Leb dx = 1 0 ψ + φ ψ + φ [ ψ T x, z, φ ]fξ v Z 1 fξ v 0 Z [ 0 T x 1 v x, zfξ x, zµdzλ Leb dv W xλ Leb dx 1 v x, zfξ x, zw xµdzλ Leb dvdx fξ v Z W x q ξ xλ Leb dx [ 1 v x, zfξ x, zw xµdzλ Leb dvdx ] v [ T xw x q ξ xλ Leb dx 1 v W x q ξ xλ dx] Leb λ Leb dv ] v [ 1 v T xw x q ξ xλ dx] Leb λ Leb dv
35 ERGODICITY OF ADAPTIVE MCMC 35 and we conclude from the assumptions on W, φ and ψ. Proof of Proposition 17. We first note that from Fisher s identity we have ξ Ξ, ξ log q ξ x = ξ log f ξ x, zν ξ x, zµdz = ξ ψξ + ξ φξ ν ξ T x. Z From 65 and E1 we may derive under the sum sign to show that ξ πx log q ξ xλ Leb dx = πx ξ log q ξ xλ Leb dx = ξ ψξ + ξ φξ πν ξ T, and thus by the chain rule of derivations θ wθ = θ ˆξθ ξ ψˆξθ + ξ φˆξθ π νˆξθ T For any θ Θ, ˆξθ is a stationary point of the mapping ξ Lθ, ξ and thus ξ Lθ, ˆξθ = ξ ψˆξθ + ξ φˆξθ θ = 0. Consequently 63 implies that θ wθ = θ ˆξθ ξ φˆξθhθ. We also notice that θ ξ Lθ, ξ = ξ φξ T. Differentiation with respect to θ of the mapping θ ξ Lθ, ˆξθ yields We finally have θ ξ Lθ, ˆξθ = ξ φˆξθ T + θ ˆξθ 2 ξ Lθ, ˆξθ = 0. θ wθ, hθ = hθ T θ ˆξθ 2 ξ Lθ, ˆξθ θ ˆξθ T hθ, which concludes the proof, since under E1, 2 ξ Lθ, ˆξθ 0 for any θ Θ. Proof of Proposition 18. For any x, H θ x H θ x T x νˆξθ x, z νˆξθ x, z µdz + θ θ. 77 Z From Proposition 16 one has that for any compact set K Ξ, there exists a constant C such that, for all ξ, z K Z Thus ξ log f ξ x, z C1 + T x and ξ log qx; ξ C1 + T x. ξ log ν ξ x, z ξ log f ξ x, z + ξ log q ξ x 2C1 + T x. Hence, for all ξ, ξ K and z Z, ν ξ x, z ν ξ x, z 2C1 + T x ξ ξ, which together with Eq. 77 concludes the proof..
36 36 C. ANDRIEU & É. MOULINES APPENDI F: PROOF OF PROPOSITIONS 19, 20 Proof of Proposition 19. The minorization condition is a classical result, see Mengersen and Tweedie Now notice that Pq IMH V x = α q x, yv yqyλ Leb dy + V x [1 α q x, y]qyλ Leb dy qx 1 πx qy πyλ Leb dy V x + qv, πy where α q is given in 55. The drift condition follows. From the definition of the transition probability and for any f L V, { P IMH q fx Pq IMH fx f V α q x, yqy α q x, yq y V yλ Leb dy } + V x α q x, yq y α q x, yqy λ Leb dy 2 f V V x α q x, yqy α q x, yq y V yλ Leb dy. We therefore bound I = We introduce the following sets { A q x = y : qy πy qx } πx qy πy qx πx q y πy q x πx πyv yλleb dy. and notice that the following inequalities hold: and B q x = y A c q x Ac qx, πy < πx πx qy qx q x q y, { y : qy } πy q x πx and y A c q x Bc qx, πy < πx q x q y qy. 78 We now decompose I into four terms I def = 4 i=1 I i, where I = qy A q A q πy q y πy πyv yλleb dy + qx A c q A c πx q x πx πyv yλleb dy q + qy A q A c πy q x πx πyv yλleb dy + qx q A c q A q πx q y πy πyv yλleb dy, where we have dropped x in the set notation for simplicity. We now determine bounds for I i, i = 2, 3. Notice that since y A c q A c q, { } { I 2 1 q x qx V yqyλ Leb dy 1 qx } A c q x V yq yλ Leb dy q Ac q A c q Ac q { 1 q x qx 1 qx } { } q x V yqyλ Leb dy V yq yλ Leb dy, A c q A c q A c q A c q,
37 ERGODICITY OF ADAPTIVE MCMC 37 and it can easily be checked that 1 q q 1 q { } q 1 1 q q { 1 1 q } q. The term I 3 can be bounded as follows I 3 { A q A c q Bc q and using 78 we find that I 3 } { qx qyv yλ Leb dy πx q x πx + A q A c q Bq A q A c q Bc q q y πy qy πy V yπyλ Leb dy V yπyλ Leb dy, { } qx 1 q x 1 qy V yλ Leb dy A q A c q Bc q + q y qy V yλ Leb dy. A q A c q Bq The bound for I 4 follows from that of I 3 by swapping q and q. Proof of Proposition 20. The first claim is direct from Proposition 19 and the assumptions. Now denote Υ ξ,ξ,αx def = 1 α q ξx + αζx 1 α q ξ x + αζx = 1 + q ξ x q ξ x ζx[ q ξ x/ζx + α 1 α ]. Therefore, from 65, for any convex compact set K Ξ there exists C < such that for any ξ, ξ, x K 2, } 1 Υ ξ,ξ,αx 1 α q ξ x q ξ x C ξ ξ sup ξ K q ξ x1 + T x α ζx ζx, which with 71 implies that for all ξ, ξ K and for λ Leb -almost all x there exists C < such that 1 1 Υ ξ,ξ,αx 1 1 Υ ξ,ξ,αx C ξ ξ. Now as a direct consequence of Eq. 67 one can show that there exists C such that for any ξ, ξ K and r [0, 1] q ξ x q ξ x W x r λ Leb dx C ξ ξ. The proof is concluded by application of Proposition 19.
38 38 C. ANDRIEU & É. MOULINES REFERENCES Andrieu, C Discussion on the meeting on statistical approaches to inverse problems, december Journal of the Royal Statistical Society: Series B Statistical Methodology Andrieu, C. and Moulines, E On the ergodicity property of some adaptive mcmc algorithms. Technical report, University of Bristol. Andrieu, C., Moulines, E. and Priouret, P Stability of stochastic approximation under verifiable conditions. SIAM J. on Control and Optimization Andrieu, C. and Robert, C Controlled MCMC for optimal sampling. Cahiers du Cérémade Arcidiacono, P. and Bailey Jones, J Finite mixture distributions, sequential likelihood and the em algorithm. Econometrica Atchadé, Y. and Rosenthal, J On adaptive markov chain monte carlo algorithms. Tech. rep. Bartusek, J Ph.D. thesis. Stochastic Approximation and Optimization of Markov Chains. Baxendale, P. H Renewal theory and computable convergence rates for geometrically ergodic markov chains. Annals of Applied Probability Benveniste, A., Métivier, M. and Priouret, P Adaptive Algorithms and Stochastic Approximations. Applications of Mathematics, Springer-Verlag, New York. Birnbaum, Z. W. and Marshall, A. W Some multivariate chebyshev inequalities with extensions to continuous parameter processes. The Annals of Mathematical Statistics Chen, H., Guo, L. and Gao, A Convergence and robustness of the Robbins- Monro algorithm truncated at randomly varying bounds. Stochastic Processes and their Applications Chen, H. and Zhu, Y.-M Stochastic approximation procedures with randomly varying truncations. Scientia Sinica Duflo, M Random Iterative Systems. Applications of mathematics, Springer- Verlag. Dvoretzky, A Asymptotic normality for sums of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability Univ. California, Berkeley, Calif., 1970/1971, Vol. II: Probability theory. Univ. California Press, Berkeley, Calif. Gasemyr, J On an adaptive version of the Metropolis-Hastings with independent proposal distribution. Scandinavian Journal of Statistics
39 ERGODICITY OF ADAPTIVE MCMC 39 Gelman, A., Roberts, G. and Gilks, W Efficient Metropolis jumping rules. In Bayesian Statistics J. O. Berger, J. M. Bernardo, A. P. Dawid and A. F. M. Smith, eds., vol. V. Oxford University Press. Gelman, A. and Rubin, D. B Inference from iterative simulation using multiple sequences. Statistical Science Glynn, P. W. and Meyn, S. P A Liapounov bound for solutions of the Poisson equation. Annals of Probability Haario, H., Saksman, E. and Tamminen, J An adaptive Metropolis algorithm. Bernoulli Hall, P. and Heyde, C Martingale Limit Theory and its Application. Academic Press, New York, London. Holden, L Adaptive chains. Technical Report, Norwegian Computing Center. Jarner, S. and Hansen, E Geometric ergodicity of Metropolis algorithms. Stochastic Processes and Their Applications Kushner, H. and Yin, G Stochastic Approximation Algorithms and Applications. Applications of Mathematics, Springer-Verlag, New-York. McLeish, D A maximal inequality and dependent strong laws. Annals of Probability Meng,. L. and Van Dyk, D The EM algorithm an old folk song sung to a fast new tune. Journal of the Royal Statistical Society. Series B Methodological Mengersen, K. and Tweedie, R Rates of convergence of the Hastings and Metropolis algorithms. Annals of Statistics Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A. and Teller, M Equations of state calculations by fast computing machines,. Journal of Chemical Physics Meyn, S. and Tweedie, R Markov Chains and Stochastic Stability. Communication and Control Engineering series, Springer-Verlag, London. Meyn, S. and Tweedie, R Computable bounds for convergence rates of Markov chains. Annals of Applied Probability Nummelin, E On the Poisson equation in the potential theory of a single kernel. Math. Scand Roberts, G. and Tweedie, R Geometric convergence and central limit theorem for multidimensional Hastings and Metropolis algorithms. Biometrika
40 40 C. ANDRIEU & É. MOULINES Titterington, D., Smith, A. F. M. and Makov, U Statistical analysis of finite mixture distributions. Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics, John Wiley and Sons, Chichester. Wu., C On the convergence properties of the EM algorithm. Annals of Statistics School of Mathematics University of Bristol University Walk, BS8 1TW, UK. url: maxca ENST, UMR , rue Barrault, PARIS Cedex 13, France url: moulines/
Adaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES Contents 1. Random variables and measurable functions 2. Cumulative distribution functions 3. Discrete
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
Online Learning, Stability, and Stochastic Gradient Descent
Online Learning, Stability, and Stochastic Gradient Descent arxiv:1105.4701v3 [cs.lg] 8 Sep 2011 September 9, 2011 Tomaso Poggio, Stephen Voinea, Lorenzo Rosasco CBCL, McGovern Institute, CSAIL, Brain
On the mathematical theory of splitting and Russian roulette
On the mathematical theory of splitting and Russian roulette techniques St.Petersburg State University, Russia 1. Introduction Splitting is an universal and potentially very powerful technique for increasing
Stochastic Gradient Method: Applications
Stochastic Gradient Method: Applications February 03, 2015 P. Carpentier Master MMMEF Cours MNOS 2014-2015 114 / 267 Lecture Outline 1 Two Elementary Exercices on the Stochastic Gradient Two-Stage Recourse
Lecture 13: Martingales
Lecture 13: Martingales 1. Definition of a Martingale 1.1 Filtrations 1.2 Definition of a martingale and its basic properties 1.3 Sums of independent random variables and related models 1.4 Products of
Gambling Systems and Multiplication-Invariant Measures
Gambling Systems and Multiplication-Invariant Measures by Jeffrey S. Rosenthal* and Peter O. Schwartz** (May 28, 997.. Introduction. This short paper describes a surprising connection between two previously
Example 4.1 (nonlinear pendulum dynamics with friction) Figure 4.1: Pendulum. asin. k, a, and b. We study stability of the origin x
Lecture 4. LaSalle s Invariance Principle We begin with a motivating eample. Eample 4.1 (nonlinear pendulum dynamics with friction) Figure 4.1: Pendulum Dynamics of a pendulum with friction can be written
Introduction to Markov Chain Monte Carlo
Introduction to Markov Chain Monte Carlo Monte Carlo: sample from a distribution to estimate the distribution to compute max, mean Markov Chain Monte Carlo: sampling using local information Generic problem
2.3 Convex Constrained Optimization Problems
42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions
Random graphs with a given degree sequence
Sourav Chatterjee (NYU) Persi Diaconis (Stanford) Allan Sly (Microsoft) Let G be an undirected simple graph on n vertices. Let d 1,..., d n be the degrees of the vertices of G arranged in descending order.
Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.
Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski. Fabienne Comte, Celine Duval, Valentine Genon-Catalot To cite this version: Fabienne
Continued Fractions and the Euclidean Algorithm
Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
BANACH AND HILBERT SPACE REVIEW
BANACH AND HILBET SPACE EVIEW CHISTOPHE HEIL These notes will briefly review some basic concepts related to the theory of Banach and Hilbert spaces. We are not trying to give a complete development, but
CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e.
CHAPTER II THE LIMIT OF A SEQUENCE OF NUMBERS DEFINITION OF THE NUMBER e. This chapter contains the beginnings of the most important, and probably the most subtle, notion in mathematical analysis, i.e.,
Analysis of Mean-Square Error and Transient Speed of the LMS Adaptive Algorithm
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 7, JULY 2002 1873 Analysis of Mean-Square Error Transient Speed of the LMS Adaptive Algorithm Onkar Dabeer, Student Member, IEEE, Elias Masry, Fellow,
Metric Spaces. Chapter 7. 7.1. Metrics
Chapter 7 Metric Spaces A metric space is a set X that has a notion of the distance d(x, y) between every pair of points x, y X. The purpose of this chapter is to introduce metric spaces and give some
1 if 1 x 0 1 if 0 x 1
Chapter 3 Continuity In this chapter we begin by defining the fundamental notion of continuity for real valued functions of a single real variable. When trying to decide whether a given function is or
Communication on the Grassmann Manifold: A Geometric Approach to the Noncoherent Multiple-Antenna Channel
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 2, FEBRUARY 2002 359 Communication on the Grassmann Manifold: A Geometric Approach to the Noncoherent Multiple-Antenna Channel Lizhong Zheng, Student
Mathematics for Econometrics, Fourth Edition
Mathematics for Econometrics, Fourth Edition Phoebus J. Dhrymes 1 July 2012 1 c Phoebus J. Dhrymes, 2012. Preliminary material; not to be cited or disseminated without the author s permission. 2 Contents
Load Balancing and Switch Scheduling
EE384Y Project Final Report Load Balancing and Switch Scheduling Xiangheng Liu Department of Electrical Engineering Stanford University, Stanford CA 94305 Email: [email protected] Abstract Load
Some Research Problems in Uncertainty Theory
Journal of Uncertain Systems Vol.3, No.1, pp.3-10, 2009 Online at: www.jus.org.uk Some Research Problems in Uncertainty Theory aoding Liu Uncertainty Theory Laboratory, Department of Mathematical Sciences
INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS
INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS STEVEN P. LALLEY AND ANDREW NOBEL Abstract. It is shown that there are no consistent decision rules for the hypothesis testing problem
t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).
1. Line Search Methods Let f : R n R be given and suppose that x c is our current best estimate of a solution to P min x R nf(x). A standard method for improving the estimate x c is to choose a direction
Master s Theory Exam Spring 2006
Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE
ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE YUAN TIAN This synopsis is designed merely for keep a record of the materials covered in lectures. Please refer to your own lecture notes for all proofs.
1 Short Introduction to Time Series
ECONOMICS 7344, Spring 202 Bent E. Sørensen January 24, 202 Short Introduction to Time Series A time series is a collection of stochastic variables x,.., x t,.., x T indexed by an integer value t. The
Properties of BMO functions whose reciprocals are also BMO
Properties of BMO functions whose reciprocals are also BMO R. L. Johnson and C. J. Neugebauer The main result says that a non-negative BMO-function w, whose reciprocal is also in BMO, belongs to p> A p,and
Tutorial on Markov Chain Monte Carlo
Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology,
1. (First passage/hitting times/gambler s ruin problem:) Suppose that X has a discrete state space and let i be a fixed state. Let
Copyright c 2009 by Karl Sigman 1 Stopping Times 1.1 Stopping Times: Definition Given a stochastic process X = {X n : n 0}, a random time τ is a discrete random variable on the same probability space as
GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1
(R) User Manual Environmental Energy Technologies Division Berkeley, CA 94720 http://simulationresearch.lbl.gov Michael Wetter [email protected] February 20, 2009 Notice: This work was supported by the U.S.
EXIT TIME PROBLEMS AND ESCAPE FROM A POTENTIAL WELL
EXIT TIME PROBLEMS AND ESCAPE FROM A POTENTIAL WELL Exit Time problems and Escape from a Potential Well Escape From a Potential Well There are many systems in physics, chemistry and biology that exist
Monte Carlo Methods in Finance
Author: Yiyang Yang Advisor: Pr. Xiaolin Li, Pr. Zari Rachev Department of Applied Mathematics and Statistics State University of New York at Stony Brook October 2, 2012 Outline Introduction 1 Introduction
24. The Branch and Bound Method
24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NP-complete. Then one can conclude according to the present state of science that no
Metric Spaces. Chapter 1
Chapter 1 Metric Spaces Many of the arguments you have seen in several variable calculus are almost identical to the corresponding arguments in one variable calculus, especially arguments concerning convergence
Duality of linear conic problems
Duality of linear conic problems Alexander Shapiro and Arkadi Nemirovski Abstract It is well known that the optimal values of a linear programming problem and its dual are equal to each other if at least
Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)
Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091) Magnus Wiktorsson Centre for Mathematical Sciences Lund University, Sweden Lecture 5 Sequential Monte Carlo methods I February
Bayesian Statistics: Indian Buffet Process
Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note
Big Data - Lecture 1 Optimization reminders
Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics
Separation Properties for Locally Convex Cones
Journal of Convex Analysis Volume 9 (2002), No. 1, 301 307 Separation Properties for Locally Convex Cones Walter Roth Department of Mathematics, Universiti Brunei Darussalam, Gadong BE1410, Brunei Darussalam
Monte Carlo-based statistical methods (MASM11/FMS091)
Monte Carlo-based statistical methods (MASM11/FMS091) Jimmy Olsson Centre for Mathematical Sciences Lund University, Sweden Lecture 5 Sequential Monte Carlo methods I February 5, 2013 J. Olsson Monte Carlo-based
LOOKING FOR A GOOD TIME TO BET
LOOKING FOR A GOOD TIME TO BET LAURENT SERLET Abstract. Suppose that the cards of a well shuffled deck of cards are turned up one after another. At any time-but once only- you may bet that the next card
Mathematical Finance
Mathematical Finance Option Pricing under the Risk-Neutral Measure Cory Barnes Department of Mathematics University of Washington June 11, 2013 Outline 1 Probability Background 2 Black Scholes for European
Pricing and calibration in local volatility models via fast quantization
Pricing and calibration in local volatility models via fast quantization Parma, 29 th January 2015. Joint work with Giorgia Callegaro and Martino Grasselli Quantization: a brief history Birth: back to
Numerical Analysis Lecture Notes
Numerical Analysis Lecture Notes Peter J. Olver 5. Inner Products and Norms The norm of a vector is a measure of its size. Besides the familiar Euclidean norm based on the dot product, there are a number
How To Prove The Dirichlet Unit Theorem
Chapter 6 The Dirichlet Unit Theorem As usual, we will be working in the ring B of algebraic integers of a number field L. Two factorizations of an element of B are regarded as essentially the same if
A PRIORI ESTIMATES FOR SEMISTABLE SOLUTIONS OF SEMILINEAR ELLIPTIC EQUATIONS. In memory of Rou-Huai Wang
A PRIORI ESTIMATES FOR SEMISTABLE SOLUTIONS OF SEMILINEAR ELLIPTIC EQUATIONS XAVIER CABRÉ, MANEL SANCHÓN, AND JOEL SPRUCK In memory of Rou-Huai Wang 1. Introduction In this note we consider semistable
Introduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby
Multivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
4: SINGLE-PERIOD MARKET MODELS
4: SINGLE-PERIOD MARKET MODELS Ben Goldys and Marek Rutkowski School of Mathematics and Statistics University of Sydney Semester 2, 2015 B. Goldys and M. Rutkowski (USydney) Slides 4: Single-Period Market
The CUSUM algorithm a small review. Pierre Granjon
The CUSUM algorithm a small review Pierre Granjon June, 1 Contents 1 The CUSUM algorithm 1.1 Algorithm............................... 1.1.1 The problem......................... 1.1. The different steps......................
Supplement to Call Centers with Delay Information: Models and Insights
Supplement to Call Centers with Delay Information: Models and Insights Oualid Jouini 1 Zeynep Akşin 2 Yves Dallery 1 1 Laboratoire Genie Industriel, Ecole Centrale Paris, Grande Voie des Vignes, 92290
Non Linear Dependence Structures: a Copula Opinion Approach in Portfolio Optimization
Non Linear Dependence Structures: a Copula Opinion Approach in Portfolio Optimization Jean- Damien Villiers ESSEC Business School Master of Sciences in Management Grande Ecole September 2013 1 Non Linear
Beyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - ECML-PKDD,
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
Machine learning and optimization for massive data
Machine learning and optimization for massive data Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Eric Moulines - IHES, May 2015 Big data revolution?
ON LIMIT LAWS FOR CENTRAL ORDER STATISTICS UNDER POWER NORMALIZATION. E. I. Pancheva, A. Gacovska-Barandovska
Pliska Stud. Math. Bulgar. 22 (2015), STUDIA MATHEMATICA BULGARICA ON LIMIT LAWS FOR CENTRAL ORDER STATISTICS UNDER POWER NORMALIZATION E. I. Pancheva, A. Gacovska-Barandovska Smirnov (1949) derived four
Irreducibility criteria for compositions and multiplicative convolutions of polynomials with integer coefficients
DOI: 10.2478/auom-2014-0007 An. Şt. Univ. Ovidius Constanţa Vol. 221),2014, 73 84 Irreducibility criteria for compositions and multiplicative convolutions of polynomials with integer coefficients Anca
LECTURE 15: AMERICAN OPTIONS
LECTURE 15: AMERICAN OPTIONS 1. Introduction All of the options that we have considered thus far have been of the European variety: exercise is permitted only at the termination of the contract. These
The Exponential Distribution
21 The Exponential Distribution From Discrete-Time to Continuous-Time: In Chapter 6 of the text we will be considering Markov processes in continuous time. In a sense, we already have a very good understanding
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
Extremal equilibria for reaction diffusion equations in bounded domains and applications.
Extremal equilibria for reaction diffusion equations in bounded domains and applications. Aníbal Rodríguez-Bernal Alejandro Vidal-López Departamento de Matemática Aplicada Universidad Complutense de Madrid,
How To Find Out How To Calculate A Premeasure On A Set Of Two-Dimensional Algebra
54 CHAPTER 5 Product Measures Given two measure spaces, we may construct a natural measure on their Cartesian product; the prototype is the construction of Lebesgue measure on R 2 as the product of Lebesgue
Nonparametric adaptive age replacement with a one-cycle criterion
Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: [email protected]
Random access protocols for channel access. Markov chains and their stability. Laurent Massoulié.
Random access protocols for channel access Markov chains and their stability [email protected] Aloha: the first random access protocol for channel access [Abramson, Hawaii 70] Goal: allow machines
1 Simulating Brownian motion (BM) and geometric Brownian motion (GBM)
Copyright c 2013 by Karl Sigman 1 Simulating Brownian motion (BM) and geometric Brownian motion (GBM) For an introduction to how one can construct BM, see the Appendix at the end of these notes A stochastic
Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1
Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1 J. Zhang Institute of Applied Mathematics, Chongqing University of Posts and Telecommunications, Chongqing
Introduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
THE CENTRAL LIMIT THEOREM TORONTO
THE CENTRAL LIMIT THEOREM DANIEL RÜDT UNIVERSITY OF TORONTO MARCH, 2010 Contents 1 Introduction 1 2 Mathematical Background 3 3 The Central Limit Theorem 4 4 Examples 4 4.1 Roulette......................................
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 50, NO. 5, MAY 2005 577. Least Mean Square Algorithms With Markov Regime-Switching Limit
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 50, NO. 5, MAY 2005 577 Least Mean Square Algorithms With Markov Regime-Switching Limit G. George Yin, Fellow, IEEE, and Vikram Krishnamurthy, Fellow, IEEE
Analysis of a Production/Inventory System with Multiple Retailers
Analysis of a Production/Inventory System with Multiple Retailers Ann M. Noblesse 1, Robert N. Boute 1,2, Marc R. Lambrecht 1, Benny Van Houdt 3 1 Research Center for Operations Management, University
A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data
A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data Faming Liang University of Florida August 9, 2015 Abstract MCMC methods have proven to be a very powerful tool for analyzing
Lectures on Stochastic Processes. William G. Faris
Lectures on Stochastic Processes William G. Faris November 8, 2001 2 Contents 1 Random walk 7 1.1 Symmetric simple random walk................... 7 1.2 Simple random walk......................... 9 1.3
Mathematical Methods of Engineering Analysis
Mathematical Methods of Engineering Analysis Erhan Çinlar Robert J. Vanderbei February 2, 2000 Contents Sets and Functions 1 1 Sets................................... 1 Subsets.............................
No: 10 04. Bilkent University. Monotonic Extension. Farhad Husseinov. Discussion Papers. Department of Economics
No: 10 04 Bilkent University Monotonic Extension Farhad Husseinov Discussion Papers Department of Economics The Discussion Papers of the Department of Economics are intended to make the initial results
Parameter estimation for nonlinear models: Numerical approaches to solving the inverse problem. Lecture 12 04/08/2008. Sven Zenker
Parameter estimation for nonlinear models: Numerical approaches to solving the inverse problem Lecture 12 04/08/2008 Sven Zenker Assignment no. 8 Correct setup of likelihood function One fixed set of observation
arxiv:1112.0829v1 [math.pr] 5 Dec 2011
How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly
On the Traffic Capacity of Cellular Data Networks. 1 Introduction. T. Bonald 1,2, A. Proutière 1,2
On the Traffic Capacity of Cellular Data Networks T. Bonald 1,2, A. Proutière 1,2 1 France Telecom Division R&D, 38-40 rue du Général Leclerc, 92794 Issy-les-Moulineaux, France {thomas.bonald, alexandre.proutiere}@francetelecom.com
Notes on Factoring. MA 206 Kurt Bryan
The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor
Gaussian Processes to Speed up Hamiltonian Monte Carlo
Gaussian Processes to Speed up Hamiltonian Monte Carlo Matthieu Lê Murray, Iain http://videolectures.net/mlss09uk_murray_mcmc/ Rasmussen, Carl Edward. "Gaussian processes to speed up hybrid Monte Carlo
Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.
1. Differentiation The first derivative of a function measures by how much changes in reaction to an infinitesimal shift in its argument. The largest the derivative (in absolute value), the faster is evolving.
A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses. Michael R. Powers[ 1 ] Temple University and Tsinghua University
A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses Michael R. Powers[ ] Temple University and Tsinghua University Thomas Y. Powers Yale University [June 2009] Abstract We propose a
Statistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
Mathematics Course 111: Algebra I Part IV: Vector Spaces
Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are
How To Know If A Domain Is Unique In An Octempo (Euclidean) Or Not (Ecl)
Subsets of Euclidean domains possessing a unique division algorithm Andrew D. Lewis 2009/03/16 Abstract Subsets of a Euclidean domain are characterised with the following objectives: (1) ensuring uniqueness
Convex analysis and profit/cost/support functions
CALIFORNIA INSTITUTE OF TECHNOLOGY Division of the Humanities and Social Sciences Convex analysis and profit/cost/support functions KC Border October 2004 Revised January 2009 Let A be a subset of R m
SUBGROUPS OF CYCLIC GROUPS. 1. Introduction In a group G, we denote the (cyclic) group of powers of some g G by
SUBGROUPS OF CYCLIC GROUPS KEITH CONRAD 1. Introduction In a group G, we denote the (cyclic) group of powers of some g G by g = {g k : k Z}. If G = g, then G itself is cyclic, with g as a generator. Examples
