Monte Carlo-based statistical methods (MASM11/FMS091) Jimmy Olsson Centre for Mathematical Sciences Lund University, Sweden Lecture 5 Sequential Monte Carlo methods I February 5, 2013 J. Olsson Monte Carlo-based statistical methods, L5 (1)
Plan of today s lecture 1 Variance reduction reconsidered 2 3 J. Olsson Monte Carlo-based statistical methods, L5 (2)
We are here Variance reduction reconsidered 1 Variance reduction reconsidered 2 3 J. Olsson Monte Carlo-based statistical methods, L5 (3)
Last time: Variance reduction Last time we discussed how to reduce the variance of the standard MC sampler by introducing correlation between the variables of the sample. More specifically, we used 1 a control variate Y such that E(Y ) = m is known: Z = φ(x) + α(y m), where α was tuned optimally to α = C(φ(X), Y )/V(Y ). 2 antithetic variables V and V such that E(V ) = E(V ) = τ and C(V, V ) < 0: W = V + V. 2 J. Olsson Monte Carlo-based statistical methods, L5 (4)
Last time: Variance reduction The following theorem turned out to be useful when constructing antithetic variables. Theorem Let V = ϕ(u), where ϕ : R R is a monotone function. Moreover, assume that there exists a non-increasing transform T : R R such that U = d. T (U). Then V = ϕ(u) and V = ϕ(t (U)) are identically distributed and C(V, V ) = C(ϕ(U), ϕ(t (U))) 0. An important application of this theorem is the following: Let F be a distribution function. Then, letting U U(0, 1), T (u) = 1 u, and ϕ = F 1 yields, for X = F 1 (U) and X = F 1 (1 U), X = d. X (with distribution function F ) and C(X, X ) 0. J. Olsson Monte Carlo-based statistical methods, L5 (5)
Last time: Variance reduction τ = π/2 0 exp(cos 2 (x)) dx, V = π 2 exp(cos2 (X)), V = π 2 exp(sin2 (X)), W = V +V 2. 3.1 3 2.9 With antithetic sampling 2.8 2.7 2.6 2.5 2.4 Standard MC 2.3 2.2 2.1 0 100 200 300 400 500 600 700 800 900 1000 Sample size N V (= 2 N W ) J. Olsson Monte Carlo-based statistical methods, L5 (6)
Control variates reconsidered A problem with the control variate approach is that the optimal α, i.e. α = C(φ(X), Y ), V(Y ) is generally not known explicitly. Thus, it was suggested to 1 draw (X i ) N i=1, 2 draw (Y i ) N i=1, 3 estimate, via MC, α using the drawn samples, and 4 use this to optimally construct (Z i ) N i=1. This yields a so-called batch estimator of α. However, this procedure is computationally somewhat complex. J. Olsson Monte Carlo-based statistical methods, L5 (7)
An online approach to optimal control variates The estimators def C N = 1 N def V N = 1 N N φ(x i )(Y i m) i=1 N (Y i m) 2 i=1 of C(φ(X), Y ) and V(φ(X)), respectively, can be implemented recursively according to and C l+1 = V l+1 = with C 0 = V 0 = 0. l l + 1 C l + 1 l + 1 φ(xl+1 )(Y l+1 m) l l + 1 V l + 1 l + 1 (Y l+1 m) 2. J. Olsson Monte Carlo-based statistical methods, L5 (8)
An online approach to optimal control variates (cont.) Inspired by this we set for l = 0, 1, 2,..., N 1, Z l+1 = φ(x l+1 ) + α l (Y l+1 m), τ l+1 = l l + 1 τ l + 1 l + 1 Z l+1, def def def where α 0 = 1, α l = C l /V l for l > 0, and τ 0 = 0 yielding an online estimator. One may then establish the following convergence results. Theorem Let τ N be obtained through ( ). Then, as N, (i) τ N τ (a.s.), (ii) N(τ N τ) where σ 2 d. N (0, σ 2 ), def = V(φ(X)){1 ρ(φ(x), Y ) 2 } is the optimal variance. ( ) J. Olsson Monte Carlo-based statistical methods, L5 (9)
Example: the tricky integral again We estimate τ = using π/2 0 exp(cos 2 (x)) dx = π/2 0 π 2 exp(cos2 (x)) } {{ } =φ(x) Z = φ(x) + α (Y m), where Y = cos 2 (X) is a control variate with m = E(Y ) = π/2 0 2 π }{{} =f(x) dx = E f (φ(x)) cos 2 (x) 2 π dx = {use integration by parts} = 1 2. However, the optimal coefficient α is not known explicitly. J. Olsson Monte Carlo-based statistical methods, L5 (10)
Example: the tricky integral again cos2 = @(x) cos(x).^2; phi = @(x) (pi/2)*exp(cos2(x)); m = 1/2; X = (pi/2)*rand; Y = cos2(x); c = phi(x)*(y m); v = (Y m)^2; tau_cv = phi(x) + (Y m); alpha = c/v; for k = 2:N, X = (pi/2)*rand; Y = cos2(x); Z = phi(x) + alpha*(y m); tau_cv = (k 1)*tau_CV/k + Z/k; c = (k 1)*c/k + phi(x)*(y m)/k; v = (k 1)*v/k + (Y m)^2/k; alpha = c/v; end J. Olsson Monte Carlo-based statistical methods, L5 (11)
Example: the tricky integral again 3.2 3.15 3.1 3.05 3 2.95 2.9 Standard MC Adaptive algorithm Batch algorithm 2.85 2.8 2.75 2.7 0 500 1000 1500 Sample size N J. Olsson Monte Carlo-based statistical methods, L5 (12)
We are here Variance reduction reconsidered 1 Variance reduction reconsidered 2 3 J. Olsson Monte Carlo-based statistical methods, L5 (13)
We will now (and for the coming two lectures) extend the principal goal of the course to the problem of estimating sequentially sequences (τ n ) n 0 of expectations τ n = E fn (φ(x 0:n )) = φ(x 0:n )f n (x 0:n ) dx 0:n X n over spaces X n of increasing dimension, where again the densities (f n ) n 0 are known up to normalizing constants only; i.e. for every n 0, f(x 0:n ) = z n(x 0:n ), c n where c n is an unknown constant and z n is a known positive function on X n. As we will see, such sequences appear in many applications in statistics and numerical analysis. J. Olsson Monte Carlo-based statistical methods, L5 (14)
We are here Variance reduction reconsidered 1 Variance reduction reconsidered 2 3 J. Olsson Monte Carlo-based statistical methods, L5 (15)
We are here Variance reduction reconsidered 1 Variance reduction reconsidered 2 3 J. Olsson Monte Carlo-based statistical methods, L5 (16)
A Markov chain on X R d is a family of random variables (= stochastic process) (X k ) k 0 taking values in X such that P(X k+1 A X 0, X 1,..., X k ) = P(X k+1 A X k ) for all A X. We call the chain time homogeneous if the conditional distribution of X k+1 given X k does not depend on k. The distribution of X k+1 given X k = x determines completely the dynamics of the process, and the density q of this distribution is called the transition density of (X k ). Consequently, P(X k+1 A X k = x k ) = q(x k+1 x k ) dx k+1. A J. Olsson Monte Carlo-based statistical methods, L5 (17)
Markov chains (cont.) The following theorem provides the joint density f n (x 0, x 1,..., x n ) of X 0, X 1,..., X n. Theorem Let (X k ) be Markov with initial distribution χ. Then for n > 0, n 1 f n (x 0, x 1,..., x n ) = χ(x 0 ) q(x k+1 x k ). k=0 Corollary (Chapman-Kolmogorov equation) Let (X k ) be Markov. Then for n > 1, f n (x n x 0 ) = ( n 1 k=0 q(x k+1 x k ) ) dx 1 dx n 1. J. Olsson Monte Carlo-based statistical methods, L5 (18)
Example: The AR(1) process As a first example we consider a first order autoregressive process (AR(1)) in R. Set X 0 = 0, X k+1 = αx k + ɛ k+1, where α is a constant and the variables (ɛ k ) k 1 of the noise sequence are i.i.d. with density function f. In this case, P(X k+1 x k+1 X k = x k ) = P(αX k + ɛ k+1 x k+1 X k = x k ) = P(ɛ k+1 x k+1 αx k X k = x k ) = P(ɛ k+1 x k+1 αx k ), implying that q(x k+1 x k ) = P(X k+1 x k+1 X k = x k ) x k+1 = P(ɛ k+1 x k+1 αx k ) = f(x k+1 αx k ). x k+1 J. Olsson Monte Carlo-based statistical methods, L5 (19)
We are here Variance reduction reconsidered 1 Variance reduction reconsidered 2 3 J. Olsson Monte Carlo-based statistical methods, L5 (20)
Simulation of rare events for Markov chains Let (X k ) be a Markov chain on X = R and consider the rectangle A = A 0 A 1 A n R n, where every A l = (a l, b l ) is an interval. Here A can be a possibly extreme event. Say that we wish to to compute, sequentially as n increases, some expectation under the conditional distribution f n A of the states X 0:n = (X 0, X 2,..., X n ) given X 0:n A, i.e. τ n = E fn (φ(x 0:n ) X 0:n A) = E fn A (φ(x 0:n )) f(x 0:n ) = φ(x 0:n ) dx 0:n. A P(X 0:n A) } {{ } =f n A (x 0:n )=z n(x 0:n )/c n Here the unknown probability c n = P(X 0:n A) of the rare event A is often the quantity of interest. J. Olsson Monte Carlo-based statistical methods, L5 (21)
Simulation of rare events for Markov chains (cont.) As c n = P(X 0:n A) = 1 A (x 0:n )f(x 0:n ) dx 0:n a first naive approach could of course be to use standard MC and simply 1 simulate the Markov chain N times, yielding trajectories (X i 0:n )N i=1, 2 count the number N A of trajectories that fall into A, and 3 estimate c n using the MC estimator c N n = N A N. Problem: if c n = 10 9 we may expect to produce a billion draws before obtaining a single draw belonging to A! As we will se, SMC methods solve the problem efficiently. J. Olsson Monte Carlo-based statistical methods, L5 (22)
We are here Variance reduction reconsidered 1 Variance reduction reconsidered 2 3 J. Olsson Monte Carlo-based statistical methods, L5 (23)
Estimation in general hidden Markov models (HMMs) A hidden Markov model (HMM) comprises two stochastic processes: 1 A Markov chain (X k ) k 0 with transition density q: X k+1 X k = x k q(x k+1 x k ). The Markov chain is not seen by us (hidden) but partially observed through 2 an observation process (Y k ) k 0 such that conditionally on the chain (X k ) k 0, (i) the Y k s are independent with (ii) conditional distribution of each Y k depending on the corresponding X k only. d. The density of the conditional distribution Y k (X k ) k 0 = Y k X k will be denoted by p(y k x k ). J. Olsson Monte Carlo-based statistical methods, L5 (24)
Estimation in general HMMs (cont.) Graphically: Y k 1 Y k Y k+1 (Observations)... X k 1 X k X k+1... (Markov chain) Y k X k = x k g(y k x k ) X k+1 X k = x k q(x k+1 x k ) X 0 χ(x 0 ) (Observation density) (Transition density) (Initial distribution) J. Olsson Monte Carlo-based statistical methods, L5 (25)
Example HMM: A stochastic volatility model The following dynamical system is used in financial economy (see e.g. Jacuquier et al., 1994). Let { Xk+1 = αx k + σɛ k+1, Y k = β exp ( Xk 2 ) ε k, where α (0, 1), σ > 0, and β > 0 are constants and (ɛ k ) k 1 and (ε k ) k 0 are sequences of i.i.d. standard normal-distributed noise variables. In this model the values of the observation process (Y k ) are observed daily log-returns (from e.g. the Swedish OMXS30 index) and the hidden chain (X k ) is the unobserved log-volatility (modeled by a stationary AR(1) process). The strength of this model is that it allows for volatility clustering, a phenomenon that is often observed in real financial time series. J. Olsson Monte Carlo-based statistical methods, L5 (26)
Example HMM: A stochastic volatility model A realization of the the model looks like follows (here α = 0.975, σ = 0.16, and β = 0.63). 4 3 2 Log returns Log volatility process log returns 1 0 1 2 0 100 200 300 400 500 600 700 800 900 1000 Days J. Olsson Monte Carlo-based statistical methods, L5 (27)
Example HMM: A stochastic volatility model Daily log-returns from the Swedish stock index OMXS30, from 2005-03-30 to 2007-03-30. 0.06 0.04 0.02 y k 0 0.02 0.04 0.06 0 50 100 150 200 250 300 350 400 450 500 k J. Olsson Monte Carlo-based statistical methods, L5 (28)
The smoothing distribution When operating on HMMs, one is most often interested in the smoothing distribution f n (x 0:n y 0:n ), i.e. the conditional distribution of a set X 0:n of hidden states given Y 0:n = y 0:n. Theorem (Smoothing distribution) f n (x 0:n y 0:n ) = χ(x 0)g(y 0 x 0 ) n k=1 g(y k x k )q(x k x k 1 ), L n (y 0:n ) where L n (y 0:n ) is the likelihood function given by L n (y 0:n ) = density of the observations y 0:n n = χ(x 0 )g(y 0 x 0 ) g(y k x k )q(x k x k 1 ) dx 0 dx n. k=1 J. Olsson Monte Carlo-based statistical methods, L5 (29)
Estimation of smoothed expectations Being a high-dimensional (say n 1000 or 10, 000) integral over complicated integrands, L n (y 0:n ) is in general unknown. However by writing τ n = E(φ(X 0:n ) Y 0:n = y 0:n ) = φ(x 0:n )f n (x 0:n y 0:n ) dx 0 dx n = φ(x 0:n ) z n(x 0:n ) c n dx 0 dx n, with { z n (x 0:n) = χ(x 0 )g(y 0 x 0 ) n k=1 g(y k x k )q(x k x k 1 ), c n = L n (y 0:n ), we may cast the problem of computing τ n into the framework of self-normalized IS. In particular we would like to update sequentially, in n, the approximation as new data (Y k ) appears. J. Olsson Monte Carlo-based statistical methods, L5 (30)
We are here Variance reduction reconsidered 1 Variance reduction reconsidered 2 3 J. Olsson Monte Carlo-based statistical methods, L5 (31)
Self-avoiding walks (SAWs) Denote by N (x) the set of neighbors of a point x in Z 2 def. Let S n = {x 0:n Z 2n : x 0 = 0, x k+1 x k = 1, x k x l, 0 l < k n} be the set of n-step self-avoiding walks in Z 2. 6 A self avoiding walk of length 50 5 4 3 2 1 0 1 2 3 8 6 4 2 0 2 4 6 J. Olsson Monte Carlo-based statistical methods, L5 (32)
Application of SAWs In addition, let c n = S n = The number of possible SAWs of length n. SAWs are used in polymer science for describing long chain polymers, with the self-avoidance condition modeling the excluded volume effect. statistical mechanics and the theory of critical phenomena in equilibrium. However, computing c n (and in analyzing how c n depends on n) is known to be a very challenging combinatoric problem! J. Olsson Monte Carlo-based statistical methods, L5 (33)
An MC approach to SAWs Trick: let f n (x 0:n ) be the uniform distribution on S n : f n (x 0:n ) = 1 1 Sn (x 0:n ), x c n } {{ } 0:n Z 2n, =z(x 0:n ) We may thus cast the problem of computing the number c n (= the normalizing constant of f n ) into the framework of self-normalized IS based on some convenient instrumental distribution g n on Z 2n. In addition, solving this problem for n = 1, 2, 3,..., 508, 509,... calls for sequential implementation of IS. This will be the topic of HA2! J. Olsson Monte Carlo-based statistical methods, L5 (34)