Monte Carlo Methods and Importance Sampling

Transcription

1 Lecture Notes for Stat 578C c Eric C. Aderso Statistical Geetics 2 October 999 (subbi for E.A Thompso) Mote Carlo Methods ad Importace Samplig History ad defiitio: The term Mote Carlo was apparetly first used by Ulam ad vo Neuma as a Los Alamos code word for the stochastic simulatios they applied to buildig better atomic bombs. Their methods, ivolvig the laws of chace, were aptly amed after the iteratioal gamig destiatio; the moiker stuck ad soo after the War a wide rage of sticky problems yielded to the ew techiques. Despite the widespread use of the methods, ad umerous descriptios of them i articles ad moographs, it is virtually impossible to fid a succit defiitio of Mote Carlo method i the literature. Perhaps this is owig to the ituitive ature of the topic which spaws may defiitios by way of specific examples. Some authors prefer to use the term stochastic simulatio for almost everythig, reservig Mote Carlo oly for Mote Carlo Itegratio ad Mote Carlo Tests (cf. Ripley 987). Others seem less cocered about blurrig the distictio betwee simulatio studies ad Mote Carlo methods. Be that as it may, a suitable defiitio ca be good to have, if for othig other tha to avoid the awkwardess of tryig to defie the Mote Carlo method by appealig to a whole bevy of examples of it. Sice I am (so Elizabeth claims!) uduly iflueced by my advisor s ways of thikig, I like to defie Mote Carlo i the spirit of defiitios she has used before. I particular, I use: Defiitio: Mote Carlo is the art of approximatig a expectatio by the sample mea of a fuctio of simulated radom variables. We will fid that this defiitio is broad eough to cover everythig that has bee called Mote Carlo, ad yet makes clear its essece i very familiar terms: Mote Carlo is about ivokig laws of large umbers to approximate expectatios. While most Mote Carlo simulatios are doe by computer today, there were may applicatios of Mote Carlo methods usig coi-flippig, card-drawig, or eedle-tossig (rather tha computergeerated pseudo-radom umbers) as early as the tur of the cetury log before the ame Mote Carlo arose. I more mathematical terms: Cosider a (possibly multidimesioal) radom variable X havig probability mass fuctio or probability desity fuctio f X (x) which is greater tha zero o a set of values X. The the expected value of a fuctio g of X is E(g(X)) = g(x)f X (x) () if X is discrete, ad E(g(X)) = g(x)f X (x)dx (2) if X is cotiuous. Now, if we were to take a -sample of X s, (x,...,x ), ad we computed the mea of g(x) over the sample, the we would have the Mote Carlo estimate g (x) = g(x i ) This applies whe the simulated variables are idepedet of oe aother, ad might apply whe they are correlated with oe aother (for example if they are states visited by a ergodic Markov chai). For ow we will just deal with idepedet simulated radom variables, but all of this exteds to samples from Markov chais via the weak law of large umbers for the umber of passages through a recurret state i a ergodic Markov chai (see Feller 957). You will ecouter this later whe talkig about MCMC.

2 2 c Eric C. Aderso. You may duplicate this freely for pedagogical purposes. of E(g(X)). We could, alteratively, speak of the radom variable g (X) = g(x) which we call the Mote Carlo estimator of E(g(X)). If E(g(X)), exists, the the weak law of large umbers tells us that for ay arbitrarily small ɛ lim P ( g (X) E(g(X)) ɛ) =. This tells us that as gets large, the there is small probability that g (X) deviates much from E(g(X)). For our purposes, the strog law of large umbers says much the same thig the importat part beig that so log as is large eough, g (x) arisig from a Mote Carlo experimet shall be close to E(g(X)), as desired. Oe other thig to ote at this poit is that g (X) is ubiased for E(g(X)): ( ) E( g (X)) = E g(x i ) = E(g(X i )) = E(g(X)). Makig this useful: The precedig sectio comes to life ad becomes useful whe oe realizes that very may quatities of iterest may be cast as expectatios. Most importatly for applicatios i statistical geetics, it is possible to express all probabilities, itegrals, ad summatios as expectatios: Probabilities: Let Y be a radom variable. The probability that Y takes o some value i a set A ca be expressed as a expectatio usig the idicator fuctio: P (Y A) =E(I {A} (Y )) (3) where I {A} (Y ) is the idicator fuctio that takes the value whe Y A ad whe Y A. Itegrals: Cosider a problem ow which is completely determiistic itegratig a fuctio q(x) from a to b (as i high-school calculus). So we have b a q(x)dx. This ca be expressed as a expectatio with respect to a uiformly distributed, cotiuous radom variable U betwee a ad b. U has desity fuctio f U (u) =/(b a), so if we rewrite the itegral we get (b a) b a q(x) dx =(b a) b a b a q(x)f U (x)dx =(b a)e(q(u))...voila! Discrete Sums: The discrete versio of the above is just the sum of a fuctio q(x) over the coutably may values of x i a set A. If we have a radom variable W which takes values i A all with equal probability p (so that w A p = the the sum may be cast as the expectatio q(x) = q(x)p = E(q(W )). p p x A x A The immediate cosequece of this is that all probabilities, itegrals, ad summatios ca be approximated by the Mote Carlo method. A crucial thig to ote, however, is that there is o restrictio that says U or W above must have uiform distributios. This is just for easy illustratio of the poits above. We will explore this poit more while cosiderig importace samplig.

3 c Eric C. Aderso. Correctios, commets?= 3 Example I: Approximatig probabilities by Mote Carlo. Cosider a Wright-Fisher populatio of size N diploid idividuals i which X t couts the umbers of copies of a certai allelic type i the populatio at time t. At time zero, there are x copies of the allele. Give this, what is the probability that the allele will be lost from the populatio i t geeratios, i.e., P (X t = X = x )? This ca be computed exactly by multiplyig trasitio probability matrices together, or by employig the Baum (972) algorithm (which you will lear about later), but it ca also be approximated by Mote Carlo. It is simple to simulate geetic drift i a Wright-Fisher populatio; thus we ca easily simulate values for X t give X = x. The, P (X t = X = x )=E(I {} (X t ) X = x ) where I {} (X t ) takes the value whe X t = ad otherwise. Deotig the i th simulated value of X t by x (i) t our Mote Carlo estimate would be P (X t = X = x ) I {} (x (i) t ). Example II: Mote Carlo approximatios to distributios. A simple extesio of the above example is to approximate the whole probability distributio P (X t X = x ) by Mote Carlo. Cosider the histogram below: Mote Carlo Estimate of Probability Number of Copies of Allele It represets the results of simulatios i which =5,, x = 6, t = 4, the Wright-Fisher populatio size N = diploids, ad each rectagle represets a Mote Carlo approximatio to P (a X 4 <a+2 X = 6), a =, 2, 4,...,2. For each such probability, the approximatio follows from P (a X 4 <a+2 X = 6) = E(I {a X<a+2} (X 4 )) I {a X<a+2} (x (i) 4 ),a=, 2,...,2 Example III: A discrete sum over latet variables. I may applicatios i statistical geetics, the probability P (Y ) of a observed evet Y must be computed as the sum over very may latet variables X of the joit probability P (Y,X). I such a case, Y is typically fixed, i.e., we have observed Y = y so we are iterested i P (Y = y), but we ca t observe the values of the latet variables which may take values i the space X. Though it follows from the laws of probability that P (Y = y) = P (Y = y, X = x), quite ofte X is such a large space (cotais so may elemets) that it is impossible to compute the sum. Applicatio of the law of coditioal probability, however, gives P (Y = y) = P (Y = y, X = x) = P (Y = y X = x)p (X = x). (4)

4 4 c Eric C. Aderso. You may duplicate this freely for pedagogical purposes. The term followig the last equals sig is the sum over all x of a fuctio of x [amely, P (Y = y X = x)], weighted by the margial probabilities P (X = x). Clearly this is a expectatio, ad therefore may be approximated by Mote Carlo, givig us P (Y = y) P (Y = y X = x i ) where x i is the i th realizatio from the margial distributio of X. You will see this sort of thig may times agai i Stat 578C. OK, these examples have all bee preseted as if the applicatio of Mote Carlo to practically ay problem is a soporific ad trivial exercise. However, othig could be further from the truth! Though it is typically easy to formulate a quatity as a expectatio ad to propose a aive Mote Carlo estimator, it is quite aother thig to actually have the Mote Carlo estimator provide you with good estimates i a reasoable amout of computer time. For most problems, a umber of Mote Carlo estimators may be proposed, however some Mote Carlo estimators are clearly better tha others. Typically, a better Mote Carlo estimator has smaller variace (for the same amout of computatioal effort) tha its competitors. Thus we tur to matters of... Mote Carlo variace: Goig back to our origial otatio, we have the radom variable g (X), a Mote Carlo estimator of E(g(X)). Like all radom variables, we may compute its variace (if it exists) by the stadard formulas: ( ) Var( g (X))=Var g(x i ) = Var(g(X)) = [g(x) E(g(X))] 2 f X (x) (5) if X is discrete, ad Var( g (X))=Var ( ) g(x i ) = Var(g(X)) = [g(x) E(g(X))] 2 f X (x)dx (6) if X is cotiuous. From here o out, let us do everythig i terms of itegrals over cotiuous variables, but it all applies equally well to sums over discrete radom variables. There are umerous ways to reduce the variace of Mote Carlo estimators. Of these variace-reductio techiques, the oe called importace samplig is particularly useful. I fid that it is best itroduced by describig its atithesis which I call irrelevace samplig or barely relevat samplig, which we will tur to after a short digressio. Digressio : Estimatig Var( g (X)): Sice, typically E( g (X)) i (5) or (6) is ukow ad the sum or the itegral is ot feasibly computed (that is why we would be doig Mote Carlo i the first place) the formulas i (5) ad (6) are ot useful for estimatig the variace associated with your Mote Carlo estimate whe you are actually doig Mote Carlo. Istead, just like approximatig the variace from a sample à la our earliest statistics classes, we have a ubiased estimator for Var(g(X)): Var(g(X)) = (g(x i ) g (x)) 2 (7) (This is just the familiar s 2 from Statistics ). The ubiased estimate of the variace of g (X) is / of that: Var( g (X)) = Var(g(X)) = ( ) (g(x i ) g (x)) 2. (8)

5 c Eric C. Aderso. Correctios, commets?= 5 The form give i (8) is ot particularly satisfyig if oe does ot wat to wait util the ed of the simulatio (util is reached) to compute the variace. To this ed, the followig formulas are extremely useful: The mea ca be computed o the fly, recursively by: g + (x) = + ( g (x)+g(x + )). (9) If we also exped the effort to record the sum of the squares of the g(x) s, SS g the a simple calculatio gives Var( g (X)): Var( g (X)) = ( SS g = [g(x i)] 2, ) ( g (x)) 2. () Whe programmig i a laguage like C, usig zero-base subscriptig, I ofte cofuse myself tryig to implemet the above recursio ad formulas. So, primarily for my ow beefit, (though this may be a useful referece for you if you program i C or C ++ ) I iclude a code sippet for implemetig the above: // variable declaratios double i; // the idex for coutig umber of reps (declarig as double // avoids havig to cast it to a double all the time) double gx; // the values that will be realized double mea_gx; // the Mote Carlo average that we will accumulate (our Mote Carlo estimate) double ss_gx; // the sum of squares of gx double var_of_mea_gx; // the estimate of our mote carlo variace // variable iitializatios: mea_gx =.; ss_gx =.; // loop over the idex i. these are repetitios i the simulatio: for(i=.;i<;i+=.) { gx = "the value for gx realized o this repetio"; mea_gx += (gx - mea_gx)/(i+.); // the curret Mote Carlo estimate ss_gx += gx * gx; // the curret sum of squares if(i>) { var_of_mea_gx = (ss_gx - (i+.)*(mea_gx * mea_gx) ) / (i * (i+.) ); // above is the curret estimate of the variace of our Mote Carlo estimator } } Barely relevat samplig: Back to the task at had. To itroduce importace samplig we cosider its opposite. Imagie that we wat a Mote Carlo approximatio to g(x)dx for g(x) show i the figure below. Note that g(x) =forx< ad x>..75 g(x) desity of a Uiform(, ) r.v. desity of a Uiform(, 5) r.v If we have U Uiform(, ), the we ca cast the itegral as the expectatio with respect to U: g(x)dx = E(g(U)), so we may approximate it by Mote Carlo: g(u i). This would work reasoably well. The figure, however, suggests aother possibility. Oe could use W Uiform(, 5) givig g(x)dx =5 E(g(W )) ad hece the Mote Carlo estimator 5 g(w i). Obviously such a

6 6 c Eric C. Aderso. You may duplicate this freely for pedagogical purposes. course would make o sese at all because, o average, 8% of the realized w i s would tell you othig substatial about the itegral of g(x) sice g(x) = for <x<5. This would be barely relevat samplig, ad o oe i their right mid would willigly do it. It does make clear that oe s choice of distributio from which to draw their radom variables will affect the quality of their Mote Carlo estimator. Importace samplig: Importace samplig is choosig a good distributio from which to simulate oe s radom variables. It ivolves multiplyig the itegrad by (usually dressed up i a tricky fashio ) to yield a expectatio of a quatity that varies less tha the origial itegrad over the regio of itegratio. For example, let h(x) be a desity for the radom variable X which takes values oly i A so that h(x) x A h(x)dx =. The h(x) = ad g(x)dx = g(x) h(x) ( ) x A x A h(x) dx = g(x) g(x) x A h(x) h(x)dx = E h, () h(x) so log as h(x) for ay x A for which g(x), ad where E h deotes the expectatio with respect to the desity h. This gives a Mote Carlo estimator: g h (X) = g(x i ) h(x i ) where X i h(x). (2) Usig (6) ad the Cauchy-Schwarz iequality, it ca be show that Var( g (X)) h is miimized whe h(x) g(x) (see Rubistei 98, p. 23). If we restrict our attetio to what for most of our purposes is the truly relevat case, 2 that is, g(x) x A, the it is immediately apparet that the choice of the desity h(x) which miimizes Mote Carlo variace is proportioal to g(x), i.e., ifαh(x) = g(x) where α is some costat of proportioality, the clearly we have g(x)/h(x) =α {x : h(x) > } so E(g(X)/h(X)) = α ad hece the Mote Carlo variace would be zero by (6). Woderful! All we eed to do to have a Mote Carlo estimator with zero variace is use (2) ad make sure that our desity h is proportioal to the fuctio g. The absurdity of this wishful thikig is that the ability to simulate idepedet radom variables from h(x), or the ability to compute the desity h(x), itself, implies that the ormalizig costat of the distributio is computable, which i tur would imply that the origial itegral ivolvig g(x) is computable ad there would hece be o reaso to do Mote Carlo at all! Ultimately, however, it makes clear that a good importace samplig fuctio (as h is called) will be oe that is as close as possible to beig proportioal to g(x) a poit made by the followig cotrived example. Example V: A cotrived demostratio. Let us approximate by Mote Carlo the area uder a Normal(, ) desity curve from -5 to 5. This quatity will, of course, be extremely close to (ad we may as well call it ). This is cotrived because o oe would ever do this i practice... Noetheless, we will use a series of importace samplig fuctios: (a) a Uiform( 5, 5) desity, (b) a cauchy desity (a t radom variable o df) trucated at 5 ad 5, ad (c) a trucated t radom variable o 3 df. Figure o Page 7 shows each importace samplig fuctio as a dashed curve ext to the ormal curve. Below each of these is the histogram of 5, Mote Carlo estimates (usig =, ) of the area uder the ormal curve. As is clear from the progressio from (a) to (c), the Mote Carlo estimates are less variable whe the importace samplig fuctio is closer to the shape of the ormal desity. (Note: the importat feature is that the shape of the curves is closer. Obviously g(x) ad h(x) will, i geeral, ot be similar i height.) 2 This is typically the relevat case because we are iterested i o-egative quatities like probabilities.

7 c Eric C. Aderso. Correctios, commets?= # out of 5, 2 5 # out of 5, # out of 5, < Estimated Area > < Estimated Area >.5 5 < Estimated Area >.5 (a) Uiform (b) t Dist. (c) t 3 Dist. Figure : Three differet importace samplig fuctios (dotted lies) used to itegrate the stadard ormal desity (solid lie) from 5 to 5. Top paels are the desity curves ad bottom paels are histograms of 5, Mote Carlo estimates of the area (which is exactly ) usig =,. I summary, a good importace samplig fuctio h(x) should have the followig properties:. h(x) > wheever g(x) 2. h(x) should be close to beig proportioal to g(x) 3. it should be easy to simulate values from h(x) 4. it should be easy to compute the desity h(x) for ay value x that you might realize. Fulfillig this wish-list i high dimesioal space (where Mote Carlo techiques are most useful) is quite ofte a tall task, ad ca provide hours of etertaimet, ot to metio dissertatio chapters, etc. Note also that g(x) is ay arbitrary fuctio, so it certaily icludes the itegrad of a stadard expectatio. For example, with X f X we might be iterested i E(r(X)) for some fuctio r so we could use ( ) r(x)fx (x) r(x)fx (x) E(r(X)) = r(x)f X (x)dx = h(x) =E h h(x) h(x) ad the go searchig about for a suitable h(x) that is close to proportioal to r(x)f X (x). Example VI: Latet variables ad importace samplig Goig back to Example III with the discrete sum over latet variables X it is clear that the optimal importace samplig fuctio would be the coditioal distributio of X give Y, i.e., P (Y = y) = P (Y = y, X = x) = P (Y = y, X = x) P (X Y = y). P (X Y = y)

8 8 c Eric C. Aderso. You may duplicate this freely for pedagogical purposes. 2 # out of 5, < Estimated Area >.5 Figure 2: Histogram of 5, Mote Carlo estimates of the area uder a trucated t distributio with oe df usig the stadard ormal desity as the importace samplig fuctio. The true area is. Note the several very high values (>.5). Note that the right side is a coditioal expectatio of a fuctio of X. As before P (X Y )is ot computable. So oe must tur to fidig some other distributio, say P (X), that is close to P (X Y ) but which is more easily sampled from ad computed. A commo pitfall of importace samplig: As a fial word o importace samplig, it should be poited out that the tails of the distributios matter! While h(x) might be roughly the same shape as g(x), serious difficulties arise if h(x) gets small much faster tha g(x) out i the tails. I such a case, though it is improbable (by defiitio) that you will realize a value x i from the far tails of h(x), if you do, the your Mote Carlo estimator will take a jolt g(x i )/h(x i ) for such a improbable x i may be orders of magitude larger tha the typical values g(x)/h(x) that you see. As a example of this pheomeo, ivestigate Figure 2 which shows the histogram of 5, Mote Carlo estimates of the area betwee -5 ad 5 of a t desity (trucated at -5 ad 5 ad reormalized so the exact area is ). The importace samplig fuctio used for this was a stadard, uit ormal desity, which obviously gets small i the tails much faster tha a cauchy (see Figure (b)). Note i particular that about 5 of the 5, Mote Carlo estimates were greater tha.5! Further readig: A classic referece o Mote Carlo is Hammersley ad Hadscomb (964). They describe several other variace-reductio techiques that you might fid iterestig. Refereces Baum, L. E., 972 A iequality ad associated maximizatio techique i statistical estimatio for probabilistic fuctios of Markov processes. I O. Shisha (Ed.), Iequalities III: Proceedigs of the Third Symposium o Iequalities Held at the Uiversity of Califoria, Los Ageles, September 9, 969, pp. 8. New York: Academic Press. Feller, W., 957 A Itroductio to Probability Theory ad Its Applicatios, 2d Editio. New York: Joh Wiley & Sos. Hammersley, J. M. ad D. C. Hadscomb, 964 Mote Carlo Methods. Lodo: Methue & Co Ltd. Ripley, B. D., 987 Stochastic Simulatio. New York: Wiley & Sos. Rubistei, B. Y., 98 Simulatio ad the Mote Carlo Method. New York: Wiley & Sos.