Index Terms: Laplace Approximation, Approximate Inference, Taylor Series, Chi Distribution, Normal Distribution, Probability Density Function

Similar documents
Christfried Webers. Canberra February June 2015

Model-based Synthesis. Tony O Hagan

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization

STA 4273H: Statistical Machine Learning

Lecture 3: Linear methods for classification

Basics of Statistical Machine Learning

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)

Monte Carlo-based statistical methods (MASM11/FMS091)

Tutorial on Markov Chain Monte Carlo

Bayesian networks - Time-series models - Apache Spark & Scala

1 Prior Probability and Posterior Probability

How To Solve A Sequential Mca Problem

The Basics of Graphical Models

Section 5. Stan for Big Data. Bob Carpenter. Columbia University

Random variables P(X = 3) = P(X = 3) = 1 8, P(X = 1) = P(X = 1) = 3 8.

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Monte Carlo-based statistical methods (MASM11/FMS091)

Compression algorithm for Bayesian network modeling of binary systems

Linear Classification. Volker Tresp Summer 2015

CHAPTER 2 Estimating Probabilities

1 Maximum likelihood estimation

1 Review of Newton Polynomials

Assignment 2: Option Pricing and the Black-Scholes formula The University of British Columbia Science One CS Instructor: Michael Gelbart

15. Symmetric polynomials

Notes on Factoring. MA 206 Kurt Bryan

Bayesian Statistics: Indian Buffet Process

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis

Bayesian probability theory

Several Views of Support Vector Machines

Machine Learning.

Integrals of Rational Functions

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

Part 4 fitting with energy loss and multiple scattering non gaussian uncertainties outliers

1.7 Graphs of Functions

Decision Trees and Networks

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Experimental Uncertainty and Probability

Statistics Graduate Courses

Section 5.1 Continuous Random Variables: Introduction

Representation of functions as power series

L4: Bayesian Decision Theory

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

Math 1B, lecture 5: area and volume

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

Master s thesis tutorial: part III

Likelihood: Frequentist vs Bayesian Reasoning

Deterministic Sampling-based Switching Kalman Filtering for Vehicle Tracking

Lecture 8: More Continuous Random Variables

MATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Chapter 4 Lecture Notes

SOLUTIONS. f x = 6x 2 6xy 24x, f y = 3x 2 6y. To find the critical points, we solve

A Statistical Framework for Operational Infrasound Monitoring

PS 271B: Quantitative Methods II. Lecture Notes

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Big Data, Machine Learning, Causal Models

Microeconomic Theory: Basic Math Concepts

Probability and statistics; Rehearsal for pattern recognition

Bayes and Naïve Bayes. cs534-machine Learning

Differentiation and Integration

Applications to Data Smoothing and Image Processing I

Variational Mean Field for Graphical Models

STAT 830 Convergence in Distribution

Incorporating Evidence in Bayesian networks with the Select Operator

Introduction to Markov Chain Monte Carlo

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Statistical Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning

2.4 Real Zeros of Polynomial Functions

Lecture 9: Introduction to Pattern Analysis

Lecture 8 February 4

Multivariate Normal Distribution

Lecture 6: Discrete & Continuous Probability and Random Variables

Evaluation of Machine Learning Techniques for Green Energy Prediction

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Microeconomics Sept. 16, 2010 NOTES ON CALCULUS AND UTILITY FUNCTIONS

Latent variable and deep modeling with Gaussian processes; application to system identification. Andreas Damianou

Estimating the evidence for statistical models

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

More Quadratic Equations

General Sampling Methods

Linear functions Increasing Linear Functions. Decreasing Linear Functions

Package fastghquad. R topics documented: February 19, 2015

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Unit 7: Radical Functions & Rational Exponents

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

WEEK #22: PDFs and CDFs, Measures of Center and Spread

Centre for Central Banking Studies

Course: Model, Learning, and Inference: Lecture 5

Decision Making under Uncertainty

The CUSUM algorithm a small review. Pierre Granjon

3.3 Real Zeros of Polynomials

Gaussian Processes in Machine Learning

Volumetric Path Tracing

Definition: Suppose that two random variables, either continuous or discrete, X and Y have joint density

Transcription:

Instructor: Dr. Volkan Cevher Laplace Approximation Thursday, September 11, 008 Rice University STAT 631 / ELEC 639: Graphical Models Scribe: Ryan Guerra Reviewers: Beth Bower and Terrance Savitsky Index Terms: Laplace Approximation, Approximate Inference, Taylor Series, Chi Distribution, Normal Distribution, Probability Density Function When we left off with the Joint Tree Algorithm and the Max-Sum Algorithm last class, we had crafted messages to transverse a tree-structured graphical model in order to calculate marginal and joint distributions. We are interested in finding p(z x) when p(x) is given as shown below. z x Figure 1. Graph representing both hidden (clear) and observed (shaded) variables with their conditional dependance indicated by the arrow. In this case, x is our observed variable and z is our given variable for which we wish to make some inference. While we would normally wish to make some sort of exact inference about p(z x), this problem is often either impossible to solve or the required algorithm is intractable. The next few lectures will focus on deterministic approximations to a pdf and then we will move on to stochastic approximations. The general hierarchy of approximation techniques is given here for reference. Deterministic Approximations (1) Laplace (local) () Variational Bayes (global) (3) Expectation Propagation (both) Approximate Inference Stochastic Approximations 1

(1) Metropolis-Hastings/Gibbs () SIS 1. Laplace Approximations to a PDF 1.1. Motivation for Representation. The idea here is that we wish to approximate any pdf such as the one given below with a nice, simple representation. Figure. An example multi-modal distribution that we want to approximate. The Laplace approximation is a method for using a Gaussian N (µ, σ ) to represent a given pdf. This is obviously more effective for a single-mode 1 distribution, as many popular distributions could be roughly represented with a Gaussian. As an example of what we mean by represent, consider that we have some function g(x) distributed according to the density function p(x) and we wish to get the expected value E{g(x)} through sampling. E{g(x)} = g(x)p(x)dx We wish to calculate this expected value by sampling discrete values from p(z), and thus get an estimate Ê[g(x)] for E[g(x)] that can be calculated as such: Ê{g(x)} = 1 L L g(x i ) i=1 1 A mode is a concentration of mass in a pdf. We could imagine a non-nice distribution like a U-quadratic that may not have its mass concentrated in a single area that would be poorly represented by the Laplace approximation.

3 Figure 3. The original distribution we want to represent (blue), with its Gaussian approximation (red) obtained by using the Laplace approximation method. Note that we were only able to capture one of the original distribution s modes. Often we will find that p(x) cannot be easily sampled, and we wish to find an alternative way to draw samples from p(x). This is the subject matter for chapter 11 from Bishop[], but suffice to say we can instead draw samples from another, nicer distribution q(x), where q(x) is some known pdf and q(x) 0. Ê{g(x)} = i g(x i )p(x i ) = i g(x i ) p(x i) q(x i ) q(x i) So far, we haven t said anything about the choice of q(x) we could use to represent our pdf, but we d like to use something simple and computable because as the dimensions of the problem increase, the required computational memory increases dramatically. This is why we need approximations. Thus far we have introduced the motivation behind approximation schemes, in particular the method of Laplace approximation. We will proceed by derving the Laplace Approximation using Taylor series expansions. Then we move to a paper by Tierney and Kadane [3] and describe the use of Laplace approximation to estimate a posterior mean. We conclude with an example of approximating the Chi distribution with a Normal distribution and demonstrate the quality of approximation through graphics. 1.. Derivation of the Laplace Approximation. Suppose we wished to approximate p(x) = f(x) z, f(x) 0. Let s look at the Taylor series expansion for the ln f(x): Definition: f(x) = f(x) x=x0 + f (x) 1! (x x 0)+ f (x) (x x! 0) +...+ f (n) (x) (x x n! 0) n +... x=x0 x=x0 x=x0

4 ln f(x) (1) ln f(x) = ln f(x 0 ) + x (x x 0 ) + 1 } x=x0 {{} second term ln f(x) x (x x 0 ) + h.o.t... x=x0 Let s assume that the higher-order terms are negligible and focus, for now, on the second term in (1): () ln f(x) x (x x 0 ) = 1 x=x0 f(x) f(x) x x=x0 } {{ } * (x x 0 ) We notice that (*) is zero at local maxima of the pdf. If we find this local maxima and choose to expand our Taylor series around this point x max we ensure that the second term in the RHS of (1) is always zero. This is done by setting f(x) x equal to zero and solving for x max, the local maxima of the pdf. Taking the first three terms of the Taylor series expansion around x 0 = x max, then (1) becomes: (3) (4) (5) ln f(x) = ln f(x max ) + 1 [ e ln f(x) = exp e ln f(x) ln f(xmax) dx = } e {{} constant ln f(x) x ln f(x max ) + 1 (x x max ) x=xmax ] (x x max ) x=xmax ln f(x) x 1 ln f(x) exp x (x x max ) }{{ x=xmax dx } constant Where we took the exponent in (4) and integrated both sides in (5). We see that the RHS of (5) contains a bunch of constants and a single term inside the exponent that is quadratic with respect to x. For the sake of simplicity, let s let ln f(x) = F (x). Then if we let σ 1 = L (x max) we get a result that looks remarkably like a Gaussian! (6) e L(x) dx e L(xmax) exp [ (x x max) ] σ dx This is the result of the Laplace method for integrals, though it is cited with an additional term n and with the traditional notation x max = x in Tierney & Kadane [3] as:

5 (7) e nl(x) dx e nl(x ) exp [ n(x x ) σ ] dx = πσn 1 e nl(x ) 1.3. Application. To elaborate further on what (7) gets us, I m going to borrow from Tierney & Kadane s paper. Given a smooth, positive function g, we wish to approximate the posterior mean of g(x). (8) E n [g(x)] = E[g(x) Y (n) ] = P r[g, Y (n) ] P r[y (n) ] = g(x)e L(x) π(x)dx e L(x) π(x)dx Where L(x) is the log-likelihood function ln p(x), π(x) is the prior density, and Y (n) is the observed set of data. As we can see, the forms of the integrals in (8) are very similar to the forms seen in (7), and are then easily estimated by the Laplace approximation of a pdf. But you can read the paper if you re interested. We now have a step-by-step process for using the Laplace approximation to approximate a single-mode pdf with a Gaussian: (1) find a local maximum x max of the given pdf f(x) () calculate the variance σ = 1 f (x max) (3) approximate the pdf with p(x) N [x max, σ ] 1.4. Example: Chi distribution. p(x) = xk 1 e x z, z = k 1 Γ( k ), x > 0 Note that z is a normalization constant that doesn t depend on x. Apparently most books don t even bother with it, and we ll ignore it here for the sake of convenience. Remember that we re working with the log-likelihood here, so f(x) = ln p(x) f(x) = ln p(x) = ln x k 1 + ln e x ln p(x) = x x [ln x k 1 x = 1 x k 1 (k 1)xk x = k 1 x = 0 x x = k 1 Now that we ve found the local maximum x, we compute the variance σ = 1 f (x ) ]

6 f (x ) = f(x) x x=x = k 1 x 1 = σ = 1 Now all we need is to create the Normal distribution: x=x [ ] ˆp(x) N x 1 k 1, 0.75 0.5 0.5 0 0.8 1.6.4 3. 4 4.8 5.6 6.4 7. 8 8.8 9.6 10.4 11. Figure 4. A plot of four Chi distributions (solid) and the corresponding Normal approximations (dashed) 1.5. Application: A Medical Inference Problem. In the paper Laplace s Method Approximations for Probabilistic Inference in Belief Networks with Continuous Variables [1], the authors introduce a medical inference problem. The figure below shows the Bayesian graphical model for the problem. There are two different experimental treatments for a disease. The goal is to estimate the posterior mean of the increase in one year survival probability. In previous work, the authors ran a Monte Carlo simulation to sample from the posterior to determine the posterior mean. In this example, we compare the Laplace approximation to the posterior to the Monte Carlo sampling method. Below is a graph of the Monte Carlo sampling and the Laplace approximation superimposed.

7 Figure 5. The Graphical Model Figure 6. Laplace Approximation v. Monte Carlo sampling The Laplace approximation appears to do well approximating the posterior. The authors note that the Monte Carlo method takes 0 times longer computationally than the Laplace approximation, making Laplace approximation suitable for this example. References [1] Adriano Azevedo-Filho and Ross D. Shachter. Laplace s method approximations for probabilistic inference in belief networks with continuous variables. Uncertainty in Artificial Intelligence, 1994. [] Christopher Bishop. Pattern Recognition and Machine Learning. Springer, 007.

8 [3] Luke Tierney and Joseph Kadane. Accurate approximations for posterior moments and marginal densities. Journal of the American Statistical Association, 81(393), 1986.