Variational Mean Field for Graphical Models

Similar documents

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Introduction to Deep Learning Variational Inference, Mean Field Theory

STA 4273H: Statistical Machine Learning

The Basics of Graphical Models

Approximating the Partition Function by Deleting and then Correcting for Model Edges

Course: Model, Learning, and Inference: Lecture 5

Statistical Machine Learning

Dirichlet Processes A gentle tutorial

Deterministic Sampling-based Switching Kalman Filtering for Vehicle Tracking

Master s thesis tutorial: part III

Exponential Random Graph Models for Social Network Analysis. Danny Wyatt 590AI March 6, 2009

Linear Threshold Units

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Basics of Statistical Machine Learning

Structured Learning and Prediction in Computer Vision. Contents

Graphical Models, Exponential Families, and Variational Inference

Load Balancing and Switch Scheduling

Discrete Frobenius-Perron Tracking

Stock Option Pricing Using Bayes Filters

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

HT2015: SC4 Statistical Data Mining and Machine Learning

Bayesian networks - Time-series models - Apache Spark & Scala

3. The Junction Tree Algorithms

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Random graphs with a given degree sequence

Distributed Structured Prediction for Big Data

Cell Phone based Activity Detection using Markov Logic Network

Statistics Graduate Courses

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Social Media Mining. Data Mining Essentials

Adaptive Online Gradient Descent

Introduction to Markov Chain Monte Carlo

Lecture 3: Linear methods for classification

Pricing and calibration in local volatility models via fast quantization

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Section 5. Stan for Big Data. Bob Carpenter. Columbia University

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Tutorial on Markov Chain Monte Carlo

Model-based Synthesis. Tony O Hagan

3. Linear Programming and Polyhedral Combinatorics

2.3 Convex Constrained Optimization Problems

Message-passing sequential detection of multiple change points in networks

NEURAL NETWORKS A Comprehensive Foundation

Towards running complex models on big data

Finding the M Most Probable Configurations Using Loopy Belief Propagation

MAP-Inference for Highly-Connected Graphs with DC-Programming

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)

Christfried Webers. Canberra February June 2015

Predict Influencers in the Social Network

Linear Programming. Widget Factory Example. Linear Programming: Standard Form. Widget Factory Example: Continued.

Follow the Perturbed Leader

A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Lecture 6: Logistic Regression

Message-passing for graph-structured linear programs: Proximal methods and rounding schemes

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

Introduction to Monte Carlo. Astro 542 Princeton University Shirley Ho

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Statistical machine learning, high dimension and big data

Bayesian Statistics: Indian Buffet Process

Monte Carlo-based statistical methods (MASM11/FMS091)

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

An Introduction to Machine Learning

Lecture 8 February 4

Parallelization Strategies for Multicore Data Analysis

Unsupervised Learning

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization

Monte Carlo Simulation

Coding and decoding with convolutional codes. The Viterbi Algor

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Several Views of Support Vector Machines

Gaussian Processes to Speed up Hamiltonian Monte Carlo

Classification Problems

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Perron vector Optimization applied to search engines

Linear Classification. Volker Tresp Summer 2015

Solving NP Hard problems in practice lessons from Computer Vision and Computational Biology

Probabilistic Latent Semantic Analysis (plsa)

Two-Stage Stochastic Linear Programs

Monte Carlo-based statistical methods (MASM11/FMS091)

Machine Learning and Pattern Recognition Logistic Regression

Big Data, Machine Learning, Causal Models

Bayesian Statistics in One Hour. Patrick Lam

Statistical Machine Learning from Data

Inference on Phase-type Models via MCMC

Proximal mapping via network optimization

Compression algorithm for Bayesian network modeling of binary systems

BayesX - Software for Bayesian Inference in Structured Additive Regression

The Advantages and Disadvantages of Online Linear Optimization

Inference of Probability Distributions for Trust and Security applications

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Lecture 9: Introduction to Pattern Analysis

Transcription:

Variational Mean Field for Graphical Models CS/CNS/EE 155 Baback Moghaddam Machine Learning Group baback @ jpl.nasa.gov

Approximate Inference Consider general UGs (i.e., not tree-structured) All basic computations are intractable (for large G) - likelihoods & partition function - marginals & conditionals - finding modes

Taxonomy of Inference Methods Inference Exact VE JT BP Stochastic Gibbs, M-H (MC) MC, SA Approximate Deterministic Cluster ~MP LBP EP Variational

Approximate Inference Stochastic (Sampling) - Metropolis-Hastings, Gibbs, (Markov Chain) Monte Carlo, etc - Computationally expensive, but is exact (in the limit) Deterministic (Optimization) - Mean Field (MF), Loopy Belief Propagation (LBP) - Variational Bayes (VB), Expectation Propagation (EP) - Computationally cheaper, but is not exact (gives bounds)

General idea Mean Field : Overview - approximate p(x) by a simpler factored distribution q(x) - minimize distance D(p q) - e.g., Kullback-Liebler original G (Naïve) MF H 0 structured MF H s p ( x) φ ( ) c c x c q ( x) q i ( x i ) i q( x) qa( xa) qb( xb)

Mean Field : Overview Naïve MF has roots in Statistical Mechanics (1890s) - physics of spin glasses (Ising), ferromagnetism, etc - why is it called Mean Field? with full factorization : E[x i x j ] = E[x i ] E[x j ] Structured MF is more modern Coupled HMM Structured MF approximation (with tractable chains)

KL Projection D(Q P) Infer hidden h given visible v (clamp v nodes with δ s) Variational : optimize KL globally the right density form for Q falls out KL is easier since we re taking E[.] wrt simpler Q Q seeks mode with the largest mass (not height) so it will tend to underestimate the support of P P = 0 forces Q = 0

KL Projection D(P Q) Infer hidden h given visible v (clamp v nodes with δ s) Expectation Propagation (EP) : optimize KL locally this KL is harder since we re taking E[.] wrt P no nice global solution for Q falls out must sequentially tweak each q c (match moments) Q covers all modes so it overestimates support P > 0 forces Q > 0

α - divergences The 2 basic KL divergences are special cases of D α (p q) is non-negative and 0 iff p = q when α - 1 we get KL(P Q) when α +1 we get KL(Q P) when α = 0 D 0 (P Q) is proportional to Hellinger s distance (metric) So many variational approximations must exist, one for each α!

for more on α - divergences Shun-ichi Amari

for specific examples of α = ±1 See Chapter 10 Variational Single Gaussian Variational Linear Regression Variational Mixture of Gaussians Variational Logistic Regression Expectation Propagation (α = -1)

Hierarchy of Algorithms (based on α and structuring) Power EP exp family D α (p q) Structured MF exp family KL(q p) FBP fully factorized D α (p q) EP exp family KL(p q) MF fully factorized KL(q p) TRW fully factorized D α (p q) α > 1 BP fully factorized KL(p q) by Tom Minka

Variational MF p( x) 1 1 ψ ( x) = γ c( xc ) = e = Z c Z ) c ψ ( x log( γ ( )) c x c log Z = log e ψ ( x) dx = log Q( x) ψ ( x) e Q( x) dx Jensen s E Q log[ e ψ ( x) / Q( x)] = sup Q E Q log[ e ψ ( x) / Q( x)] = sup { E [ ψ Q Q ( x)] + H [ Q( x)]}

Variational MF log Z ψ sup { E [ ( x)] + Q Q H [ Q( x)]} Equality is obtained for Q(x) = P(x) (all Q admissible) Using any other Q yields a lower bound on log Z The slack in this bound is KL-divergence D(Q P) Goal: restrict Q to a tractable subclass Q optimize with sup Q to tighten this bound note we re (also) maximizing entropy H[Q]

Variational MF log Z ψ sup { E [ ( x)] + Q Q H [ Q( x)]} Most common specialized family : log-linear models = T ψ ( x) θ φ ( x ) = θ φ( x) linear in parameters θ (natural parameters of EFs) c c c c clique potentials φ(x) (sufficient statistics of EFs) Fertile ground for plowing Convex Analysis

Convex Analysis The Old Testament The New Testament

Variational MF for EF log Z ψ sup { E [ ( x)] + Q Q H [ Q( x)]} log Z sup { E [ θ φ( x)] + Q Q T H [ Q( x)]} log Z T sup { θ E [ φ( x)] + Q Q H [ Q( x)]} A( θ ) sup { θ µ A*( µ ) } µ M T EF notation M = set of all moment parameters realizable under subclass Q

Variational MF for EF So it looks like we are just optimizing a concave function (linear term + negative-entropy) over a convex set Yet it is hard... Why? 1. graph probability (being a measure) requires a very large number of marginalization constraints for consistency (leads to a typically beastly marginal polytope M in the discrete case) e.g., a complete 7-node graph s polytope has over 10 8 facets! In fact, optimizing just the linear term alone can be hard 2. exact computation of entropy -A*(µ) is highly non-trivial (hence the famed Bethe & Kikuchi approximations)

Gibbs Sampling for Ising Binary MRF G = (V,E) with pairwise clique potentials 1. pick a node s at random 2. sample u ~ Uniform(0,1) 3. update node s : 4. goto step 1 a slower stochastic version of ICM

Naive MF for Ising use a variational mean parameter at each site 1. pick a node s at random 2. update its parameter : 3. goto step 1 deterministic loopy message-passing how well does it work? depends on θ

Graphical Models as EF G(V,E) with nodes sufficient stats : clique potentials likewise for θ st probability log-partition mean parameters

Variational Theorem for EF For any mean parameter µ where θ (µ) is the corresponding natural parameter in relative interior of M not in the closure of M the log-partition function has this variational representation this supremum is achieved at the moment-matching value of µ

Legendre-Fenchel Duality Main Idea: (convex) functions can be supported (lower-bounded) by a continuum of lines (hyperplanes) whose intercepts create a conjugate dual of the original function (and vice versa) conjugate dual of A conjugate dual of A* Note that A** = A (iff A is convex)

Dual Map for EF Two equivalent parameterizations of the EF Bijective mapping between Ω and the interior of M Mapping is defined by the gradients of A and its dual A* Shape & complexity of M depends on X and size and structure of G

Marginal Polytope G(V,E) = graph with discrete nodes Then M = convex hull of all φ(x) equivalent to intersecting half-spaces a T µ > b difficult to characterize for large G hence difficult to optimize over interior of M is 1-to-1 with Ω

The Simplest Graph x G(V,E) = a single Bernoulli node φ(x) = x density log-partition (of course we knew this) we know A* too, but let s solve for it variationally differentiate stationary point rearrange to, substitute into A* Note: we found both the mean parameter and the lower bound using the variational method

The 2 nd Simplest Graph x 1 x 2 G(V,E) = 2 connected Bernoulli nodes moments moment constraints variational problem solve (it s still easy!)

The 3 rd Simplest Graph x 1 x 2 x 3 3 nodes 16 constraints # of constraints blows up real fast: 7 nodes 200,000,000+ constraints hard to keep track of valid µ s (i.e., the full shape and extent of M ) no more checking our results against closed-forms expressions that we already knew in advance! unless G remains a tree, entropy A* will not decompose nicely, etc

tractable subgraph H = (V,0) fully-factored distribution moment space Variational MF for Ising entropy is additive : - variational problem for A(θ ) using coordinate ascent :

Variational MF for Ising M tr is a non-convex inner approximation M tr M what causes this funky curvature? optimizing over M tr must then yield a lower bound

Factorization with Trees suppose we have a tree G = (V,T) useful factorization for trees entropy becomes - singleton terms - pairwise terms Mutual Information

Variational MF for Loopy Graphs pretend entropy factorizes like a tree (Bethe approximation) define pseudo marginals must impose these normalization and marginalization constraints define local polytope L(G) obeying these constraints M ( G) L( G) note that for any G with equality only for trees : M(G) = L(G)

Variational MF for Loopy Graphs L(G) is an outer polyhedral approximation solving this Bethe Variational Problem we get the LBP eqs! so fixed points of LBP are the stationary points of the BVP this not only illuminates what was originally an educated hack (LBP) but suggests new convergence conditions and improved algorithms (TRW)

see ICML 2008 Tutorial

Summary SMF can also be cast in terms of Free Energy etc Tightening the var bound = min KL divergence Other schemes (e.g, Variational Bayes ) = SMF - with additional conditioning (hidden, visible, parameter) Solving variational problem gives both µ and A(θ) Helps to see problems through lens of Var Analysis

Matrix of Inference Methods Exact Deterministic approximation Stochastic approximation Chain (online) Low treewidth High treewidth Discrete BP = forwards Boyen-Koller (ADF), beam search VarElim, Jtree, recursive conditioning Loopy BP, mean field, structured variational, EP, graph-cuts Gibbs Gaussian BP = Kalman filter Jtree = sparse linear Loopy BP algebra Gibbs Other EKF, UKF, moment EP, EM, VB, EP, variational EM, VB, matching (ADF) NBP, Gibbs NBP, Gibbs Particle filter BP = Belief Propagation, EP = Expectation Propagation, ADF = Assumed Density Filtering, EKF = Extended Kalman Filter, UKF = unscented Kalman filter, VarElim = Variable Elimination, Jtree= Junction Tree, EM = Expectation Maximization, VB = Variational Bayes, NBP = Non-parametric BP by Kevin Murphy

T H E E N D