Variational Methods. Zoubin Ghahramani.

Similar documents
STA 4273H: Statistical Machine Learning

Variational Mean Field for Graphical Models

Dirichlet Processes A gentle tutorial

Basics of Statistical Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning

Statistical Machine Learning from Data

Christfried Webers. Canberra February June 2015

Unsupervised Learning

Course: Model, Learning, and Inference: Lecture 5

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Bayes and Naïve Bayes. cs534-machine Learning

Bayesian Statistics: Indian Buffet Process

Section 5. Stan for Big Data. Bob Carpenter. Columbia University

Topic models for Sentiment analysis: A Literature Survey

Exponential Random Graph Models for Social Network Analysis. Danny Wyatt 590AI March 6, 2009

Linear Threshold Units

Statistics Graduate Courses

Linear Classification. Volker Tresp Summer 2015

Introduction to Deep Learning Variational Inference, Mean Field Theory

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Gaussian Processes in Machine Learning

VARIATIONAL ALGORITHMS FOR APPROXIMATE BAYESIAN INFERENCE

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Logistic Regression (1/24/13)

Towards running complex models on big data

Statistical Machine Learning

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Bayesian Clustering for Campaign Detection

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

The Basics of Graphical Models

Bayesian networks - Time-series models - Apache Spark & Scala

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

Introduction to Markov Chain Monte Carlo

Probabilistic Methods for Time-Series Analysis

Question 2 Naïve Bayes (16 points)

NEURAL NETWORKS A Comprehensive Foundation

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Machine Learning Overview

3F3: Signal and Pattern Processing

Model-based Synthesis. Tony O Hagan

Machine Learning.

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Methods of Data Analysis Working with probability distributions

Latent variable and deep modeling with Gaussian processes; application to system identification. Andreas Damianou

Machine Learning and Pattern Recognition Logistic Regression

Markov Chain Monte Carlo Simulation Made Simple

Efficient Bayesian Methods for Clustering

Tutorial on Markov Chain Monte Carlo

CHAPTER 2 Estimating Probabilities

Big Data, Machine Learning, Causal Models

Stock Option Pricing Using Bayes Filters

Bayesian Predictive Profiles with Applications to Retail Transaction Data

Inference on Phase-type Models via MCMC

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Approximating the Partition Function by Deleting and then Correcting for Model Edges

Data Mining: An Overview. David Madigan

Gaussian Process Training with Input Noise

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

Predict Influencers in the Social Network

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Estimation and comparison of multiple change-point models

An Introduction to Variational Methods for Graphical Models

Efficient Cluster Detection and Network Marketing in a Nautural Environment

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

11. Time series and dynamic linear models

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression

Supervised Learning (Big Data Analytics)

Evaluation of Machine Learning Techniques for Green Energy Prediction

Mixture Models for Genomic Data

Introduction to Learning & Decision Trees

Bayesian Image Super-Resolution

Discrete Frobenius-Perron Tracking

Some Essential Statistics The Lure of Statistics

Parallelization Strategies for Multicore Data Analysis

33. STATISTICS. 33. Statistics 1

MACHINE LEARNING IN HIGH ENERGY PHYSICS

1 Maximum likelihood estimation

Natural Conjugate Gradient in Variational Inference

Probabilistic user behavior models in online stores for recommender systems

Penalized regression: Introduction

Nonlinear Optimization: Algorithms 3: Interior-point methods

Monte Carlo and Empirical Methods for Stochastic Inference (MASM11/FMS091)

Introduction to Machine Learning CMU-10701

Incorporating cost in Bayesian Variable Selection, with application to cost-effective measurement of quality of health care.

Variational Inference in Non-negative Factorial Hidden Markov Models for Efficient Audio Source Separation

Scaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Cell Phone based Activity Detection using Markov Logic Network

Bayesian Statistics in One Hour. Patrick Lam

Machine Learning for Data Science (CS4786) Lecture 1

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Monte Carlo-based statistical methods (MASM11/FMS091)

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Transcription:

Variational Methods Zoubin Ghahramani zoubin@cs.cmu.edu http://www.gatsby.ucl.ac.uk Statistical Approaches to Learning and Discovery Carnegie Mellon University April 23

The Expectation Maximization (EM) algorithm Given observed/visible variables y, unobserved/hidden/latent/missing variables x, and model parameters θ, maximize the likelihood w.r.t. θ. L(θ) = log p(y θ) = log p(x, y θ)dx, where we have written the marginal for the visibles in terms of an integral over the joint distribution for hidden and visible variables. Using Jensen s inequality, any distribution 1 over hidden variables q(x) gives: L(θ) = log q(x) p(x, y θ) dx q(x) q(x) log p(x, y θ) q(x) dx = F(q, θ), defining the F(q, θ) functional, which is a lower bound on the log likelihood. In the EM algorithm, we alternately optimize F(q, θ) wrt q and θ, and we can prove that this will never decrease L(θ). 1 s.t. q(x) > if p(x, y θ) >.

The E and M steps of EM The lower bound on the log likelihood: F(q, θ) = where H(q) = q(x) log p(x, y θ) dx = q(x) q(x) log p(x, y θ)dx + H(q), q(x) log q(x)dx is the entropy of q. We iteratively alternate: E step: maximize F(q, θ) wrt distribution over hidden variables given the parameters: q (k) (x) := argmax q(x) F ( q(x), θ (k 1)). M step: maximize F(q, θ) wrt the parameters given the hidden distribution: θ (k) := argmax θ F ( q (k) (x), θ ) = argmax θ q (k) (x) log p(x, y θ)dx, which is equivalent to optimizing the expected complete-data likelihood p(x, y θ), since the entropy of q(x) does not depend on θ.

EM as Coordinate Ascent in F

The EM algorithm never decreases the log likelihood The difference between the log likelihood and the bound: L(θ) F(q, θ) = log p(y θ) = log p(y θ) = q(x) log q(x) log q(x) log p(x, y θ) dx q(x) p(x y, θ)p(y θ) dx q(x) p(x y, θ) dx = KL ( q(x), p(x y, θ) ), q(x) This is the Kullback-Liebler divergence; it is non-negative and zero if and only if q(x) = p(x y, θ) (thus this is the E step). Although we are working with a bound on the likelihood, the likelihood is non-decreasing in every iteration: L ( θ (k 1)) = E step F( q (k), θ (k 1)) M step F( q (k), θ (k)) Jensen L( θ (k)), where the first equality holds because of the E step, and the first inequality comes from the M step and the final inequality from Jensen. Usually EM converges to a local optimum of L (although there are exceptions).

A Generative Model for Generative Models SBN, Boltzmann Machines hier Cooperative Vector Quantization dyn Factorial HMM distrib mix : mixture red-dim : reduced dimension dyn : dynamics distrib : distributed representation hier : hierarchical nonlin : nonlinear distrib dyn HMM switch : switching Mixture of Gaussians (VQ) mix red-dim mix Mixture of HMMs Gaussian red-dim Mixture of Factor Analyzers dyn nonlin Factor Analysis (PCA) mix dyn Switching State-space Models ICA Linear Dynamical Systems (SSMs) switch mix hier dyn nonlin Mixture of LDSs Nonlinear Gaussian Belief Nets Nonlinear Dynamical Systems

Example: Dynamic Bayesian Networks Factorial Hidden Markov Models, Switching State Space Models, and Nonlinear Dynamical Systems, Nonlinear Dynamical Systems,... X (1) 1 X (1) 2 X (1) 3 S (1) t-1 S (1) t S (1) t+1 S (2) t-1 S (2) t S (2) t+1 X (M) 1 X (M) 2 X (M) 3 S (3) t-1 S (3) t S (3) t+1 S 1 S 2 S 3 Y t-1 Y t Y t+1 Y 1 Y 2 Y 3 A t A t+1 A t+2 B t B t+1 B t+2... C t C t+1 C t+2 D t D t+1 D t+2

Intractability For many models of interest, exact inference is not computationally feasible. This occurs for two (main) reasons: distributions may have complicated forms (non-linearities in generative model) explaining away causes coupling from observations: observing the value of a child induces dependencies amongst its parents X 1 X 2 X 3 X 4 X 5 1 2 1 1 3 Y Y = X 1 + 2 X 2 + X 3 + X 4 + 3 X 5 We can still work with such models by using approximate inference techniques to estimate the latent variables.

Variational Appoximations Assume your goal is to maximize likelihood ln p(y θ). Any distribution q(x) over the hidden variables defines a lower bound on ln p(y θ): ln p(y θ) x q(x) ln p(x, y θ) q(x) = F(q, θ) Constrain q(x) to be of a particular tractable form (e.g. factorised) and maximise F subject to this constraint E-step: Maximise F w.r.t. q with θ fixed, subject to the constraint on q, equivalently minimize: ln p(y θ) F(q, θ) = x q(x) ln q(x) p(x y, θ) = KL(q p) The inference step therefore tries to find q closest to the exact posterior distribution. M-step: Maximise F w.r.t. θ with q fixed (related to mean-field approximations)

Variational Approximations for Bayesian Learning

Model structure and overfitting: a simple example M = M = 1 M = 2 M = 3 4 4 4 4 2 2 2 2 2 5 1 2 5 1 2 5 1 2 5 1 M = 4 M = 5 M = 6 M = 7 4 4 4 4 2 2 2 2 2 5 1 2 5 1 2 5 1 2 5 1

Learning Model Structure Conditional Independence Structure What is the structure of the graph (i.e. what relations hold)? A B C D E Feature Selection Is some input relevant to predicting some output? Cardinality of Discrete Latent Variables How many clusters in the data? How many states in a hidden Markov model? SVYDAAAQLTADVKKDLRDSWKVIGSDKKGNGVALMTTY Dimensionality of Real Valued Latent Vectors What choice of dimensionality in a PCA/FA model of the data? How many state variables in a linear-gaussian state-space model?

Using Bayesian Occam s Razor to Learn Model Structure Select the model class, m, with the highest probability given the data, y: p(m y) = p(y m)p(m), p(y m) = p(y θ, m) p(θ m) dθ p(y) Interpretation of the marginal likelihood ( evidence ): The probability that randomly selected parameters from the prior would generate y. Model classes that are too simple are unlikely to generate the data set. Model classes that are too complex can generate many possible data sets, so again, they are unlikely to generate that particular data set at random. too simple P(Y M i ) "just right" too complex Y All possible data sets

Bayesian Model Selection: Occam s Razor at Work M = M = 1 M = 2 M = 3 4 4 4 4 2 2 2 2 1 Model Evidence.8 2 5 1 M = 4 2 5 1 M = 5 2 5 1 M = 6 2 5 1 M = 7 P(Y M).6.4 4 4 4 4.2 2 2 2 2 1 2 3 4 5 6 7 M 2 5 1 2 5 1 2 5 1 2 5 1 demo: polybayes Subtleties of Occam s Hill

Computing Marginal Likelihoods can be Computationally Intractable p(y m) = p(y θ, m) p(θ m) dθ This can be a very high dimensional integral. The presence of latent variables results in additional dimensions that need to be marginalized out. p(y m) = p(y, x θ, m) p(θ m) dx dθ The likelihood term can be complicated.

Practical Bayesian approaches Laplace approximations: Appeals to asymptotic normality to make a Gaussian approximation about the posterior mode of the parameters. Large sample approximations (e.g. BIC). Markov chain Monte Carlo methods (MCMC): converge to the desired distribution in the limit, but: many samples are required to ensure accuracy. sometimes hard to assess convergence and reliably compute marginal likelihood. Variational approximations... Note: other deterministic approximations are also available now: e.g. Bethe/Kikuchi approximations, Expectation Propagation, Tree-based reparameterizations.

Lower Bounding the Marginal Likelihood Variational Bayesian Learning Let the latent variables be x, data y and the parameters θ. We can lower bound the marginal likelihood (by Jensen s inequality): ln p(y m) = ln = ln p(y, x, θ m) dx dθ p(y, x, θ m) q(x, θ) q(x, θ) q(x, θ) ln p(y, x, θ m) q(x, θ) dx dθ dx dθ. Use a simpler, factorised approximation to q(x, θ) q x (x)q θ (θ): p(y, x, θ m) ln p(y m) q x (x)q θ (θ) ln dx dθ q x (x)q θ (θ) = F m (q x (x), q θ (θ), y).

Variational Bayesian Learning... Maximizing this lower bound, F m, leads to EM-like iterative updates: [ q x (t+1) (x) exp q (t+1) θ (θ) p(θ m) exp ] ln p(x,y θ, m) q (t) θ (θ) dθ [ ln p(x,y θ, m) q x (t+1) (x) dx E-like step ] M-like step Maximizing F m is equivalent to minimizing KL-divergence between the approximate posterior, q θ (θ) q x (x) and the true posterior, p(θ, x y, m): ln p(y m) F m (q x (x), q θ (θ), y) = q x (x) q θ (θ) ln q x(x) q θ (θ) p(θ, x y, m) dx dθ = KL(q p) In the limit as n, for identifiable models, the variational lower bound approaches Schwartz s (1978) BIC criterion.

Conjugate-Exponential models Let s focus on conjugate-exponential (CE) models, which satisfy (1) and (2): Condition (1). The joint probability over variables is in the exponential family: p(x, y θ) = f(x, y) g(θ) exp { φ(θ) u(x, y) } where φ(θ) is the vector of natural parameters, u are sufficient statistics Condition (2). The prior over parameters is conjugate to this joint probability: p(θ η, ν) = h(η, ν) g(θ) η exp { φ(θ) ν } where η and ν are hyperparameters of the prior. Conjugate priors are computationally convenient and have an intuitive interpretation: η: number of pseudo-observations ν: values of pseudo-observations

Conjugate-Exponential examples In the CE family: Gaussian mixtures factor analysis, probabilistic PCA hidden Markov models and factorial HMMs linear dynamical systems and switching models discrete-variable belief networks Other as yet undreamt-of models can combine Gaussian, Gamma, Poisson, Dirichlet, Wishart, Multinomial and others. Not in the CE family: Boltzmann machines, MRFs (no conjugacy) logistic regression (no conjugacy) sigmoid belief networks (not exponential) independent components analysis (not exponential) Note: one can often approximate these models with models in the CE family.

The Variational Bayesian EM algorithm EM for MAP estimation Goal: maximize p(θ y, m) w.r.t. θ E Step: compute Variational Bayesian EM Goal: lower bound p(y m) VB-E Step: compute M Step: θ (t+1) =argmax θ q (t+1) x (x) = p(x y, θ (t) ) q (t+1) x (x) ln p(x, y, θ) dx VB-M Step: q (t+1) x (x) = p(x y, φ (t) ) [ ] q (t+1) θ (θ) exp q (t+1) x (x) ln p(x, y, θ) dx Properties: Reduces to the EM algorithm if q θ (θ) = δ(θ θ ). F m increases monotonically, and incorporates the model complexity penalty. Analytical parameter distributions (but not constrained to be Gaussian). VB-E step has same complexity as corresponding E step. We can use the junction tree, belief propagation, Kalman filter, etc, algorithms in the VB-E step of VB-EM, but using expected natural parameters, φ.

Variational Bayesian EM The Variational Bayesian EM algorithm has been used to approximate Bayesian learning in a wide range of models such as: probabilistic PCA and factor analysis mixtures of Gaussians and mixtures of factor analysers hidden Markov models state-space models (linear dynamical systems) independent components analysis (ICA) and mixtures discrete graphical models... The main advantage is that it can be used to automatically do model selection and does not suffer from overfitting to the same extent as ML methods do. Also it is about as computationally demanding as the usual EM algorithm. See: www.variational-bayes.org demos: mixture of Gaussians, hidden Markov models

! " # $ % & ' & & & ( & & & & & ) ( * & ( & ( ) ( * & ( &

28 29 * Marginal Likelihood (AIS) 3 31 32 33 34 35 36 37 1 2 1 3 1 4 1 5 Duration of Annealing (samples)

1 AIS VB BIC rank of true structure 1 1 1 2 1 1 1 2 1 3 1 4 n

Comparison to Cheeseman-Stutz and BIC.9.8.7 BIC BICprior CS CSnew VB.9.8.7 BIC BICprior CS CSnew VB True Structure has P>.1.6.5.4.3 True Structure has P>.5.6.5.4.3.2.2.1.1 1 1 1 2 1 3 1 4 Size of Data Set Averaging over about 1 samples. 1 1 1 2 1 3 1 4 Size of Data Set CS is much better than BIC, under some measures as good as VB. Note: BIC and CS require estimates of the effective number of parameters. This can be difficult to compute. We estimate the effective number of parameters using a variant of the procedure described in Geiger, Heckerman and Meek (1996).

Summary and Conclusions EM can be interpreted as a lower bound maximization algorithm. For many models of interest the E step is intractable. For such models an approximate E step can be used in a variational lower bound optimization algorithm. This lower bound idea can also be used to do variational Bayesian learning. Bayesian learning embodies automatic Occam s razor via the marginal likelihood. This makes it possible to avoid overfitting and select models. Appendix

Appendix

Example: A Multiple Cause Model Shapes Problem 36... y W 1 W 2 W 3 Training Data 16 16 16......... x 1 x 2 x 3 Architecture... s1 s2 sk W 1 W 2 y W 3 Output Weight Matrix

Example: A Multiple Cause Model... s1 s2 sk y Model with binary latent variables s i {, 1}, real-valued observed vector y and parameters θ = {{µ i, π i } K i=1, σ2 } p(s 1,..., s K π) = K p(s i π i ) = i=1 p(y s 1,..., s K, µ, σ 2 ) = N ( i K i=1 s i µ i, σ 2 I) π s i i (1 π i) (1 s i) EM optimizes lower bound on likelihood: F(q, θ) = log p(s, y θ) q(s) log q(s) q(s) where q is expectation under q. Optimum E step: q(s) = p(s y, θ) is exponential in K.

Example: A Multiple Cause Model (cont)... s1 s2 sk y F(q, θ) = log p(s, y θ) q(s) log q(s) q(s) (1) log p(s, y θ) + c = K i=1 s i log π i +(1 s i ) log(1 π i ) D log σ 1 2σ 2(y i s i µ i ) (y i s i µ i ) = K i=1 s i log π i +(1 s i ) log(1 π i ) D log σ 1 y y 2 s 2σ 2 i µ i y + i i s i s j µ i µ j j we therefore need s i and s i s j to compute F. These are the expected sufficient statistics of the hidden variables.

Example: A Multiple Cause Model (cont) Variational approximation: q(s) = i q i (s i ) = K i=1 λ s i i (1 λ i) (1 s i) (2) Under this approximation we know s i = λ i and s i s j = λ i λ j + δ ij (λ i λ 2 i ). F(λ, θ) = i λ i log π i λ i + (1 λ i ) log (1 π i) (1 λ i ) D log σ 1 2σ 2(y i λ i µ i ) (y i λ i µ i ) + C(λ, µ) + c (3) where C(λ, µ) = 1 2σ 2 i (λ i λ 2 i )µ i µ i, and c = D 2 log(2π) is a constant.

Fixed point equations for multiple cause model Taking derivatives w.r.t. λ i : F λ i = log π i 1 π i log λ i 1 λ i + 1 σ 2(y j i λ j µ j ) µ i 1 2σ 2µ i µ i (4) Setting to zero we get fixed point equations: π i λ i = f log + 1 1 π i σ 2(y j i λ j µ j ) µ i 1 2σ 2µ i µ i (5) where f(x) = 1/(1 + exp( x)) is the logistic (sigmoid) function. Learning algorithm: E step: run fixed point equations until convergence of λ for each data point. M step: re-estimate θ given λs.

Structured Variational Approximations q(s) need not be completely factorized. For example, suppose you can partition s into sets s 1 and s 2 such that computing the expected sufficient statistics under q(s 1 ) and q(s 2 ) is tractable. Then q(s) = q(s 1 )q(s 2 ) is tractable. If you have a graphical model, you may want to factorize q(s) into aproduct of trees, which are tractable distributions. A t A t+1 A t+2 B t B t+1 B t+2... C t C t+1 C t+2 D t D t+1 D t+2

Scaling the parameter priors How the parameter priors are scaled determines whether an Occam s hill is present or not. Order Order 2 Order 4 Order 6 Order 8 Order 1 Unscaled models: 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1.2.1 1 2 3 4 5 6 7 8 9 1 11 Model order Order Order 2 Order 4 Order 6 Order 8 Order 1 Scaled models: 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1.2.1 1 2 3 4 5 6 7 8 9 1 11 Model order (Rasmussen & Ghahramani, 2) back

The Cheeseman-Stutz (CS) Approximation The Cheeseman-Stutz approximation is based on: p(y m) = p(z m) p(y m) dθ p(θ m)p(y θ, m) p(z m) = p(z m) dθ p(θ m)p(z θ, m) which is true for any completion of the data: z = {ŝ, y}. We use the BIC approximation for both top and bottom integrals: ln p(y m) ln p(ŝ, y m) + ln p(ˆθ m) + ln p(y ˆθ) d 2 ln n This can be corrected for d d. ln p(ˆθ m) ln p(ŝ, y ˆθ) + d 2 ln n = ln p(ŝ, y m) + ln p(y ˆθ) ln p(ŝ, y ˆθ), Cheeseman-Stutz: Run MAP to get ˆθ, complete data with expectations under ˆθ, compute CS approximation as above.