Approximate Inference

Similar documents
STA 4273H: Statistical Machine Learning

Introduction to Markov Chain Monte Carlo

Christfried Webers. Canberra February June 2015

HT2015: SC4 Statistical Data Mining and Machine Learning

Section 5. Stan for Big Data. Bob Carpenter. Columbia University

Tutorial on Markov Chain Monte Carlo

Bayesian Statistics: Indian Buffet Process

Dirichlet Processes A gentle tutorial

Markov Chain Monte Carlo Simulation Made Simple

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Basics of Statistical Machine Learning

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Towards running complex models on big data

Gaussian Processes in Machine Learning

Parallelization Strategies for Multicore Data Analysis

Master s Theory Exam Spring 2006

Probabilistic Matrix Factorization

Bayesian Factorization Machines

Estimating the evidence for statistical models

Introduction to Monte Carlo. Astro 542 Princeton University Shirley Ho

Methods of Data Analysis Working with probability distributions

Bayes and Naïve Bayes. cs534-machine Learning

Bayesian Clustering for Campaign Detection

Statistics Graduate Courses

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Variational Mean Field for Graphical Models

Probabilistic Methods for Time-Series Analysis

The Basics of Graphical Models

Gaussian Processes to Speed up Hamiltonian Monte Carlo

Computational Statistics for Big Data

Logistic Regression (1/24/13)

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Efficient Bayesian Methods for Clustering

MCMC Using Hamiltonian Dynamics

BayesX - Software for Bayesian Inference in Structured Additive Regression

Statistical Machine Learning from Data

Machine Learning and Pattern Recognition Logistic Regression

Course: Model, Learning, and Inference: Lecture 5

Statistical Machine Learning

Linear Threshold Units

Linear Classification. Volker Tresp Summer 2015

The Chinese Restaurant Process

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

More details on the inputs, functionality, and output can be found below.

Bootstrapping Big Data

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Predict Influencers in the Social Network

Topic models for Sentiment analysis: A Literature Survey

Centre for Central Banking Studies

LECTURE 4. Last time: Lecture outline

Component Ordering in Independent Component Analysis Based on Data Power

Inference on Phase-type Models via MCMC

Estimation and comparison of multiple change-point models

Incorporating cost in Bayesian Variable Selection, with application to cost-effective measurement of quality of health care.

Probabilistic user behavior models in online stores for recommender systems

11. Time series and dynamic linear models

Machine Learning Overview

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Gaussian Process Training with Input Noise

Reinforcement Learning with Factored States and Actions

Natural Conjugate Gradient in Variational Inference

CSCI567 Machine Learning (Fall 2014)

Chenfeng Xiong (corresponding), University of Maryland, College Park

Classification. Chapter 3

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

Computing with Finite and Infinite Networks

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

CHAPTER 2 Estimating Probabilities

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Dealing with large datasets

Parameter estimation for nonlinear models: Numerical approaches to solving the inverse problem. Lecture 12 04/08/2008. Sven Zenker

Variational Bayesian Inference for Big Data Marketing Models 1

Classification by Pairwise Coupling

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Message-passing sequential detection of multiple change points in networks

1 Maximum likelihood estimation

Gaussian Conjugate Prior Cheat Sheet

Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach

A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking

Bayesian Statistics in One Hour. Patrick Lam

An Introduction to Machine Learning

Introduction to General and Generalized Linear Models

Compression and Aggregation of Bayesian Estimates for Data Intensive Computing

called Restricted Boltzmann Machines for Collaborative Filtering

Bayesian Image Super-Resolution

Model Combination. 24 Novembre 2009

Towards scaling up Markov chain Monte Carlo: an adaptive subsampling approach

Distributed Bayesian Posterior Sampling via Moment Sharing

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Exponential Random Graph Models for Social Network Analysis. Danny Wyatt 590AI March 6, 2009

Nonparametric Factor Analysis with Beta Process Priors

General Sampling Methods

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Introduction to Deep Learning Variational Inference, Mean Field Theory

Transcription:

Approximate Inference 9.520 Class 19 Ruslan Salakhutdinov BCS and CSAIL, MIT 1

Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Laplace and Variational Inference. 4. Basic Sampling Algorithms. 5. Markov chain Monte Carlo algorithms. 2

References/Acknowledgements Chris Bishop s book: Pattern Recognition and Machine Learning, chapter 11 (many figures are borrowed from this book). David MacKay s book: Information Theory, Inference, and Learning Algorithms, chapters 29-32. Radford Neals s technical report on Probabilistic Inference Using Markov Chain Monte Carlo Methods. Zoubin Ghahramani s ICML tutorial on Bayesian Machine Learning: http://www.gatsby.ucl.ac.uk/ zoubin/icml04-tutorial.html Ian Murray s tutorial on Sampling Methods: http://www.cs.toronto.edu/ murray/teaching/ 3

Basic Notation P (x) P (x θ) P (x, θ) probability of x conditional probability of x given θ joint probability of x and θ Bayes Rule: P (θ x) = P (x θ)p (θ) P (x) where P (x) = P (x, θ)dθ Marginalization I will use probability distribution and probability density interchangeably. It should be obvious from the context. 4

Inference Problem Given a dataset D = {x 1,..., x n }: Bayes Rule: P (θ D) = P (D θ)p (θ) P (D) P (D θ) P (θ) P (θ D) Likelihood function of θ Prior probability of θ Posterior distribution over θ Computing posterior distribution is known as the inference problem. But: P (D) = P (D, θ)dθ This integral can be very high-dimensional and difficult to compute. 5

Prediction P (θ D) = P (D θ)p (θ) P (D) P (D θ) P (θ) P (θ D) Likelihood function of θ Prior probability of θ Posterior distribution over θ Prediction: Given D, computing conditional probability of x requires computing the following integral: P (x D) = P (x θ, D)P (θ D)dθ = E P (θ D) [P (x θ, D)] which is sometimes called predictive distribution. Computing predictive distribution requires posterior P (θ D). 6

Model Selection Compare model classes, e.g. M 1 and M 2. Need to compute posterior probabilities given D: P (M D) = P (D M)P (M) P (D) where P (D M) = P (D θ, M)P (θ M)dθ is known as the marginal likelihood or evidence. 7

Computational Challenges Computing marginal likelihoods often requires computing very highdimensional integrals. Computing posterior distributions (and hence predictive distributions) is often analytically intractable. In this class, we will concentrate on Markov Chain Monte Carlo (MCMC) methods for performing approximate inference. First, let us look at some specific examples: Bayesian Probabilistic Matrix Factorization Bayesian Neural Networks Dirichlet Process Mixtures (last class) 8

Bayesian PMF User Features 1 2 3 4 5 6 7... 1 2 3 4 5 6 7... 5 3? 1... 3? 4? 3 2... R ~ U V Movie Features We have N users, M movies, and integer rating values from 1 to K. Let r ij be the rating of user i for movie j, and U R D N, V R D M be latent user and movie feature matrices: Goal: Predict missing ratings. R U V 9

Bayesian PMF α V α U Probabilistic linear model with Gaussian observation noise. Likelihood: Θ V Θ U p(r ij u i, v j, σ 2 ) = N (r ij u i v j, σ 2 ) Vj U i Gaussian Priors over parameters: p(u µ U, Λ U ) = N N (u i µ u, Σ u ), j=1,...,m R ij i=1,...,n i=1 M p(v µ V, Λ V ) = N (v i µ v, Σ v ). σ i=1 Conjugate Gaussian-inverse-Wishart priors on the user and movie hyperparameters Θ U = {µ u, Σ u } and Θ V = {µ v, Σ v }. Hierarchical Prior. 10

Bayesian PMF Predictive distribution: Consider predicting a rating rij and query movie j: p(r ij R) = for user i p(rij u i, v j )p(u, } V, Θ {{ U, Θ V R) } d{u, V }d{θ U, Θ V } Posterior over parameters and hyperparameters Exact evaluation of this predictive distribution is analytically intractable. Posterior distribution p(u, V, Θ U, Θ V R) is complicated and does not have a closed form expression. Need to approximate. 11

Bayesian Neural Nets Regression problem: Given a set of i.i.d observations X = {x n } N n=1 with corresponding targets D = {t n } N n=1. Likelihood: p(d X, w) = N n=1 N (t n y(x n, w), β 2 ) The mean is given by the output of the neural network: y k (x, w) = M wkjσ 2 ( D wjix 1 ) i j=0 i=0 where σ(x) is the sigmoid function. Gaussian prior over the network parameters: p(w) = N (0, α 2 I). 12

Bayesian Neural Nets Likelihood: N p(d X, w) = N (t n y(x n, w), β 2 ) n=1 Gaussian prior over parameters: p(w) = N (0, α 2 I) Posterior is analytically intractable: p(w D, X) = p(d w, X)p(w) p(d w, X)p(w)dw Remark: Under certain conditions, Radford Neal (1994) showed, as the number of hidden units go to infinity, a Gaussian prior over parameters results in a Gaussian process prior for functions. 13

Undirected Models x is a binary random vector with x i {+1, 1}: p(x) = 1 Z exp ( (i,j) E θ ij x i x j + i V θ i x i ). where Z is known as partition function: Z = x exp ( (i,j) E θ ij x i x j + i V θ i x i ). If x is 100-dimensional, need to sum over 2 100 terms. The sum might decompose (e.g. junction tree). Otherwise we need to approximate. Remark: Compare to marginal likelihood. 14

Inference For most situations we will be interested in evaluating the expectation: E[f] = f(z)p(z)dz We will use the following notation: p(z) = p(z) Z. We can evaluate p(z) pointwise, but cannot evaluate Z. Posterior distribution: P (θ D) = 1 P (D) P (D θ)p (θ) Markov random fields: P (z) = 1 Z exp( E(z)) 15

Laplace Approximation 0.8 0.6 Consider: p(z) = p(z) Z 0.4 0.2 0 2 1 0 1 2 3 4 Goal: Find a Gaussian approximation q(z) which is centered on a mode of the distribution p(z). At a stationary point z 0 the gradient p(z) vanishes. Consider a Taylor expansion of ln p(z): ln p(z) ln p(z 0 ) 1 2 (z z 0) T A(z z 0 ) where A is a Hessian matrix: A = ln p(z) z=z0 16

Laplace Approximation 0.8 0.6 Consider: p(z) = p(z) Z 0.4 0.2 0 2 1 0 1 2 3 4 Exponentiating both sides: p(z) p(z 0 ) exp ( Goal: Find a Gaussian approximation q(z) which is centered on a mode of the distribution p(z). 1 ) 2 (z z 0) T A(z z 0 ) We get a multivariate Gaussian approximation: ( q(z) = A 1/2 exp 1 ) (2π) D/2 2 (z z 0) T A(z z 0 ) 17

Laplace Approximation Remember p(z) = p(z) Z, where we approximate: ( Z = p(z)dz p(z 0 ) exp 1 ) 2 (z z 0) T A(z z 0 ) = p(z 0 ) (2π)D/2 A 1/2 Bayesian Inference: P (θ D) = 1 P (D) P (D θ)p (θ). Identify: p(θ) = P (D θ)p (θ) and Z = P (D): The posterior is approximately Gaussian around the MAP estimate θ MAP p(θ D) A 1/2 exp D/2 (2π) ( 1 ) 2 (θ θ MAP ) T A(θ θ MAP ) 18

Laplace Approximation Remember p(z) = p(z) Z, where we approximate: ( Z = p(z)dz p(z 0 ) exp 1 ) 2 (z z 0) T A(z z 0 ) = p(z 0 ) (2π)D/2 A 1/2 Bayesian Inference: P (θ D) = 1 P (D) P (D θ)p (θ). Identify: p(θ) = P (D θ)p (θ) and Z = P (D): Can approximate Model Evidence: P (D) = Using Laplace approximation P (D θ)p (θ)dθ ln P (D) ln P (D θ MAP ) + ln P (θ MAP ) + D 2 ln 2π 1 ln A }{{ 2 } Occam factor: penalize model complexity 19

Bayesian Information Criterion BIC can be obtained from the Laplace approximation: ln P (D) ln P (D θ MAP ) + ln P (θ MAP ) + D 2 ln 2π 1 2 ln A by taking the large sample limit (N ) where N is the number of data points: ln P (D) P (D θ MAP ) 1 2 D ln N Quick, easy, does not depend on the prior. Can use maximum likelihood estimate of θ instead of the MAP estimate D denotes the number of well-determined parameters Danger: Counting parameters can be tricky (e.g. infinite models) 20

Variational Inference Key Idea: Approximate intractable distribution p(θ D) with simpler, tractable distribution q(θ). We can lower bound the marginal likelihood using Jensen s inequality: P (D, θ) ln p(d) = ln p(d, θ)dθ = ln q(θ) dθ q(θ) p(d, θ) q(θ) ln q(θ) dθ = q(θ) ln p(d, θ)dθ + q(θ) ln 1 q(θ) dθ }{{} Entropy functional } {{ } Variational Lower-Bound = ln p(d) KL(q(θ) p(θ D)) = L(q) where KL(q p) is a Kullback Leibler divergence. It is a non-symmetric measure of the difference between two probability distributions q and p. The goal of variational inference is to maximize the variational lower-bound w.r.t. approximate q distribution, or minimize KL(q p). 21

Variational Inference Key Idea: Approximate intractable distribution p(θ D) with simpler, tractable distribution q(θ) by minimizing KL(q(θ) p(θ D)). We can choose a fully factorized distribution: q(θ) = D i=1 q i(θ i ), also known as a mean-field approximation. The variational lower-bound takes form: L(q) = = q(θ) ln p(d, θ)dθ + [ q j (θ j ) ln p(d, θ) i j q(θ) ln 1 q(θ) dθ ] q i (θ i )dθ i }{{} E i j [ln p(d, θ)] dθ j + i q i (θ i ) ln 1 q(θ i ) dθ i Suppose we keep {q i j } fixed and maximize L(q) w.r.t. all possible forms for the distribution q j (θ j ). 22

Variational Approximation 1 0.8 0.6 0.4 The plot shows the original distribution (yellow), along with the Laplace (red) and variational (green) approximations. 0.2 0 2 1 0 1 2 3 4 By maximizing L(q) w.r.t. all possible forms for the distribution q j (θ j ) we obtain a general expression: qj (θ j ) = exp(e i j[ln p(d, θ)]) exp(ei j [ln p(d, θ)])dθ j Iterative Procedure: Initialize all q j and then iterate through the factors replacing each in turn with a revised estimate. Convergence is guaranteed as the bound is convex w.r.t. each of the factors q j (see Bishop, chapter 10). 23

Inference: Recap For most situations we will be interested in evaluating the expectation: E[f] = f(z)p(z)dz We will use the following notation: p(z) = p(z) Z. We can evaluate p(z) pointwise, but cannot evaluate Z. Posterior distribution: P (θ D) = 1 P (D) P (D θ)p (θ) Markov random fields: P (z) = 1 Z exp( E(z)) 24

Simple Monte Carlo General Idea: Draw independent samples {z 1,..., z n } from distribution p(z) to approximate expectation: E[f] = f(z)p(z)dz 1 N N f(z n ) = ˆf n=1 Note that E[f] = E[ ˆf], so the estimator ˆf has correct mean (unbiased). The variance: var[ ˆf] = 1 N E[ (f E[f]) 2] Remark: The accuracy of the estimator does not depend on dimensionality of z. 25

Simple Monte Carlo In general: f(z)p(z)dz 1 N N f(z n ), z n p(z) n=1 Predictive distribution: P (x D) = P (x θ, D)P (θ D)dθ 1 N N P (x θ n, D), θ n p(θ D) n=1 Problem: It is hard to draw exact samples from p(z). 26

Basic Sampling Algorithm How to generate samples from simple non-uniform distributions assuming we can generate samples from uniform distribution. Define: h(y) = y p(ŷ)dŷ Sample: z U[0, 1]. Then: y = h 1 (z) is a sample from p(y). Problem: Computing cumulative h(y) is just as hard! 27

Rejection Sampling Sampling from target distribution p(z) = p(z)/z p is difficult. Suppose we have an easy-to-sample proposal distribution q(z), such that kq(z) p(z), z. Sample z 0 from q(z). Sample u 0 from Uniform[0, kq(z 0 )] The pair (z 0, u 0 ) has uniform distribution under the curve of kq(z). If u 0 > p(z 0 ), the sample is rejected. 28

Rejection Sampling Probability that a sample is accepted is: p(accept) = = 1 k p(z) kq(z) q(z)dz p(z)dz The fraction of accepted samples depends on the ratio of the area under p(z) and kq(z). Hard to find appropriate q(z) with optimal k. Useful technique in one or two dimensions. Typically applied as a subroutine in more advanced algorithms. 29

Importance Sampling Suppose we have an easy-to-sample proposal distribution q(z), such that q(z) > 0 if p(z) > 0. E[f] = f(z)p(z)dz = 1 N f(z) p(z) q(z) q(z)dz p(z n ) q(z n ) f(zn ), n z n q(z) The quantities w n = p(zn )/q(z n ) are known as importance weights. Unlike rejection sampling, all samples are retained. But wait: we cannot compute p(z), only p(z). 30

Importance Sampling Let our proposal be of the form q(z) = q(z)/z q : E[f] = Z q 1 Z p N f(z)p(z)dz = n p(z n ) q(z n ) f(zn ) = Z q Z p 1 N f(z) p(z) q(z) q(z)dz = Z q Z p w n f(z n ), n f(z) p(z) q(z) q(z)dz z n q(z) But we can use the same importance weights to approximate Z p Z q : Z p Z q = 1 Z q p(z)dz = p(z) q(z) q(z)dz 1 N n p(z n ) q(z n ) = 1 N n w n Hence: E[f] 1 N n w n n wnf(zn ) Consistent but biased. 31

Problems If our proposal distribution q(z) poorly matches our target distribution p(z) then: Rejection Sampling: almost always rejects Importance Sampling: has large, possibly infinite, variance (unreliable estimator). For high-dimensional problems, finding good proposal distributions is very hard. What can we do? Markov Chain Monte Carlo. 32

Markov Chains A first-order Markov chain: a series of random variables {z 1,..., z N } such that the following conditional independence property holds for n {z 1,..., z N 1 }: p(z n+1 z 1,..., z n ) = p(z n+1 z n ) We can specify Markov chain: probability distribution for initial state p(z 1 ). conditional probability for subsequent states in the form of transition probabilities T (z n+1 z n ) p(z n+1 z n ). Remark: T (z n+1 z n ) is sometimes called a transition kernel. 33

Markov Chains A marginal probability of a particular state can be computed as: p(z n+1 ) = z n T (z n+1 z n )p(z n ) A distribution π(z) is said to be invariant or stationary with respect to a Markov chain if each step in the chain leaves π(z) invariant: π(z) = z T (z z )π(z ) A given Markov chain may have many stationary distributions. For example: T (z z ) = I{z = z } is the identity transformation. Then any distribution is invariant. 34

Detailed Balance A sufficient (but not necessary) condition for ensuring that π(z) is invariant is to choose a transition kernel that satisfies a detailed balance property: π(z )T (z z ) = π(z)t (z z) A transition kernel that satisfies detailed balance will leave that distribution invariant: z π(z )T (z z ) = π(z)t (z z) z = π(z) z T (z z) = π(z) A Markov chain that satisfies detailed balance is said to be reversible. 35

Recap We want to sample from target distribution π(z) = π(z)/z (e.g. posterior distribution). Obtaining independent samples is difficult. Set up a Markov chain with transition kernel T (z z) that leaves our target distribution π(z) invariant. If the chain is ergodic, i.e. it is possible to go from every state to any other state (not necessarily in one move), then the chain will converge to this unique invariant distribution π(z). We obtain dependent samples drawn approximately from π(z) by simulating a Markov chain for some time. Ergodicity: There exists K, for any starting z, T K (z z) > 0 for all π(z ) > 0. 36

Metropolis-Hasting Algorithm A Markov chain transition operator from current state z to a new state z is defined as follows: A new candidate state z is proposed according to some proposal distribution q(z z), e.g. N (z, σ 2 ). A candidate state x is accepted with probability: ( min 1, π(z ) π(z) q(z z ) ) q(z z) If accepted, set z = z. Otherwise z = z, or the next state is the copy of the current state. Note: no need to know normalizing constant Z. 37

Metropolis-Hasting Algorithm We can show that M-H transition kernel leaves π(z) invariant by showing that it satisfies detailed balance: ( π(z)t (z z) = π(z)q(z z) min 1, π(z ) π(z) q(z z ) ) q(z z) = min (π(z)q(z z), π(z )q(z z )) ( π(z) = π(z )q(z ) z) ) min π(z ) q(z z ), 1 = π(z )T (z z ) Note that whether the chain is ergodic will depend on the particulars of π and proposal distribution q. 38

Metropolis-Hasting Algorithm 3 2.5 2 1.5 1 Using Metropolis algorithm to sample from Gaussian distribution with proposal q(z z) = N (z, 0.04). accepted (green), rejected (red). 0.5 0 0 0.5 1 1.5 2 2.5 3 39

Choice of Proposal Proposal distribution: q(z z) = N (z, ρ 2 ). ρ large - many rejections ρ small - chain moves too slowly The specific choice of proposal can greatly affect the performance of the algorithm. 40

Gibbs Sampler Consider sampling from p(z 1,..., z N ). Initialize z i, i = 1,..., N For t=1,...,t Sample z1 t+1 p(z 1 z2, t..., zn t ) Sample z2 t+1 p(z 2 z1 t+1, x t 3,..., zn t ) Sample z t+1 N p(z N z t+1 1,..., z t+1 N 1 ) Gibbs sampler is a particular instance of M-H algorithm with proposals p(z n z i n ) accept with probability 1. Apply a series (componentwise) of these operators. 41

Gibbs Sampler Applicability of the Gibbs sampler depends on how easy it is to sample from conditional probabilities p(z n z i n ). For discrete random variables with a few discrete settings: p(z n z i n ) = p(z n, z i n ) z n p(z n, z i n ) The sum can be computed analytically. For continuous random variables: p(z n z i n ) = p(z n, z i n ) p(zn, z i n )dz n The integral is univariate and is often analytically tractable or amenable to standard sampling methods. 42

Bayesian PMF Remember predictive distribution?: Consider predicting a rating rij for user i and query movie j: p(r ij R) = p(rij u i, v j )p(u, } V, Θ {{ U, Θ V R) } d{u, V }d{θ U, Θ V } Posterior over parameters and hyperparameters Use Monte Carlo approximation: p(r ij R) 1 N N n=1 p(rij u (n) i, v (n) j ). The samples (u n i, vn j ) are generated by running a Gibbs sampler, whose stationary distribution is the posterior distribution of interest. 43

Bayesian PMF Monte Carlo approximation: p(r ij R) 1 N N n=1 p(rij u (n) i, v (n) j ). The conditional distributions over the user and movie feature vectors are Gaussians easy to sample from: p(u i R, V, Θ U, α) = N ( u i µ i, Σ ) i p(v j R, U, Θ U, α) = N ( v j µ j, Σ ) j The conditional distributions over hyperparameters also have closed form distributions easy to sample from. Netflix dataset Bayesian PMF can handle over 100 million ratings. 44

MCMC: Main Problems Main problems of MCMC: Hard to diagnose convergence (burning in). Sampling from isolated modes. More advanced MCMC methods for sampling in distributions with isolated modes: Parallel tempering Simulated tempering Tempered transitions Hamiltonian Monte Carlo methods (make use of gradient information). Nested Sampling, Coupling from the Past, many others. 45