Inference and Representation

Transcription

1 Inference and Representation David Sontag New York University Lecture 11, Nov. 24, 2015 David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

2 Approximate marginal inference Given the joint p(x 1,..., x n ) represented as a graphical model, how do we perform marginal inference, e.g. to compute p(x 1 e)? We showed in Lecture 4 that doing this exactly is NP-hard Nearly all approximate inference algorithms are either: 1 Monte-carlo methods (e.g., Gibbs sampling, likelihood reweighting, MCMC) 2 Variational algorithms (e.g., mean-field, loopy belief propagation) David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

3 Variational methods Goal: Approximate difficult distribution p(x e) with a new distribution q(x) such that: 1 p(x e) and q(x) are close 2 Computation on q(x) is easy How should we measure distance between distributions? The Kullback-Leibler divergence (KL-divergence) between two distributions p and q is defined as D(p q) = x p(x) log p(x) q(x). (measures the expected number of extra bits required to describe samples from p(x) using a code based on q instead of p) D(p q) 0 for all p, q, with equality if and only if p = q Notice that KL-divergence is asymmetric David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

4 KL-divergence (see Section of Murphy) D(p q) = x p(x) log p(x) q(x). Suppose p is the true distribution we wish to do inference with What is the difference between the solution to arg min D(p q) q (called the M-projection of q onto p) and (called the I-projection)? arg min D(q p) q These two will differ only when q is minimized over a restricted set of probability distributions Q = {q 1,...}, and in particular when p Q David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

5 KL-divergence M-projection q = arg min q Q D(p q) = x p(x) log p(x) q(x). For example, suppose that p(z) is a 2D Gaussian and Q is the set of all Gaussian distributions with diagonal covariance matrices: 1 z z 1 1 p=green, (b) q =Red David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

6 KL-divergence I-projection q = arg min q Q D(q p) = x q(x) log q(x) p(x). For example, suppose that p(z) is a 2D Gaussian and Q is the set of all Gaussian distributions with diagonal covariance matrices: 1 z z 1 1 p=green, (a) q =Red David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

7 KL-divergence (single Gaussian) In this simple example, both the M-projection and I-projection find an approximate q(x) that has the correct mean (i.e. E p [z] = E q [z]): 1 1 z 2 z z 1 1 (b) z 1 1 (a) What if p(x) is multi-modal? David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

8 KL-divergence M-projection (mixture of Gaussians) q = arg min q Q D(p q) = x p(x) log p(x) q(x). Now suppose that p(x) is mixture of two 2D Gaussians and Q is the set of all 2D Gaussian distributions (with arbitrary covariance matrices): p=blue, q =Red M-projection yields distribution q(x) with the correct mean and covariance. David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

9 KL-divergence I-projection (mixture of Gaussians) q = arg min q Q D(q p) = x q(x) log q(x) p(x). p=blue, q =Red (two local minima!) Unlike M-projection, the I-projection does not always yield the correct moments. Q: D(p q) is convex so why are there local minima? A: using a parametric form for q (i.e., a Gaussian). Not convex in µ, Σ. David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

10 M-projection does moment matching Recall that the M-projection is: q = arg min q Q D(p q) = x p(x) log p(x) q(x). Suppose that Q is an exponential family (p(x) can be arbitrary) and that we perform the M-projection, finding q Theorem: The expected sufficient statistics, with respect to q (x), are exactly the marginals of p(x): E q [f(x)] = E p [f(x)] Thus, solving for the M-projection (exactly) is just as hard as the original inference problem David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

11 M-projection does moment matching Recall that the M-projection is: q = arg min D(p q) = q(x;η) Q x Theorem: E q [f(x)] = E p [f(x)]. Proof: Look at the first-order optimality conditions. ηi D(p q) = ηi p(x) log q(x) x = ηi x = ηi x p(x) log p(x) q(x). { } p(x) log h(x) exp{η f(x) ln Z(η)} { } p(x) η f(x) ln Z(η) = p(x)f i (x) + E q(x;η) [f i (x)] x = E p [f i (x)] + E q(x;η) [f i (x)] = 0. (since ηi ln Z(η) = E q [f i (x)]) Corollary: Even computing the gradients is hard (can t do gradient descent) David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

12 Most variational inference algorithms make use of the I-projection David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

13 Variational methods Suppose that we have an arbitrary graphical model: p(x; θ) = 1 ( ) φ c (x c ) = exp θ c (x c ) ln Z(θ) Z(θ) c C All of the approaches begin as follows: D(q p) = x q(x) ln q(x) p(x) c C = q(x) ln p(x) q(x) ln 1 q(x) x x = q(x) ( θ c (x c ) ln Z(θ) ) H(q(x)) x c C = q(x)θ c (x c ) + q(x) ln Z(θ) H(q(x)) c C x x = c C E q [θ c (x c )] + ln Z(θ) H(q(x)). David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

14 Mean field algorithms for variational inference Eric CMU, max q Q q(x c )θ c (x c ) + H(q(x)). c C x c Naïve Mean Field Although this function is concave and thus in theory should be easy to optimize, we Fully need factorized somevariational compact distribution way of representing q(x) Mean field algorithms assume a factored representation of the joint distribution, e.g. q(x) = i V q i (x i ) Eric Xing CMU, naive mean field) 34 David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

15 Naive mean-field Suppose that Q consists of all fully factored distributions, of the form q(x) = i V q i(x i ) We can use this to simplify q(x c )θ c (x c ) + H(q) max q Q c C First, note that q(x c ) = i c q i(x i ) x c Next, notice that the joint entropy decomposes as a sum of local entropies: H(q) = x q(x) ln q(x) = q(x) ln q i (x i ) = q(x) ln q i (x i ) x i V x i V = q(x) ln q i (x i ) i V x = q i (x i ) ln q i (x i ) q(x V \i x i ) = H(q i ). i V x i x V \i i V David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

16 Naive mean-field Suppose that Q consists of all fully factored distributions, of the form q(x) = i V q i(x i ) We can use this to simplify q(x c )θ c (x c ) + H(q) max q Q c C First, note that q(x c ) = i c q i(x i ) x c Next, notice that the joint entropy decomposes as H(q) = i V H(q i). Putting these together, we obtain the following variational objective: ( ) max θ c (x c ) q i (x i ) + H(q i ) q x c i V subject to the constraints c C i c q i (x i ) 0 i V, x i Val(X i ) q i (x i ) = 1 i V x i Val(X i ) David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

17 Naive mean-field for pairwise MRFs How do we maximize the variational objective? ( ) max θ ij (x i, x j )q i (x i )q j (x j ) q i (x i ) ln q i (x i ) q x i,x j x i ij E i V This is a non-concave optimization problem, with many local maxima! Nonetheless, we can greedily maximize it using block coordinate ascent: 1 Iterate over each of the variables i V. For variable i, 2 Fully maximize (*) with respect to {q i (x i ), x i Val(X i )}. 3 Repeat until convergence. Constructing the Lagrangian, taking the derivative, setting to zero, and solving yields the update: (shown on blackboard) q i (x i ) 1 exp {θ i (x i ) + Z i j N(i) } q j (x j )θ ij (x i, x j ) x j David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

18 How accurate will the approximation be? Consider a distribution which is an XOR of two binary variables A and B: p(a, b) = 0.5 ɛ if a b and p(a, b) = ɛ if a = b The contour plot of the variational objective is: Q(b 1 ) Q(a 1 ) Even for a single edge, mean field can give very wrong answers! Interestingly, once ɛ > 0.1, mean field has a single maximum point at the uniform distribution (thus, exact) David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

19 Fig. 5.4 Structured mean field approximation for a factorial HMM. (a) Original model David Sontag (NYU) consists of a set of hiddeninference Markovand models Representation (defined on chains), coupled Lecture 11, at each Nov. time 24, 2015 by 19 / 32 Structured mean-field approximations Rather than assuming a fully-factored distribution for q, we can use a structured approximation, such as a spanning tree For example, for a factorial HMM, a good approximation may be a product of 146chain-structured Mean Field Methods models:

20 Recall our starting place for variational methods... Suppose that we have an arbitrary graphical model: p(x; θ) = 1 ( ) φ c (x c ) = exp θ c (x c ) ln Z(θ) Z(θ) c C All of the approaches begin as follows: D(q p) = x q(x) ln q(x) p(x) c C = q(x) ln p(x) q(x) ln 1 q(x) x x = q(x) ( θ c (x c ) ln Z(θ) ) H(q(x)) x c C = q(x)θ c (x c ) + q(x) ln Z(θ) H(q(x)) c C x x = c C E q [θ c (x c )] + ln Z(θ) H(q(x)). David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

21 The log-partition function Since D(q p) 0, we have c C E q [θ c (x c )] + ln Z(θ) H(q(x)) 0, which implies that ln Z(θ) c C E q [θ c (x c )] + H(q(x)). Thus, any approximating distribution q(x) gives a lower bound on the log-partition function (for a BN, this is the log probability of the observed variables) Recall that D(q p) = 0 if and only if p = q.thus, if we allow ourselves to optimize over all distributions, we have: ln Z(θ) = max q E q [θ c (x c )] + H(q(x)). c C David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

22 Re-writing objective in terms of moments ln Z(θ) = max q = max q = max q E q [θ c (x c )] + H(q(x)) c C q(x)θ c (x c ) + H(q(x)) c C x q(x c )θ c (x c ) + H(q(x)). c C x c Now assume that p(x) is in the exponential family, and let f(x) be its sufficient statistic vector Define µ q = E q [f(x)] to be the marginals of q(x) We can re-write the objective as ln Z(θ) = max max µ M q:e q[f(x)]=µ c C x c θ c (x c )µ c (x c ) + H(q(x)), where M, the marginal polytope, consists of all valid marginal vectors David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

23 Re-writing objective in terms of moments Next, push the max over q instead to obtain: ln Z(θ) = max θ c (x c )µ c (x c ) + H(µ), where µ M c C x c H(µ) = max q:e q[f(x)]=µ H(q) Does this look familiar? For discrete random variables, the marginal polytope M is given by M = {µ R d µ = } p(x)f(x) for some p(x) 0, p(x) = 1 x X m x X { m = conv f(x), x X m} (conv denotes the convex hull operation) For a discrete-variable MRF, the sufficient statistic vector f(x) is simply the concatenation of indicator functions for each clique of variables that appear together in a potential function For example, if we have a pairwise MRF on binary variables with m = V variables and E edges, d = 2m + 4 E David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

24 Marginal polytope for discrete MRFs µ = 0! 1! 0! 1! 1! 0! 0" 0" 1" 0" 0! 0! 0! 1! 0! 0! 1! 0" Marginal polytope! (Wainwright & Jordan, 03)! µ = 1 µ 2 + µ valid marginal probabilities! X 1! = 1! 1! 0! 0! 1! 1! 0! 1" 0" 0" 0" 0! 1! 0! 0! 0! 0! 1! 0" Assignment for X 1" Assignment for X 2" Assignment for X 3! Edge assignment for" X 1 X 3! Edge assignment for" X 1 X 2! Edge assignment for" X 2 X 3! X 1! = 0! X 2! = 1! X 3 =! 0! X 2! = 1! X 3 =! 0! Figure 2-1: Illustration of the marginal polytope for a Markov random field with three nodes that have states in {0, 1}. The vertices correspond one-to-one with global assignments to David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

25 Relaxation ln Z(θ) = max θ c (x c )µ c (x c ) + H(µ) µ M c C x c We still haven t achieved anything, because: 1 The marginal polytope M is complex to describe (in general, exponentially many vertices and facets) 2 H(µ) is very difficult to compute or optimize over We now make two approximations: 1 We replace M with a relaxation of the marginal polytope, e.g. the local consistency constraints M L 2 We replace H(µ) with a function H(µ) which approximates H(µ) David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

26 Local consistency constraints Force every cluster of variables to choose a local assignment: µ i (x i ) 0 i V, x i x i µ i (x i ) = 1 i V µ ij (x i, x j ) 0 ij E, x i, x j x i,x j µ ij (x i, x j ) = 1 ij E Enforce that these local assignments are globally consistent: µ i (x i ) = x j µ ij (x i, x j ) ij E, x i µ j (x j ) = x i µ ij (x i, x j ) ij E, x j The local consistency polytope, M L is defined by these constraints Theorem: The local consistency constraints exactly define the marginal polytope for a tree-structured MRF David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

27 Entropy for tree-structured models Suppose that p is a tree-structured distribution, so that we are optimizing only over marginals µ ij (x i, x j ) for ij T The solution to arg max q:eq[f(x)]=µ H(q) is a tree-structured MRF (c.f. lecture 10, maximum entropy estimation) The entropy of q as a function of its marginals can be shown to be H( µ) = i V H(µ i ) ij T I (µ ij ) where H(µ i ) = x i µ i (x i ) log µ i (x i ) I (µ ij ) = x i,x j µ ij (x i, x j ) log µ ij(x i, x j ) µ i (x i )µ j (x j ) Can we use this for non-tree structured models? David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

28 Bethe-free energy approximation The Bethe entropy approximation is (for any graph) H bethe ( µ) = H(µ i ) I (µ ij ) i V ij E This gives the following variational approximation: max θ c (x c )µ c (x c ) + H bethe ( µ) µ M L c C x c For non tree-structured models this is not concave, and is hard to maximize Loopy belief propagation, if it converges, finds a saddle point! David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

29 Concave relaxation Let H(µ) be an upper bound on H(µ), i.e. H(µ) H(µ) As a result, we obtain the following upper bound on the log-partition function: ln Z(θ) max θ c (x c )µ c (x c ) + H(µ) µ M L c C x c An example of a concave entropy upper bound is the tree-reweighted approximation (Jaakkola, Wainwright, & Wilsky, 05), given by specifying a distribution over spanning trees of the graph e f b e f b (a) (b) (c) (d) Figure 1. Illustration of the spanning tree polytope T(G). Original graph is shown in panel (a). Probability 1/3 is assigned to each of the three spanning trees { Ti i = 1, 2, 3 } shown in panels (b) (d). Edge b is a so-called bridge in G, meaning that it must appear in any spanning tree (i.e., µb = 1). Edges e and f appear in two and one of the spanning trees respectively, which gives rise to edge appearance probabilities µe = 2/3 and µf = 1/3. e f b e f b the Bethe free energy [11]. In fact, suppose that w It can be seen that this function is closely related David Sontag (NYU) Tree-consistent pseudomarginals: Inference and Representation The con- set µst = 1 for all edges (s, t) E, meaning that eve Lecture edge appears 11, Nov. with probability 24, 2015one. In29 this/ case, 32 t joint pseudomarginal Tst: Letting {ρ ij } denote edge appearance probabilities, we have: H TRW ( µ) = i V H(µ i ) ij E ρ ij I (µ ij ) Ist(Tst) = Tst;jk log ( (j,k) k Xt Tst;jk Tst;jk)( j Xs Tst;jk Borrowing terminology from statistical physics [11], w define an average energy term as follows: T θ = Ts;jθs;j + Tst;jkθst; s V j (s,t) E (j,k) Using this notation, our bounds are based on the fo lowing function: F(T; µe; θ ) Hs(Ts) + µstist(tst) T s V (s,t) E

30 Comparison of LBP and TRW We showed two approximation methods, both making use of the local consistency constraints M L on the marginal polytope: 1 Bethe-free energy approximation (for pairwise MRFs): max µ ij (x i, x j )θ ij (x i, x j ) + H(µ i ) I (µ ij ) µ M L ij E x i,x j i V ij E Not concave. Can use concave-convex procedure to find local optima Loopy BP, if it converges, finds a saddle point (often a local maxima) 2 Tree re-weighted approximation (for pairwise MRFs): ( ) max µ ij (x i, x j )θ ij (x i, x j ) + H(µ i ) ρ ij I (µ ij ) µ M L ij E x i,x j i V ij E {ρ ij } are edge appearance probabilities (must be consistent with some set of spanning trees) This is concave! Find global maximiza using projected gradient ascent Provides an upper bound on log-partition function, i.e. ln Z(θ) ( ) David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

31 Two types of variational algorithms: Mean-field and Mean Field Approximation relaxation max q Q q(x c )θ c (x c ) + H(q(x)). c C x c Eric CMU, Although this function is concave and thus in theory should be easy to optimize, we need some compact way of representing q(x) Relaxation algorithms work directly with pseudomarginals which may not be Naïve Mean Field consistent with any joint distribution Mean-field algorithms assume a factored representation of the joint distribution, e.g. Fully factorized variational distribution 33 q(x) = i V q i (x i ) Eric CMU, (called naive mean field) David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32

32 Naive mean-field Using the same notation as in the rest of the lecture, naive mean-field is: ( ) max θ c (x c )µ c (x c ) + H(µ i ) subject to µ x c c C x i Val(X i ) i V µ i (x i ) 0 i V, x i Val(X i ) µ i (x i ) = 1 i V µ c (x c ) = i c µ i (x i ) Corresponds to optimizing over an inner5.4bound Nonconvexityon of Mean the Fieldmarginal 141 polytope: We obtain a lower bound on the partition function, i.e. ( ) ln Z(θ) Fig. 5.3 Cartoon illustration of the set MF (G) of mean parameters that arise from tractable distributions is a nonconvex inner bound on M(G). Illustrated here is the case of discrete random variables where M(G) is a polytope. The circles correspond to mean parameters that arise from delta distributions, and belong to both M(G) and MF (G). David Sontag (NYU) Inference and Representation Lecture 11, Nov. 24, / 32