Graphical Models and Message passing

Transcription

1 Graphical Models and Message passing Yunshu Liu ASPITRG Research Group

2 References: [1]. Steffen Lauritzen, Graphical Models, Oxford University Press, 1996 [2]. Michael I. Jordan, Graphical models, Statistical Science, Vol.19, No. 1. Feb., 2004 [3]. Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag New York, Inc [4]. Daphne Koller and Nir Friedman, Probabilistic Graphical Models - Principles and Techniques, The MIT Press, 2009 [5].Kevin P. Murphy, Machine Learning - A Probabilistic Perspective, The MIT Press, 2012

3 Outline Preliminaries on Graphical Models Directed graphical model Undirected graphical model Directed graph and Undirected graph Message passing and sum-product algorithm Factor graphs Sum-product algorithms Junction tree algorithm

4 Outline Preliminaries on Graphical Models Message passing and sum-product algorithm

5 Preliminaries on Graphical Models Motivation: Curse of dimensionality Matroid enumeration Polyhedron computation Entropic Region Machine Learning: computing likelihoods, marginal probabilities Graphical Models: Help capturing complex dependencies among random variables, building large-scale statistical models and designing efficient algorithms for Inference.

6 Preliminaries on Graphical Models Definition of Graphical Models: A graphical model is a probabilistic model for which a graph denotes the conditional dependence structure between random variables. Example: Suppose MIT and Stanford accepted undergraduate students only based on GPA MIT : Accepted by MIT Stanford : Accepted by Stanford GPA MIT Stanford Given Alice s GPA as GP A Alice, P(MIT Stanford, GP A Alice )=P(MIT GP A Alice ) We say MIT is conditionally independent of Stanford given GP A Alice Sometimes use symbol (MIT Stanford GP A Alice )

7 Bayesian networks: directed graphical model Bayesian networks A Bayesian network consists of a collection of probability distributions P over x = {x 1,, x K } that factorize over a directed acyclic graph(dag) in the following way: p(x) = p(x 1,, x K ) = k K p(x k pa k ) where pa k is the direct parents nodes of x k. Alias of Bayesian networks: probabilistic directed graphical model: via directed acyclic graph(dag) belief networks causal networks: directed arrows represent causal realtions

8 Bayesian networks: directed graphical model Examples of Bayesian networks Consider an arbitrary joint distribution p(x) = p(x 1, x 2, x 3 ) over three variables, we can write: p(x 1, x 2, x 3 ) = p(x 3 x 1, x 2 )p(x 1, x 2 ) = p(x 3 x 1, x 2 )p(x 2 x 1 )p(x 1 ) which can be expressed in the following directed graph: x 1 x 3 x 2

9 Bayesian networks: directed graphical model Examples Similarly, if we change the order of x 1, x 2 and x 3 (same as consider all permutations of them), we can express p(x 1, x 2, x 3 ) in five other different ways, for example: p(x 1, x 2, x 3 ) = p(x 1 x 2, x 3 )p(x 2, x 3 ) = p(x 1 x 2, x 3 )p(x 2 x 3 )p(x 3 ) which corresponding to the following directed graph: x 1 x 3 x 2

10 Bayesian networks: directed graphical model Examples Recall the previous example about how MIT and Stanford accept undergraduate students, if we assign x 1 to GPA, x 2 to accepted by MIT and x 3 to accepted by Stanford, then since p(x 3 x 1, x 2 ) = p(x 3 x 1 ) we have p(x 1, x 2, x 3 ) = p(x 3 x 1, x 2 )p(x 2 x 1 )p(x 1 ) = p(x 3 x 1 )p(x 2 x 1 )p(x 1 ) which corresponding to the following directed graph: x 1 GPA x 3 Stanford x 2 MIT

11 ,x 4 } is not a clique because of the missing link from x 1 to x 4. re Markov define the random factors infields: the decomposition undirected of graphical the joint distribution model the variables In the undirected in the cliques. case, theinprobability fact, we can distribution consider factorizes functions ques, according without loss to functions of generality, definedbecause on the clique otherof cliques the graph. must be l cliques. A clique Thus, is aif subset {x 1,x 2 of,xnodes 3 } is ainmaximal a graph such clique that and there we define exist a on over linkthis between clique, allthen pairsincluding of nodes inanother subset. factor defined over a iablesawould maximal be redundant. clique is a clique such that it is not possible to a clique by C and the set of variables in that clique by x C. Then include any other nodes from the graph in the set without it ceasing to be a clique. ted graph showing a clique (outlined in al clique (outlined in blue). x 1 Example of cliques: x 2 {x 1,x 2 },{x 2,x 3 },{x 3,x 4 },{x 2,x 4 },{x 1,x 3 }, {x 1,x 2,x 3 }, {x 2,x 3,x 4 } Maximal cliques: {x 1,x 2,x 3 }, {x 2,x 3,x 4 } x 3 x 4

12 Markov random fields: undirected graphical model Markov random fields: Definition Denote C as a clique, x C the set of variables in clique C and ψ C (x C ) a nonnegative potential function associated with clique C. Then a Markov random field is a collection of distributions that factorize as a product of potential functions ψ C (x C ) over the maximal cliques of the graph: p(x) = 1 ψ Z C (x C ) where normalization constant Z = x called the partition function. C C ψ C(x C ) sometimes

13 Markov random fields: undirected graphical model Factorization of undirected graphs Question: how to write the joint distribution for this undirected graph? x 1 x 3 x 2 (2 3 1) hold Answer: p(x) = 1 Z ψ 12(x 1, x 2 )ψ 13 (x 1, x 3 ) where ψ 12 (x 1, x 2 ) and ψ 13 (x 1, x 3 ) are the potential functions and Z is the partition function that make sure p(x) satisfy the conditions to be a probability distribution.

14 Markov random fields: undirected graphical model Markov property Given an undirected graph G = (V,E), a set of random variables X = (X a ) a V indexed by V, we have the following Markov properties: Pairwise Markov property: Any two non-adjacent variables are conditionally independent given all other variables: X a X b X V \{a,b} if {a, b} E Local Markov property: A variable is conditionally independent of all other variables given its neighbors: X a X V \{nb(a) a} X nb(a) where nb(a) is the neighbors of node a. Global Markov property: Any two subsets of variables are conditionally independent given a separating subset: X A X B X S, where every path from a node in A to a node in B passes through S(means when we remove all the nodes in S, there are no paths connecting any nodes in A to any nodes in B).

15 Markov random fields: undirected graphical model G. The dotted red node X 8 is independent of all other clude Examples its parents of Markov (blue), properties children (green) and co-parents GM. Pairwise The redmarkov nodeproperty: X ( ), ( ) 8 is independent of the other black Local Markov property: ( ), ( ) Global Markov property: ( ), ( ) (b)

16 Markov random fields: undirected graphical model Relationship between different Markov properties and factorization property ( F ): Factorization property; ( G ): Global Markov property; ( L ): Local Markov property; ( P ):Pairwise Markov property if assuming strictly positive p( ) (F) (G) (L) (P) (P) (F) which give us the Hammersley-Clifford theorem.

17 Markov random fields: undirected graphical model The Hammersley-Clifford theorem(see Koller and Friedman 2009, p131 for proof) Consider graph G, for strictly positive p( ), the following Markov property and Factorization property are equivalent: Markov property: Any two subsets of variables are conditionally independent given a separating subset (X A, X B X S ) where every path from a node in A to a node in B passes through S. Factorization property: The distribution p factorizes according to G if it can be expressed as a product of potential functions over maximal cliques.

18 Markov random fields: undirected graphical model Example: Gaussian Markov random fields A multivariate normal distribution forms a Markov random field w.r.t. a graph G = (V, E) if the missing edges correspond to zeros on the concentration matrix(the inverse convariance matrix) Consider x = {x 1,, x K } N (0, Σ) with Σ regular and K = Σ 1. The concentration matrix of the conditional distribution of (x i, x j ) given x \{i,j} is ( ) kii k K {i,j} = ij k ji k jj Hence x i x j x \ij k ij = 0 {i, j} E

19 Markov random fields: undirected graphical model Example: Gaussian Markov random fields The joint distribution of gaussian markov random fields can be factorizes as: log p(x) = const 1 2 i k ii x 2 i {i,j} E k ij x i x j The zero entries in K are called structural zeros since they represent the absent edges in the undirected graph.

20 X 11 X 12 X 13 X 14 X 15 Directed graph and Undirected graph X 16 X 17 X 18 X 19 X 20 X 16 X 17 X 18 X 19 X 20 (a) X 11 X 12 X 13 X 14 X 15 Moral graph: from directed graph to undirected graph Figure 19.1 Moralization: marrying the parents, so to speak, force the parents of a node to form a completed sub-graph. (a) A 2d lattice represented as a DAG. The dotted red node X 8 is independent of all other nodes (black) given its Markov blanket, which include its parents (blue), children (green) and co-parents (orange). (b) The same model represented as a UGM. The red node X 8 is independent of the other black nodes given its neighbors (blue nodes). (b) (a) (b) Figure 19.2 (a) A DGM. (b) Its moralized version, represented as a UGM. Motivation: Why not simply convert directed graph to undirected graph? Works fine for Markov chain, not work for when there are nodes that have more than one parents(previous example). The set of nodes that renders a node t conditionally independent of all the other nodes in the graph is called t s Markov blanket; we will denote this by mb(t). Formally, the Markov blanket satisfies the following property: t V\cl(t) mb(t) (19.1) where cl(t) mb(t) {t} is the closure of node t. One can show that, in a UGM, a node s

21 Directed graph and Undirected graph Markov blanket In directed graph, a Markov blanket of a given node x i is the set of nodes comprising the parents, the children and the co-parents(other parents of its children) of x i. In undirected graph, a Markov blanket of a given node x i is the set of 662its neighboring nodes. Chapter 19. Undirected graphical models (Markov random fields) X 1 X 2 X 3 X 4 X 5 X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 X 6 X 7 X 8 X 9 X 10 X 11 X 12 X 13 X 14 X 15 X 11 X 12 X 13 X 14 X 15 X 16 X 17 X 18 X 19 X 20 (a) X 16 X 17 X 18 X 19 X 20 (b) Figure 19.1 (a) A 2d lattice represented as a DAG. The dotted red node X 8 is independent of all other nodes (black) given its Markov blanket, which include its parents (blue), children (green) and co-parents (orange). (b) The same model represented as a UGM. The red node X 8 is independent of the other black nodes given its neighbors (blue nodes).

22 Directed graph and Undirected graph Markov blanket In directed graph Let x = {x 1,, x K } p(x i x \i ) = p(x 1,, x K ) x i p(x 1,, x K ) k = p(x k pa k ) x i k p(x k pa k ) = f (x i, mb(x i )) where pa k is the parents nodes of x k, mb(x i ) is the markov blanket of node x i. Example: p(x 8 X \8 ) = p(x 8 X 3,X 7 ) p(x 9 X 4,X 8 )p(x 13 X 8,X 12 ) X 8 p(x 8 X 3,X 7 )p(x 9 X 4,X 8 )p(x 13 X 8,X 12 )

23 Directed graph and Undirected graph Markov blanket In undirected graph Let x = {x 1,, x K }, p(x i x \i ) = p(x i mb(x i )) where mb(x i ) is the markov blanket of node x i. Example: p(x 8 X \8 ) = p(x 8 X 3, X 7, X 9, X 13 ) Notes The Markov blanket of a directed graph is the Markov blanket(neighbours) of its moral graph.

24 Directed graph and Undirected graph Directed graph VS. Undirected graph All distributions Directed graph Chordal Undirected graph x 1 x 1 x 1 x 3 x 2 x 4 x 3 x2 x 3 x 2 What is Chordal?

25 Directed graph and Undirected graph What is Chordal? Chordal or decomposable: If we collapse together all the variables in each maximal clique to make the mega-variables, the resulting graph will be a tree. Why Chordal graphs are the intersection of directed graph and undirected graph? Hint: Moral graph

26 Directed graph and Undirected graph Directed Graph:Dropping edges x 1 (2 3 1) x 3 x 2 x 1 x 1 (1 3 2) x 3 x 2 x 3 x 2 x 1 (1 2 ) But not (1 2 3) x 3 x 2

27 Directed graph and Undirected graph Undirected Graphs: remove arrows x 1 x 3 x 2 x 1 x 1 x 1 x 3 x 2 x 3 x 2 x 3 x 2 (2 3 1) hold x 1 x 3 x 2

28 Directed graph and Undirected graph Directed graph and Undirected graph Example of 4 atom distribution: x x 4 x 3 x 2 x 1 p(x) When I(x 3 ; x 4 )= a b-a c-a 1+a-b-c αβ α(1 β) (1 α)β (1 α)(1 β) Satisfy following Conditional independence relations: (1, 2 3), (1, 2 4) (1, 2 34), (1 34), (2 34) (1 234), (2 134), (3 124), (4 123) (3, 4 )

29 Outline Preliminaries on Graphical Models Message passing and sum-product algorithm

30 Factor graphs Common thing about directed graphical model and undirected graphical model Both allow global function to be expressed as product of factors over subsets of variables. Factor graphs: Definition A factor graph is a bipartite(bigraph) that expresses the structure of the factorization (1). p(x) = s f s (x s ) (1) where x s denotes a subset of the variables x. In a factor graph, the random variables usually are round nodes and the factors are square nodes. There is an edge between the factor node f s (x s ) and the variable node x k if and only if x i x s.

31 Factor graphs Example of factor graph: Factor graph: connection only between variable nodes and factor nodes. Variable nodes X 1 X 2 X 3 X 4 X 5 Factor nodes F a F b F c F d p(x) =f a (x 1,x 3 )f b (x 3,x 4 )f c (x 2,x 4,x 5 )f d (x 1,x 3 ) Note: The same subset of variables x s can be repeated multiple times, for example: x s = {x 1, x 3 } in the above factor graph.

32 Factor graphs From directed graph to factor graph: x 1 x 2 x 1 x 2 x 1 x 2 f f c x 3 f x a 3 x3 (a) (b) (c) f b (a ): An directed graph with factorization p(x 1 )p(x 2 )p(x 3 x 1, x 2 ). (b ): A factor graph with factor f (x 1, x 2, x 3 ) = p(x 1 )p(x 2 )p(x 3 x 1, x 2 ). (c ): A different factor graph with factors f a (x 1 )f b (x 2 )f c (x 1, x 2, x 3 ) such that f a (x 1 ) = p(x 1 ), f b (x 2 ) = p(x 2 ) and f c (x 1, x 2, x 3 ) = p(x 3 x 1, x 2 ).

33 Factor graphs From undirected graph to factor graph: difference between 1 Z C ψ C(x C ) and x f s(x s ) x 1 x 1 x 2 x 3 x 1 x 2 x 3 x 2 x 3 f f a f b (a) (b) (c) (a ): An undirected graph with a single clique potential ψ(x 1, x 2, x 3 ). (b ): A factor graph with factor f (x 1, x 2, x 3 ) = ψ(x 1, x 2, x 3 ). (c ): A different factor graph with factors f a (x 1, x 2, x 3 )f b (x 1, x 2 ) = ψ(x 1, x 2, x 3 ) Note: When clique is large, further factorize is usually very useful.

34 Factor graphs Why introduce Factor graphs? Both directed graph and undirected graph can be converted to factor graphs. Factor graphs allows us to be more explicit about the details of the factorization. When connect factors to exponential families, the product of factors becomes sum of parameters under single exp. Multiple different factor graphs can represent the same directed or undirected graph, we can find factor graphs that are better for our inference.

35 Sum-product algorithms Compute marginal probability p(x i ) p(x i ) = x \xi p(x) What if p(x) = 1 Z C ψ C(x C ) or x f s(x s )? Can we design more efficient algorithms by exploring the conditional independence structure of the model?

36 Sum-product algorithms Yes, we can. The key idea of Sum-product algorithm xy + xz = x(y + z) + x + y x z y z Expression tree of xy + xz and x(y+z), where the leaf nodes represent variables and internal nodes represent arithmetic operators(addition, multiplication, negation, etc.).

37 Sum-product algorithms Example of computing p(x 1 ) = x \x1 p(x) on an undirected chain p(x) = 1 Z ψ 12(x 1,x 2 )ψ 23 (x 2,x 3 )ψ 34 (x 3,x 4 ) x 1 x 2 x 3 x 4 m 2 (x 1 ) m 3 (x 2 ) m 4 (x 3 ) p(x 1 ) = 1 Z = 1 Z = 1 Z = 1 Z x 2 x 3 ψ 12 (x 1, x 2 )ψ 23 (x 2, x 3 )ψ 34 (x 3, x 4 ) x 4 ψ 12 (x 1, x 2 ) ψ 23 (x 2, x 3 ) ψ 34 (x 3, x 4 ) x 2 x 3 x 4 ψ 12 (x 1, x 2 ) ψ 23 (x 2, x 3 )m 4 (x 3 ) x 2 x 3 ψ 12 (x 1, x 2 )m 3 (x 2 ) = 1 Z m 2(x 1 ) x 2

38 Sum-product algorithms Example of computing p(x 1 ) = x \x1 p(x) Factor graph: connection only between variable nodes and factor nodes. Variable nodes X 1 X 2 X 3 X 4 X 5 Factor nodes F a F b F c F d p(x) =f a (x 1,x 3 )f b (x 3,x 4 )f c (x 2,x 4,x 5 )f d (x 1,x 3 ) p(x 1 ) = x \x1 p(x) = x 2 p(x) x 3 x 4 x 5 = x 2,x 3 f a (x 1, x 3 )f d (x 1, x 3 ) x 4 f b (x 3, x 4 ) x 5 f c (x 2, x 4, x 5 )

39 Sum-product algorithms Sum-product algorithm The sum-product algorithm involves using a message passing scheme to explore the conditional independence and use these properties to change the order of sum and product.

40 Sum-product algorithms Message passing on a factor graph: Variable x to factor f : m x f = m fk x(x) k nb(x)\f f 1 m f1 x x m x f f m fk x f K nb(x) denote the neighboring factor nodes of x If nb(x)\f = then m x f = 1.

41 Sum-product algorithms Message passing on a factor graph: Factor f to variable x: m f x = f (x, x 1,, x M ) x 1, x M m nb(f )\x m xm f (x m ) x 1 m x1 f f m f x x x M m xm f nb(f ) denote the neighboring variable nodes of f If nb(f )\x = then m f x = f (x).

42 Sum-product algorithms Message passing on a tree structured factor graph x 3 f b x 2 f a f c x 1 x 4

43 Sum-product algorithms Message passing on a tree structured factor graph Down - up: m x1 f a (x 1 ) = 1 m fa x 2 (x 2 ) = x 1 f a (x 1, x 2 ) m x4 f c (x 4 ) = 1 m fc x 2 (x 2 ) = x 4 f c (x 2, x 4 ) m x2 f b (x 2 ) = m fa x 2 (x 2 )m fc x 2 (x 2 ) m fb x 3 (x 3 ) = x 2 f b (x 2, x 3 )m x2 f b

44 Sum-product algorithms Message passing on a tree structured factor graph Up - down: m x3 f b (x 3 ) = 1 m fb x 2 (x 2 ) = x 3 f b (x 2, x 3 ) m x2 f a (x 2 ) = m fb x 2 (x 2 )m fc x 2 (x 2 ) m fa x 1 (x 1 ) = x 2 f a (x 1, x 2 )m x2 f a (x 2 ) m x2 f c (x 2 ) = m fa x 2 (x 2 )m fb x 2 (x 2 ) m fc x 4 (x 4 ) = x 2 f c (x 2.x 4 )m x2 f c (x 2 )

45 Sum-product algorithms Message passing on a tree structured factor graph Calculating p(x 2 ) p(x 2 ) = 1 Z m f a x 2 (x 2 )m fb x 2 (x 2 )m fc x2 (x 2 ) = 1 { }{ }{ } f a (x 1, x 2 ) f b (x 2, x 3 ) f c (x 2, x 4 ) Z x 1 x 3 x 4 = 1 f a (x 1, x 2 )f b (x 2, x 3 )f c (x 2, x 4 ) Z x 1 x 3 x 4 = p(x) x 1 x 3 x 4

46 Junction tree algorithm Junction tree algorithm: message passing on general graphs 1. Moralize to a undirected graph if starting with a directed graph 2. Triangulate the graph: to make the graph chordal 3. Build the junction tree 4. Run message passing algorithm on the junction tree

47 Junction tree algorithm 662 Chapter 19. Undirected graphical models (Markov random fields) X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 Junction tree algorithm X6 X7 X8 X9 X10 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 1. Moralize to a undirected graph if starting with a directed X16 X17 X18 X19 X20 X16 X17 X18 X19 X20 graph (a) X11 X12 X13 X14 X15 Moralization: Figure 19.1 marrying (a) A 2d lattice represented theasparents, a DAG. The dotted red sonode tox8 speak, is independentforce of all otherthe nodes (black) given its Markov blanket, which include its parents (blue), children (green) and co-parents parents of (orange). a (b) node The sameto model form represented aas completed a UGM. The red node X8 sub-graph. is independent of the other black nodes given its neighbors (blue nodes). (b) (a) (b) Figure 19.2 (a) A DGM. (b) Its moralized version, represented as a UGM. The set of nodes that renders a node t conditionally independent of all the other nodes in the graph is called t s Markov blanket; we will denote this by mb(t). Formally, the Markov blanket satisfies the following property: t V\cl(t) mb(t) (19.1)

48 Junction tree algorithm Junction tree algorithm 2. Triangulate the graph: to make the graph chordal Finding chord-less cycles containing four or more nodes and adding extra links to eliminate such chord-less cycles. x 1 x 1 x 5 x 2 x 5 x 2 x 4 x 3 x 4 x 3

49 Junction tree algorithm Junction tree algorithm 3. Build the junction tree The nodes of a junction tree are the maximal cliques of the triangulated graph, The links connect pairs of nodes(maximal cliques) that have variables in common. The junction tree is build so that the summation of the edge-weight in the tree is maximal, where the weight of an edge is the number of variables shared by the two maximal cliques it connected. x 1 x 1,x 2,x 5 x 1,x 2,x 5 x 5 x 2 x 5 x 3,x 4,x 5 x 2,x 5 x 2,x 3,x 5 x 3,x 5 x 3,x 5 x 4 x 3 x 2,x 3,x 5 No x 3,x 4,x 5 Yes

50 Junction tree algorithm Junction tree algorithm 4. Run message passing algorithm on the junction tree The message passing algorithm on the junction tree is similarly to the sum-product algorithm on a tree, the result give the joint distribution of each maximal clique of the triangulated graph. The computational complexity of the junction tree algorithm is determined by the size of the largest clique, which is lower bounded by the treewidth of the graph,where treewidth is one less than the size of the largest clique.

51 Thanks! Questions!