Introduction to Graphical Models

obert Collins CSE586 Credits: Several slides are from: Introduction to Graphical odels eadings in Prince textbook: Chapters 0 and but mainly only on directed graphs at this time eview: Probability Theory Sum rule (marginal distributions) Product rule From these we have Bayes theorem with normalization factor eview: Conditional Probabilty Conditional Probability (rewriting product rule) A B) P (A, B) / B) Chain ule A,B,C,D) A) A,B) A,B,C) A,B,C,D) A) A,B) A,B,C) Conditional Independence A) B A) C A,B) D A,B,C) A, B C) A C ) B C) statistical independence A, B) A) B) Christopher Bishop, S Overview of Graphical odels Graphical odels model conditional dependence/ independence Graph structure specifies how joint probability factors Directed graphs Example:H The Joint Distribution ecipe for making a joint distribution of variables: Example: Boolean variables A, B, C Undirected graphs Example:F Inference by : belief propagation Sum-product algorithm ax-product (in-sum if using logs) We will focus mainly on directed graphs right now. Andrew oore, CU

The Joint Distribution Example: Boolean variables A, B, C The Joint Distribution Example: Boolean variables A, B, C ecipe for making a joint distribution of variables: A B C 0 0 0 0 0 ecipe for making a joint distribution of variables: A B C Prob 0 0 0 0.30 0 0 0.05. ake a truth table listing all combinations of values of your variables (if there are Boolean variables then the table will have 2 rows). 0 0 0 0 0 0 0. ake a truth table listing all combinations of values of your variables (if there are Boolean variables then the table will have 2 rows). 2. For each combination of values, say how probable it is. 0 0 0.0 0 0.05 0 0 0.05 0 0.0 0 0.25 0.0 Andrew oore, CU Andrew oore, CU The Joint Distribution Example: Boolean variables A, B, C Joint distributions ecipe for making a joint distribution of variables:. ake a truth table listing all combinations of values of your variables (if there are Boolean variables then the table will have 2 rows). 2. For each combination of values, say how probable it is. 3. If you subscribe to the axioms of probability, those numbers must sum to. A B C Prob 0 0 0 0.30 0 0 0.05 0 0 0.0 0 0.05 0 0 0.05 0 0.0 0 0.25 0.0 A truth table 0.05 0.0 0.25 0.0 0.05 0.05 C Good news Once you have a joint distribution, you can answer all sorts of probabilistic questions involving combinations of attributes 0.30 B 0.0 Andrew oore, CU Using the Joint Using the Joint Poor ale) 0.4654 E) row) rows matching E Poor) 0.7604 E) row) rows matching E Andrew oore, CU Andrew oore, CU 2

Inference with the Joint Inference with the Joint computing conditional probabilities E E2) E E2) E ) 2 rows matching E and E2 rows matching E2 row) row) E E2) E E2) E ) 2 rows matching E and E2 rows matching E2 row) row) ale Poor) 0.4654 / 0.7604 0.62 Andrew oore, CU Andrew oore, CU Joint distributions Good news Once you have a joint distribution, you can answer all sorts of probabilistic questions involving combinations of attributes Bad news Impossible to create JD for more than about ten attributes because there are so many numbers needed when you build the thing. For 0 binary variables you need to specify 2 0 - numbers 023. (question for class: why the -?) How to use Fewer Numbers Factor the joint distribution into a product of distributions over subsets of variables Identify (or just assume) independence between some subsets of variables Use that independence to simplify some of the distributions Graphical models provide a principled way of doing this. Factoring Directed versus Undirected Graphs Consider an arbitrary joint distribution We can always factor it, by application of the chain rule what this factored form looks like as a graphical model Directed Graph Examples: Bayes nets Hs Undirected Graph Examples FS Note: The word graphical denotes the graph structure underlying the model, not the fact that you can draw a pretty picture of it (although that helps). Christopher Bishop, S Christopher Bishop, S 3

Graphical odel Concepts Graphical odel Concepts s)0.3 S )0.6 s)0.3 S )0.6 ^S)0.05 ^~S)0. ~^S)0. ~^~S)0.2 T T )0.3 T ~)0.8 )0.3 ~)0.6 ^S)0.05 ^~S)0. ~^S)0. ~^~S)0.2 T T )0.3 T ~)0.8 )0.3 ~)0.6 Nodes represent random variables. Edges (or lack of edges) represent conditional dependence (or independence). Each node is annotated with a table of conditional probabilities wrt parents. Note: The word graphical denotes the graph structure underlying the model, not the fact that you can draw a pretty picture of it using graphics. Directed Acyclic Graphs Directed acyclic means we can t follow arrows around in a cycle. Examples: chains; trees Also, things that look like this: Factoring Examples Joint distribution where denotes the parents of i We can read the factored form of the joint distribution immediately from a directed graph where denotes the parents of i x parents of x) y parents of y) Factoring Examples Joint distribution Factoring Examples We can read the form of the joint distribution directly from the directed graph where denotes the parents of i where denotes the parents of i p(x,y)p(x)p(y x) p(x,y)p(y)p(x y) p(x,y)p(x)p(y) parents of ) parents of ) parents of ) 4

Factoring Examples We can read the form of the joint distribution directly from the directed graph Factoring Examples We can read the form of the joint distribution directly from a directed graph where denotes the parents of i where denotes the parents of i,,) ) ) ) parents of ) parents of ) parents of ) Factoring Examples Graphical odel Concepts We can read the form of the joint distribution directly from a directed graph s)0.3 S )0.6 where denotes the parents of i ^S)0.05 ^~S)0. ~^S)0. ~^~S)0.2 T T )0.3 T ~)0.8 )0.3 ~)0.6 Note:,,),))) How about this one?,,,s,t) Graphical odel Concepts Factoring Examples s)0.3 S )0.6 Joint distribution ^S)0.05 ^~S)0. ~^S)0. ~^~S)0.2 T T )0.3 T ~)0.8 )0.3 ~)0.6 where denotes the parents of i What about this one? How about this one?,,,s,t) S)) S,) )T ) No directed cycles Note to mention T )+T ~) Christopher Bishop, S 5

Factoring Examples How many probabilities do we have to specify/learn (assuming each x i is a binary variable)? if fully connected, we would need 2^7-27 but, for this connectivity, we need +++8+4+2+4 2 2 0 2 3 2 0 2 0 Important Case: Time Series Consider modeling a time series of sequential data x, x2,..., xn These could represent locations of a tracked object over time observations of the weather each day spectral coefficients of a speech signal joint angles during human motion 2 2 2 2 2 Note: If all nodes were independent, we would only need 7! odeling Time Series odeling Time Series Simplest model of a time series is that all observations are independent. In the most general case, we could use chain rule to state that any node is dependent on all previous nodes... This would be appropriate for modeling successive tosses {heads,tails} of an unbiased coin. However, it doesn t really treat the series as a sequence. That is, we could permute the ordering of the observations and not change a thing. x,x2,x3,x4,...) x)x2 x)x3 x,x2)x4 x,x2,x3)... ook for an intermediate model between these two extremes. odeling Time Series odeling Time Series arkov assumption: xn x,x2,...,xn-) xn xn-) that is, assume all conditional distributions depend only on the most recent previous observation. Generalization: State-Space odels You have a arkov chain of latent (unobserved) states Each state generates an observation x! x 2! x n-! x n! x n+! The result is a first-order arkov Chain y! y 2! y n-! y n! y n+! x,x2,x3,x4,...) x)x2 x)x3 x2)x4 x3)... Goal: Given a sequence of observations, predict the sequence of unobserved states that imizes the joint probability. 6

odeling Time Series Examples of State Space models Hidden arkov model Kalman filter x! x 2! x n-! x n! x n+! odeling Time Series x,x2,x3,x4,...,y,y2,y3,y4,...) x)y x)x2 x)y2 x2)x3 x2)y3 x3)x4 x3)y4 x4)... x! x 2! x n-! x n! x n+! y! y 2! y n-! y n! y n+! y! y 2! y n-! y n! y n+! Example of a Tree-structured odel essage Passing Confusion alert: Our textbook uses w to denote a world state variable and x to denote a measurement. (we have been using x to denote world state and y as the measurement). essage Passing : Belief Propagation Example: D chain Find marginal for a particular node Key Idea of essage Passing multiplication distributes over addition a * b + a * c a * (b + c) as a consequence: for -state nodes, cost is exponential in length of chain but, we can exploit the graphical structure (conditional independences) is number of discrete values a variable can take is number of variables Applicable to both directed and undirected graphs. 7

Example essage Passing In the next several slides, we will consider an example of a simple, four-variable arkov chain. 48 multiplications + 23 additions 5 multiplications + 6 additions For, this principle is applied to functions of random variables, rather than the variables as done here. essage Passing Now consider computing the marginal distribution of variable x3 essage Passing ultiplication distributes over addition... essage Passing, aka Forward-Backward Algorithm Can view as sending/combining messages... Forward-Backward Algorithm Express marginals as product of messages evaluated forward from ancesters of xi and backwards from decendents of xi α β Forw * Back ecursive evaluation of messages - + Find Z by normalizing Works in both directed and undirected graphs Christopher Bishop, S 8

Confusion Alert! This standard notation for defining heavily overloads the notion of multiplication, e.g. the messages are not scalars it is more appropriate to think of them as vectors, matrices, or even tensors depending on how many variables are involved, with multiplication defined accordingly. [] Note: these are conditional probability tables, so values in each row must sum to one Not scalar multiplication! [] Note: these are conditional probability tables, so values in each row must sum to one [] Note: these are conditional probability tables, so values in each row must sum to one Interpretation x) x2) x2) x22) x22 X) 0.6 Sample computations: x, x2, x3, x4) (.7)(.4)(.8)().2 x2, x2, x32, x4) (.3)()(.2)(.7).02 Joint Probability, represented in a truth table x x2 x3 x4 x,x2,x3,x4) Joint Probability x x2 x3 x4 x,x2,x3,x4) Compute marginal of x3: x3) 0.458 x32) 042 9

now compute via now compute via [] [].7.4.7.6 x)x2 x) x)x22 x).3.3 x2)x2 x2) x2)x22 x2) (.7)(.4) + (.3)() (.7)(.6) + (.3)() 0.43 07.7.4.7.6 x)x2 x) x)x22 x).3.3 x2)x2 x2) x2)x22 x2) (.7)(.4) + (.3)() (.7)(.6) + (.3)() 0.43 07 simpler way to compute this... [] [.43 7] i.e. matrix multiply can do the combining and marginalization all at once!!!! now compute via now compute via [] [] [] [.43 7] [.458 42] [] compute sum along rows of x4 x3) Can also do that with matrix multiply: [.43 7] note: this is not [.458 42] a coincidence essage Passing Can view as sending/combining messages... essage Passing Can view as sending/combining messages... Forw [.458 42] Belief that x3, from front part of chain Belief that x32, from front part of chain * How to combine them? Back Belief that x32, from back part of chain Belief that x3, from back part of chain Forw [.458 42] * Back [ (.458)() (42)() ] [.458 42] [.458 42] (after normalizing, but note that it was already normalized. Again, not a coincidence) These are the same values for the marginal x3) that we computed from the raw joint probability table. Whew!!! 0

[] [] If we want to compute all marginals, we can do it in one shot by cascading, for a big computational savings. We need one cascaded forward pass, one separate cascaded backward pass, then a combination and normalization at each node. forward pass [] backward pass Then combine each by elementwise multiply and normalize forward pass [] backward pass Then combine each by elementwise multiply and normalize Forward: [] [.43 7] [.458 42] [.6084.396] Backward: [ ] [ ] [ ] [ ] combined+ normalized [] [.43 7] [.458 42] [.6084.396] Note: In this example, a directed arkov chain using true conditional probabilities (rows sum to one), only the forward pass is needed. This is true because the backward pass sums along rows, and always produces [ ]. ax arginals What if we want to know the most probably state (mode of the distribution)? Since the marginal distributions can tell us which value of each variable yields the highest marginal probability (that is, which value is most likely), we might try to just take the arg of each marginal distribution. We didn t really need forward AND backward in this example. Forward: [] [.43 7] [.458 42] [.6084.396] Backward: [ ] [ ] [ ] [ ] combined+ normalized [] [.43 7] [.458 42] [.6084.396] these are already the marginals for this example Didn t need to do these steps marginals computed by belief propagation marginals [] [.43 7] [.458 42] [.6084.396] arg x x 2 2 x 3 2 x 4 Although that s correct in this example, it isn t always the case

ax arginals can Fail to Find the AP However, the marginals find most likely values of each variable treated individually, which may not be the combination of values that jointly imize the distribution. marginals: w4, w22 actual AP solution: w2, w24 ax-product Algorithm Goal: find define the marginal then essage passing algorithm with sum replaced by Generalizes to any two operations forming a semiring Christopher Bishop, S Computing AP Value product Can solve using algorithm with sum replaced by. [] In our chain, we start at the end and work our way back to the root (x) using the -product algorithm, keeping track of the value as we go. marginal stage.7.4 x)x2 x).3 x2)x2 x2).7.6 x)x22 x).3 x2)x22 x2) stage3 stage2 [(.7)(.4), (.3)()] [(.7)(.6), (.3)()] 0.28 0.42 product product [] [] [.28.42] [.224.336] (.28)(.8) (.28)(.2) (.42)(.2) (.42)(.8) Note that this is no longer matrix multiplication, since we are not summing down the columns but taking instead... [.28.42] [.224.336] (.28)(.8) (.28)(.2) (.42)(.2) (.42)(.8) compute along rows of x4 x3).7 2

[] product Joint Probability, represented in a truth table x x2 x3 x4 x,x2,x3,x4) argest value of joint prob mode AP interpretation: the mode of the joint distribution is.2352, and the value of variable x3, in the configuration that yields the mode value, is 2. [.224.336].2.2352 [(.224)(), (.336)(.7)].7.2352 arg 2 x3 interpretation: the mode of the joint distribution is.2352, and the value of variable x3, in the configuration that yields the mode value, is 2..2.2352 [(.224)(), (.336)(.7)].2352 arg 2 x3 Computing Arg-ax of AP Value : AP Estimate Chris Bishop, P: At this point, we might be tempted simply to continue with the algorithm [sending forward-backward messages and combining to compute arg for each variable node]. However, because we are now imizing rather than summing, it is possible that there may be multiple configurations of x all of which give rise to the imum value for p(x). In such cases, this strategy can fail because it is possible for the individual variable values obtained by imizing the product of messages at each node to belong to different imizing configurations, giving an overall configuration that no longer corresponds to a imum. The problem can be resolved by adopting a rather different kind of... Essentially, the solution is to write a dynamic programming algorithm based on -product. DP State Space Trellis.4.7.6.3.2.2.8.8.7.3 : AP Estimate Joint Probability, represented in a truth table x x2 x3 x4 x,x2,x3,x4) DP State Space Trellis argest value of joint prob mode AP.7.3.6.4.2.2.8.8.7.3 achieved for x, x22, x32, x4 3

Belief Propagation Summary Definition can be extended to general tree-structured graphs Works for both directed AND undirected graphs Efficiently computes marginals and AP configurations At each node: form product of incoming messages and local evidence marginalize to give outgoing message one message in each direction across every link oopy Belief Propagation BP applied to graph that contains loops needs a propagation schedule needs multiple iterations might not converge Typically works well, even though it isn t supposed to State-of-the-art performance in error-correcting codes Gives exact answer in any acyclic graph (no loops). Christopher Bishop, S Christopher Bishop, S 4