3. The Junction Tree Algorithms

Similar documents

Course: Model, Learning, and Inference: Lecture 5

5 Directed acyclic graphs

The Basics of Graphical Models

2. (a) Explain the strassen s matrix multiplication. (b) Write deletion algorithm, of Binary search tree. [8+8]

Part 2: Community Detection

Factor Graphs and the Sum-Product Algorithm

Markov random fields and Gibbs measures

Well-Separated Pair Decomposition for the Unit-disk Graph Metric and its Applications

5.1 Bipartite Matching

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

Boolean Algebra Part 1

OPTIMAL DESIGN OF DISTRIBUTED SENSOR NETWORKS FOR FIELD RECONSTRUCTION

Distributed Computing over Communication Networks: Maximal Independent Set

Lecture 1: Course overview, circuits, and formulas

Small Maximal Independent Sets and Faster Exact Graph Coloring

Mining Social Network Graphs

On the k-path cover problem for cacti

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis

Structured Learning and Prediction in Computer Vision. Contents

Multi-Class and Structured Classification

A Note on Maximum Independent Sets in Rectangle Intersection Graphs

Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits

1 = (a 0 + b 0 α) (a m 1 + b m 1 α) 2. for certain elements a 0,..., a m 1, b 0,..., b m 1 of F. Multiplying out, we obtain

SEQUENTIAL VALUATION NETWORKS: A NEW GRAPHICAL TECHNIQUE FOR ASYMMETRIC DECISION PROBLEMS

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Graph Mining and Social Network Analysis

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

Exponential time algorithms for graph coloring

Approximation Algorithms

CS 598CSC: Combinatorial Optimization Lecture date: 2/4/2010

Basics of Statistical Machine Learning

Problem Set 7 Solutions

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Graphical Models, Exponential Families, and Variational Inference

DATA ANALYSIS II. Matrix Algorithms

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

Linear Programming. Widget Factory Example. Linear Programming: Standard Form. Widget Factory Example: Continued.

Statistical machine learning, high dimension and big data

6.852: Distributed Algorithms Fall, Class 2

COMBINATORIAL PROPERTIES OF THE HIGMAN-SIMS GRAPH. 1. Introduction

Introduction. The Quine-McCluskey Method Handout 5 January 21, CSEE E6861y Prof. Steven Nowick

Outline 2.1 Graph Isomorphism 2.2 Automorphisms and Symmetry 2.3 Subgraphs, part 1

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Practical Graph Mining with R. 5. Link Analysis

Linear Programming. March 14, 2014

Cluster Analysis: Advanced Concepts

6.045: Automata, Computability, and Complexity Or, Great Ideas in Theoretical Computer Science Spring, Class 4 Nancy Lynch

Definition Given a graph G on n vertices, we define the following quantities:

Exponential Random Graph Models for Social Network Analysis. Danny Wyatt 590AI March 6, 2009

PES Institute of Technology-BSC QUESTION BANK

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Scheduling Shop Scheduling. Tim Nieberg

Mathematics Review for MS Finance Students

1 Solving LPs: The Simplex Algorithm of George Dantzig

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

Data Structure [Question Bank]

CMPSCI611: Approximating MAX-CUT Lecture 20

NP-Completeness. CptS 223 Advanced Data Structures. Larry Holder School of Electrical Engineering and Computer Science Washington State University

Guessing Game: NP-Complete?

Mining Social-Network Graphs

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Learning Sum-Product Networks with Direct and Indirect Variable Interactions

Scheduling Home Health Care with Separating Benders Cuts in Decision Diagrams

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Cautious Propagation in Bayesian Networks *

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

Conditional Random Fields: An Introduction

Discuss the size of the instance for the minimum spanning tree problem.

Message-passing sequential detection of multiple change points in networks

1 The Line vs Point Test

Warshall s Algorithm: Transitive Closure

Bayesian networks - Time-series models - Apache Spark & Scala

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Bayesian Networks. Mausam (Slides by UW-AI faculty)

So let us begin our quest to find the holy grail of real analysis.

Social Media Mining. Data Mining Essentials

Finding and counting given length cycles

Steiner Tree Approximation via IRR. Randomized Rounding

Tagging with Hidden Markov Models

Graphical Modeling for Genomic Data

Simulating Investment Portfolios

Network Algorithms for Homeland Security

Random Map Generator v1.0 User s Guide

Copy in your notebook: Add an example of each term with the symbols used in algebra 2 if there are any.

Institut für Informatik Lehrstuhl Theoretische Informatik I / Komplexitätstheorie. An Iterative Compression Algorithm for Vertex Cover

Data Structures Fibonacci Heaps, Amortized Analysis

Distributed Structured Prediction for Big Data

FACTORIZATION OF TROPICAL POLYNOMIALS IN ONE AND SEVERAL VARIABLES. Nathan B. Grigg. Submitted to Brigham Young University in partial fulfillment

A 2-factor in which each cycle has long length in claw-free graphs

Mathematics for Algorithm and System Analysis

Applied Algorithm Design Lecture 5

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Chapter 3. if 2 a i then location: = i. Page 40

NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES

Big Ideas in Mathematics

Traffic Engineering for Multiple Spanning Tree Protocol in Large Data Centers

Transcription:

A Short Course on Graphical Models 3. The Junction Tree Algorithms Mark Paskin mark@paskin.org 1

Review: conditional independence Two random variables X and Y are independent (written X Y ) iff p X ( ) = p X Y (, y) for all y If X Y then Y gives us no information about X. X and Y are conditionally independent given Z (written X Y Z) iff p X Z (, z) = p X Y Z (, y, z) for all y and z If X Y Z then Y gives us no new information about X once we know Z. We can obtain compact, factorized representations of densities by using the chain rule in combination with conditional independence assumptions. The Variable Elimination algorithm uses the distributivity of over + to perform inference efficiently in factorized densities. 2

Review: graphical models Bayesian network undirected graphical model p A p B A p C A p D B p E C p F BE 1 Z ψ A ψ AB ψ AC ψ BD ψ CE ψ BEF D D B B A F A F C E C E d-separation cond. indep. graph separation cond. indep. Moralization converts a Bayesian network into an undirected graphical model (but it does not preserve all of the conditional independence properties). 3

A notation for sets of random variables It is helpful when working with large, complex models to have a good notation for sets of random variables. Let X = (X i : i V ) be a vector random variable with density p. For each A V, let X A = (Xi : i A). For A, B V, let p A = pxa and p A B = pxa X B. Example. If V = {a, b, c} and A = {a, c} then X = X a X b X c and X A = X a X c where X a, X b, and X c are random variables. 4

A notation for assignments We also need a notation for dealing flexibly with functions of many arguments. An assignment to A is a set of index-value pairs u = {(i, x i ) : i A}, one per index i A, where x i is in the range of X i. Let X A be the set of assignments to X A (with X = X V ). Building new assignments from given assignments: Given assignments u and v to disjoint subsets A and B, respectively, their union u v is an assignment to A B. If u is an assignment to A then the restriction of u to B V is u B = {(i, xi ) u : i B}, an assignment to A B. If u = {(i, x i ) : i A} is an assignment and f is a function, then f(u) = f(x i : i A) 5

Examples of the assignment notation 1. If p is the joint density of X then the marginal density of X A is p A (v) = p(v u), v X A u X A where A = V \A is the complement of A. 2. If p takes the form of a normalized product of potentials, we can write it as p(u) = 1 Z C C ψ C (u C ), u X where C is a set of subsets of V, and each ψ C is a potential function that depends only upon X C. The Markov graph of p has clique set C. 6

Review: the inference problem Input: a vector random variable X = (X i : i V ); a joint density for X of the form p(u) = 1 Z ψ C (u C ) C C an evidence assignment w to E; and some query variables X Q. Output: p Q E (,w), the conditional density of X Q given the evidence w. 7

Dealing with evidence From the definition of conditional probability, we have: p E E (u,w) = p(u w) p E (w) = 1 Z C C ψ C(u C w C ) p E (w) For fixed evidence w on X E, this is another normalized product of potentials: p E w (u) = 1 Z ψ C (u C ) C C where Z = Z p E (w), C = C\E, and ψ C (u) = ψ C (u w C ). Thus, to deal with evidence, we simply instantiate it in all clique potentials. 8

The reformulated inference problem Given a joint density for X = (X i : i V ) of the form p(u) = 1 Z compute the marginal density of X Q : p Q (v) = C C u X Q p(v u) = u X Q 1 Z C C ψ C (u C ) ψ C (v C u C ) 9

Review: Variable Elimination For each i Q, push in the sum over X i and compute it: p Q (v) = 1 Z = 1 Z = 1 Z = 1 Z u X Q C C u X Q\{i} w X {i} u X Q\{i} u X Q\{i} ψ C (v C u C ) C C i C C C i C C C ψ C (v C u C ) ψ C (v C u C w C ) w X {i} C C i C ψ C (v C u C ) ψ Ei (v Ei u Ei ) This creates a new elimination clique E i = C C i C ψ C (v C u C w) C\{i}. At the end we have p Q = 1 Z ψ Q and we normalize to obtain p Q (and Z). 10

From Variable Elimination to the junction tree algorithms Variable Elimination is query sensitive: we must specify the query variables in advance. This means each time we run a different query, we must re-run the entire algorithm. The junction tree algorithms generalize Variable Elimination to avoid this; they compile the density into a data structure that supports the simultaneous execution of a large class of queries. 11

Junction trees d {b, d} b {b} a c e f {a, b, c} {b, c} {b, c, e} {b, e} {b, e, f} G T A cluster graph T is a junction tree for G if it has these three properties: 1. singly connected: there is exactly one path between each pair of clusters. 2. covering: for each clique A of G there is some cluster C such that A C. 3. running intersection: for each pair of clusters B and C that contain i, each cluster on the unique path between B and C also contains i. 12

Building junction trees To build a junction tree: 1. Choose an ordering of the nodes and use Node Elimination to obtain a set of elimination cliques. 2. Build a complete cluster graph over the maximal elimination cliques. 3. Weight each edge {B, C} by B C and compute a maximum-weight spanning tree. This spanning tree is a junction tree for G (see Cowell et al., 1999). Different junction trees are obtained with different elimination orders and different maximum-weight spanning trees. Finding the junction tree with the smallest clusters is an NP-hard problem. 13

An example of building junction trees 1. Compute the elimination cliques (the order here is f, d, e, c, b, a). d d b b b b a f a a a b c e c e c e c a 2. Form the complete cluster graph over the maximal elimination cliques and find a maximum-weight spanning tree. {b, d} 1 1 {b, e, f} {b, d} 1 2 1 {b} {b, c, e} 2 {a, b, c} {a, b, c} {b, c} {b, c, e} {b, e} {b, e, f} 14

Decomposable densities A factorized density p(u) = 1 Z C C ψ C (u C ) is decomposable if there is a junction tree with cluster set C. To convert a factorized density p to a decomposable density: 1. Build a junction tree T for the Markov graph of p. 2. Create a potential ψ C for each cluster C of T and initialize it to unity. 3. Multiply each potential ψ of p into the cluster potential of one cluster that covers its variables. Note: this is possible only because of the covering property. 15

The junction tree inference algorithms The junction tree algorithms take as input a decomposable density and its junction tree. They have the same distributed structure: Each cluster starts out knowing only its local potential and its neighbors. Each cluster sends one message (potential function) to each neighbor. By combining its local potential with the messages it receives, each cluster is able to compute the marginal density of its variables. 16

The message passing protocol The junction tree algorithms obey the message passing protocol: Cluster B is allowed to send a message to a neighbor C only after it has received messages from all neighbors except C. One admissible schedule is obtained by choosing one cluster R to be the root, so the junction tree is directed. Execute Collect(R) and then Distribute(R): 1. Collect(C): For each child B of C, recursively call Collect(B) and then pass a message from B to C. 2. Distribute(C): For each child B of C, pass a message to B and then recursively call Distribute(B). COLLECT messages DISTRIBUTE messages root 1 4 2 1 3 4 17

The Shafer Shenoy Algorithm The message sent from B to C is defined as µ BC (u) = ψ B (u v) v X B\C (A,B) E A C µ AB (u A v A ) Procedurally, cluster B computes the product of its local potential ψ B and the messages from all clusters except C, marginalizes out all variables that are not in C, and then sends the result to C. Note: µ BC is well-defined because the junction tree is singly connected. The cluster belief at C is defined as β C (u) = ψ C (u) (B,C) E µ BC (u B ) This is the product of the cluster s local potential and the messages received from all of its neighbors. We will show that β C p C. 18

Correctness: Shafer Shenoy is Variable Elimination in all directions at once The cluster belief β C is computed by alternatingly multiplying cluster potentials together and summing out variables. This computation is of the same basic form as Variable Elimination. To prove that β C p C, we must prove that no sum is pushed in too far. This follows directly from the running intersection property: C i is summed out when computing this message the running intersection property guarantees the clusters containing i constitute a connected subgraph clusters containing i messages with i messages without i 19

The hugin Algorithm Give each cluster C and each separator S a potential function over its variables. Initialize: φ C (u) = ψ C (u) φ S (u) = 1 To pass a message from B to C over separator S, update φ S(u) = φ B (u v) v X B\S φ C(u) = φ C (u) φ S (u S) φ S (u S ) After all messages have been passed, φ C p C for all clusters C. 20

Correctness: hugin is a time-efficient version of Shafer Shenoy Each time the Shafer Shenoy algorithm sends a message or computes its cluster belief, it multiplies together messages. To avoid performing these multiplications repeatedly, the hugin algorithm caches in φ C the running product of ψ C and the messages received so far. When B sends a message to C, it divides out the message C sent to B from this running product. 21

Summary: the junction tree algorithms Compile time: 1. Build the junction tree T: (a) Obtain a set of maximal elimination cliques with Node Elimination. (b) Build a weighted, complete cluster graph over these cliques. (c) Choose T to be a maximum-weight spanning tree. 2. Make the density decomposable with respect to T. Run time: 1. Instantiate evidence in the potentials of the density. 2. Pass messages according to the message passing protocol. 3. Normalize the cluster beliefs/potentials to obtain conditional densities. 22

Complexity of junction tree algorithms Junction tree algorithms represent, multiply, and marginalize potentials: tabular Gaussian storing ψ C O(k C ) O( C 2 ) computing ψ B C = ψ B ψ C O(k B C ) O( B C 2 ) computing ψ C\B (u) = v X B ψ C (u v) O(k C ) O( B 3 C 2 ) The number of clusters in a junction tree and therefore the number of messages computed is O( V ). Thus, the time and space complexity is dominated by the size of the largest cluster in the junction tree, or the width of the junction tree: In tabular densities, the complexity is exponential in the width. In Gaussian densities, the complexity is cubic in the width. 23

Generalized Distributive Law The general problem solved by the junction tree algorithms is the sum-of-products problem: compute p Q (v) ψ C (v C u C ) u X Q C C The property used by the junction tree algorithms is the distributivity of over +; more generally, we need a commutative semiring: [0, ) (+, 0) (, 1) sum-product [0, ) (max, 0) (, 1) max-product (, ] (min, ) (+, 0) min-sum {T, F} (, F) (, T) Boolean Many other problems are of this form, including maximum a posteriori inference, the Hadamard transform, and matrix chain multiplication. 24

Summary The junction tree algorithms generalize Variable Elimination to the efficient, simultaneous execution of a large class of queries. The algorithms take the form of message passing on a graph called a junction tree, whose nodes are clusters, or sets, of variables. Each cluster starts with one potential of the factorized density. By combining this potential with the potentials it receives from its neighbors, it can compute the marginal over its variables. Two junction tree algorithms are the Shafer Shenoy algorithm and the hugin algorithm, which avoids repeated multiplications. The complexity of the algorithms scales with the width of the junction tree. The algorithms can be generalized to solve other problems by using other commutative semirings. 25