Probabilistic Graphical Models

Similar documents

3. The Junction Tree Algorithms

The Basics of Graphical Models

Markov random fields and Gibbs measures

5 Directed acyclic graphs

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Distributed Structured Prediction for Big Data

Approximating the Partition Function by Deleting and then Correcting for Model Edges

Complexity Theory. IE 661: Scheduling Theory Fall 2003 Satyaki Ghosh Dastidar

1. Nondeterministically guess a solution (called a certificate) 2. Check whether the solution solves the problem (called verification)

Course: Model, Learning, and Inference: Lecture 5

Lecture 4: BK inequality 27th August and 6th September, 2007

Life of A Knowledge Base (KB)

5.1 Bipartite Matching

Programming Tools based on Big Data and Conditional Random Fields

Part 2: Community Detection

Bayesian Networks. Mausam (Slides by UW-AI faculty)

A Sublinear Bipartiteness Tester for Bounded Degree Graphs

Triangle deletion. Ernie Croot. February 3, 2010

DATA ANALYSIS II. Matrix Algorithms

Large-Sample Learning of Bayesian Networks is NP-Hard

Handout #Ch7 San Skulrattanakulchai Gustavus Adolphus College Dec 6, Chapter 7: Digraphs

Generating models of a matched formula with a polynomial delay

CS 598CSC: Combinatorial Optimization Lecture date: 2/4/2010

Bayesian Networks. Read R&N Ch Next lecture: Read R&N

Definition Given a graph G on n vertices, we define the following quantities:

Network (Tree) Topology Inference Based on Prüfer Sequence

Structured Learning and Prediction in Computer Vision. Contents

Message-passing sequential detection of multiple change points in networks

Clique coloring B 1 -EPG graphs

Online Model Predictive Control of a Robotic System by Combining Simulation and Optimization

6.2 Permutations continued

Fairness in Routing and Load Balancing

Tools for parsimonious edge-colouring of graphs with maximum degree three. J.L. Fouquet and J.M. Vanherpe. Rapport n o RR

CSC 373: Algorithm Design and Analysis Lecture 16

Informatics 2D Reasoning and Agents Semester 2,

Exponential time algorithms for graph coloring

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

Cycles and clique-minors in expanders

Private Approximation of Clustering and Vertex Cover

Determination of the normalization level of database schemas through equivalence classes of attributes

Maximum Hitting Time for Random Walks on Graphs. Graham Brightwell, Cambridge University Peter Winkler*, Emory University

Approximation Algorithms

SEQUENCES OF MAXIMAL DEGREE VERTICES IN GRAPHS. Nickolay Khadzhiivanov, Nedyalko Nenov

6.852: Distributed Algorithms Fall, Class 2

Satisfiability Checking

Guessing Game: NP-Complete?

GENERATING LOW-DEGREE 2-SPANNERS

Proximal mapping via network optimization

Graph Mining and Social Network Analysis

COMBINATORIAL PROPERTIES OF THE HIGMAN-SIMS GRAPH. 1. Introduction

(67902) Topics in Theory and Complexity Nov 2, Lecture 7

Best Monotone Degree Bounds for Various Graph Parameters

Finding and counting given length cycles

CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma

Lecture 16 : Relations and Functions DRAFT

On an anti-ramsey type result

Why? A central concept in Computer Science. Algorithms are ubiquitous.

On the Unique Games Conjecture

Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits

Viral Marketing and the Diffusion of Trends on Social Networks

Math 319 Problem Set #3 Solution 21 February 2002

Discrete Mathematics & Mathematical Reasoning Chapter 10: Graphs

On the independence number of graphs with maximum degree 3

Small Maximal Independent Sets and Faster Exact Graph Coloring

Introduction to Markov Chain Monte Carlo

An Approximation Algorithm for Bounded Degree Deletion

SHORT CYCLE COVERS OF GRAPHS WITH MINIMUM DEGREE THREE

Finding the M Most Probable Configurations Using Loopy Belief Propagation

Gibbs Sampling and Online Learning Introduction

Distributed Computing over Communication Networks: Maximal Independent Set

On three zero-sum Ramsey-type problems

Introduction to Algorithms. Part 3: P, NP Hard Problems

JUST-IN-TIME SCHEDULING WITH PERIODIC TIME SLOTS. Received December May 12, 2003; revised February 5, 2004

Using Adversary Structures to Analyze Network Models,

A simpler and better derandomization of an approximation algorithm for Single Source Rent-or-Buy

Lecture 1: Course overview, circuits, and formulas

OPTIMAL DESIGN OF DISTRIBUTED SENSOR NETWORKS FOR FIELD RECONSTRUCTION

A Partition-Based Efficient Algorithm for Large Scale. Multiple-Strings Matching

A SOCIAL NETWORK ANALYSIS APPROACH TO ANALYZE ROAD NETWORKS INTRODUCTION

Lecture 8 The Subjective Theory of Betting on Theories

An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data

1 The Line vs Point Test

Labeling outerplanar graphs with maximum degree three

Network Algorithms for Homeland Security

Midterm Practice Problems

Scheduling Shop Scheduling. Tim Nieberg

Lecture 7: NP-Complete Problems

Decentralized Utility-based Sensor Network Design

Lecture 2: Complexity Theory Review and Interactive Proofs

3/25/ /25/2014 Sensor Network Security (Simon S. Lam) 1

Mean Ramsey-Turán numbers

Social Media Mining. Graph Essentials

Applied Algorithm Design Lecture 5

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Reductions & NP-completeness as part of Foundations of Computer Science undergraduate course

Tree-representation of set families and applications to combinatorial decompositions

OHJ-2306 Introduction to Theoretical Computer Science, Fall

INFORMATION from a single node (entity) can reach other

Tiers, Preference Similarity, and the Limits on Stable Partners

Graphical Models, Exponential Families, and Variational Inference

Transcription:

Probabilistic Graphical Models Raquel Urtasun and Tamir Hazan TTI Chicago April 4, 2011 Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 1 / 22

Bayesian Networks and independences Not every distribution independencies can be captured by a directed graph Regularity in the parameterization of the distribution that cannot be captured in the graph structure, e.g., XOR example { 1/12 if x y z = false P(x, y, z) = 1/6 if x y z = true (X Y ) I(P) Z is not independent of X given Y or Y given X. An I-map is the network X Z Y. This is not a perfect map as (X Z) I(P) Symmetric variable-level independencies that are not naturally expressed with a Bayesian network. Independence assumptions imposed by the structure of the DBN are not appropriate, e.g., misconception example Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 2 / 22

Misconception example (a) (b) (c) (a) Two independencies: (A C D, B) and (B D A, C) Can we encode this with a BN? (b) First attempt: encodes (A C D, B) but it also implies that (B D A) but dependent given both A, C (c) Second attempt: encodes (A C D, B), but also implies that B and D are marginally independent. Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 3 / 22

Undirected graphical models I So far we have seen directed graphical models or Bayesian networks BN do not captured all the independencies, e.g., misconception example, We want a representation that does not require directionality of the influences. We do this via an undirected graph. Undirected graphical models, which are useful in modeling phenomena where the interaction between variables does not have a clear directionality. Often simpler perspective on directed models, in terms of the independence structure and of inference. Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 4 / 22

Undirected graphical models II As in BN, the nodes in the graph represent the variables The edges represent direct probabilistic interaction between the neighboring variables How to parametrize the graph? In BN we used CPD (conditional probabilities) to represent distribution of a node given others For undirected graphs, we use a more symmetric parameterization that captures the affinities between related variables. Given a set of random variables X we define a factor as a function from Val(X) to R. The set of variables X is called the scope of the factor. Factors can be negative. In general, we restrict the discussion to positive factors Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 5 / 22

Misconception example once more... We can write the joint probability as p(a, B, C, D) = 1 Z φ 1(A, B)φ 2 (B, C)φ 3 (C, D)φ 4 (A, D) Z is the partition function and is used to normalized the probabilities Z = φ 1 (A, B)φ 2 (B, C)φ 3 (C, D)φ 4 (A, D) A,B,C,D It is called function as it depends on the parameters: important for learning. For positive factors, the higher the value of φ, the higher the compatibility. This representation is very flexible. Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 6 / 22

Query about probabilities What s the p(b 0 )? Marginalize the other variables! Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 7 / 22

Misconception example once more... We can write the joint probability as p(a, B, C, D) = 1 Z φ 1(A, B)φ 2 (B, C)φ 3 (C, D)φ 4 (A, D) Use the joint distribution to query about conditional probabilities by summing out the other variables. Tight connexion between the factorization and the independence properties X Y Z iff p(x, Y, Z) = φ 1 (X, Z)φ 2 (Y, Z) We see that in the example, (A C D, B) and (B D A, C) Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 8 / 22

Factors A factor can represent a joint distribution over D by defining φ(d). A factor can represent a CPD p(x D) by defining φ(d X ) But joint and CPD are more restricted, i.e., normalization constraints. Associating parameters over edges is not enough We need to associate factors over sets of nodes, i.e., higher order terms Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 9 / 22

Factor product Given 3 disjoint set of variables X, Y, Z, and factors φ 1 (X, Y), φ 2 (Y, Z), the factor product is defined as ψ(x, Y, Z) = φ 1 (X, Y)φ 2 (Y, Z) Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 10 / 22

Gibbs distributions and Markov networks A distribution P φ is a Gibbs distribution parameterized with a set of factors φ 1 (D 1 ),, φ m (D m ) if it is defined as P φ (X 1,, X n ) = 1 Z φ 1(D 1 ) φ m (D m ) and the partition function is defined as Z = φ 1 (D 1 ) φ m (D m ) X 1,,,X n The factors do NOT represent marginal probabilities of the variables of their scope. A factor is only one contribution to the joint. A distribution P φ with φ 1 (D 1 ),, φ m (D m ) factorizes over a Markov network H if each D i is a complete subgraph of H The factors that parameterize a Markov network are called clique potentials Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 11 / 22

Maximal cliques One can reduce the number of factors by using factors of the maximal cliques This obscures the structure What s the P φ on the left? And on the right? What s the relationship between the factors? Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 12 / 22

Example: Pairwise MRF Undirected graphical model very popular in applications such as computer vision: segmentation, stereo, de-noising The graph has only node potentials φ i (X i ) and pairwise potentials φ i,j (X i, X j ) Grids are particularly popular, e.g., pixels in an image with 4-connectivity Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 13 / 22

Reduced Markov Networks I Conditioning on an assignment u to a subset of variables U can be done by Eliminating all entries that are inconsistent with the assignment Re-normalizing the remaining entries so that they sum to 1 (Original) (Cond. on c 1 ) Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 14 / 22

Reduced Markov Networks Let H be a Markov network over X and let U = u be the context. The reduced network H[u] is a Markov network over the nodes W = X U where we have an edge between X and Y if there is an edge between then in H If U = Grade? If U = {Grade, SAT }? Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 15 / 22

Markov Network Independencies I As in BN, the graph encodes a set of independencies. Probabilistic influence flows along the undirected paths in the graph. It is blocked if we condition on the intervening nodes A path X 1 X k is active given the observed variables E X if none of the X i is in E. A set of nodes Z separates X and Y in H, i.e., sep H (X; Y Z), if there exists no active path between any node X X and Y Y given Z. The definition of separation is monotonic if sep H (X; Y Z) then sep H (X; Y Z ) for any Z Z Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 16 / 22

Markov Network Independencies II If P is a Gibbs distribution that factorizes over H, then H is an I-map for P, i.e., I (H) I (P) (soundness of separation) Proof: Suppose Z separates X from Y. Then we can write p(x 1,, X n ) = 1 f (X, Z)g(Y, Z) Z A distribution is positive if P(x) > 0 for all x. Hammersley-Clifford theorem: If P is a positive distribution over X and H is an I-map for P, then P is a Gibbs distribution that factorizes over H p(x) = 1 φ c (x c ) Z It is not the case that every pair of nodes that are not separated in H are dependent in every distribution which factorizes over H If X and Y are not separated given Z in H, then X and Y are dependent given Z in some distribution that factorizes over H. Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 17 / 22 c

Independence assumptions I In a BN we specify local Markov assumptions and d-separation. In Markov networks we have 1 Global assumption: A set of nodes Z separates X and Y if there is no active path between any node X X and Y Y given Z. I(H) = {(X Y Z) : sep H (X; Y Z)} 2 Pairwise Markov assumption: X and Y are independent give all the other nodes in the graph if no direct connection exists between them I p (H) = {(X Y X {X, Y }) : X Y / X } 3 Markov blanket assumption: X is independent of the rest of the nodes given its neighbors I l (H) = {(X X {X } MB H (X ) MB H (X )) : X X } A set U is a Markov blanket of X if X / U and if U is a minimal set of nodes such that (X X {X } U U) I Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 18 / 22

Independence assumptions II (Markov blanket) In general I(H) I l (H) I p (H) If P satisfies I(H), then it satisfies I l (H) If P satisfies I l (H), then it satisfies I p (H) If P is a positive distribution and satisfies I p (H) then it satisfies I l (H) Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 19 / 22

From distributions to graphs I The notion of I-map is not enough: as in BN the complete graph is an I-map for any distribution, but does not imply any independencies For a given distribution, we want to construct a minimal I-map based on the local indep. assumptions 1 Pairwise: Add an edge between all pairs of nodes that do NOT satisfy (X Y X {X, Y }) 2 Markov blanket: For each variable X we define the neighbors of X all the nodes that render X independent of the rest of nodes. Define a graph by introducing and edge for all X and all Y MB P (X ). If P is positive distribution, there is a unique Markov blanket of X in I(P), denoted MB P (X ) Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 20 / 22

From distributions to graphs II If P is a positive distribution, and let H be the graph defined by introducing an edge {X, Y } for which P does NOT satisfied (X Y X {X, Y }), then H is the unique minimal I-map for P. Minimal I-map is the one that if we remove one edge is not an I-map. Proof: H is an I-map for P since P by construction satisfies I p (P) which for positive distributions equals I(P). To prove that it s minimal, if we eliminate an edge {X, Y } the graph would imply (X Y X {X, Y }), which is false for P, otherwise edge omitted when constructing H. To prove that it s unique: any other I-map must contain the same edges, and it s either equal or contain additional edges, and thus it is not minimal If P is a positive distribution, and for each node let MB P (X ) be a minimal set of nodes U satisfying (X X {X } U U) I. Define H by introducing an edge {X, Y } for all X and all Y MB P (X ). Then H is the unique minimal I-map of P. Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 21 / 22

Examples of deterministic relations Not every distribution has a perfect map Even for positive distributions Example is the V-structure, where minimal I-map is the fully connected graph It fails to capture the marginal independence (X Y ) that holds in P Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 22 / 22