STAT 535 Lecture 4 Graphical representations of conditional independence Part II Bayesian Networks c Marina Meilă

Similar documents
5 Directed acyclic graphs

The Basics of Graphical Models

Bayesian Networks. Read R&N Ch Next lecture: Read R&N

Bayesian Networks. Mausam (Slides by UW-AI faculty)

3. The Junction Tree Algorithms

Course: Model, Learning, and Inference: Lecture 5

Large-Sample Learning of Bayesian Networks is NP-Hard

Bayesian Tutorial (Sheet Updated 20 March)

Lecture 16 : Relations and Functions DRAFT

Decision Trees and Networks

An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data

UNIVERSITY OF LYON DOCTORAL SCHOOL OF COMPUTER SCIENCES AND MATHEMATICS P H D T H E S I S. Specialty : Computer Science. Author

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

AN ALGORITHM FOR DETERMINING WHETHER A GIVEN BINARY MATROID IS GRAPHIC

Markov random fields and Gibbs measures

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

Probabilistic Networks An Introduction to Bayesian Networks and Influence Diagrams

GRAPH THEORY LECTURE 4: TREES

Graph Theory Origin and Seven Bridges of Königsberg -Rhishikesh

Study Manual. Probabilistic Reasoning. Silja Renooij. August 2015

Lecture Note 1 Set and Probability Theory. MIT Spring 2006 Herman Bennett

Machine Learning.

A Comparison of Novel and State-of-the-Art Polynomial Bayesian Network Learning Algorithms

UPDATES OF LOGIC PROGRAMS

5.1 Bipartite Matching

Relational Dynamic Bayesian Networks: a report. Cristina Manfredotti

Decision Making under Uncertainty

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

Class notes Program Analysis course given by Prof. Mooly Sagiv Computer Science Department, Tel Aviv University second lecture 8/3/2007

Testing LTL Formula Translation into Büchi Automata

Graph Theory Problems and Solutions

8.1 Min Degree Spanning Tree

Follow links for Class Use and other Permissions. For more information send to:

Big Data, Machine Learning, Causal Models

Cartesian Products and Relations

Data Structures Fibonacci Heaps, Amortized Analysis

Dimitris Chatzidimitriou. Corelab NTUA. June 28, 2013

Lecture 7: NP-Complete Problems

Triangulation by Ear Clipping

A Non-Linear Schema Theorem for Genetic Algorithms

Conditional Probability, Hypothesis Testing, and the Monty Hall Problem

1 Definitions. Supplementary Material for: Digraphs. Concept graphs

Discuss the size of the instance for the minimum spanning tree problem.

Complexity Theory. IE 661: Scheduling Theory Fall 2003 Satyaki Ghosh Dastidar

Approximation Algorithms

Handout #Ch7 San Skulrattanakulchai Gustavus Adolphus College Dec 6, Chapter 7: Digraphs

root node level: internal node edge leaf node Data Structures & Algorithms McQuain

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis

Cosmological Arguments for the Existence of God S. Clarke

Graphical Modeling for Genomic Data

Reading 13 : Finite State Automata and Regular Expressions

CS 598CSC: Combinatorial Optimization Lecture date: 2/4/2010

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Scheduling Shop Scheduling. Tim Nieberg

Connectivity and cuts

Single-Link Failure Detection in All-Optical Networks Using Monitoring Cycles and Paths

HUGIN Expert White Paper

(Refer Slide Time: 2:03)

Question 2 Naïve Bayes (16 points)

1. Nondeterministically guess a solution (called a certificate) 2. Check whether the solution solves the problem (called verification)

LS.6 Solution Matrices

Vector storage and access; algorithms in GIS. This is lecture 6

Scheduling Home Health Care with Separating Benders Cuts in Decision Diagrams

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

Chapter 31 out of 37 from Discrete Mathematics for Neophytes: Number Theory, Probability, Algorithms, and Other Stuff by J. M.

CMPSCI611: Approximating MAX-CUT Lecture 20

TU e. Advanced Algorithms: experimentation project. The problem: load balancing with bounded look-ahead. Input: integer m 2: number of machines

Math 141. Lecture 2: More Probability! Albyn Jones 1. jones/courses/ Library 304. Albyn Jones Math 141

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

3. Mathematical Induction

Chapter 28. Bayesian Networks

Competitive Analysis of On line Randomized Call Control in Cellular Networks

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Using Bayesian Networks to Analyze Expression Data ABSTRACT

Lecture 1: Course overview, circuits, and formulas

CSE 135: Introduction to Theory of Computation Decidability and Recognizability

Full and Complete Binary Trees

Circuits 1 M H Miller

Notes on Complexity Theory Last updated: August, Lecture 1

The Goldberg Rao Algorithm for the Maximum Flow Problem

Handout #1: Mathematical Reasoning

Identifiability of Path-Specific Effects

Managerial Economics Prof. Trupti Mishra S.J.M. School of Management Indian Institute of Technology, Bombay. Lecture - 13 Consumer Behaviour (Contd )

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

11 Multivariate Polynomials

Lecture 4: BK inequality 27th August and 6th September, 2007

A Few Basics of Probability

Topology-based network security

Chapter 7: Products and quotients

1 Review of Newton Polynomials

DETERMINING THE CONDITIONAL PROBABILITIES IN BAYESIAN NETWORKS

Coverability for Parallel Programs

Lecture 17 : Equivalence and Order Relations DRAFT

2) Write in detail the issues in the design of code generator.

Tagging with Hidden Markov Models

Analysis of Algorithms, I

Triangle deletion. Ernie Croot. February 3, 2010

A Bayesian Network Model for Diagnosis of Liver Disorders Agnieszka Onisko, M.S., 1,2 Marek J. Druzdzel, Ph.D., 1 and Hanna Wasyluk, M.D.,Ph.D.

This asserts two sets are equal iff they have the same elements, that is, a set is determined by its elements.

Likelihood: Frequentist vs Bayesian Reasoning

Transcription:

TT 535 Lecture 4 Graphical representations of conditional independence Part II Bayesian Networks c Marina Meilă mmp@stat.washington.edu 1 limitation of Markov networks Between two variables or subsets of V, there can be four combinations of independence statements. Conditional Marginal Independence Independence B B B C B C C B B C Not representable by MRF B C The table above gives an example of MRF structure for each of the three cases that are representable by MRF s. For the fourth case, an I-map is possible, and this is the graph that contains an edge B. Note that this I-map is trivial, since it does not represent the independence B. This is a limitation of MRF s. We will show how serious this limitation is by an example. uppose that a variable called burglar alarm () becoming true can have two causes: a burglary (B) or an earthquake (E). reasonable assumption is that burglaries and earthquakes occur independently (B E). But, given that the alarm sounds = 1, and hearing that an earthquake as taken place 1

(E = 1), most people will believe that the burglary is less likely to have taken place than if there had been no earthquake (E = 0), hence E carries information about B when = 1 and therefore B, E are dependent given. In other words, it is reasonable to assume that Markov nets cannot represent this situation. B E (1) B E. (2) Exercise Find other examples of the kind (i) Y and Y Z, i.e two independent causes which can produce the same effect. Find examples that fit the other three possible combinations of marginal and conditional (in)depependence, ie. (ii) Y, Y Z, (iii) Y, Y Z, (iv) Y, Y Z. Can you describe them in words? The case illustrated above is important enough to warrant the introduction of another formalism for encoding independencies in graphs, called - separation and based on directed graphs. 2 irected cyclic Graphs (G s) irected cyclic Graph (Gs) is a directed graph G = (V, E) which contains no directed cycles. this is a G this is not a G Below is a somewhat more complicated textbook example (the chest-clinic example). In this example, the arrows coincide with causal links between 2

sia moker Tuberculosis Lung cancer Bronchitis -ray yspnoea Figure 1: The Chest clinic G. variables (i.e moking causes Lung cancer ). This is not completely accidental. Bayesian networks are particularly fit for representind domains where there are causal relationships. It is therefore useful to think of causal relationships when we try to build a Bayes net that represents a problem. But the formalism we study here is not necessarily tied to causality. One does not have to interpret the arrows as cause-effect relations and we will not do so. Terminology: parent sia is parent of Tuberculosis pa(variable) the set of parents of a variable pa( -ray ) = { Lung cancer, Tuberculosis } child Lung cancer is child of moker ancestor moker is ancestor of yspnoea descendent family yspnoea is descendent of moker a node and its parents {yspnoea, Tuberculosis, Lung cancer, Bronchitis} are a family But perhaps the most important concept in G s is the V-structure, which denotes a variable having two parents which are not connected by an edge. The Burglar-Earthquake-larm example of the previous section is the V-structure. 3

Burglar (B) Earthquake (E) Landslide (L) Earthquake (E) larm () larm () a V-structure not a V-structure In figure 1, (T,, L), (T,, L), (T,, B) are V-structures. 3 -separation In a G, independence is encoded by by the relation of d-separation, define below. B C d-separated from B by C -separation : is d-separated from B by C if all the paths between sets and B are blocked by elements of C. The three cases of d-separation: Z Y Z Y Y Z if Z C the path is blocked, otherwise open if Z C the path is blocked, otherwise open Z if Z or one of its descendents C the path is open, otherwise blocke The directed Markov property: its non-descendants pa() 4

Theorem For any G G = (V, E) and any probability distribution P V over V, if P V satisfies the irected Markov Property w.r.t the G G, then any - separation relationship that is true in G corresponds to a true independence relationship in P V. (In other words, G is an I-map of P V ). 3.1 The Markov blanket In a G, the Markov blanket of a node is the union of all parents, children and other parents of s children. Markov blanket() = pa() ch() ( Y ch() pa(y ) \ {} ) For example, in the G in Figure 1, the Markov blanket of L is the set { (parent),, (children), T, B (parents of the children)}, and the Markov blanket of is T. 3.2 Equivalent G s Two G s are said to be (likelihood) equivalent if they encode the same set of independencies. For example, the two graphs below are equivalent (encoding no independencies). B B If the arrows represented causal links, then the two above graphs would not be equivalent! nother example of equivalence between Gs was seen in lecture 3: a Markov chain or a HMM can be represented by a directed graph with forward arrows, as well as by one with backward arrows. (Note, in passing, that an HMM can also be represented as an undirected graph, demonstrating equivalence between a G and a MRF). Yet another example is below. On the left, the chest clinic G. In the 5

middle, an equivalent G. On the right, another G which is not equivalent with the chest clinic example. [Exercise: verify that the graphs are/are not equivalent by looking at what independence relationships hold in the three graphs.] G 1 G 2 G 3 G 1 G 2 G 3 Theorem 1 (Chickering) Two G s are equivalent iff they have the same undirected skeleton and the same V-structures. Consequently, we can invert an arrow in a G and preserve the same independencies, only if that arrow is not part of a V-structure. In other words, B can be reversed iff for any arc C pointing at, there is an arc CB pointing into B and viceversa. Or, pa(b) = {} pa(). It can be proved that one can traverse the class of all equivalent G s by successive arrow reversals. 3.3 -separation as separation in an undirected graph Here we show that -separation in a G is equivalent to separation in an undirected graph obtained from the G and the variables we are interested in. First two definitions, whose meaning will become clear shortly. Moralization is the graph operation of connecting the parents of a V- structure. G is moralized if all nodes that share a child have been 6

connected. fter a graph is moralized, all edges, be they original edges or new edges added by moralization, are considered as undirected. If G is a G the graph obtained by moralizing G is denoted by G m and is called the moral graph of G. marrying the parents dropping the directions For any variable the set an() denotes the ancestors of (including itself). imilarly, if is a set of nodes, an() denotes the set of all ancestors of variables in. an() = an() The ancestral graph of a set of nodes V is the graph G an() = (an(), E ) obtained from G by removing all nodes not in an(). Now we can state the main result. Theorem 2 Let, B, V be three disjoint sets of nodes in a G G. Then, B are -separated by in G iff they are separated by in the moral ancestral graph of, B,. B in G iff B in (G an( B ) ) m The intuition is that observing/conditioning on a variable creates a dependence between its parents (if it has any). Moralization represents this link. Now why the ancestral graph? Note that an unobserved descendent cannot produce dependencies between its ancestors (ie cannot open a path in a directed graph). o we can safely remove all descendents of, B that are not 7

in. The descendents of itself that are not in, B, and all the nodes that are not ancestors of, B, can be removed by a similar reasoning. Hence, first the graph is pruned, then dependencies between parents are added by moralization. Now directions on edges can be removed, because G s are just like undirected graphs if it weren t for the V-structures, and we have already dealt with those. The Theorem immediately suggests an algorithm for testing -separation using undirected graph separation. 1. remove all nodes not in an( B ) to get G an( B ) 2. moralize the remaining graph to get (G an( B ) ) m 3. remove all nodes in from (G an( B ) ) m to get G 4. test if there is a path between and B in G For example, test if B in the chest clinic G. 8

T L B T L B show nodes of interest ancestral graph T L B T L B moralize eliminate conditioning nodes 4 Factorization Now we construct joint probability distributions that have the independencies specified by a given G. ssume the set of discrete variables is V = { 1, 2,... n } and that we are given a G G = (V, E). The goal is to construct the family of distributions that are represented by the graph. This family is given by n P( 1, 2,... n ) = P( i pa( i )) (3) In the above P( i pa( i )) represents the conditional distribution of variable i given its parents. Because the factors P( i pa( i )) involve a variable 9 i=1

Figure 2: The chest clinic G, with shorter variable names. and its parents, that is, nodes closely connected in the graph structure, we often call them local probability tables (or local distributions). Note that the parameters of each local table are (functionally) independent of the parameters in the other tables. We can choose them separately, and the set of all parameters for all conditional probability distributions form the family of distributions for which the G G is an I-map. If a distribution can be written in the form (3) we say that the distribution factors according to the graph G. joint distributions that factors according to some G G is called a Bayes net. Note that any distribution is a Bayes net in a trivial way: by taking G to be the complete graph, with no missing edges. In general, we want a Bayes net to be as sparse as possible, because representing independences explicitly has many computational advantages. The Bayes net described by this graph is P(,, T, L, B,, ) = P()P()P(T )P(L )P(B)P( T, L)P( T, L, B) way of obtaining this decomposition starting from the graph is 1. Construct a topological ordering of the variables. topological or- 10

dering is an ordering of the variables where the parents of each variable are always before the variable itself in the ordering.,, T, L, B,, is a topological ordering for the graph above. This ordering is not unique; another possible topological ordering is, B, T,, L,,. In general, there can be exponentially many topological orderings for a given G. 2. pply the chain rule following the topological ordering. P(,, T, L, B,, ) = P()P( )P(T, )P(L,, T)P(B,, T, L) P(,, T, L, B)P(,, T, L, B, ) 3. Use the directed Markov property to simplify the factors P( ) = P() P(T, ) = P(T ) P(L,, T) = P(L ) P(B,, T, L) = P(B), etc. Let us now look at the number of parameters in such a model. ssume that in the example above all variables are binary. Then the number of unconstrained parameters in the model is 1 + 1 + 2 + 2 + 1 + 4 + 8 = 19 The number of parameters in a 7 way contingency table is 2 7 1 = 127 so we are saving 118 parameters. s we shall see, there are also other computational advantages to joint distribution representations of this form. 11