Lecture 2: Introduction to belief (Bayesian) networks



Similar documents
Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Bayesian Networks. Read R&N Ch Next lecture: Read R&N

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Bayesian Networks. Mausam (Slides by UW-AI faculty)

Life of A Knowledge Base (KB)

The Basics of Graphical Models

People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

The Certainty-Factor Model

Model-based Synthesis. Tony O Hagan

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Course: Model, Learning, and Inference: Lecture 5

Discrete Structures for Computer Science

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

Study Manual. Probabilistic Reasoning. Silja Renooij. August 2015

Knowledge-Based Probabilistic Reasoning from Expert Systems to Graphical Models

Math 141. Lecture 2: More Probability! Albyn Jones 1. jones/courses/ Library 304. Albyn Jones Math 141

Question 2 Naïve Bayes (16 points)

Learning from Data: Naive Bayes

Decision Trees and Networks

Informatics 2D Reasoning and Agents Semester 2,

MONITORING AND DIAGNOSIS OF A MULTI-STAGE MANUFACTURING PROCESS USING BAYESIAN NETWORKS

Bayesian Network Development

Master s thesis tutorial: part III

Homework 8 Solutions

6.4 Logarithmic Equations and Inequalities

Bayesian Tutorial (Sheet Updated 20 March)

Basics of Statistical Machine Learning

Bayesian probability theory

A survey on click modeling in web search

STA 371G: Statistics and Modeling

Solutions: Problems for Chapter 3. Solutions: Problems for Chapter 3

7: The CRR Market Model

Bayesian Analysis for the Social Sciences

5 Directed acyclic graphs

Linear Threshold Units

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

Probabilities. Probability of a event. From Random Variables to Events. From Random Variables to Events. Probability Theory I

Lecture 9: Bayesian hypothesis testing

Lecture Note 1 Set and Probability Theory. MIT Spring 2006 Herman Bennett

Chapter 14 Managing Operational Risks with Bayesian Networks

Conditional Probability, Hypothesis Testing, and the Monty Hall Problem

Prediction of Heart Disease Using Naïve Bayes Algorithm

Agenda. Interface Agents. Interface Agents

Running Head: STRUCTURE INDUCTION IN DIAGNOSTIC REASONING 1. Structure Induction in Diagnostic Causal Reasoning. Development, Berlin, Germany

Bayesian networks - Time-series models - Apache Spark & Scala

ST 371 (IV): Discrete Random Variables

Definition and Calculus of Probability

What Is Probability?

A Bayesian Approach for on-line max auditing of Dynamic Statistical Databases

Relational Dynamic Bayesian Networks: a report. Cristina Manfredotti

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Probabilistic Networks An Introduction to Bayesian Networks and Influence Diagrams

DETERMINING THE CONDITIONAL PROBABILITIES IN BAYESIAN NETWORKS

Lecture 9: Introduction to Pattern Analysis

Naive-Bayes Classification Algorithm

Big Data, Machine Learning, Causal Models

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

SEQUENTIAL VALUATION NETWORKS: A NEW GRAPHICAL TECHNIQUE FOR ASYMMETRIC DECISION PROBLEMS

STAT 200 QUIZ 2 Solutions Section 6380 Fall 2013

A Bayesian Network Model for Diagnosis of Liver Disorders Agnieszka Onisko, M.S., 1,2 Marek J. Druzdzel, Ph.D., 1 and Hanna Wasyluk, M.D.,Ph.D.

Lecture 6: Discrete & Continuous Probability and Random Variables

Evaluation of Machine Learning Techniques for Green Energy Prediction

CONTINGENCY (CROSS- TABULATION) TABLES

Aggregate Loss Models

Statistics in Geophysics: Introduction and Probability Theory

1 Prior Probability and Posterior Probability

CHAPTER 2 Estimating Probabilities

Tagging with Hidden Markov Models

Introduction to Hypothesis Testing OPRE 6301

Tweaking Naïve Bayes classifier for intelligent spam detection

Lecture 16 : Relations and Functions DRAFT

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Using game-theoretic probability for probability judgment. Glenn Shafer

Machine Learning.

An Introduction to Information Theory

Compression algorithm for Bayesian network modeling of binary systems

Chapter 4 Lecture Notes

I I I I I I I I I I I I I I I I I I I

The Calculus of Probability

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Lecture 4: BK inequality 27th August and 6th September, 2007

CSE 473: Artificial Intelligence Autumn 2010

Transcription:

Lecture 2: Introduction to belief (Bayesian) networks Conditional independence What is a belief network? Independence maps (I-maps) January 7, 2008 1 COMP-526 Lecture 2 Recall from last time: Conditional probabilities Our probabilistic models will compute and manipulate conditional probabilities. Given two random variables X, Y, we denote by p(x = x Y = y) the probability of X taking value x given that we know that Y is certain to have value y. This fits the situation when we observe something and want to make an inference about something related but unobserved: p(cancer recurs tumor measurements) p(gene expressed > 1.3 transcription factor concentrations) p(collision to obstacle sensor readings) p(word uttered sound wave) January 7, 2008 2 COMP-526 Lecture 2

Recall from last time: Bayes rule Bayes rule is very simple but very important for relating conditional probabilities: p(x y) = p(y x)p(x) p(y) Bayes rule is a useful tool for inferring the posterior probability of a hypothesis based on evidence and a prior belief in the probability of different hypotheses. January 7, 2008 3 COMP-526 Lecture 2 Using Bayes rule for inference Often we want to form a hypothesis about the world based on observable variables. Bayes rule is fundamental when viewed in terms of stating the belief given to a hypothesis H given evidence e: p(h e) = p(e H)p(H) p(e) p(h e) is sometimes called posterior probability p(h) is called prior probability p(e H) is called likelihood of the evidence (data) p(e) is just a normalizing constant, that can be computed from the requirement that P h p(h = h e) = 1: p(e) = X h p(e h)p(h) Sometimes we write p(h e) p(e H)p(H) January 7, 2008 4 COMP-526 Lecture 2

Example: Medical Diagnosis A doctor knows that pneumonia causes a fever 95% of the time. She knows that if a person is selected randomly from the population, there is a 10 7 chance of the person having pneumonia. 1 in 100 people suffer from fever. You go to the doctor complaining about the symptom of having a fever (evidence). What is the probability that pneumonia is the cause of this symptom (hypothesis)? p(pneumonia fever) = p(fever pneumonia)p(pneumonia) p(fever) = 0.95 10 7 0.01 January 7, 2008 5 COMP-526 Lecture 2 Computing conditional probabilities Typically, we are interested in the posterior joint distribution of some query variables Y given specific values e for some evidence variables E Let the hidden variables be Z = X Y E If we have a joint probability distribution, we can compute the answer by using the definition of conditional probabilities and marginalizing the hidden variables: p(y e) = p(y, e) p(e) p(y, e) = X z p(y, e, z) This yields the same big problem as before: the joint distribution is too big to handle January 7, 2008 6 COMP-526 Lecture 2

Independence of random variables revisited We said that two r.v. s X and Y are independent, denoted X Y, if p(x, y) = p(x)p(y). But we also know that p(x, y) = p(x y)p(y). Hence, two r.v. s are independent if and only if: p(x y) = p(x) (and vice versa), x Ω X, y Ω Y This means that knowledge about Y does not change the uncertainty about X and vice versa. Is there a similar requirement, but less stringent? January 7, 2008 7 COMP-526 Lecture 2 Conditional independence Two random variables X and Y are conditionally independent given Z if: p(x y, z) = p(x z), x, y, z This means that knowing the value of Y does not change the prediction about X if the value of Z is known. We denote this by X Y Z. January 7, 2008 8 COMP-526 Lecture 2

Example Consider the medical diagnosis problem with three random variables: P (patient has pneumonia), F (patient has a fever), C (patient has a cough) The full joint distribution has 2 3 1 = 7 independent entries If someone has pneumonia, we can assume that the probability of a cough does not depend on whether they have a fever: p(c = 1 P = 1, F ) = p(c = 1 P = 1) (1) Same equality holds if the patient does not have pneumonia: p(c = 1 P = 0, F ) = p(c = 1 P = 0) (2) Hence, C and F are conditionally independent given P. January 7, 2008 9 COMP-526 Lecture 2 Example (continued) The joint distribution can now be written as: p(c, P, F ) = p(c P, F )p(f P )p(p ) = p(c P )p(f P )p(p ) Hence, the joint can be described using 2 + 2 + 1 = 5 numbers instead of 7 Much more important savings happen with more variables January 7, 2008 10 COMP-526 Lecture 2

Naive Bayesian model A common assumption in early diagnosis is that the symptoms are independent of each other given the disease Let s 1,... s n be the symptoms exhibited by a patient (e.g. fever, headache etc) Let D be the patient s disease Then by using the naive Bayes assumption, we get: p(d, s 1,... s n ) = p(d)p(s 1 D) p(s n D) The conditional probability of the disease given the symptoms: p(d s 1,... s n ) = p(d, s 1,... s n ) p(s 1,... s n ) p(d)p(s 1 D) p(s n D) because the denominator is just a normalization constant. January 7, 2008 11 COMP-526 Lecture 2 Recursive Bayesian updating The naive Bayes assumption allows also for a very nice, incremental updating of beliefs as more evidence is gathered Suppose that after knowing symptoms s 1,... s n the probability of D is: ny p(d, s 1... s n ) = p(d) p(s i D) i=1 What happens if a new symptom s n+1 appears? n+1 Y p(, s 1... s n, s n+1 ) = p(d) p(s i D) = p(d, s 1... s n )p(s n+1 D) i=1 An even nicer formula can be obtained by taking logs: log p(d, s 1... s n, s n+1 ) = log p(d, s 1... s n ) + log p(s n+1 D) January 7, 2008 12 COMP-526 Lecture 2

Recursive Bayesian updating The naive Bayes assumption allows also for a very nice, incremental updating of beliefs as more evidence is gathered Suppose that after knowing symptoms s 1,... s n the probability of D is: ny p(d, s 1... s n ) = p(d) p(s i D) i=1 What happens if a new symptom s n+1 appears? n+1 Y p(d, s 1... s n, s n+1 ) = p(d) p(s i D) = p(d, s 1... s n )p(s n+1 D) i=1 An even nicer formula can be obtained by taking logs: log p(d, s 1... s n, s n+1 ) = log p(d, s 1... s n ) + log p(s n+1 D) January 7, 2008 13 COMP-526 Lecture 2 A graphical representation of the naive Bayesian model Diagnosis Sympt1 Sympt2... Sympt n The nodes represent random variables The arcs represent influences The lack of arcs represents conditional independence relationships January 7, 2008 14 COMP-526 Lecture 2

More generally: Bayesian networks p(e) E=1 E=0 0.005 0.995 E B p(b) B=1 B=0 0.01 0.99 E=0 E=1 p(r E) R=1 0.0001 0.65 R=0 0.9999 0.35 A=0 A=1 R C=1 p(c A) C=0 0.05 0.95 0.7 0.3 A C B=0,E=0 B=0,E=1 B=1,E=0 B=1,E=1 p(a B,E) A=1 A=0 0.001 0.999 0.3 0.8 0.7 0.2 0.95 0.05 Bayesian networks are a graphical representation of conditional independence relations, using graphs. January 7, 2008 15 COMP-526 Lecture 2 A graphical representation for probabilistic models Suppose the world is described by a set of r.v. s X 1,... X n Let us define a directed acyclic graph such that each node i corresponds to an r.v. X i Since this is a one-to-one mapping, we will use X i to denote both the node in the graph and the corresponding r.v. Let X πi be the set of parents for node X i in the graph We associate with each node the conditional probability distribution of the r.v. X i given its parents: p(x i X πi ). January 7, 2008 16 COMP-526 Lecture 2

Example: A Bayesian (belief) network p(e) E=1 E=0 0.005 0.995 E B p(b) B=1 B=0 0.01 0.99 E=0 E=1 p(r E) R=1 0.0001 0.65 R=0 0.9999 0.35 A=0 A=1 R C=1 p(c A) C=0 0.05 0.95 0.7 0.3 The nodes represent random variables The arcs represent influences A C B=0,E=0 B=0,E=1 B=1,E=0 B=1,E=1 p(a B,E) A=1 A=0 0.001 0.999 0.3 0.8 0.7 0.2 At each node, we have a conditional probability distribution (CPD) for the corresponding variable given its parents 0.95 0.05 January 7, 2008 17 COMP-526 Lecture 2 Factorization Let G be a DAG over variables X 1,..., X n. We say that a joint probability distribution p factorizes according to G if p can be expressed as a product: ny p(x 1,..., x n ) = p(x i x πi ) i=1 The individual factors p(x i x πi ) are called local probabilistic models or conditional probability distributions (CPD). January 7, 2008 18 COMP-526 Lecture 2

Bayesian network definition A Bayesian network is a DAG G over variables X 1,..., X n, together with a distribution p that factorizes over G. p is specified as the set of conditional probability distributions associated with G s nodes. p(e) E=1 E=0 0.005 0.995 E B p(b) B=1 B=0 0.01 0.99 E=0 E=1 p(r E) R=1 0.0001 0.65 0.35 R=0 0.9999 A=0 A=1 R p(c A) C=1 C=0 0.05 0.95 0.7 0.3 A C p(a B,E) A=1 A=0 B=0,E=0 0.001 0.999 B=0,E=1 0.3 0.7 B=1,E=0 0.8 0.2 B=1,E=1 0.95 0.05. January 7, 2008 19 COMP-526 Lecture 2 Using a Bayes net for reasoning (1) Computing any entry in the joint probability table is easy because of the factorization property: p(b = 1, E = 0, A = 1, C = 1, R = 0) = p(b = 1)p(E = 0)p(A = 1 B = 1, E = 0)p(C = 1 A = 1)p(R = 0 E = 0) = 0.01 0.995 0.8 0.7 0.9999 0.0056 Computing marginal probabilities is also easy. E.g. What is the probability that a neighbor calls? p(c = 1) = X p(c = 1, e, b, r, a) =... e,b,r,a January 7, 2008 20 COMP-526 Lecture 2

Using a Bayes net for reasoning (2) One might want to compute the conditional probability of a variable given evidence that is upstream from it in the graph E.g. What is the probability of a call in case of a burglary? P p(c = 1, B = 1) e,r,a p(c = 1, B = 1, e, r, a) p(c = 1 B = 1) = = p(b = 1) p(c, B = 1, e, r, a) P c,e,r,a This is called causal reasoning or prediction January 7, 2008 21 COMP-526 Lecture 2 Using a Bayes net for reasoning (3) We might have some evidence and need an explanation for it. In this case, we compute a conditional probability based on evidence that is downstream in the graph E.g. Suppose we got a call. What is the probability of a burglary? What is the probability of an earthquake? p(b = 1 C = 1) = p(e = 1 C = 1) = p(c = 1 B = 1)p(B = 1) p(c = 1) p(c = 1 E = 1)p(E = 1) p(c = 1) =... =... This is evidential reasoning or explanation. January 7, 2008 22 COMP-526 Lecture 2

Using a Bayes net for reasoning (4) Suppose that you now gather more evidence, e.g. the radio announces an earthquake. What happens to the probabilities? p(e = 1 C = 1, R = 1) p(e = 1 C = 1) and p(b = 1 C = 1, R = 1) p(b = 1 C = 1) This is called explaining away January 7, 2008 23 COMP-526 Lecture 2 I-Maps A directed acyclic graph (DAG) G whose nodes represent random variables X 1,..., X n is an I-map (independence map) of a distribution p if p satisfies the independence assumptions: X i Nondescendents(X i ) X πi, i = 1,... n where X πi are the parents of X i January 7, 2008 24 COMP-526 Lecture 2

Example Consider all possible DAG structures over 2 variables. Which graph is an I-map for the following distribution? x y p(x, y) 0 0 0.08 0 1 0.32 1 0 0.32 1 1 0.28 What about the following distribution? x y p(x, y) 0 0 0.08 0 1 0.12 1 0 0.32 1 1 0.48 January 7, 2008 25 COMP-526 Lecture 2 Example (continued) In the first example, X and Y are not independent, so the only I-maps are the graphs X Y and Y X, which assume no independence In the second example, we have p(x = 0) = 0.2, p(y = 0) = 0.4, and and for all entries p(x, y) = p(x)p(y) Hence, X Y, and there are three I-maps for the distribution: the graph in which X and Y are not connected, and both graphs above. Note that independence maps may have extra arcs! January 7, 2008 26 COMP-526 Lecture 2