13.3 Inference Using Full Joint Distribution



Similar documents
Bayesian Networks. Mausam (Slides by UW-AI faculty)

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Life of A Knowledge Base (KB)

Bayesian Networks. Read R&N Ch Next lecture: Read R&N

The Basics of Graphical Models

Relational Dynamic Bayesian Networks: a report. Cristina Manfredotti

Bayesian Tutorial (Sheet Updated 20 March)

6.3 Conditional Probability and Independence

CHAPTER 2 Estimating Probabilities

The Certainty-Factor Model

3. Mathematical Induction

A Few Basics of Probability

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

Lecture Note 1 Set and Probability Theory. MIT Spring 2006 Herman Bennett

Basic Probability Concepts

Learning from Data: Naive Bayes

Conditional Probability, Independence and Bayes Theorem Class 3, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

Machine Learning.

Math/Stats 425 Introduction to Probability. 1. Uncertainty and the axioms of probability

Statistics in Geophysics: Introduction and Probability Theory

Solution to Homework 2

Section 6.1 Joint Distribution Functions

Question: What is the probability that a five-card poker hand contains a flush, that is, five cards of the same suit?

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Probability Models.S1 Introduction to Probability

ST 371 (IV): Discrete Random Variables

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Probabilistic Networks An Introduction to Bayesian Networks and Influence Diagrams

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 2

Incorporating Evidence in Bayesian networks with the Select Operator

E3: PROBABILITY AND STATISTICS lecture notes

CHAPTER 3. Methods of Proofs. 1. Logical Arguments and Formal Proofs

Decision Trees and Networks

Markov random fields and Gibbs measures

Reliability Applications (Independence and Bayes Rule)

Compression algorithm for Bayesian network modeling of binary systems

Linear Threshold Units

Business Statistics 41000: Probability 1

4. Continuous Random Variables, the Pareto and Normal Distributions

3.1. RATIONAL EXPRESSIONS

Representation of functions as power series

Chapter 3. Cartesian Products and Relations. 3.1 Cartesian Products

Model-based Synthesis. Tony O Hagan

Basics of Statistical Machine Learning

Review. Bayesianism and Reliability. Today s Class

Integrals of Rational Functions

3. The Junction Tree Algorithms

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

An Empirical Evaluation of Bayesian Networks Derived from Fault Trees

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

Linear Programming. March 14, 2014

Statistical Machine Translation: IBM Models 1 and 2

Cell Phone based Activity Detection using Markov Logic Network

Note on growth and growth accounting

Handout #1: Mathematical Reasoning

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Notes on Determinant

Decision Making under Uncertainty

Bayesian probability theory

3.2 Roulette and Markov Chains

Solutions to Math 51 First Exam January 29, 2015

STA 371G: Statistics and Modeling

Testing LTL Formula Translation into Büchi Automata

Quotient Rings and Field Extensions

Study Manual. Probabilistic Reasoning. Silja Renooij. August 2015

2 Integrating Both Sides

MATH Fundamental Mathematics IV

Chapter 4 Lecture Notes

Chapter 4. Probability and Probability Distributions

Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

Prediction of Heart Disease Using Naïve Bayes Algorithm

Likewise, we have contradictions: formulas that can only be false, e.g. (p p).

IAM 530 ELEMENTS OF PROBABILITY AND STATISTICS INTRODUCTION

Probability. a number between 0 and 1 that indicates how likely it is that a specific event or set of events will occur.

Curriculum Alignment Project

DIFFERENTIABILITY OF COMPLEX FUNCTIONS. Contents

Factoring. Factoring 1

Ex. 2.1 (Davide Basilio Bartolini)

5 Directed acyclic graphs

Tagging with Hidden Markov Models

Question 2 Naïve Bayes (16 points)

Bayesian Analysis for the Social Sciences

Zeros of a Polynomial Function

Unit 19: Probability Models

Efficient Data Structures for Decision Diagrams

Mathematical Induction

Extending the Role of Causality in Probabilistic Modeling

Chapter 7 Probability. Example of a random circumstance. Random Circumstance. What does probability mean?? Goals in this chapter

So, using the new notation, P X,Y (0,1) =.08 This is the value which the joint probability function for X and Y takes when X=0 and Y=1.

Discrete Structures for Computer Science

POLYNOMIAL FUNCTIONS

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

CHAPTER 2. Logic. 1. Logic Definitions. Notation: Variables are used to represent propositions. The most common variables used are p, q, and r.

CS510 Software Engineering

Lecture 1 Introduction Properties of Probability Methods of Enumeration Asrat Temesgen Stockholm University

The Two Envelopes Problem

Principle of Data Reduction

Transcription:

191 The probability distribution on a single variable must sum to 1 It is also true that any joint probability distribution on any set of variables must sum to 1 Recall that any proposition a is equivalent to the disjunction of all the atomic events in which a holds Call this set of events e(a) Atomic events are mutually exclusive, so the probability of any conjunction of atomic events is zero, by axiom 2 Hence, from axiom 3 P(a) = ei e(a) P(e i ) Given a full joint distribution that specifies the probabilities of all atomic events, this equation provides a simple method for computing the probability of any proposition 192 13.3 Inference Using Full Joint Distribution toothache toothache catch catch catch catch cavity 0.108 0.012 0.072 0.008 cavity 0.016 0.064 0.144 0.576 E.g., there are six atomic events for cavity toothache: 0.108 + 0.012 + 0.072 + 0.008 + 0.016 + 0.064 = 0.28 Extracting the distribution over a variable (or some subset of variables), marginal probability, is attained by adding the entries in the corresponding rows or columns E.g., P(cavity) = 0.108 + 0.012 + 0.072 + 0.008 = 0.2 We can write the following general marginalization (summing out) rule for any sets of variables Y and Z: P(Y) = z Z P(Y, z) 1

193 For example: P( Cavity) z { Catch, Toothache} P( Cavity, z) A variant of this rule involves conditional probabilities instead of joint probabilities, using the product rule: P( Y) P( Y z) P( z) z This rule is called conditioning Marginalization and conditioning turn out to be useful rules for all kinds of derivations involving probability expressions toothache toothache catch catch catch catch cavity 0.108 0.012 0.072 0.008 cavity 0.016 0.064 0.144 0.576 194 Computing a conditional probability P(cavity toothache) = P(cavity toothache)/p(toothache) = (0.108 + 0.012)/(0.108 + 0.012 + 0.016 + 0.064) = 0.12/0.2 = 0.6 Respectively P( cavity toothache) = (0.016 + 0.064)/0.2 = 0.4 The two probabilities sum up to one, as they should 2

195 1/P(toothache) = 1/0.2 = 5 is a normalization constant ensuring that the distribution P(Cavity toothache) adds up to 1 Let denote the normalization constant P(Cavity toothache) = P(Cavity, toothache) = (P(Cavity, toothache, catch) + P(Cavity, toothache, catch)) = ([0.108, 0.016] + [0.012, 0.064]) = [0.12, 0.08] = [0.6, 0.4] In other words, we can calculate the conditional probability distribution without knowing P(toothache) using normalization 196 More generally: we need to find out the distribution of the query variable X (Cavity), evidence variables E (Toothache) have observed values e, and the remaining unobserved variables are Y (Catch) Evaluation of a query: P(X e) = P(X, e) = y P(X, e, y), where the summation is over all possible ys; i.e., all possible combinations of values of the unobserved variables Y 3

197 P(X, e, y) is simply a subset of the joint probability distribution of variables X, E, and Y X, E, and Y together constitute the complete set of variables for the domain Given the full joint distribution to work with, the equation in the previous slide can answer probabilistic queries for discrete variables It does not scale well For a domain described by n Boolean variables, it requires an input table of size O(2 n ) and takes O(2 n ) time to process the table In realistic problems the approach is completely impractical 198 13.4 Independence If we expand the previous example with a fourth random variable Weather, which has four possible values, we have to copy the table of joint probabilities four times to have 32 entries together Dental problems have no influence on the weather, hence: P(Weather = cloudy toothache, catch, cavity) = P(Weather = cloudy) By this observation and product rule P(toothache, catch, cavity, Weather = cloudy) = P(Weather = cloudy) P(toothache, catch, cavity) 4

199 A similar equation holds for the other values of the variable Weather, and hence P(Toothache, Catch, Cavity, Weather) = P(Toothache, Catch, Cavity) P(Weather) The required joint distribution tables have 8 and 4 elements Propositions a and b are independent if P(a b) = P(a) P(b a) = P(b) P(a b) = P(a) P(b) Respectively variables X and Y are independent of each other if P(X Y) = P(X) P(Y X) = P(Y) P(X, Y) = P(X)P(Y) Independent coin flips: P(C 1,, C n ) can be represented as the product of n single-variable distributions P(C i ) 200 13.5 Bayes Rule and Its Use By the product rule P(a b) = P(a b) P(b) and the commutativity of conjunction P(a b) = P(b a) P(a) Equating the two right-hand sides and dividing by P(a), we get the Bayes rule P(b a) = P(a b) P(b) / P(a) The more general case of multivalued variables X and Y conditionalized on some background evidence e P(Y X, e) = P(X Y, e) P(Y e) / P(X e) Using normalization Bayes rule can be written as P(Y X) = P(X Y) P(Y) 5

201 Half of meningitis patients have a stiff neck P(s m) = 0.5 The prior probability of meningitis is 1 / 50 000: P(m) = 1/50 000 Every 20th patient complains about a stiff neck P(s) = 1/20 What is the probability that a patient complaining about a stiff neck has meningitis? P(m s) = P(s m) P(m) / P(s) = 20 / (2 50 000) = 0.0002 202 Perhaps the doctor knows that a stiff neck implies meningitis in 1 out of 5 000 cases The doctor, hence, has quantitative information in the diagnostic direction from symptoms to causes, and no need to use Bayes rule Unfortunately, diagnostic knowledge is often more fragile than causal knowledge If there is a sudden epidemic of meningitis, the unconditional probability of meningitis P(m) will go up The conditional probability P(s m), however, stays the same The doctor who derived diagnostic probability P(m s) directly from statistical observation of patients before the epidemic will have no idea how to update the value The doctor who computes P(m s) from the other three values will see P(m s) go up proportionally with P(m) 6

203 All modern probabilistic inference systems are based on the use of Bayes rule On the surface the relatively simple rule does not seem very useful However, as the previous example illustrates, Bayes rule gives a chance to apply existing knowledge We can avoid assessing the probability of the evidence P(s) by instead computing a posterior probability for each value of the query variable m and m and then normalizing the result P(M s) = [ P(s m) P(m), P(s m) P( m) ] Thus, we need to estimate P(s m) instead of P(s) Sometimes easier, sometimes harder 204 When a probabilistic query has more than one piece of evidence the approach based on full joint probability will not scale up P(Cavity toothache catch) Neither will applying Bayes rule scale up in general P(toothache catch Cavity) P(Cavity) We would need variables to be independent, but variable Toothache and Catch obviously are not: if the probe catches in the tooth, it probably has a cavity and that probably causes a toothache Each is directly caused by the cavity, but neither has a direct effect on the other catch and toothache are conditionally independent given Cavity 7

205 Conditional independence: P(toothache catch Cavity) = P(toothache Cavity) P(catch Cavity) Plugging this into Bayes rule yields P(Cavity toothache catch) = P(Cavity) P(toothache Cavity) P(catch Cavity) Now we only need three separate distributions The general definition of conditional independence of variables X and Y, given a third variable Z is P(X, Y Z) = P(X Z) P(Y Z) Equivalently, P(X Y, Z) = P(X Z) and P(Y X, Z) = P(Y Z) 206 If all effects are conditionally independent given a single cause, the exponential size of knowledge representation is cut to linear A probability distribution is called a naïve Bayes model if all effects E 1,, E n are conditionally independent, given a single cause C The full joint probability distribution can be written as P(C, E 1,, E n ) = P(C) i P(E i C) It is often used as a simplifying assumption even in cases where the effect variables are not conditionally independent given the cause variable In practice, naïve Bayes systems can work surprisingly well, even when the independence assumption is not true 8

207 14 PROBABILISTIC REASONING A Bayesian network is a directed graph in which each node is annotated with quantitative probability information 1. A set of random variables makes up the nodes of the network. Variables may be discrete or continuous 2. A set of directed links (arrows) connects pairs of nodes. If there is an arrow from node X to node Y, then X is said to be a parent of Y 3. Each node X i has a conditional probability distribution P(X i Parents(X i )) 4. The graph has no directed cycles and hence is a directed, acyclic graph DAG 208 The intuitive meaning of arrow X Y is that X has a direct influence on Y The topology of the network the set of nodes and links specifies the conditional independence relations that hold in the domain Once the topology of the Bayesian network has been laid out, we need only specify a conditional probability distribution for each variable, given its parents The combination of the topology and the conditional distributions specify implicitly the full joint distribution for all the variables Weather Cavity Toothache Catch 9

209 Example (from LA) A new burglar alarm (A) has been installed at home It is fairly reliable at detecting a burglary (B), but also responds on occasion to minor earthquakes (E) Neighbors John (J) and Mary (M) have promised to call you at work when they hear the alarm John always calls when he hears the alarm, but sometimes confuses the telephone ringing with the alarm and calls then, too Mary, on the other hand, likes loud music and sometimes misses the alarm altogether Given the evidence of who has or has not called, we would like to estimate the probability of a burglary 210 P(B).001 Burglary Earthquake P(E).002 Alarm B E P(A) T T.95 T F.94 F T.29 F F.001 A P(J) T.90 F.05 John Calls Mary Calls A P(M) T.70 F.01 10

211 The topology of the network indicates that Burglary and earthquakes affect the probability of the alarm s going off Whether John and Mary call depends only on the alarm They do not perceive any burglaries directly, they do not notice minor earthquakes, and they do not confer before calling Mary listening to loud music and John confusing phone ringing to the sound of the alarm can be read from the network only implicitly as uncertainty associated to calling at work The probabilities actually summarize a potentially infinite set of circumstances The alarm might fail to go off due to high humidity, power failure, dead battery, cut wires, a dead mouse stuck inside the bell, etc. John and Mary might fail to call and report an alarm because they are out to lunch, on vacation, temporarily deaf, passing helicopter, etc. 212 The conditional probability tables in the network give the probabilities for the values of the random variable depending on the combination of values for the parent nodes Each row must sum to 1, because the entries represent exhaustive set of cases for the variable Above all variables are Boolean, and therefore it is enough to know that the probability of a true value is p, the probability of false must be 1 p In general, a table for a Boolean variable with k parents contains 2 k independently specifiable probabilities A variable with no parents has only one row, representing the prior probabilities of each possible value of the variable 11

213 14.2 The Semantics of Bayesian Networks Every entry in the full joint probability distribution can be calculated from the information in a Bayesian network A generic entry in the joint distribution is the probability of a conjunction of particular assignments to each variable P(X 1 = x 1 X n = x n ), abbreviated as P(x 1,, x n ) The value of this entry is P(x 1,, x n ) = i=1,,n P(x i parents(x i )), where parents(x i ) denotes the specific values of the variables Parents(X i ) P(j m a b e) = P(j a) P(m a) P(a b e) P( b) P( e) =.90.70.001.999.998 =.000628 214 Constructing Bayesian networks We can rewrite an entry in the joint distribution P(x 1,, x n ), using the product rule, as P(x n x n-1,, x 1 ) P(x n-1,, x 1 ) Then we repeat the process, reducing each conjunctive probability to a conditional probability and a smaller conjunction. We end up with one big product: P(xn xn -1,,x 1)P(xn -1 xn -2,,x 1) P(x2 x 1)P(x 1) = P(x x,,x 1) This identity holds true for any set of random variables and is called the chain rule The specification of the joint distribution is thus equivalent to the general assertion that, for every variable X i in the network P(X i X i-1,, X 1 ) = P(X i Parents(X i )) provided that Parents(X i ) { X i-1,, X 1 } n i= 1 i i-1 12

215 The last condition is satisfied by labeling the nodes in any order that is consistent with the partial order implicit in the graph structure Each node is required to be conditionally independent of its predecessors in the node ordering, given its parents We need to choose as parents of a node all those nodes that directly influence the value of the variable For example, MaryCalls is certainly influenced by whether there is a Burglary or an Earthquake, but not directly influenced We believe the following conditional independence statement to hold: P(MaryCalls JohnCalls, Alarm, Earthquake, Burglary) = P(MaryCalls Alarm) 216 A Bayesian network is a complete and nonredundant representation of the domain In addition, it is a more compact representation than the full joint distribution, due to its locally structured properties Each subcomponent interacts directly with only a bounded number of other components, regardless of the total number of components Local structure is usually associated with linear rather than exponential growth in complexity In domain of a Bayesian network, if each of the n random variables is influenced by at most k others, then specifying each conditional probability table will require at most 2 k numbers Altogether n2 k numbers; the full joint distribution has 2 n cells E.g., n = 30 and k = 5: n2 k = 960 and 2 n > 10 9 13

217 In a Bayesian network we can exchange accuracy with complexity It may be wiser to leave out very weak dependencies from the network in order to restrict the complexity, but this yields a lower accuracy Choosing the right topology for the network is a hard problem The correct order in which to add nodes is to add the root causes first, then the variables they influence, and so on, until we reach the leaves, which have no direct causal influence on other variables Adding nodes in the false order makes the network unnecessarily complex and unintuitive 218 Mary Calls John Calls Earthquake Burglary Alarm 14