Probabilistic Graphical Models

Similar documents
The Basics of Graphical Models

Bayesian networks - Time-series models - Apache Spark & Scala

Bayesian Networks. Read R&N Ch Next lecture: Read R&N

Learning from Data: Naive Bayes

1 Maximum likelihood estimation

Bayesian Networks. Mausam (Slides by UW-AI faculty)

Machine Learning.

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Course: Model, Learning, and Inference: Lecture 5

Cell Phone based Activity Detection using Markov Logic Network

Bayes and Naïve Bayes. cs534-machine Learning

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Question 2 Naïve Bayes (16 points)

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Tagging with Hidden Markov Models

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

CHAPTER 2 Estimating Probabilities

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

Supervised Learning (Big Data Analytics)

MAS108 Probability I

Bayesian Tutorial (Sheet Updated 20 March)

CONTINGENCY (CROSS- TABULATION) TABLES

Lecture Note 1 Set and Probability Theory. MIT Spring 2006 Herman Bennett

Language Modeling. Chapter Introduction

Model-based Synthesis. Tony O Hagan

: Introduction to Machine Learning Dr. Rita Osadchy

Chapter 28. Bayesian Networks

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Basics of Statistical Machine Learning

Chapter 4. Probability and Probability Distributions

Learning outcomes. Knowledge and understanding. Competence and skills

STA 4273H: Statistical Machine Learning

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Lecture 7: NP-Complete Problems

3. The Junction Tree Algorithms

8. Machine Learning Applied Artificial Intelligence

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Data Mining - Evaluation of Classifiers

Course Outline Department of Computing Science Faculty of Science. COMP Applied Artificial Intelligence (3,1,0) Fall 2015

ANALYTICS IN BIG DATA ERA

Less naive Bayes spam detection

Statistical Machine Translation: IBM Models 1 and 2

CSE 473: Artificial Intelligence Autumn 2010

5 Directed acyclic graphs

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

Life of A Knowledge Base (KB)

Why? A central concept in Computer Science. Algorithms are ubiquitous.

A Few Basics of Probability

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis

Lecture 16 : Relations and Functions DRAFT

Data Structures and Algorithms Written Examination

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

Compression algorithm for Bayesian network modeling of binary systems

Study Manual. Probabilistic Reasoning. Silja Renooij. August 2015

Scheduling Shop Scheduling. Tim Nieberg

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Chapter 4 Lecture Notes

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

Random variables P(X = 3) = P(X = 3) = 1 8, P(X = 1) = P(X = 1) = 3 8.

CSCI567 Machine Learning (Fall 2014)

Question: What is the probability that a five-card poker hand contains a flush, that is, five cards of the same suit?

Decision Trees and Networks

A Logistic Regression Approach to Ad Click Prediction

Analyzing the Facebook graph?

Reading 13 : Finite State Automata and Regular Expressions

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Binary Adders: Half Adders and Full Adders

CS 6220: Data Mining Techniques Course Project Description

Lecture 9: Introduction to Pattern Analysis

Social Media Mining. Data Mining Essentials

Data Mining Algorithms Part 1. Dejan Sarka

Incorporating Evidence in Bayesian networks with the Select Operator

Predict Influencers in the Social Network

Discrete Structures for Computer Science

Review. Bayesianism and Reliability. Today s Class

Statistics in Geophysics: Introduction and Probability Theory

Tracking Groups of Pedestrians in Video Sequences

A.I. in health informatics lecture 1 introduction & stuff kevin small & byron wallace

Lecture 4: BK inequality 27th August and 6th September, 2007

Unit 4 The Bernoulli and Binomial Distributions

Introduction to Learning & Decision Trees

Up/Down Analysis of Stock Index by Using Bayesian Network

Bayesian Network Development

Normal distribution. ) 2 /2σ. 2π σ

Basic Probability. Probability: The part of Mathematics devoted to quantify uncertainty

Linear Threshold Units

Informatics 2D Reasoning and Agents Semester 2,

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

7.S.8 Interpret data to provide the basis for predictions and to establish

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Bayesian models of cognition

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Chapter 3. Cartesian Products and Relations. 3.1 Cartesian Products

Useful Number Systems

CS 598CSC: Combinatorial Optimization Lecture date: 2/4/2010

Big Data, Machine Learning, Causal Models

Transcription:

Probabilistic Graphical Models David Sontag New York University Lecture 1, January 26, 2012 David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 1 / 37

One of the most exciting advances in machine learning (AI, signal processing, coding, control,...) in the last decades David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 2 / 37

How can we gain global insight based on local observations? David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 3 / 37

Key idea 1 Represent the world as a collection of random variables X 1,..., X n with joint distribution p(x 1,..., X n ) 2 Learn the distribution from data 3 Perform inference (compute conditional distributions p(x i X 1 = x 1,..., X m = x m )) David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 4 / 37

Reasoning under uncertainty As humans, we are continuously making predictions under uncertainty Classical AI and ML research ignored this phenomena Many of the most recent advances in technology are possible because of this new, probabilistic, approach David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 5 / 37

Applications: Deep question answering David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 6 / 37

Applications: Machine translation David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 7 / 37

Applications: Speech recognition David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 8 / 37

Applications: Stereo vision input: two images! output: disparity! David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 9 / 37

Key challenges 1 Represent the world as a collection of random variables X 1,..., X n with joint distribution p(x 1,..., X n ) How does one compactly describe this joint distribution? Directed graphical models (Bayesian networks) Undirected graphical models (Markov random fields, factor graphs) 2 Learn the distribution from data Maximum likelihood estimation. Other estimation methods? How much data do we need? How much computation does it take? 3 Perform inference (compute conditional distributions p(x i X 1 = x 1,..., X m = x m )) David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 10 / 37

Syllabus overview We will study Representation, Inference & Learning First in the simplest case Only discrete variables Fully observed models Exact inference & learning Then generalize Continuous variables Partially observed data during learning (hidden variables) Approximate inference & learning Learn about algorithms, theory & applications David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 11 / 37

Logistics Class webpage: http://cs.nyu.edu/~dsontag/courses/pgm12/ Sign up for mailing list! Draft slides posted before each lecture Book: Probabilistic Graphical Models: Principles and Techniques by Daphne Koller and Nir Friedman, MIT Press (2009) Office hours: Tuesday 5-6pm and by appointment. 715 Broadway, 12th floor, Room 1204 Grading: problem sets (70%) + final exam (30%) Grader is Chris Alberti (chris.alberti@gmail.com) 6-7 assignments (every 2 weeks). Both theory and programming First homework out today, due Feb. 9 at 5pm See collaboration policy on class webpage David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 12 / 37

Quick review of probability Reference: Chapter 2 and Appendix A What are the possible outcomes? Coin toss: Ω = { heads, tails } Die: Ω = {1, 2, 3, 4, 5, 6} An event is a subset of outcomes S Ω: Examples for die: {1, 2, 3}, {2, 4, 6},... We measure each event using a probability function David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 13 / 37

Probability function Assign non-negative weight, p(ω), to each outcome such that p(ω) = 1 ω Ω Coin toss: p( head ) + p( tail ) = 1 Die: p(1) + p(2) + p(3) + p(4) + p(5) + p(6) = 1 Probability of event S Ω: p(s) = ω S p(ω) Example for die: p({2, 4, 6}) = p(2) + p(4) + p(6) Claim: p(s 1 S 2 ) = p(s 1 ) + p(s 2 ) p(s 1 S 2 ) David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 14 / 37

Independence of events Two events S 1, S 2 are independent if p(s 1 S 2 ) = p(s 1 )p(s 2 ) David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 15 / 37

Conditional probability Let S 1, S 2 be events, p(s 2 ) > 0. p(s 1 S 2 ) = p(s 1 S 2 ) p(s 2 ) Claim 1: ω S p(ω S) = 1 Claim 2: If S 1 and S 2 are independent, then p(s 1 S 2 ) = p(s 1 ) David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 16 / 37

Two important rules 1 Chain rule Let S 1,... S n be events, p(s i ) > 0. p(s 1 S 2 S n ) = p(s 1 )p(s 2 S 1 ) p(s n S 1,..., S n 1 ) 2 Bayes rule Let S 1, S 2 be events, p(s 1 ) > 0 and p(s 2 ) > 0. p(s 1 S 2 ) = p(s 1 S 2 ) p(s 2 ) = p(s 2 S 1 )p(s 1 ) p(s 2 ) David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 17 / 37

Discrete random variables Often each outcome corresponds to a setting of various attributes (e.g., age, gender, haspneumonia, hasdiabetes ) A random variable X is a mapping X : Ω D D is some set (e.g., the integers) Induces a partition of all outcomes Ω For some x D, we say p(x = x) = p({ω Ω : X (ω) = x}) probability that variable X assumes state x Notation: Val(X ) = set D of all values assumed by X (will interchangeably call these the values or states of variable X ) p(x ) is a distribution: x Val(X ) p(x = x) = 1 David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 18 / 37

Multivariate distributions Instead of one random variable, have random vector X(ω) = [X 1 (ω),..., X n (ω)] X i = x i is an event. The joint distribution p(x 1 = x 1,..., X n = x n ) is simply defined as p(x 1 = x 1 X n = x n ) We will often write p(x 1,..., x n ) instead of p(x 1 = x 1,..., X n = x n ) Conditioning, chain rule, Bayes rule, etc. all apply David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 19 / 37

Working with random variables For example, the conditional distribution p(x 1 X 2 = x 2 ) = p(x 1, X 2 = x 2 ). p(x 2 = x 2 ) This notation means p(x 1 = x 1 X 2 = x 2 ) = p(x 1=x 1,X 2 =x 2 ) p(x 2 =x 2 ) x 1 Val(X 1 ) Two random variables are independent, X 1 X 2, if p(x 1 = x 1, X 2 = x 2 ) = p(x 1 = x 1 )p(x 2 = x 2 ) for all values x 1 Val(X 1 ) and x 2 Val(X 2 ). David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 20 / 37

Example Consider three binary-valued random variables X 1, X 2, X 3 Val(X i ) = {0, 1} Let outcome space Ω be the cross-product of their states: Ω = Val(X 1 ) Val(X 2 ) Val(X 3 ) X i (ω) is the value for X i in the assignment ω Ω Specify p(ω) for each outcome ω Ω by a big table: x 1 x 2 x 3 p(x 1, x 2, x 3 ) 0 0 0.11 0 0 1.02. 1 1 1.05 How many parameters do we need to specify? 2 3 1 David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 21 / 37

Marginalization Suppose X and Y are random variables with distribution p(x, Y ) X : Intelligence, Val(X ) = { Very High, High } Y : Grade, Val(Y ) = { a, b } Joint distribution specified by: p(y = a) =?= 0.85 Y X vh h a 0.7 0.15 b 0.1 0.05 More generally, suppose we have a joint distribution p(x 1,..., X n ). Then, p(x i = x i ) = p(x 1,..., x n ) x 1 x 2 x i+1 x n x i 1 David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 22 / 37

Conditioning Suppose X and Y are random variables with distribution p(x, Y ) X : Intelligence, Val(X ) = { Very High, High } Y : Grade, Val(Y ) = { a, b } Y X vh h a 0.7 0.15 b 0.1 0.05 Can compute the conditional probability p(y = a X = vh) = = = p(y = a, X = vh) p(x = vh) p(y = a, X = vh) p(y = a, X = vh) + p(y = b, X = vh) 0.7 = 0.875. 0.7 + 0.1 David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 23 / 37

Example: Medical diagnosis Variable for each symptom (e.g. fever, cough, fast breathing, shaking, nausea, vomiting ) Variable for each disease (e.g. pneumonia, flu, common cold, bronchitis, tuberculosis ) Diagnosis is performed by inference in the model: p(pneumonia = 1 cough = 1, fever = 1, vomiting = 0) One famous model, Quick Medical Reference (QMR-DT), has 600 diseases and 4000 findings David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 24 / 37

Representing the distribution Naively, could represent multivariate distributions with table of probabilities for each outcome (assignment) How many outcomes are there in QMR-DT? 2 4600 Estimation of joint distribution would require a huge amount of data Inference of conditional probabilities, e.g. p(pneumonia = 1 cough = 1, fever = 1, vomiting = 0) would require summing over exponentially many variables values Moreover, defeats the purpose of probabilistic modeling, which is to make predictions with previously unseen observations David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 25 / 37

Structure through independence If X 1,..., X n are independent, then p(x 1,..., x n ) = p(x 1 )p(x 2 ) p(x n ) 2 n entries can be described by just n numbers (if Val(X i ) = 2)! However, this is not a very useful model observing a variable X i cannot influence our predictions of X j If X 1,..., X n are conditionally independent given Y, denoted as X i X i Y, then p(y, x 1,..., x n ) = p(y)p(x 1 y) = p(y)p(x 1 y) This is a simple, yet powerful, model n p(x i x 1,..., x i 1, y) i=2 n p(x i y). David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 26 / 37 i=2

Example: naive Bayes for classification Classify e-mails as spam (Y = 1) or not spam (Y = 0) Let 1 : n index the words in our vocabulary (e.g., English) X i = 1 if word i appears in an e-mail, and 0 otherwise E-mails are drawn according to some distribution p(y, X 1,..., X n ) Suppose that the words are conditionally independent given Y. Then, p(y, x 1,... x n ) = p(y) n p(x i y) Estimate the model with maximum likelihood. Predict with: p(y = 1 x 1,... x n ) = i=1 p(y = 1) n i=1 p(x i Y = 1) y={0,1} p(y = y) n i=1 p(x i Y = y) Are the independence assumptions made here reasonable? Philosophy: Nearly all probabilistic models are wrong, but many are nonetheless useful David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 27 / 37

Bayesian networks Reference: Chapter 3 A Bayesian network is specified by a directed acyclic graph G = (V, E) with: 1 One node i V for each random variable X i 2 One conditional probability distribution (CPD) per node, p(x i x Pa(i) ), specifying the variable s probability conditioned on its parents values Corresponds 1-1 with a particular factorization of the joint distribution: p(x 1,... x n ) = p(x i x Pa(i) ) i V Powerful framework for designing algorithms to perform probability computations David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 28 / 37

Example Consider the following Bayesian network: d 0 d 1 0.6 0.4 i 0 i 1 0.7 0.3 Difficulty Intelligence i 0,d 0 i 0,d 1 i 0,d 0 i 0,d 1 g 1 0.3 0.05 0.9 0.5 g 2 g 3 0.4 0.3 0.25 0.7 0.08 0.02 0.3 0.2 g 1 g 2 g 2 Grade Letter l 0 0.1 0.4 0.99 l 1 0.9 0.6 0.01 i 0 i 1 SAT s 0 0.95 0.2 s 1 0.05 0.8 What is its joint distribution? p(x 1,... x n ) = p(x i x Pa(i) ) i V p(d, i, g, s, l) = p(d)p(i)p(g i, d)p(s i)p(l g) David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 29 / 37

More examples naive Bayes Label Y Medical diagnosis d 1!"#$%#$#& d n X 1 X 2 X 3... X n Features f 1 '(!"()#& findings f m Evidence is denoted by shading in a node Can interpret Bayesian network as a generative process. For example, to generate an e-mail, we 1 Decide whether it is spam or not spam, by samping y p(y ) 2 For each word i = 1 to n, sample x i p(x i Y = y) David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 30 / 37

Bayesian network structure implies conditional independencies! Difficulty Intelligence Grade SAT Letter The joint distribution corresponding to the above BN factors as p(d, i, g, s, l) = p(d)p(i)p(g i, d)p(s i)p(l g) However, by the chain rule, any distribution can be written as p(d, i, g, s, l) = p(d)p(i d)p(g i, d)p(s i, d, g)p(l g, d, i, g, s) Thus, we are assuming the following additional independencies: D I, S {D, G} I, L {I, D, S} G. What else? David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 31 / 37

Bayesian network structure implies conditional independencies! Local Structures & Independencies Local Structures & Independencies Generalizing the above arguments, we obtain that a variable is independent from its non-descendants given its parents #,8BB8)'@=$6): # P%H%)*'Q'76&8<@>69 G'=)7',! #" R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R! " 6>9'8A'G'=)7','=$6'%)76@6)76):R R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R! # ",8BB8)'@=$6):! " P%H%)*'Q'76&8<@>69 G'=)7', Common parent fixing B decouples A and C,=9&=76 Cascade knowing B decouples A and C S)8F%)*'Q 76&8<@>69 G'=)7',,=9&=76! # " R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8'! # " S)8F%)*'Q 76&8<@>69 G'=)7', 6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R 6>'*6)6'G'@$8?%769')8' R*%?6)':56'>6?6>'8A'*6)6'Q/':56'>6?6>'*6)6'G'@$8?%769')8' 6H:$='@$67%&:%8)'?=><6'A8$':56'>6?6>'8A'*6)6',R >'8A'*6)6',R T29:$<&:<$6! # S)8F%)*','&8<@>69'G'=)7'Q T29:$<&:<$6! " # V-structure U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L',! Knowing # C couples A and B S)8F%)*','&8<@>69'G'=)7'Q RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R " This important U6&=<96'G'&=)'R6H@>=%)'=F=JR'Q'FL$L:L', " phenomona is called explaining away and is what 'FL$L:L', makes W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X Bayesian RVA'G'&8$$6>=:69':8',/':56)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R networks so powerful "#$%&'(%)*'+',-./'011!20113 O1 'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R W56'>=)*<=*6'%9'&8B@=&:/':56'&8)&6@:9'=$6'$%&5X David Sontag (NYU) Graphical "#$%&'(%)*'+',-./'011!20113 Models Lecture 1, January 26, 2012 O1 32 / 37

es & s A simple justification (for common parent) # '=)7',! " 6'Q/':56'>6?6>9'8A'G'=)7','=$6'%)76@6)76):R We ll show that p(a, C B) = p(a B)p(C B) for any distribution p(a, B, C) that factors! according # to this graph " structure, i.e. 9 G'=)7', 6'Q/':56'>6?6>'*6)6'G'@$8?%769')8' p(a, B, C) = p(b)p(a B)p(C B) A8$':56'>6?6>'8A'*6)6',R Proof.! # '=)7'Q p(a, B, " C) =%)'=F=JR'Q'FL$L:L', p(a, C B) = = p(a B)p(C B) p(b) 6)'&5=)&6'A8$'Q':8'=>98'&8$$6>=:6':8'Q'F%>>'76&$6=96R @=&:/':56'&8)&6@:9'=$6'$%&5X "#$%&'(%)*'+',-./'011!20113 David Sontag (NYU) Graphical Models O1 Lecture 1, January 26, 2012 33 / 37

D-separation ( directed separated ) in Bayesian networks Algorithm to calculate whether X Z Y by looking at graph separation Look to see if there is active path between X and Y when variables Y are observed: Y Y X Z X Z (a) (b) David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 34 / 37

D-separation ( directed separated ) in Bayesian networks X Z X Z Algorithm to calculate (a) whether X Z Y by looking (b) at graph separation Look to see if there is active path between X and Y when variables Y are observed: X Z X Z Y (a) Y (b) David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 35 / 37

D-separation ( directed separated ) in Bayesian networks Algorithm to calculate whether X Z Y by looking at graph separation Look to see if there is active path between X and Y when variables Y are observed: X Y Z X Y Z (a) (b) If no such path, then X and Z are d-separated with repsect to Y d-separation reduces statistical independencies (hard) to connectivity in graphs (easy) Important because it allows us to quickly prune the Bayesian network, finding just the relevant variables for answering a query David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 36 / 37

David Sontag (NYU) Graphical Models X 4 Lecture 1, January 26, 2012 37 / 37 D-separation example 1 X 4 X 2 X 1 X 6 X 3 X 5

D-separation example 2 X3 X 5 X 4 X 2 X 1 X 6 X 3 X 5 David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 38 / 37

Summary Bayesian networks given by (G, P) where P is specified as a set of local conditional probability distributions associated with G s nodes One interpretation of a BN is as a generative model, where variables are sampled in topological order Local and global independence properties identifiable via d-separation criteria Computing the probability of any assignment is obtained by multiplying CPDs Bayes rule is used to compute conditional probabilities Marginalization or inference is often computationally difficult David Sontag (NYU) Graphical Models Lecture 1, January 26, 2012 39 / 37