Introduction to Graphical Models

Similar documents
The Basics of Graphical Models

Course: Model, Learning, and Inference: Lecture 5

SYSM 6304: Risk and Decision Analysis Lecture 5: Methods of Risk Analysis

These axioms must hold for all vectors ū, v, and w in V and all scalars c and d.

9.4. The Scalar Product. Introduction. Prerequisites. Learning Style. Learning Outcomes

Approximation Algorithms

CHAPTER 2 Estimating Probabilities

3. The Junction Tree Algorithms

Factor Graphs and the Sum-Product Algorithm

Math 4310 Handout - Quotient Vector Spaces

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

LS.6 Solution Matrices

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

Lecture Note 1 Set and Probability Theory. MIT Spring 2006 Herman Bennett

Dynamic Programming. Lecture Overview Introduction

Click on the links below to jump directly to the relevant section

Decision Making under Uncertainty

Binary Adders: Half Adders and Full Adders

Lecture Notes 2: Matrices as Systems of Linear Equations

A Few Basics of Probability

3.2 Matrix Multiplication

The PageRank Citation Ranking: Bring Order to the Web

Question 2: How do you solve a matrix equation using the matrix inverse?

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Similarity and Diagonalization. Similar Matrices

5.5. Solving linear systems by the elimination method

GREEN CHICKEN EXAM - NOVEMBER 2012

DERIVATIVES AS MATRICES; CHAIN RULE

Applied Algorithm Design Lecture 5

Conditional Random Fields: An Introduction

E3: PROBABILITY AND STATISTICS lecture notes

Simplifying Logic Circuits with Karnaugh Maps

Section 10.4 Vectors

Likelihood: Frequentist vs Bayesian Reasoning

LECTURE 4. Last time: Lecture outline

TU e. Advanced Algorithms: experimentation project. The problem: load balancing with bounded look-ahead. Input: integer m 2: number of machines

Bayesian Networks. Read R&N Ch Next lecture: Read R&N

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

Compression algorithm for Bayesian network modeling of binary systems

Linear Programming. March 14, 2014

3. Mathematical Induction

2.4 Real Zeros of Polynomial Functions

6.3 Conditional Probability and Independence

CONTENTS 1. Peter Kahn. Spring 2007

Solutions to Math 51 First Exam January 29, 2015

Chapter 7 - Roots, Radicals, and Complex Numbers

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

10.2 ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS. The Jacobi Method

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 2

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Decision Analysis. Here is the statement of the problem:

Lecture 2 Matrix Operations

Solving Quadratic Equations

5 Directed acyclic graphs

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

A Non-Linear Schema Theorem for Genetic Algorithms

Why? A central concept in Computer Science. Algorithms are ubiquitous.

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

The last three chapters introduced three major proof techniques: direct,

University of Lille I PC first year list of exercises n 7. Review

1 Review of Least Squares Solutions to Overdetermined Systems

7 Gaussian Elimination and LU Factorization

Rational Exponents. Squaring both sides of the equation yields. and to be consistent, we must have

Mathematics for Computer Science/Software Engineering. Notes for the course MSM1F3 Dr. R. A. Wilson

Study Manual. Probabilistic Reasoning. Silja Renooij. August 2015

Conditional Probability, Independence and Bayes Theorem Class 3, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Row Echelon Form and Reduced Row Echelon Form

Bayesian networks - Time-series models - Apache Spark & Scala

Structured Learning and Prediction in Computer Vision. Contents

Matrix Algebra. Some Basic Matrix Laws. Before reading the text or the following notes glance at the following list of basic matrix algebra laws.

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

(Refer Slide Time: 2:03)

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Satisfiability Checking

Question: What is the probability that a five-card poker hand contains a flush, that is, five cards of the same suit?

Solving Mass Balances using Matrix Algebra

CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team

Statistical Machine Translation: IBM Models 1 and 2

Bayesian Networks. Mausam (Slides by UW-AI faculty)

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Solving Systems of Linear Equations

Boolean Algebra. Boolean Algebra. Boolean Algebra. Boolean Algebra

Chapter 3. if 2 a i then location: = i. Page 40

- Easy to insert & delete in O(1) time - Don t need to estimate total memory needed. - Hard to search in less than O(n) time

Beginner s Matlab Tutorial

Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.

Notes on Determinant

Recursive Algorithms. Recursion. Motivating Example Factorial Recall the factorial function. { 1 if n = 1 n! = n (n 1)! if n > 1

Regular Languages and Finite Automata

Chapter 6 Work and Energy

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

Solution to Homework 2

DATA ANALYSIS II. Matrix Algorithms

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Math 223 Abstract Algebra Lecture Notes

Linear Algebra I. Ronald van Luijk, 2012

1 Review of Newton Polynomials

Solving Systems of Linear Equations Using Matrices

Chapter 3. Cartesian Products and Relations. 3.1 Cartesian Products

Transcription:

obert Collins CSE586 Credits: Several slides are from: Introduction to Graphical odels eadings in Prince textbook: Chapters 0 and but mainly only on directed graphs at this time eview: Probability Theory Sum rule (marginal distributions) Product rule From these we have Bayes theorem with normalization factor eview: Conditional Probabilty Conditional Probability (rewriting product rule) A B) P (A, B) / B) Chain ule A,B,C,D) A) A,B) A,B,C) A,B,C,D) A) A,B) A,B,C) Conditional Independence A) B A) C A,B) D A,B,C) A, B C) A C ) B C) statistical independence A, B) A) B) Christopher Bishop, S Overview of Graphical odels Graphical odels model conditional dependence/ independence Graph structure specifies how joint probability factors Directed graphs Example:H The Joint Distribution ecipe for making a joint distribution of variables: Example: Boolean variables A, B, C Undirected graphs Example:F Inference by : belief propagation Sum-product algorithm ax-product (in-sum if using logs) We will focus mainly on directed graphs right now. Andrew oore, CU

The Joint Distribution Example: Boolean variables A, B, C The Joint Distribution Example: Boolean variables A, B, C ecipe for making a joint distribution of variables: A B C 0 0 0 0 0 ecipe for making a joint distribution of variables: A B C Prob 0 0 0 0.30 0 0 0.05. ake a truth table listing all combinations of values of your variables (if there are Boolean variables then the table will have 2 rows). 0 0 0 0 0 0 0. ake a truth table listing all combinations of values of your variables (if there are Boolean variables then the table will have 2 rows). 2. For each combination of values, say how probable it is. 0 0 0.0 0 0.05 0 0 0.05 0 0.0 0 0.25 0.0 Andrew oore, CU Andrew oore, CU The Joint Distribution Example: Boolean variables A, B, C Joint distributions ecipe for making a joint distribution of variables:. ake a truth table listing all combinations of values of your variables (if there are Boolean variables then the table will have 2 rows). 2. For each combination of values, say how probable it is. 3. If you subscribe to the axioms of probability, those numbers must sum to. A B C Prob 0 0 0 0.30 0 0 0.05 0 0 0.0 0 0.05 0 0 0.05 0 0.0 0 0.25 0.0 A truth table 0.05 0.0 0.25 0.0 0.05 0.05 C Good news Once you have a joint distribution, you can answer all sorts of probabilistic questions involving combinations of attributes 0.30 B 0.0 Andrew oore, CU Using the Joint Using the Joint Poor ale) 0.4654 E) row) rows matching E Poor) 0.7604 E) row) rows matching E Andrew oore, CU Andrew oore, CU 2

Inference with the Joint Inference with the Joint computing conditional probabilities E E2) E E2) E ) 2 rows matching E and E2 rows matching E2 row) row) E E2) E E2) E ) 2 rows matching E and E2 rows matching E2 row) row) ale Poor) 0.4654 / 0.7604 0.62 Andrew oore, CU Andrew oore, CU Joint distributions Good news Once you have a joint distribution, you can answer all sorts of probabilistic questions involving combinations of attributes Bad news Impossible to create JD for more than about ten attributes because there are so many numbers needed when you build the thing. For 0 binary variables you need to specify 2 0 - numbers 023. (question for class: why the -?) How to use Fewer Numbers Factor the joint distribution into a product of distributions over subsets of variables Identify (or just assume) independence between some subsets of variables Use that independence to simplify some of the distributions Graphical models provide a principled way of doing this. Factoring Directed versus Undirected Graphs Consider an arbitrary joint distribution We can always factor it, by application of the chain rule what this factored form looks like as a graphical model Directed Graph Examples: Bayes nets Hs Undirected Graph Examples FS Note: The word graphical denotes the graph structure underlying the model, not the fact that you can draw a pretty picture of it (although that helps). Christopher Bishop, S Christopher Bishop, S 3

Graphical odel Concepts Graphical odel Concepts s)0.3 S )0.6 s)0.3 S )0.6 ^S)0.05 ^~S)0. ~^S)0. ~^~S)0.2 T T )0.3 T ~)0.8 )0.3 ~)0.6 ^S)0.05 ^~S)0. ~^S)0. ~^~S)0.2 T T )0.3 T ~)0.8 )0.3 ~)0.6 Nodes represent random variables. Edges (or lack of edges) represent conditional dependence (or independence). Each node is annotated with a table of conditional probabilities wrt parents. Note: The word graphical denotes the graph structure underlying the model, not the fact that you can draw a pretty picture of it using graphics. Directed Acyclic Graphs Directed acyclic means we can t follow arrows around in a cycle. Examples: chains; trees Also, things that look like this: Factoring Examples Joint distribution where denotes the parents of i We can read the factored form of the joint distribution immediately from a directed graph where denotes the parents of i x parents of x) y parents of y) Factoring Examples Joint distribution Factoring Examples We can read the form of the joint distribution directly from the directed graph where denotes the parents of i where denotes the parents of i p(x,y)p(x)p(y x) p(x,y)p(y)p(x y) p(x,y)p(x)p(y) parents of ) parents of ) parents of ) 4

Factoring Examples We can read the form of the joint distribution directly from the directed graph Factoring Examples We can read the form of the joint distribution directly from a directed graph where denotes the parents of i where denotes the parents of i,,) ) ) ) parents of ) parents of ) parents of ) Factoring Examples Graphical odel Concepts We can read the form of the joint distribution directly from a directed graph s)0.3 S )0.6 where denotes the parents of i ^S)0.05 ^~S)0. ~^S)0. ~^~S)0.2 T T )0.3 T ~)0.8 )0.3 ~)0.6 Note:,,),))) How about this one?,,,s,t) Graphical odel Concepts Factoring Examples s)0.3 S )0.6 Joint distribution ^S)0.05 ^~S)0. ~^S)0. ~^~S)0.2 T T )0.3 T ~)0.8 )0.3 ~)0.6 where denotes the parents of i What about this one? How about this one?,,,s,t) S)) S,) )T ) No directed cycles Note to mention T )+T ~) Christopher Bishop, S 5

Factoring Examples How many probabilities do we have to specify/learn (assuming each x i is a binary variable)? if fully connected, we would need 2^7-27 but, for this connectivity, we need +++8+4+2+4 2 2 0 2 3 2 0 2 0 Important Case: Time Series Consider modeling a time series of sequential data x, x2,..., xn These could represent locations of a tracked object over time observations of the weather each day spectral coefficients of a speech signal joint angles during human motion 2 2 2 2 2 Note: If all nodes were independent, we would only need 7! odeling Time Series odeling Time Series Simplest model of a time series is that all observations are independent. In the most general case, we could use chain rule to state that any node is dependent on all previous nodes... This would be appropriate for modeling successive tosses {heads,tails} of an unbiased coin. However, it doesn t really treat the series as a sequence. That is, we could permute the ordering of the observations and not change a thing. x,x2,x3,x4,...) x)x2 x)x3 x,x2)x4 x,x2,x3)... ook for an intermediate model between these two extremes. odeling Time Series odeling Time Series arkov assumption: xn x,x2,...,xn-) xn xn-) that is, assume all conditional distributions depend only on the most recent previous observation. Generalization: State-Space odels You have a arkov chain of latent (unobserved) states Each state generates an observation x! x 2! x n-! x n! x n+! The result is a first-order arkov Chain y! y 2! y n-! y n! y n+! x,x2,x3,x4,...) x)x2 x)x3 x2)x4 x3)... Goal: Given a sequence of observations, predict the sequence of unobserved states that imizes the joint probability. 6

odeling Time Series Examples of State Space models Hidden arkov model Kalman filter x! x 2! x n-! x n! x n+! odeling Time Series x,x2,x3,x4,...,y,y2,y3,y4,...) x)y x)x2 x)y2 x2)x3 x2)y3 x3)x4 x3)y4 x4)... x! x 2! x n-! x n! x n+! y! y 2! y n-! y n! y n+! y! y 2! y n-! y n! y n+! Example of a Tree-structured odel essage Passing Confusion alert: Our textbook uses w to denote a world state variable and x to denote a measurement. (we have been using x to denote world state and y as the measurement). essage Passing : Belief Propagation Example: D chain Find marginal for a particular node Key Idea of essage Passing multiplication distributes over addition a * b + a * c a * (b + c) as a consequence: for -state nodes, cost is exponential in length of chain but, we can exploit the graphical structure (conditional independences) is number of discrete values a variable can take is number of variables Applicable to both directed and undirected graphs. 7

Example essage Passing In the next several slides, we will consider an example of a simple, four-variable arkov chain. 48 multiplications + 23 additions 5 multiplications + 6 additions For, this principle is applied to functions of random variables, rather than the variables as done here. essage Passing Now consider computing the marginal distribution of variable x3 essage Passing ultiplication distributes over addition... essage Passing, aka Forward-Backward Algorithm Can view as sending/combining messages... Forward-Backward Algorithm Express marginals as product of messages evaluated forward from ancesters of xi and backwards from decendents of xi α β Forw * Back ecursive evaluation of messages - + Find Z by normalizing Works in both directed and undirected graphs Christopher Bishop, S 8

Confusion Alert! This standard notation for defining heavily overloads the notion of multiplication, e.g. the messages are not scalars it is more appropriate to think of them as vectors, matrices, or even tensors depending on how many variables are involved, with multiplication defined accordingly. [] Note: these are conditional probability tables, so values in each row must sum to one Not scalar multiplication! [] Note: these are conditional probability tables, so values in each row must sum to one [] Note: these are conditional probability tables, so values in each row must sum to one Interpretation x) x2) x2) x22) x22 X) 0.6 Sample computations: x, x2, x3, x4) (.7)(.4)(.8)().2 x2, x2, x32, x4) (.3)()(.2)(.7).02 Joint Probability, represented in a truth table x x2 x3 x4 x,x2,x3,x4) Joint Probability x x2 x3 x4 x,x2,x3,x4) Compute marginal of x3: x3) 0.458 x32) 042 9

now compute via now compute via [] [].7.4.7.6 x)x2 x) x)x22 x).3.3 x2)x2 x2) x2)x22 x2) (.7)(.4) + (.3)() (.7)(.6) + (.3)() 0.43 07.7.4.7.6 x)x2 x) x)x22 x).3.3 x2)x2 x2) x2)x22 x2) (.7)(.4) + (.3)() (.7)(.6) + (.3)() 0.43 07 simpler way to compute this... [] [.43 7] i.e. matrix multiply can do the combining and marginalization all at once!!!! now compute via now compute via [] [] [] [.43 7] [.458 42] [] compute sum along rows of x4 x3) Can also do that with matrix multiply: [.43 7] note: this is not [.458 42] a coincidence essage Passing Can view as sending/combining messages... essage Passing Can view as sending/combining messages... Forw [.458 42] Belief that x3, from front part of chain Belief that x32, from front part of chain * How to combine them? Back Belief that x32, from back part of chain Belief that x3, from back part of chain Forw [.458 42] * Back [ (.458)() (42)() ] [.458 42] [.458 42] (after normalizing, but note that it was already normalized. Again, not a coincidence) These are the same values for the marginal x3) that we computed from the raw joint probability table. Whew!!! 0

[] [] If we want to compute all marginals, we can do it in one shot by cascading, for a big computational savings. We need one cascaded forward pass, one separate cascaded backward pass, then a combination and normalization at each node. forward pass [] backward pass Then combine each by elementwise multiply and normalize forward pass [] backward pass Then combine each by elementwise multiply and normalize Forward: [] [.43 7] [.458 42] [.6084.396] Backward: [ ] [ ] [ ] [ ] combined+ normalized [] [.43 7] [.458 42] [.6084.396] Note: In this example, a directed arkov chain using true conditional probabilities (rows sum to one), only the forward pass is needed. This is true because the backward pass sums along rows, and always produces [ ]. ax arginals What if we want to know the most probably state (mode of the distribution)? Since the marginal distributions can tell us which value of each variable yields the highest marginal probability (that is, which value is most likely), we might try to just take the arg of each marginal distribution. We didn t really need forward AND backward in this example. Forward: [] [.43 7] [.458 42] [.6084.396] Backward: [ ] [ ] [ ] [ ] combined+ normalized [] [.43 7] [.458 42] [.6084.396] these are already the marginals for this example Didn t need to do these steps marginals computed by belief propagation marginals [] [.43 7] [.458 42] [.6084.396] arg x x 2 2 x 3 2 x 4 Although that s correct in this example, it isn t always the case

ax arginals can Fail to Find the AP However, the marginals find most likely values of each variable treated individually, which may not be the combination of values that jointly imize the distribution. marginals: w4, w22 actual AP solution: w2, w24 ax-product Algorithm Goal: find define the marginal then essage passing algorithm with sum replaced by Generalizes to any two operations forming a semiring Christopher Bishop, S Computing AP Value product Can solve using algorithm with sum replaced by. [] In our chain, we start at the end and work our way back to the root (x) using the -product algorithm, keeping track of the value as we go. marginal stage.7.4 x)x2 x).3 x2)x2 x2).7.6 x)x22 x).3 x2)x22 x2) stage3 stage2 [(.7)(.4), (.3)()] [(.7)(.6), (.3)()] 0.28 0.42 product product [] [] [.28.42] [.224.336] (.28)(.8) (.28)(.2) (.42)(.2) (.42)(.8) Note that this is no longer matrix multiplication, since we are not summing down the columns but taking instead... [.28.42] [.224.336] (.28)(.8) (.28)(.2) (.42)(.2) (.42)(.8) compute along rows of x4 x3).7 2

[] product Joint Probability, represented in a truth table x x2 x3 x4 x,x2,x3,x4) argest value of joint prob mode AP interpretation: the mode of the joint distribution is.2352, and the value of variable x3, in the configuration that yields the mode value, is 2. [.224.336].2.2352 [(.224)(), (.336)(.7)].7.2352 arg 2 x3 interpretation: the mode of the joint distribution is.2352, and the value of variable x3, in the configuration that yields the mode value, is 2..2.2352 [(.224)(), (.336)(.7)].2352 arg 2 x3 Computing Arg-ax of AP Value : AP Estimate Chris Bishop, P: At this point, we might be tempted simply to continue with the algorithm [sending forward-backward messages and combining to compute arg for each variable node]. However, because we are now imizing rather than summing, it is possible that there may be multiple configurations of x all of which give rise to the imum value for p(x). In such cases, this strategy can fail because it is possible for the individual variable values obtained by imizing the product of messages at each node to belong to different imizing configurations, giving an overall configuration that no longer corresponds to a imum. The problem can be resolved by adopting a rather different kind of... Essentially, the solution is to write a dynamic programming algorithm based on -product. DP State Space Trellis.4.7.6.3.2.2.8.8.7.3 : AP Estimate Joint Probability, represented in a truth table x x2 x3 x4 x,x2,x3,x4) DP State Space Trellis argest value of joint prob mode AP.7.3.6.4.2.2.8.8.7.3 achieved for x, x22, x32, x4 3

Belief Propagation Summary Definition can be extended to general tree-structured graphs Works for both directed AND undirected graphs Efficiently computes marginals and AP configurations At each node: form product of incoming messages and local evidence marginalize to give outgoing message one message in each direction across every link oopy Belief Propagation BP applied to graph that contains loops needs a propagation schedule needs multiple iterations might not converge Typically works well, even though it isn t supposed to State-of-the-art performance in error-correcting codes Gives exact answer in any acyclic graph (no loops). Christopher Bishop, S Christopher Bishop, S 4