# Logic, Probability and Learning

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Logic, Probability and Learning Luc De Raedt

2 Overview Logic Learning Probabilistic Learning Probabilistic Logic Learning

3 Closely following : Russell and Norvig, AI: a modern approach Freiburg AI course slides (Burgard, Nebel, De Raedt) Probabilistic Learning

4 Probabilistic Learning 1. Probability theory 2. Bayesian Networks 3. Probabilistic Learning 4. Alternative models

5 Probabilistic Learning 1. Probability theory 2. Bayesian Networks 3. Probabilistic Learning 4. Alternative models

6 Daughter in Florence

7 Degree of belief and Probability Theory We (and other agents) are convinced by facts and rules only up to a certain degree. One possibility for expressing the degree of belief is to use probabilities. The agent is 90% (or 0.9) convinced by its sensor information = in 9 out of 10 cases, the information is correct (the agent believes). Probabilities sum up the uncertainty that stems from lack of knowledge.

8 Unconditional Probabilities P(A) denotes the unconditional probability or prior probability that A will appear in the absence of any other information, for example: P(Cavity) = 0.1 Cavity is a proposition. We obtain prior probabilities from statistical analysis or general rules.

9 Unconditional Probabilities (2) In general, a random variable can take on true and false values, as well as other values: P(Weather=Sunny) = 0.7 P(Weather=Rain) = 0.2 P(Weather=Cloudy) = 0.08 P(Weather=Snow) = 0.02 P(Headache=TRUE) = 0.1 Propositions can contain equations over random variables. Logical connectors can be used to build propositions, e.g. P(Cavity & Injured) = 0.06.

10 Unconditional Probabilities P(x) is the vector of probabilities for the (ordered) domain of the random variable X: P(Headache) = 0.1, 0.9 P(Weather) = 0.7, 0.2, 0.08, 0.02 define the probability distribution for the random variables Headache and Weather. P(Headache, Weather) is a 4x2 table of probabilities of all combinations of the values of a set of random variables. Headache = TRUE Headache = FALSE Weather = Sunny P(W = Sunny & Headache) P(W = Sunny & Headache) Weather = Rain Weather = Cloudy Weather = Snow

11 Conditional Probabilities (1) New information can change the probability. Example: The probability of a cavity increases if we know the patient has a toothache. If additional information is available, we can no longer use the prior probabilities! P(A B) is the conditional or posterior probability of A given that all we know is B: P(Cavity Toothache) = 0.8 P(X Y) is the table of all conditional probabilities over all values of X and Y.

12 Conditional Probabilities (2) P(Weather Headache) is a 4x2 table of conditional probabilities of all combinations of the values of a set of random variables. Headache = TRUE Headache = FALSE Weather = Sunny P(W = Sunny Headache) P(W = Sunny Headache) Weather = Rain Weather = Cloudy Weather = Snow Conditional probabilities result from unconditional probabilities (if P(B)>0) (by definition) P(A B) = P(A & B) P(B)

14 Conditional Probabilities (4) P(A&B) P(A B) = P(B) Product rule: P(A&B) = P(A B) P(B) Analog: P(A&B) = P(B A) P(A) A and B are marginally independent if P(A B) = P(A) (equiv. P(B A) = P(B)). Then (and only then) it holds that P(A&B) = P(A) P(B).

15 Joint Probability The agent assigns probabilities to every proposition in the domain. An atomic event is an assignment of values to all random variables X 1,, X n (= complete specification of a state). Example: Let X and Y be boolean variables. Then we have the following 4 atomic events: X. Y, X. Y, X. Y, X. Y. The joint probability distribution P(X 1,, X n ) assigns a probability to every atomic event. Toothache Toothache Cavity Cavity Since all atomic events are disjoint, the sum of all fields is 1 (disjunction of events). The conjunction is necessarily false.

16 Working with Joint Probability All relevant probabilities can be computed using the joint probability by expressing them as a disjunction of atomic events. Examples: P(Cavity V Toothache) = P(Cavity & Toothache) + P( Cavity & Toothache) + P(Cavity & Toothache) We obtain unconditional probabilities by adding across a row or column (marginalization): P(Cavity) = P(Cavity & Toothache) + P(Cavity & Toothache) P(Cavity Toothache) = P(Cavity & Toothache) 0.04 = P(Toothache) = 0.80

17 Problems with Joint Distribution We can easily obtain all probabilities from the joint probability. The joint probability, however, involves k n values, if there are n random variables with k values. Difficult to represent Difficult to assess Questions: 1. Is there a more compact way of representing joint probabilities? 2. Is there an efficient method to work with this representation? Not in general, but it can work in many cases. Modern systems work directly with conditional probabilities and make assumptions on the independence of variables in order to simplify calculations.

18 Bayes Rule We know (product rule): P(A.B) = P(A B) P(B) and P(A.B) = P(B A) P(A) By equating the right-hand sides, we get P(A B) P(B) = P(B A) P(A) P(B A) P(A) P(A B) = P(B) For multi-valued variables (set of equalities): P(X Y) P(Y) P(Y X) = P(X) Generalization (conditioning on background evidence E): P(Y X,E) = P(X Y,E) P(Y E) P(X E)

19 Applying Bayes Rule P(Toothache Cavity) = 0.4 P(Cavity) = 0.1 P(Toothache) = 0.05 P(Cavity Toothache) = 0.4 x 0.1 = Why don t we try to assess P(Cavity Toothache) directly? P(Toothache Cavity) (causal) is more robust than P(Cavity Toothache) (diagnostic): P(Toothache Cavity) is independent from the prior probabilities P (Toothache) and P(Cavity). If there is a cavity epidemic and P(Cavity) increases, P(Toothache Cavity) does not change, but P(Toothache) and P(Cavity Toothache) will change proportionally.

20 Multiple Evidence (1) A dentist s probe catches in the aching tooth of a patient. Using Bayes rule, we can calculate: P(Cavity Catch) = 0.95 But how does the combined evidence help? Using Bayes rule, the dentist could establish: P(Cav Tooth & Catch) = P(Tooth & Catch Cav) x P(Cav) P(Tooth & Catch) P(Cav Tooth & Catch) = α P(Tooth & Catch Cav) x P(Cav)

21 Multiple Evidence (2) Problem: The dentist needs P(Tooth & Catch Cav), i.e. diagnostic knowledge of all combinations of symptoms in the general case. It would be nice if Tooth and Catch were independent but they are not: if a probe catches in the tooth, it probably has cavity which probably causes toothache. They are independent given we know whether the tooth has cavity: P(Tooth & Catch Cav) = P(Tooth Cav) P(Catch Cav) Each is directly caused by the cavity but neither has a direct effect on the other.

22 Conditional Independence The general definition of conditional independence of two variables X and Y given a third variable Z is: P(X,Y Z) = P(X Z) P(Y Z) Thus our diagnostic problem turns into: P(Cav Tooth & Catch) = α P(Tooth Cav) P(Catch Cav) P(Cav)

23 Probabilistic Learning 1. Probability theory 2. Bayesian Networks 3. Probabilistic Learning 4. Alternative models

24 Different Types of Models Graphical Models Directed - Bayesian Networks - HMMs (special case) Undirected - Markov Networks - Conditional Random Fields

25 Bayesian Networks (also belief networks, probabilistic networks, causal networks) 1. The random variables are the nodes. 2. Directed edges between nodes represent direct influence. 3. A table of conditional probabilities (CPT) is associated with every node, in which the effect of the parent nodes is quantified. 4. The graph is acyclic (a DAG). Remark: Burglary and Earthquake are denoted as the parents of Alarm

26 Bayesian Networks

27 The Meaning of Bayesian Networks Alarm depends on Burglary and Earthquake. MaryCalls only depends on Alarm. P(MaryCalls Alarm, Burglary) = P(MaryCalls Alarm) A Bayesian Network can be considered as a set of independence assumptions.

28 Conditional Independence Relations in Bayesian Networks (1) A node is conditionally independent of its nondescendants given its parents.

29 Example JohnCalls is cond. independent of Burglary and Earthquake given the value of Alarm.

30 Bayesian Networks and the Joint Probability Bayesian networks can be seen as a more comprehensive representation of joint probabilities. Let all nodes X 1,, X n be ordered topologically according to the the arrows in the network. Let x 1,, x n be the values of the variables. Then P(x 1,, x n ) = P(x n x n-1,, x 1 ) P(x 2 x 1 ) P(x 1 ) = n i=1 P(x i x i-1,, x 1 ) From the independence assumption, this is equivalent to P(x 1,, x n ) = n i=1 P(x i parents(x i )) We can calculate the joint probability from the network topology and the CPTs!

31 Example Only the probabilities for positive events are given. The negative probabilities can be found using P( X) = 1 P(X). P(J, M, A, B, E) = = P(J A) P(M A) P(A B, E) P( B) P( E) = 0.9 x 0.7 x x x =

32 Compactness of Bayesian Networks For the explicit representation of Bayesian networks, we need a table of size 2 n where n is the number of variables. In the case that every node in a network has at most k parents, we only need n tables of size 2 k (assuming boolean variables). Example: n = 20 and k = = and 20 x 2 5 = 640 different explicitly-represented probabilities! In the worst case, a Bayesian network can become exponentially large, for example if every variable is directly influenced by all the others. The size depends on the application domain (local vs. global interaction) and the skill of the designer.

33 Inference in Bayesian Networks Instantiating evidence variables and sending queries to nodes. What is P(Burglary JohnCalls) or P(Burglary JohnCalls, MaryCalls)?

34 Conditional Independence Relations in Bayesian Networks (1) A node is conditionally independent of its nondescendants given its parents.

35 Example JohnCalls is independent of Burglary and Earthquake given the value of Alarm.

36 Conditional Independence Relations in Bayesian Networks (2) A node is conditionally independent of all other nodes in the network given the Markov blanket, i.e. its parents, children and children s parents.

37 Example Burglary is independent of JohnCalls and MaryCalls, given the values of Alarm and Earthquake.

38 Various problems for BNs Find P(state) Find P(X e) -- typical query Find most likely state arg max X P(X=x e) Learning

39 Exact Inference in Bayesian Networks Compute the posterior probability distribution for a set of query variables X given an observation, i.e. the values of a set of evidence variables E. Complete set of variables is X E Y Y are called the hidden variables Typical query P(X e) where e are the observed values of E. In the remainder: X is a singleton Example: P(Burglary JohnCalls = true, MaryCalls=true) = (0.284,0.716)

40 Inference by Enumeration P(X e) = α P(X,e) = α P(X,e,y) The network gives a complete representation of the full joint distribution. A query can be answered using a Bayesian network by computing sums of products of conditional probabilities from the network.

41 Example Consider P(Burglary JohnCalls = true, MaryCalls=true) The hidden variables are Earthquake and Alarm. We have: P(B j, m) = α P(B, j, m) = α P(B, j, m, e, a) If we consider the independence of variables, we obtain for B=b P(b j, m) = α P(j a) P(m a) P(a e,b) P(e) P(b)

42 Complexity of Exact Inference If the network is singly connected or a polytree, the time and space complexity of exact inference is linear in the size of the network. The burglary example is a typical singly connected network. For multiply connected networks inference in Bayesian Networks is NP-hard. There are approximate inference methods for multiply connected networks such as sampling techniques or Markov chain Monte Carlo.

43 Probabilistic Learning 1. Probability theory 2. Bayesian Networks 3. Probabilistic Learning 4. Alternative models

44 Probabilistic Learning Two problems parameter estimation structure learning

45 Parameter Estimation Given a set of examples E, a probabilistic model M = (S, λ) with structure S and parameters λ, a probabilistic coverage relation P (e M) that computes the probabilty of observering the example e given the model M, a scoring function score(e, M) that employs the probabilistic coverage relation P (e M) Find the parameters λ that maximize score(e, M), i.e., λ = arg max λ score(e, (S, λ)) (1)

46 Scoring functions maximally a posteriori hypothesis H MAP H MAP = arg max H P (H E) = arg max H P (E H) P (H) P (E) (1) Simplified into maximum likelihood hypothesis H ML equally likely all hypotheses a priori H ML = arg max H P (E H) (2) Assuming independently and identically distributed examples: H ML = arg max P (e i H) (3) H e i E Often easier to maximized the log likelihood H LL = arg max log P (e i H) = arg max H H e i E log P (e i H) (4) e i E

47 Toy Example

48 Parameter Estimation A1 A2 A3 A4 A5 A6 true true false true false false false true true true false false true false false false true true complete data set simply counting X1 X2... XM

49 Parameter Estimation hidden/ latent A1 A2 A3 A4 A5 A6 true true? true false false? true?? false false true false? false true? incomplete data set Real-world data: states of some random variables are missing E.g. medical diagnose: not all patient are subjects to all test Parameter reduction, e.g. clustering,...? missing value

50 On the Alarm Net

51 Derivation (fully obs) Generalize : P (a, m) = λ 0 (1 λ 2 ) P (E M) = λ a 0.(1 λ 0) a.λ a m 1 (1 λ 1 ) a m.λ a m 2.(1 λ 2 ) a m Use log likelihood: L = log P (E M) = a. log λ 0 + a. log(1 λ 0 ) + a m. log λ 1 + a m. log(1 λ 1 ) + a m. log λ 2 + a m. log(1 λ 2 )

52 Derivation (fully obs) (2) Derivatives are : Yielding as solutions L = a a λ 0 λ 0 1 λ 0 L a m a m = λ 1 λ 1 1 λ 1 L a m a m = λ 2 λ 2 1 λ 2 λ 0 = λ 1 = λ 2 = a a + a a m a m + a m a m a m + a m (1) (2) (3) So, simply counting!

53 Parameter Estimation (part. obs) Idea Expectation Maximization Incomplete data? 1. Complete data (Imputation) most probable?, average?,... value 2. Count 3. Iterate

54 EM Idea: Complete the incomplete data complete A M true true true? false true true false false? P (m a) = 0.6 P (m a) = 0.2 complete data expected counts A true M true N 1.6 iterate true false false false true false λ 1 = P (m a) = λ 2 = P (m a) = = = 0.6

55 EM - formulae Q(λ), the likelihood is a function of λ: Q(λ) = E[L(λ)] = E[log P (E M(λ)] = E ( log P (e i M(λ)) ) e i E = E ( log P (x i, y i M(λ))] ) e i E = P (y i x i, M(λ j )) log P (x i, y i M(λ)) e i E

56 EM

57 Parameter Tying P (E M) = λ a 0.(1 λ 0) a.λ a m 1 (1 λ 1 ) a m.λ a m 2.(1 λ 2 ) a m. λ a j 3.(1 λ 3 ) a j.λ a j 4 (1 λ 4 ) a j Assume λ 1 = λ 3 and λ 2 = λ 4. The likelihood simplifies to P (E M) = λ a 0.(1 λ 0) a.λ a m + a j 1 (1 λ 1 ) a m + a j. λ a m + a j 2.(1 λ 2 ) a m + a m

58 Structure Learning Given a set of examples E a language L M of possible models of the form M = (S, λ) with structure S and parameters λ a probabilistic coverage relation P (e M) that computes the probabilty of observering the example e given the model M, a scoring function score(e, M) that employs the probabilistic coverage relation P (e M) Find the model M = (S, λ) that maximizes score(e, M), i.e., M = arg max M score(e, M) (1)

59 Structure learning H MAP = arg max H = arg max H = arg max H = arg max H P (H E) (1) P (E H)P (H) P (E) (2) P (E H)P (H) (3) log P (E H) + log P (H) (4) Often : heuristic search through space of models Bayesian Nets : add, delete or reverse arcs a kind of refinement operator

60 Probabilistic Learning 1. Probability theory 2. Bayesian Networks 3. Probabilistic Learning 4. Alternative models

61 Markov Models Define probability distribution over sequences. Assume that i, j 1 : P(X i X i 1 ) = P(X j X j 1 ) (1) i, j 1 : P(X i X i 1, X i 2,, X 0 ) = P(X i X i 1 ) (2) The Markov model factorizes as P(X 0,, X n ) = P(X 0 ) P(X i X i 1 ) i>0 (3)

62 Hidden Markov Model Assume also that i, j 0 : P(Y i X i ) = P(Y j X j ) (1) The hidden Markov model factorizes as n P(Y 0,, Y n, X 0,, X n ) = P(X 0 )P(Y 0 X 0 ) P(Y i X i )P(X i X i 1 ) (2) i>0

63 Example Possible states {rain, sun} Observations {umbrella, no umbrella } Questions Given sequence of umbrella, no umbrella -- what was the weather outside? Is there now rain (given a sequence of observations)? Will there be sun tomorrow, given a sequence of observations? Next week?

64 Typical Problems Decoding: compute the probability P(Y 0 = y 0,, Y n = y n ) of a particular observation sequence. Most likely state sequence: compute the most likely hidden state sequence corresponding to a particular observation sequence, that is arg max P (X 0 = x 0,, X n = x n Y 0 = y o,, Y n = y n ) (1) x 0,,x n State prediction:compute the most likely state state based on a particular observation sequence: arg max x t P (X t = x t,, X n = x n Y 0 = y o,, Y n = y n ) (2) If t < n, this is called smoothing; for t = n, filtering, and if t > n it is prediction Parameter estimation For Markov Models : EFFICIENT

65 Conclusions Principles of Probability Theory & Graphical Models A wide variety of such models exist They can be learned from data

### 13.3 Inference Using Full Joint Distribution

191 The probability distribution on a single variable must sum to 1 It is also true that any joint probability distribution on any set of variables must sum to 1 Recall that any proposition a is equivalent

### Outline. Uncertainty. Uncertainty. Making decisions under uncertainty. Probability. Methods for handling uncertainty

Outline Uncertainty Chapter 13 Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule Uncertainty Let action A t = leave for airport t minutes before flight Will A t get me

### Bayesian Networks. Mausam (Slides by UW-AI faculty)

Bayesian Networks Mausam (Slides by UW-AI faculty) Bayes Nets In general, joint distribution P over set of variables (X 1 x... x X n ) requires exponential space for representation & inference BNs provide

### Bayesian Networks Chapter 14. Mausam (Slides by UW-AI faculty & David Page)

Bayesian Networks Chapter 14 Mausam (Slides by UW-AI faculty & David Page) Bayes Nets In general, joint distribution P over set of variables (X 1 x... x X n ) requires exponential space for representation

### Artificial Intelligence

Artificial Intelligence ICS461 Fall 2010 Nancy E. Reed nreed@hawaii.edu 1 Lecture #13 Uncertainty Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule Uncertainty

### CS 771 Artificial Intelligence. Reasoning under uncertainty

CS 771 Artificial Intelligence Reasoning under uncertainty Today Representing uncertainty is useful in knowledge bases Probability provides a coherent framework for uncertainty Review basic concepts in

### Probability, Conditional Independence

Probability, Conditional Independence June 19, 2012 Probability, Conditional Independence Probability Sample space Ω of events Each event ω Ω has an associated measure Probability of the event P(ω) Axioms

### / Informatica. Intelligent Systems. Agents dealing with uncertainty. Example: driving to the airport. Why logic fails

Intelligent Systems Prof. dr. Paul De Bra Technische Universiteit Eindhoven debra@win.tue.nl Agents dealing with uncertainty Uncertainty Probability Syntax and Semantics Inference Independence and Bayes'

### Artificial Intelligence. Conditional probability. Inference by enumeration. Independence. Lesson 11 (From Russell & Norvig)

Artificial Intelligence Conditional probability Conditional or posterior probabilities e.g., cavity toothache) = 0.8 i.e., given that toothache is all I know tation for conditional distributions: Cavity

### Basics of Probability

Basics of Probability 1 Sample spaces, events and probabilities Begin with a set Ω the sample space e.g., 6 possible rolls of a die. ω Ω is a sample point/possible world/atomic event A probability space

### Uncertainty. Chapter 13. Chapter 13 1

Uncertainty Chapter 13 Chapter 13 1 Attribution Modified from Stuart Russell s slides (Berkeley) Chapter 13 2 Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes Rule

### CS 188: Artificial Intelligence. Probability recap

CS 188: Artificial Intelligence Bayes Nets Representation and Independence Pieter Abbeel UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew Moore Conditional probability

### Lecture 2: Introduction to belief (Bayesian) networks

Lecture 2: Introduction to belief (Bayesian) networks Conditional independence What is a belief network? Independence maps (I-maps) January 7, 2008 1 COMP-526 Lecture 2 Recall from last time: Conditional

### Variable Elimination 1

Variable Elimination 1 Inference Exact inference Enumeration Variable elimination Junction trees and belief propagation Approximate inference Loopy belief propagation Sampling methods: likelihood weighting,

### Graphical Models. CS 343: Artificial Intelligence Bayesian Networks. Raymond J. Mooney. Conditional Probability Tables.

Graphical Models CS 343: Artificial Intelligence Bayesian Networks Raymond J. Mooney University of Texas at Austin 1 If no assumption of independence is made, then an exponential number of parameters must

### CS 331: Artificial Intelligence Fundamentals of Probability II. Joint Probability Distribution. Marginalization. Marginalization

Full Joint Probability Distributions CS 331: Artificial Intelligence Fundamentals of Probability II Toothache Cavity Catch Toothache, Cavity, Catch false false false 0.576 false false true 0.144 false

### Machine Learning

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 23, 2015 Today: Graphical models Bayes Nets: Representing distributions Conditional independencies

### 22c:145 Artificial Intelligence

22c:145 Artificial Intelligence Fall 2005 Uncertainty Cesare Tinelli The University of Iowa Copyright 2001-05 Cesare Tinelli and Hantao Zhang. a a These notes are copyrighted material and may not be used

### Bayesian Networks. Read R&N Ch. 14.1-14.2. Next lecture: Read R&N 18.1-18.4

Bayesian Networks Read R&N Ch. 14.1-14.2 Next lecture: Read R&N 18.1-18.4 You will be expected to know Basic concepts and vocabulary of Bayesian networks. Nodes represent random variables. Directed arcs

### BIN504 - Lecture XI. Bayesian Inference & Markov Chains

BIN504 - Lecture XI Bayesian Inference & References: Lee, Chp. 11 & Sec. 10.11 Ewens-Grant, Chps. 4, 7, 11 c 2012 Aybar C. Acar Including slides by Tolga Can Rev. 1.2 (Build 20130528191100) BIN 504 - Bayes

### Marginal Independence and Conditional Independence

Marginal Independence and Conditional Independence Computer Science cpsc322, Lecture 26 (Textbook Chpt 6.1-2) March, 19, 2010 Recap with Example Marginalization Conditional Probability Chain Rule Lecture

### Life of A Knowledge Base (KB)

Life of A Knowledge Base (KB) A knowledge base system is a special kind of database management system to for knowledge base management. KB extraction: knowledge extraction using statistical models in NLP/ML

### Artificial Intelligence Mar 27, Bayesian Networks 1 P (T D)P (D) + P (T D)P ( D) =

Artificial Intelligence 15-381 Mar 27, 2007 Bayesian Networks 1 Recap of last lecture Probability: precise representation of uncertainty Probability theory: optimal updating of knowledge based on new information

### Outline. Probability. Random Variables. Joint and Marginal Distributions Conditional Distribution Product Rule, Chain Rule, Bayes Rule.

Probability 1 Probability Random Variables Outline Joint and Marginal Distributions Conditional Distribution Product Rule, Chain Rule, Bayes Rule Inference Independence 2 Inference in Ghostbusters A ghost

### Bayesian network inference

Bayesian network inference Given: Query variables: X Evidence observed variables: E = e Unobserved variables: Y Goal: calculate some useful information about the query variables osterior Xe MA estimate

### Machine Learning

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 8, 2011 Today: Graphical models Bayes Nets: Representing distributions Conditional independencies

### Probabilistic Graphical Models Final Exam

Introduction to Probabilistic Graphical Models Final Exam, March 2007 1 Probabilistic Graphical Models Final Exam Please fill your name and I.D.: Name:... I.D.:... Duration: 3 hours. Guidelines: 1. The

### Querying Joint Probability Distributions

Querying Joint Probability Distributions Sargur Srihari srihari@cedar.buffalo.edu 1 Queries of Interest Probabilistic Graphical Models (BNs and MNs) represent joint probability distributions over multiple

### STAT 535 Lecture 4 Graphical representations of conditional independence Part II Bayesian Networks c Marina Meilă

TT 535 Lecture 4 Graphical representations of conditional independence Part II Bayesian Networks c Marina Meilă mmp@stat.washington.edu 1 limitation of Markov networks Between two variables or subsets

### The Basics of Graphical Models

The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

### Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber 2011 1

Data Modeling & Analysis Techniques Probability & Statistics Manfred Huber 2011 1 Probability and Statistics Probability and statistics are often used interchangeably but are different, related fields

### Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

### Intelligent Systems: Reasoning and Recognition. Uncertainty and Plausible Reasoning

Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2015/2016 Lesson 12 25 march 2015 Uncertainty and Plausible Reasoning MYCIN (continued)...2 Backward

### Introduction to Discrete Probability Theory and Bayesian Networks

Introduction to Discrete Probability Theory and Bayesian Networks Dr Michael Ashcroft September 15, 2011 This document remains the property of Inatas. Reproduction in whole or in part without the written

### DAGs, I-Maps, Factorization, d-separation, Minimal I-Maps, Bayesian Networks. Slides by Nir Friedman

DAGs, I-Maps, Factorization, d-separation, Minimal I-Maps, Bayesian Networks Slides by Nir Friedman. Probability Distributions Let X 1,,X n be random variables Let P be a joint distribution over X 1,,X

### Markov chains and Markov Random Fields (MRFs)

Markov chains and Markov Random Fields (MRFs) 1 Why Markov Models We discuss Markov models now. This is the simplest statistical model in which we don t assume that all variables are independent; we assume

### Bayesian Networks Tutorial I Basics of Probability Theory and Bayesian Networks

Bayesian Networks 2015 2016 Tutorial I Basics of Probability Theory and Bayesian Networks Peter Lucas LIACS, Leiden University Elementary probability theory We adopt the following notations with respect

### Bayesian networks. Data Mining. Abraham Otero. Data Mining. Agenda. Bayes' theorem Naive Bayes Bayesian networks Bayesian networks (example)

Bayesian networks 1/40 Agenda Introduction Bayes' theorem Bayesian networks Quick reference 2/40 1 Introduction Probabilistic methods: They are backed by a strong mathematical formalism. They can capture

### Probabilistic Graphical Models Homework 1: Due January 29, 2014 at 4 pm

Probabilistic Graphical Models 10-708 Homework 1: Due January 29, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 1-3. You must complete all four problems to

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### Lecture 9: Bayesian Learning

Lecture 9: Bayesian Learning Cognitive Systems II - Machine Learning SS 2005 Part II: Special Aspects of Concept Learning Bayes Theorem, MAL / ML hypotheses, Brute-force MAP LEARNING, MDL principle, Bayes

### 3. The Junction Tree Algorithms

A Short Course on Graphical Models 3. The Junction Tree Algorithms Mark Paskin mark@paskin.org 1 Review: conditional independence Two random variables X and Y are independent (written X Y ) iff p X ( )

### Model-based Synthesis. Tony O Hagan

Model-based Synthesis Tony O Hagan Stochastic models Synthesising evidence through a statistical model 2 Evidence Synthesis (Session 3), Helsinki, 28/10/11 Graphical modelling The kinds of models that

### Decision Trees and Networks

Lecture 21: Uncertainty 6 Today s Lecture Victor R. Lesser CMPSCI 683 Fall 2010 Decision Trees and Networks Decision Trees A decision tree is an explicit representation of all the possible scenarios from

### Bayes Rule. How is this rule derived? Using Bayes rule for probabilistic inference: Evidence Cause) Evidence)

Bayes Rule P ( A B = B A A B ( Rev. Thomas Bayes (1702-1761 How is this rule derived? Using Bayes rule for probabilistic inference: P ( Cause Evidence = Evidence Cause Cause Evidence Cause Evidence: diagnostic

### Bayesian Learning. CSL465/603 - Fall 2016 Narayanan C Krishnan

Bayesian Learning CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Bayes Theorem MAP Learners Bayes optimal classifier Naïve Bayes classifier Example text classification Bayesian networks

### The intuitions in examples from last time give us D-separation and inference in belief networks

CSC384: Lecture 11 Variable Elimination Last time The intuitions in examples from last time give us D-separation and inference in belief networks a simple inference algorithm for networks without Today

### Bayesian networks - Time-series models - Apache Spark & Scala

Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly

### Bayesian probability theory

Bayesian probability theory Bruno A. Olshausen arch 1, 2004 Abstract Bayesian probability theory provides a mathematical framework for peforming inference, or reasoning, using probability. The foundations

### Hidden Markov Models. and. Sequential Data

Hidden Markov Models and Sequential Data Sequential Data Often arise through measurement of time series Snowfall measurements on successive days in Buffalo Rainfall measurements in Chirrapunji Daily values

### Relational Dynamic Bayesian Networks: a report. Cristina Manfredotti

Relational Dynamic Bayesian Networks: a report Cristina Manfredotti Dipartimento di Informatica, Sistemistica e Comunicazione (D.I.S.Co.) Università degli Studi Milano-Bicocca manfredotti@disco.unimib.it

### Lecture 6: The Bayesian Approach

Lecture 6: The Bayesian Approach What Did We Do Up to Now? We are given a model Log-linear model, Markov network, Bayesian network, etc. This model induces a distribution P(X) Learning: estimate a set

### CIS 391 Introduction to Artificial Intelligence Practice Midterm II With Solutions (From old CIS 521 exam)

CIS 391 Introduction to Artificial Intelligence Practice Midterm II With Solutions (From old CIS 521 exam) Problem Points Possible 1. Perceptrons and SVMs 20 2. Probabilistic Models 20 3. Markov Models

### Introduction to Machine Learning

Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 5: Decision Theory & ROC Curves Gaussian ML Estimation Many figures courtesy Kevin Murphy s textbook,

### Probabilistic Graphical Models

Probabilistic Graphical Models Raquel Urtasun and Tamir Hazan TTI Chicago April 4, 2011 Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 1 / 22 Bayesian Networks and independences

### Markov Chains, part I

Markov Chains, part I December 8, 2010 1 Introduction A Markov Chain is a sequence of random variables X 0, X 1,, where each X i S, such that P(X i+1 = s i+1 X i = s i, X i 1 = s i 1,, X 0 = s 0 ) = P(X

### Final Exam, Spring 2007

10-701 Final Exam, Spring 2007 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 16 numbered pages in this exam (including this cover sheet). 3. You can use any material you brought:

### Big Data, Machine Learning, Causal Models

Big Data, Machine Learning, Causal Models Sargur N. Srihari University at Buffalo, The State University of New York USA Int. Conf. on Signal and Image Processing, Bangalore January 2014 1 Plan of Discussion

### Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Machine Learning 1 Attribution Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley) 2 Outline Inductive learning Decision

10-601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html Course data All up-to-date info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

### Hidden Markov Models Fundamentals

Hidden Markov Models Fundamentals Daniel Ramage CS229 Section Notes December, 2007 Abstract How can we apply machine learning to data that is represented as a sequence of observations over time? For instance,

### bayesian inference: principles and practice

bayesian inference: and http://jakehofman.com july 9, 2009 bayesian inference: and motivation background bayes theorem bayesian probability bayesian inference would like models that: provide predictive

### The Certainty-Factor Model

The Certainty-Factor Model David Heckerman Departments of Computer Science and Pathology University of Southern California HMR 204, 2025 Zonal Ave Los Angeles, CA 90033 dheck@sumex-aim.stanford.edu 1 Introduction

### CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

### An Empirical Evaluation of Bayesian Networks Derived from Fault Trees

An Empirical Evaluation of Bayesian Networks Derived from Fault Trees Shane Strasser and John Sheppard Department of Computer Science Montana State University Bozeman, MT 59717 {shane.strasser, john.sheppard}@cs.montana.edu

### Discovering Markov Blankets: Finding Independencies Among Variables

Discovering Markov Blankets: Finding Independencies Among Variables Motivation: Algorithm: Applications: Toward Optimal Feature Selection. Koller and Sahami. Proc. 13 th ICML, 1996. Algorithms for Large

### 15-381: Artificial Intelligence. Probabilistic Reasoning and Inference

5-38: Artificial Intelligence robabilistic Reasoning and Inference Advantages of probabilistic reasoning Appropriate for complex, uncertain, environments - Will it rain tomorrow? Applies naturally to many

### Statistical Machine Translation: IBM Models 1 and 2

Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation

### Real Time Traffic Monitoring With Bayesian Belief Networks

Real Time Traffic Monitoring With Bayesian Belief Networks Sicco Pier van Gosliga TNO Defence, Security and Safety, P.O.Box 96864, 2509 JG The Hague, The Netherlands +31 70 374 02 30, sicco_pier.vangosliga@tno.nl

### Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

### Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

1 Learning Goals Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom 1. Be able to apply Bayes theorem to compute probabilities. 2. Be able to identify

### Incorporating Evidence in Bayesian networks with the Select Operator

Incorporating Evidence in Bayesian networks with the Select Operator C.J. Butz and F. Fang Department of Computer Science, University of Regina Regina, Saskatchewan, Canada SAS 0A2 {butz, fang11fa}@cs.uregina.ca

### Discovering Structure in Continuous Variables Using Bayesian Networks

In: "Advances in Neural Information Processing Systems 8," MIT Press, Cambridge MA, 1996. Discovering Structure in Continuous Variables Using Bayesian Networks Reimar Hofmann and Volker Tresp Siemens AG,

### A Bayesian Approach for on-line max auditing of Dynamic Statistical Databases

A Bayesian Approach for on-line max auditing of Dynamic Statistical Databases Gerardo Canfora Bice Cavallo University of Sannio, Benevento, Italy, {gerardo.canfora,bice.cavallo}@unisannio.it ABSTRACT In

### 5 Directed acyclic graphs

5 Directed acyclic graphs (5.1) Introduction In many statistical studies we have prior knowledge about a temporal or causal ordering of the variables. In this chapter we will use directed graphs to incorporate

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### Introduction to Markov Chain Monte Carlo

Introduction to Markov Chain Monte Carlo Monte Carlo: sample from a distribution to estimate the distribution to compute max, mean Markov Chain Monte Carlo: sampling using local information Generic problem

### ML, MAP, and Bayesian The Holy Trinity of Parameter Estimation and Data Prediction. Avinash Kak Purdue University. January 4, :19am

ML, MAP, and Bayesian The Holy Trinity of Parameter Estimation and Data Prediction Avinash Kak Purdue University January 4, 2017 11:19am An RVL Tutorial Presentation originally presented in Summer 2008

### Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

CS 188: Artificial Intelligence Lecture 20: Dynamic Bayes Nets, Naïve Bayes Pieter Abbeel UC Berkeley Slides adapted from Dan Klein. Part III: Machine Learning Up until now: how to reason in a model and

### Probabilistic Networks An Introduction to Bayesian Networks and Influence Diagrams

Probabilistic Networks An Introduction to Bayesian Networks and Influence Diagrams Uffe B. Kjærulff Department of Computer Science Aalborg University Anders L. Madsen HUGIN Expert A/S 10 May 2005 2 Contents

### An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

### 3. Markov property. Markov property for MRFs. Hammersley-Clifford theorem. Markov property for Bayesian networks. I-map, P-map, and chordal graphs

Markov property 3-3. Markov property Markov property for MRFs Hammersley-Clifford theorem Markov property for ayesian networks I-map, P-map, and chordal graphs Markov Chain X Y Z X = Z Y µ(x, Y, Z) = f(x,

### Knowledge-Based Probabilistic Reasoning from Expert Systems to Graphical Models

Knowledge-Based Probabilistic Reasoning from Expert Systems to Graphical Models By George F. Luger & Chayan Chakrabarti {luger cc} @cs.unm.edu Department of Computer Science University of New Mexico Albuquerque

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Study Manual. Probabilistic Reasoning. Silja Renooij. August 2015

Study Manual Probabilistic Reasoning 2015 2016 Silja Renooij August 2015 General information This study manual was designed to help guide your self studies. As such, it does not include material that is

### Lecture 11: Graphical Models for Inference

Lecture 11: Graphical Models for Inference So far we have seen two graphical models that are used for inference - the Bayesian network and the Join tree. These two both represent the same joint probability

### MATH 56A SPRING 2008 STOCHASTIC PROCESSES 31

MATH 56A SPRING 2008 STOCHASTIC PROCESSES 3.3. Invariant probability distribution. Definition.4. A probability distribution is a function π : S [0, ] from the set of states S to the closed unit interval

### MS&E 226: Small Data

MS&E 226: Small Data Lecture 16: Bayesian inference (v3) Ramesh Johari ramesh.johari@stanford.edu 1 / 35 Priors 2 / 35 Frequentist vs. Bayesian inference Frequentists treat the parameters as fixed (deterministic).

### Chapter 28. Bayesian Networks

Chapter 28. Bayesian Networks The Quest for Artificial Intelligence, Nilsson, N. J., 2009. Lecture Notes on Artificial Intelligence, Spring 2012 Summarized by Kim, Byoung-Hee and Lim, Byoung-Kwon Biointelligence

### Bayesian Tutorial (Sheet Updated 20 March)

Bayesian Tutorial (Sheet Updated 20 March) Practice Questions (for discussing in Class) Week starting 21 March 2016 1. What is the probability that the total of two dice will be greater than 8, given that

### Join Bayes Nets: A New Type of Bayes net for Relational Data

Join Bayes Nets: A New Type of Bayes net for Relational Data Bayes nets, statistical-relational learning, knowledge discovery in databases Abstract Many real-world data are maintained in relational format,

### Andrew Blake and Pushmeet Kohli

1 Introduction to Markov Random Fields Andrew Blake and Pushmeet Kohli This book sets out to demonstrate the power of the Markov random field (MRF) in vision. It treats the MRF both as a tool for modeling

### Deterministic Problems

Chapter 2 Deterministic Problems 1 In this chapter, we focus on deterministic control problems (with perfect information), i.e, there is no disturbance w k in the dynamics equation. For such problems there

### Cell Phone based Activity Detection using Markov Logic Network

Cell Phone based Activity Detection using Markov Logic Network Somdeb Sarkhel sxs104721@utdallas.edu 1 Introduction Mobile devices are becoming increasingly sophisticated and the latest generation of smart

### Concept Learning. Machine Learning 1

Concept Learning Inducing general functions from specific training examples is a main issue of machine learning. Concept Learning: Acquiring the definition of a general category from given sample positive

### Compression algorithm for Bayesian network modeling of binary systems

Compression algorithm for Bayesian network modeling of binary systems I. Tien & A. Der Kiureghian University of California, Berkeley ABSTRACT: A Bayesian network (BN) is a useful tool for analyzing the

### Markov Chains. Ben Langmead. Department of Computer Science

Markov Chains Ben Langmead Department of Computer Science You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email me (ben.langmead@gmail.com)

### Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a