# Introduction to Discrete Probability Theory and Bayesian Networks

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Introduction to Discrete Probability Theory and Bayesian Networks Dr Michael Ashcroft September 15, 2011 This document remains the property of Inatas. Reproduction in whole or in part without the written permission of Inatas is strictly forbidden. 1

2 Contents 1 Introduction to Discrete Probability Discrete Probability Spaces Sample Spaces, Outcomes and Events Probability Functions The probabilities of events Random Variables Combinations of events Conditional Probability Independence Conditional Independence The Chain Rule Bayes Theorem Introduction to Bayesian Networks Bayesian Networks D-Separation, The Markov Blanket and Markov Equivalence Potentials Exact Inference on a Bayesian Network: The Variable Elimination Algorithm Exact Inference on a Bayesian Network: The Junction Tree Algorithm Inexact Inference on a Bayesian Network: Likelihood Sampling. 21 2

3 1 Introduction to Discrete Probability 1.1 Discrete Probability Spaces A discrete probability space is a pair: < S, P >, where S is a Sample Space and P is a probability function Sample Spaces, Outcomes and Events An outcome is a value that the stochastic system we are modeling can take. The sample space of our model is the set of all outcomes. An event is a subset of the sample space. (So they are also sets of outcomes.) Example 1. The sample space corresponding to rolling a die is 1, 2, 3, 4, 5, 6. The outcomes of this sample space are 1, 2, 3, 4, 5 and 6. The events of this sample space are the members of its power set:, {1}, {2}, {3}, {4}, {5}, {6}, {1, 2}, {1, 3}, {1, 4}, {1, 5}, {1, 6}, {2, 3}, {2, 4}, {2, 5}, {2, 6}, {3, 4}, {3, 5}, {3, 6}, {4, 5}, {4, 6}, {5, 6}, {1, 2, 3}, {1, 2, 4}, {1, 2, 5}, {1, 2, 6}, {1, 3, 4}, {1, 3, 5}, {1, 3, 6}, {1, 4, 5}, {1, 4, 6}, {1, 5, 6}, {2, 3, 4}, {2, 3, 5}, {2, 3, 6}, {2, 4, 5}, {2, 4, 6}, {2, 5, 6}, {3, 4, 5}, {3, 4, 6}, {3, 5, 6}, {4, 5, 6}, {1, 2, 3, 4}, {1, 2, 3, 5}, {1, 2, 3, 6}, {1, 2, 4, 5}, {1, 2, 4, 6}, {1, 2, 5, 6}, {1, 3, 4, 5}, {1, 3, 4, 6}, {1, 3, 5, 6}, {1, 4, 5, 6}, {2, 3, 4, 5}, {2, 3, 4, 6}, {2, 3, 5, 6}, {2, 4, 5, 6}, {3, 4, 5, 6}, {1, 2, 3, 4, 5}, {1, 2, 3, 4, 6}, {1, 2, 3, 5, 6}, {1, 2, 4, 5, 6}, {1, 3, 4, 5, 6}, {2, 3, 4, 5, 6}, {1, 2, 3, 4, 5, 6} Probability Functions A probability function is a function from S to the real numbers, such that: (i) 0 p(s) 1, for each s S. (ii) (p(s)) = 1. Example 2. If a six sided dice is fair, the sample space associated with the result of a single throw is {1, 2, 3, 4, 5, 6} and the probability function is: p(1) = 1/6 p(2) = 1/6 3

4 p(3) = 1/6 p(4) = 1/6 p(5) = 1/6 p(6) = 1/6 Example 3. Now imagine a die that is not fair: It has twice the probability of coming up six as it does of coming up any other number. The probability function for such a die would be: p(1) = 1/7 p(2) = 1/7 p(3) = 1/7 p(4) = 1/7 p(5) = 1/7 p(6) = 2/7 1.2 The probabilities of events The probability of an event, E is defined: p(e) = o E p(o) Example 4. Continuing example 2 (the fair die), the event E, that the roll of the die produces an odd number, is {1, 3, 5}. Therefore: p(e) = p(o) o E = p(1) + p(3) + p(5) = 1/6 + 1/6 + 1/6 = 1/2 Example 5. Likewise, for example 3 (the biased die), the event F, that the roll of the die produces 5 or 6, is {5, 6}. Therefore: 4

5 p(f ) = o E p(o) = p(5) + p(6) = 1/7 + 2/7 = 3/7 Notice that an event represents the disjunctive claim that one of the outcomes that are its members occurred. So event E ({1, 3, 5}) represents the event that the roll of the die produced 1 or 2 or Random Variables Note: A random variable is neither random nor a variable! Often, we are interested in numerical values that are connected with our outcomes. We use random variables to model these. A random variable is a function from a sample space to the real numbers. Example 6. Suppose a coin is flipped three times. Let X(t) be the random variable that equals the number of heads that appear when t is the outcome. Then X(t) takes the following values: X(HHH) = 3 X(HHT ) = X(HT H) = X(T HH) = 2 X(HT T ) = X(T HT ) = X(T T H) = 1 X(T T T ) = 0 Notice that a random variable divides the sample space into disjoint and exhaustive set of events, each mapped to a unique real number, r. Let us term this set of events E r. The probability distribution of a random variable, X, on a sample space, S, is the set of ordered pairs < r, p(x = r) > for all r X(S), where p(x = r) = (p(o)). This is the probability that an outcome o o E r occurred such that X(o) = r and is often characterized by saying that X took the value r. As we would expect: (i) 0 p(e r ) 1, for each r X(S). (ii) p(e r ) = 1. r X(S) 5

6 Placing this in table form gives us a familiar discrete probability distribution. Points to note about random variables: 1. A random variable and its probability distribution together consistute a probability space. 2. A function from the codomain of a random variable to the real numbers is itself a random variable. 1.4 Combinations of events Some theorems: 1. p(ē) = 1 p(e) 2. p(e and F ) = p(e, F ) = p(e F ) 3. p(e or F ) = p(e F ) = p(e) + p(f ) p(e F ) Theorem 1 should be obvious! Note, though, that combined with the definition of a probability distribution, it entails that p( ) = 0. Lets look at an example for theorem 2. Example 7. Returning to example 4 with the fair die, let the event E, be {1, 3, 5} (that the roll of the die produces an odd number), and event F be {5, 6} (that the roll of the die produces 5 or 6). We want to calculate the probability that one of these events occurs, which is to say that the event E F occurs. Using theorem two we see: p(e F ) = p(e) + p(f ) p(e F ) = (p(1) + p(3) + p(5)) + (p(5) + p(6)) p(5) = p(1) + p(3) + p(5) + p(6) = 4/6 Which is as it should be! 6

7 1.5 Conditional Probability The conditional probability of one event, E, given another, F, is denoted: p(e F ). p(e F ) = p(e F ) p(f ) Example 8. Continuing example 7, we can calculate the probability that the roll of the die produces 5 or 6 given that it produces an odd number: p(e F ) = p(e F ) p(f ) = p(5) p(1)+p(3)+p(5) = 1/6 3/6 = 1/3 1.6 Independence Two events, E and F, are independent if and only if p(e, F ) = p(e)p(f ). That is to say, that the probability of both events occurring is simply the probability of the first event occurring multiplied by the probability that the second event occurs. Likewise, two random variables are independent if and only if p(x(s) = r 1 ), Y (s) = r 2 ) = p(x(s) = r 1 )p(y (s) = r 2 ), for all real numbers r 1 and r 2. Independence is of great practical importance: It significantly simplifies working out complex probabilities. Where a number of events are independent, we can quickly calculate their joint probability distribution from their individual probabilities. Example 9. Imagine we are examining the results of the (ordered) tosses of three coins. Given the possible results of each coin is {H, T }, the sample space for our model will be {H, T } 3. Let us define three random variables, X 1, X 2, X 3. X 1 maps outcomes to 1 if the first coin lands heads, and 0 otherwise. X 2 and X 3 do likewise for the second and third coins. Now assume we are given the following information: 1. p(x 1 = 1) = p(x 2 = 1) = p(x 3 = 1) = 0.1 7

8 If we also know that these random variables are independent, then we can immediately calculate the joint probability distribution for the three random variables from these three values alone (remembering that Ēn = 1 E n ): 1. p(x 1 = 1, X 2 = 1, X 3 = 1) = = p(x 1 = 1, X 2 = 1, X 3 = 0) = = p(x 1 = 1, X 2 = 0, X 3 = 1) = = p(x 1 = 1, X 2 = 0, X 3 = 0) = = p(x 1 = 0, X 2 = 1, X 3 = 1) = = p(x 1 = 0, X 2 = 1, X 3 = 0) = = p(x 1 = 0, X 2 = 0, X 3 = 1) = = p(x 1 = 0, X 2 = 0, X 3 = 0) = = If we do not know these random variables are independent, we require much more information. In fact, we will need to have the values for each of the entries in the joint probability distribution. Notice that: our storage requirements have jumped from linear on the number of random variables to exponential. (Very bad.) our computational complexity has fallen from linear of the number of random variables to constant. (Good, but we could obtain this in the early case as well, if we kept the probabilities after we calculated them.) Typically, the probability distributions that are of interest to us are such that this exponential storage complexity renders them intractable. Some methods for dealing with this, such as the naive Bayes classifier, simply assume independence among the random variables they are modeling. But this can lead to significantly lower accuracy from the model. 1.7 Conditional Independence Analogously to independence, we say that two events, E and F, are conditionally independent given another, G, if and only if P (G) 0 and one of the following holds: 1. p(e F G) = p(e G) and p(e G) 0, p(f G) p(e G) = 0 or p(f G) = 0. 8

9 Example 10. Say we have 13 objects. Each object is either black (B) or white(w), each object has either a 1 or a 2 written on it, and each object is either a square ( ) or a diamond( ). The objects are: B1, B1, B2, B2, B2, B2, B1, B2, B2 W1, W2, W1, W2 If we are interested in the characteristics of a randomly drawn object and assume all objects have equal chance of being drawn, then using the techniques we have already looked at, we can see that the event, E 1, that a randomly selected box has a 1 written on it is not independent from the event, E, that such a box is square. But they are conditionally independent given the event, E B that the box is black (and, in fact, also given the event that the box is white): p(e 1 ) = 5/13 p(e 1 E ) = 3/8 p(e 1 E B ) = 3/9 = 1/3 p(e 1 E E B ) = 2/6 = 1/3 There is little more to say about conditional independence at this point, but soon it will take center stage as a means of obtaining the accuracy of using the full joint distribution of the random variables we are modeling while avoiding the complexity issues that accompany this. 1.8 The Chain Rule The chain rule for events says that given n events, E 1, E 2,...E n, defined on the same sample space S: p(e 1, E 2,...E n ) = p(e n E n 1, E n 2,...E 1 )...p(e 2 E 1 )p(e 1 ) Applied to random variables, this gives us that for n random variables, X 1, X 2,...X n, defined on the same sample space S: p(x 1 = x 1, X 2 = x 2,...X n = x n ) = p(x n = n x X n 1 = x n 1, X n 2 = x n 2,...X 1 = x 1 )... = p(x 2 = x 2 X 1 = x 1 )p(x 1 = x 1 ) It is straightforward to prove this rule using the rule for conditional probability. 9

10 1.9 Bayes Theorem Bayes theorem is: p(f E) = Proof: p(e F )p(f ) p(e F )p(f )+p(e F )p( F ) 1. By the definition of conditional probability p(f E) = p(e F ) p(e) and p(e F ) = p(e F ) p(f ). 2. Therefore, p(e F ) = p(f E)p(E) = p(e F )p(f ). 3. Therefore, p(f E) = p(e F )p(f ) p(e) 4. p(e) = p(e S) = p(e (F F )) = p((e F ) (E F )) 5. (E F ) and (E F ) are disjoint (otherwise x (F F ) = ), so p(e) = p((e F ) (E F )) = p(e F )p(f ) + p(e F )p( F ) 6. Therefore p(f E) = p(e F )p(f ) p(e F )p(f ) p(e) = p(e F )p(f )+p(e F )p( F ) (Bayes theorem) Example 11. Suppose 1 person in has a particular rare disease. There exists a diagnostics test for this disease that is accurate 99% of the time when given to those who have the disease and 99.5% of the time when given to those who do not. Given this information, we can find the probability that someone who tests positive for the disease actually has the disease: Let E be the event that someone tests positive for the disease and F be the event that a person has the disease. We want to find p(f E). We know that 1 p(f ) = and so p( F ) = We also know that p(e F ) = 100, so p(ē F ) = Likewise we know that p(ē F ) = , so p(ē F ) = So by Bayes theorem: p(f E) = p(e F )p(f ) p(e F )p(f )+p(e F )p( F ) = (.99)(.00001) (.99)(.00001)+(.005)(.99999).002 Notice that the result was not intuitively obvious. Most people, if told only the information we had available, assume that testing positive means a very high probability of having the disease. 2 Introduction to Bayesian Networks 2.1 Bayesian Networks A Bayesian Network is a model of a system, which in turn consists of a number of random variables. It consists of: 10

11 1. A directed acyclic graph (DAG), within which each random variable is represented by a node. The topology of this DAG must meet the Markov Condition: Each node must be conditionally independent of its nondescendants given its parents. 2. A set of conditional probability distributions, one for each node, which give the probability of the random variable represented by the given node taking particular values given the values the random variables represented by the node s parents take. Examine the DAG in Figure 1 the information in Table, and compare with From the chain rule, we know that the joint probability distribution of the random variables p(a, B, C, D, E) = p(e D, C, B, A)p(D C, B, A)p(C B, A)p(B A)p(A). But given the conditional independencies present in P, we know that: p(c B A) = p(c A) p(d C B A) = p(d C B) p(e D C B A) = p(e C) So we know that p(a, B, C, D, E) = p(e C)p(D C, B)p(C A)p(B A)p(A). This may not seem a huge improvement, but it is. It means we can calculate the full joint distribution from the (normally much, much smaller) conditional probability tables associated with each node. As the networks get bigger, the advantages of such a method become crucial. advantages of such a method become crucial. What we have done is pull the joint probability distribution apart by its conditional independencies. We now have a means of obtaining tractable calculations using the full joint distribution. It has been proven that every discrete probability distribution (and many continuous ones) can be represented by a Bayesian Network, and that every Bayesian Network represents some probability distribution. Of course, if there are no conditional independencies in the joint probability distribution, representing it with a Bayesian Network gains us nothing. But in practice, while independence relationship between random variables in a system we are interested in modeling are rare (and assumptions regarding such independence dangerous), conditional independencies are plentiful. Some important points about Bayesian Networks: Bayesian Networks provide much more information than simple classifiers (like neural networks, or support vector machines, etc). Most importantly, when used to predict the value a random variable will take, they return a probability distribution rather than simply specifying what value is most probable. Obviously, there are many advantages to this. 11

12 Bayesian Networks have easily understandable and informative physical interpretation (unlike neural networks, or support vector machines, etc, which are effectively black boxes to all but experts). We will see one advantage of this in the next section. We can use Bayesian Networks to simply model the correlations and conditional independencies between the random variables of systems. But generally we are interested in inferring the probability distributions of a subset of the random variables of the network given knowledge of the values taken by another (possibly empty) subset. Bayesian Networks can also be extended to Influence Diagrams, with decision and utility nodes, in order to perform automated decision decision making. 2.2 D-Separation, The Markov Blanket and Markov Equivalence The Markov Condition also entails other conditional independencies. Because of the Markov Condition, these conditional independencies have a graph theoretic criterion called D-Separation (which we will not define, as it is difficult). Accordingly, when one set of random variables, Γ, is conditionally independent of another,, given a third, Θ, them we will say that the nodes representing the random variables in Γ are D-Separated from by Θ. The most important case of D-Separation/Conditional Independence is: A node is D-Separated of the entire graph given its parents, its children, and the other parents of its children. Because of this, the parents, children and other parents of a node s children are called the Markov Blanket of the node. This is important. Imagine we have a node, α, (which is associated with a random variable) whose probability distribution we wish to predict and whose Markov Blanket is the set of nodes, Γ. If we know the value of (the random variables associated with) every node in Γ, then we know that there is no more information regarding the value taken by (the random variable associated with) α. In this way, if we are confident that we can always establish the values of some of the random variables our network is modeling, we can often see that certain of the random variables are superfluous, and we need not continue to include them in the network nor collect information on them. Since, in practice, collecting data on random variables can be costly, this can be very helpful. We will also say that two DAGs are Markov Equivalent if they have the same D-Separations. 12

13 A B C D E Figure 1: A DAG with five nodes Node Conditional Independencies A - B C and E, given A C B, given A D A and E, given B and C E A, B and D, given C Table 1: Conditional independencies required of random variables the DAG in Figure 1 to be a Bayesian Network A B C D E F G H I J K L M N O P Q R S T U V W Figure 2: The Markov Blanket of Node L 13

14 2.3 Potentials Where V is a set of random variables {v 1,...v n }, let Γ V be the Cartesian product of the co-domains of the random variables in V. So Γ V consists of all the possible combinations of values that the random variables of V can be take. Let φ V be a mapping V Γ V R, such that φ V (v i, x) = the ith term of x, where x Γ V. Ie φ V gives us the value assigned to a particular member of V by a particular member of Γ V. If W V, let ψw V be a mapping Γ V Γ W, such that φ W (x, ψw V (y)) = φ V (x, y), for all x W, y Γ V. So ψw V gives us the member of Γ W in which all the members of W are assigned the same values as a particular member of Γ V. A potential is an ordered pair < V, F >, where V is a set of random variables, and F is a mapping Γ V R. Given a set of potentials, {< V 1, F 1 >,... < V n, F n >}, the multiplication of these potentials is itself a potential, < V α, F α >, where: V α = n i=1 F α (x) = V i n i=1 F i (ψ Fα F i (x)) This is simpler than it appears. We call the set of random variables in a potential the potential s scheme. The scheme of a product of a set of potentials is the union of the schemes of the factors. Likewise, the values assigned by the function in the potential to particular value combinations of the random variables is the product of the values assigned by the functions of the factors to the same value combinations (for those random variables present in the factor). Example 12. Take the multiplication of two potentials pot 1 =< {X 1, X 2 }, f > and pot 2 =< {X 1, X 3 }, g >, where all random variables are binary: x 1 x 2 f(x 1 = x 1, X 2 = x 2 ) Table 2: pot 1 Where pot 3 = pot 1 pot 2, we have: Given a set of potentials, < V, F >, the marginalization out of some random variable v V from this potential is itself a potential, < V α, F α >, where: 14

15 x 1 x 3 g(x 1 = x 1, X 3 = x 3 ) Table 3: pot 2 x 1 x 2 x 3 h(x 1 = x 1, X 2 = x 2, X 3 = x 3 ) Table 4: pot 3. V α = V \ v F α (x) = y Γ V F (y), where ψ F F α (y) = x Example 13. If pot 4 is the result of marginalizing X 1 out of pot 1 from Example 12, then: x 2 i(x 2 = x 2 ) Table 5: pot 4 Some points: Note that potentials are simply generalizations of probability distributions, and that the latter are necessarily the former, but not vice versa. In fact, a conditional probability table is a potential, not a distribution. Unlike distributions, potentials need not sum to 1. 15

16 2.4 Exact Inference on a Bayesian Network: The Variable Elimination Algorithm Let Γ be a subset of random variables in our network. Let f be a function that assigns each random variable, v Γ a particular value from those that v can take, f(v). To obtain the probability that the random variables in Γ take the values assigned to them by f: 1. Perform a topological sort on the DAG. This gives us an ordering where all nodes occur before their descendants. From the definition of a DAG, this is always possible. 2. For each node, n, construct a bucket, b n. Also construct a null bucket, b. 3. For each conditional probability distribution in the network: (a) Create a list of random variables present in the conditional probability distribution. (b) For each random variable, v Γ, eliminate all rows corresponding to values other than f(v), and eliminate v from the associated list. (c) Associate this list with the resulting potential and place this potential in the bucket associated with random variable remaining in the list associated with the highest ordered node. If there are no random variables remaining, place the potential in the null bucket. 4. Proceed in the given order through the buckets: (a) Create a new potential by multiplying all potential in the bucket. Associate with this potential a list of random variables that includes all random variables on the lists associated with the original potential in the bucket. (b) In this potential, marginalize out (ie sum over) the random variable associated with the bucket. Remove the random variable associated with the bucket from the associated list. (c) Place the distribution in the bucket associated with random variable remaining in the list associated with the highest ordered node. If there are no random variables remaining, place the distribution in the null bucket. 5. Multiply together the distributions in the null bucket (this is simply scalar multiplication). To obtain the a posteriori probability that a subset of random variables, Γ, takes particular values given the observation that a second subset,, has taken particular values, we run the algorithm twice: First on Γ, then on Delta, and we divide the first by the second. Some points to note: 16

17 The algorithm can be extended to obtain good estimates of error bars for our probability estimates, and wishing to do so is the main reason for using the algorithm. The complexity of the algorithm is dominated by the largest potential, which will be at least the size of the largest conditional probability table and which is, in practice, much, much smaller than the full joint distribution. When used to calculate a large number of probabilities (such as the a posteriori probability distributions for each unobserved random variable), the algorithm is relatively inefficient, since, if f is a function from the random variables in the network to the number of values each can take, it must be run f(v) 1 times for each unobserved random variable, v. The algorithm can be run on the smallest sub-graph containing (the nodes representing) the variables whose a posteriori probabilities we wish to find that is D-Separated for the remainder of the Network by (nodes representing) random variables whose values we know. 2.5 Exact Inference on a Bayesian Network: The Junction Tree Algorithm The Junction Tree algorithm is the work horse of Bayesian Network inference algorithms, permitting efficient exact inference. It does not, though, permit the calculation of error bars for our probability estimates. Since the Junction Tree algorithm is a generalization of the Variable Elimination algorithm, there is hope that the extension to the latter that permits us to obtain such error bars can likewise be generalized so as to be utilized in the former. Whether, and if so how, this can be done is an open research question. This algorithm utilizes a secondary structure formed from the Bayesian Network called a Junction Tree or Join Tree. We first show how to create this structure. Some Definitions: A cluster is a maximally connected sub-graph. The weight of a node is the number of values its associated random variable has. The weight of a cluster is the product of the weight of its constituent nodes. The Create (an Optimal) Junction Tree Algorithm: 1. Take a copy, G, of the DAG, join all unconnected parents and undirect all edges. 17

18 A B C D E F G H I Figure 3: A simple Bayesian Network 2. While there are still nodes left in G: (a) Select a node, n, from G, such that n causes the least number of edges to be added in step 2b, breaking ties by choosing the node which induces the cluster with the least weight. (b) Form a cluster, C, from the node and its neighbors, adding edges as required. (c) If C is not a sub-graph of a previously stored cluster, store C as a clique. (d) Remove n from G. 3. Create n trees, each consisting of a single stored clique. Also create a set, S. Repeat until n 1 sepsets have been inserted into the forest: (a) Select from S the sepset, s, that has the largest number of variables in it, breaking ties by calculating the product of the number of values of the random variables in the sets, and choosing the set with the lowest. Further ties can be broken arbitrarily. (b) Delete s from S. (c) Insert s between the cliques X and Y only if X and Y are on different trees in the forest. (This merges their two trees into a larger tree, until you are left with a single tree: The Junction Tree.) Before explaining how to perform inference using a Junction Tree, we require some definitions: Evidence Potentials 18

19 {B, E} {E} {D, E, G} {G} {G, I} {D} {F, H} {F } {C, D, F } {C, D} {A, C, D} Figure 4: The Junction Tree constructed from Figure 3 Variable Value 1 Value 2 Value 3 Notes A Nothing known B Observed to be value 1. C Observed to not be value 2 D Soft evidence, with actual probabilities E Soft evidence, assigns same probabilities as D Table 6: EvidenceP otentials 19

20 An evidence potential has a singleton set of random variables, and maps real numbers to the random variable s values. If working with hard evidence, it will map 0 to values which evidence has ruled out, and 1 to all other values (where at least one value must be mapped to 1). Where all values are mapped to 1, nothing is know about the random variables. Where all values except one are mapped to 1, it is known that the random variable takes the specified value. If working with soft evidence, values can be mapped to any non-negative real number, but the sum of these must be non-zero. Such a potential assigns values probabilities as specified by the its normalization. Message Pass We pass a message from one clique, c 1, to another, c 2, via the intervening sepset, s, by: 1. Save the potential associated with s. 2. Marginalize a new potential for s, containing only those variables in s, out of c Assign a new potential to c 2, such that: Collect Evidence pot(c 2 ) new = pot(c 2 ) old ( pot(s)new pot(s)old ) When called on a clique, c, Collect Evidence does the following: 1. Marks c. 2. Calls Collect Evidence recursively on the unmarked neighbors of c, if any. 3. Passes a message from c to the clique that called collect evidence, if any. Disperse Evidence When called on a clique, c, Disperse Evidence does the following: 1. Marks c. 2. Passes a message to each of the unmarked neighbors of c, if any. 3. Calls Disperse Evidence recursively on the unmarked neighbors of c, if any. To perform inference on a Junction Tree, we use the following algorithm: 1. Associate with each clique and sepset a potential, whose random variables are those of the clique/subset, and which associates with all value combinations of these random variables the value For each node: 20

21 (a) Associate with the node an evidence potential representing current knowledge. (b) Find a clique containing the node and its parents (it is certain to exist) and multiply in the node s conditional probability table to the clique s potential. (By multiply in is meant: multiply the node s conditional probability table and the clique s potential, and replace the cliques potential with the result.) (c) Multiply in the evidence potential associated with the node. 3. Pick an arbitrary root clique, and call collect evidence and then disperse evidence on this clique: 4. For each node you wish to obtain a posteriori probabilities for: (a) Select the smallest clique containing this node. (b) Create a copy of the potential associated with this clique. (c) Marginalize all other nodes out of the clique. (d) Normalize the resulting potential. This is the random variable s a posteriori probability distribution. Some points to note: The complexity of the algorithm is dominated by the largest potential associated a clique, which will be at least the size of, and probably much larger than, the largest conditional probability table. But it is, in practice, much smaller than the full joint distribution. When cliques are relatively small, the algorithm is comparatively efficient. There are also numerous techniques to improve efficiency available in the literature. A Junction Tree can be formed from the smallest sub-graph containing (the nodes representing) the variables whose a posteriori probabilities we wish to find that is D-Separated for the remainder of the Network by (nodes representing) random variables whose values we know. 2.6 Inexact Inference on a Bayesian Network: Likelihood Sampling If the network is sufficiently complex, exact inference algorithms will become intractable. In such cases we turn to likelihood sampling. Using this algorithm, given a set of random variables, E, whose values we know (or are assuming), we can estimate a posteriori probabilities for the other random variables, U, in the network: 1. Perform a topological sort on the DAG. 21

22 2. Set all random variables in E to the value they are known/assumed to take. 3. For each random variable in U, create a score card, with a number for each value the random variable can take. Initially set all numbers to zero. 4. Repeat: (a) In the order generated in step 1, for each node in U, randomly assign values to each random variable using their conditional probability tables. (b) Given the values assigned, calculate the p(e = e), from the conditional probability tables of the random variables in E. Ie, where P ar(v) is random variables associated with the parents of the node associated with random variable v, par(v) are the values these parents have been assigned and E = {E 1,...E n }, calculate: p(e = e) = (p(e n = e n P ar(e n ) = par(e n )) E n E (c) For each random variable in U, add p(e = e) to the score for the value it was assigned in this sample. 5. For each random variable in U, normalize its score card. This is an estimate of the random variable s a posteriori probability distribution. 22

### The Basics of Graphical Models

The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures

### Question: What is the probability that a five-card poker hand contains a flush, that is, five cards of the same suit?

ECS20 Discrete Mathematics Quarter: Spring 2007 Instructor: John Steinberger Assistant: Sophie Engle (prepared by Sophie Engle) Homework 8 Hints Due Wednesday June 6 th 2007 Section 6.1 #16 What is the

### 6.3 Conditional Probability and Independence

222 CHAPTER 6. PROBABILITY 6.3 Conditional Probability and Independence Conditional Probability Two cubical dice each have a triangle painted on one side, a circle painted on two sides and a square painted

### 3. The Junction Tree Algorithms

A Short Course on Graphical Models 3. The Junction Tree Algorithms Mark Paskin mark@paskin.org 1 Review: conditional independence Two random variables X and Y are independent (written X Y ) iff p X ( )

### 13.3 Inference Using Full Joint Distribution

191 The probability distribution on a single variable must sum to 1 It is also true that any joint probability distribution on any set of variables must sum to 1 Recall that any proposition a is equivalent

### Lecture 2: Introduction to belief (Bayesian) networks

Lecture 2: Introduction to belief (Bayesian) networks Conditional independence What is a belief network? Independence maps (I-maps) January 7, 2008 1 COMP-526 Lecture 2 Recall from last time: Conditional

### Chapter 4 Lecture Notes

Chapter 4 Lecture Notes Random Variables October 27, 2015 1 Section 4.1 Random Variables A random variable is typically a real-valued function defined on the sample space of some experiment. For instance,

### Probability OPRE 6301

Probability OPRE 6301 Random Experiment... Recall that our eventual goal in this course is to go from the random sample to the population. The theory that allows for this transition is the theory of probability.

### Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note 11

CS 70 Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note Conditional Probability A pharmaceutical company is marketing a new test for a certain medical condition. According

### Probabilistic Graphical Models

Probabilistic Graphical Models Raquel Urtasun and Tamir Hazan TTI Chicago April 4, 2011 Raquel Urtasun and Tamir Hazan (TTI-C) Graphical Models April 4, 2011 1 / 22 Bayesian Networks and independences

### Math/Stats 425 Introduction to Probability. 1. Uncertainty and the axioms of probability

Math/Stats 425 Introduction to Probability 1. Uncertainty and the axioms of probability Processes in the real world are random if outcomes cannot be predicted with certainty. Example: coin tossing, stock

### 5 Directed acyclic graphs

5 Directed acyclic graphs (5.1) Introduction In many statistical studies we have prior knowledge about a temporal or causal ordering of the variables. In this chapter we will use directed graphs to incorporate

### CS 188: Artificial Intelligence. Probability recap

CS 188: Artificial Intelligence Bayes Nets Representation and Independence Pieter Abbeel UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew Moore Conditional probability

### ECE302 Spring 2006 HW1 Solutions January 16, 2006 1

ECE302 Spring 2006 HW1 Solutions January 16, 2006 1 Solutions to HW1 Note: These solutions were generated by R. D. Yates and D. J. Goodman, the authors of our textbook. I have added comments in italics

### Logic, Probability and Learning

Logic, Probability and Learning Luc De Raedt luc.deraedt@cs.kuleuven.be Overview Logic Learning Probabilistic Learning Probabilistic Logic Learning Closely following : Russell and Norvig, AI: a modern

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### MAT 1000. Mathematics in Today's World

MAT 1000 Mathematics in Today's World We talked about Cryptography Last Time We will talk about probability. Today There are four rules that govern probabilities. One good way to analyze simple probabilities

### What Is Probability?

1 What Is Probability? The idea: Uncertainty can often be "quantified" i.e., we can talk about degrees of certainty or uncertainty. This is the idea of probability: a higher probability expresses a higher

### Introduction to probability theory in the Discrete Mathematics course

Introduction to probability theory in the Discrete Mathematics course Jiří Matoušek (KAM MFF UK) Version: Oct/18/2013 Introduction This detailed syllabus contains definitions, statements of the main results

### Approximation Algorithms

Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms

### E3: PROBABILITY AND STATISTICS lecture notes

E3: PROBABILITY AND STATISTICS lecture notes 2 Contents 1 PROBABILITY THEORY 7 1.1 Experiments and random events............................ 7 1.2 Certain event. Impossible event............................

### Probability, Conditional Independence

Probability, Conditional Independence June 19, 2012 Probability, Conditional Independence Probability Sample space Ω of events Each event ω Ω has an associated measure Probability of the event P(ω) Axioms

### Worked examples Basic Concepts of Probability Theory

Worked examples Basic Concepts of Probability Theory Example 1 A regular tetrahedron is a body that has four faces and, if is tossed, the probability that it lands on any face is 1/4. Suppose that one

### Combinatorics: The Fine Art of Counting

Combinatorics: The Fine Art of Counting Week 7 Lecture Notes Discrete Probability Continued Note Binomial coefficients are written horizontally. The symbol ~ is used to mean approximately equal. The Bernoulli

### Discrete Structures for Computer Science

Discrete Structures for Computer Science Adam J. Lee adamlee@cs.pitt.edu 6111 Sennott Square Lecture #20: Bayes Theorem November 5, 2013 How can we incorporate prior knowledge? Sometimes we want to know

### ST 371 (IV): Discrete Random Variables

ST 371 (IV): Discrete Random Variables 1 Random Variables A random variable (rv) is a function that is defined on the sample space of the experiment and that assigns a numerical variable to each possible

### Markov random fields and Gibbs measures

Chapter Markov random fields and Gibbs measures 1. Conditional independence Suppose X i is a random element of (X i, B i ), for i = 1, 2, 3, with all X i defined on the same probability space (.F, P).

### The Joint Probability Distribution (JPD) of a set of n binary variables involve a huge number of parameters

DEFINING PROILISTI MODELS The Joint Probability Distribution (JPD) of a set of n binary variables involve a huge number of parameters 2 n (larger than 10 25 for only 100 variables). x y z p(x, y, z) 0

### Basic Probability Theory I

A Probability puzzler!! Basic Probability Theory I Dr. Tom Ilvento FREC 408 Our Strategy with Probability Generally, we want to get to an inference from a sample to a population. In this case the population

### Bayesian Tutorial (Sheet Updated 20 March)

Bayesian Tutorial (Sheet Updated 20 March) Practice Questions (for discussing in Class) Week starting 21 March 2016 1. What is the probability that the total of two dice will be greater than 8, given that

### Course: Model, Learning, and Inference: Lecture 5

Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.

### Lecture 11: Graphical Models for Inference

Lecture 11: Graphical Models for Inference So far we have seen two graphical models that are used for inference - the Bayesian network and the Join tree. These two both represent the same joint probability

### People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

PROBABILITY AND LIKELIHOOD, A BRIEF INTRODUCTION IN SUPPORT OF A COURSE ON MOLECULAR EVOLUTION (BIOL 3046) Probability The subject of PROBABILITY is a branch of mathematics dedicated to building models

### Discrete Mathematics for CS Fall 2006 Papadimitriou & Vazirani Lecture 22

CS 70 Discrete Mathematics for CS Fall 2006 Papadimitriou & Vazirani Lecture 22 Introduction to Discrete Probability Probability theory has its origins in gambling analyzing card games, dice, roulette

### Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber 2011 1

Data Modeling & Analysis Techniques Probability & Statistics Manfred Huber 2011 1 Probability and Statistics Probability and statistics are often used interchangeably but are different, related fields

### 1. The sample space S is the set of all possible outcomes. 2. An event is a set of one or more outcomes for an experiment. It is a sub set of S.

1 Probability Theory 1.1 Experiment, Outcomes, Sample Space Example 1 n psychologist examined the response of people standing in line at a copying machines. Student volunteers approached the person first

### Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

### Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a

### Chapter 3: The basic concepts of probability

Chapter 3: The basic concepts of probability Experiment: a measurement process that produces quantifiable results (e.g. throwing two dice, dealing cards, at poker, measuring heights of people, recording

### Bayesian Networks Chapter 14. Mausam (Slides by UW-AI faculty & David Page)

Bayesian Networks Chapter 14 Mausam (Slides by UW-AI faculty & David Page) Bayes Nets In general, joint distribution P over set of variables (X 1 x... x X n ) requires exponential space for representation

### Statistical Inference. Prof. Kate Calder. If the coin is fair (chance of heads = chance of tails) then

Probability Statistical Inference Question: How often would this method give the correct answer if I used it many times? Answer: Use laws of probability. 1 Example: Tossing a coin If the coin is fair (chance

### Distributed Computing over Communication Networks: Maximal Independent Set

Distributed Computing over Communication Networks: Maximal Independent Set What is a MIS? MIS An independent set (IS) of an undirected graph is a subset U of nodes such that no two nodes in U are adjacent.

### The Set Data Model CHAPTER 7. 7.1 What This Chapter Is About

CHAPTER 7 The Set Data Model The set is the most fundamental data model of mathematics. Every concept in mathematics, from trees to real numbers, is expressible as a special kind of set. In this book,

### The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge, cheapest first, we had to determine whether its two endpoints

### Joint Probability Distributions and Random Samples (Devore Chapter Five)

Joint Probability Distributions and Random Samples (Devore Chapter Five) 1016-345-01 Probability and Statistics for Engineers Winter 2010-2011 Contents 1 Joint Probability Distributions 1 1.1 Two Discrete

### The sample space for a pair of die rolls is the set. The sample space for a random number between 0 and 1 is the interval [0, 1].

Probability Theory Probability Spaces and Events Consider a random experiment with several possible outcomes. For example, we might roll a pair of dice, flip a coin three times, or choose a random real

### Lecture Note 1 Set and Probability Theory. MIT 14.30 Spring 2006 Herman Bennett

Lecture Note 1 Set and Probability Theory MIT 14.30 Spring 2006 Herman Bennett 1 Set Theory 1.1 Definitions and Theorems 1. Experiment: any action or process whose outcome is subject to uncertainty. 2.

### Probabilities. Probability of a event. From Random Variables to Events. From Random Variables to Events. Probability Theory I

Victor Adamchi Danny Sleator Great Theoretical Ideas In Computer Science Probability Theory I CS 5-25 Spring 200 Lecture Feb. 6, 200 Carnegie Mellon University We will consider chance experiments with

### 1 Maximum likelihood estimation

COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

### I. WHAT IS PROBABILITY?

C HAPTER 3 PROBABILITY Random Experiments I. WHAT IS PROBABILITY? The weatherman on 0 o clock news program states that there is a 20% chance that it will snow tomorrow, a 65% chance that it will rain and

### Bayesian networks - Time-series models - Apache Spark & Scala

Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly

### Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 13. Random Variables: Distribution and Expectation

CS 70 Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 3 Random Variables: Distribution and Expectation Random Variables Question: The homeworks of 20 students are collected

### CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) 3 4 4 7 5 9 6 16 7 8 8 4 9 8 10 4 Total 92.

Name: Email ID: CSE 326, Data Structures Section: Sample Final Exam Instructions: The exam is closed book, closed notes. Unless otherwise stated, N denotes the number of elements in the data structure

### Basics of Probability

Basics of Probability 1 Sample spaces, events and probabilities Begin with a set Ω the sample space e.g., 6 possible rolls of a die. ω Ω is a sample point/possible world/atomic event A probability space

### Lecture 1 Introduction Properties of Probability Methods of Enumeration Asrat Temesgen Stockholm University

Lecture 1 Introduction Properties of Probability Methods of Enumeration Asrat Temesgen Stockholm University 1 Chapter 1 Probability 1.1 Basic Concepts In the study of statistics, we consider experiments

### Artificial Intelligence Mar 27, Bayesian Networks 1 P (T D)P (D) + P (T D)P ( D) =

Artificial Intelligence 15-381 Mar 27, 2007 Bayesian Networks 1 Recap of last lecture Probability: precise representation of uncertainty Probability theory: optimal updating of knowledge based on new information

### Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 10

CS 70 Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 10 Introduction to Discrete Probability Probability theory has its origins in gambling analyzing card games, dice,

### Examples and Proofs of Inference in Junction Trees

Examples and Proofs of Inference in Junction Trees Peter Lucas LIAC, Leiden University February 4, 2016 1 Representation and notation Let P(V) be a joint probability distribution, where V stands for a

### Probability and Statistics

CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b - 0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be

### ! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of

### , each of which contains a unique key value, say k i , R 2. such that k i equals K (or to determine that no such record exists in the collection).

The Search Problem 1 Suppose we have a collection of records, say R 1, R 2,, R N, each of which contains a unique key value, say k i. Given a particular key value, K, the search problem is to locate the

### V. RANDOM VARIABLES, PROBABILITY DISTRIBUTIONS, EXPECTED VALUE

V. RANDOM VARIABLES, PROBABILITY DISTRIBUTIONS, EXPETED VALUE A game of chance featured at an amusement park is played as follows: You pay \$ to play. A penny and a nickel are flipped. You win \$ if either

### ! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of three

### 1 if 1 x 0 1 if 0 x 1

Chapter 3 Continuity In this chapter we begin by defining the fundamental notion of continuity for real valued functions of a single real variable. When trying to decide whether a given function is or

### Mining Social-Network Graphs

342 Chapter 10 Mining Social-Network Graphs There is much information to be gained by analyzing the large-scale data that is derived from social networks. The best-known example of a social network is

### Basic concepts in probability. Sue Gordon

Mathematics Learning Centre Basic concepts in probability Sue Gordon c 2005 University of Sydney Mathematics Learning Centre, University of Sydney 1 1 Set Notation You may omit this section if you are

### L10: Probability, statistics, and estimation theory

L10: Probability, statistics, and estimation theory Review of probability theory Bayes theorem Statistics and the Normal distribution Least Squares Error estimation Maximum Likelihood estimation Bayesian

### Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume

### Definition and Calculus of Probability

In experiments with multivariate outcome variable, knowledge of the value of one variable may help predict another. For now, the word prediction will mean update the probabilities of events regarding the

### Probability, statistics and football Franka Miriam Bru ckler Paris, 2015.

Probability, statistics and football Franka Miriam Bru ckler Paris, 2015 Please read this before starting! Although each activity can be performed by one person only, it is suggested that you work in groups

### Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

1 Learning Goals Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom 1. Be able to apply Bayes theorem to compute probabilities. 2. Be able to identify

### Lecture 3: Linear Programming Relaxations and Rounding

Lecture 3: Linear Programming Relaxations and Rounding 1 Approximation Algorithms and Linear Relaxations For the time being, suppose we have a minimization problem. Many times, the problem at hand can

### Examination 110 Probability and Statistics Examination

Examination 0 Probability and Statistics Examination Sample Examination Questions The Probability and Statistics Examination consists of 5 multiple-choice test questions. The test is a three-hour examination

### A Few Basics of Probability

A Few Basics of Probability Philosophy 57 Spring, 2004 1 Introduction This handout distinguishes between inductive and deductive logic, and then introduces probability, a concept essential to the study

### STAT 319 Probability and Statistics For Engineers PROBABILITY. Engineering College, Hail University, Saudi Arabia

STAT 319 robability and Statistics For Engineers LECTURE 03 ROAILITY Engineering College, Hail University, Saudi Arabia Overview robability is the study of random events. The probability, or chance, that

### 1 Introduction to Counting

1 Introduction to Counting 1.1 Introduction In this chapter you will learn the fundamentals of enumerative combinatorics, the branch of mathematics concerned with counting. While enumeration problems can

### Statistics 100A Homework 8 Solutions

Part : Chapter 7 Statistics A Homework 8 Solutions Ryan Rosario. A player throws a fair die and simultaneously flips a fair coin. If the coin lands heads, then she wins twice, and if tails, the one-half

### Ch. 13.3: More about Probability

Ch. 13.3: More about Probability Complementary Probabilities Given any event, E, of some sample space, U, of a random experiment, we can always talk about the complement, E, of that event: this is the

### Lesson 1. Basics of Probability. Principles of Mathematics 12: Explained! www.math12.com 314

Lesson 1 Basics of Probability www.math12.com 314 Sample Spaces: Probability Lesson 1 Part I: Basic Elements of Probability Consider the following situation: A six sided die is rolled The sample space

### CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a

### Linear Codes. Chapter 3. 3.1 Basics

Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length

### Bayesian Networks. Read R&N Ch. 14.1-14.2. Next lecture: Read R&N 18.1-18.4

Bayesian Networks Read R&N Ch. 14.1-14.2 Next lecture: Read R&N 18.1-18.4 You will be expected to know Basic concepts and vocabulary of Bayesian networks. Nodes represent random variables. Directed arcs

### Chapter 7. Hierarchical cluster analysis. Contents 7-1

7-1 Chapter 7 Hierarchical cluster analysis In Part 2 (Chapters 4 to 6) we defined several different ways of measuring distance (or dissimilarity as the case may be) between the rows or between the columns

### (x + a) n = x n + a Z n [x]. Proof. If n is prime then the map

22. A quick primality test Prime numbers are one of the most basic objects in mathematics and one of the most basic questions is to decide which numbers are prime (a clearly related problem is to find

### Analysis of Algorithms I: Binary Search Trees

Analysis of Algorithms I: Binary Search Trees Xi Chen Columbia University Hash table: A data structure that maintains a subset of keys from a universe set U = {0, 1,..., p 1} and supports all three dictionary

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### Pattern Recognition. Probability Theory

Pattern Recognition Probability Theory Probability Space 2 is a three-tuple with: the set of elementary events algebra probability measure -algebra over is a system of subsets, i.e. ( is the power set)

### 4. Joint Distributions

Virtual Laboratories > 2. Distributions > 1 2 3 4 5 6 7 8 4. Joint Distributions Basic Theory As usual, we start with a random experiment with probability measure P on an underlying sample space. Suppose

### Random variables P(X = 3) = P(X = 3) = 1 8, P(X = 1) = P(X = 1) = 3 8.

Random variables Remark on Notations 1. When X is a number chosen uniformly from a data set, What I call P(X = k) is called Freq[k, X] in the courseware. 2. When X is a random variable, what I call F ()

### Gibbs Sampling and Online Learning Introduction

Statistical Techniques in Robotics (16-831, F14) Lecture#10(Tuesday, September 30) Gibbs Sampling and Online Learning Introduction Lecturer: Drew Bagnell Scribes: {Shichao Yang} 1 1 Sampling Samples are

### Machine Learning in Spam Filtering

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

### Compression algorithm for Bayesian network modeling of binary systems

Compression algorithm for Bayesian network modeling of binary systems I. Tien & A. Der Kiureghian University of California, Berkeley ABSTRACT: A Bayesian network (BN) is a useful tool for analyzing the

### Application. Outline. 3-1 Polynomial Functions 3-2 Finding Rational Zeros of. Polynomial. 3-3 Approximating Real Zeros of.

Polynomial and Rational Functions Outline 3-1 Polynomial Functions 3-2 Finding Rational Zeros of Polynomials 3-3 Approximating Real Zeros of Polynomials 3-4 Rational Functions Chapter 3 Group Activity:

### m (t) = e nt m Y ( t) = e nt (pe t + q) n = (pe t e t + qe t ) n = (qe t + p) n

1. For a discrete random variable Y, prove that E[aY + b] = ae[y] + b and V(aY + b) = a 2 V(Y). Solution: E[aY + b] = E[aY] + E[b] = ae[y] + b where each step follows from a theorem on expected value from

### CMPSCI611: Approximating MAX-CUT Lecture 20

CMPSCI611: Approximating MAX-CUT Lecture 20 For the next two lectures we ll be seeing examples of approximation algorithms for interesting NP-hard problems. Today we consider MAX-CUT, which we proved to

### Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

### 6 PROBABILITY GENERATING FUNCTIONS

6 PROBABILITY GENERATING FUNCTIONS Certain derivations presented in this course have been somewhat heavy on algebra. For example, determining the expectation of the Binomial distribution (page 5.1 turned

### 1. Nondeterministically guess a solution (called a certificate) 2. Check whether the solution solves the problem (called verification)

Some N P problems Computer scientists have studied many N P problems, that is, problems that can be solved nondeterministically in polynomial time. Traditionally complexity question are studied as languages: