Introduction to Discrete Probability Theory and Bayesian Networks


 Joel Wiggins
 1 years ago
 Views:
Transcription
1 Introduction to Discrete Probability Theory and Bayesian Networks Dr Michael Ashcroft September 15, 2011 This document remains the property of Inatas. Reproduction in whole or in part without the written permission of Inatas is strictly forbidden. 1
2 Contents 1 Introduction to Discrete Probability Discrete Probability Spaces Sample Spaces, Outcomes and Events Probability Functions The probabilities of events Random Variables Combinations of events Conditional Probability Independence Conditional Independence The Chain Rule Bayes Theorem Introduction to Bayesian Networks Bayesian Networks DSeparation, The Markov Blanket and Markov Equivalence Potentials Exact Inference on a Bayesian Network: The Variable Elimination Algorithm Exact Inference on a Bayesian Network: The Junction Tree Algorithm Inexact Inference on a Bayesian Network: Likelihood Sampling. 21 2
3 1 Introduction to Discrete Probability 1.1 Discrete Probability Spaces A discrete probability space is a pair: < S, P >, where S is a Sample Space and P is a probability function Sample Spaces, Outcomes and Events An outcome is a value that the stochastic system we are modeling can take. The sample space of our model is the set of all outcomes. An event is a subset of the sample space. (So they are also sets of outcomes.) Example 1. The sample space corresponding to rolling a die is 1, 2, 3, 4, 5, 6. The outcomes of this sample space are 1, 2, 3, 4, 5 and 6. The events of this sample space are the members of its power set:, {1}, {2}, {3}, {4}, {5}, {6}, {1, 2}, {1, 3}, {1, 4}, {1, 5}, {1, 6}, {2, 3}, {2, 4}, {2, 5}, {2, 6}, {3, 4}, {3, 5}, {3, 6}, {4, 5}, {4, 6}, {5, 6}, {1, 2, 3}, {1, 2, 4}, {1, 2, 5}, {1, 2, 6}, {1, 3, 4}, {1, 3, 5}, {1, 3, 6}, {1, 4, 5}, {1, 4, 6}, {1, 5, 6}, {2, 3, 4}, {2, 3, 5}, {2, 3, 6}, {2, 4, 5}, {2, 4, 6}, {2, 5, 6}, {3, 4, 5}, {3, 4, 6}, {3, 5, 6}, {4, 5, 6}, {1, 2, 3, 4}, {1, 2, 3, 5}, {1, 2, 3, 6}, {1, 2, 4, 5}, {1, 2, 4, 6}, {1, 2, 5, 6}, {1, 3, 4, 5}, {1, 3, 4, 6}, {1, 3, 5, 6}, {1, 4, 5, 6}, {2, 3, 4, 5}, {2, 3, 4, 6}, {2, 3, 5, 6}, {2, 4, 5, 6}, {3, 4, 5, 6}, {1, 2, 3, 4, 5}, {1, 2, 3, 4, 6}, {1, 2, 3, 5, 6}, {1, 2, 4, 5, 6}, {1, 3, 4, 5, 6}, {2, 3, 4, 5, 6}, {1, 2, 3, 4, 5, 6} Probability Functions A probability function is a function from S to the real numbers, such that: (i) 0 p(s) 1, for each s S. (ii) (p(s)) = 1. Example 2. If a six sided dice is fair, the sample space associated with the result of a single throw is {1, 2, 3, 4, 5, 6} and the probability function is: p(1) = 1/6 p(2) = 1/6 3
4 p(3) = 1/6 p(4) = 1/6 p(5) = 1/6 p(6) = 1/6 Example 3. Now imagine a die that is not fair: It has twice the probability of coming up six as it does of coming up any other number. The probability function for such a die would be: p(1) = 1/7 p(2) = 1/7 p(3) = 1/7 p(4) = 1/7 p(5) = 1/7 p(6) = 2/7 1.2 The probabilities of events The probability of an event, E is defined: p(e) = o E p(o) Example 4. Continuing example 2 (the fair die), the event E, that the roll of the die produces an odd number, is {1, 3, 5}. Therefore: p(e) = p(o) o E = p(1) + p(3) + p(5) = 1/6 + 1/6 + 1/6 = 1/2 Example 5. Likewise, for example 3 (the biased die), the event F, that the roll of the die produces 5 or 6, is {5, 6}. Therefore: 4
5 p(f ) = o E p(o) = p(5) + p(6) = 1/7 + 2/7 = 3/7 Notice that an event represents the disjunctive claim that one of the outcomes that are its members occurred. So event E ({1, 3, 5}) represents the event that the roll of the die produced 1 or 2 or Random Variables Note: A random variable is neither random nor a variable! Often, we are interested in numerical values that are connected with our outcomes. We use random variables to model these. A random variable is a function from a sample space to the real numbers. Example 6. Suppose a coin is flipped three times. Let X(t) be the random variable that equals the number of heads that appear when t is the outcome. Then X(t) takes the following values: X(HHH) = 3 X(HHT ) = X(HT H) = X(T HH) = 2 X(HT T ) = X(T HT ) = X(T T H) = 1 X(T T T ) = 0 Notice that a random variable divides the sample space into disjoint and exhaustive set of events, each mapped to a unique real number, r. Let us term this set of events E r. The probability distribution of a random variable, X, on a sample space, S, is the set of ordered pairs < r, p(x = r) > for all r X(S), where p(x = r) = (p(o)). This is the probability that an outcome o o E r occurred such that X(o) = r and is often characterized by saying that X took the value r. As we would expect: (i) 0 p(e r ) 1, for each r X(S). (ii) p(e r ) = 1. r X(S) 5
6 Placing this in table form gives us a familiar discrete probability distribution. Points to note about random variables: 1. A random variable and its probability distribution together consistute a probability space. 2. A function from the codomain of a random variable to the real numbers is itself a random variable. 1.4 Combinations of events Some theorems: 1. p(ē) = 1 p(e) 2. p(e and F ) = p(e, F ) = p(e F ) 3. p(e or F ) = p(e F ) = p(e) + p(f ) p(e F ) Theorem 1 should be obvious! Note, though, that combined with the definition of a probability distribution, it entails that p( ) = 0. Lets look at an example for theorem 2. Example 7. Returning to example 4 with the fair die, let the event E, be {1, 3, 5} (that the roll of the die produces an odd number), and event F be {5, 6} (that the roll of the die produces 5 or 6). We want to calculate the probability that one of these events occurs, which is to say that the event E F occurs. Using theorem two we see: p(e F ) = p(e) + p(f ) p(e F ) = (p(1) + p(3) + p(5)) + (p(5) + p(6)) p(5) = p(1) + p(3) + p(5) + p(6) = 4/6 Which is as it should be! 6
7 1.5 Conditional Probability The conditional probability of one event, E, given another, F, is denoted: p(e F ). p(e F ) = p(e F ) p(f ) Example 8. Continuing example 7, we can calculate the probability that the roll of the die produces 5 or 6 given that it produces an odd number: p(e F ) = p(e F ) p(f ) = p(5) p(1)+p(3)+p(5) = 1/6 3/6 = 1/3 1.6 Independence Two events, E and F, are independent if and only if p(e, F ) = p(e)p(f ). That is to say, that the probability of both events occurring is simply the probability of the first event occurring multiplied by the probability that the second event occurs. Likewise, two random variables are independent if and only if p(x(s) = r 1 ), Y (s) = r 2 ) = p(x(s) = r 1 )p(y (s) = r 2 ), for all real numbers r 1 and r 2. Independence is of great practical importance: It significantly simplifies working out complex probabilities. Where a number of events are independent, we can quickly calculate their joint probability distribution from their individual probabilities. Example 9. Imagine we are examining the results of the (ordered) tosses of three coins. Given the possible results of each coin is {H, T }, the sample space for our model will be {H, T } 3. Let us define three random variables, X 1, X 2, X 3. X 1 maps outcomes to 1 if the first coin lands heads, and 0 otherwise. X 2 and X 3 do likewise for the second and third coins. Now assume we are given the following information: 1. p(x 1 = 1) = p(x 2 = 1) = p(x 3 = 1) = 0.1 7
8 If we also know that these random variables are independent, then we can immediately calculate the joint probability distribution for the three random variables from these three values alone (remembering that Ēn = 1 E n ): 1. p(x 1 = 1, X 2 = 1, X 3 = 1) = = p(x 1 = 1, X 2 = 1, X 3 = 0) = = p(x 1 = 1, X 2 = 0, X 3 = 1) = = p(x 1 = 1, X 2 = 0, X 3 = 0) = = p(x 1 = 0, X 2 = 1, X 3 = 1) = = p(x 1 = 0, X 2 = 1, X 3 = 0) = = p(x 1 = 0, X 2 = 0, X 3 = 1) = = p(x 1 = 0, X 2 = 0, X 3 = 0) = = If we do not know these random variables are independent, we require much more information. In fact, we will need to have the values for each of the entries in the joint probability distribution. Notice that: our storage requirements have jumped from linear on the number of random variables to exponential. (Very bad.) our computational complexity has fallen from linear of the number of random variables to constant. (Good, but we could obtain this in the early case as well, if we kept the probabilities after we calculated them.) Typically, the probability distributions that are of interest to us are such that this exponential storage complexity renders them intractable. Some methods for dealing with this, such as the naive Bayes classifier, simply assume independence among the random variables they are modeling. But this can lead to significantly lower accuracy from the model. 1.7 Conditional Independence Analogously to independence, we say that two events, E and F, are conditionally independent given another, G, if and only if P (G) 0 and one of the following holds: 1. p(e F G) = p(e G) and p(e G) 0, p(f G) p(e G) = 0 or p(f G) = 0. 8
9 Example 10. Say we have 13 objects. Each object is either black (B) or white(w), each object has either a 1 or a 2 written on it, and each object is either a square ( ) or a diamond( ). The objects are: B1, B1, B2, B2, B2, B2, B1, B2, B2 W1, W2, W1, W2 If we are interested in the characteristics of a randomly drawn object and assume all objects have equal chance of being drawn, then using the techniques we have already looked at, we can see that the event, E 1, that a randomly selected box has a 1 written on it is not independent from the event, E, that such a box is square. But they are conditionally independent given the event, E B that the box is black (and, in fact, also given the event that the box is white): p(e 1 ) = 5/13 p(e 1 E ) = 3/8 p(e 1 E B ) = 3/9 = 1/3 p(e 1 E E B ) = 2/6 = 1/3 There is little more to say about conditional independence at this point, but soon it will take center stage as a means of obtaining the accuracy of using the full joint distribution of the random variables we are modeling while avoiding the complexity issues that accompany this. 1.8 The Chain Rule The chain rule for events says that given n events, E 1, E 2,...E n, defined on the same sample space S: p(e 1, E 2,...E n ) = p(e n E n 1, E n 2,...E 1 )...p(e 2 E 1 )p(e 1 ) Applied to random variables, this gives us that for n random variables, X 1, X 2,...X n, defined on the same sample space S: p(x 1 = x 1, X 2 = x 2,...X n = x n ) = p(x n = n x X n 1 = x n 1, X n 2 = x n 2,...X 1 = x 1 )... = p(x 2 = x 2 X 1 = x 1 )p(x 1 = x 1 ) It is straightforward to prove this rule using the rule for conditional probability. 9
10 1.9 Bayes Theorem Bayes theorem is: p(f E) = Proof: p(e F )p(f ) p(e F )p(f )+p(e F )p( F ) 1. By the definition of conditional probability p(f E) = p(e F ) p(e) and p(e F ) = p(e F ) p(f ). 2. Therefore, p(e F ) = p(f E)p(E) = p(e F )p(f ). 3. Therefore, p(f E) = p(e F )p(f ) p(e) 4. p(e) = p(e S) = p(e (F F )) = p((e F ) (E F )) 5. (E F ) and (E F ) are disjoint (otherwise x (F F ) = ), so p(e) = p((e F ) (E F )) = p(e F )p(f ) + p(e F )p( F ) 6. Therefore p(f E) = p(e F )p(f ) p(e F )p(f ) p(e) = p(e F )p(f )+p(e F )p( F ) (Bayes theorem) Example 11. Suppose 1 person in has a particular rare disease. There exists a diagnostics test for this disease that is accurate 99% of the time when given to those who have the disease and 99.5% of the time when given to those who do not. Given this information, we can find the probability that someone who tests positive for the disease actually has the disease: Let E be the event that someone tests positive for the disease and F be the event that a person has the disease. We want to find p(f E). We know that 1 p(f ) = and so p( F ) = We also know that p(e F ) = 100, so p(ē F ) = Likewise we know that p(ē F ) = , so p(ē F ) = So by Bayes theorem: p(f E) = p(e F )p(f ) p(e F )p(f )+p(e F )p( F ) = (.99)(.00001) (.99)(.00001)+(.005)(.99999).002 Notice that the result was not intuitively obvious. Most people, if told only the information we had available, assume that testing positive means a very high probability of having the disease. 2 Introduction to Bayesian Networks 2.1 Bayesian Networks A Bayesian Network is a model of a system, which in turn consists of a number of random variables. It consists of: 10
11 1. A directed acyclic graph (DAG), within which each random variable is represented by a node. The topology of this DAG must meet the Markov Condition: Each node must be conditionally independent of its nondescendants given its parents. 2. A set of conditional probability distributions, one for each node, which give the probability of the random variable represented by the given node taking particular values given the values the random variables represented by the node s parents take. Examine the DAG in Figure 1 the information in Table, and compare with From the chain rule, we know that the joint probability distribution of the random variables p(a, B, C, D, E) = p(e D, C, B, A)p(D C, B, A)p(C B, A)p(B A)p(A). But given the conditional independencies present in P, we know that: p(c B A) = p(c A) p(d C B A) = p(d C B) p(e D C B A) = p(e C) So we know that p(a, B, C, D, E) = p(e C)p(D C, B)p(C A)p(B A)p(A). This may not seem a huge improvement, but it is. It means we can calculate the full joint distribution from the (normally much, much smaller) conditional probability tables associated with each node. As the networks get bigger, the advantages of such a method become crucial. advantages of such a method become crucial. What we have done is pull the joint probability distribution apart by its conditional independencies. We now have a means of obtaining tractable calculations using the full joint distribution. It has been proven that every discrete probability distribution (and many continuous ones) can be represented by a Bayesian Network, and that every Bayesian Network represents some probability distribution. Of course, if there are no conditional independencies in the joint probability distribution, representing it with a Bayesian Network gains us nothing. But in practice, while independence relationship between random variables in a system we are interested in modeling are rare (and assumptions regarding such independence dangerous), conditional independencies are plentiful. Some important points about Bayesian Networks: Bayesian Networks provide much more information than simple classifiers (like neural networks, or support vector machines, etc). Most importantly, when used to predict the value a random variable will take, they return a probability distribution rather than simply specifying what value is most probable. Obviously, there are many advantages to this. 11
12 Bayesian Networks have easily understandable and informative physical interpretation (unlike neural networks, or support vector machines, etc, which are effectively black boxes to all but experts). We will see one advantage of this in the next section. We can use Bayesian Networks to simply model the correlations and conditional independencies between the random variables of systems. But generally we are interested in inferring the probability distributions of a subset of the random variables of the network given knowledge of the values taken by another (possibly empty) subset. Bayesian Networks can also be extended to Influence Diagrams, with decision and utility nodes, in order to perform automated decision decision making. 2.2 DSeparation, The Markov Blanket and Markov Equivalence The Markov Condition also entails other conditional independencies. Because of the Markov Condition, these conditional independencies have a graph theoretic criterion called DSeparation (which we will not define, as it is difficult). Accordingly, when one set of random variables, Γ, is conditionally independent of another,, given a third, Θ, them we will say that the nodes representing the random variables in Γ are DSeparated from by Θ. The most important case of DSeparation/Conditional Independence is: A node is DSeparated of the entire graph given its parents, its children, and the other parents of its children. Because of this, the parents, children and other parents of a node s children are called the Markov Blanket of the node. This is important. Imagine we have a node, α, (which is associated with a random variable) whose probability distribution we wish to predict and whose Markov Blanket is the set of nodes, Γ. If we know the value of (the random variables associated with) every node in Γ, then we know that there is no more information regarding the value taken by (the random variable associated with) α. In this way, if we are confident that we can always establish the values of some of the random variables our network is modeling, we can often see that certain of the random variables are superfluous, and we need not continue to include them in the network nor collect information on them. Since, in practice, collecting data on random variables can be costly, this can be very helpful. We will also say that two DAGs are Markov Equivalent if they have the same DSeparations. 12
13 A B C D E Figure 1: A DAG with five nodes Node Conditional Independencies A  B C and E, given A C B, given A D A and E, given B and C E A, B and D, given C Table 1: Conditional independencies required of random variables the DAG in Figure 1 to be a Bayesian Network A B C D E F G H I J K L M N O P Q R S T U V W Figure 2: The Markov Blanket of Node L 13
14 2.3 Potentials Where V is a set of random variables {v 1,...v n }, let Γ V be the Cartesian product of the codomains of the random variables in V. So Γ V consists of all the possible combinations of values that the random variables of V can be take. Let φ V be a mapping V Γ V R, such that φ V (v i, x) = the ith term of x, where x Γ V. Ie φ V gives us the value assigned to a particular member of V by a particular member of Γ V. If W V, let ψw V be a mapping Γ V Γ W, such that φ W (x, ψw V (y)) = φ V (x, y), for all x W, y Γ V. So ψw V gives us the member of Γ W in which all the members of W are assigned the same values as a particular member of Γ V. A potential is an ordered pair < V, F >, where V is a set of random variables, and F is a mapping Γ V R. Given a set of potentials, {< V 1, F 1 >,... < V n, F n >}, the multiplication of these potentials is itself a potential, < V α, F α >, where: V α = n i=1 F α (x) = V i n i=1 F i (ψ Fα F i (x)) This is simpler than it appears. We call the set of random variables in a potential the potential s scheme. The scheme of a product of a set of potentials is the union of the schemes of the factors. Likewise, the values assigned by the function in the potential to particular value combinations of the random variables is the product of the values assigned by the functions of the factors to the same value combinations (for those random variables present in the factor). Example 12. Take the multiplication of two potentials pot 1 =< {X 1, X 2 }, f > and pot 2 =< {X 1, X 3 }, g >, where all random variables are binary: x 1 x 2 f(x 1 = x 1, X 2 = x 2 ) Table 2: pot 1 Where pot 3 = pot 1 pot 2, we have: Given a set of potentials, < V, F >, the marginalization out of some random variable v V from this potential is itself a potential, < V α, F α >, where: 14
15 x 1 x 3 g(x 1 = x 1, X 3 = x 3 ) Table 3: pot 2 x 1 x 2 x 3 h(x 1 = x 1, X 2 = x 2, X 3 = x 3 ) Table 4: pot 3. V α = V \ v F α (x) = y Γ V F (y), where ψ F F α (y) = x Example 13. If pot 4 is the result of marginalizing X 1 out of pot 1 from Example 12, then: x 2 i(x 2 = x 2 ) Table 5: pot 4 Some points: Note that potentials are simply generalizations of probability distributions, and that the latter are necessarily the former, but not vice versa. In fact, a conditional probability table is a potential, not a distribution. Unlike distributions, potentials need not sum to 1. 15
16 2.4 Exact Inference on a Bayesian Network: The Variable Elimination Algorithm Let Γ be a subset of random variables in our network. Let f be a function that assigns each random variable, v Γ a particular value from those that v can take, f(v). To obtain the probability that the random variables in Γ take the values assigned to them by f: 1. Perform a topological sort on the DAG. This gives us an ordering where all nodes occur before their descendants. From the definition of a DAG, this is always possible. 2. For each node, n, construct a bucket, b n. Also construct a null bucket, b. 3. For each conditional probability distribution in the network: (a) Create a list of random variables present in the conditional probability distribution. (b) For each random variable, v Γ, eliminate all rows corresponding to values other than f(v), and eliminate v from the associated list. (c) Associate this list with the resulting potential and place this potential in the bucket associated with random variable remaining in the list associated with the highest ordered node. If there are no random variables remaining, place the potential in the null bucket. 4. Proceed in the given order through the buckets: (a) Create a new potential by multiplying all potential in the bucket. Associate with this potential a list of random variables that includes all random variables on the lists associated with the original potential in the bucket. (b) In this potential, marginalize out (ie sum over) the random variable associated with the bucket. Remove the random variable associated with the bucket from the associated list. (c) Place the distribution in the bucket associated with random variable remaining in the list associated with the highest ordered node. If there are no random variables remaining, place the distribution in the null bucket. 5. Multiply together the distributions in the null bucket (this is simply scalar multiplication). To obtain the a posteriori probability that a subset of random variables, Γ, takes particular values given the observation that a second subset,, has taken particular values, we run the algorithm twice: First on Γ, then on Delta, and we divide the first by the second. Some points to note: 16
17 The algorithm can be extended to obtain good estimates of error bars for our probability estimates, and wishing to do so is the main reason for using the algorithm. The complexity of the algorithm is dominated by the largest potential, which will be at least the size of the largest conditional probability table and which is, in practice, much, much smaller than the full joint distribution. When used to calculate a large number of probabilities (such as the a posteriori probability distributions for each unobserved random variable), the algorithm is relatively inefficient, since, if f is a function from the random variables in the network to the number of values each can take, it must be run f(v) 1 times for each unobserved random variable, v. The algorithm can be run on the smallest subgraph containing (the nodes representing) the variables whose a posteriori probabilities we wish to find that is DSeparated for the remainder of the Network by (nodes representing) random variables whose values we know. 2.5 Exact Inference on a Bayesian Network: The Junction Tree Algorithm The Junction Tree algorithm is the work horse of Bayesian Network inference algorithms, permitting efficient exact inference. It does not, though, permit the calculation of error bars for our probability estimates. Since the Junction Tree algorithm is a generalization of the Variable Elimination algorithm, there is hope that the extension to the latter that permits us to obtain such error bars can likewise be generalized so as to be utilized in the former. Whether, and if so how, this can be done is an open research question. This algorithm utilizes a secondary structure formed from the Bayesian Network called a Junction Tree or Join Tree. We first show how to create this structure. Some Definitions: A cluster is a maximally connected subgraph. The weight of a node is the number of values its associated random variable has. The weight of a cluster is the product of the weight of its constituent nodes. The Create (an Optimal) Junction Tree Algorithm: 1. Take a copy, G, of the DAG, join all unconnected parents and undirect all edges. 17
18 A B C D E F G H I Figure 3: A simple Bayesian Network 2. While there are still nodes left in G: (a) Select a node, n, from G, such that n causes the least number of edges to be added in step 2b, breaking ties by choosing the node which induces the cluster with the least weight. (b) Form a cluster, C, from the node and its neighbors, adding edges as required. (c) If C is not a subgraph of a previously stored cluster, store C as a clique. (d) Remove n from G. 3. Create n trees, each consisting of a single stored clique. Also create a set, S. Repeat until n 1 sepsets have been inserted into the forest: (a) Select from S the sepset, s, that has the largest number of variables in it, breaking ties by calculating the product of the number of values of the random variables in the sets, and choosing the set with the lowest. Further ties can be broken arbitrarily. (b) Delete s from S. (c) Insert s between the cliques X and Y only if X and Y are on different trees in the forest. (This merges their two trees into a larger tree, until you are left with a single tree: The Junction Tree.) Before explaining how to perform inference using a Junction Tree, we require some definitions: Evidence Potentials 18
19 {B, E} {E} {D, E, G} {G} {G, I} {D} {F, H} {F } {C, D, F } {C, D} {A, C, D} Figure 4: The Junction Tree constructed from Figure 3 Variable Value 1 Value 2 Value 3 Notes A Nothing known B Observed to be value 1. C Observed to not be value 2 D Soft evidence, with actual probabilities E Soft evidence, assigns same probabilities as D Table 6: EvidenceP otentials 19
20 An evidence potential has a singleton set of random variables, and maps real numbers to the random variable s values. If working with hard evidence, it will map 0 to values which evidence has ruled out, and 1 to all other values (where at least one value must be mapped to 1). Where all values are mapped to 1, nothing is know about the random variables. Where all values except one are mapped to 1, it is known that the random variable takes the specified value. If working with soft evidence, values can be mapped to any nonnegative real number, but the sum of these must be nonzero. Such a potential assigns values probabilities as specified by the its normalization. Message Pass We pass a message from one clique, c 1, to another, c 2, via the intervening sepset, s, by: 1. Save the potential associated with s. 2. Marginalize a new potential for s, containing only those variables in s, out of c Assign a new potential to c 2, such that: Collect Evidence pot(c 2 ) new = pot(c 2 ) old ( pot(s)new pot(s)old ) When called on a clique, c, Collect Evidence does the following: 1. Marks c. 2. Calls Collect Evidence recursively on the unmarked neighbors of c, if any. 3. Passes a message from c to the clique that called collect evidence, if any. Disperse Evidence When called on a clique, c, Disperse Evidence does the following: 1. Marks c. 2. Passes a message to each of the unmarked neighbors of c, if any. 3. Calls Disperse Evidence recursively on the unmarked neighbors of c, if any. To perform inference on a Junction Tree, we use the following algorithm: 1. Associate with each clique and sepset a potential, whose random variables are those of the clique/subset, and which associates with all value combinations of these random variables the value For each node: 20
21 (a) Associate with the node an evidence potential representing current knowledge. (b) Find a clique containing the node and its parents (it is certain to exist) and multiply in the node s conditional probability table to the clique s potential. (By multiply in is meant: multiply the node s conditional probability table and the clique s potential, and replace the cliques potential with the result.) (c) Multiply in the evidence potential associated with the node. 3. Pick an arbitrary root clique, and call collect evidence and then disperse evidence on this clique: 4. For each node you wish to obtain a posteriori probabilities for: (a) Select the smallest clique containing this node. (b) Create a copy of the potential associated with this clique. (c) Marginalize all other nodes out of the clique. (d) Normalize the resulting potential. This is the random variable s a posteriori probability distribution. Some points to note: The complexity of the algorithm is dominated by the largest potential associated a clique, which will be at least the size of, and probably much larger than, the largest conditional probability table. But it is, in practice, much smaller than the full joint distribution. When cliques are relatively small, the algorithm is comparatively efficient. There are also numerous techniques to improve efficiency available in the literature. A Junction Tree can be formed from the smallest subgraph containing (the nodes representing) the variables whose a posteriori probabilities we wish to find that is DSeparated for the remainder of the Network by (nodes representing) random variables whose values we know. 2.6 Inexact Inference on a Bayesian Network: Likelihood Sampling If the network is sufficiently complex, exact inference algorithms will become intractable. In such cases we turn to likelihood sampling. Using this algorithm, given a set of random variables, E, whose values we know (or are assuming), we can estimate a posteriori probabilities for the other random variables, U, in the network: 1. Perform a topological sort on the DAG. 21
22 2. Set all random variables in E to the value they are known/assumed to take. 3. For each random variable in U, create a score card, with a number for each value the random variable can take. Initially set all numbers to zero. 4. Repeat: (a) In the order generated in step 1, for each node in U, randomly assign values to each random variable using their conditional probability tables. (b) Given the values assigned, calculate the p(e = e), from the conditional probability tables of the random variables in E. Ie, where P ar(v) is random variables associated with the parents of the node associated with random variable v, par(v) are the values these parents have been assigned and E = {E 1,...E n }, calculate: p(e = e) = (p(e n = e n P ar(e n ) = par(e n )) E n E (c) For each random variable in U, add p(e = e) to the score for the value it was assigned in this sample. 5. For each random variable in U, normalize its score card. This is an estimate of the random variable s a posteriori probability distribution. 22
The Basics of Graphical Models
The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures
More informationQuestion: What is the probability that a fivecard poker hand contains a flush, that is, five cards of the same suit?
ECS20 Discrete Mathematics Quarter: Spring 2007 Instructor: John Steinberger Assistant: Sophie Engle (prepared by Sophie Engle) Homework 8 Hints Due Wednesday June 6 th 2007 Section 6.1 #16 What is the
More information6.3 Conditional Probability and Independence
222 CHAPTER 6. PROBABILITY 6.3 Conditional Probability and Independence Conditional Probability Two cubical dice each have a triangle painted on one side, a circle painted on two sides and a square painted
More information3. The Junction Tree Algorithms
A Short Course on Graphical Models 3. The Junction Tree Algorithms Mark Paskin mark@paskin.org 1 Review: conditional independence Two random variables X and Y are independent (written X Y ) iff p X ( )
More information13.3 Inference Using Full Joint Distribution
191 The probability distribution on a single variable must sum to 1 It is also true that any joint probability distribution on any set of variables must sum to 1 Recall that any proposition a is equivalent
More informationLecture 2: Introduction to belief (Bayesian) networks
Lecture 2: Introduction to belief (Bayesian) networks Conditional independence What is a belief network? Independence maps (Imaps) January 7, 2008 1 COMP526 Lecture 2 Recall from last time: Conditional
More informationChapter 4 Lecture Notes
Chapter 4 Lecture Notes Random Variables October 27, 2015 1 Section 4.1 Random Variables A random variable is typically a realvalued function defined on the sample space of some experiment. For instance,
More informationProbability OPRE 6301
Probability OPRE 6301 Random Experiment... Recall that our eventual goal in this course is to go from the random sample to the population. The theory that allows for this transition is the theory of probability.
More informationDiscrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note 11
CS 70 Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note Conditional Probability A pharmaceutical company is marketing a new test for a certain medical condition. According
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Raquel Urtasun and Tamir Hazan TTI Chicago April 4, 2011 Raquel Urtasun and Tamir Hazan (TTIC) Graphical Models April 4, 2011 1 / 22 Bayesian Networks and independences
More informationMath/Stats 425 Introduction to Probability. 1. Uncertainty and the axioms of probability
Math/Stats 425 Introduction to Probability 1. Uncertainty and the axioms of probability Processes in the real world are random if outcomes cannot be predicted with certainty. Example: coin tossing, stock
More information5 Directed acyclic graphs
5 Directed acyclic graphs (5.1) Introduction In many statistical studies we have prior knowledge about a temporal or causal ordering of the variables. In this chapter we will use directed graphs to incorporate
More informationCS 188: Artificial Intelligence. Probability recap
CS 188: Artificial Intelligence Bayes Nets Representation and Independence Pieter Abbeel UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew Moore Conditional probability
More informationECE302 Spring 2006 HW1 Solutions January 16, 2006 1
ECE302 Spring 2006 HW1 Solutions January 16, 2006 1 Solutions to HW1 Note: These solutions were generated by R. D. Yates and D. J. Goodman, the authors of our textbook. I have added comments in italics
More informationLogic, Probability and Learning
Logic, Probability and Learning Luc De Raedt luc.deraedt@cs.kuleuven.be Overview Logic Learning Probabilistic Learning Probabilistic Logic Learning Closely following : Russell and Norvig, AI: a modern
More informationA crash course in probability and Naïve Bayes classification
Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s
More informationMAT 1000. Mathematics in Today's World
MAT 1000 Mathematics in Today's World We talked about Cryptography Last Time We will talk about probability. Today There are four rules that govern probabilities. One good way to analyze simple probabilities
More informationWhat Is Probability?
1 What Is Probability? The idea: Uncertainty can often be "quantified" i.e., we can talk about degrees of certainty or uncertainty. This is the idea of probability: a higher probability expresses a higher
More informationIntroduction to probability theory in the Discrete Mathematics course
Introduction to probability theory in the Discrete Mathematics course Jiří Matoušek (KAM MFF UK) Version: Oct/18/2013 Introduction This detailed syllabus contains definitions, statements of the main results
More informationApproximation Algorithms
Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NPCompleteness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms
More informationE3: PROBABILITY AND STATISTICS lecture notes
E3: PROBABILITY AND STATISTICS lecture notes 2 Contents 1 PROBABILITY THEORY 7 1.1 Experiments and random events............................ 7 1.2 Certain event. Impossible event............................
More informationProbability, Conditional Independence
Probability, Conditional Independence June 19, 2012 Probability, Conditional Independence Probability Sample space Ω of events Each event ω Ω has an associated measure Probability of the event P(ω) Axioms
More informationWorked examples Basic Concepts of Probability Theory
Worked examples Basic Concepts of Probability Theory Example 1 A regular tetrahedron is a body that has four faces and, if is tossed, the probability that it lands on any face is 1/4. Suppose that one
More informationCombinatorics: The Fine Art of Counting
Combinatorics: The Fine Art of Counting Week 7 Lecture Notes Discrete Probability Continued Note Binomial coefficients are written horizontally. The symbol ~ is used to mean approximately equal. The Bernoulli
More informationDiscrete Structures for Computer Science
Discrete Structures for Computer Science Adam J. Lee adamlee@cs.pitt.edu 6111 Sennott Square Lecture #20: Bayes Theorem November 5, 2013 How can we incorporate prior knowledge? Sometimes we want to know
More informationST 371 (IV): Discrete Random Variables
ST 371 (IV): Discrete Random Variables 1 Random Variables A random variable (rv) is a function that is defined on the sample space of the experiment and that assigns a numerical variable to each possible
More informationMarkov random fields and Gibbs measures
Chapter Markov random fields and Gibbs measures 1. Conditional independence Suppose X i is a random element of (X i, B i ), for i = 1, 2, 3, with all X i defined on the same probability space (.F, P).
More informationThe Joint Probability Distribution (JPD) of a set of n binary variables involve a huge number of parameters
DEFINING PROILISTI MODELS The Joint Probability Distribution (JPD) of a set of n binary variables involve a huge number of parameters 2 n (larger than 10 25 for only 100 variables). x y z p(x, y, z) 0
More informationBasic Probability Theory I
A Probability puzzler!! Basic Probability Theory I Dr. Tom Ilvento FREC 408 Our Strategy with Probability Generally, we want to get to an inference from a sample to a population. In this case the population
More informationBayesian Tutorial (Sheet Updated 20 March)
Bayesian Tutorial (Sheet Updated 20 March) Practice Questions (for discussing in Class) Week starting 21 March 2016 1. What is the probability that the total of two dice will be greater than 8, given that
More informationCourse: Model, Learning, and Inference: Lecture 5
Course: Model, Learning, and Inference: Lecture 5 Alan Yuille Department of Statistics, UCLA Los Angeles, CA 90095 yuille@stat.ucla.edu Abstract Probability distributions on structured representation.
More informationLecture 11: Graphical Models for Inference
Lecture 11: Graphical Models for Inference So far we have seen two graphical models that are used for inference  the Bayesian network and the Join tree. These two both represent the same joint probability
More informationPeople have thought about, and defined, probability in different ways. important to note the consequences of the definition:
PROBABILITY AND LIKELIHOOD, A BRIEF INTRODUCTION IN SUPPORT OF A COURSE ON MOLECULAR EVOLUTION (BIOL 3046) Probability The subject of PROBABILITY is a branch of mathematics dedicated to building models
More informationDiscrete Mathematics for CS Fall 2006 Papadimitriou & Vazirani Lecture 22
CS 70 Discrete Mathematics for CS Fall 2006 Papadimitriou & Vazirani Lecture 22 Introduction to Discrete Probability Probability theory has its origins in gambling analyzing card games, dice, roulette
More informationData Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber 2011 1
Data Modeling & Analysis Techniques Probability & Statistics Manfred Huber 2011 1 Probability and Statistics Probability and statistics are often used interchangeably but are different, related fields
More information1. The sample space S is the set of all possible outcomes. 2. An event is a set of one or more outcomes for an experiment. It is a sub set of S.
1 Probability Theory 1.1 Experiment, Outcomes, Sample Space Example 1 n psychologist examined the response of people standing in line at a copying machines. Student volunteers approached the person first
More informationQuestion 2 Naïve Bayes (16 points)
Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the
More informationData Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a
More informationChapter 3: The basic concepts of probability
Chapter 3: The basic concepts of probability Experiment: a measurement process that produces quantifiable results (e.g. throwing two dice, dealing cards, at poker, measuring heights of people, recording
More informationBayesian Networks Chapter 14. Mausam (Slides by UWAI faculty & David Page)
Bayesian Networks Chapter 14 Mausam (Slides by UWAI faculty & David Page) Bayes Nets In general, joint distribution P over set of variables (X 1 x... x X n ) requires exponential space for representation
More informationStatistical Inference. Prof. Kate Calder. If the coin is fair (chance of heads = chance of tails) then
Probability Statistical Inference Question: How often would this method give the correct answer if I used it many times? Answer: Use laws of probability. 1 Example: Tossing a coin If the coin is fair (chance
More informationDistributed Computing over Communication Networks: Maximal Independent Set
Distributed Computing over Communication Networks: Maximal Independent Set What is a MIS? MIS An independent set (IS) of an undirected graph is a subset U of nodes such that no two nodes in U are adjacent.
More informationThe Set Data Model CHAPTER 7. 7.1 What This Chapter Is About
CHAPTER 7 The Set Data Model The set is the most fundamental data model of mathematics. Every concept in mathematics, from trees to real numbers, is expressible as a special kind of set. In this book,
More informationThe UnionFind Problem Kruskal s algorithm for finding an MST presented us with a problem in datastructure design. As we looked at each edge,
The UnionFind Problem Kruskal s algorithm for finding an MST presented us with a problem in datastructure design. As we looked at each edge, cheapest first, we had to determine whether its two endpoints
More informationJoint Probability Distributions and Random Samples (Devore Chapter Five)
Joint Probability Distributions and Random Samples (Devore Chapter Five) 101634501 Probability and Statistics for Engineers Winter 20102011 Contents 1 Joint Probability Distributions 1 1.1 Two Discrete
More informationThe sample space for a pair of die rolls is the set. The sample space for a random number between 0 and 1 is the interval [0, 1].
Probability Theory Probability Spaces and Events Consider a random experiment with several possible outcomes. For example, we might roll a pair of dice, flip a coin three times, or choose a random real
More informationLecture Note 1 Set and Probability Theory. MIT 14.30 Spring 2006 Herman Bennett
Lecture Note 1 Set and Probability Theory MIT 14.30 Spring 2006 Herman Bennett 1 Set Theory 1.1 Definitions and Theorems 1. Experiment: any action or process whose outcome is subject to uncertainty. 2.
More informationProbabilities. Probability of a event. From Random Variables to Events. From Random Variables to Events. Probability Theory I
Victor Adamchi Danny Sleator Great Theoretical Ideas In Computer Science Probability Theory I CS 525 Spring 200 Lecture Feb. 6, 200 Carnegie Mellon University We will consider chance experiments with
More information1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
More informationI. WHAT IS PROBABILITY?
C HAPTER 3 PROBABILITY Random Experiments I. WHAT IS PROBABILITY? The weatherman on 0 o clock news program states that there is a 20% chance that it will snow tomorrow, a 65% chance that it will rain and
More informationBayesian networks  Timeseries models  Apache Spark & Scala
Bayesian networks  Timeseries models  Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup  November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
More informationDiscrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 13. Random Variables: Distribution and Expectation
CS 70 Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 3 Random Variables: Distribution and Expectation Random Variables Question: The homeworks of 20 students are collected
More informationCSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) 3 4 4 7 5 9 6 16 7 8 8 4 9 8 10 4 Total 92.
Name: Email ID: CSE 326, Data Structures Section: Sample Final Exam Instructions: The exam is closed book, closed notes. Unless otherwise stated, N denotes the number of elements in the data structure
More informationBasics of Probability
Basics of Probability 1 Sample spaces, events and probabilities Begin with a set Ω the sample space e.g., 6 possible rolls of a die. ω Ω is a sample point/possible world/atomic event A probability space
More informationLecture 1 Introduction Properties of Probability Methods of Enumeration Asrat Temesgen Stockholm University
Lecture 1 Introduction Properties of Probability Methods of Enumeration Asrat Temesgen Stockholm University 1 Chapter 1 Probability 1.1 Basic Concepts In the study of statistics, we consider experiments
More informationArtificial Intelligence Mar 27, Bayesian Networks 1 P (T D)P (D) + P (T D)P ( D) =
Artificial Intelligence 15381 Mar 27, 2007 Bayesian Networks 1 Recap of last lecture Probability: precise representation of uncertainty Probability theory: optimal updating of knowledge based on new information
More informationDiscrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 10
CS 70 Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 10 Introduction to Discrete Probability Probability theory has its origins in gambling analyzing card games, dice,
More informationExamples and Proofs of Inference in Junction Trees
Examples and Proofs of Inference in Junction Trees Peter Lucas LIAC, Leiden University February 4, 2016 1 Representation and notation Let P(V) be a joint probability distribution, where V stands for a
More informationProbability and Statistics
CHAPTER 2: RANDOM VARIABLES AND ASSOCIATED FUNCTIONS 2b  0 Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute  Systems and Modeling GIGA  Bioinformatics ULg kristel.vansteen@ulg.ac.be
More information! Solve problem to optimality. ! Solve problem in polytime. ! Solve arbitrary instances of the problem. !approximation algorithm.
Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NPhard problem What should I do? A Theory says you're unlikely to find a polytime algorithm Must sacrifice one of
More information, each of which contains a unique key value, say k i , R 2. such that k i equals K (or to determine that no such record exists in the collection).
The Search Problem 1 Suppose we have a collection of records, say R 1, R 2,, R N, each of which contains a unique key value, say k i. Given a particular key value, K, the search problem is to locate the
More informationV. RANDOM VARIABLES, PROBABILITY DISTRIBUTIONS, EXPECTED VALUE
V. RANDOM VARIABLES, PROBABILITY DISTRIBUTIONS, EXPETED VALUE A game of chance featured at an amusement park is played as follows: You pay $ to play. A penny and a nickel are flipped. You win $ if either
More information! Solve problem to optimality. ! Solve problem in polytime. ! Solve arbitrary instances of the problem. #approximation algorithm.
Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NPhard problem What should I do? A Theory says you're unlikely to find a polytime algorithm Must sacrifice one of three
More information1 if 1 x 0 1 if 0 x 1
Chapter 3 Continuity In this chapter we begin by defining the fundamental notion of continuity for real valued functions of a single real variable. When trying to decide whether a given function is or
More informationMining SocialNetwork Graphs
342 Chapter 10 Mining SocialNetwork Graphs There is much information to be gained by analyzing the largescale data that is derived from social networks. The bestknown example of a social network is
More informationBasic concepts in probability. Sue Gordon
Mathematics Learning Centre Basic concepts in probability Sue Gordon c 2005 University of Sydney Mathematics Learning Centre, University of Sydney 1 1 Set Notation You may omit this section if you are
More informationL10: Probability, statistics, and estimation theory
L10: Probability, statistics, and estimation theory Review of probability theory Bayes theorem Statistics and the Normal distribution Least Squares Error estimation Maximum Likelihood estimation Bayesian
More informationSummary of Formulas and Concepts. Descriptive Statistics (Ch. 14)
Summary of Formulas and Concepts Descriptive Statistics (Ch. 14) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume
More informationDefinition and Calculus of Probability
In experiments with multivariate outcome variable, knowledge of the value of one variable may help predict another. For now, the word prediction will mean update the probabilities of events regarding the
More informationProbability, statistics and football Franka Miriam Bru ckler Paris, 2015.
Probability, statistics and football Franka Miriam Bru ckler Paris, 2015 Please read this before starting! Although each activity can be performed by one person only, it is suggested that you work in groups
More informationBayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom
1 Learning Goals Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom 1. Be able to apply Bayes theorem to compute probabilities. 2. Be able to identify
More informationLecture 3: Linear Programming Relaxations and Rounding
Lecture 3: Linear Programming Relaxations and Rounding 1 Approximation Algorithms and Linear Relaxations For the time being, suppose we have a minimization problem. Many times, the problem at hand can
More informationExamination 110 Probability and Statistics Examination
Examination 0 Probability and Statistics Examination Sample Examination Questions The Probability and Statistics Examination consists of 5 multiplechoice test questions. The test is a threehour examination
More informationA Few Basics of Probability
A Few Basics of Probability Philosophy 57 Spring, 2004 1 Introduction This handout distinguishes between inductive and deductive logic, and then introduces probability, a concept essential to the study
More informationSTAT 319 Probability and Statistics For Engineers PROBABILITY. Engineering College, Hail University, Saudi Arabia
STAT 319 robability and Statistics For Engineers LECTURE 03 ROAILITY Engineering College, Hail University, Saudi Arabia Overview robability is the study of random events. The probability, or chance, that
More information1 Introduction to Counting
1 Introduction to Counting 1.1 Introduction In this chapter you will learn the fundamentals of enumerative combinatorics, the branch of mathematics concerned with counting. While enumeration problems can
More informationStatistics 100A Homework 8 Solutions
Part : Chapter 7 Statistics A Homework 8 Solutions Ryan Rosario. A player throws a fair die and simultaneously flips a fair coin. If the coin lands heads, then she wins twice, and if tails, the onehalf
More informationCh. 13.3: More about Probability
Ch. 13.3: More about Probability Complementary Probabilities Given any event, E, of some sample space, U, of a random experiment, we can always talk about the complement, E, of that event: this is the
More informationLesson 1. Basics of Probability. Principles of Mathematics 12: Explained! www.math12.com 314
Lesson 1 Basics of Probability www.math12.com 314 Sample Spaces: Probability Lesson 1 Part I: Basic Elements of Probability Consider the following situation: A six sided die is rolled The sample space
More informationCHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
More informationLinear Codes. Chapter 3. 3.1 Basics
Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length
More informationBayesian Networks. Read R&N Ch. 14.114.2. Next lecture: Read R&N 18.118.4
Bayesian Networks Read R&N Ch. 14.114.2 Next lecture: Read R&N 18.118.4 You will be expected to know Basic concepts and vocabulary of Bayesian networks. Nodes represent random variables. Directed arcs
More informationChapter 7. Hierarchical cluster analysis. Contents 71
71 Chapter 7 Hierarchical cluster analysis In Part 2 (Chapters 4 to 6) we defined several different ways of measuring distance (or dissimilarity as the case may be) between the rows or between the columns
More information(x + a) n = x n + a Z n [x]. Proof. If n is prime then the map
22. A quick primality test Prime numbers are one of the most basic objects in mathematics and one of the most basic questions is to decide which numbers are prime (a clearly related problem is to find
More informationAnalysis of Algorithms I: Binary Search Trees
Analysis of Algorithms I: Binary Search Trees Xi Chen Columbia University Hash table: A data structure that maintains a subset of keys from a universe set U = {0, 1,..., p 1} and supports all three dictionary
More informationBasics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar
More informationPattern Recognition. Probability Theory
Pattern Recognition Probability Theory Probability Space 2 is a threetuple with: the set of elementary events algebra probability measure algebra over is a system of subsets, i.e. ( is the power set)
More information4. Joint Distributions
Virtual Laboratories > 2. Distributions > 1 2 3 4 5 6 7 8 4. Joint Distributions Basic Theory As usual, we start with a random experiment with probability measure P on an underlying sample space. Suppose
More informationRandom variables P(X = 3) = P(X = 3) = 1 8, P(X = 1) = P(X = 1) = 3 8.
Random variables Remark on Notations 1. When X is a number chosen uniformly from a data set, What I call P(X = k) is called Freq[k, X] in the courseware. 2. When X is a random variable, what I call F ()
More informationGibbs Sampling and Online Learning Introduction
Statistical Techniques in Robotics (16831, F14) Lecture#10(Tuesday, September 30) Gibbs Sampling and Online Learning Introduction Lecturer: Drew Bagnell Scribes: {Shichao Yang} 1 1 Sampling Samples are
More informationMachine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
More informationCompression algorithm for Bayesian network modeling of binary systems
Compression algorithm for Bayesian network modeling of binary systems I. Tien & A. Der Kiureghian University of California, Berkeley ABSTRACT: A Bayesian network (BN) is a useful tool for analyzing the
More informationApplication. Outline. 31 Polynomial Functions 32 Finding Rational Zeros of. Polynomial. 33 Approximating Real Zeros of.
Polynomial and Rational Functions Outline 31 Polynomial Functions 32 Finding Rational Zeros of Polynomials 33 Approximating Real Zeros of Polynomials 34 Rational Functions Chapter 3 Group Activity:
More informationm (t) = e nt m Y ( t) = e nt (pe t + q) n = (pe t e t + qe t ) n = (qe t + p) n
1. For a discrete random variable Y, prove that E[aY + b] = ae[y] + b and V(aY + b) = a 2 V(Y). Solution: E[aY + b] = E[aY] + E[b] = ae[y] + b where each step follows from a theorem on expected value from
More informationCMPSCI611: Approximating MAXCUT Lecture 20
CMPSCI611: Approximating MAXCUT Lecture 20 For the next two lectures we ll be seeing examples of approximation algorithms for interesting NPhard problems. Today we consider MAXCUT, which we proved to
More informationClassification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
More information6 PROBABILITY GENERATING FUNCTIONS
6 PROBABILITY GENERATING FUNCTIONS Certain derivations presented in this course have been somewhat heavy on algebra. For example, determining the expectation of the Binomial distribution (page 5.1 turned
More information1. Nondeterministically guess a solution (called a certificate) 2. Check whether the solution solves the problem (called verification)
Some N P problems Computer scientists have studied many N P problems, that is, problems that can be solved nondeterministically in polynomial time. Traditionally complexity question are studied as languages:
More information2.3 Scheduling jobs on identical parallel machines
2.3 Scheduling jobs on identical parallel machines There are jobs to be processed, and there are identical machines (running in parallel) to which each job may be assigned Each job = 1,,, must be processed
More informationProbability Models.S1 Introduction to Probability
Probability Models.S1 Introduction to Probability Operations Research Models and Methods Paul A. Jensen and Jonathan F. Bard The stochastic chapters of this book involve random variability. Decisions are
More information