People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

PROBABILITY AND LIKELIHOOD, A BRIEF INTRODUCTION IN SUPPORT OF A COURSE ON MOLECULAR EVOLUTION (BIOL 3046) Probability The subject of PROBABILITY is a branch of mathematics dedicated to building models to describe conditions of uncertainty and providing tools to make decisions or draw conclusions on the basis of such models. In the broad sense, a PROBABILITY is a measure of the degree to which an occurrence is certain [or uncertain]. A statistical definition of probability People have thought about, and defined, probability in different ways. important to note the consequences of the definition: It is 1. All definitions agree on the algebraic and arithmetic procedures that must be followed; hence, the definition does not influence the outcome. 2. The definition has a fundamental impact on the meaning of the result! We will consider the frequentist definition of probability, as it is the one that currently is the most widely held. To do this we need to define two concepts: (i) sample space, and (ii) relative frequency. 1. Sample space, S, is the collection [sometimes called universe] of all possible outcomes. For a stochastic system, or an experiment, the sample space is a set where each outcome comprises one element of the set. 2. Relative frequency is the proportion of the sample space on which an event E occurs. In an experiment with 100 outcomes, and E occurs 81 times, the relative frequency is 81/100 or 0.81. The frequentist approach is based on the notion of statistical regularity; i.e., in the long run, over replicates, the cumulative relative frequency of an event (E) stabilizes. The best way to illustrate this is with an example experiment that we run many times and measure the cumulative relative frequency (crf). The crf is simply the relative frequency computed cumulatively over some number of replicates of samples, each with a space S. Let s take a look at an example of statistical regularity. Suppose we have a treatment for high blood pressure. The event, E, we are interested in is successfully controlling the blood pressure. So, we want to be able to make a prediction about the probability that a patient treated in the future will have

blood pressure under control, P(E). To estimate this probability we conduct an experiment that is replicated over time in months. The data are presented in the table below. Month Number of subjects (S) Number Controlled (E) Cumulative S Cumulative E crf 1 100 80 100 80 0.800 2 100 88 200 168 0.840 3 100 75 300 243 0.810 4 100 77 400 320 0.800 5 100 80 500 400 0.800 6 100 76 600 476 0.793 7 100 82 700 558 0.797 8 100 79 800 637 0.796 9 100 80 900 717 0.797 10 100 76 1000 793 0.793 11 100 77 1100 970 0.791 12 100 78 1200 948 0.790 [data for example is after McColl (1995)] The crf values down the right most column fluctuate the most in the beginning, but rapidly stabilize. Statistical regularity is the stabilization of the crf in the face of individual fluctuations form month to month in the relative frequency of E. Finally, we are in a position where we can obtain a definition of probability. Here goes: In words, the probability of an event E, written as P(E), is the long run (cumulative) relative frequency of E. More formally we define P(E) as follows: n n ( ) P(E) = lim crf E We can get an idea of this by using an example with nearly infinite replications. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2500 5000 7500 10000 Hypothetical plot of crf of an event

Probability models For all probability models to give consistent results about the outcomes of future events they need to obey four simple axioms (Kolmogorov 1933). Probability axioms: 1. Probability scale= 1 to 0. Hence, 0 P(E) 1. 2. Probabilities are derived from a relative frequency of an event (E) in the space of all possible outcomes (S), where P(S) = 1. Hence, if the probability of an event (E) is P(E), then the probability that E does not occur is 1 P(E). 3. When events E and F are disjoint, they cannot occur together. The probability of disjoint events E or F = P(E or F) = P(E) + P(F). 4. Axiom 3 above deals with a finite sequence of events. Axiom 4 is an extension of axiom 3 to an infinite sequence of events. For the purpose of modelling in molecular evolution, we need to assume these probability axioms and just one additional theorem, the multiplication theorem. I will not provide a detailed explanation of this theorem. However, a consequence of this theorem is what is sometime referred to as the product rule or multiplication rule ; see the box below for an explanation. Product rule: The product rule applies when two events E1 and E2 are independent. E1 and E2 are independent if the occurrence or non-occurrence of E1 does not change the probability of E2 [and vice versa]. [A further statistical definition requires the use of the multiplication theorem] It is important to note that a proof of statistical independence for a specific case by using the multiplication theorem is rarely possible; hence, most models incorporate independence as a model assumption. Typically, probability refers to the occurrence of some future event: When E1 and E2 occur together they are joint events. The joint probability of For example, the probability that a tossed [fair] coin will be heads is ½. the independent events E1 and E2 = P(E1,E2) = P(E1) P(E2). Hence the term What is the probability of getting 5H and 6T if the coin is fair product rule or multiplication principle, or whatever you call it. Conditional probability is very useful as it allows us to express a probability given some further information; specifically, it is the probability of event E2 assuming that event E1 has already occurred. We assume the E1 and E2 events are in a given sample space, S, and P(E1) > 0. We write the conditional probability as P(E2 E1); the vertical bar is read as given.

Let s look at an example of a probability model. The familiar binomial distribution provides the appropriate model for describing the probability of the outcomes of flipping a coin. The binomial model is as follows: n P = 1 k k n k ( p) ( p) n = n! k k!( n k )! If we had a fair coin we could predict the probability of specific outcomes (e.g., 1 head & 1 tail in two tosses) by setting the p parameter equal to 0.5. Note that the model does not require this. In the case of the coin toss, we are interested in a conditional probability; i.e., what is the probability of obtaining, say, 5 heads given a fair coin (p = 0.5) and 12 tosses, or P(k=5 p=0.5, n=12). Probability and likelihood are inverted Probability refers to the occurrence of some future outcome. For example: If I toss a fair coin 12 times, what is the probability that I will obtain 5 heads and 7 tails? Likelihood refers to a past event with a known outcome. For example: What is the probability that my coin is fair if I tossed it 12 times and observed 5 heads and 7 tails Let s continue to use the familiar coin tossing experiment to examine this inversion. n P = 2 k k n k ( 1 / 2) ( 1/ ) n = n! k k!( n k )! n is the number of flips k is the number of successes

CASE 1: PROBABILITY. The question is the same: If I toss a fair coin 12 times, what is the probability that I will obtain 5 heads and 7 tails? The answer comes directly from the above formula where n = 12, and k = 5. The probability of such a future event is 0.193359. From the probability perspective we can look at the distribution of all possible outcomes Probability of 5 heads & 7 tails = 0.1933 Our outcome of 5 heads & 7 tails This is the distribution of mutually exclusive outcomes that comprise the set of all possible outcomes under the model where p = 0.5. Remember probability axiom 2 where P(S) = 1; the probability of each outcome (i.e., 0 to 12 heads) sum to 1. CASE 2: LIKELIHOOD. The second question is: What is the probability that my coin is fair if I tossed it 12 times and observed 5 heads and 7 tails? We have inverted the problem. In the previous case (1) we were interested in the probability of a future outcome given that my coin is fair. In this case (2) we are interested in the probability hat my coin is fair, given a particular outcome. So, in the likelihood framework we have inverted the question such that the hypothesis (H) is variable, and the outcome (let s call it the data, D) is constant. A problem: What we want to measure is P(H D). The problem is that we can t work with the probability of a hypothesis, only the relative frequencies of outcomes. The

solution comes from the knowledge that there is a relationship between P(H D) and P(D H): The P(H D) = αp(d H) Constant value of proportionality The likelihood of the hypothesis given the data, L(H D), is proportional to the probability of the data given the hypothesis, P(D H). As long as we stick to comparing hypotheses on the same data and probability model, the constant remains the same, and we can compare the likelihood scores. We cannot make comparisons on different data using likelihoods. Just remember: with likelihoods, the hypotheses are the variables! Let s use the binomial model to look at the application of probability as compared with likelihood. PROBABILITIES Data D1: 1H & 1T D2: 2H Hypotheses H1: p(h) = 1/4 0.375 0.0625 H2: p(h) = 1/2 0.5 0.25 Following the probability axioms, and as we saw in the binomial distribution above, given a singe hypothesis (i.e., H2: p(h) = 0.5), the different outcomes can be summed. For example P(D1 or D2 H2) = P(D1 H2) + P(D2 H2), a well known result; with all possible outcomes summing to 1. However, we cannot use the addition axiom over different hypotheses H1 and H2; i.e., P(D1 H1 or D2 H2) P(D1 H1) + P(D2 H2). LIKELIHOODS Data D1: 1H & 1T D2: 2H Hypotheses H1: p(h) = 1/4 α 1 0.375 α 2 0.0625 H2: p(h) = 1/2 α 1 0.5 α 2 0.25 Under likelihood we can work with different hypotheses as long as we stick to the same dataset. Take the likelihoods of H1 and H2 under D1. We can infer that the H1 is ¾ less likely than H2. Note that when working with likelihoods, we compute the probabilities, and we drop the constant for convenience. The likelihoods do not sum to 1 because the probabilities terms are for the same outcome drawn from different distributions [probabilities for the total set of outcomes S in same distribution sum to 1]. An example of Likelihood in action Let s use likelihood to follow through on our question of the probability that the coin is fair given 12 tosses with 5 heads and 7 tails. As always our tosses are independent. The L(p=0.5 12,5) = α P(2,5 p=0.5)

[it s easy to use the binomial formula to get the probability term] L = α 0.193 [we drop the constant for convenience] L = 0.193 Perhaps there is an alternative hypothesis; i.e., where p 0.05, that has a higher likelihood. To explore this possibility we take the binomial formula as our likelihood function and evaluate the resulting likelihoods with respect to various values of p and the given data. The results can be plotted as a curve; this curve is sometimes called the likelihood surface. The curve for our data (12,5) is shown below. Maximum Likelihood score = 0.228 0.25 0.2 0.15 0.1 0.05 0 0 0.2 0.4 0.6 0.8 1 ML estimate of p = 0.42 IMPORTANT NOTE: It looks like a distribution, but don t be fooled, the area under the curve does not sum to 1. The curve reflects the probabilities of different values of p (a parameter of the model) under the same data, and these are not mutually exclusive outcomes within a single set of all the possible outcomes.