People have thought about, and defined, probability in different ways. important to note the consequences of the definition:



Similar documents
Math/Stats 425 Introduction to Probability. 1. Uncertainty and the axioms of probability

E3: PROBABILITY AND STATISTICS lecture notes

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

MATH 140 Lab 4: Probability and the Standard Normal Distribution

Lecture Note 1 Set and Probability Theory. MIT Spring 2006 Herman Bennett

6.4 Normal Distribution

Fairfield Public Schools

IAM 530 ELEMENTS OF PROBABILITY AND STATISTICS INTRODUCTION

Lecture 1 Introduction Properties of Probability Methods of Enumeration Asrat Temesgen Stockholm University

Chapter 4. Probability and Probability Distributions

Section 1.3 P 1 = 1 2. = P n = 1 P 3 = Continuing in this fashion, it should seem reasonable that, for any n = 1, 2, 3,..., =

REPEATED TRIALS. The probability of winning those k chosen times and losing the other times is then p k q n k.

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

Section 6.2 Definition of Probability

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

IEOR 6711: Stochastic Models I Fall 2012, Professor Whitt, Tuesday, September 11 Normal Approximations and the Central Limit Theorem

Unit 19: Probability Models

One Period Binomial Model

6.3 Conditional Probability and Independence

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

10.2 Series and Convergence

CHAPTER 2 Estimating Probabilities

Question: What is the probability that a five-card poker hand contains a flush, that is, five cards of the same suit?

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Mathematical Induction. Mary Barnes Sue Gordon

Definition and Calculus of Probability

Normal distribution. ) 2 /2σ. 2π σ

STA 371G: Statistics and Modeling

WHERE DOES THE 10% CONDITION COME FROM?

Basic Probability Concepts

1 The Brownian bridge construction

z-scores AND THE NORMAL CURVE MODEL

A Few Basics of Probability

Random variables, probability distributions, binomial random variable

Simple Regression Theory II 2010 Samuel L. Baker

Bayesian Analysis for the Social Sciences

Concepts of Probability

Odds ratio, Odds ratio test for independence, chi-squared statistic.

Chapter 6: Probability

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

MBA 611 STATISTICS AND QUANTITATIVE METHODS

Mathematical Induction

WRITING PROOFS. Christopher Heil Georgia Institute of Technology

4. Continuous Random Variables, the Pareto and Normal Distributions

So let us begin our quest to find the holy grail of real analysis.

What Is Probability?

Probabilities. Probability of a event. From Random Variables to Events. From Random Variables to Events. Probability Theory I

Basic Proof Techniques

88 CHAPTER 2. VECTOR FUNCTIONS. . First, we need to compute T (s). a By definition, r (s) T (s) = 1 a sin s a. sin s a, cos s a

Chapter 13 & 14 - Probability PART

Elements of probability theory

Handout #1: Mathematical Reasoning

Lecture 9: Bayesian hypothesis testing

Chapter 4 Lecture Notes

Lesson 9 Hypothesis Testing

9.2 Summation Notation

5/31/ Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives.

You flip a fair coin four times, what is the probability that you obtain three heads.

3. Mathematical Induction

HYPOTHESIS TESTING: POWER OF THE TEST

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

FUNCTIONAL ANALYSIS LECTURE NOTES: QUOTIENT SPACES

Chapter ML:IV. IV. Statistical Learning. Probability Basics Bayes Classification Maximum a-posteriori Hypotheses

Full and Complete Binary Trees

Basic Probability. Probability: The part of Mathematics devoted to quantify uncertainty

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

The Binomial Probability Distribution

Introduction to Hypothesis Testing


6.2 Permutations continued

Standard Deviation Estimator

Tests for One Proportion

How To Write A Data Analysis

Likelihood: Frequentist vs Bayesian Reasoning

Financial Mathematics and Simulation MATH Spring 2011 Homework 2

Projects Involving Statistics (& SPSS)

6 3 The Standard Normal Distribution

CHAPTER 3. Methods of Proofs. 1. Logical Arguments and Formal Proofs

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

STAT 35A HW2 Solutions

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

UNDERSTANDING THE TWO-WAY ANOVA

Incenter Circumcenter

Using Excel for inferential statistics

Continued Fractions and the Euclidean Algorithm

Normal Approximation. Contents. 1 Normal Approximation. 1.1 Introduction. Anthony Tanbakuchi Department of Mathematics Pima Community College

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

It is remarkable that a science, which began with the consideration of games of chance, should be elevated to the rank of the most important

Study Guide for the Final Exam

Random variables P(X = 3) = P(X = 3) = 1 8, P(X = 1) = P(X = 1) = 3 8.

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Math 141. Lecture 2: More Probability! Albyn Jones 1. jones/courses/ Library 304. Albyn Jones Math 141

"Statistical methods are objective methods by which group trends are abstracted from observations on many separate individuals." 1

ST 371 (IV): Discrete Random Variables

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Decision Making under Uncertainty

Measures of Central Tendency and Variability: Summarizing your Data for Others

Chapter 3 RANDOM VARIATE GENERATION

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

Transcription:

PROBABILITY AND LIKELIHOOD, A BRIEF INTRODUCTION IN SUPPORT OF A COURSE ON MOLECULAR EVOLUTION (BIOL 3046) Probability The subject of PROBABILITY is a branch of mathematics dedicated to building models to describe conditions of uncertainty and providing tools to make decisions or draw conclusions on the basis of such models. In the broad sense, a PROBABILITY is a measure of the degree to which an occurrence is certain [or uncertain]. A statistical definition of probability People have thought about, and defined, probability in different ways. important to note the consequences of the definition: It is 1. All definitions agree on the algebraic and arithmetic procedures that must be followed; hence, the definition does not influence the outcome. 2. The definition has a fundamental impact on the meaning of the result! We will consider the frequentist definition of probability, as it is the one that currently is the most widely held. To do this we need to define two concepts: (i) sample space, and (ii) relative frequency. 1. Sample space, S, is the collection [sometimes called universe] of all possible outcomes. For a stochastic system, or an experiment, the sample space is a set where each outcome comprises one element of the set. 2. Relative frequency is the proportion of the sample space on which an event E occurs. In an experiment with 100 outcomes, and E occurs 81 times, the relative frequency is 81/100 or 0.81. The frequentist approach is based on the notion of statistical regularity; i.e., in the long run, over replicates, the cumulative relative frequency of an event (E) stabilizes. The best way to illustrate this is with an example experiment that we run many times and measure the cumulative relative frequency (crf). The crf is simply the relative frequency computed cumulatively over some number of replicates of samples, each with a space S. Let s take a look at an example of statistical regularity. Suppose we have a treatment for high blood pressure. The event, E, we are interested in is successfully controlling the blood pressure. So, we want to be able to make a prediction about the probability that a patient treated in the future will have

blood pressure under control, P(E). To estimate this probability we conduct an experiment that is replicated over time in months. The data are presented in the table below. Month Number of subjects (S) Number Controlled (E) Cumulative S Cumulative E crf 1 100 80 100 80 0.800 2 100 88 200 168 0.840 3 100 75 300 243 0.810 4 100 77 400 320 0.800 5 100 80 500 400 0.800 6 100 76 600 476 0.793 7 100 82 700 558 0.797 8 100 79 800 637 0.796 9 100 80 900 717 0.797 10 100 76 1000 793 0.793 11 100 77 1100 970 0.791 12 100 78 1200 948 0.790 [data for example is after McColl (1995)] The crf values down the right most column fluctuate the most in the beginning, but rapidly stabilize. Statistical regularity is the stabilization of the crf in the face of individual fluctuations form month to month in the relative frequency of E. Finally, we are in a position where we can obtain a definition of probability. Here goes: In words, the probability of an event E, written as P(E), is the long run (cumulative) relative frequency of E. More formally we define P(E) as follows: n n ( ) P(E) = lim crf E We can get an idea of this by using an example with nearly infinite replications. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2500 5000 7500 10000 Hypothetical plot of crf of an event

Probability models For all probability models to give consistent results about the outcomes of future events they need to obey four simple axioms (Kolmogorov 1933). Probability axioms: 1. Probability scale= 1 to 0. Hence, 0 P(E) 1. 2. Probabilities are derived from a relative frequency of an event (E) in the space of all possible outcomes (S), where P(S) = 1. Hence, if the probability of an event (E) is P(E), then the probability that E does not occur is 1 P(E). 3. When events E and F are disjoint, they cannot occur together. The probability of disjoint events E or F = P(E or F) = P(E) + P(F). 4. Axiom 3 above deals with a finite sequence of events. Axiom 4 is an extension of axiom 3 to an infinite sequence of events. For the purpose of modelling in molecular evolution, we need to assume these probability axioms and just one additional theorem, the multiplication theorem. I will not provide a detailed explanation of this theorem. However, a consequence of this theorem is what is sometime referred to as the product rule or multiplication rule ; see the box below for an explanation. Product rule: The product rule applies when two events E1 and E2 are independent. E1 and E2 are independent if the occurrence or non-occurrence of E1 does not change the probability of E2 [and vice versa]. [A further statistical definition requires the use of the multiplication theorem] It is important to note that a proof of statistical independence for a specific case by using the multiplication theorem is rarely possible; hence, most models incorporate independence as a model assumption. Typically, probability refers to the occurrence of some future event: When E1 and E2 occur together they are joint events. The joint probability of For example, the probability that a tossed [fair] coin will be heads is ½. the independent events E1 and E2 = P(E1,E2) = P(E1) P(E2). Hence the term What is the probability of getting 5H and 6T if the coin is fair product rule or multiplication principle, or whatever you call it. Conditional probability is very useful as it allows us to express a probability given some further information; specifically, it is the probability of event E2 assuming that event E1 has already occurred. We assume the E1 and E2 events are in a given sample space, S, and P(E1) > 0. We write the conditional probability as P(E2 E1); the vertical bar is read as given.

Let s look at an example of a probability model. The familiar binomial distribution provides the appropriate model for describing the probability of the outcomes of flipping a coin. The binomial model is as follows: n P = 1 k k n k ( p) ( p) n = n! k k!( n k )! If we had a fair coin we could predict the probability of specific outcomes (e.g., 1 head & 1 tail in two tosses) by setting the p parameter equal to 0.5. Note that the model does not require this. In the case of the coin toss, we are interested in a conditional probability; i.e., what is the probability of obtaining, say, 5 heads given a fair coin (p = 0.5) and 12 tosses, or P(k=5 p=0.5, n=12). Probability and likelihood are inverted Probability refers to the occurrence of some future outcome. For example: If I toss a fair coin 12 times, what is the probability that I will obtain 5 heads and 7 tails? Likelihood refers to a past event with a known outcome. For example: What is the probability that my coin is fair if I tossed it 12 times and observed 5 heads and 7 tails Let s continue to use the familiar coin tossing experiment to examine this inversion. n P = 2 k k n k ( 1 / 2) ( 1/ ) n = n! k k!( n k )! n is the number of flips k is the number of successes

CASE 1: PROBABILITY. The question is the same: If I toss a fair coin 12 times, what is the probability that I will obtain 5 heads and 7 tails? The answer comes directly from the above formula where n = 12, and k = 5. The probability of such a future event is 0.193359. From the probability perspective we can look at the distribution of all possible outcomes Probability of 5 heads & 7 tails = 0.1933 Our outcome of 5 heads & 7 tails This is the distribution of mutually exclusive outcomes that comprise the set of all possible outcomes under the model where p = 0.5. Remember probability axiom 2 where P(S) = 1; the probability of each outcome (i.e., 0 to 12 heads) sum to 1. CASE 2: LIKELIHOOD. The second question is: What is the probability that my coin is fair if I tossed it 12 times and observed 5 heads and 7 tails? We have inverted the problem. In the previous case (1) we were interested in the probability of a future outcome given that my coin is fair. In this case (2) we are interested in the probability hat my coin is fair, given a particular outcome. So, in the likelihood framework we have inverted the question such that the hypothesis (H) is variable, and the outcome (let s call it the data, D) is constant. A problem: What we want to measure is P(H D). The problem is that we can t work with the probability of a hypothesis, only the relative frequencies of outcomes. The

solution comes from the knowledge that there is a relationship between P(H D) and P(D H): The P(H D) = αp(d H) Constant value of proportionality The likelihood of the hypothesis given the data, L(H D), is proportional to the probability of the data given the hypothesis, P(D H). As long as we stick to comparing hypotheses on the same data and probability model, the constant remains the same, and we can compare the likelihood scores. We cannot make comparisons on different data using likelihoods. Just remember: with likelihoods, the hypotheses are the variables! Let s use the binomial model to look at the application of probability as compared with likelihood. PROBABILITIES Data D1: 1H & 1T D2: 2H Hypotheses H1: p(h) = 1/4 0.375 0.0625 H2: p(h) = 1/2 0.5 0.25 Following the probability axioms, and as we saw in the binomial distribution above, given a singe hypothesis (i.e., H2: p(h) = 0.5), the different outcomes can be summed. For example P(D1 or D2 H2) = P(D1 H2) + P(D2 H2), a well known result; with all possible outcomes summing to 1. However, we cannot use the addition axiom over different hypotheses H1 and H2; i.e., P(D1 H1 or D2 H2) P(D1 H1) + P(D2 H2). LIKELIHOODS Data D1: 1H & 1T D2: 2H Hypotheses H1: p(h) = 1/4 α 1 0.375 α 2 0.0625 H2: p(h) = 1/2 α 1 0.5 α 2 0.25 Under likelihood we can work with different hypotheses as long as we stick to the same dataset. Take the likelihoods of H1 and H2 under D1. We can infer that the H1 is ¾ less likely than H2. Note that when working with likelihoods, we compute the probabilities, and we drop the constant for convenience. The likelihoods do not sum to 1 because the probabilities terms are for the same outcome drawn from different distributions [probabilities for the total set of outcomes S in same distribution sum to 1]. An example of Likelihood in action Let s use likelihood to follow through on our question of the probability that the coin is fair given 12 tosses with 5 heads and 7 tails. As always our tosses are independent. The L(p=0.5 12,5) = α P(2,5 p=0.5)

[it s easy to use the binomial formula to get the probability term] L = α 0.193 [we drop the constant for convenience] L = 0.193 Perhaps there is an alternative hypothesis; i.e., where p 0.05, that has a higher likelihood. To explore this possibility we take the binomial formula as our likelihood function and evaluate the resulting likelihoods with respect to various values of p and the given data. The results can be plotted as a curve; this curve is sometimes called the likelihood surface. The curve for our data (12,5) is shown below. Maximum Likelihood score = 0.228 0.25 0.2 0.15 0.1 0.05 0 0 0.2 0.4 0.6 0.8 1 ML estimate of p = 0.42 IMPORTANT NOTE: It looks like a distribution, but don t be fooled, the area under the curve does not sum to 1. The curve reflects the probabilities of different values of p (a parameter of the model) under the same data, and these are not mutually exclusive outcomes within a single set of all the possible outcomes.