1 A simple example. A short introduction to Bayesian statistics, part I Math 218, Mathematical Statistics D Joyce, Spring 2016

Similar documents
WHERE DOES THE 10% CONDITION COME FROM?

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

1 Prior Probability and Posterior Probability

Conditional Probability, Independence and Bayes Theorem Class 3, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

More Quadratic Equations

Maximum Likelihood Estimation

Activities/ Resources for Unit V: Proportions, Ratios, Probability, Mean and Median

Lecture Note 1 Set and Probability Theory. MIT Spring 2006 Herman Bennett

Chapter 4. Probability and Probability Distributions

Lecture 7: Continuous Random Variables

The Basics of Graphical Models

CHAPTER 2 Estimating Probabilities

The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference

Section 6.2 Definition of Probability

Chapter 3 RANDOM VARIATE GENERATION

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

An Introduction to Basic Statistics and Probability

Lecture 19: Chapter 8, Section 1 Sampling Distributions: Proportions

The Normal Distribution

Chapter 3: DISCRETE RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS. Part 3: Discrete Uniform Distribution Binomial Distribution

MATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables

EXAM. Exam #3. Math 1430, Spring April 21, 2001 ANSWERS

Probability: Terminology and Examples Class 2, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Chapter 4 Lecture Notes

Chapter 5. Random variables

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

1 Sufficient statistics

Preliminary Mathematics

Principle of Data Reduction

Simple Regression Theory II 2010 Samuel L. Baker

A Few Basics of Probability

Master s Theory Exam Spring 2006

Probability Using Dice

Computational Statistics and Data Analysis

Math 461 Fall 2006 Test 2 Solutions

Linear Equations and Inequalities

8.7 Exponential Growth and Decay

Decimal Notations for Fractions Number and Operations Fractions /4.NF

Nonparametric adaptive age replacement with a one-cycle criterion

REPEATED TRIALS. The probability of winning those k chosen times and losing the other times is then p k q n k.

CALCULATIONS & STATISTICS

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

To give it a definition, an implicit function of x and y is simply any relationship that takes the form:

RATIOS, PROPORTIONS, PERCENTAGES, AND RATES

Session 7 Fractions and Decimals

Two-sample inference: Continuous data

1.7 Graphs of Functions

E3: PROBABILITY AND STATISTICS lecture notes

Problem of the Month: Fair Games

Finding Rates and the Geometric Mean

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

DERIVATIVES AS MATRICES; CHAIN RULE

( ) = 1 x. ! 2x = 2. The region where that joint density is positive is indicated with dotted lines in the graph below. y = x

Foundations of Statistics Frequentist and Bayesian

Using Proportions to Solve Percent Problems I

Likelihood: Frequentist vs Bayesian Reasoning

Solving Rational Equations

Random variables, probability distributions, binomial random variable

Interpreting Kullback-Leibler Divergence with the Neyman-Pearson Lemma

All the examples in this worksheet and all the answers to questions are available as answer sheets or videos.

A Little Set Theory (Never Hurt Anybody)

CHAPTER 6: Continuous Uniform Distribution: 6.1. Definition: The density function of the continuous random variable X on the interval [A, B] is.

The sample space for a pair of die rolls is the set. The sample space for a random number between 0 and 1 is the interval [0, 1].

Notes for EER #4 Graph transformations (vertical & horizontal shifts, vertical stretching & compression, and reflections) of basic functions.

Recursive Estimation

Linear functions Increasing Linear Functions. Decreasing Linear Functions

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

6.4 Normal Distribution

Bayes and Naïve Bayes. cs534-machine Learning

NF5-12 Flexibility with Equivalent Fractions and Pages

3.1. Solving linear equations. Introduction. Prerequisites. Learning Outcomes. Learning Style

The Method of Partial Fractions Math 121 Calculus II Spring 2015

Unit 1 Number Sense. In this unit, students will study repeating decimals, percents, fractions, decimals, and proportions.

1.4 Compound Inequalities

Black Problems - Prime Factorization, Greatest Common Factor and Simplifying Fractions

Statistics 100A Homework 2 Solutions

Revised Version of Chapter 23. We learned long ago how to solve linear congruences. ax c (mod m)

Statistics 100A Homework 4 Solutions

Math 370, Spring 2008 Prof. A.J. Hildebrand. Practice Test 1 Solutions

6.3 Conditional Probability and Independence

Chapter 7. Functions and onto. 7.1 Functions

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Random Variables. Chapter 2. Random Variables 1

OA3-10 Patterns in Addition Tables

6.2. Discrete Probability Distributions

Math 181 Handout 16. Rich Schwartz. March 9, 2010

The Procedures of Monte Carlo Simulation (and Resampling)

Part 2: One-parameter models

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

Trigonometric Functions and Triangles

ST 371 (IV): Discrete Random Variables

1 Shapes of Cubic Functions

Math Quizzes Winter 2009

Notes on Continuous Random Variables

MATH 13150: Freshman Seminar Unit 10

Curve Fitting, Loglog Plots, and Semilog Plots 1

Lecture 1 Introduction Properties of Probability Methods of Enumeration Asrat Temesgen Stockholm University

What Is Probability?

6 3 The Standard Normal Distribution

Bungee Constant per Unit Length & Bungees in Parallel. Skipping school to bungee jump will get you suspended.

Transcription:

and P (B X). In order to do find those conditional probabilities, we ll use Bayes formula. We can easily compute the reverse probabilities A short introduction to Bayesian statistics, part I Math 18, Mathematical Statistics D Joyce, Spring 016 I ll try to make this introduction to Bayesian statistics clear and short. First we ll look as a specific example, then the general setting, then Bayesian statistics for the Bernoulli process, for the Poisson process, and for normal distributions. 1 A simple example Suppose we have two identical urns urn A with 5 red balls and 10 green balls, and urn B with 10 red balls and 5 green balls. We ll select randomly one of the two urns, then sample with replacement that urn to help determine whether we chose A or B. ) n k P (X A) ( 1 k ( ) P (X B) ( 1 n k ( ) so by Bayes formula we derive the posterior probabilities P (A X) ) k P (X A)P (A) P (X A)P (A) + P (X B)P (B) ( 1 ) k ( ) n k 1 ( 1 ) k ( ) n k 1 + ( 1 n k n k + k P (B X) 1 P (A X) k n k + k ) n k ( ) k 1 For example, suppose that in n 10 trials we got k 4 red balls. The posterior probabilities would become P (A X) 4 5 and P (B X) 1 5. Before sampling we ll suppose that we have prior probabilities of 1, that is, P (A) 1 and P (B) 1. Let s take a sample X (X 1, X,..., X n ) of size n, and suppose that k of the n balls we select with replacement are red. We want to use that information to help determine which of the two urns, A or B, we chose. That is, we ll compute P (A X) 1

Before the experiment we chose the two urns each with probability 1, that is, the probability of choosing a red ball was either p 1 or p each with probability 1. That s shown in the prior graph on the left. After drawing n 10 balls out of that urn (with replacement) and getting k 4 red balls, we update the probabilities. That s shown in the posterior graph on the right. How this example generalizes. In the example we had a discrete distribution on p, the probability that we d chose a red ball. This parameter p could take two values: p could be 1 with probability 1 when we chose urn A, or p could be with probability 1 when we chose urn B. We actually had a prior distribution on the parameter p. After taking into consideration the outcome k of an experiment, we had a different distribution on p. It was a conditional distribution p k. In general, we won t have only two different values on a parameter, but infinitely many; we ll have a continuous distribution on the parameter instead of a discrete one. The basic principle The setting for Bayesian statistics is a family of distributions parametrized by one or more parameters along with a prior distribution for those parameters. In the example above we had a Bernoulli process parametrized by one parameter p the probability of success. In the example the prior distribution for p was discrete and had only two values, 1 and each with probability 1. A sample X is taken, and a posterior distribution for the parameters is computed. Let s clarify the situation and introduce terminology and notation in the general case where X is a discrete random variable, and there is only one discrete parameter θ. (In practice, θ is a continuous parameter, but in the example above it was discrete, and for this introduction, let s take θ to be discrete). In statistics, we don t know what the value of θ is; our job is to make inferences about θ. The way to find out about θ is to perform many trials and see what happens, that is, to select a random sample from the distribution, X (X 1, X,..., X n ), where each random variable X i has the given distribution. The actual outcomes that are observed I ll denote x (x 1, x,..., x n ). Now, different values of θ lead to different probabilities of the outcome x, that is, P (Xx θ) varies with θ. In the so-called classical statistics, this probability is called the likelihood of θ given the outcome x, denoted L(θ x). The reason the word likelihood is used is the suggestion that the real value of θ is likely to be one with a higher probability P (Xx θ). But this likelihood L(θ x) is not a probability about θ. (Note that classical statistics is much younger than Bayesian statistics and probably should have some other name.) What Bayesian statistics does is replace this concept of likelihood by a real probability. In order to do that, we ll treat the parameter θ as a random variable rather than an unknown constant. Since it s a random variable, I ll use an uppercase Θ. This random variable Θ itself has a probability mass function, which I ll denote f Θ (θ) P (Θθ). This f Θ is called the prior distribution on Θ. It s the probability you have before considering the information in X, the results of an observation. The symbol P (Xx θ) really is a conditional probability now, and it should properly be written P (Xx Θθ), but I ll abbreviate it simply as P (x θ) and leave out the references to the random variables when the context is clear. Using Bayes law we can invert this conditional probability. In full, it says P (Θθ Xx) but we can abbreviate that as P (θ x) P (Xx Θθ)P (Θθ) P (Xx) P (x θ)p (θ). P (x) This conditional probability P (θ x) is called the posterior distribution on Θ. It s the probability you

have after taking into consideration new information from an observation. Note that the denominator P (x) is a constant, so the last equation says that the posterior distribution P (θ x) is proportional to P (x θ)p (θ). I ll write proportions with the traditional symbol so that the last statement can be written as P (θ x) P (x θ)p (θ). Using proportions saves a lot of symbols, and we don t lose any information since the constant of proportionality P (x) is known. When we discuss the three settings Bernoulli, Poisson, and normal the random variable X will be either discrete or continuous, but our parameters will all be continuous, not discrete (unlike the simple example above where our parameter p was discrete and only took the two values 1 and ). That means we ll be working with probability density functions instead of probability mass functions. In the continuous case there are analogous statements. In particular, analogous to the last statement, we have f(θ x) f(x θ)f(θ) where f(θ) is the prior density function on the parameter Θ, f(θ x) is the posterior density function on Θ, and f(x θ) is a conditional probability or a conditional density depending on whether X is a continuous or discrete random variable. The Bernoulli process. A single trial X for a Bernoulli process, called a Bernoulli trial, ends with one of two outcomes success where X 1 and failure where X 0. Success occurs with probability p while failure occurs with probability q 1 p. The term Bernoulli process is just another name for a random sample from a Bernoulli population. Thus, it consists of repeated independent Bernoulli trials X (X 1, X,..., X n ) with the same parameter p. The problem for statistics is determining the value of this parameter p. All we know is that it lies between 0 and 1. We also expect the ratio k/n of the number of successes k to the number trials n to approach p as n approaches, but that s a theoretical result that doesn t say much about what p is when n is small. Let s see what the Bayesian approach says here. We start with a prior density function f(p) on p, and take a random sample x (x 1, x,..., x n ). Then the posterior density function is proportional to a conditional probability times the prior density function f(p x) P (Xx p) f(p). Suppose, now, that there are k successes occur among the n trials x. With our convention that X i 1 means the trial X i ended in success, that means that k x 1 + x + + x n. Then Therefore, P (Xx p) p k (1 p) n k. f(p x) p k (1 p) n k f(p). Thus, we have a formula for determining the posterior density function f(p x) from the prior density function f(p). (In order to know a density function, it s enough to know what it s proportional to, because we also know the integral of a density function is 1.) But what should the prior distribution be? That depends on your state of knowledge. You may already have some knowledge about what p might be. But if you don t, maybe the best thing to do is assume that all values of p are equally probable. Let s do that and see what happens. So, assume now that the prior density function f(p) is uniform on the interval [0, 1]. So f(p) 1 on the interval, 0 off it. Then we can determine the posterior density function. On the interval [0, 1], f(p x) p k (1 p) n k f(p) p k (1 p) n k

That s enough to tell us this is the beta distribution Beta(k + 1, n + 1 k) because the probability density function for a beta distribution Beta(α, β) is 1 f(x) B(α, β) xα 1 (1 x) β 1 for 0 x 1, where B(α, β) is a constant, namely, the beta function B evaluated at the arguments α and β. Note that the prior distribution f(p) we chose was uniform on [0, 1], and that s actually the beta distribution Beta(1, 1). Let s suppose you have a large number of balls in an urn, every one of which is either red or green, but you have no idea how many there are or what the fraction p of red balls there are. They could even be all red or all green. You decide to make your prior distribution on p uniform, that is Beta(1, 1). This uniform prior density is shaded green in the first figure. green, and the probability is shifted back towards the center. Now f P (p) 6p(1 p). It s shaded blue in the figure. Suppose the next three are red, red, and green, in that order. After each one, we can update the distribution. Next will be Beta(, ), then Beta(4, ), and then Beta(4, ) They appear in the next figure. The first green, second pink, and third blue. Now you choose one ball at random and put it back. If it was red, your new distribution on p is Beta(, 1). The density function of this distribution is f P (p) p. It s shaded pink in the figure. The probability is now more dense near 1 and less near 0. Let s suppose we do it again and get a green ball. Now we ve got Beta(, ). So far, one red and one With each new piece of information the slowly narrows. We can t say much yet with only 5 drawings, reds and greens. A sample of size 5 doesn t say much. Even with so little information, we can still pretty much rule out p being less than 0.05 or greater than 0.99. Let s see what the distribution would look like with more data. Take three more cases. First, when n 10 and we ve drawn red balls 6 times. Then when n 0 and we ve gotten 14 red balls. And finally when n 50 and we got reds. Those have the three distributions Beta(7, 5) graphed in green, Beta(15, 7) graphed in red, and Beta(4, 18) graphed in blue. These are much skinnier distributions, so we ll squeeze the vertical scale. 4

Even after 50 trials, about all we can say is that p is with high probability between 0.4 and 0.85. We can actually compute that high probability as well, since we have a distribution on p. Math 18 Home Page at http://math.clarku.edu/~djoyce/ma18/ 5