Bayesian probability: P. State of the World: X. P(X your information I)

Similar documents
Part 2: One-parameter models

MAS2317/3317. Introduction to Bayesian Statistics. More revision material

Basics of Statistical Machine Learning

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

P (B) In statistics, the Bayes theorem is often used in the following way: P (Data Unknown)P (Unknown) P (Data)

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Notes on Continuous Random Variables

Logistic Regression (1/24/13)

Representation of functions as power series

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Principle of Data Reduction

1 Sufficient statistics

Question: What is the probability that a five-card poker hand contains a flush, that is, five cards of the same suit?

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Foundations of Statistics Frequentist and Bayesian

CHAPTER 2 Estimating Probabilities

1 Prior Probability and Posterior Probability

Chapter 4 Lecture Notes

Normal distribution. ) 2 /2σ. 2π σ

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

Computational Statistics and Data Analysis

The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference

Lecture 7: Continuous Random Variables

Some special discrete probability distributions

( ) = 1 x. ! 2x = 2. The region where that joint density is positive is indicated with dotted lines in the graph below. y = x

MA 1125 Lecture 14 - Expected Values. Friday, February 28, Objectives: Introduce expected values.

Statistics Graduate Courses

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

STAT 830 Convergence in Distribution

WHERE DOES THE 10% CONDITION COME FROM?

Notes on the Negative Binomial Distribution

Math 461 Fall 2006 Test 2 Solutions

DECISION MAKING UNDER UNCERTAINTY:

Important Probability Distributions OPRE 6301

Negative Binomial. Paul Johnson and Andrea Vieux. June 10, 2013

UNIVERSITY OF OSLO. The Poisson model is a common model for claim frequency.

An Introduction to Basic Statistics and Probability

Lecture 2. Summarizing the Sample

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Bayesian Tutorial (Sheet Updated 20 March)

STA 371G: Statistics and Modeling

Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page

Statistics 100A Homework 8 Solutions

STA 4273H: Statistical Machine Learning

The Binomial Distribution

Statistical Machine Learning from Data

STAT 35A HW2 Solutions

SCHOOL OF ENGINEERING & BUILT ENVIRONMENT. Mathematics

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Random variables P(X = 3) = P(X = 3) = 1 8, P(X = 1) = P(X = 1) = 3 8.

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Final Mathematics 5010, Section 1, Fall 2004 Instructor: D.A. Levin

What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference

Machine Learning.

Bayesian Analysis for the Social Sciences

E3: PROBABILITY AND STATISTICS lecture notes

Master s Theory Exam Spring 2006

Likelihood: Frequentist vs Bayesian Reasoning

6 PROBABILITY GENERATING FUNCTIONS

Lecture Notes 1. Brief Review of Basic Probability

Probability and statistics; Rehearsal for pattern recognition

Math 370, Spring 2008 Prof. A.J. Hildebrand. Practice Test 1 Solutions

A Tutorial on Probability Theory

Chapter 11 Monte Carlo Simulation

HT2015: SC4 Statistical Data Mining and Machine Learning

Inference of Probability Distributions for Trust and Security applications

Practice problems for Homework 11 - Point Estimation

Exploratory Data Analysis

13. Poisson Regression Analysis

Estimating Probability Distributions

MATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables

Week 3&4: Z tables and the Sampling Distribution of X

Chapter 5. Random variables

Lecture 9: Bayesian hypothesis testing

Probability. Section 9. Probability. Probability of A = Number of outcomes for which A happens Total number of outcomes (sample space)

Ch5: Discrete Probability Distributions Section 5-1: Probability Distribution

Statistics and Random Variables. Math 425 Introduction to Probability Lecture 14. Finite valued Random Variables. Expectation defined

Chapter 3. Distribution Problems. 3.1 The idea of a distribution The twenty-fold way

6.2. Discrete Probability Distributions

Unit 4 The Bernoulli and Binomial Distributions

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Chapter 4 & 5 practice set. The actual exam is not multiple choice nor does it contain like questions.

Point and Interval Estimates

Modelling the Scores of Premier League Football Matches

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Model-based Synthesis. Tony O Hagan

HFCC Math Lab Arithmetic - 4. Addition, Subtraction, Multiplication and Division of Mixed Numbers

Basic Bayesian Methods

Section 5.1 Continuous Random Variables: Introduction

VISUALIZATION OF DENSITY FUNCTIONS WITH GEOGEBRA

Probability density function : An arbitrary continuous random variable X is similarly described by its probability density function f x = f X

Business Statistics 41000: Probability 1

UNIT I: RANDOM VARIABLES PART- A -TWO MARKS

Decision Making under Uncertainty

Experimental Uncertainty and Probability

99.37, 99.38, 99.38, 99.39, 99.39, 99.39, 99.39, 99.40, 99.41, cm

Lesson 9: Radicals and Conjugates

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Transcription:

Bayesian probability: P State of the World: X P(X your information I) 1

First example: bag of balls Every probability is conditional to your background knowledge I : P(A I) What is the (your) probability that there are r red balls in a bag? (Assuming N balls which can be red/white) Before any data, you might select your prior probability as P(r)=1/(N+1) for all possible r. (0,1,2,,N). Here r is the unknown parameter, and your data will be the observed balls that will be drawn. 2

First example: bag of balls Given that there are i/n red balls, you might say: the probability of picking blindly one red ball is P( X=red i/n) = i/n This is your (subjective) model choice. Calculate posterior probability: P( r=i/n X=red) 3

First example: bag of balls Remember some probability calculus: P(A,B)=P(A B)P(B)=P(B A)P(A)=P(B,A) Joint probability in this example: P(X=red,r=i/N) = (i/n)*(1/(n+1)) Calculate: P(r=i/N X=red) = (i/n)*(1/(n+1)) / P(X=red) P(X=red) is just normalizing constant, i.e. P( X N = red) = P( X i= 0 = red r = i / N) P( r = i / N) = 1/ 2 4

First example: bag of balls The posterior probability is therefore: P(r=i/N X=red) = 2i/(N*(N+1)) What have we learned from the observation X=red? Compare with the prior probability. 5

First example: bag of balls Our new prior is: 2i/(N*(N+1)) After observing two red balls X=2red : Now: P(r=i/N X=2) = (i/n) * 2i/(N(N+1))/c = 2i 2 /(N 2 (N+1))/c Normalizing constant c = (2N+1)/3N So: P(r=i/N X=2) = 6i 2 /(N(N+1)(2N+1)) 6

First example: bag of balls The result is the same if Start with original prior, + use the probability of observing two red balls Start with the posterior we got after observing one red ball, + use the probability of observing one red ball (again) The model would be different if we assume that balls are not replaced in the bag. 7

First example: bag of balls The prior (and posterior) probability P(r) can be said to describe epistemic uncertainty. The conditional probability P(X r) can be said to describe aleatoric uncertainty. Where do these come from? Background information. Model choice. 8

Elicitation of a prior from an expert P(A) should describe the expert s beliefs. Consider two options: You ll get 300 if A is true You ll get a lottery ticket knowing n out of 100 wins 300. Which option do you choose? n small /100 < P(A your) < n large /100 Can find out: n/100 P(A your) 9

Elicitation of a prior from an expert Also, in terms of odds w = P(A)/(1-P(A)), a fair bet is such that P(A)wR + (1-P(A))(-R) = 0 Find out P(A) = 1 /(1+w) Probability densities more difficult to elicit. Multivariate densities even more difficult. Psychological biases. 10

Elicitation of a prior from an expert Assume we have elicited densities p i (x) from experts i=1,,n. Combination? Mixture density N p( x) = i= 1 p i ( x) 1 N Product of densities: p( x) (needs normalizing constant c) = N i= 1 p i ( x) 1/ N / c 11

The height of Eiffel? What s your minimum and maximum? p i = U(min i,max i ) 12

Choice of prior Subjective expert knowledge can be important When we have little data. When it is the only source of information. When data would be too expensive. Difficult problems never have sufficient data Alternatively: uninformative, flat priors. Objective Bayes & Subjective Bayes 13

An example from school book genetics Assume parents with unknown genotypes: Aa, aa or AA. Assume a child is observed to be of type AA. Question1: now what is the probability for the genotypes of the parents? Question2: what is the probability that the next child will also be of type AA? 14

Graphically: there is a conditional probability for the genotype of each child, given the type of parents: X mom =? Y dad =? This is the data model X 1 =AA X 2 =? 15

Now, given the prior AND the observed child, we calculate the probability of the 2nd child: X mom =? Y dad =? X 1 =AA X 2 =? Information about 1st child tells something about the parents, hence about the 2nd child. 16

The posterior probability is 1/4 for each of the parental combinations: [AA,AA], [Aa,Aa], [AA,Aa], [Aa,AA] This Results to: P(AA)=9/16 for the 2nd child. Compare this with prior probability: P(AA)=1/4. New evidence changed this. 17

Using Bayes: P( A B) = P( B A) P( A) A P( B A) P( A) the posterior probability for the parents can be calculated as: P(X mom, X dad X 1 = AA) = = AA X P(X1 = X X mom dad mom dad mom P(X 1 dad mom,x AA X dad ) P(X, X mom, X ) P(X dad ), X ) This describes our final degree of uncertainty. 18

The posterior probability is 1/4 for each of the parental combinations: [AA,AA], [Aa,Aa], [AA,Aa], [Aa,AA] Notice, aa is no longer a possible type for either parent. The prediction for the next child is thus: P ( X2 = AA X = AA) = P(X2 = AA XmomXdad ) P(XmomXdad X 1 1 = X mom X dad Posterior Resulting to: 9/16 Compare this with prior probability: P(AA)=1/4 AA) 19

The previous example had all the elements that are essential. The same idea is just repeated in various forms. 20

Binomial model Recall Bayes original example. X ~ Binomial(N,θ) p(θ) = prior density. U(0,1) Beta(α,β) Find out p(θ X) 21

Binomial model Posterior density: P(θ X)=P(X θ)p(θ)/c Assuming uniform prior: Take a look at this as a function of θ, with N, x, and c as fixed constants. What probability density function can be seen? Hint: compare to beta-density. c x N x p x N x ) / ( 1 ) (1 ) ( } 1 0 { θ θ θ θ θ < < = 1 1 ) (1 ) ( ) ( ) ( ), ( Γ Γ + Γ = β α θ θ β α β α β α θ p 22

Binomial model The posterior can be written, up to a constant term as p( θ N, x) θ x+ 1 1 (1 θ ) N x+ 1 1 Same as beta(x+1,n-x+1) If the uniform prior is replaced by beta(α,β), we get beta(x+α,n-x+β) 23

Binomial model The uniform prior corresponds to having two pseudo observations : one red ball, one white ball. The posterior mean is (1+X)/(2+N) Or: (α+x)/(α+β+n) Can be expressed as With w = (α+β)/(α+β+n) α w + ( 1 w) α + β See what happens if N, or if N 0. X N 24

Simulated sample from the joint distribution p(θ,x)= P(X N,θ)p(θ) Binomial model See P(X N,θ) and p(θ X) in the Fig. 25

Binomial model The binomial distribution (likelihood function) and the beta-prior are said to be conjugate. Conjugate choice of prior leads to closed form solutions. (Posterior density is in the same family as prior density). Can also interpret conjugate prior as pseudo data in comparison with real data. Only a few conjugate solutions exist! 26

Binomial model & priors The uniform prior U(0,1) for θ was uninformative. In what sense? What if we study the density of θ 2 or log(θ), assuming θ ~ U(0,1)? Jeffreys prior is uninformative in the sense that it is transformation invariant: p( θ ) J ( θ ) 1/ 2 with d J ( θ ) = E[( log( P( X dθ θ )) ) 2 θ ] 27

Binomial model & priors J(θ) is known as Fisher information for θ With Jeffreys prior for θ we get, for any one-to-one smooth transformation φ=h(θ) that: Transformation of variables rule Jeffreys p( φ) = p( θ ) dθ dφ d E[( log( L) ) dθ 2 dθ ( ) dφ 2 ] = d E[( log( L) ) dφ 2 ] = J ( φ) where L = P(X parameter) 28

Binomial model & priors For the binomial model, Jeffreys prior is Beta(1/2,1/2). But in general: Jeffreys prior can lead to improper densities (integral is infinite). Difficult to generalize into higher dimensions. Violates likelihood principle which states that inferences should be the same when the likelihood function is the same. 29

Binomial model & priors Also: Haldane s prior Beta(0,0) is uninformative. (How? Think of pseudo data ) But is improper. Can a prior be improper density? Yes, but! - the likelihood needs to be such that the posterior still integrates to one. With Haldane s prior, this works only when the binomial data X is either >0 or <N. 30

Binomial model & priors For the binomial model P(X θ), when computing the posterior p(θ X), we have at least 3 different uninformative priors: p(θ)=u(0,1)=beta(1,1) Bayes-Laplace p(θ)=beta(1/2,1/2) Jeffreys p(θ)=beta(0,0) Haldane s Each of them is uninformative in different ways! Unique definition for uninformative does not exist. 31

Binomial model & priors example: estimate the mortality THIRD DEATH The expanded warning came as Yosemite announced that a third person had died of the disease and the number of confirmed cases rose to eight, all of them among U.S. visitors to the park. Ok, it s a small data, but we try: with uniform prior: p(r data)=beta(3+1,8-3+1). Try also other priors. (Haldane s in red ) 32

Binomial model & N? In previous slides, N was fixed (known). We can also think situations where θ is known, X is known, but N is unknown. Exercise: solve P(N θ,x) = P(X N,θ)P(N)/c with suitable choice of prior. Try e.g. discrete uniform over a range of values. Try e.g. P ( N) 1/ N With Bayes rule we can compute probabilities of any unknowns, given the knowns & prior & likelihood (model). 33

Poisson model Widely applicable: counts of disease cases, accidents, faults, births, deaths over a time, or within an area, etc P( X X λ λ) = e X! λ λ = Poisson intensity = E(X). Aim to get: p(λ X) Bayes: p(λ X) = P(X λ)p(λ)/c

Poisson model Conjugate prior? Try Gamma-density: Then: Simplify expression, what density you see? (up to a normalizing constant). c e e X X p X / ) (! ) ( 1 βλ α α λ λ α β λ λ Γ = βλ α α λ α β β α λ Γ = e p 1 ) ( ), (

Poisson model Posterior density is Gamma(X+α,1+β). Posterior mean is (X+α)/(1+β) Can be written as weighted sum of data mean X and prior mean α/β. 1 β α X + 1+ β 1+ β β

Poisson model With a set of observations: X 1,,X N : And with the Gamma(α,β)-prior we get: Gamma(X 1 + +X N +α,n+β). Posterior mean What happens if N, or N 0? β α β β β + + + = N X N N i i 1 1 = = N i i X N e X X X P i 1 1! ),, ( λ λ λ

Poisson model Gamma-prior can be informative or uninformative. In the limit (α,β) (0,0), posterior Gamma(X 1 + +X N,N). Compare the conjugate analysis with Binomial model. Note similarities.

Poisson model Parameterizing with exposure Type of problems: rate of cases per year, or per 100,000 persons per year. Model: X i ~ Poisson( λ E i ) E i is exposure, e.g. population of the i th city (in a year). X i is observed number of cases.

Poisson model Example: 64 lung cancer cases in 1968-1971 in Fredericia, Denmark, population 6264. Estimate incidence per 100,000? P(λ X,E) = gamma(α+σx i,β+σe i ) With non-informative prior, X=64,E=6264, we get gamma(64,6264), (plot: 10 5 λ)

Exponential model Applicable for event times, concentrations, positive measurements, p( X θ ) = θe θx Mean E(X) = 1/θ Aim to get P(θ X), or P(θ X 1 + +X N ). Conjugate prior Gamma(α,β) Posterior: Gamma(α+1,β+X) or Gamma(α+N,β+X 1 + +X N ).

Exponential model Posterior mean is (α+n)/(β+x 1 + +X N ) What happens if N, or N 0? Uninformative prior (α,β) (0,0) Similarities again.

Exponential model Example: life times of 10 light bulbs were T = 4.1, 0.8, 2.0, 1.5, 5.0, 0.7, 0.1, 4.2, 0.4, 1.8 years. Estimate the failure rate? (true=0.5) T i ~ exp(θ) Non-informative prior gives p(θ T) = gamma(10,20.6). Could also parameterize with 1/θ and use inverse-gamma prior.

Exponential model Some observations may be censored, so we know only that T i < c i, or T i > c i The probability for the whole data is then of the form: P(data θ) = ΠP(T i θ) Π P(T i < c i θ) Π P(T i > c i θ) Here we need cumulative probability functions, but the Bayes theorem still applies, just more complicated.

Binomial, Poisson, Exponential The simplest one-parameter models. Conjugate priors available. Prior can be seen as pseudo data comparable with actual data. Easy to see how the new data update the prior density to posterior density. Posterior means, variances, modes, quantiles can be used to summarize.