Part 2: One-parameter models

Similar documents

Basics of Statistical Machine Learning

1 Prior Probability and Posterior Probability

CHAPTER 2 Estimating Probabilities

Maximum Likelihood Estimation

Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page

Principle of Data Reduction

Notes on the Negative Binomial Distribution

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

Basic Bayesian Methods

Bayesian Analysis for the Social Sciences

MAS2317/3317. Introduction to Bayesian Statistics. More revision material

Statistics 100A Homework 8 Solutions

Bayesian Statistics in One Hour. Patrick Lam

Exam C, Fall 2006 PRELIMINARY ANSWER KEY

Chapter 3 RANDOM VARIATE GENERATION

Bayes and Naïve Bayes. cs534-machine Learning

The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference

Inference of Probability Distributions for Trust and Security applications

UNIT I: RANDOM VARIABLES PART- A -TWO MARKS

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Chapter 4 Statistical Inference in Quality Control and Improvement. Statistical Quality Control (D. C. Montgomery)

Time Series Analysis

Aggregate Loss Models

Linear Classification. Volker Tresp Summer 2015

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

The Distribution of Aggregate Life Insurance Claims

Joint Exam 1/P Sample Exam 1

Dongfeng Li. Autumn 2010

Bayesian logistic betting strategy against probability forecasting. Akimichi Takemura, Univ. Tokyo. November 12, 2012

Homework 4 - KEY. Jeff Brenion. June 16, Note: Many problems can be solved in more than one way; we present only a single solution here.

Reject Inference in Credit Scoring. Jie-Men Mok

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

1 Sufficient statistics

CHAPTER 6: Continuous Uniform Distribution: 6.1. Definition: The density function of the continuous random variable X on the interval [A, B] is.

Practice problems for Homework 11 - Point Estimation

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

0 x = 0.30 x = 1.10 x = 3.05 x = 4.15 x = x = 12. f(x) =

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Chapter 4 Lecture Notes

The Two Envelopes Problem

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

SOCIETY OF ACTUARIES/CASUALTY ACTUARIAL SOCIETY EXAM C CONSTRUCTION AND EVALUATION OF ACTUARIAL MODELS EXAM C SAMPLE QUESTIONS

Detection of changes in variance using binary segmentation and optimal partitioning

7.1 The Hazard and Survival Functions

Multinomial and Ordinal Logistic Regression

Statistics 104: Section 6!

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA

Chapter 3: DISCRETE RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS. Part 3: Discrete Uniform Distribution Binomial Distribution

Stat 515 Midterm Examination II April 6, 2010 (9:30 a.m. - 10:45 a.m.)

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 1 - Bootstrap

Lecture 14. Chapter 7: Probability. Rule 1: Rule 2: Rule 3: Nancy Pfenning Stats 1000

Package SHELF. February 5, 2016

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

Properties of Future Lifetime Distributions and Estimation

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Poisson Models for Count Data

How To Check For Differences In The One Way Anova

1.1 Introduction, and Review of Probability Theory Random Variable, Range, Types of Random Variables CDF, PDF, Quantiles...

A Predictive Probability Design Software for Phase II Cancer Clinical Trials Version 1.0.0

Introduction to Bayesian Analysis Using SAS R Software

Review of Random Variables

Statistics 100A Homework 7 Solutions

Simple Linear Regression Inference

Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing

Section 5 Part 2. Probability Distributions for Discrete Random Variables

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

The Exponential Distribution

Markov Chain Monte Carlo Simulation Made Simple

PSTAT 120B Probability and Statistics

Lecture 8: More Continuous Random Variables

Lecture 2 ESTIMATING THE SURVIVAL FUNCTION. One-sample nonparametric methods

What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference

THE USE OF STATISTICAL DISTRIBUTIONS TO MODEL CLAIMS IN MOTOR INSURANCE

( ) = 1 x. ! 2x = 2. The region where that joint density is positive is indicated with dotted lines in the graph below. y = x

Parametric Survival Models

The Variability of P-Values. Summary

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Lecture notes for Choice Under Uncertainty

171:290 Model Selection Lecture II: The Akaike Information Criterion

E3: PROBABILITY AND STATISTICS lecture notes

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Message-passing sequential detection of multiple change points in networks

Lecture 9: Bayesian hypothesis testing

Likelihood Approaches for Trial Designs in Early Phase Oncology

Math/Stats 342: Solutions to Homework

UNIVERSITY OF OSLO. The Poisson model is a common model for claim frequency.

Multivariate Logistic Regression

An Internal Model for Operational Risk Computation

Transcription:

Part 2: One-parameter models

Bernoilli/binomial models Return to iid Y 1,...,Y n Bin(1, θ). The sampling model/likelihood is p(y 1,...,y n θ) =θ P y i (1 θ) n P y i When combined with a prior p(θ), Bayes rule gives the posterior p(θ y 1,...,y n )= θp y i (1 θ) n P yi p(θ) p(y 1,...,y n )

Posterior odds ratio If we compare the relative probabilities of any two -values, say and, we see that θ θ a θ b p(θ a y 1,...,y n ) p(θ a y 1,...,y n ) P = θ yi a (1 θ a ) n P y i p(θ a )/p(y 1,...,y n ) P θ yi (1 θ b b ) n P y i p(θb )/p(y 1,...,y n ) P y P i n θa 1 yi θa p(θ a ) = θ b 1 θ b p(θ b ) This shows that the probability density at relative that at θ depends on y b 1,...,y n only through yi θ a

Sufficient statistics From this it is possible to show that P (θ A Y 1 = y 1,...,Y n = y n ) n = P θ A Y i = i=1 n i=1 y i We interpret this as meaning that yi contains all the information about θ available from the data yi we say that is a sufficient statistic for θ and p(y 1,...,y n θ)

The Binomial model When Y 1,...,Y n iid we saw that the Bin(1, θ) n sufficient statistic has a Bin(n, θ) distribution i=1 Y i Having observed {Y = y} our task is to obtain the posterior distribution of θ: p(θ y) = p(y θ)p(θ) p(y) = n y θ y (1 θ) n y p(θ) p(y) = c(y)θ y (1 θ) n y p(θ) where c(y) is a function of y and not θ

A uniform prior The parameter 0 and 1 θ is some unknown number between Suppose our prior information for θ is such that all subintervals of [0, 1] having the same length also have the same probability P (a θ b) =P (a + c θ b + c), for 0 a<b<b+ c 1 This condition implies a uniform density for θ: p(θ) = 1 for all θ [0, 1]

Normalizing constant Using the following result from calculus 1 θ a 1 (1 θ) b 1 dθ = Γ(a)Γ(b) 0 Γ(a + b) where Γ(n) =(n 1)! is evaluated in R as gamma(n) we can find out what 1 c(y) is under the uniform prior p(θ y) dy =1 0 c(y) = Γ(n + 2) Γ(y + 1)Γ(n y + 1)

Beta posterior So the posterior distribution is p(θ y) = = To wrap up: Γ(n + 2) Γ(y + 1)Γ(n y + 1) θy (1 θ) n y Γ(n + 2) Γ(y + 1)Γ(n y + 1) θ(y+1) 1 (1 θ) (n y+1) 1 = Beta(y +1,n y + 1) a uniform prior, plus a Bernoilli/Binomial likelihood (sampling model) gives a Beta posterior

Example: Happiness Each female of age 65 or over in the 1998 General Social Survey was asked whether or not the were generally happy Let Y i =1if respondent i reported being generally happy, and Y i =0otherwise, for i =1,...,n= 129 individuals Since 129 N, the total size of the female senior citizen population, our joint beliefs about Y 1,...,Y 129 are well approximated by (the sampling model) our beliefs about θ = N i=1 Y i/n the model that, conditional on θ, the Y i s are IID Bernoilli RVs with expectation θ

Example: Happiness IID Bernoulli likelihood This sampling model says that the probability for any potential outcome {y 1,...,y n }, conditional on θ, is given by p(y 1,...,y 129 θ) =θ P 129 i=1 y i (1 θ) 129 P 129 i=1 y i The survey revealed that 118 individuals report being generally happy (91%) 11 individuals do not (9%) So the probability of these data for a given value of p(y 1,...,y 129 θ) =θ 118 (1 θ) 11 θ is

Example: Happiness Binomial likelihood We could work with this IID Bernoulli likelihood directly. But for a sense of the bigger picture, we ll use that we know that n Y = Y i Bin(n, θ) i=1 In this case, our observed data is y = 118 So (now) the probability of these data for a given value of θ is 129 p(y θ) = 118 θ 118 (1 θ) 11 Notice that this is within a constant factor of the previous Bernoilli likelihood

Example: Happiness Posterior If we take a uniform prior for θ, expressing our ignorance, then we know the posterior is p(θ y) = Beta(y +1,n y + 1) For n = 129 and y = 118, this gives θ {Y = 118} Beta(119, 12) density 0 5 10 15 posterior prior 0.0 0.2 0.4 0.6 0.8 1.0 theta

Uniform is Beta The uniform prior has p(θ) =1for all θ [0, 1] This can be thought of as a Beta distribution with parameters a =1,b=1 p(θ) = Γ(2) Γ(1)Γ(1) θ1 1 (1 θ) 1 1 = 1 1 1 1 1 (under the convention that Γ(1) = 1) θ Beta(1, 1) We saw that if Y θ Bin(n, θ) then θ {Y = y} Beta(1 + y, 1+n y)

General Beta prior Suppose θ Beta(a, b) and Y θ Bin(n, θ) p(θ y) θ a+y 1 (1 θ) b+n y 1 θ y Beta(a + y, b + n y) In other words, the constant of proportionality must be: c(n, y, a, b) = Γ(a + b + n) Γ(a + y)γ(b + n y) How do I know? p(θ y) (and the beta density) must integrate to 1

Posterior proportionality We will use this trick over and over again! I.e., if we recognize that the posterior distribution is proportional to a known probability density, then it must be identical to that density But Careful! The constant of proportionality must be constant with respect to θ

Conjugacy We have shown that a beta prior distribution and a binomial sampling model lead to a beta posterior distribution To reflect this, we say that the class of beta priors is conjugate for the binomial sampling model Formally, we say that a class P of prior distributions for θ is conjugate for a sampling model p(y θ) if p(θ) P p(θ y) P Conjugate priors make posterior calculations easy, but might not actually represent our prior information

Posterior summaries If we are interested in point-summaries of our posterior inference, then the (full) distribution gives us many points to choose from E.g., if θ {Y = y} Beta(a + y, b + n y) then E{θ y} = a + y a + b + n mode(θ y) = a + y 1 a + b + n 2 Var[θ y] = E{θ y}e{1 θ y} a + b + n +1 not including quantiles, etc.

Combining information The posterior expectation E{θ y} is easily recognized as a combination of prior and data information: E{θ y} = a + b a + b + n prior expectation + n a + b + n data average I.e., for this model and prior distribution, the posterior mean is a weighted average of the prior mean and the sample average with weights proportional to a + b and n respectively

Prior sensitivity This leads to the interpretation of a and b as prior data : a prior number of 1 s b prior number of 0 s a + b prior sample size If n a + b, then it seems reasonable that most of our information should be coming from the data as opposed to the prior Indeed: a + b a + b + n 0 E{θ y} y n & Var[θ y] 1 n y n 1 y n

Prediction An important feature of Bayesian inference is the existence of a predictive distribution for new observations Revert, for the moment, to our notation for Bernoilli data. Let y 1,...,y n, be the outcomes of a sample of n binary RVs Let Ỹ {0, 1} be an additional outcome from the same population that has yet to be observed The predictive distribution of Ỹ is the conditional distribution of Ỹ given {Y 1 = y 1,...,Y n = y n }

Predictive distribution For conditionally IID binary variables the predictive distribution can be derived from the distribution of Ỹ given θ and the posterior distribution of θ p(ỹ =1 y 1,...,y n )=E{θ y 1,...,y n } = a + N i=1 y i a + b + n p(ỹ =0 y 1,...,y n )=1 p(ỹ =1 y 1,...,y n ) b + n N i=1 y i a + b + n

Two points about prediction 1. The predictive distribution does not depend upon any unknown quantities if it did, we would not be able to use it to make predictions 2. The predictive distribution depends on our observed data i.e. Ỹ is not independent of Y 1,...,Y n this is because observing Y 1,...,Y n gives information about θỹ, which in turn gives information about

Example: Posterior predictive Consider a uniform prior, or Y = n i=1 y i P (Ỹ Beta(1, 1), using =1 Y = y) =E{θ Y = y} = 2 2+n 1 2 + n 2+n y n but mode(θ Y = y) = y n Does this discrepancy between these two posterior summaries of our information make sense? Consider the case in which Y =0, for which mode(θ Y = 0) = 0 but P (Ỹ =1 Y = 0) = 1/(2 + n)

Confidence regions An interval [l(y),u(y)], based on the observed data Y = y, has 95% Bayesian coverage for θ if P (l(y) < θ <u(y) Y = y) =0.95 The interpretation of this interval is that it describes your information about the true value of θ after you have observed Y = y Such intervals are typically called credible intervals, to distinguish them from frequentist confidence intervals which is an interval that describes a region (based upon ˆθ ) wherein the true θ 0 lies 95% of the time Both, confusingly, use the acronym CI

Quantile-based (Bayesian) CI Perhaps the easiest way to obtain a credible interval is to use the posterior quantiles To make a 100 (1 α) % quantile-based CI, find numbers θ α/2 < θ 1 α/2 such that (1) (2) P (θ < θ α/2 Y = y) =α/2 P (θ > θ 1 α/2 Y = y) =α/2 The numbers θ α/2, θ 1 α/2 are the α/2 and 1 α/2 posterior quantiles of θ

Example: Binomial sampling and uniform prior Suppose out of n = 10 conditionally independent draws of a binary random variable we observe Y =2 ones Using a uniform prior distribution for θ, the posterior distribution is θ {Y =2} Beta(1 + 2, 1 + 8) A 95% CI can be obtained from the 0.025 and 0.975 quantiles of this beta distribution These quantiles are 0.06 and 0.52, respectively, and so the posterior probability that is 95% θ [0.06, 0.52]

Example: a quantile-based CI drawback Notice that there are θ-values outside the quantilebased CI that have higher probability [density] than some points inside the interval density 0.0 0.5 1.0 1.5 2.0 2.5 3.0 posterior 95% CI 0.0 0.2 0.4 0.6 0.8 1.0 theta

HPD region The quantile-based drawback suggests a more restrictive type of interval may be warranted A 100 (1 α) % high posterior density (HPD) region consists of a subset of the parameter space such that 1. P (θ s(y) Y = y) =1 α 2. If θ a s(y), and θ b / s(y), then p(θ a Y = y) >p(θ b Y = y) All points in an HPD region have a higher posterior density than points outside the region

Constructing a HPD region However, a HPD region might not be an interval if the posterior density is multimodal When it is an interval, the basic idea behind its construction is as follows: 1. Gradually move a horizontal line down across the density, including in the HPD region all of the θ-values having a density above the horizontal line 2. Stop when the posterior probability of the θ-values in the region reaches 1 α

Example: HPD region Numerically, we may collect the highest density points whose cumulative density is greater than 1 α density 0.0 0.5 1.0 1.5 2.0 2.5 3.0 95% q CI 50% HPD 75% HPD 95% HPD 0.0 0.2 0.4 0.6 0.8 1.0 theta The 95% HPD region is [0.04, 0.048] is narrower than the quantile-based CI, yet both contain 95% probability

Example: Estimating the probability of a female birth The proportion of births that are female has long been a topic of interest both scientifically and to the lay public E.g., Laplace, perhaps the first Bayesian, analysed Parisian birth data from 1745 with a uniform prior and a binomial sampling model 241,945 girls, and 251,527 boys

Example: female birth posterior The posterior distribution for θ is (n = 493472) θ {Y = 241945} Beta(241945 + 1, 241527 + 1) Now, we can use the posterior CDF to calculate P (θ > 0.5 Y = 241945) = pbeta(0.5,241945+1, 251527+1, lower.tail=false) =1.15 10 42

Example: female birth with placenta previa Placenta previa is an unusual condition of pregnancy in which the placenta is implanted very low in the uterus, obstructing the fetus from a normal vaginal delivery An early study concerning the sex of placenta previa births in Germany found that of a total of 980 births, 437 were female How much evidence does this provide for the claim that the proportion of the female births in the population of placenta previa births is less than 0.485, the proportion of female births in the general population?

Example: female birth with placenta previa The posterior distribution is θ {Y = 437} Beta(437 + 1, 543 + 1) The summary statistics are E{θ {Y = 437}} =0.446 Var[θ {Y = 437}] =0.016 med(θ {Y = 437}) =0.446 (sd) The central 95% CI is [0.415, 0.476] P (θ > 0.485 Y = 437) = 0.007

The Poisson model Some measurements, such as a person s number of children or number of friends, have values that are whole numbers In these cases the sample space is Y = {0, 1, 2,...} Perhaps the simplest probability model on Poisson model Y is the A RV Y has a Poisson distribution with mean θ if P (Y = y θ) =θ y e θ y! for y {0, 1, 2,...}

Mean variance relationship Poisson RVs have the property that E{Y θ} = Var[Y θ] =θ People sometimes say that the Poisson family of distributions has a mean variance relationship because if one Poisson distribution has a larger mean than another, it will have a larger variance as well

IID Sampling model If we take is: Y 1,...,Y n iid Pois(θ) then the joint PDF n P (Y 1 = y 1,...,Y n = y n θ) = p(y i θ) = i=1 n i=1 1 y! θy i e nθ = c(y 1,...,y n )θ P y i e nθ

Posterior odds ratio Comparing two values of θ a posteriori, we have P p(θ a y 1,...,y n ) p(θ b y 1,...,y n ) = c(y 1,...,y n )e nθa θ yi a p(θ a ) P c(y 1,...,y n )e nθ b θ yi p(θ b b ) = e nθ a e nθ b θa θ b P y i p(θ a) p(θ b ) n As in the case of the IID Bernoulli model, i=1 Y i, is a sufficient statistic Furthermore n i=1 Y i Pois(nθ)

Conjugate prior Recall that a class of priors is conjugate for a sampling model if the posterior is also in that class For the Poisson model, our posterior distribution for θ has the following form p(θ y 1,...,y m ) p(y 1,...,y n θ)p(θ) θ P y i e nθ p(θ) This means that whatever our conjugate class of densities is, it will have to include terms like θ c 1 e c 2θ for numbers c 1 and c 2

Conjugate gamma prior The simplest class of probability distributions matching this description is the gamma family p(θ) = ba Γ(a) θa 1 e bθ for θ,a,b>0 Some properties include E{θ} = a b Var[θ] = a b 2 mode(θ) = (a 1)/b if a>1 0 if a 1 We shall denote this family as G(a, b)

Posterior distribution Suppose Y 1,...Y n θ iid Pois(θ) and θ G(a, b) Then p(θ y 1,...,y n ) θ a+p y i 1 e (b+n)θ This is evidently a gamma distribution, and we have confirmed the conjugacy of the gamma family for the Poisson sampling model θ G(a, b) Y 1,...,Y n Pois(θ) n {θ Y 1,...,Y n } G a + Y i,b+ n i=1

Combining information Estimation and prediction proceed in a manner similar to that in the binomial model The posterior expectation of θ is a convex combination of the prior expectation and the sample average b a n yi E{θ y 1,...,y n } = b + n b + b + n n b is the number of prior observations a is the sum of the counts from b prior observations For large n, the information in the data dominates n b E{θ y 1,...y n } ȳ, Var[θ y 1,...,y n ] ȳ n

Posterior predictive Predictions about additional data can be obtained with the posterior predictive distribution p(ỹ y 1,...,y n ) (b + n) a+p y i = Γ(ỹ + 1)Γ(a + y i ) (b + n) a+p y i = Γ(ỹ + 1)Γ(a + y i ) 0 θ a+p y i +ỹ 1 e (b n)θ dθ P a+ b + n yi 1 b + n +1 b + n +1 ȳ for ỹ {0, 1, 2,...,}

Posterior predictive This is a negative binomial distribution with parameters (a + y i,b+ n) for which E{Ỹ y 1,...,y n } = a + y i b + n = E{θ y 1,...,y n } Var[Ỹ y 1,...,y n ]= a + y i b + n b + n +1 b + n = Var[θ y 1,...,y n ] (b + n + 1) = E{θ y 1,...,y n } b + n +1 b + n

Example: birth rates and education Over the course of the 1990s the General Soceital Survey gathered data on the educational attainment and number of children of 155 women who were 40 years of age at the time of their participation in the survey These women were in their 20s in the 1970s, a period of historically low fertility rates in the United States In this example we will compare the women with college degrees to those without in terms of their numbers of children

Example: sampling model(s) Let Y 1,1,...,Y n1,1 denote the numbers of children for the n 1 women without college degrees and n 2 Y 1,2,...,Y n2,2 degrees be the data for the women with For this example, we will use the following sampling models Y 1,1,...,Y n1,1 Y 1,2,...,Y n2,2 iid Pois(θ1 ) iid Pois(θ2 ) The (sufficient) data are no bachelors: bachelors: n 1 = 111, n 1 i=1 y i,1 = 217, ȳ 1 =1.95 n 2 = 44, n 2 i=1 y i,2 = 66, ȳ 2 =1.50

Example: prior(s) and posterior(s) In the case were {θ 1, θ 2 } iid G(a =1,b= 2) we have the following posterior distributions: θ 1 {n 1 = 111, Y i,1 = 217} G(2 + 217, 1 + 111) θ 2 {n 1 = 44, Y i,2 = 66} G(2 + 66, 1 + 44) Posterior means, modes and 95% quantile-based CIs for θ 1 and θ 2 can be obtained from their gamma posterior distributions

Example: prior(s) and posterior(s) density 0.0 0.5 1.0 1.5 2.0 2.5 3.0 prior post none post bach 0 1 2 3 4 5 theta The posterior(s) provide substantial evidence that θ 1 > θ 2. E.g., P (θ 1 > θ 2 Y i,1 = 217, Y i,2 = 66) = 0.97

Example: posterior predictive(s) Consider a randomly sampled individual from each population. To what extent do we expect the uneducated one to have more children than the other? probability 0.00 0.05 0.10 0.15 0.20 0.25 0.30 none bach P (Ỹ1 > Ỹ2 Y i,1 = 217, Y i,2 = 66) = 0.48 P (Ỹ1 = Ỹ2 Y i,1 = 217, Y i,2 = 66) = 0.22 0 1 2 3 4 5 6 7 8 9 10 children The distinction between {θ 1 > θ 2 } and {Ỹ1 > Ỹ2} is important: strong evidence of a difference between two populations does not mean the distance is large

Non-informative priors Priors can be difficult to construct There has long been a desire for prior distributions that can be guaranteed to play a minimal role in the posterior distribution Such distributions are sometimes called reference priors, and described as vague, flat, diffuse, or non-informative The rationale for using non-informative priors is often said to be to let the data speak for themselves, so that inferences are unaffected by the information external to the current data

Jeffreys invariance principle One approach that is sometimes used to define a noninformative prior distribution was introduced by Jeffreys (1946), based on considering one-to-one transformations of the parameter Jeffrey s general principle is that any rule for determining the prior density should yield an equivalent posterior if applied to the transformed parameter Naturally, one must take the sampling model into account in order to study this invariance

The Jeffreys prior Jeffreys principle leads to defining the non-informative prior as p(θ) [i(θ)] 1/2, where i(θ) is the Fisher information for θ Recall that i(θ) =E θ d dθ log p(y θ) 2 = E θ d 2 log p(y θ) dθ2 A prior so-constructed is called the Jeffreys prior for the sampling model p(y θ)

Example: Jeffreys prior for the binomial sampling model Consider the binomial distribution Y Bin(n, θ) which has log-likelihood log p(y θ) =c + y log θ +(n y) log(1 θ) Therefore i(θ) = E θ d 2 log p(y θ) dθ2 = n θ(1 θ) So the Jeffreys prior density is then p(θ) θ 1/2 (1 θ) 1/2 which is the same as θ Beta( 1 2, 1 2 )

Example: non-informative? This θ Beta( 1 2, 1 2 ) Jeffreys prior is non-informative in some sense, but not in all senses The posterior expectation would give sample(s) worth of weight to the prior mean This Jeffreys prior less informative than the uniform prior which gives a weight of 2 samples to the prior mean (also 1) a + b =1 a a+b =1 In this sense, the least informative prior is θ Beta(0, 0) But is this a valid prior distribution?

Propriety of priors We call a prior density p(θ) proper if it does not depend on data and is integrable, i.e., Θ p(θ) dθ < Proper priors lead to valid joint probability models p(θ,y) and proper posteriors ( p(θ y) dθ < ) Θ Improper priors do not provide valid joint probability models, but they can lead to proper posterior by proceeding with the Bayesian algebra p(θ y) p(y θ)p(θ)

Example: continued The θ Beta(0, 0) 1 prior is improper since 0 θ 1 (1 θ) 1 dθ = However, the Beta(0 + y, 0+n y) posterior that results is proper as long as y = 0,n A θ Beta(, ) prior, for 0 < 1 is always proper, and (in the limit) non-informative In practice, the difference between these alternatives (Beta(0, 0), Beta( 1 2, 1 2 ), Beta(1, 1)) is small; all three allow the data (likelihood) to dominate in the posterior

Example: Jeffreys prior for the Poisson sampling model Consider the Poisson distribution has log-likelihood Y Pois(θ) which log p(y θ) =c + y log θ θ Therefore i(θ) = E θ d 2 log p(y θ) dθ2 = 1 θ So the Jeffreys prior density is then p(θ) θ 1/2

Example: Improper Jeffreys prior Since 0 θ 1/2 dθ = this prior is improper However, we can interpret is as a G( 1, i.e., within 2, 0) the conjugate gamma family, to see that the posterior is G 1 n 2 + i=1 y i, 0+n This is proper as long as we have observed one data point, i.e., n 1, since 0 θ α e β dθ < for all α > 0, β 1

Example: non-informative? Again, this G( 1 2, 0) Jeffreys prior is non-informative in some sense, but not in all senses The (improper) prior p(θ) θ 1, or θ G(0, 0), is in some sense equivalent since it gives the same weight (b = 0) to the prior mean (although different) It can be shown that, in the limit as the prior parameters (a, b) (0, 0) the posterior expectation approaches the MLE p(θ) θ 1 is a typical default for this reason As before, the practical difference between these choices is small

Notes of caution The search for non-informative priors has several problems, including it can be misguided: if the likelihood (data) is truly dominant, the the choice among relatively flat priors cannot matter for many problems there is no clear choice difficulties arise when averaging over competing models (Lindley s paradox) Bottom line: If so few data are available that the choice of non-informative prior matters, then one should put relevant information into the prior

Example: Heart transplant mortality rate For a particular hospital, we observe n : the number of transplant surgeries y : the number of deaths within 30 days of surgery In addition, one can predict the probability of death for each patient based upon a model that uses information such as the patients medical condition before the surgery, gender, and race. From this, one can obtain e : the expected number of deaths (exposure) A standard model assumes that the number of deaths follows a Poisson distribution with mean eλ, and the objective is to estimate the mortality rate per unit exposure λ

Example: problems with the MLE The standard estimate of λ is the MLE ˆλ = y/e. Unfortunately, this estimate can be poor when the number of deaths y is close to zero In this situation, when small death counts are possible, it is desirable to use a Bayesian estimate that uses prior knowledge about the size of the mortality rate

Example: prior information A convenient source of prior information is heart transplant data from a small group of hospitals that we believe has the same rate of mortality as the rate from the hospital of interest Suppose we observe z j : the number of deaths o j : the exposure where Z j iid Pois(oj λ) for ten hospitals (j =1,...,10) If we assign λ the non-informative prior then the posterior is p(λ) λ 1 G( 10 j=1 z i, n j=1 o i) (check this)

Example: the prior from data Suppose that, in this example, we have the prior data 10 z i = 16, n o i = 15174 j=1 j=1 So for our prior (to analyse the data for our hospital) we shall use λ G(16, 15174) Now, we consider inference about the heart transplant death rate for two hospitals A. small exposure (e a = 66) and y a =1death B. larger exposure (e b = 1767) and y b =4deaths

Example: posterior The posterior distributions are as follows A. B. λ a y a,e a G(16 + 1, 15174 + 66) λ b y b,e b G(16 + 4, 15174 + 1767) density 0 500 1000 1500 prior post A post B 0.0005 0.0010 0.0015 0.0020 0.0025 lambda P (λ a < λ b y a,e a,y b,e b )=0.57