Bayesian Methods for Regression in R

Similar documents
Bayesian Statistics in One Hour. Patrick Lam

Parallelization Strategies for Multicore Data Analysis

Markov Chain Monte Carlo Simulation Made Simple

Likelihood: Frequentist vs Bayesian Reasoning

Linear regression methods for large n and streaming data

Foundations of Statistics Frequentist and Bayesian

PS 271B: Quantitative Methods II. Lecture Notes

11. Time series and dynamic linear models

Centre for Central Banking Studies

An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics

Part 2: One-parameter models

Statistics Graduate Courses

CHAPTER 2 Estimating Probabilities

Using SAS PROC MCMC to Estimate and Evaluate Item Response Theory Models

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Basics of Statistical Machine Learning

Bayesian inference for population prediction of individuals without health insurance in Florida

STA 4273H: Statistical Machine Learning

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Probabilistic Methods for Time-Series Analysis

Bayesian Analysis for the Social Sciences

Dynamic Linear Models with R

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Lecture 9: Bayesian hypothesis testing

Objections to Bayesian statistics

1 Prior Probability and Posterior Probability

BayesX - Software for Bayesian Inference in Structured Additive Regression

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Simple Regression Theory II 2010 Samuel L. Baker

Lecture 9: Introduction to Pattern Analysis

THE USE OF STATISTICAL DISTRIBUTIONS TO MODEL CLAIMS IN MOTOR INSURANCE

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Introduction to Markov Chain Monte Carlo

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

How To Check For Differences In The One Way Anova

Likelihood Approaches for Trial Designs in Early Phase Oncology

Multivariate Normal Distribution

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Bayesian probability theory

Bayesian Statistical Analysis in Medical Research

BAYESIAN ANALYSIS OF AN AGGREGATE CLAIM MODEL USING VARIOUS LOSS DISTRIBUTIONS. by Claire Dudley

Nonparametric statistics and model selection

Model-based Synthesis. Tony O Hagan

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

More details on the inputs, functionality, and output can be found below.

Basic Bayesian Methods

Bayes and Naïve Bayes. cs534-machine Learning

E3: PROBABILITY AND STATISTICS lecture notes

People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

33. STATISTICS. 33. Statistics 1

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Imputing Missing Data using SAS

Experimental Uncertainty and Probability

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Fairfield Public Schools

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Simulation Exercises to Reinforce the Foundations of Statistical Thinking in Online Classes

The HB. How Bayesian methods have changed the face of marketing research. Summer 2004

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach

Bayesian Tutorial (Sheet Updated 20 March)

Dirichlet Processes A gentle tutorial

Introduction to Monte Carlo. Astro 542 Princeton University Shirley Ho

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

DURATION ANALYSIS OF FLEET DYNAMICS


Bayesian Phylogeny and Measures of Branch Support


Quantitative Methods for Finance

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Standard Deviation Estimator

Hypothesis Testing for Beginners

Machine Learning.

Descriptive Statistics

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

Two-sample hypothesis testing, II /16/2004

Validation of Software for Bayesian Models Using Posterior Quantiles

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Chapter 3 RANDOM VARIATE GENERATION

MAS2317/3317. Introduction to Bayesian Statistics. More revision material

Lecture Note 1 Set and Probability Theory. MIT Spring 2006 Herman Bennett

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

A Coefficient of Variation for Skewed and Heavy-Tailed Insurance Losses. Michael R. Powers[ 1 ] Temple University and Tsinghua University

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Tutorial on Markov Chain Monte Carlo

Statistics in Retail Finance. Chapter 6: Behavioural models

1 The Brownian bridge construction

Master s Theory Exam Spring 2006

Bayesian Statistics: Indian Buffet Process

Lecture 19: Chapter 8, Section 1 Sampling Distributions: Proportions

An Application of Inverse Reinforcement Learning to Medical Records of Diabetes Treatment

STA 371G: Statistics and Modeling

Compression and Aggregation of Bayesian Estimates for Data Intensive Computing

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 10

Sample Size Issues for Conjoint Analysis

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

A spreadsheet Approach to Business Quantitative Methods

Transcription:

Bayesian Methods for Regression in R Nels Johnson Lead Collaborator, Laboratory for Interdisciplinary Statistical Analysis 03/14/2011 Nels Johnson (LISA) Bayesian Regression 03/14/2011 1 / 26

Outline What is Linear Regression? Intro to Bayesian Statistics More on Priors Bayesian Estimation and Inference Computing Issues Examples Nels Johnson (LISA) Bayesian Regression 03/14/2011 2 / 26

Regression Suppose you have a variable Y whose outcome is considered random. Examples: Amount of soil erosion at a particular site; How much time it takes for a fungus to develop a resistance to a fungicide; How many feet a moving car will need to come to a complete stop. Suppose you have some other other variables X which are treated as fixed. Examples: Kind of plants are growing in the soil; The composition of the fungicide; How fast the car is traveling. If you think the observed value of Y is dependent on X then regression is a tool for modeling that dependence. Nels Johnson (LISA) Bayesian Regression 03/14/2011 3 / 26

Linear Regression Linear regression assumes that the underlying structure of Y is as linear combination of the variables X. Example: Y dist = β 0 + X speed β speed + ɛ Often the formula for linear regression will be condensed using matrix algebra: Y = X β + ɛ Since Y is considered a random variable, we will naturally expect it not to follow the underlying structure exactly. ɛ is the term of the model that describes Y s random variation around it s underlying structure X β. The most common way to describe the random variation is with ɛ N(0, σ 2 ). Nels Johnson (LISA) Bayesian Regression 03/14/2011 4 / 26

Estimating Parameters Since we don t know the values for β and σ 2 we ll need to estimate them based on our data (Y, X ). Traditionally we would do so by finding the β and σ 2 that maximize the likelihood: L(Y i X i, β, σ 2 ) = n N(Y i X i β, σ 2 ) Note: the likelihood is also the joint distribution of the data. The estimates ˆβ = (X T X ) 1 X T Y and s 2 = (Y X ˆβ) T (Y X ˆβ) n p would be considered random variables because they are functions of the random variable Y i. i=1 Nels Johnson (LISA) Bayesian Regression 03/14/2011 5 / 26

Interpretation of Parameters We might compute a confidence interval for β or perform a test to see if it is significantly different from zero. Confidence intervals are interpreted in terms of the proportion (i.e. relative frequency) of times they should capture the true parameter in the long run. This is because it is the endpoints of the interval that are random variables and the parameter is fixed. The interval either captures the fixed parameter or it doesn t. Because of this interpretation, this paradigm of statistics is called the frequentist paradigm (or classical paradigm). Nels Johnson (LISA) Bayesian Regression 03/14/2011 6 / 26

Example 1.2 This is an example of regression using the lm() function in R. Go to Example 1.2 in Bayes Reg 2011.r Nels Johnson (LISA) Bayesian Regression 03/14/2011 7 / 26

Bayes Theorem The Bayesian paradigm is named after Rev Thomas Bayes for its use of his theorem. Take the rule for conditional probability for two events A and B: P(A B) = Bayes discovered that this is equivalent to: P(A B) = P(B A)P(A) P(B) P(A B) P(B) This is known as Bayes Theorem or Bayes Rule. = P(B A)P(A) P(B A)P(A)dA Nels Johnson (LISA) Bayesian Regression 03/14/2011 8 / 26

Bayesian Paradigm The mathematician Pierre-Simon Laplace got the idea that instead of just defining probability on variables, we could also define probability on parameters too. And by using Bayes Rule we can make inference on parameters. Effectively treating parameters as random variables. In our regression example, let θ = {β, σ 2 } and D = the data. Using Bayes Rule we get: P(θ D) = P(D θ)p(θ) P(D) P(θ D) is called the posterior distribution. It is what we will use to make inference about the parameters θ = {β, σ 2 }. Nels Johnson (LISA) Bayesian Regression 03/14/2011 9 / 26

Bayesian Paradigm The mathematician Pierre-Simon Laplace got the idea that instead of just defining probability on variables, we could also define probability on parameters too. And by using Bayes Rule we can make inference on parameters. Effectively treating parameters as random variables. In our regression example, let θ = {β, σ 2 } and D = the data. Using Bayes Rule we get: P(θ D) = P(D θ)p(θ) P(D) P(D θ) is the likelihood we discussed previously. It contains all the information about θ we can learn from the data. Nels Johnson (LISA) Bayesian Regression 03/14/2011 9 / 26

Bayesian Paradigm The mathematician Pierre-Simon Laplace got the idea that instead of just defining probability on variables, we could also define probability on parameters too. And by using Bayes Rule we can make inference on parameters. Effectively treating parameters as random variables. In our regression example, let θ = {β, σ 2 } and D = the data. Using Bayes Rule we get: P(θ D) = P(D θ)p(θ) P(D) P(θ) is called the prior distribution for θ. It contains the information we know about θ before we observe the data. Nels Johnson (LISA) Bayesian Regression 03/14/2011 9 / 26

Bayesian Paradigm The mathematician Pierre-Simon Laplace got the idea that instead of just defining probability on variables, we could also define probability on parameters too. And by using Bayes Rule we can make inference on parameters. Effectively treating parameters as random variables. In our regression example, let θ = {β, σ 2 } and D = the data. Using Bayes Rule we get: P(θ D) = P(D θ)p(θ) P(D) P(D) is the normalizing constant of the function P(D θ)p(θ) such that P(θ D) is a proper probability distribution. Nels Johnson (LISA) Bayesian Regression 03/14/2011 9 / 26

Proportionality Often knowing the normalizing constant P(D) is not necessary for Bayesian inference. When it s removed we get: P(θ D) P(D θ)p(θ) posterior likelihood prior The symbol means that P(θ D) is proportional to (i.e. a scalar multiple of) P(D θ)p(θ). In this case that scalar is P(D). Usually P(D θ) and P(θ) only need to be known up to proportionality as well because their normalizing constants get lumped in with P(D). Nels Johnson (LISA) Bayesian Regression 03/14/2011 10 / 26

How It Works: The advantage of Bayesian analysis is sequentially updating beliefs about θ. Process: Prior belief about θ Observe data update belief about θ. That is P(θ) P(D θ)p(θ) P(θ D). Now that we ve done one experiment and have P(θ D 1 ) we can conduct another experiment using P(θ D 1 ) as the prior for θ. Leads us to: P(θ D 2, D 1 ) P(D 2 θ)p(θ D 1 ) This process can be continually repeated. Example: Last year you did a study relating levels of heavy metals in streams to the size of mussels in the stream. Use the posterior from last year s study as your prior for this year s study. Nels Johnson (LISA) Bayesian Regression 03/14/2011 11 / 26

Prior Distributions Just like how you need to specify the likelihood for the problem, you need to specify the prior distribution. This is not an intuitive process for many people at first mainly because of the prior. What helped me the most: think of the prior as a measure of our uncertainty in the θ s true value. Very often it is easiest to write the joint prior as independent univariate priors, e.g. P(θ) = P(β)P(σ 2 ). Some more important terminology that comes up when talking about priors: Informative and noninformative priors Proper and improper priors Conjugate priors Reference priors Nels Johnson (LISA) Bayesian Regression 03/14/2011 12 / 26

MLE Posteriors If we use maximum likelihood to estimate β, with σ 2 known, then ˆβ = (X T X ) 1 X T Y and V ( ˆβ) = σ 2 (X T X ) 1. We also know the sampling distribution for ˆβ is normal. It turns out, if we can pick the prior π(β) 1, then π(β X, Y ) N((X T X ) 1 X T Y, σ 2 (X T X ) 1 ). This is an example improper prior, since it isn t a valid probability distribution, but the posterior π(β X, Y ) is. It is also considered an uniformative. In fact, it is so uninformative it s called the reference prior, since it s the least informative prior we could pick (under some criterion I don t want to get into). Nels Johnson (LISA) Bayesian Regression 03/14/2011 13 / 26

Example 1.3 This example will allow us to see how the prior is updated after we see data. Key points: As the sample size increases, impact of prior on posterior decreases. As the variance of the prior increases (i.e. diffuse prior), impact of the prior on the posterior decreases. Where the prior is centered is not important if the prior is diffuse enough. Nels Johnson (LISA) Bayesian Regression 03/14/2011 14 / 26

Conjugacy If P(θ) is a conjugate prior for θ with likelihood P(D θ), then P(θ D) has the same family of distributions as the prior. This is nice because it means that in some cases we can pick a prior such that the posterior distribution will be a distribution we understand well (e.g. normal, gamma). This means computational methods might not be needed for inference or that the computation methods will be fast and simple to implement. For a list of some: http://en.wikipedia.org/wiki/conjugate_prior Nels Johnson (LISA) Bayesian Regression 03/14/2011 15 / 26

Common Conjugate Examples Y = X β + ɛ where ɛ N(0, σ 2 I ) or ɛ N(0, Σ) Parameter Prior Likelihood Posterior β Normal Normal Normal σ 2 Inverse-Gamma Normal Inverse-Gamma Σ Inverse-Wishart Normal Inverse-Wishart σ 2 Gamma Normal Gamma Σ 1 Wishart Normal Wishart Nels Johnson (LISA) Bayesian Regression 03/14/2011 16 / 26

To R The rest of this presentation will be in R. I ve left all my slides from an older version of this talk for those that may be interested in reading them later. Nels Johnson (LISA) Bayesian Regression 03/14/2011 17 / 26

Posterior Distributions Bayesian inference is usually done on the joint posterior distribution of all the parameters P(θ D). Sometimes it is done on the marginal distribution of a single parameter. Such as: P(β D) = P(θ D)dσ 2 Because the posterior is a probability distribution for θ, preforming inference on θ is as simple as finding relevant summaries based on its posterior. Nels Johnson (LISA) Bayesian Regression 03/14/2011 18 / 26

Point Estimation Two very popular point estimators for θ: The mean of the posterior distribution (posterior mean). The maximum a posterior estimator also known as the MAP estimator. The MAP estimator is arg max θ P(θ D). You could use any measure of center that makes sense for P(θ D). Nels Johnson (LISA) Bayesian Regression 03/14/2011 19 / 26

Illustration of MAP Estimator Nels Johnson (LISA) Bayesian Regression 03/14/2011 20 / 26

Interval Estimation Bayesian confidence intervals are called credible intervals. So a 95% Bayesian confidence interval is called a 95% credible interval. There are two popular ways to find them: Highest posterior density, often shortened to HPD. Equal tail probability. A 95% HPD credible interval is the smallest interval with probability 0.95. A 95% Equal tail probability interval use the values that have cumulative probability of 0.025 and 0.975 as the endpoints. This is not preferred when the posterior is highly skewed or multimodal. Nels Johnson (LISA) Bayesian Regression 03/14/2011 21 / 26

HPD and Equal Tail Credible Intervals Nels Johnson (LISA) Bayesian Regression 03/14/2011 22 / 26

Hypothesis Testing Since we can talk about the probability of θ being in some interval, this make interpretation of certain kinds of hypothesis tests much easier. For instance H 0 : β 0 vs. H a : β 0 You can find P β D (β 0) and if it is sufficiently small then reject the null hypothesis in favor of the alternative. We could also test H 0 : a β b vs H 0 : β a or b β where a and b are chosen such that if H 0 were true them then β would have no practical effect. Again, compute P β D (H 0 ) and if it is sufficiently small, reject H 0. Some special steps need to be take when trying to test H 0 : β = 0 which are beyond the scope of this course. Also instead of rejecting H 0 or accepting H a, you could just report the evidence (i.e. the probability) for the reader to decide. Nels Johnson (LISA) Bayesian Regression 03/14/2011 23 / 26

Sampling From the Joint Posterior As stated previously we want to perform inference on θ using the joint posterior P(θ D). Summarizing the joint posterior may be difficult (or impossible). A solution to this problem is to use a computer algorithm to sample observations from the joint posterior. If we have a large enough sample from P(θ D) then the distribution of the sample should approximate P(θ D). This means we can just summarize the sample from P(θ D) to make inference on θ instead of using the distribution P(θ D) directly. Nels Johnson (LISA) Bayesian Regression 03/14/2011 24 / 26

One (Three) Important Algorithm(s) The big three algorithms are as follows: Gibbs sampling. Metropolis algorithm. Metropolis-Hastings algorithm. All three of these algorithms are Markov Chain Monte Carlo algorithms. Commonly known as MCMC algorithms. What that means for us is that after you run the algorithm long enough you will eventually start sampling from P(θ D). The Metropolis-Hasting algorithm is a generalization of the other two and it so similar to the Metropolis algorithm that the two titles are often interchanged. Nels Johnson (LISA) Bayesian Regression 03/14/2011 25 / 26

More on Gibbs The Gibbs sampler is a much simplified version of the Metropolis-Hastings algorithm. So simplified in fact it isn t obvious at all that they are related. All three algorithms require you to know the conditional posterior distribution of each parameter in the model (e.g. P(β D, σ 2 ) and P(σ 2 D, β) ). The Gibbs sampler requires that the user can sample from each of the conditional posterior distributions using computer software. This is where conjugate priors become very popular. If the conditional posterior distributions are all normal, gamma, or some other distribution we can readily sample from, then the Gibbs algorithm makes sampling from the joint posterior very easy. Other details about this algorithm are beyond the scope of this course. The data analysis examples we will discuss both use a Gibbs sampler. Nels Johnson (LISA) Bayesian Regression 03/14/2011 26 / 26