The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference



Similar documents
Principle of Data Reduction

Basics of Statistical Machine Learning

1 Prior Probability and Posterior Probability

1 Sufficient statistics

Maximum Likelihood Estimation


A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

For a partition B 1,..., B n, where B i B j = for i. A = (A B 1 ) (A B 2 ),..., (A B n ) and thus. P (A) = P (A B i ) = P (A B i )P (B i )

Master s Theory Exam Spring 2006

Bayes and Naïve Bayes. cs534-machine Learning

Part 2: One-parameter models

CHAPTER 2 Estimating Probabilities

Dirichlet Processes A gentle tutorial

Gaussian Conjugate Prior Cheat Sheet

Notes on the Negative Binomial Distribution

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

Efficient Cluster Detection and Network Marketing in a Nautural Environment

Statistical Machine Learning from Data

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

Section 5.1 Continuous Random Variables: Introduction

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

Final Mathematics 5010, Section 1, Fall 2004 Instructor: D.A. Levin

Chapter 3: DISCRETE RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS. Part 3: Discrete Uniform Distribution Binomial Distribution

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference

UNIVERSITY OF OSLO. The Poisson model is a common model for claim frequency.

Inference of Probability Distributions for Trust and Security applications

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

Beta Distribution. Paul Johnson and Matt Beverlin June 10, 2013

INSURANCE RISK THEORY (Problems)

Lecture 2: Discrete Distributions, Normal Distributions. Chapter 1

Lab 2: Vector Analysis

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Bayesian Statistics in One Hour. Patrick Lam

Chapter 3. Distribution Problems. 3.1 The idea of a distribution The twenty-fold way

Unified Lecture # 4 Vectors

Imperfect Debugging in Software Reliability

Markov Chain Monte Carlo Simulation Made Simple

Lecture 9: Bayesian hypothesis testing

23. RATIONAL EXPONENTS

Negative Binomial. Paul Johnson and Andrea Vieux. June 10, 2013

Mechanics 1: Vectors

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Chapter 3 RANDOM VARIATE GENERATION

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

E3: PROBABILITY AND STATISTICS lecture notes

Statistics 100A Homework 8 Solutions

Stats on the TI 83 and TI 84 Calculator

The Basics of Graphical Models

ALMOST COMMON PRIORS 1. INTRODUCTION

Topic models for Sentiment analysis: A Literature Survey

1.1 Introduction, and Review of Probability Theory Random Variable, Range, Types of Random Variables CDF, PDF, Quantiles...

Lecture Note 1 Set and Probability Theory. MIT Spring 2006 Herman Bennett

Section 6.1 Joint Distribution Functions

Binomial lattice model for stock prices

Homework 4 - KEY. Jeff Brenion. June 16, Note: Many problems can be solved in more than one way; we present only a single solution here.

3.1. Solving linear equations. Introduction. Prerequisites. Learning Outcomes. Learning Style

Sports betting: A source for empirical Bayes

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

The Analysis of Data. Volume 1. Probability. Guy Lebanon

Efficiency and the Cramér-Rao Inequality

CHAPTER 6: Continuous Uniform Distribution: 6.1. Definition: The density function of the continuous random variable X on the interval [A, B] is.

Logistic Regression (1/24/13)

Practice problems for Homework 11 - Point Estimation

Section 4.4 Inner Product Spaces

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

1 Inner Products and Norms on Real Vector Spaces

Equations, Inequalities & Partial Fractions

Recursive Estimation

Lecture 7: Continuous Random Variables

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Machine Learning Logistic Regression

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

The Math Circle, Spring 2004

People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

Stat 5102 Notes: Nonparametric Tests and. confidence interval

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes

Chapter 4 Lecture Notes

Methods of Solution of Selected Differential Equations Carol A. Edwards Chandler-Gilbert Community College

LEARNING OBJECTIVES FOR THIS CHAPTER

1 Solving LPs: The Simplex Algorithm of George Dantzig

Gaussian Processes in Machine Learning

Playing with Numbers

5.1 Radical Notation and Rational Exponents

STA 4273H: Statistical Machine Learning

VISUALIZATION OF DENSITY FUNCTIONS WITH GEOGEBRA

e.g. arrival of a customer to a service station or breakdown of a component in some system.

Machine Learning and Pattern Recognition Logistic Regression

3.4. The Binomial Probability Distribution. Copyright Cengage Learning. All rights reserved.

M2S1 Lecture Notes. G. A. Young ayoung

Centre for Central Banking Studies

Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

PROBABILITY AND SAMPLING DISTRIBUTIONS

1 Norms and Vector Spaces

Approximation of Aggregate Losses Using Simulation

T ( a i x i ) = a i T (x i ).

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS

Transcription:

The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference Stephen Tu tu.stephenl@gmail.com 1 Introduction This document collects in one place various results for both the Dirichlet-multinomial and Dirichlet-categorical likelihood model. Both models, while simple, are actually a source of confusion because the terminology has been very sloppily overloaded on the internet. We ll clear the confusion in this writeup. 2 Preliminaries Let s standardize terminology and outline a few preliminaries. 2.1 Dirichlet distribution The Dirichlet distribution, which we denote Dir(α 1,..., α ), is parameterized by positive scalars α i > 0 for i1,...,, where 2. The support of the Dirichlet distribution is the ( 1)-dimensional simplex S ; that is, all dimensional vectors which form a valid probability distribution. The probability density of x (x 1,..., x ) when x S is f(x 1,..., x ; α 1,..., α ) Γ( i1 α i) i1 Γ(α i) i1 x α i 1 i There are some pedagogical concerns to be noted. First, the pdf f(x) is technically defined only in ( 1)-dimensional space and not ; this is because S has zero measure in dimensions 1. Second, the support is actually the open simplex, meaning all x i > 0 (which implies no x i 1). The Dirichlet distribution is a generalization of the Beta distribution, which is the conugate prior for coin flipping. 1 If you read the Wikipedia entry on the Dirichlet distribution, this is what the mathematical argon referring to the Lebesgue measure is saying. 1

2.2 Multinomial distribution The Multinomial distribution, which we denote Mult(p 1,..., p, n), is a discrete distribution over dimensional non-negative integer vectors x Z + where i1 x i n. Here, p (p 1,..., p ) is a element of S and n 1. The probability mass function is given as f(x 1,..., x ; p 1,..., p, n) i1 Γ(x i + 1) The interpretation is simple; a draw x from Mult(p 1,..., p ; n) can be intepreted as drawing n iid values from a Categorical distribution with pmf f(xi) p i. Each entry x i counts the number of times value i was drawn. The strange looking gamma functions in the pmf account for the combinatorics of a draw; it is simply the number of ways of placing n balls in bins. Indeed, since n! for non-negative integer n, i1 Γ(x i + 1) n! x 1! x! The Multinomial distribution is a generalization of the Binomial distribution. 2.3 Categorical distribution The Categorical distribution, which we denote as Cat(p 1,..., p ), is a discrete distribution with support {1,..., }. Once again, p (p 1,..., p ) S. It has a probability mass function given as f(xi) p i The source of confusion with the Multinomial distribution is because often the popular 1-of- encoding is used to encode a value drawn from the Categorical distribution, and in this case, we can actually see the Categorical distribution is ust a Multinomial with n 1. In the scalar form, the Categorical distribution is a generalization of the Bernoulli distribution (coin flipping). 2.4 Conugate priors Much ink has been spilled on conugate priors, so we won t attempt to provide any lengthy discussion. What we will mention now is that, when doing Bayesian inference, we are often interested in a few key distributions. Below, D refers to the dataset we have at hand, and we typically assume each y i D is drawn iid from the same distribution f(y; θ), where θ parameterizes the likelihood model. When people say they are being Bayesian, what this often means is they are treating θ as an unknown, but postulating that θ follows some prior distribution f(θ; α), where α parameterizes the prior distribution (and is often called the hyper-parameter). Of course you can keep playing this game and put a prior on α (called the hyper-prior), but we won t go there. i1 p x i i 2

While so far this seems reasonable, often Bayesian analysis is intractable since we have to integrate out the unknown θ, and for many problems the integral cannot be solved analytically (and one has to resort to numerical techniques). Conugate priors, however, are the (tiny 2 ) class of models where we can analytically compute distributions of interest. These distributions are: Posterior predictive. This is the distribution denoted notationally by f(y D), where y is a new datapoint of interest. By the iid assumption, we have f(y D) f(y, θ D) dθ f(y θ)f(θ D) dθ The distribution f(θ D), often called the posterior distribution, can be broken down further f(θ D) f(θ, D) f(d) f(θ, D) f(θ α) f(y i θ) Marginal distribution of the data. This is denoted f(d), and is derived by integrating out the model parameter f(d) f(d, θ) dθ f(θ α) f(y i θ) dθ Dirichlet distribution as a prior. It turns out (to further the confusion), that the Dirichlet distribution is the conugate prior for both the Categorical and Multinomial distributions! For the remainder of this document, we will list the results of both the posterior predictive and marginal distribution on both Dirichlet-Categorical and Dirichlet-Multinomial. 3 Dirichlet-Multinomial Model. p 1,..., p Dir(α 1,..., α ) y 1,...y Mult(p 1,..., p ) 2 This class is essentially ust the exponential family of distributions. 3

Posterior. f(θ D) f(θ, D) f(p 1,..., p α 1,..., α ) p α 1 p α 1+ y() i p y() i f(y i p 1,...p ) This density is exactly that of a Dirichlet distribution, except we have α α + y () i That is, f(θ D) Dir(α 1,..., α ). Posterior Predictive. f(y D) f(y θ)f(θ D) dθ f(y p 1,..., p )f(p 1,..., p D) Γ(y() + 1) p y() Γ( α ) Γ(y() + 1) Γ(α ) Γ( α ) Γ(α ) p y() +α 1 p α 1 Γ( α ) Γ(y() + α ) Γ(y() + 1) Γ(α ) Γ(n + α ) (1) where denotes integrating (p 1,..., p ) with respect to the ( 1) simplex. 4

Marignal. This derivation is almost the same as the posterior predictive. f(d) f(θ α) f(y i θ) dθ f(p 1,..., p α 1,..., α ) f(y i p 1,..., p ) Γ( α ) Γ(α ) Γ( α ) Γ(α ) Γ( α ) Γ(α ) [ [ p α 1 Γ(y() Γ(y() i + 1) ] i + 1) Γ(y() i + 1) p p y() i y() i +α 1 ] Γ( y() i + α ) Γ( D n + α ) (2) 4 Dirichlet-Categorical The derivations here are almost identical to before (with some minor syntatic differences). Model. Posterior. p 1,..., p Dir(α 1,..., α ) y Cat(p 1,..., p ) f(θ D) f(θ, D) f(p 1,..., p α 1,..., α ) p α 1 p 1{y i} p α 1+ 1{y i} f(y i p 1,...p ) This density is exactly that of a Dirichlet distribution, except we have α α + 1{y i } That is, f(θ D) Dir(α 1,..., α ). 5

Posterior Predictive. f(yx D) f(yx θ)f(θ D) dθ f(yx p 1,..., p )f(p 1,..., p D) Γ( p α ) x Γ(α ) Γ( α ) Γ(α ) p α 1 p 1{x}+α 1 Γ( α ) Γ(α ) Γ(1{x } + α ) Γ(1 + α ) α x α (3) where we used the fact that nγ(n) to simplify the second to last line. Marignal. f(d) f(θ α) f(y i θ) dθ f(p 1,..., p α 1,..., α ) f(y i p 1,..., p ) Γ( α ) Γ(α ) Γ( α ) Γ(α ) p α 1 p 1{y i} y p i D 1{y i}+α 1 Γ( α ) Γ( 1{y i } + α ) Γ(α ) Γ( D + α ) (4) 6