Lecture 24: Dirichlet distribution and Dirichlet Process

Similar documents
Dirichlet Process. Yee Whye Teh, University College London

Dirichlet Processes A gentle tutorial

The Chinese Restaurant Process

An Alternative Prior Process for Nonparametric Bayesian Clustering

Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

HT2015: SC4 Statistical Data Mining and Machine Learning

Topic models for Sentiment analysis: A Literature Survey

Ex. 2.1 (Davide Basilio Bartolini)

Lecture 4: BK inequality 27th August and 6th September, 2007

WHERE DOES THE 10% CONDITION COME FROM?

Gaussian Conjugate Prior Cheat Sheet

Bayesian Statistics: Indian Buffet Process

Lecture Notes 1. Brief Review of Basic Probability

Lecture 2: Universality

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Basics of Statistical Machine Learning

CHAPTER 2 Estimating Probabilities

Solving Systems of Linear Equations

Master s Theory Exam Spring 2006

LECTURE 4. Last time: Lecture outline

Systems of Linear Equations

1 Sets and Set Notation.

The Basics of Graphical Models

MATH 4330/5330, Fourier Analysis Section 11, The Discrete Fourier Transform

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

1 Review of Least Squares Solutions to Overdetermined Systems

Nonparametric adaptive age replacement with a one-cycle criterion

LINEAR INEQUALITIES. Mathematics is the art of saying many things in many different ways. MAXWELL

Unified Lecture # 4 Vectors

Bayes and Naïve Bayes. cs534-machine Learning

Mathematics Course 111: Algebra I Part IV: Vector Spaces

T ( a i x i ) = a i T (x i ).

1 Prior Probability and Posterior Probability

10.2 Series and Convergence

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

Section 1.4. Lines, Planes, and Hyperplanes. The Calculus of Functions of Several Variables

NOTES ON LINEAR TRANSFORMATIONS

Random variables, probability distributions, binomial random variable

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

Section Inner Products and Norms

Math 461 Fall 2006 Test 2 Solutions

1 Maximum likelihood estimation

E3: PROBABILITY AND STATISTICS lecture notes

Course: Model, Learning, and Inference: Lecture 5

Notes on Determinant

3.1. Solving linear equations. Introduction. Prerequisites. Learning Outcomes. Learning Style

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Efficient Cluster Detection and Network Marketing in a Nautural Environment

What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

COMMUTATIVITY DEGREES OF WREATH PRODUCTS OF FINITE ABELIAN GROUPS

Introduction to General and Generalized Linear Models

Statistics 100A Homework 4 Solutions

Inner Product Spaces

Linear Threshold Units

Math 115 Spring 2011 Written Homework 5 Solutions

People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

Estimating Incentive and Selection Effects in Medigap Insurance. Market: An Application with Dirichlet Process Mixture Model

Chapter 4 Lecture Notes

Lecture 1 Introduction Properties of Probability Methods of Enumeration Asrat Temesgen Stockholm University

Math 3C Homework 3 Solutions

Lecture 13 - Basic Number Theory.

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 13. Random Variables: Distribution and Expectation

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

Feb 7 Homework Solutions Math 151, Winter Chapter 4 Problems (pages )

Math 4310 Handout - Quotient Vector Spaces

Chapter 3. Distribution Problems. 3.1 The idea of a distribution The twenty-fold way

Breaking The Code. Ryan Lowe. Ryan Lowe is currently a Ball State senior with a double major in Computer Science and Mathematics and

4/1/2017. PS. Sequences and Series FROM 9.2 AND 9.3 IN THE BOOK AS WELL AS FROM OTHER SOURCES. TODAY IS NATIONAL MANATEE APPRECIATION DAY

ST 371 (IV): Discrete Random Variables

1 Norms and Vector Spaces

Lecture L3 - Vectors, Matrices and Coordinate Transformations

Economics 1011a: Intermediate Microeconomics

LEARNING OBJECTIVES FOR THIS CHAPTER

Supervised and unsupervised learning - 1

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

THE FUNDAMENTAL THEOREM OF ARBITRAGE PRICING

Section 1.1. Introduction to R n

Introduction to Matrix Algebra

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Basic Probability. Probability: The part of Mathematics devoted to quantify uncertainty

Chapter 2: Binomial Methods and the Black-Scholes Formula

Computational Statistics and Data Analysis

Nonparametric Factor Analysis with Beta Process Priors

Introduction to Markov Chain Monte Carlo

Statistics 100A Homework 8 Solutions

General Framework for an Iterative Solution of Ax b. Jacobi s Method

It is important to bear in mind that one of the first three subscripts is redundant since k = i -j +3.

Section 4.4 Inner Product Spaces

Matrix Representations of Linear Transformations and Changes of Coordinates

Final Mathematics 5010, Section 1, Fall 2004 Instructor: D.A. Levin

LS.6 Solution Matrices

Lecture 14: Section 3.3

Chapter 12 Modal Decomposition of State-Space Models 12.1 Introduction The solutions obtained in previous chapters, whether in time domain or transfor

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.

Markov random fields and Gibbs measures

BANACH AND HILBERT SPACE REVIEW

Math 55: Discrete Mathematics

Transcription:

Stat260: Bayesian Modeling and Inference Lecture Date: April 28, 2010 Lecture 24: Dirichlet distribution and Dirichlet Process Lecturer: Michael I. Jordan Scribe: Vivek Ramamurthy 1 Introduction In the last couple of lectures, in our study of Bayesian nonparametric approaches, we considered the Chinese Restaurant Process, Bayesian mixture models, stick breaking, and the Dirichlet process. Today, we will try to gain some insight into the connection between the Dirichlet process and the Dirichlet distribution. 2 The Dirichlet distribution and Pólya urn First, we note an important relation between the Dirichlet distribution and the Gamma distribution, which is used to generate random vectors which are Dirichlet distributed. If, for i {1,2,,K}, Z i Gamma( i,β) independently, then and ( K K ) S = Z i Gamma i,β V = (V 1,,V K ) = (Z 1 /S,,Z K /S) Dir( 1,, K ) Now, consider the following Pólya urn model. Suppose that X i - color of the ith draw X - space of colors (discrete) (k) - number of balls of color k initially in urn. We then have that p(x i = k X 1,,X i 1 ) = (k) + j<i δ X j (k) (X) + i 1 where δ Xj (k) = 1 if X j = k and 0 otherwise, and (X) = k (k). It may then be shown that p(x 1 = x 1,X 2 = x 2,,X n = x n ) = (x 1) (X) n (x i ) + j<i δ X j (x i ) i=2 (X) + i 1 = (1)[(1) + 1] [(1) + m 1 1](2)[(2) + 1] [(2) + m 2 1] (C)[(C) + 1] [(C) + m C 1] (X)[(X) + 1] [(X) + n 1] where 1,2,,C are the distinct colors that appear in x 1,,x n and m k = n 1{X i = k}. 1

2 Lecture 24: Dirichlet distribution and Dirichlet Process 3 The Pitman-Yor process This section is a small aside on the Pitman-Yor process, a process related to the Dirichlet Process. Recall that, in the stick-breaking construction for the Dirichlet Process, we dene an innite sequence of Beta random variables as follows: β i Beta(1, 0 ) i = 1,2, Then, we define an infinite sequence of mixing proportions as follows: π 1 = β 1 π k = β k (1 β j ) k = 2,3, j<k The Pitman-Yor process PY (d,,g 0 ) is a related probability distribution over distributions. The parameters of this process are 0 d < 1 a discount parameter, a strength parameter θ > d and a base distribution over G 0. In the special case of d = 0, the Pitman-Yor process is equivalent to the Dirichlet process. Under the Pitman-Yor process, the innite sequence of Beta random variables is dened as β i Beta(1 d, + kd) i = 1,2, As in the Dirichlet Process, we complete the description of the Pitman-Yor process via G = π k δ φk k=1 θ i G G Hence, due to the way β i are drawn, the Pitman-Yor process has a longer tail than the Dirichlet Process, that is, more sticks have non-trivial amounts of mass. The Pitman-Yor process may be seen as a Chinese restaurant process with a rule that favors starting a new table. In the homework, you will be required to show that the expectation of the total number of occupied tables in the Chinese restaurant scales as O(n d ) under the Pitman-Yor process, PY (d,,g 0 ). This is known as a power-law and is in contrast to the logarithmic growth for the Dirichlet Process that we discuss in the next section. Many natural phenomena follow power-law distributions, and in these cases, the Pitman-Yor process may be a better choice for a prior than the Dirichlet Process. 4 Expected number of occupied tables in the Chinese Restaurant Process Going back to the Dirichlet Process, the stick-breaking construction/gem distribution (named so by Ewens (1990) after Griths, Engen and McCloskey), of the β i is given by Beta(1,). In the resulting Chinese Restaurant Process, we have We define P(new table on ith draw) = + i 1 { 1 if ith draw is a new table W i = 0 otherwise

Lecture 24: Dirichlet distribution and Dirichlet Process 3 The expected number of occupied tables is then given by ( n ) n ( ) + n E W i = + i 1 log Moreover, we can also easily compute the probability of any sequence of draws. Consider, for instance, the sequence of draws given by W = (1,1,0,0,1). The probability of this sequence is given by P(W = (1,1,0,0,1)) = + 1 2 + 2 3 + 3 = 3 Γ() (1 1 2 3 1) Γ( + n) + 4 In general, it may be shown that the probability of drawing k tables in n draws is given by P(k tables,n) = S(n,k) k Γ() Γ( + n) where S(n,k) is a Stirling number of the first kind. The following identity may be shown to hold for Stirling numbers of the first kind: S(n + 1,m + 1) = S(n,m) + ns(n,m + 1) 5 Gibbs Sampling for Dirichlet Process mixtures Our goal now, is to sample in a Gibbs sampler for Dirichlet Process mixtures. The sampling distributions used are G DP(,G 0 ) θ i G G X ij θ i F θi In addition, we also need to determine a prior on. Toward this end, let k = the number of occupied tables in a Chinese restaurant process. By the last computation, we have that We also have that the Beta function is B( 1, 2 ) = 1 0 p(k ) k Γ() Γ( + n) x 1 1 (1 x) 2 1 dx = Γ( 1)Γ( 2 ) Γ( 1 + 2 ) = Γ() ( + n)b( + 1,n) = Γ( + n) ( + n)γ( + n) The above identity may be verified by checking that Hence, it then follows that B( + 1,n) = Γ( + 1)Γ(n) Γ( + 1 + n) = Γ()Γ(n) ( + n)γ( + n) p( k) p() k 1 ( + n) 1 0 x (1 x) n 1 dx Introducing X Beta( + 1,n), we observe that p() must be chosen to be conjugate to the Gamma distribution. Since the Gamma distribution is conjugate to itself, it follows that p() must be chosen to be a Gamma distribution.

4 Lecture 24: Dirichlet distribution and Dirichlet Process 6 The Dirichlet Process In this section, we discuss why the Dirichlet Process is named the Dirichlet Process. Consider a set Φ and a partition A 1,A 2, of Φ such that k A k = Φ. We would like to construct a random probability measure on Φ, i.e., a random probability measure G such that for all i, G(A i ) is a random variable. For this, we need to specify a joint distribution for (G(A 1 ),G(A 2 ),,G(A k )) for any k and any partition. Ferguson (1973) showed that G is a Dirichlet process with parameters 0 and G 0, i.e. G DP( 0,G 0 ) if for any partition A 1,,A k, we have that (G(A 1 ),,G(A k )) Dir[ 0 G 0 (A 1 ),, 0 G 0 (A k )] Sethuraman (1994) showed that the Dirichlet Process is an innite sum of the form G = k=1 π kδ φk that obeys the denition of the stick-breaking process. We wish to give an overview of why these denitions are equivalent. First, we present three facts, two of which you need to prove in the homework and one which we simply state as a fact. 1. (Homework) Suppose that U and V are independent k-vectors. Let U Dir() and V Dir(γ). Let W ( i i, i γ i), independently of U and V. It may then be shown that WU + (1 W)V Dir( + γ) 2. (Homework) Let e j denote a unit basis vector. Let β j = γ j / i γ i. In this case, it may be shown that β j Dir(γ + e j ) = Dir(γ) j 3. Let W,U be a pair of random variables where W take values in [1,1] and U takes values in a linear space. Suppose V is a random variable taking values in the same linear space as U and which is independent of (W,U) and satises the distributional equation V st = U + WV where the notation st = stands for has same distribution. If P( W = 1) 1, then there is a unique distribution for V that satises this equation. By the stick-breaking construction, G = π k δ φk = π 1 δ φ1 + (1 π 1 ) π k δ φk k=1 which gives us the distributional equation G st = π 1 δ φ1 + (1 π 1 )G Evaluating the LHS and RHS on a partition (A 1,,A k ) gives us k=2 V st = π 1 X + (1 π 1 )V (1)

Lecture 24: Dirichlet distribution and Dirichlet Process 5 where π 1 Beta(1, 0 ), X is k vector that takes on the value e j with probability G 0 (A k ), and V is independent of X and π 1. We show that the k dimensional Dirichlet distribution V Dir(G 0 (A 1 ),...,G 0 (A k )) satises Equation (1) and therefore, by fact 3, V is the unique distribution to satisfy this. Therefore, V constructed via the stick-breaking construction and V as defined in the Dirichlet Process are equivalent. Now to show that the k dimensional Dirichlet distribution satises Equation (1). Let V on the RHS of Equation (1) be the k dimensional Dirichlet distribution Dir(G 0 (A 1 ),...,G 0 (A k )). By denition, the k dimensional Dirichlet distribution Dir(e j ) assigns probability 1 to partition A j. Now conditioning on X = e j, the distribution of π 1 X + (1 π 1 )V is π 1 Dir(e j ) + (1 π 1 )Dir(G 0 (A 1 ),...,G 0 (A k )). By fact 1, this is distributed Dir((G 0 (A 1 ),...,G 0 (A k )) + e j ). Now integrating over the distribution of X where P {X = e j } = G 0 (A j ) = j i i and using fact 2, we see that the RHS is distributed Dir(G 0 (A 1 ),...,G 0 (A k )). Therefore, the stick-breaking construction and the mathematical denition of the Dirichlet Process are equivalent. References [1] Ferguson, T. S. 1973. A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics 1 (2): 209-230. [2] Sethuraman, J. 1994. A Constructive Definition of Dirichlet Priors. Statistica Sinica 4: 639-650.