Hierarchical Beta Processes and the Indian Buffet Process



Similar documents
Channel Assignment Strategies for Cellular Phone Systems

Weighting Methods in Survey Sampling

Sebastián Bravo López

Supply chain coordination; A Game Theory approach

Computer Networks Framing

BUILDING A SPAM FILTER USING NAÏVE BAYES. CIS 391- Intro to AI 1

A Holistic Method for Selecting Web Services in Design of Composite Applications

Hierarchical Clustering and Sampling Techniques for Network Monitoring

Scalable Hierarchical Multitask Learning Algorithms for Conversion Optimization in Display Advertising

Discovering Trends in Large Datasets Using Neural Networks

Intelligent Measurement Processes in 3D Optical Metrology: Producing More Accurate Point Clouds

Bayesian Statistics: Indian Buffet Process

Big Data Analysis and Reporting with Decision Tree Induction

User s Guide VISFIT: a computer tool for the measurement of intrinsic viscosities

A Context-Aware Preference Database System

5.2 The Master Theorem

Chapter 1 Microeconomics of Consumer Theory


REDUCTION FACTOR OF FEEDING LINES THAT HAVE A CABLE AND AN OVERHEAD SECTION

A Keyword Filters Method for Spam via Maximum Independent Sets

Capacity at Unsignalized Two-Stage Priority Intersections

Classical Electromagnetic Doppler Effect Redefined. Copyright 2014 Joseph A. Rybczyk

Bayes Bluff: Opponent Modelling in Poker

) ( )( ) ( ) ( )( ) ( ) ( ) (1)

Improved SOM-Based High-Dimensional Data Visualization Algorithm

How To Fator

Pattern Recognition Techniques in Microarray Data Analysis

Recovering Articulated Motion with a Hierarchical Factorization Method

Asymmetric Error Correction and Flash-Memory Rewriting using Polar Codes

protection p1ann1ng report

Chapter 5 Single Phase Systems

THE PERFORMANCE OF TRANSIT TIME FLOWMETERS IN HEATED GAS MIXTURES

Open and Extensible Business Process Simulator

Programming Basics - FORTRAN 77

NOMCLUST: AN R PACKAGE FOR HIERARCHICAL CLUSTERING OF OBJECTS CHARACTERIZED BY NOMINAL VARIABLES

FOOD FOR THOUGHT Topical Insights from our Subject Matter Experts

Static Fairness Criteria in Telecommunications

State of Maryland Participation Agreement for Pre-Tax and Roth Retirement Savings Accounts

A Robust Optimization Approach to Dynamic Pricing and Inventory Control with no Backorders

i_~f e 1 then e 2 else e 3

Dirichlet Processes A gentle tutorial

Neural network-based Load Balancing and Reactive Power Control by Static VAR Compensator

Deadline-based Escalation in Process-Aware Information Systems

Trade Information, Not Spectrum: A Novel TV White Space Information Market Model

1.3 Complex Numbers; Quadratic Equations in the Complex Number System*

Parametric model of IP-networks in the form of colored Petri net

Recommending Questions Using the MDL-based Tree Cut Model

' R ATIONAL. :::~i:. :'.:::::: RETENTION ':: Compliance with the way you work PRODUCT BRIEF

Granular Problem Solving and Software Engineering

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 3, MAY/JUNE

Marker Tracking and HMD Calibration for a Video-based Augmented Reality Conferencing System

arxiv:astro-ph/ v2 10 Jun 2003 Theory Group, MS 50A-5101 Lawrence Berkeley National Laboratory One Cyclotron Road Berkeley, CA USA

FIRE DETECTION USING AUTONOMOUS AERIAL VEHICLES WITH INFRARED AND VISUAL CAMERAS. J. Ramiro Martínez-de Dios, Luis Merino and Aníbal Ollero

DSP-I DSP-I DSP-I DSP-I

HEAT EXCHANGERS-2. Associate Professor. IIT Delhi P.Talukdar/ Mech-IITD

RATING SCALES FOR NEUROLOGISTS

An Efficient Network Traffic Classification Based on Unknown and Anomaly Flow Detection Mechanism

Findings and Recommendations

Optimal Sales Force Compensation

3 Game Theory: Basic Concepts

Price-based versus quantity-based approaches for stimulating the development of renewable electricity: new insights in an old debate

Henley Business School at Univ of Reading. Pre-Experience Postgraduate Programmes Chartered Institute of Personnel and Development (CIPD)

A Survey of Usability Evaluation in Virtual Environments: Classi cation and Comparison of Methods

cos t sin t sin t cos t

In order to be able to design beams, we need both moments and shears. 1. Moment a) From direct design method or equivalent frame method

A Three-Hybrid Treatment Method of the Compressor's Characteristic Line in Performance Prediction of Power Systems

An integrated optimization model of a Closed- Loop Supply Chain under uncertainty

An Enhanced Critical Path Method for Multiple Resource Constraints

UNIVERSITY AND WORK-STUDY EMPLOYERS WEB SITE USER S GUIDE

Ranking Community Answers by Modeling Question-Answer Relationships via Analogical Reasoning

A novel active mass damper for vibration control of bridges

4.15 USING METEOSAT SECOND GENERATION HIGH RESOLUTION VISIBLE DATA FOR THE IMPOVEMENT OF THE RAPID DEVELOPPING THUNDERSTORM PRODUCT

Computational Analysis of Two Arrangements of a Central Ground-Source Heat Pump System for Residential Buildings

WATER CLOSET SUPPORTS TECHNICAL DATA

Information Security 201

Nodal domains on graphs - How to count them and why?

10.1 The Lorentz force law

RISK-BASED IN SITU BIOREMEDIATION DESIGN JENNINGS BRYAN SMALLEY. A.B., Washington University, 1992 THESIS. Urbana, Illinois

MATE: MPLS Adaptive Traffic Engineering

Interpretable Fuzzy Modeling using Multi-Objective Immune- Inspired Optimization Algorithms

Learning Curves and Stochastic Models for Pricing and Provisioning Cloud Computing Services

Electrician'sMathand BasicElectricalFormulas

AUDITING COST OVERRUN CLAIMS *

Outline. Planning. Search vs. Planning. Search vs. Planning Cont d. Search vs. planning. STRIPS operators Partial-order planning.

Behavior Analysis-Based Learning Framework for Host Level Intrusion Detection

From a strategic view to an engineering view in a digital enterprise

HEAT CONDUCTION. q A q T

Impedance Method for Leak Detection in Zigzag Pipelines

Transcription:

Hierarhial Beta Proesses and the Indian Buffet Proess Romain Thibaux Dept. of EECS University of California, Berkeley Berkeley, CA 9472 Mihael I. Jordan Dept. of EECS and Dept. of Statistis University of California, Berkeley Berkeley, CA 9472 Abstrat We show that the beta proess is the de Finetti mixing distribution underlying the Indian buffet proess of [2]. This result shows that the beta proess plays the role for the Indian buffet proess that the Dirihlet proess plays for the Chinese restaurant proess, a parallel that guides us in deriving analogs for the beta proess of the many known extensions of the Dirihlet proess. In partiular we define Bayesian hierarhies of beta proesses and use the onnetion to the beta proess to develop posterior inferene algorithms for the Indian buffet proess. We also present an appliation to doument lassifiation, exploring a relationship between the hierarhial beta proess and smoothed naive Bayes models. 1 Introdution Mixture models provide a well-known probabilisti approah to lustering in whih the data are assumed to arise from an exhangeable set of hoies among a finite set of mixture omponents. Dirihlet proess mixture models provide a nonparametri Bayesian approah to mixture modeling that does not require the number of mixture omponents to be known in advane [1]. The basi idea is that the Dirihlet proess indues a prior distribution over partitions of the data, a distribution that is readily ombined with a prior distribution over parameters and a likelihood. The distribution over partitions an be generated inrementally using a simple sheme known as the Chinese restaurant proess. As an alternative to the multinomial representation underlying lassial mixture models, fatorial models assoiate to eah data point a set of latent Bernoulli variables. The fatorial representation has several advantages. First, the Bernoulli variables may have a natural interpretation as featural desriptions of objets. Seond, the representation of objets in terms of sets of Bernoulli variables provides a natural way to define interesting topologies on lusters e.g., as the number of features that two lusters have in ommon. Third, the number of lusters representable with m features is 2 m, and thus the fatorial approah may be appropriate for situations involving large numbers of lusters. As in the mixture model setting, it is desirable to onsider nonparametri Bayesian approahes to fatorial modeling that remove the assumption that the ardinality of the set of features is known a priori. An important first step in this diretion has been provided by Griffiths and Ghahramani [2], who defined a stohasti proess on features that an be viewed as a fatorial analog of the Chinese restaurant proess. This proess, referred to as the Indian buffet proess, involves the metaphor of a sequene of ustomers tasting dishes in an infinite buffet. Let Z i be a binary vetor where Z i,k = 1 if ustomer i tastes dish k. Customer i tastes dish k with probability m k /i, where m k is the number of ustomers that have previously tasted dish k; that is, Z i,k Berm k /i. Having sampled from the dishes previously sampled by other ustomers, ustomer i then goes on to taste an additional number of new dishes determined by a draw from a Poissonα/i distribution. Modulo a reordering of the features, the Indian buffet proess an be shown to generate an exhangeable distribution over binary matries that is, P Z 1,... Z n = P Z σ1,... Z σn for any permutation σ. Given suh an exhangeability result, it is natural to inquire as to the underlying distribution that renders the sequene onditionally independent. Indeed, De Finetti s theorem states that the distribution of any

infinitely exhangeable sequene an be written [ ] n P Z 1,... Z n = P Z i B dp B, where B is the random element that renders the variables {Z i } onditionally independent and where we will refer to the distribution P B as the de Finetti mixing distribution. For the Chinese restaurant proess, the underlying de Finetti mixing distribution is known it is the Dirihlet proess. As this result suggests, identifying the de Finetti mixing distribution behind a given exhangeable sequene is important; it greatly extends the range of statistial appliations of the exhangeable sequene. In this paper we make the following three ontributions: 1. We identify the de Finetti mixing distribution behind the Indian buffet proess. In partiular, in Se. 4 we show that this distribution is the beta proess. We also show that this onnetion motivates a two-parameter generalization of the Indian buffet proess proposed in [3]. While the beta proess has been previously studied for its appliations in survival analysis, this result shows that it is also the natural objet of study in nonparametri Bayesian fatorial modeling. 2. In Se. 5 we exploit the link between the beta proess and the Indian buffet proess to provide a new algorithm to sample beta proesses. 3. In Se. 6 we define the hierarhial beta proess, an analog for fatorial modeling of the hierarhial Dirihlet proess [11]. The hierarhial beta proess makes it possible to speify models in whih features are shared among a number of groups. We present an example of suh a model in an appliation to doument lassifiation in Se. 7, where we also explore the relationship of the hierarhial beta proess to naive Bayes models. 2 The beta proess The beta proess was defined by Hjort [4] for appliations in survival analysis. In those appliations, the beta proess plays the role of a distribution on funtions umulative hazard funtions defined on the positive real line. In our appliations, the sample paths of the beta proess need to be defined on more general spaes. We thus develop a nomenlature that is more suited to these more general appliations. A positive random measure B on a spae Ω e.g., R is a Lévy proess, or independent inrement proess, if the masses BS 1,... BS k assigned to disjoint 2 1 1 5 Draw 1 Figure 1: Top. A measure B sampled from a beta proess blue, along with its orresponding umulative distribution funtion red. The horizontal axis is Ω, here [, 1]. The tips of the blue segments are drawn from a Poisson proess with base measure the Lévy measure. Bottom. 1 samples from a Bernoulli proess with base measure B, one per line. Samples are sets of points, obtained by inluding eah point ω independently with probability B{ω}. subsets S 1,... S k of Ω are independent 1. The Lévy- Khinhine theorem [5, 8] implies that a positive Lévy proess is uniquely haraterized by its Lévy measure or ompensator, a measure on Ω R +. Definition. A beta proess B BP, B is a positive Lévy proess whose Lévy measure depends on two parameters: is a positive funtion over Ω that we all the onentration funtion, and B is a fixed measure on Ω, alled the base measure. In the speial ase where is a onstant it will be alled the onentration parameter. We also all γ = B Ω the mass parameter. If B is ontinuous, the Lévy measure of the beta proess is νdω, dp = ωp 1 1 p ω 1 dpb dω 1 on Ω [, 1]. As a funtion of p, it is a degenerate beta distribution, justifying the name. ν has the following elegant interpretation. To draw B BP, B, draw a set of points ω i, p i Ω [, 1] from a Poisson proess with base measure ν see Fig. 1, and let: B = i p i δ ωi 2 where δ ω is a unit point mass or atom at ω. This 1 Positivity is not required to define Lévy proesses but greatly simplifies their study, and positive Lévy proesses are suffiient here. On Ω = R, positive Lévy proesses are also alled subordinators.

implies BS = i:ω i S p i for all S Ω. As this representation shows, B is disrete and the pairs ω i, p i orrespond to the loation ω i Ω and weight p i [, 1] of its atoms. Sine νω [, 1] = the Poisson proess generates infinitely many points, making 2 a ountably infinite sum. Nonetheless, as shown in the appendix, its expetation is finite if B is finite. If B is disrete 2, of the form B = i q iδ ωi, then B has atoms at the same loations B = i p iδ ωi with p i Betaω i q i, ω i 1 q i. 3 This requires q i [, 1]. If B is mixed disreteontinuous, B is the sum of the two independent ontributions. 3 The Bernoulli proess Definition. Let B be a measure on Ω. We define a Bernoulli proess with hazard measure B, written X BePB, as the Lévy proess with Lévy measure µdp, dω = δ 1 dpbdω. 4 If B is ontinuous, X is simply a Poisson proess with intensity B: X = N δ ω i where N PoiBΩ and ω i are independent draws from the distribution B/BΩ. If B is disrete, of the form B = i p iδ ωi, then X = i b iδ ωi where the b i are independent Bernoulli variables with the probability that b i = 1 equal to p i. If B is mixed disrete-ontinuous, X is the sum of the two independent ontributions. In summary, a Bernoulli proess is similar to a Poisson proess, exept that it an give weight at most 1 to singletons, even if the base measure B is disontinuous, for instane if B is itself drawn from a beta proess. We an intuitively think of Ω as a spae of potential features, and X as an objet defined by the features it possesses. The random measure B enodes the probability that X possesses eah partiular feature. In the Indian buffet metaphor, X is a ustomer and its features are the dishes it tastes. Conjugay. Let B BP, B, and let X i B BePB for i = 1,... n be n independent Bernoulli proess draws from B. Let X 1...n denote the set of observations {X 1,..., X n }. Applying theorem 3.3 of Kim [6] to the beta and Bernoulli proesses, the posterior distribution of B after observing X 1...n is still 2 The Lévy measure still exists, but is not as useful as in the ontinuous ase. νdp, dω = i Betaωiqi, ωi1 qidpδω dω. i a beta proess with modified parameters: B X 1...n BP + n, + n B + 1 X i. 5 + n That is, the beta proess is onjugate to the Bernoulli proess. This result an also be derived from Corollary 4.1 of Hjort [4]. 4 Connetion to the Indian buffet proess We now present the onnetion between the beta proess and the Indian buffet proess. The first step is to marginalize out B to obtain the marginal distribution of X 1. Independene of X 1 on disjoint intervals is preserved, so X 1 is a Lévy proess. Sine it gives mass or 1 to singletons its Lévy measure is of the form 4, so X 1 is still a Bernoulli proess. Its expetation is EX 1 = EEX 1 B = EB = B, so its hazard measure is B. 3 Combining this with Eq. 5 and using P X n+1 X 1...n = E B X1...n P X n+1 B gives us the following formula, whih we rewrite using the notation m n,j, the number of ustomers among X 1...n having tried dish ω j : X n+1 X 1...n BeP + n B + 1 X i + n = BeP + n B + m n,j + n δ ω j.6 j To make the onnetion to the Indian buffet proess let us first assume that is a onstant and B is ontinuous with finite total mass B Ω = γ. Observe what happens when we generate X 1...n sequentially using Eq. 6. Sine X 1 BePB and B is ontinuous, X 1 is a Poisson proess with intensity B. In partiular, the total number of features of X 1 is X 1 Ω Poiγ. This orresponds to the first ustomer trying a Poiγ number of dishes. Separating the base measure of Eq. 6 into its ontinuous and disrete parts, we see that X n+1 is the sum of two independent Bernoulli proesses: X n+1 = U + V where U BeP m n,j j +n δ ω j has an atom at ω j tastes dish j with independent probability m n,j + n and V BeP +n B is a Poisson proess with intensity γ +n B, generating a Poi number of new features new + n dishes. 3 We show that EB = B in the Appendix.

F 2 and G 2. By indution we get, for any n B d = ˆB n + G n where ˆBn = d Sine lim n ˆBn = B, we an use ˆBn as an approximation of B. The following iterative algorithm onstruts ˆB n starting with ˆB =. At eah step n 1: F i Figure 2: Draws from a beta proess with onentration and uniform base measure with mass γ, as we vary and γ. For eah draw, 2 samples are shown from the orresponding Bernoulli proess, one per line. This is a two-parameter, γ generalization of the Indian buffet proess, whih we reover when we let, γ = 1, α. Griffiths and Ghahramani propose suh an extension in [3]. The ustomers together try a Poinγ number of dishes, but beause they tend to try the same dishes the number of unique dishes is only Poi γ n 1 i=, roughly +i + n Poi γ + γ log. 7 + 1 This quantity beomes Poiγ if all ustomers share the same dishes or Poinγ if no sharing, justifying the names onentration parameter for and mass parameter for γ. The effet of and γ is illustrated in Fig. 2. 5 An algorithm to generate beta proesses +1 B Eq. 5, when n = 1 and B is ontinuous, shows that B X 1 is the sum of two independent 1 beta proesses F 1 BP + 1, +1 X 1 and G 1 BP + 1,. In partiular, this lets us sample B by first sampling X 1, whih is a Poisson proess with base measure B, then sampling F 1 and G 1. Let X 1 = K 1 δ ω j. Sine the base measure of F 1 is disrete we an apply Eq. 3, with ω i = + 1 and q i = 1/ + 1. We get F 1 = K 1 p jδ ωj where p j Beta1,. G 1 has onentration + 1 and mass γ/ + 1. Sine its base measure is ontinuous, we an further deompose it via Eq. 5 into two independent beta proesses sample K n Poi γ +n 1, sample K n new loations ω j from 1 γ B independently, sample their weight p j Beta1, + n 1 independently, ˆB n = ˆB n 1 + K n p jδ ωj. The expeted mass added at step n is EF n Ω = γ +n+n 1 and the expeted remaining mass after step n is EG n Ω = γ +n. Other algorithms exist to build approximations of beta proesses. The Inverse Levy Measure algorithm of Wolpert and Ikstadt [12] is very general and an generate atoms in dereasing order of weight, but requires inverting the inomplete beta funtion at eah step, whih is omputationally intensive. The algorithm of Lee and Kim [7] bypasses this diffiulty by approximating the beta proess by a ompound Poisson proess but requires a fixed approximation level. This means that their algorithm only onverges in distribution. Our algorithm is a simple and effiient alternative. It is losely related to the stik-breaking onstrution of Dirihlet proesses [1], in that it generates the atoms of B in a size-biased order. 6 The hierarhial beta proess The parallel with the Dirihlet proess leads us to onsider hierarhies of beta proesses in a manner akin to the hierarhial Dirihlet proesses of [11]. To motivate our onstrution, let us onsider the following appliation to doument lassifiation to whih we return in Se. 7. Suppose that our training data X is a list of douments, where eah doument is lassified by one of n ategories. We model a doument by the set of words it ontains. In partiular we do not take the number of appearanes of eah word into aount. We assume that doument X i,j is generated by inluding eah word ω independently with a probability p j ω speifi to ategory j. These probabilities form a disrete measure A j over the spae of words Ω, and we put a beta proess BP j, B prior on A j.

1 1 1 1 2 1 Figure 3: Left. Graphial model for the hierarhial beta proess. Right. Example draws from this model with = j = 1 and B uniform on [, 1] with mass γ = 1. From top to bottom are shown a sample for B, A j and 25 samples X 1,j,... X 25,j. If B is a ontinuous measure, implying that Ω is infinite, with probability one the A j s will share no atoms, an undesirable result in the pratial appliation to douments. For ategories to share words, B must be disrete. On the other hand, B is unknown a priori, so it must be random. This suggests that B should itself be generated as a realization of a beta proess. We thus put a beta proess prior BP, B on B. This allows sharing of statistial strength among ategories. In summary we have the following model, whose graphial representation is shown in Fig. 3. Baseline B BP, B Categories A j BP j, B j n 8 Douments X i,j BePA j i n j To lassify a new doument Y we need to ompare its probability under eah ategory: P X nj +1,j = Y X where X nj +1,j is a new doument in ategory j. The next two subsetions give a Monte Carlo inferene algorithm to do this for hierarhies of arbitrarily many levels. 6.1 The disrete part Sine all elements of our model are Lévy proesses, we an partition the spae Ω and perform inferene separately on eah part. We hoose the partition that has ells {ω} for eah of the features ω that have been observed at least one, and has a single large ell ontaining the rest of the spae. We first onsider inferene over the singletons {ω}, and return to inferene over the remaining ell in the following subsetion. Inferene for {ω} deals only with the values b = B {ω}, b = B{ω}, a j = A j {ω} and x ij = X i,j {ω}. Let x denote the set of all x ij and let a denote the set of all a j. These variables form a slie of the hierarhy of their respetive proesses, and they have the following distributions: b Beta b, 1 b a j Beta j b, j 1 b 9 x ij Bera j. Stritly speaking, if B is ontinuous, b = and so the prior over b is improper. We treat this by onsidering b to be non-zero and taking the limit as b approahes zero. This is justified under the limit onstrution of the beta proess see Theorem 3.1 of [4]. For a fixed value of b, we an average over a using onjugay. To average over b, we use rejetion sampling; we sample b from an approximation of its onditional distribution and orret for the differene by rejetion. Let m j = n j x ij. Marginalizing out a in Eq. 9 and using Γx + 1 = xγx, the log posterior distribution of b given x is up to a onstant: fb = b 1 logb + 1 b 1 log1 b m j 1 + log j b + i + i= n j m j 1 i= log j 1 b + i. 1 This posterior is log onave and has a maximum at b, 1 whih we an obtain by binary searh. Using the onavity of log j b+i for i >, log j 1 b+ i for i and log1 b tangentially at b we obtain the following upper bound on f up to a onstant: gb = α 1 logb b/β where α = b + and 1 β 1 mj> = 1 b 1 1 b + n j m j 1 i= m j 1 j j 1 b + i. j j b + i g is, up to a onstant, the log density of a Gammaα, β variable, whih we an use as a rejetion sampling proposal distribution. Setting u = 1 b in Eq. 1 maintains the form of f, only exhanging the oeffiients. Therefore we an hoose instead to approximate 1 b by a gamma variable. Among these two possible approximations we hoose the one with the lowest variane αβ 2. In the experiments of Se. 7 the aeptane

rate was above 9%, showing that g is a good approximation of f. This algorithm gives us samples from P b x. In partiular, with T samples, b 1,... b T, we an ompute the following approximation: P x nj +1,j = 1 x = EEa j b, x x = jeb x + m j j + n j where Eb x 1 b t. T In the end if we want samples of a we an obtain them easily. We an sample from the onditional distribution of a j whih is, by onjugay: a j b, x Beta j b + m j, j 1 b + n j m j. 6.2 The ontinuous part Let s now look at the rest of the spae, where all observations are equal to zero. We approximate B on that part of the spae by ˆB N obtained after N steps of the t γ +k 1 algorithm of Se. 5. ˆBN onsists of K k Poi atoms at eah level k = 1,... N. For eah atom ω, b out of the K k atoms of level k, we have the following hierarhy. It is similar to Eq. 9 exept that it refers to a random loation ω hosen in size-biased order. b Beta1, + k 1 a j Beta j b, j 1 b 11 x ij Bera j We want to infer the distribution of the next observation X nj +1,j from group or ategory j, given that all other observations from all other groups are zero, that is X =. The loations of the atoms of X nj+1,j will be B /γ distributed so we only need to know the distribution of X nj +1,jΩ, the number of atoms. X = implies that all levels k have generated zero observations. Sine eah level is independent, we an reason on eah level separately, where we want the posterior distribution of K k and of the variables in Eq. 11. Let x = {x ij j n, i n j } and a = {a j j n}. Let P k be the probability over b, a and x defined by Eq. 11 and let q k = P k x =. The posterior distribution of K k is P K k = m X = qk m γ Poi m so K k X = Poi + k 1 γ + k 1 q k. Let p k = P k x nj+1,j = 1, x =, then P k x nj+1,j = 1 x = = p k /q k. Let D k be the number of atoms of X nj +1,j from level k γ D k Poi + k 1 q k p k q k = Poi γ + k 1 p k Adding the ontributions of all levels we get the following result whih is exat for N = : N γ X nj +1,jΩ Poi + k 1 p k. k=1 Using T samples b k,1,... b k,t from Eq. 11 we an ompute p k : Let rb = E k a jt 1 a j t nj b j = j b j + n j n j =1 then p k = E k [rb] 1 T 6.3 Larger hierarhies Γ j Γ j 1 b + n j Γ j 1 bγ j + n j T rb k,t. 12 t=1 We now extend model Eq. 8 to larger hierarhies suh as: Baseline B BP, B Categories A j BP j, B j n Subategories S l,j BP l,j, A j l n j Douments X i,l,j BePS l,j i n l,j To extend the algorithm of Se. 6.2 we draw samples of a from Eq. 11 and replae b by a in Eq. 12. Extending the algorithm of Se. 6.1 is less immediate sine we an no longer use onjugay to integrate out a. The Markov hain must now instantiate a and b. Se. 6.1 lets us sample a b, x, leaving us with the task of sampling b b, a. Up to a onstant the log onditional probability of b is f 2 b = b 1 logb + 1 b 1 log1 b [logγ j b + logγ j 1 b] + j b loga j /1 a j. 13 Let sx = logγx log x. Sine s is onave, f 2 itself is onave with a maximum b in, 1 whih we an obtain by binary searh. Bounding s and log1 b by their tangent yields the following upper bound g 2 of f 2 omitting the onstant: g 2 b = α 1 logb b/β where.

α = b + n 1 β = 1 b n 1 1 b + n n b aj j log 1 a j + [ j ψ j b j ψ j 1 b ]. We an use this Gammaα, β variable, or the one obtained by setting u = 1 b in Eq. 13 as a proposal distribution as in Se. 6.1. The ase of b b, a is the general ase for nodes of large hierarhies, so this algorithm an handle hierarhies of arbitrary depth. 7 Appliation to doument lassifiation Naive Bayes is a very simple yet powerful probabilisti model used for lassifiation. It models douments as lists of features, and assumes that features are independent given the ategory. Bayes rule an then be used on new douments to infer their ategory. This method has been used suessfully in many domains despite its simplisti assumptions. Naive Bayes does suffer however from several known shortomings. Consider a binary feature ω and let p j,ω be the probability that a doument from ategory j has feature ω. Estimating p j,ω as its maximum likelihood m j,ω /n j,ω leads to many features having probability or 1, making inferene impossible. To prevent suh extreme values, p j,ω is generally estimated with Laplae smoothing, whih an be interpreted as plaing a ommon Betaa, b prior on p j,ω : ˆp j,ω = m j,ω + a n j,ω + a + b. Laplae smoothing also orrets for unbalaned training data by imposing greater smoothing on the probabilities of small ategories, for whih we have low onfidene. Nonetheless, Laplae smoothing an lead to paradoxes with unbalaned data [9]. Consider the situtation where most ategories u have enough data to show with onfidene that p u,ω is lose to a very small value p. The impat on lassifiation of p j,ω is relative to p u,ω for u j so if ategory j has little data, we expet p j,ω to be lose to p for it to have little impat. Laplae smoothing however brings it lose to a a+b, very far from p, where it will have an enormous impat. This inonsisteny makes rare features hurt performane, and leads to the pratie of ombining naive Bayes with feature seletion, potentially wasting information. We propose instead to use a hierarhial beta proess hbp as a prior over the probabilities p j,ω. Suh a hierarhial Bayesian model allows sharing among ategories by shrinking the maximum likelihood probabilities towards eah other rather than towards a a+b. Suh an effet ould in priniple be ahieved using a finite model with a hierarhial beta prior; however, suh an approah would not permit new features that do not appear in the training data. The model in Eq. 8 allows the number of known features to grow with data, and the number of unknown features to serve as evidene for belonging to a poorly known ategory, one for whih we have little training data. A hbp gives a onsistent prior for varying amounts of data, whereas Laplae smoothing amounts to hanging the prior every time a new feature appears. We ompared the performane of hbp and naive Bayes on 1 posts from the 2 Newsgroups dataset 4, grouped into 2 ategories. We hose an unbalaned dataset, with the number of douments per ategory dereasing linearly from 1 to 2. We randomly seleted 4% of these papers as a test set for a lassifiation task. We enoded the douments X ij as a set of binary features representing the presene of words. All words were used without any pruning or feature seletion. We used the model in Eq. 8, setting the parameters, γ and j a priori in the following way. Sine we expet a lot of ommonalities between ategories, with differenes onentrated on a few words only, we take j to be small. Therefore douments drawn from Eq. 8 are lose to being drawn from a ommon BP, γ prior, under whih the expeted number of features per doument is γ. Estimating this value from the data gives γ = 15. Knowing γ we an solve for by mathing the expetation of Eq. 7 to the number of unique features N in the data. This an be done by interpreting Eq. 7 as a fixed point equation: F γ γ log + n/ + 1 1 leading to = 7. Finally we seleted the value of j by ross-validation on a held-out portion of the training set, giving us j = 1 4. We then ran our Monte Carlo inferene proedure and lassified eah doument Y of the test set by finding the lass for whih P X nj+1,j = Y X was highest, obtaining 58% auray. By omparison, we performed a grid searh over the spae of Laplae smoothing parameters a, b [1 11, 1 7 ] for the best naive Bayes model. For a = 1 8 and b = 1 7, naive Bayes reahed 5% auray. The lassifiation of douments is often takled with a multinomial models under the bag-of-words assump- 4 The data are available at http://people.sail.mit.edu/jrennie/2newsgroups/.

tion. The advantage of a feature-based model is that it beomes natural to add other non-text features suh as presene of a header, the text is right-justified, et. Sine the hbp handles rare features appropriately, we an safely inlude a muh larger set of features than would be possible for a naive Bayes model. 8 Conlusions In this paper we have shown that the beta proess originally developed for appliations in survival analysis is the natural objet of study for nonparametri Bayesian fatorial modeling. Representing data points in terms of sets of features, fatorial models provide substantial flexibility relative to the multinomial representations assoiated with lassial mixture models. We have shown that the beta proess is the de Finetti mixing distribution underlying the Indian buffet proess, a distribution on sparse binary matries. This result parallels the relationship between the Dirihlet proess and the Chinese restaurant proess. We have also shown that the beta proess an be extended to a reursively-defined hierarhy of beta proesses. This representation makes it possible to develop nonparametri Bayesian models in whih unbounded sets of features an be shared among multiple nested groups of data. Compared to the Dirihlet proess, the beta proess has the advantage of being an independent inrements proess Brownian motion is another example of an independent inrements proess. However, some of the simplifying features of the Dirihlet proess do not arry over to the beta proess, and in partiular we have needed to design new inferene algorithms for beta proess and hierarhial beta proess mixture models rather than simply borrowing from Dirihlet proess methods. Our inferene methods are elementary and additional work on inferene algorithms will be neessary to fully exploit beta proess models. 9 Appendix Hjort [4] derives the following moments for any set S Ω. If B is ontinuous, EBS = bνdω, db = B S. i:ω i S S [,1] If B is disrete: EBS = Eb i δ ωi = b δ ωi = B S and V ar BS = S i:ω i S B dω1 B dω. ω + 1 Thus despite being disrete, B an be viewed as an approximation of B, with flutuations going to zero as. Referenes [1] C. Antoniak. Mixtures of Dirihlet proesses with appliations to Bayesian nonparametri problems. The Annals of Statistis, 26:1152 1174, 1974. [2] T. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet proess. In Advanes in Neural Information Proessing Systems NIPS 18, 25. [3] T. Griffiths and Z. Ghahramani. Infinite latent feature models and the Indian buffet proess. Tehnial Report 25-1, Gatsby Computational Neurosiene Unit, 25. [4] N. L. Hjort. Nonparametri Bayes estimators based on Beta proesses in models for life history data. The Annals of Statistis, 183:1259 1294, 199. [5] A. Y. Khinhine. Korrelationstheorie der stationären stohastisen prozesse. Mathematishe Annalen, 19:64 615, 1934. [6] Y. Kim. Nonparametri Bayesian estimators for ounting proesses. The Annals of Statistis, 272:562 588, 1999. [7] J. Lee and Y. Kim. A new algorithm to generate beta proesses. Computational Statistis and Data Analysis, 47:441 453, 24. [8] P. Lévy. Théorie de l addition des variables aléatoires. Gauthiers-Villars, Paris, 1937. [9] J. Rennie, L. Shih, J. Teevan, and D. Karger. Takling the poor assumptions of naïve Bayes text lassifiers. In International Conferene on Mahine Learning ICML 2, 23. [1] J. Sethuraman. A onstrutive definition of Dirihlet priors. Statistia Sinia, 4:639 65, 1994. [11] Y. W. Teh, M. I. Jordan, M. Beal, and D. M. Blei. Hierarhial Dirihlet proesses. Journal of the Amerian Statistial Assoiation, 11:1566 1581, 26. [12] R. Wolpert and K. Ikstadt. Simulation of Lévy random fields. In D. Dey, P. Muller, and D. Sinha, editors, Pratial Nonparametri and Semiparametri Bayesian Statistis. Springer-Verlag, New York, 1998.