Appendix for Hierarchical Dirichlet Scaling Process for Multi-labeled Data

1 Appendix for Hierarchical Dirichlet Scaling Process for Multi-labeled Data Dongwoo Kim KAIST, Daeeon, Korea Alice Oh KAIST, Daeeon, Korea This appendix has been provided to give readers additional information about the HDSP. In this appendix, we provide the detailed derivation of variational inference, the posterior word count analysis, and more examples from the Wikipedia and OHSUMED corpora. A. Variational inference for HDSP In this section, we provide the detailed derivation for mean-field variational inference for HDSP. First, the evidence of lower bound for HDSP is obtained by taking a Jensen s inequality on the marginal log likelihood of the observed data, ln p(d, Θ)dΘ Q(Ψ) ln P (D, Θ) dθ, (1) Q(Ψ) where D is the training set of documents and labels. Ψ denotes the set of variational parameters, Θ denotes the set of model parameters. We define a fully factorized variational distribution Q as follows: T J M Q := q(φ k )q(v k ) q(w ) q(π mk ) =1 q(z mn ), (2) where q(z mn ) = Multinomial(z mn γ mn1, γ mn2,..., γ mnt ) (3) q(π mk ) = Gamma(π mk a π mk, b π mk) q(w ) = InvGamma(w a w, b w ) q(φ k ) = Dirichlet(φ k η k1, η k2,..., η ki ) q(v k ) = δ Vk.

2 The evidence of lower bound (ELBO) is L(D, Ψ) = E q [ln p(d, Ψ)] H[Q] = E q [ ln p(x mn z mn, Φ)] E q [ m=1 ln p(z mn π m )] J E q [ ln p(π mk V k, w k, r m )] E q [ ln p(v k α)] E q [ ln p(w a w, b w )] = E q [ ln p(φ k η)] E q [ln Q] γ mnk E q [ln p(x mn φ k )] E q [ln p(π mk V k, w k, r m )] =1 N γ mnk E q [ln p(z mn = k π m )] m=1 =1 J E q [ln p(v k α)] E q [ln p(w a w, b w )] E q [ln Q] (4) where the expectations of latent variables under the variational distribution Q are Then, we derive the equations further L(D, Ψ) = m=1 E q [π mk ] = a π mk/b π mk E q [ln π mk ] = ψ(a π mk) ln b π mk E q [w ] = b w /(a w 1) E q [w 1 ] = aw /b w E q [ln w ] = ln b w ψ(a w ) E q [ln φ ki ] = ψ(η kd ) ψ( ψ ki ) γ mnk ψ(η kxmn ) ψ( d γ mnk E q [ln π mk ] E q [ln π mk ] i η kd ) (5) βp k r m ln(b w ) ψ(a w ) (βp k 1)ψ(a π mk) ln(b π mk) ln Γ(α 1) ln Γ(α) (α 1) ln(1 V k ) =1 J a w ln b w ln Γ(a w ) (a w 1)ln b w ψ(a w ) a w E q [ln Q]. ( ) a w rm a π mk ln Γ(βp k ) Taking the derivatives of this lower bound with respect to each variational parameter, we can obtain the coordinate ascent updates. The optimal form of the variational distribution can be obtained by exponentiating the variational lower bound with all expectations except the parameter of interest (Bishop & Nasrabadi, 2006). For π mk, we can derive the optimal form of b w b π mk

3 variational distribution as follows q(π mk ) exp E q πmk [ln p(π mk z, π mk, r m, V )] exp E q πmk [ln p(z π m ) ln p(π m ) r m, V )] (6) Z βp k Nm n=1 γ mnk e bπ mk Eq[w r m ] Nm ξm where update for ξ m is ln ξ m ( T E q[π mk ] ξ m )/ξ m. Therefore, the optimal form of variational distribution for π mk is q(π mk ) Gamma(βp k n=1 γ mnk, b π mk E q [w rm ] ). (7) ξ m We take the same approach described in (Paisley et al., 2012), and the only difference comes from the product of the inverse distance term. For w, we can derive the optimal form of the variational distribution as follows q(w ) exp E q w [ln p(w π, w, a w, b w )] J exp E q w [ ln p(π mk βp k, w rm ) ln p(w a w, b w )] m=1 =1 exp E q w [ βp k r m ln w w rm π mk (a w 1) ln w bw ] w m=1 m w Eq[βp k] M m=1 rm aw 1 e ( Therefore, the optimal form of variational distribution for w is q(w ) InvGamma(E q [βp k ] m m:r m =1 :rm =1/ Eq[w 1 ]Eq[π k mk] b w ) 1 w r m a w, where m = m : r m = 1 and / = : r m = 1,. B. Posterior Word Count m / (8) E q [w 1 k ]E q[π m k] b w ) (9) Like the HDP and other nonparametric topic models, our model also uses only a few topics even though we set the truncation level to 200. Figure 1 shows the posterior word count for the different values of the Dirichlet topic parameter η. As the result indicates our model uses 50 to 0 topics. The HDSP tends to use more topics than the HDP. C. More Results Figure 2 shows more examples of the expected topic distribution given label from the Wikipedia and OHSUMED corpora. We show the top 20 topics each with the top words. References Bishop, Christopher M and Nasrabadi, Nasser M. Pattern recognition and machine learning, volume 1. springer New York, Paisley, John, Wang, Chong, and Blei, David M. The discrete infinite logistic normal distribution. Bayesian Analysis, 7 (4):997 34, 2012.

4 Posterior word count (HDP) eta = 0.5 eta = 0.25 eta = 0.1 Posterior word count (HDSP) eta = 0.5 eta = 0.25 eta = Topic number (sorted) (a) HDP Topic number (sorted) (b) HDSP Figure 1. Posterior number of words. We set the truncation level to 200, but only a few topics are used during inference.

5 No label label=2005_albums label=democic_party_united_states_senators label=2009_novels label=bird_ra label=alumni_of_the_university_of_oxford university research school college science professor education students american es species genus family ship found known common plant long small state texas road county highway route north virginia ohio states york city american ersey washington pennsylvania boston chico philadelphia band music released rock song records live songs park lake river area water creek located north south national women family life years married son died father russian greek refer russia turkey empire name republic turkish chinese london ohn william british england sir died royal thomas son september august october une uly april november december march anuary series show best star actor love cast life (a) Wikipedia music song released single track songs number version recorded tropical storm high system s due light low runs won match year class played team right test san party california election county conservative los francisco center city chico park liberal votes seats elections elected seat parliament ireland pope irish dublin rome bishop piano cardinal arms roman game win games round world team title match island law islands court street act south rights bay legal coast government beach case sea constitution rhode police city states No label label=mice label=infant label=aged label=brain label=drug infect e clinic receptor human il factor oper result clinic follow-up abnorm surviv tumor e treat cdna acid human amino clone women birth weight pregnanc releas secret inhibit stimul hospit result e care level il concentr cultur mg dai receiv placebo dose therapi transcript region site (b) OHSUMED antibodi region chain mutat antigen acid glucos insulin diabet concentr mm control uptak receptor inhibit respons 3h wky na relax contract norepinephrin ghrh ss microm kallikrein pressur arteri mean pulmonari mm hg ventricular mortal e surviv percent sap salmonella precursor code antibodi detect invas domain cdna cup rubber silicon metal deliveri 258 introduct extract occas assist ferritin ang plant preval iron systol iron-induc lead soybean ic Figure 2. Expected topic distributions from Wikipedia and OHSUMED. From top to bottom, we compute the expected topic distribution given a label.

