Discriminative Improvements to Distributional Sentence Similarity

Dscrmnatve Improvements to Dstrbutonal Sentence Smlarty Yangfeng J School of Interactve Computng Georga Insttute of Technology jyfeng@gatech.edu Jacob Esensten School of Interactve Computng Georga Insttute of Technology jacobe@gatech.edu Abstract Matrx and tensor factorzaton have been appled to a number of semantc relatedness tasks, ncludng paraphrase dentfcaton. The key dea s that smlarty n the latent space mples semantc relatedness. We descrbe three ways n whch labeled data can mprove the accuracy of these approaches on paraphrase classfcaton. Frst, we desgn a new dscrmnatve term-weghtng metrc called TF-KLD, whch outperforms TF-IDF. Next, we show that usng the latent representaton from matrx factorzaton as features n a classfcaton algorthm substantally mproves accuracy. Fnally, we combne latent features wth fne-graned n-gram overlap features, yeldng performance that s 3% more accurate than the pror state-of-the-art. 1 Introducton Measurng the semantc smlarty of short unts of text s fundamental to many natural language processng tasks, from evaluatng machne translaton (Kauchak and Barzlay, 2006) to groupng redundant event mentons n socal meda (Petrovć et al., 2010). The task s challengng because of the nfntely dverse set of possble lngustc realzatons for any dea (Bhagat and Hovy, 2013), and because of the short length of ndvdual sentences, whch means that standard bag-of-words representatons wll be hopelessly sparse. Dstrbutonal methods address ths problem by transformng the hgh-dmensonal bag-of-words representaton nto a lower-dmensonal latent space. Ths can be accomplshed by factorng a matrx or tensor of term-context counts (Turney and Pantel, 2010); proxmty n the nduced latent space has been shown to correlate wth semantc smlarty (Mhalcea et al., 2006). However, factorng the term-context matrx means throwng away a consderable amount of nformaton, as the orgnal matrx of sze M N (number of nstances by number of features) s factored nto two smaller matrces of sze M K and N K, wth K M, N. If the factorzaton does not take nto account labeled data about semantc smlarty, mportant nformaton can be lost. In ths paper, we show how labeled data can consderably mprove dstrbutonal methods for measurng semantc smlarty. Frst, we develop a new dscrmnatve term-weghtng metrc called TF-KLD, whch s appled to the term-context matrx before factorzaton. On a standard paraphrase dentfcaton task (Dolan et al., 2004), ths method mproves on both tradtonal TF-IDF and Weghted Textual Matrx Factorzaton (WTMF; Guo and Dab, 2012). Next, we convert the latent representatons of each sentence par nto a feature vector, whch s used as nput to a lnear SVM classfer. Ths yelds further mprovements and substantally outperforms the current state-of-the-art on paraphrase classfcaton. We then add fnegraned features about the lexcal smlarty of the sentence par. The combnaton of latent and fnegraned features yelds further mprovements n accuracy, demonstratng that these feature sets provde complementary nformaton on semantc smlarty.

2 Related Work Wthout attemptng to do justce to the entre lterature on paraphrase dentfcaton, we note three hgh-level approaches: (1) strng smlarty metrcs such as n-gram overlap and BLEU score (Wan et al., 2006; Madnan et al., 2012), as well as strng kernels (Bu et al., 2012); (2) syntactc operatons on the parse structure (Wu, 2005; Das and Smth, 2009); and (3) dstrbutonal methods, such as latent semantc analyss (LSA; Landauer et al., 1998), whch are most relevant to our work. One applcaton of dstrbutonal technques s to replace ndvdual words wth dstrbutonally smlar alternatves (Kauchak and Barzlay, 2006). Alternatvely, Blacoe and Lapata (2012) show that latent word representatons can be combned wth smple elementwse operatons to dentfy the semantc smlarty of larger unts of text. Socher et al. (2011) propose a syntactcally-nformed approach to combne word representatons, usng a recursve auto-encoder to propagate meanng through the parse tree. We take a dfferent approach: rather than representng the meanngs of ndvdual words, we drectly obtan a dstrbutonal representaton for the entre sentence. Ths s nspred by Mhalcea et al. (2006) and Guo and Dab (2012), who treat sentences as pseudo-documents n an LSA framework, and dentfy paraphrases usng smlarty n the latent space. We show that the performance of such technques can be mproved dramatcally by usng supervsed nformaton to (1) reweght the ndvdual dstrbutonal features and (2) learn the mportance of each latent dmenson. 3 Dscrmnatve feature weghtng Dstrbutonal representatons (Turney and Pantel, 2010) can be nduced from a co-occurrence matrx W R M N, where M s the number of nstances and N s the number of dstrbutonal features. For paraphrase dentfcaton, each nstance s a sentence; features may be ungrams, or may nclude hgher-order n-grams or dependency pars. By decomposng the matrx W, we hope to obtan a latent representaton n whch semantcally-related sentences are smlar. Sngular value decomposton (SVD) s tradtonally used to perform ths factorzaton. However, recent work has demonstrated the robustness of nonnegatve matrx factorzaton (NMF; Lee and Seung, 2001) for text mnng tasks (Xu et al., 2003; Arora et al., 2012); the dfference from SVD s the addton of a non-negatvty constrant n the latent representaton based on non-orthogonal bass. Whle W may smply contan counts of dstrbutonal features, pror work has demonstrated the utlty of reweghtng these counts (Turney and Pantel, 2010). TF-IDF s a standard approach, as the nverse document frequency (IDF) term ncreases the mportance of rare words, whch may be more dscrmnatve. Guo and Dab (2012) show that applyng a specal weght to unseen words can further mprovement performance on paraphrase dentfcaton. We present a new weghtng scheme, TF-KLD, based on supervsed nformaton. The key dea s to ncrease the weghts of dstrbutonal features that are dscrmnatve, and to decrease the weghts of features that are not. Conceptually, ths s smlar to Lnear Dscrmnant Analyss, a supervsed feature weghtng scheme for contnuous data (Murphy, 2012). More formally, we assume labeled sentence pars of the form w (1), w (2), r, where w (1) s the vector of dstrbutonal features for the frst sentence, w (2) s the vector of dstrbutonal features for the second sentence, and r {0, 1} ndcates whether they are labeled as a paraphrase par. Assumng the order of the sentences wthn the par s rrelevant, then for k-th dstrbutonal feature, we defne two Bernoull dstrbutons: p k = P (w (1) k w(2) k = 1, r = 1). Ths s the probablty that sentence w (1) contans feature k, gven that k appears n w (2) and the two sentences are labeled as paraphrases, r = 1. q k = P (w (1) k w(2) k = 1, r = 0). Ths s the probablty that sentence w (1) contans feature k, gven that k appears n w (2) and the two sentences are labeled as not paraphrases, r = 0. The Kullback-Lebler dvergence KL(p k q k ) = x p k(x) log p k(x) q k (x) s then a measure of the dscrmnablty of feature k, and s guaranteed to be non-

0.600 1 q k 1.0 0.8 0.6 0.4 0.2 1.000 0.800 off then study same 0.200 0.400 nether 0.400 shares not 0.200 but 0.600 fear 1.000 nor 0.800 1 ungram recall 2 ungram precson 3 bgram recall 4 bgram precson 5 dependency relaton recall 6 dependency relaton precson 7 BLEU recall 8 BLEU precson 9 Dfference of sentence length 10 Tree-edtng dstance 0.0 0.0 0.2 0.4 0.6 0.8 1.0 p k Fgure 1: Condtonal probabltes for a few handselected ungram features, wth lnes showng contours wth dentcal KL-dvergence. The probabltes are estmated based on the MSRPC tranng set (Dolan et al., 2004). negatve. 1 We use ths dvergence to reweght the features n W before performng the matrx factorzaton. Ths has the effect of ncreasng the weghts of features whose lkelhood of appearng n a par of sentences s strongly nfluenced by the paraphrase relatonshp between the two sentences. On the other hand, f p k = q k, then the KL-dvergence wll be zero, and the feature wll be gnored n the matrx factorzaton. We name ths weghtng scheme TF-KLD, snce t ncludes the term frequency and the KL-dvergence. Takng the ungram feature not as an example, we have p k = [0.66, 0.34] and q k = [0.31, 0.69], for a KL-dvergence of 0.25: the lkelhood of ths word beng shared between two sentence s strongly dependent on whether the sentences are paraphrases. In contrast, the feature then has p k = [0.33, 0.67] and q k = [0.32, 0.68], for a KL-dvergence of 3.9 10 4. Fgure 1 shows the dstrbutons of these and other ungram features wth respect to p k and 1 q k. The dagonal lne runnng through the mddle of the plot ndcates zero KL-dvergence, so features on ths lne wll be gnored. 1 We obtan very smlar results wth the opposte dvergence KL(q k p k ). However, the symmetrc Jensen-Shannon dvergence performs poorly. Table 1: Fne-graned features for paraphrase classfcaton, selected from pror work (Wan et al., 2006). 4 Supervsed classfcaton Whle prevous work has performed paraphrase classfcaton usng dstance or smlarty n the latent space (Guo and Dab, 2012; Socher et al., 2011), more drect supervson can be appled. Specfcally, we convert the latent representatons of a par of sentences v 1 and v 2 nto a sample vector, s( v 1, v 2 ) = [ v 1 + v 2, v 1 v 2 ], (1) concatenatng the element-wse sum v 1 + v 2 and absolute dfference v 1 v 2. Note that s(, ) s symmetrc, snce s( v 1, v 2 ) = s( v 2, v 1 ). Gven ths representaton, we can use any supervsed classfcaton algorthm. A further advantage of treatng paraphrase as a supervsed classfcaton problem s that we can apply addtonal features besdes the latent representaton. We consder a subset of features dentfed by Wan et al. (2006), lsted n Table 1. These features manly capture fne-graned smlarty between sentences, for example by countng specfc ungram and bgram overlap. 5 Experments Our experments test the utlty of the TF- KLD weghtng towards paraphrase classfcaton, usng the Mcrosoft Research Paraphrase Corpus (Dolan et al., 2004). The tranng set contans 2753 true paraphrase pars and 1323 false paraphrase pars; the test set contans 1147 and 578 pars, respectvely. The TF-KLD weghts are constructed from only the tranng set, whle matrx factorzatons are per-

formed on the entre corpus. Matrx factorzaton on both tranng and (unlabeled) test data can be vewed as a form of transductve learnng (Gammerman et al., 1998), where we assume access to unlabeled test set nstances. 2 We also consder an nductve settng, where we construct the bass of the latent space from only the tranng set, and then project the test set onto ths bass to fnd the correspondng latent representaton. The performance dfferences between the transductve and nductve settngs were generally between 0.5% and 1%, as noted n detal below. We reterate that the TF-KLD weghts are never computed from test set data. Pror work on ths dataset s descrbed n secton 2. To our knowledge, the current state-of-theart s a supervsed system that combnes several machne translaton metrcs (Madnan et al., 2012), but we also compare wth state-of-the-art unsupervsed matrx factorzaton work (Guo and Dab, 2012). 5.1 Smlarty-based classfcaton In the frst experment, we predct whether a par of sentences s a paraphrase by measurng ther cosne smlarty n latent space, usng a threshold for the classfcaton boundary. As n pror work (Guo and Dab, 2012), the threshold s tuned on held-out tranng data. We consder two dstrbutonal feature sets: FEAT 1, whch ncludes ungrams; and FEAT 2, whch also ncludes bgrams and unlabeled dependency pars obtaned from MaltParser (Nvre et al., 2007). To compare wth Guo and Dab (2012), we set the latent dmensonalty to K = 100, whch was the same n ther paper. Both SVD and NMF factorzaton are evaluated; n both cases, we mnmze the Frobenus norm of the reconstructon error. Table 2 compares the accuracy of a number of dfferent confguratons. The transductve TF-KLD weghtng yelds the best overall accuracy, achevng 72.75% when combned wth nonnegatve matrx factorzaton. Whle NMF performs slghtly better than SVD n both comparsons, the major dfference s the performance of dscrmnatve TF-KLD weghtng, whch outperforms TF-IDF regardless of the factorzaton technque. When we 2 Another example of transductve learnng n NLP s when Turan et al. (2010) nduced word representatons from a corpus that ncluded both tranng and test data for ther downstream named entty recognton task. Accuracy (%) 80 75 70 65 60 Feat 1 _TF-IDF_SVM Feat 2 _TF-IDF_SVM Feat 1 _TF-KLD_SVM Feat 2 _TF-KLD_SVM 50 100 150 200 250 300 350 400 K Fgure 2: Accuracy of feature and weghtng combnatons n the classfcaton framework. perform the matrx factorzaton on only the tranng data, the accuracy on the test set s 73.58%, wth F1 score 80.55%. 5.2 Supervsed classfcaton Next, we apply supervsed classfcaton, constructng sample vectors from the latent representaton as shown n Equaton 1. For classfcaton, we choose a Support Vector Machne wth a lnear kernel (Fan et al., 2008), leavng a thorough comparson of classfers for future work. The classfer parameter C s tuned on a development set comprsng 20% of the orgnal tranng set. Fgure 2 presents results for a range of latent dmensonaltes. Supervsed learnng dentfes the mportant dmensons n the latent space, yeldng sgnfcantly better performance that the smlartybased classfcaton from the prevous experment. In Table 3, we compare aganst pror publshed work, usng the held-out development set to select the best value of K (agan, K = 400). The best result s from TF-KLD, wth dstrbutonal features FEAT 2, achevng 79.76% accuracy and 85.87% F1. Ths s well beyond all known pror results on ths task. When we nduce the latent bass from only the tranng data, we get 78.55% on accuracy and 84.59% F1, also better than the prevous state-of-art. Fnally, we augment the dstrbutonal representaton, concatenatng the ten fne-graned features n Table 1 to the sample vectors descrbed n Equaton 1. As shown n Table 3, the accu-

Factorzaton Feature set Weghtng K Measure Accuracy (%) F1 SVD ungrams TF-IDF 100 cosne sm. 68.92 80.33 NMF ungrams TF-IDF 100 cosne sm. 68.96 80.14 WTMF ungrams TF-IDF 100 cosne sm. 71.51 not reported SVD ungrams TF-KLD 100 cosne sm. 72.23 81.19 NMF ungrams TF-KLD 100 cosne sm. 72.75 81.48 Table 2: Smlarty-based paraphrase dentfcaton accuracy. Results for WTMF are reprnted from the paper by Guo and Dab (2012). Acc. F1 Most common class 66.5 79.9 (Wan et al., 2006) 75.6 83.0 (Das and Smth, 2009) 73.9 82.3 (Das and Smth, 2009) wth 18 features 76.1 82.7 (Bu et al., 2012) 76.3 not reported (Socher et al., 2011) 76.8 83.6 (Madnan et al., 2012) 77.4 84.1 FEAT 2, TF-KLD, SVM 79.76 85.87 FEAT 2, TF-KLD, SVM, Fne-graned features 80.41 85.96 Table 3: Supervsed classfcaton. Results from pror work are reprnted. racy now mproves to 80.41%, wth an F1 score of 85.96%. When the latent representaton s nduced from only the tranng data, the correspondng results are 79.94% on accuracy and 85.36% F1, agan better than the prevous state-of-the-art. These results show that the nformaton captured by the dstrbutonal representaton can stll be augmented by more fne-graned tradtonal features. 6 Concluson We have presented three ways n whch labeled data can mprove dstrbutonal measures of semantc smlarty at the sentence level. The man nnovaton s TF-KLD, whch dscrmnatvely reweghts the dstrbutonal features before factorzaton, so that dscrmnablty mpacts the nducton of the latent representaton. We then transform the latent representaton nto a sample vector for supervsed learnng, obtanng results that strongly outperform the pror state-of-the-art; addng fne-graned lexcal features further ncreases performance. These deas may have applcablty n other semantc smlarty tasks, and we are also eager to apply them to new, large-scale automatcally-nduced paraphrase corpora (Gantkevtch et al., 2013). Acknowledgments We thank the revewers for ther helpful feedback, and Wewe Guo for quckly answerng questons about hs mplementaton. Ths research was supported by a Google Faculty Research Award to the second author. References Sanjeev Arora, Rong Ge, and Ankur Motra. 2012. Learnng Topc Models - Gong beyond SVD. In FOCS, pages 1 10. Rahul Bhagat and Eduard Hovy. 2013. What Is a Paraphrase? Computatonal Lngustcs. Wllam Blacoe and Mrella Lapata. 2012. A Comparson of Vector-based Representatons for Semantc Composton. In Proceedngs of the 2012 Jont Conference on Emprcal Methods n Natural Language Processng and Computatonal Natural Language Learnng, pages 546 556, Stroudsburg, PA, USA. Assocaton for Computatonal Lngustcs. Fan Bu, Hang L, and Xaoyan Zhu. 2012. Strng Rewrtng kernel. In Proceedngs of ACL, pages 449 458. Assocaton for Computatonal Lngustcs. Dpanjan Das and Noah A. Smth. 2009. Paraphrase dentfcaton as probablstc quas-synchronous recognton. In Proceedngs of the Jont Conference

of the Annual Meetng of the Assocaton for Computatonal Lngustcs and the Internatonal Jont Conference on Natural Language Processng, pages 468 476, Stroudsburg, PA, USA. Assocaton for Computatonal Lngustcs. Bll Dolan, Chrs Qurk, and Chrs Brockett. 2004. Unsupervsed Constructon of Large Paraphrase Corpora: Explotng Massvely Parallel News Sources. In COL- ING. Rong-En Fan, Ka-We Chang, Cho-Ju Hseh, Xang-Ru Wang, and Chh-Jen Ln. 2008. LIBLINEAR: A Lbrary for Large Lnear Classfcaton. Journal of Machne Learnng Research, 9:1871 1874. Alexander Gammerman, Volodya Vovk, and Vladmr Vapnk. 1998. Learnng by transducton. In Proceedngs of the Fourteenth conference on Uncertanty n artfcal ntellgence, pages 148 155. Morgan Kaufmann Publshers Inc. Jur Gantkevtch, Benjamn Van Durme, and Chrs Callson-Burch. 2013. PPDB: The Paraphrase Database. In Proceedngs of NAACL, pages 758 764. Assocaton for Computatonal Lngustcs. Wewe Guo and Mona Dab. 2012. Modelng Sentences n the Latent Space. In Proceedngs of the 50th Annual Meetng of the Assocaton for Computatonal Lngustcs, pages 864 872, Stroudsburg, PA, USA. Assocaton for Computatonal Lngustcs. Davd Kauchak and Regna Barzlay. 2006. Paraphrasng for automatc evaluaton. In Proceedngs of NAACL, pages 455 462. Assocaton for Computatonal Lngustcs. Thomas Landauer, Peter W. Foltz, and Darrel Laham. 1998. Introducton to Latent Semantc Analyss. Dscource Processes, 25:259 284. Danel D. Lee and H. Sebastan Seung. 2001. Algorthms for Non-Negatve Matrx Factorzaton. In Advances n Neural Informaton Processng Systems (NIPS). Ntn Madnan, Joel R. Tetreault, and Martn Chodorow. 2012. Re-examnng Machne Translaton Metrcs for Paraphrase Identfcaton. In HLT-NAACL, pages 182 190. The Assocaton for Computatonal Lngustcs. Rada Mhalcea, Courtney Corley, and Carlo Strapparava. 2006. Corpus-based and knowledge-based measures of text semantc smlarty. In AAAI. Kevn P. Murphy. 2012. Machne Learnng: A Probablstc Perspectve. The MIT Press. Joakm Nvre, Johan Hall, Jens Nlsson, Atanas Chanev, Gülsen Erygt, Sandra Kübler, Svetoslav Marnov, and Erwn Mars. 2007. MaltParser: A languagendependent system for data-drven dependency parsng. Natural Language Engneerng, 13(2):95 135. Saša Petrovć, Mles Osborne, and Vctor Lavrenko. 2010. Streamng frst story detecton wth applcaton to twtter. In Proceedngs of HLT-NAACL, pages 181 189. Assocaton for Computatonal Lngustcs. Rchard Socher, Erc H. Huang, Jeffrey Pennngton, Andrew Y. Ng, and Chrstopher D. Mannng. 2011. Dynamc Poolng And Unfoldng Recursve Autoencoders For Paraphrase Detecton. In Advances n Neural Informaton Processng Systems (NIPS). Joseph Turan, Lev Ratnov, and Yoshua Bengo. 2010. Word Representaton: A Smple and General Method for Sem-Supervsed Learnng. In ACL, pages 384 394. Peter D. Turney and Patrck Pantel. 2010. From Frequency to Meanng: Vector Space Models of Semantcs. JAIR, 37:141 188. Ssephen Wan, Mark Dras, Robert Dale, and Cecle Pars. 2006. Usng Dependency-based Features to Take the Para-farce out of Paraphrase. In Proceedngs of the Australasan Language Technology Workshop. Deka Wu. 2005. Recognzng paraphrases and textual entalment usng nverson transducton grammars. In Proceedngs of the ACL Workshop on Emprcal Modelng of Semantc Equvalence and Entalment, pages 25 30. Assocaton for Computatonal Lngustcs. We Xu, Xn Lu, and Yhong Gong. 2003. Document Clusterng based on Non-Negatve Matrx Factorzaton. In SIGIR, pages 267 273.