Hybrid Selection of Language Model Training Data Using Linguistic Information and Perplexity

Hybrid Seection o Language Mode Training Data Using Linguistic Inormation and Antonio Tora Schoo o Computing Dubin City University Dubin, Ireand atora@computing.dcu.ie Abstract We expore the seection o training data or anguage modes using perpexity. We introduce three nove modes that make use o inguistic inormation and evauate them on three dierent corpora and two anguages. In our out o the six scenarios a inguisticay motivated method outperorms the purey statistica state-o-theart approach. Finay, a method which combines surace orms and the inguisticay motivated methods outperorms the baseine in a the scenarios, seecting data whose perpexity is between 3.49% and 8.17% (depending on the corpus and anguage) ower than that o the baseine. 1 Introduction Language modes (LMs) are a undamenta piece in statistica appications that produce natura anguage text, such as machine transation and speech recognition. In order to perorm optimay, a LM shoud be trained on data rom the same domain as the data that it wi be appied to. This poses a probem, because in the majority o appications, the amount o domain-speciic data is imited. A popuar strand o research in recent years to tacke this probem is that o training data seection. Given a imited domain-speciic corpus and a arger non-domain-speciic corpus, the task consists on inding suitabe data or the speciic domain in the non-domain-speciic corpus. The underying assumption is that a non-domain-speciic corpus, i broad enough, contains sentences simiar to a domain-speciic corpus, which thereore, woud be useu or training modes or that domain. This paper ocuses on the approach that uses perpexity or the seection o training data. The irst works in this regard (Gao et a., 2002; Lin et a., 1997) use the perpexity according to a domain-speciic LM to rank the text segments (e.g. sentences) o non-domain-speciic corpora. The text segments with perpexity ess than a given threshod are seected. A more recent method, which can be considered the state-o-the-art, is Moore-Lewis (Moore and Lewis, 2010). It considers not ony the crossentropy 1 according to the domain-speciic LM but aso the cross-entropy according to a LM buit on a random subset (equa in size to the domainspeciic corpus) o the non-domain-speciic corpus. The additiona use o a LM rom the nondomain-speciic corpus aows to seect a subset o the non-domain-speciic corpus which is better (the perpexity o a test set o the speciic domain has ower perpexity on a LM trained on this subset) and smaer compared to the previous approaches. The experiment was carried out or Engish, using Europar (Koehn, 2005) as the domain-speciic corpus and LDC Gigaword 2 as the non-domain-speciic one. In this paper we study whether the use o two types o inguistic knowedge (emmas and named entities) can contribute to obtain better resuts within the perpexity-based approach. 2 Methodoogy We expore the use o inguistic inormation or the seection o data to train domain-speciic LMs rom non-domain-speciic corpora. Our hypothesis is that ranking by perpexity on n-grams that represent inguistic patterns (rather than n-grams that represent surace orms) captures additiona inormation, and thus may seect vauabe data that is not seected according soey to surace orms. We use two types o inguistic inormation at 1 note that using cross-entropy is equivaent to using perpexity since they are monotonicay reated. 2 http://www.dc.upenn.edu/cataog/ cataogentry.jsp?cataogid=ldc2007t07 8 Proceedings o the Second Workshop on Hybrid Approaches to Transation, pages 8 12, Soia, Bugaria, August 8, 2013. c 2013 Association or Computationa Linguistics

word eve: emmas and named entity categories. We experiment with the oowing modes: Forms (hereater ), uses surace orms. This mode repicates the Moore-Lewis approach and is to be considered the baseine. Forms and named entities (hereater ), uses surace orms, with the exception o any word detected as a named entity, which is substituted by its type (e.g. person, organisation). Lemmas (hereater ), uses emmas. Lemmas and named entities (hereater n), uses emmas, with the exception o any word detected as a named entity, which is substituted by its type. A sampe sentence, according to each o these modes, oows: : I decare resumed the session o the European Pariament : I decare resumed the session o the NP00O00 : i decare resume the session o the european_pariament n: i decare resume the session o the NP00O00 Tabe 1 shows the number o n-grams on LMs buit on the Engish side o News Commentary v8 (hereater NC) or each o the modes. Regarding 1-grams, compared to, the substitution o named entities by their categories () resuts in smaer vocabuary size (-24.79%). Simiary, the vocabuary is reduced or the modes (-8.39%) and n (- 44.18%). Athough not a resut in itse, this might be an indication that using inguisticay motivated modes coud be useu to dea with data sparsity. n n 1 65076 48945 59619 36326 2 981077 847720 835825 702118 3 2624800 2382629 2447759 2212709 4 3633724 3412719 3523888 3325311 5 3929751 3780064 3856917 3749813 Tabe 1: Number o n-grams in LMs buit using the dierent modes Our procedure oows that o the Moore-Lewis method. We buid LMs or the domain-speciic corpus and or a random subset o the nondomain-speciic corpus o the same size (number o sentences) o the domain-speciic corpus. Each sentence s in the non-domain-speciic corpus is then scored according to equation 1 where P P I (s) is the perpexity o s according to the domainspeciic LM and P P O (s) is the perpexity o s according to the non-domain-speciic LM. score(s) = P P I (s) P P O (s) (1) We buid LMs or the domain-speciic and nondomain-speciic corpora using the our modes previousy introduced. Then we rank the sentences o the non-domain-speciic corpus or each o these modes and keep the highest ranked sentences according to a threshod. Finay, we buid a LM on the set o sentences seected 3 and compute the perpexity o the test set on this LM. We aso investigate the combination o the our modes. The procedure is airy straightorward: given the sentences seected by a the modes or a given threshod, we iterate through these sentences oowing the ranking order and keeping a the distinct sentences seected unti we obtain a set o sentences whose size is the one indicated by the threshod. I.e. we add to our distinct set o sentences irst the top ranked sentence by each o the methods, then the sentence ranked second by each method, and so on. 3 Experiments 3.1 Setting We use corpora rom the transation task at WMT13. 4 Our domain-speciic corpus is NC, and we carry out experiments with three non-domainspeciic corpora: a subset o Common Craw 5 (hereater CC), Europar version 7 (hereater EU), and United Nations (Eisee and Chen, 2010) (hereater UN). We use the test data rom WMT12 (newstest2012) as our test set. We carry out experiments on two anguages or which these corpora are avaiabe: Engish (reerred to as en in tabes) and Spanish ( es in tabes). We test the methods on three very dierent nondomain-speciic corpora, both in terms o the topics that they cover (text crawed rom web in CC, pariamentary speeches in EU and oicia documents rom United Nations in UN) and their size 3 For the inguistic methods we repace the sentences seected (which contain emmas and/or named entities) with the corresponding sentences in the origina corpus (containing ony word orms). 4 http://www.statmt.org/wmt13/ transation-task.htm 5 http://commoncraw.org/ 9

(around 2 miion sentences both or CC and EU, and around 11 miion or UN). This can be considered as a contribution o this paper since previous works such as Moore and Lewis (2010) and, more recenty, Axerod et a. (2011) test the Moore-Lewis method on ony one non-domainspeciic corpus: LDC Gigaword and an unpubished genera-domain corpus, respectivey. A the LMs are buit with IRSTLM 5.80.01 (Federico et a., 2008), use up to 5-grams and are smoothed using a simpiied version o the improved Kneser-Ney method (Chen and Goodman, 1996). For emmatisation and named entity recognition we use Freeing 3.0 (Padró and Staniovsky, 2012). The corpora are tokenised and truecased using scripts rom the Moses tookit (Koehn et a., 2007). 3.2 Experiments with Dierent Modes Figures 1, 2 and 3 show the perpexities obtained by each method on dierent subsets seected rom the Engish corpora CC, EU and UN, respectivey. We obtain these subsets according to dierent threshods, i.e. percentages o sentences seected rom the non-domain-speciic corpus. These are 1 the irst 64 ranked sentences, 1 32, 1 16, 1 8, 1 4, 1 2 and 1. 6 Corresponding igures or Spanish are omitted due to the imited space avaiabe and aso because the trends in those igures are very simiar. 1050 1000 950 900 850 800 750 700 650 600 64 32 16 8 4 2 1 Figure 1: Resuts o the dierent methods on CC In a the igures, the resuts are very simiar regardess o the use o emmas. The use o named entities, however, produces substantiay dierent resuts. The modes that do not use named entity categories obtain the best resuts or ower threshods (up to 1/32 or CC, and up to 1/16 both or 6 1 An additiona threshod,, is used or the United Nations 128 corpus n 1600 1500 1400 1300 1200 1000 64 32 16 8 4 2 1 Figure 2: Resuts o the dierent methods on EU 1900 1700 1500 1300 900 128 64 32 16 8 4 2 1 Figure 3: Resuts o the dierent methods on UN EU and UN). I the best perpexity is obtained with a ower threshod than this (the case o EU, 1/32, and UN, 1/64), then methods that do not use named entities obtain the best resut. However, i the optima perpexity is obtained with a higher threshod (the case o CC, 1/2), then using named entities yieds the best resut. Tabe 2 presents the resuts or each mode. For each scenario (corpus and anguage combination), we show the threshod or which the best resut is obtained (coumn best). The perpexity obtained on data seected by each mode is shown in the subsequent coumns. For the inguistic methods, we aso show the comparison o their perormance to the baseine (as percentages, coumns di). The perpexity when using the u corpus is shown (coumn u) together with the comparison o this resut to the best method (ast coumn di). The resuts, as previousy seen in Figures 1, 2 and 3, dier with respect to the corpus but oow simiar trends across anguages. For CC we obtain the best resuts using named entities. The mode n obtains the best resut or Engish (5.54% ower n n 10

corpus best di di n di u di cc en 1/2 660.77 625.62-5.32 660.58-0.03 624.19-5.54 638.24-2.20 eu en 1/32 1072.98 1151.13 7.28 1085.66 1.18 1170.00 9.04 1462.61-26.64 un en 1/64 984.08 1127.55 14.58 979.06-0.51 1121.45 13.96 1939.44-49.52 cc es 1/2 499.22 480.17-3.82 498.93-0.06 480.45-3.76 481.96-0.37 eu es 1/16 788.62 813.32 3.13 801.50 1.63 825.13 4.63 960.06-17.86 un es 1/32 725.93 773.89 6.61 723.37-0.35 771.25 6.24 1339.78-46.01 Tabe 2: Resuts or the dierent modes perpexity than the baseine), whie the mode obtains the best resut or Spanish (3.82%), athough in both cases the dierence between these two modes is rather sma. For the other corpora, the best resuts are obtained without named entities. In the case o EU, the baseine obtains the best resut, athough the mode is not very ar (1.18% higher perpexity or Engish and 1.63% or Spanish). This trend is reversed or UN, the mode obtaining the best scores but cose to the baseine (-0.51%, -0.35%). 3.3 Experiments with the Combination o Modes Tabe 3 shows the perpexities obtained by the method that combines the our modes (coumn comb) or the threshod that yieded the best resut in each scenario (see Tabe 2), compares these resuts (coumn di) to those obtained by the baseine (coumn ) and shows the percentage o sentences that this method inspected rom the sentences seected by the individua methods (coumn perc). corpus comb di perc cc en 660.77 613.83-7.10 76.90 eu en 1072.98 1035.51-3.49 70.51 un en 984.08 908.47-7.68 74.58 cc es 499.22 478.87-4.08 74.61 eu es 788.62 748.22-5.12 68.05 un es 725.93 666.62-8.17 74.32 Tabe 3: Resuts o the combination method The combination method outperorms the baseine and any o the individua inguistic modes in a the scenarios. The perpexity obtained by combining the modes is substantiay ower than that obtained by the baseine (ranging rom 3.49% to 8.17%). In a the scenarios, the combination method takes its sentences rom roughy the top 70% sentences ranked by the individua methods. 4 Concusions and Future Work This paper has expored the use o inguistic inormation (emmas and named entities) or the task o training data seection or LMs. We have introduced three inguisticay motivated modes, and compared them to the state-o-the-art method or perpexity-based data seection across three dierent corpora and two anguages. In our out o these six scenarios a inguisticay motivated method outperorms the state-o-the-art approach. We have aso presented a method which combines surace orms and the three inguisticay motivated methods. This combination outperorms the baseine in a the scenarios, seecting data whose perpexity is between 3.49% and 8.17% (depending on the corpus and anguage) ower than that o the baseine. Regarding uture work, we have severa pans. One interesting experiment woud be to appy these modes to a morphoogicay-rich anguage, to check i, as hypothesised, these modes dea better with sparse data. Another strand regards the appication o these modes to iter parae corpora, e.g. oowing the extension o the Moore-Lewis method (Axerod et a., 2011) or in combination with other methods which are deemed to be more suitabe or parae data, e.g. (Mansour et a., 2011). We have used one type o inguistic inormation in each LM, but another possibiity is to combine dierent pieces o inguistic inormation in a singe LM, e.g. oowing a hybrid LM that uses words and tags, depending o the requency o each type (Ruiz et a., 2012). Given the act that the best resut is obtained with dierent modes depending on the corpus, it woud be worth to investigate whether given a new corpus, one coud predict the best method to be appied and the threshod or which one coud expect to obtain the minimum perpexity. 11

Acknowedgments We woud ike to thank Raphaë Rubino or insightu conversations. The research eading to these resuts has received unding rom the European Union Seventh Framework Programme FP7/2007-2013 under grant agreements PIAP- GA-2012-324414 and FP7-ICT-2011-296347. Reerences Amittai Axerod, Xiaodong He, and Jianeng Gao. 2011. Domain adaptation via pseudo in-domain data seection. In Proceedings o the Conerence on Empirica Methods in Natura Language Processing, EMNLP 11, pages 355 362, Stroudsburg, PA, USA. Association or Computationa Linguistics. Staney F. Chen and Joshua Goodman. 1996. An empirica study o smoothing techniques or anguage modeing. In Proceedings o the 34th annua meeting on Association or Computationa Linguistics, ACL 96, pages 310 318, Stroudsburg, PA, USA. Association or Computationa Linguistics. Andreas Eisee and Yu Chen. 2010. Mutiun: A mutiingua corpus rom united nation documents. In Nicoetta Cazoari, Khaid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Steios Piperidis, Mike Rosner, and Danie Tapias, editors, LREC. European Language Resources Association. and Evangeos Dermatas, editors, EUROSPEECH. ISCA. Saab Mansour, Joern Wuebker, and Hermann Ney. 2011. Combining transation and anguage mode scoring or domain-speciic data itering. In Internationa Workshop on Spoken Language Transation, pages 222 229, San Francisco, Caiornia, USA, December. Robert C. Moore and Wiiam Lewis. 2010. Inteigent seection o anguage mode training data. In Proceedings o the ACL 2010 Conerence Short Papers, ACLShort 10, pages 220 224, Stroudsburg, PA, USA. Association or Computationa Linguistics. Luís Padró and Evgeny Staniovsky. 2012. Freeing 3.0: Towards wider mutiinguaity. In Proceedings o the Language Resources and Evauation Conerence (LREC 2012), Istanbu, Turkey, May. ELRA. Nick Ruiz, Arianna Bisazza, Rodano Cattoni, and Marceo Federico. 2012. FBK s Machine Transation Systems or IWSLT 2012 s TED Lectures. In Proceedings o the 9th Internationa Workshop on Spoken Language Transation (IWSLT). Marceo Federico, Nicoa Bertodi, and Mauro Cettoo. 2008. IRSTLM: an open source tookit or handing arge scae anguage modes. In INTER- SPEECH, pages 1618 1621. ISCA. Jianeng Gao, Joshua Goodman, Mingjing Li, and Kai- Fu Lee. 2002. Toward a uniied approach to statistica anguage modeing or chinese. 1(1):3 33, March. Phiipp Koehn, Hieu Hoang, Aexandra Birch, Chris Caison-Burch, Marceo Federico, Nicoa Bertodi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Aexandra Constantin, and Evan Herbst. 2007. Moses: open source tookit or statistica machine transation. In Proceedings o the 45th Annua Meeting o the ACL on Interactive Poster and Demonstration Sessions, ACL 07, pages 177 180, Stroudsburg, PA, USA. Association or Computationa Linguistics. Phiipp Koehn. 2005. Europar: A Parae Corpus or Statistica Machine Transation. In Conerence Proceedings: the tenth Machine Transation Summit, pages 79 86, Phuket, Thaiand. AAMT, AAMT. Sung-Chien Lin, Chi-Lung Tsai, Lee-Feng Chien, Keh- Jiann Chen, and Lin-Shan Lee. 1997. Chinese anguage mode adaptation based on document cassiication and mutipe domain-speciic anguage modes. In George Kokkinakis, Nikos Fakotakis, 12