Hybrid Selection of Language Model Training Data Using Linguistic Information and Perplexity

Similar documents
A Similarity Search Scheme over Encrypted Cloud Images based on Secure Transformation

Dynamic Pricing Trade Market for Shared Resources in IIU Federated Cloud

Collaborative Machine Translation Service for Scientific texts

Machine translation techniques for presentation of summaries

Minimum Support Size of the Defender s Strong Stackelberg Equilibrium Strategies in Security Games

Face Hallucination and Recognition

Australian Bureau of Statistics Management of Business Providers

CONTRIBUTION OF INTERNAL AUDITING IN THE VALUE OF A NURSING UNIT WITHIN THREE YEARS

Art of Java Web Development By Neal Ford 624 pages US$44.95 Manning Publications, 2004 ISBN:

Advanced ColdFusion 4.0 Application Development Server Clustering Using Bright Tiger

Adaptation to Hungarian, Swedish, and Spanish

Adapting General Models to Novel Project Ideas

Fast Robust Hashing. ) [7] will be re-mapped (and therefore discarded), due to the load-balancing property of hashing.

Pay-on-delivery investing

The TCH Machine Translation System for IWSLT 2008

Fixed income managers: evolution or revolution

SELECTING THE SUITABLE ERP SYSTEM: A FUZZY AHP APPROACH. Ufuk Cebeci

Parallel FDA5 for Fast Deployment of Accurate Statistical Machine Translation Systems

Chapter 3: e-business Integration Patterns

Distribution of Income Sources of Recent Retirees: Findings From the New Beneficiary Survey

The guaranteed selection. For certainty in uncertain times

TERM INSURANCE CALCULATION ILLUSTRATED. This is the U.S. Social Security Life Table, based on year 2007.

Spatio-Temporal Asynchronous Co-Occurrence Pattern for Big Climate Data towards Long-Lead Flood Prediction

THUTR: A Translation Retrieval System

AA Fixed Rate ISA Savings

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

Life Contingencies Study Note for CAS Exam S. Tom Struppeck

Betting Strategies, Market Selection, and the Wisdom of Crowds

Subject: Corns of En gineers and Bureau of Reclamation: Information on Potential Budgetarv Reductions for Fiscal Year 1998

Advantages and Disadvantages of Sampling. Vermont ASQ Meeting October 26, 2011

Undergraduate Studies in. Education and International Development

UEdin: Translating L1 Phrases in L2 Context using Context-Sensitive SMT

A Practical Framework for Privacy-Preserving Data Analytics

A Latent Variable Pairwise Classification Model of a Clustering Ensemble


Multi-Robot Task Scheduling

How To Deiver Resuts

Niagara Catholic. District School Board. High Performance. Support Program. Academic

Enabling Direct Interest-Aware Audience Selection

A Supplier Evaluation System for Automotive Industry According To Iso/Ts Requirements

Finance 360 Problem Set #6 Solutions

Simultaneous Routing and Power Allocation in CDMA Wireless Data Networks

Infrastructure for Business

3.3 SOFTWARE RISK MANAGEMENT (SRM)

Diploma Decisions for Students with Disabilities. What Parents Need to Know

The Use of Cooling-Factor Curves for Coordinating Fuses and Reclosers

Early access to FAS payments for members in poor health

l l ll l l Exploding the Myths about DETC Accreditation A Primer for Students

Factored Translation Models

Income Protection Options

Sentiment Analysis with Global Topics and Local Dependency

ST. MARKS CONFERENCE FACILITY MARKET ANALYSIS

Avaya Remote Feature Activation (RFA) User Guide

An Online Service for SUbtitling by MAchine Translation

Load Balancing in Distributed Web Server Systems with Partial Document Replication *

GWPD 4 Measuring water levels by use of an electric tape

CUSTOM. Putting Your Benefits to Work. COMMUNICATIONS. Employee Communications Benefits Administration Benefits Outsourcing

Secure Network Coding with a Cost Criterion

Business schools are the academic setting where. The current crisis has highlighted the need to redefine the role of senior managers in organizations.

PENALTY TAXES ON CORPORATE ACCUMULATIONS

Automatic slide assignation for language model adaptation

Cache-based Online Adaptation for Machine Translation Enhanced Computer Assisted Translation

Pricing and hedging of variable annuities

Oligopoly in Insurance Markets

This paper considers an inventory system with an assembly structure. In addition to uncertain customer

Certificate in Contemporary Music 2016 For International Applicants

Technical Support Guide for online instrumental lessons

How to Cut Health Care Costs

AN APPROACH TO THE STANDARDISATION OF ACCIDENT AND INJURY REGISTRATION SYSTEMS (STAIRS) IN EUROPE

Views of black trainee accountants in South Africa on matters related to a career as a chartered accountant

Teamwork. Abstract. 2.1 Overview

Protection Against Income Loss During the First 4 Months of Illness or Injury *

ONE of the most challenging problems addressed by the

Hedge Fund Capital Accounts and Revaluations: Are They Section 704(b) Compliant?

LIUM s Statistical Machine Translation System for IWSLT 2010

FIRST BANK OF MANHATTAN MORTGAGE LOAN ORIGINATORS NMLS ID #405508

With the arrival of Java 2 Micro Edition (J2ME) and its industry

Vacancy Rebate Supporting Documentation Checklist

effect on major accidents

Hybrid Machine Translation Guided by a Rule Based System

Human Capital & Human Resources Certificate Programs

your statement of insurance

WHITE PAPER BEsT PRAcTIcEs: PusHIng ExcEl BEyond ITs limits WITH InfoRmATIon optimization

Jane 2: Open Source Phrase-based and Hierarchical Statistical Machine Translation

The Web Insider... The Best Tool for Building a Web Site *

READING A CREDIT REPORT

Chapter 3: JavaScript in Action Page 1 of 10. How to practice reading and writing JavaScript on a Web page

Design of Follow-Up Experiments for Improving Model Discrimination and Parameter Estimation

Books on Reference and the Problem of Library Science

Migrating and Managing Dynamic, Non-Textua Content

The KIT Translation system for IWSLT 2010

Who Benefits From Social Health Insurance in Developing Countries?

Breakeven analysis and short-term decision making

SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统

Annual Notice of Changes for 2016

A Branch-and-Price Algorithm for Parallel Machine Scheduling with Time Windows and Job Priorities

The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER Ncode: an Open Source Bilingual N-gram SMT Toolkit

The definition of insanity is doing the same thing over and over again and expecting different results

Vendor Performance Measurement Using Fuzzy Logic Controller

LT Codes-based Secure and Reliable Cloud Storage Service

Transcription:

Hybrid Seection o Language Mode Training Data Using Linguistic Inormation and Antonio Tora Schoo o Computing Dubin City University Dubin, Ireand atora@computing.dcu.ie Abstract We expore the seection o training data or anguage modes using perpexity. We introduce three nove modes that make use o inguistic inormation and evauate them on three dierent corpora and two anguages. In our out o the six scenarios a inguisticay motivated method outperorms the purey statistica state-o-theart approach. Finay, a method which combines surace orms and the inguisticay motivated methods outperorms the baseine in a the scenarios, seecting data whose perpexity is between 3.49% and 8.17% (depending on the corpus and anguage) ower than that o the baseine. 1 Introduction Language modes (LMs) are a undamenta piece in statistica appications that produce natura anguage text, such as machine transation and speech recognition. In order to perorm optimay, a LM shoud be trained on data rom the same domain as the data that it wi be appied to. This poses a probem, because in the majority o appications, the amount o domain-speciic data is imited. A popuar strand o research in recent years to tacke this probem is that o training data seection. Given a imited domain-speciic corpus and a arger non-domain-speciic corpus, the task consists on inding suitabe data or the speciic domain in the non-domain-speciic corpus. The underying assumption is that a non-domain-speciic corpus, i broad enough, contains sentences simiar to a domain-speciic corpus, which thereore, woud be useu or training modes or that domain. This paper ocuses on the approach that uses perpexity or the seection o training data. The irst works in this regard (Gao et a., 2002; Lin et a., 1997) use the perpexity according to a domain-speciic LM to rank the text segments (e.g. sentences) o non-domain-speciic corpora. The text segments with perpexity ess than a given threshod are seected. A more recent method, which can be considered the state-o-the-art, is Moore-Lewis (Moore and Lewis, 2010). It considers not ony the crossentropy 1 according to the domain-speciic LM but aso the cross-entropy according to a LM buit on a random subset (equa in size to the domainspeciic corpus) o the non-domain-speciic corpus. The additiona use o a LM rom the nondomain-speciic corpus aows to seect a subset o the non-domain-speciic corpus which is better (the perpexity o a test set o the speciic domain has ower perpexity on a LM trained on this subset) and smaer compared to the previous approaches. The experiment was carried out or Engish, using Europar (Koehn, 2005) as the domain-speciic corpus and LDC Gigaword 2 as the non-domain-speciic one. In this paper we study whether the use o two types o inguistic knowedge (emmas and named entities) can contribute to obtain better resuts within the perpexity-based approach. 2 Methodoogy We expore the use o inguistic inormation or the seection o data to train domain-speciic LMs rom non-domain-speciic corpora. Our hypothesis is that ranking by perpexity on n-grams that represent inguistic patterns (rather than n-grams that represent surace orms) captures additiona inormation, and thus may seect vauabe data that is not seected according soey to surace orms. We use two types o inguistic inormation at 1 note that using cross-entropy is equivaent to using perpexity since they are monotonicay reated. 2 http://www.dc.upenn.edu/cataog/ cataogentry.jsp?cataogid=ldc2007t07 8 Proceedings o the Second Workshop on Hybrid Approaches to Transation, pages 8 12, Soia, Bugaria, August 8, 2013. c 2013 Association or Computationa Linguistics

word eve: emmas and named entity categories. We experiment with the oowing modes: Forms (hereater ), uses surace orms. This mode repicates the Moore-Lewis approach and is to be considered the baseine. Forms and named entities (hereater ), uses surace orms, with the exception o any word detected as a named entity, which is substituted by its type (e.g. person, organisation). Lemmas (hereater ), uses emmas. Lemmas and named entities (hereater n), uses emmas, with the exception o any word detected as a named entity, which is substituted by its type. A sampe sentence, according to each o these modes, oows: : I decare resumed the session o the European Pariament : I decare resumed the session o the NP00O00 : i decare resume the session o the european_pariament n: i decare resume the session o the NP00O00 Tabe 1 shows the number o n-grams on LMs buit on the Engish side o News Commentary v8 (hereater NC) or each o the modes. Regarding 1-grams, compared to, the substitution o named entities by their categories () resuts in smaer vocabuary size (-24.79%). Simiary, the vocabuary is reduced or the modes (-8.39%) and n (- 44.18%). Athough not a resut in itse, this might be an indication that using inguisticay motivated modes coud be useu to dea with data sparsity. n n 1 65076 48945 59619 36326 2 981077 847720 835825 702118 3 2624800 2382629 2447759 2212709 4 3633724 3412719 3523888 3325311 5 3929751 3780064 3856917 3749813 Tabe 1: Number o n-grams in LMs buit using the dierent modes Our procedure oows that o the Moore-Lewis method. We buid LMs or the domain-speciic corpus and or a random subset o the nondomain-speciic corpus o the same size (number o sentences) o the domain-speciic corpus. Each sentence s in the non-domain-speciic corpus is then scored according to equation 1 where P P I (s) is the perpexity o s according to the domainspeciic LM and P P O (s) is the perpexity o s according to the non-domain-speciic LM. score(s) = P P I (s) P P O (s) (1) We buid LMs or the domain-speciic and nondomain-speciic corpora using the our modes previousy introduced. Then we rank the sentences o the non-domain-speciic corpus or each o these modes and keep the highest ranked sentences according to a threshod. Finay, we buid a LM on the set o sentences seected 3 and compute the perpexity o the test set on this LM. We aso investigate the combination o the our modes. The procedure is airy straightorward: given the sentences seected by a the modes or a given threshod, we iterate through these sentences oowing the ranking order and keeping a the distinct sentences seected unti we obtain a set o sentences whose size is the one indicated by the threshod. I.e. we add to our distinct set o sentences irst the top ranked sentence by each o the methods, then the sentence ranked second by each method, and so on. 3 Experiments 3.1 Setting We use corpora rom the transation task at WMT13. 4 Our domain-speciic corpus is NC, and we carry out experiments with three non-domainspeciic corpora: a subset o Common Craw 5 (hereater CC), Europar version 7 (hereater EU), and United Nations (Eisee and Chen, 2010) (hereater UN). We use the test data rom WMT12 (newstest2012) as our test set. We carry out experiments on two anguages or which these corpora are avaiabe: Engish (reerred to as en in tabes) and Spanish ( es in tabes). We test the methods on three very dierent nondomain-speciic corpora, both in terms o the topics that they cover (text crawed rom web in CC, pariamentary speeches in EU and oicia documents rom United Nations in UN) and their size 3 For the inguistic methods we repace the sentences seected (which contain emmas and/or named entities) with the corresponding sentences in the origina corpus (containing ony word orms). 4 http://www.statmt.org/wmt13/ transation-task.htm 5 http://commoncraw.org/ 9

(around 2 miion sentences both or CC and EU, and around 11 miion or UN). This can be considered as a contribution o this paper since previous works such as Moore and Lewis (2010) and, more recenty, Axerod et a. (2011) test the Moore-Lewis method on ony one non-domainspeciic corpus: LDC Gigaword and an unpubished genera-domain corpus, respectivey. A the LMs are buit with IRSTLM 5.80.01 (Federico et a., 2008), use up to 5-grams and are smoothed using a simpiied version o the improved Kneser-Ney method (Chen and Goodman, 1996). For emmatisation and named entity recognition we use Freeing 3.0 (Padró and Staniovsky, 2012). The corpora are tokenised and truecased using scripts rom the Moses tookit (Koehn et a., 2007). 3.2 Experiments with Dierent Modes Figures 1, 2 and 3 show the perpexities obtained by each method on dierent subsets seected rom the Engish corpora CC, EU and UN, respectivey. We obtain these subsets according to dierent threshods, i.e. percentages o sentences seected rom the non-domain-speciic corpus. These are 1 the irst 64 ranked sentences, 1 32, 1 16, 1 8, 1 4, 1 2 and 1. 6 Corresponding igures or Spanish are omitted due to the imited space avaiabe and aso because the trends in those igures are very simiar. 1050 1000 950 900 850 800 750 700 650 600 64 32 16 8 4 2 1 Figure 1: Resuts o the dierent methods on CC In a the igures, the resuts are very simiar regardess o the use o emmas. The use o named entities, however, produces substantiay dierent resuts. The modes that do not use named entity categories obtain the best resuts or ower threshods (up to 1/32 or CC, and up to 1/16 both or 6 1 An additiona threshod,, is used or the United Nations 128 corpus n 1600 1500 1400 1300 1200 1000 64 32 16 8 4 2 1 Figure 2: Resuts o the dierent methods on EU 1900 1700 1500 1300 900 128 64 32 16 8 4 2 1 Figure 3: Resuts o the dierent methods on UN EU and UN). I the best perpexity is obtained with a ower threshod than this (the case o EU, 1/32, and UN, 1/64), then methods that do not use named entities obtain the best resut. However, i the optima perpexity is obtained with a higher threshod (the case o CC, 1/2), then using named entities yieds the best resut. Tabe 2 presents the resuts or each mode. For each scenario (corpus and anguage combination), we show the threshod or which the best resut is obtained (coumn best). The perpexity obtained on data seected by each mode is shown in the subsequent coumns. For the inguistic methods, we aso show the comparison o their perormance to the baseine (as percentages, coumns di). The perpexity when using the u corpus is shown (coumn u) together with the comparison o this resut to the best method (ast coumn di). The resuts, as previousy seen in Figures 1, 2 and 3, dier with respect to the corpus but oow simiar trends across anguages. For CC we obtain the best resuts using named entities. The mode n obtains the best resut or Engish (5.54% ower n n 10

corpus best di di n di u di cc en 1/2 660.77 625.62-5.32 660.58-0.03 624.19-5.54 638.24-2.20 eu en 1/32 1072.98 1151.13 7.28 1085.66 1.18 1170.00 9.04 1462.61-26.64 un en 1/64 984.08 1127.55 14.58 979.06-0.51 1121.45 13.96 1939.44-49.52 cc es 1/2 499.22 480.17-3.82 498.93-0.06 480.45-3.76 481.96-0.37 eu es 1/16 788.62 813.32 3.13 801.50 1.63 825.13 4.63 960.06-17.86 un es 1/32 725.93 773.89 6.61 723.37-0.35 771.25 6.24 1339.78-46.01 Tabe 2: Resuts or the dierent modes perpexity than the baseine), whie the mode obtains the best resut or Spanish (3.82%), athough in both cases the dierence between these two modes is rather sma. For the other corpora, the best resuts are obtained without named entities. In the case o EU, the baseine obtains the best resut, athough the mode is not very ar (1.18% higher perpexity or Engish and 1.63% or Spanish). This trend is reversed or UN, the mode obtaining the best scores but cose to the baseine (-0.51%, -0.35%). 3.3 Experiments with the Combination o Modes Tabe 3 shows the perpexities obtained by the method that combines the our modes (coumn comb) or the threshod that yieded the best resut in each scenario (see Tabe 2), compares these resuts (coumn di) to those obtained by the baseine (coumn ) and shows the percentage o sentences that this method inspected rom the sentences seected by the individua methods (coumn perc). corpus comb di perc cc en 660.77 613.83-7.10 76.90 eu en 1072.98 1035.51-3.49 70.51 un en 984.08 908.47-7.68 74.58 cc es 499.22 478.87-4.08 74.61 eu es 788.62 748.22-5.12 68.05 un es 725.93 666.62-8.17 74.32 Tabe 3: Resuts o the combination method The combination method outperorms the baseine and any o the individua inguistic modes in a the scenarios. The perpexity obtained by combining the modes is substantiay ower than that obtained by the baseine (ranging rom 3.49% to 8.17%). In a the scenarios, the combination method takes its sentences rom roughy the top 70% sentences ranked by the individua methods. 4 Concusions and Future Work This paper has expored the use o inguistic inormation (emmas and named entities) or the task o training data seection or LMs. We have introduced three inguisticay motivated modes, and compared them to the state-o-the-art method or perpexity-based data seection across three dierent corpora and two anguages. In our out o these six scenarios a inguisticay motivated method outperorms the state-o-the-art approach. We have aso presented a method which combines surace orms and the three inguisticay motivated methods. This combination outperorms the baseine in a the scenarios, seecting data whose perpexity is between 3.49% and 8.17% (depending on the corpus and anguage) ower than that o the baseine. Regarding uture work, we have severa pans. One interesting experiment woud be to appy these modes to a morphoogicay-rich anguage, to check i, as hypothesised, these modes dea better with sparse data. Another strand regards the appication o these modes to iter parae corpora, e.g. oowing the extension o the Moore-Lewis method (Axerod et a., 2011) or in combination with other methods which are deemed to be more suitabe or parae data, e.g. (Mansour et a., 2011). We have used one type o inguistic inormation in each LM, but another possibiity is to combine dierent pieces o inguistic inormation in a singe LM, e.g. oowing a hybrid LM that uses words and tags, depending o the requency o each type (Ruiz et a., 2012). Given the act that the best resut is obtained with dierent modes depending on the corpus, it woud be worth to investigate whether given a new corpus, one coud predict the best method to be appied and the threshod or which one coud expect to obtain the minimum perpexity. 11

Acknowedgments We woud ike to thank Raphaë Rubino or insightu conversations. The research eading to these resuts has received unding rom the European Union Seventh Framework Programme FP7/2007-2013 under grant agreements PIAP- GA-2012-324414 and FP7-ICT-2011-296347. Reerences Amittai Axerod, Xiaodong He, and Jianeng Gao. 2011. Domain adaptation via pseudo in-domain data seection. In Proceedings o the Conerence on Empirica Methods in Natura Language Processing, EMNLP 11, pages 355 362, Stroudsburg, PA, USA. Association or Computationa Linguistics. Staney F. Chen and Joshua Goodman. 1996. An empirica study o smoothing techniques or anguage modeing. In Proceedings o the 34th annua meeting on Association or Computationa Linguistics, ACL 96, pages 310 318, Stroudsburg, PA, USA. Association or Computationa Linguistics. Andreas Eisee and Yu Chen. 2010. Mutiun: A mutiingua corpus rom united nation documents. In Nicoetta Cazoari, Khaid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Steios Piperidis, Mike Rosner, and Danie Tapias, editors, LREC. European Language Resources Association. and Evangeos Dermatas, editors, EUROSPEECH. ISCA. Saab Mansour, Joern Wuebker, and Hermann Ney. 2011. Combining transation and anguage mode scoring or domain-speciic data itering. In Internationa Workshop on Spoken Language Transation, pages 222 229, San Francisco, Caiornia, USA, December. Robert C. Moore and Wiiam Lewis. 2010. Inteigent seection o anguage mode training data. In Proceedings o the ACL 2010 Conerence Short Papers, ACLShort 10, pages 220 224, Stroudsburg, PA, USA. Association or Computationa Linguistics. Luís Padró and Evgeny Staniovsky. 2012. Freeing 3.0: Towards wider mutiinguaity. In Proceedings o the Language Resources and Evauation Conerence (LREC 2012), Istanbu, Turkey, May. ELRA. Nick Ruiz, Arianna Bisazza, Rodano Cattoni, and Marceo Federico. 2012. FBK s Machine Transation Systems or IWSLT 2012 s TED Lectures. In Proceedings o the 9th Internationa Workshop on Spoken Language Transation (IWSLT). Marceo Federico, Nicoa Bertodi, and Mauro Cettoo. 2008. IRSTLM: an open source tookit or handing arge scae anguage modes. In INTER- SPEECH, pages 1618 1621. ISCA. Jianeng Gao, Joshua Goodman, Mingjing Li, and Kai- Fu Lee. 2002. Toward a uniied approach to statistica anguage modeing or chinese. 1(1):3 33, March. Phiipp Koehn, Hieu Hoang, Aexandra Birch, Chris Caison-Burch, Marceo Federico, Nicoa Bertodi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Aexandra Constantin, and Evan Herbst. 2007. Moses: open source tookit or statistica machine transation. In Proceedings o the 45th Annua Meeting o the ACL on Interactive Poster and Demonstration Sessions, ACL 07, pages 177 180, Stroudsburg, PA, USA. Association or Computationa Linguistics. Phiipp Koehn. 2005. Europar: A Parae Corpus or Statistica Machine Transation. In Conerence Proceedings: the tenth Machine Transation Summit, pages 79 86, Phuket, Thaiand. AAMT, AAMT. Sung-Chien Lin, Chi-Lung Tsai, Lee-Feng Chien, Keh- Jiann Chen, and Lin-Shan Lee. 1997. Chinese anguage mode adaptation based on document cassiication and mutipe domain-speciic anguage modes. In George Kokkinakis, Nikos Fakotakis, 12