Statistical Machine Translation for Automobile Marketing Texts
|
|
|
- Amelia Tyler
- 10 years ago
- Views:
Transcription
1 Statistical Machine Translation for Automobile Marketing Texts Samuel Läubli 1 Mark Fishel 1 1 Institute of Computational Linguistics University of Zurich Binzmühlestrasse 14 CH-8050 Zürich Manuela Weibel 2 Martin Volk 1 2 SemioticTransfer AG Bruggerstrasse 37 CH-5400 Baden Abstract We describe a project on introducing an in-house statistical machine translation system for marketing texts from the automobile industry with the final aim of replacing manual translation with post-editing, based on the translation system. The focus of the paper is the suitability of such texts for SMT; we present experiments in domain adaptation and decompounding that improve the baseline translation systems, the results of which are evaluated using automatic metrics as well as manual evaluation. 1 Introduction As machine translation and post-editing are gaining popularity on an industrial level, global companies turn to using their accumulated translations for setting up in-house SMT services. Substantial cost and time savings have been reported in various sectors, such as software localization (Flournoy, 2011; Zhechev, 2012) or film and television subtitling (Volk et al., 2010). In a joint project between the University of Zurich and SemioticTransfer AG, we target a new domain: automobile marketing texts. We show that even limited amounts of indomain translation material allow for building domain-specific SMT systems (see Section 3), and that translation quality can be significantly improved by using out-of-domain material and language-specific preprocessing (see Section 4). Our aim is to give an example of how even small language service providers with distinct areas of specialization can successfully incorporate machine translation and post-editing into their translation workflow. In case of our project, this claim is tested on two language pairs and translation directions: German (DE) to French (FR) and to Italian (IT). A closer description of our domain is presented in the next section. 2 Domain Description SemioticTransfer has specialized in translating marketing materials for the automobile industry. This includes print products such as brochures or price lists as well as electronic materials such as websites or newsletters, but no technical documentation such as manuals. As a whole, the domain covers a very broad spectrum of language, ranging from highly emotional and metaphorical to very dry and technical. As an example, consider the Sima an, K., Forcada, M.L., Grasmick, D., Depraetere, H., Way, A. (eds.) Proceedings of the XIV Machine Translation Summit (Nice, September 2 6, 2013), p c 2013 The authors. This article is licensed under a Creative Commons 3.0 licence, no derivative works, attribution, CC-BY-ND.
2 following two segments, both stemming from the same high gloss catalogue: (a) The R8 e-tron combines the genes of a sports coupé and a race car in an athlete s body which expresses technological progress and superiority. (b) size 8.5 J x 19 at front, size 11 J x 19 at rear, with 235/35 R 19 tires at front and 295/30 R 19 tires at rear Tailoring SMT systems to producing good pretranslations for our entire domain is thus challenging. Before describing our corresponding experiments in Section 4, we outline the technical setup and report on the performance of simple baseline systems in our use case. 3 Baseline Translation Systems 3.1 Technical Setup The technical setup of all experiments closely resembles the setup of the baseline systems in the shared task of the Workshop on Statistical Machine Translation (Callison-Burch et al., 2012). The only difference is that we used IRSTLM for language modeling (Federico et al., 2008) because of licensing issues. We used a test set of 500 in-domain segments for automatic evaluation; these were randomly drawn from contracts that were processed after compiling the training and development sets (see section 3.2). Using a moderate test set size enabled detailed manual inspection and categorization of the machine translations, e.g., identifying the number of compounds in out-of-vocabulary (OOV) types. All automatic metric scores were calculated using multeval (Clark et al., 2011). Unless otherwise stated, the scores are averages over five MERT runs (Och, 2003; Bertoldi et al., 2009). METEOR scores are only given for DE FR systems since Italian is not fully supported (Denkowski and Lavie, 2011). As SemioticTransfer s translation workflow is based on the Across workbench 1, we implemented an RPC layer that allows for integrating Moses server instances into Across. In this way, we were able to apply various pre- and post-processing methods to translations while ensuring seamless integration of the Moses systems as a pre-translation service into SemioticTransfer s existing infrastructure. 3.2 Baseline Systems SemioticTransfer has been using Across for several years, through which they have accumulated a lot of quality-checked translations in their translation memories. We extracted all of this material to train in-domain baseline SMT systems for both language pairs; 2,000 in-domain segments were used as a development set for tuning. Note that unlike regular parallel corpora, our in-domain corpus contains each translation only once, i.e., no sentence-level frequencies are available. Using in-domain translation memory data turned out to be a promising starting point for producing good-quality pre-translations (see Table 1). Despite the small amount of training data, the systems score 33.5 (DE FR) and 32.3 (DE IT) BLEU points. However, the number of OOV words is high. Besides inconsistencies in punctuation and number formatting, untranslated words were found to be particularly disturbing in manual inspection of the results. As a consequence, we applied several techniques aimed at reducing the number of unknown words in our baseline systems. 4 Improved Translation Systems As mentioned in the previous section, we focused our efforts on improving translation
3 DE FR DE IT In-Domain Europarl OpenSubtitles In-Domain Europarl OpenSubtitles Tokens DE 2,011,872 48,405,406 16,858,070 1,413,452 48,419,389 15,642,379 Tokens FR/IT 2,632,256 56,372,702 16,370,845 1,731,219 50,689,987 15,458,666 Tokens OOV 4.3 % 12.5 % 14.1 % 4.7 % 9.6 % 11.1 % Segments 166,957 1,903,628 2,852, ,166 1,805,792 2,131,004 Tokens/Sg. DE Tokens/Sg. FR/IT Types DE 86, , ,408 59, , ,020 Types FR/IT 47, , ,774 34, , ,736 Types OOV 9.6 % 24.2 % 26.8 % 11.8 % 19.8 % 24.4 % Avg. BLEU Avg. METEOR Avg. TER Table 1: Training data for in- and out-of-domain language and translation models. OOV rates and automatic evaluation metrics refer to a test set of 500 in-domain segments (see Section 3.1). quality by reducing the rate of OOV input types. This was done by adding generaldomain corpora and combining them with our in-domain data via domain adaptation, as well as by decompounding methods on both the indomain data and the mixed-domain set. 4.1 Related Work Domain adaptation has been applied to most components of statistical machine translation: language models (Clarkson and Robinson, 1997; Koehn and Schroeder, 2007), word alignment (Hua et al., 2005), and translation models (Foster and Kuhn, 2007; Sennrich, 2012). Combinations of these methods often show that there is an overlap in the translation problems that the methods fix, which leads to one method stealing part of the effect of the others (Koehn and Schroeder, 2007; Sennrich, 2012). We perform domain adaptation with language and translation models, following the mixture-modeling approach by Foster and Kuhn (2007) and Sennrich (2012). A common method to handle complex compounding in languages such as German is to split unknown compounds into their parts, if they occur in the training data, and join the translated parts on the output side of the translation system. The splitting can be motivated morphologically (Stymne, 2009; Hardmeier et al., 2010) or empirically (Koehn and Knight, 2003; Dyer, 2009). Also, instead of making a final decision on whether to split a compound or not, both alternatives can be passed to the translation system by representing the input text with a lattice of phrases instead of just a single sentence (Dyer et al., 2008; Dyer, 2009; Wuebker and Ney, 2012). Our approach lies between the splitting and lattice-based approaches and is especially tailored for translation between languages with heavy compounding and word order differences. 4.2 Domain Adaptation The moderate size of our in-domain data set and relatively high OOV rates suggest adding bigger general-domain corpora to the training material. The specific nature of the indomain data makes it necessary to use domain adaptation techniques, to avoid having the bigger general-domain data override the original 267
4 DE FR DE IT Metric Mode Avg s sel s Test p-value Avg s sel s Test p-value BLEU TER METEOR In-Domain Only Domain Adaptation (DA) DA + Decompounding A DA + Decompounding B In-Domain Only Domain Adaptation (DA) DA + Decompounding A DA + Decompounding B In-Domain Only Domain Adaptation (DA) DA + Decompounding A DA + Decompounding B Table 2: Automatic evaluation of the baseline and combined systems. p-values are relative to the preceding system and indicate whether a score improves (+) or decreases ( ) significantly. domain-specific translations. We used two freely available out-of-domain corpora for these experiments: Europarl v7 (Koehn, 2005) and OpenSubtitles (Tiedemann, 2009). Stand-alone systems trained on these corpora result in unusable translations with very low scores (see Table 1). For domain adaptation, we used multidomain mixture-modeling (Foster and Kuhn, 2007) for language and translation models. The main distinctive feature of that approach is that instead of a binary in-domain/out-ofdomain treatment of the data sets, each separate domain is assigned a weight, reflecting its similarity to in-domain (parallel) texts. Mixture-modeling and optimization of these weights is implemented in IRSTLM for language models (Federico et al., 2008) and in tmcombine for translation models (Sennrich, 2012). The weight distribution discovered by the optimization step for both kinds of models and both language pairs was very similar: around 93% distribution mass to the in-domain data, 5-6% to Europarl and 1-2% to OpenSubtitles. Resulting translation quality estimation scores are presented in Table 2; all scores for both language pairs show a stable and significant improvement. The OOV rate for types dropped from 9.6% to 5.8% for DE FR and from 11.8% to 6.1% for DE IT in the adapted model combination. Manual evaluation and a more detailed description of the domain-adaptation experiments can be found in (Läubli et al., 2013). 4.3 Decompounding Categorizing the remaining German OOV types in our DE FR test set revealed that 65% of them were either nominal or adjectival compounds. While adding out-of-domain data reduced the number of normal OOV words 2 by 81%, only 24% of the OOV compounds could be translated in this way. In other words, our categorization confirmed that even big parallel corpora are sparse of domain-specific German compounds such as Fahrzeugmodells (car type, genitive), but also general-domain compounds such as Winterbeginn (onset of winter). 2 We defined normal words as the set of types that do not belong to any of the following categories: compounds, numbers, foreign words, spelling errors, casing errors, tokenization errors. 268
5 4.3.1 Basic Approach Koehn and Knight (2003) have addressed compounding in machine translation with an empirical method. Their approach builds upon the fact that even if a compound is out-ofvocabulary, the training data or phrase table might still contain its distinct parts. Koehn and Knight thus consider all splits S of an unknown compound C into known parts p (substrings) that entirely cover C. Given the count of words in the training corpus, the best split ŝ has the highest geometric mean of word frequencies of its n parts: ŝ = arg max ( count(p i ) ) 1 n S p i S This leaves compounds unbroken if they appear more frequently in the training material than their parts. For example, Fahrzeugmodell (car type) is preserved in our DE FR model, while Fahrzeugmodells (car type, genitive) is split into Fahrzeug Modells. We applied Koehn and Knight s decompounding method to our domain-adapted DE FR and DE IT systems. Using the corresponding Moses implementation, we split compounds on the German side of each corpus (in-domain, Europarl, and OpenSubtitles) to train new translation models. This leads to a significant improvement of all metric scores for DE FR, but not for DE IT (see Table 2, DA + Decompounding A). The actual translations of compounds in our test set did not meet our expectations. Despite the metric improvements, we identified two severe problems: missing function words and word order. In both French and Italian, the parts of a compound are often connected by function words such as de (FR) or di (IT). Moreover, their order is usually reversed, at least in compounds with two stems only. Consider for example the German compound Fahrzeugmodells (car type, genitive), which translates into modèle de véhicule (FR) and modello di veicolo (IT), respectively. As function words and reordering are crucial for translating German compounds into romance languages such as French or Italian, we propose a modified approach to decompounding in the following section Using Ambiguous Input Dyer (2009) and Hardmeier et al. (2010) have used word segmentation lattices to pass multiple analyses of an input segment to an SMT decoder. The intuition behind this concept is to avoid the propagation of errors caused by selecting one single best hypothesis and to let the multiple splits be evaluated against each other at runtime. The lattice-based approach, however, does not address the changed order or insertion of function words, which is what we propose. Our updated input lattice includes the original compound, a path of its two split parts as well as a similar path but with the order of compound parts switched, and finally the switched-order path with several function word alternatives inserted in between. The latter are inserted in the source language (German) to avoid forcing certain translation variants. For Fahrzeugmodells, the modified input lattice would look like this: Modells Modells Fahrzeug von... des... Fahrzeugmodells Fahrzeug Fahrzeug Modells At this experimental stage, we have weighted all paths equally. Also, the choice of function words is naturally arbitrary. To test the modified approach, we first ran a pilot experiment on in-domain data only for DE FR (see Table 3). Although the automatic 269
6 Mode BLEU METEOR TER In-Domain Only Decompounding A Decompounding B Table 3: Effects of decompounding techniques (see Sections and 4.3.2) on DE FR systems using in-domain training material only. scores of the Koehn and Knight (2003) approach (Decompounding A) and our latticebased extension (Decompounding B) did not differ significantly in terms of automatic metrics, we found a number of compounds in our test set that were now translated correctly (modèle de véhicule instead of véhicule modèle) or in a more comprehensible way (la liste des amis instead of les amis liste). By adding more data through domain adaptation (see Section 4.2) however, Decompounding B is outperformed by Decompounding A (see Table 2). In order to assess whether this difference is also conceivable for an actual translator, we conducted a human evaluation experiment Human Evaluation Our human evaluation was primarily aimed at testing if translators or post-editors prefer machine translations that involve decompounding over such that leave unknown compounds untranslated. The human evaluator, fluent in the target languages (FR, IT) as well as familiar with the specific domain terminology, compared the decompounding setups (Decompounding A and B) to No Decompounding in both language pairs. This resulted in four tasks, each of which consisted in comparing 150 segments. This procedure commonly referred to as pairwise ranking is well-established in MT research (Callison-Burch et al., 2012). We included between four and twenty duplicates per task in order to measure the intra-annotator agreement, which turned out to be κ = 0.93, i.e., almost perfect according to Landis and Koch (1977). We also calculated a p-value for each task in order to quantify genuine differences between each two systems. As in (Callison- Burch et al., 2012), we ignored ties and applied the Sign Test for paired observations. In contrast to the automatic evaluation results, the human evaluation shows a better performance of Decompounding B, which outperformed No Decompounding in both language pairs, whereas Decompounding A only performed better for German Italian. However, the differences are not statistically significant. Due to the domain-specific training material, frequently used automobile and marketing terms such as LED-Leseleuchten (LED reading lights) lampes de lecture à DEL (DE FR) or Bremsenergie-Rückgewinnung (recovery of breaking energy) il recupero dell energia di frenata (DE IT) were translated correctly even by the systems with no decompounding; only unknown compounds could not be translated. Thus, the human evaluator s preference for translations that involve decompounding is obvious only in cases of successful translations of unknown compounds, such as Datenschutzerklärung (privacy statement) déclaration de la protection des données (DE FR; Decompounding B). 5 Conclusion In this paper, we have shown how translation memories of limited size can be used to build domain-specific SMT systems for automobile marketing texts. Our work is targeted at enabling language service providers with expertise in distinct domains and language pairs to incorporate post-editing into their translation workflow. 270
7 DE FR DE IT Mode Sg. loss tie win p-val. Sg. loss tie win p-val. Decompounding A Decompounding B Table 4: Human Evaluation. Both decompounding methods (see Sections and 4.3.2) are evaluated against the No Decompounding baseline (see Section 4.2). p-values indicate significant differences between two systems (win > loss). Intra-Annotator Agreement κ = In particular, we have used freely available out-of-domain data to reduce the number of untranslated words (OOV) in our translation models. Combining in- and out-of-domain resources into weighted mixture-models (Sennrich, 2012) ensures that domain-specific terminology prevails over alternative translations. The OOV rate was further reduced by splitting compounds in German source segments; in our German French test set, the number of OOV types dropped from 194 to 60 (-69%) through domain adaptation and decompounding. Altogether, our combined systems outperform the in-domain only baselines significantly in both language pairs in terms of BLEU, METEOR, and TER. Acknowledgements This work was supported by Grant No PFES-ES from the Swiss Federal Commission for Technology and Innovation. References Nicola Bertoldi, Barry Haddow, and Jean- Baptiste Fouet. Improved minimum error rate training in moses. The Prague Bulletin of Mathematical Linguistics, 91(1):7 16, Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. Findings of the 2012 workshop on statistical machine translation. In Proceedings of WMT, pages 10 51, Montréal, Canada, Jonathan H. Clark, Chris Dyer, Alon Lavie, and Noah A. Smith. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of ACL/HLT, pages , Portland, USA, Philip R. Clarkson and Anthony J. Robinson. Language model adaptation using mixtures and an exponentially decaying cache. In Proceedings of ICASSP, volume 2, pages , Munich, Germany, Michael Denkowski and Alon Lavie. Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. In Proceedings of WMT, pages 85 91, Edinburgh, UK, Chris Dyer. Using a maximum entropy model to build segmentation lattices for mt. In Proceedings of HLT-NAACL, pages , Boulder, USA, Christopher Dyer, Smaranda Muresan, and Philip Resnik. Generalizing word lattice translation. In Proceedings of ACL, pages , Columbus, USA, Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. IRSTLM: an open source toolkit for handling large scale language models. In Proceedings of Interspeech, pages , Brisbane, Australia,
8 Raymond Flournoy. MT use within the enterprise: Encouraging adoption via a unified MT API. In Proceedings of MT Summit XIII, pages , Xiamen, China, George Foster and Roland Kuhn. Mixturemodel adaptation for SMT. In Proceedings of WMT, pages , Prague, Czech Republic, Christian Hardmeier, Arianna Bisazza, and Marcello Federico. FBK at WMT 2010: word lattices for morphological reduction and chunk-based reordering. In Proceedings of WMT/MetricsMATR, pages 88 92, Uppsala, Sweden, Wu Hua, Wang Haifeng, and Liu Zhanyi. Alignment model adaptation for domainspecific word alignment. In Proceedings of ACL, pages , Ann Arbor, USA, Philipp Koehn. Europarl: A parallel corpus for statistical machine translation. In Proceedings of MT Summit X, pages 79 86, Phuket, Thailand, Philipp Koehn and Kevin Knight. Empirical methods for compound splitting. In Proceedings of EACL, pages , Budapest, Hungary, Philipp Koehn and Josh Schroeder. Experiments in domain adaptation for statistical machine translation. In Proceedings of WMT, pages , Prague, Czech Republic, J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data. Biometrics, pages , Samuel Läubli, Mark Fishel, Martin Volk, and Manuela Weibel. Combining domainspecific translation memories with generaldomain parallel corpora in statistical machine translation systems. In Proceedings of NoDaLiDa, pages , Oslo, Norway, Franz Josef Och. Minimum error rate training in statistical machine translation. In Proceedings of ACL, pages , Sapporo, Japan, Rico Sennrich. Perplexity minimization for translation model domain adaptation in statistical machine translation. In Proceedings of EACL, pages , Trento, Italy, Sara Stymne. Compound processing for phrase-based statistical machine translation. Master s thesis, Linköping University, Linköping, Sweden, Jörg Tiedemann. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. In Proceedings of RANLP, volume V, pages , Borovets, Bulgaria, Martin Volk, Rico Sennrich, Christian Hardmeier, and Frida Tidström. Machine translation of TV subtitles for large scale production. In Second Joint EM+/CNGL Workshop, pages 53 62, Denver, USA, Joern Wuebker and Hermann Ney. Phrase model training for statistical machine translation with word lattices of preprocessing alternatives. In Proceedings of WMT, pages , Montreal, Canada, Ventsislav Zhechev. Machine Translation Infrastructure and Post-editing Performance at Autodesk. In Proceedings of WPTP, pages 87 96, San Diego, USA,
Cache-based Online Adaptation for Machine Translation Enhanced Computer Assisted Translation
Cache-based Online Adaptation for Machine Translation Enhanced Computer Assisted Translation Nicola Bertoldi Mauro Cettolo Marcello Federico FBK - Fondazione Bruno Kessler via Sommarive 18 38123 Povo,
The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006
The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 Yidong Chen, Xiaodong Shi Institute of Artificial Intelligence Xiamen University P. R. China November 28, 2006 - Kyoto 13:46 1
UEdin: Translating L1 Phrases in L2 Context using Context-Sensitive SMT
UEdin: Translating L1 Phrases in L2 Context using Context-Sensitive SMT Eva Hasler ILCC, School of Informatics University of Edinburgh [email protected] Abstract We describe our systems for the SemEval
Machine translation techniques for presentation of summaries
Grant Agreement Number: 257528 KHRESMOI www.khresmoi.eu Machine translation techniques for presentation of summaries Deliverable number D4.6 Dissemination level Public Delivery date April 2014 Status Author(s)
Appraise: an Open-Source Toolkit for Manual Evaluation of MT Output
Appraise: an Open-Source Toolkit for Manual Evaluation of MT Output Christian Federmann Language Technology Lab, German Research Center for Artificial Intelligence, Stuhlsatzenhausweg 3, D-66123 Saarbrücken,
SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems Jin Yang, Satoshi Enoue Jean Senellart, Tristan Croiset SYSTRAN Software, Inc. SYSTRAN SA 9333 Genesee Ave. Suite PL1 La Grande
An Online Service for SUbtitling by MAchine Translation
SUMAT CIP-ICT-PSP-270919 An Online Service for SUbtitling by MAchine Translation Annual Public Report 2012 Editor(s): Contributor(s): Reviewer(s): Status-Version: Arantza del Pozo Mirjam Sepesy Maucec,
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 Jin Yang and Satoshi Enoue SYSTRAN Software, Inc. 4444 Eastgate Mall, Suite 310 San Diego, CA 92121, USA E-mail:
Collaborative Machine Translation Service for Scientific texts
Collaborative Machine Translation Service for Scientific texts Patrik Lambert [email protected] Jean Senellart Systran SA [email protected] Laurent Romary Humboldt Universität Berlin
Computer Aided Translation
Computer Aided Translation Philipp Koehn 30 April 2015 Why Machine Translation? 1 Assimilation reader initiates translation, wants to know content user is tolerant of inferior quality focus of majority
Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation
Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation Rabih Zbib, Gretchen Markiewicz, Spyros Matsoukas, Richard Schwartz, John Makhoul Raytheon BBN Technologies
TAPTA: Translation Assistant for Patent Titles and Abstracts
TAPTA: Translation Assistant for Patent Titles and Abstracts https://www3.wipo.int/patentscope/translate A. What is it? This tool is designed to give users the possibility to understand the content of
Factored Translation Models
Factored Translation s Philipp Koehn and Hieu Hoang [email protected], [email protected] School of Informatics University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW Scotland, United Kingdom
Application of Machine Translation in Localization into Low-Resourced Languages
Application of Machine Translation in Localization into Low-Resourced Languages Raivis Skadiņš 1, Mārcis Pinnis 1, Andrejs Vasiļjevs 1, Inguna Skadiņa 1, Tomas Hudik 2 Tilde 1, Moravia 2 {raivis.skadins;marcis.pinnis;andrejs;inguna.skadina}@tilde.lv,
TRANSREAD LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS. Projet ANR 201 2 CORD 01 5
Projet ANR 201 2 CORD 01 5 TRANSREAD Lecture et interaction bilingues enrichies par les données d'alignement LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS Avril 201 4
Adaptive Development Data Selection for Log-linear Model in Statistical Machine Translation
Adaptive Development Data Selection for Log-linear Model in Statistical Machine Translation Mu Li Microsoft Research Asia [email protected] Dongdong Zhang Microsoft Research Asia [email protected]
UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation
UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation Liang Tian 1, Derek F. Wong 1, Lidia S. Chao 1, Paulo Quaresma 2,3, Francisco Oliveira 1, Yi Lu 1, Shuo Li 1, Yiming
Convergence of Translation Memory and Statistical Machine Translation
Convergence of Translation Memory and Statistical Machine Translation Philipp Koehn and Jean Senellart 4 November 2010 Progress in Translation Automation 1 Translation Memory (TM) translators store past
HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN
HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN Yu Chen, Andreas Eisele DFKI GmbH, Saarbrücken, Germany May 28, 2010 OUTLINE INTRODUCTION ARCHITECTURE EXPERIMENTS CONCLUSION SMT VS. RBMT [K.
The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems
The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems Dr. Ananthi Sheshasaayee 1, Angela Deepa. V.R 2 1 Research Supervisior, Department of Computer Science & Application,
Quantifying the Influence of MT Output in the Translators Performance: A Case Study in Technical Translation
Quantifying the Influence of MT Output in the Translators Performance: A Case Study in Technical Translation Marcos Zampieri Saarland University Saarbrücken, Germany [email protected] Mihaela Vela
Domain Adaptation for Medical Text Translation Using Web Resources
Domain Adaptation for Medical Text Translation Using Web Resources Yi Lu, Longyue Wang, Derek F. Wong, Lidia S. Chao, Yiming Wang, Francisco Oliveira Natural Language Processing & Portuguese-Chinese Machine
The Prague Bulletin of Mathematical Linguistics NUMBER 93 JANUARY 2010 7 16
The Prague Bulletin of Mathematical Linguistics NUMBER 93 JANUARY 21 7 16 A Productivity Test of Statistical Machine Translation Post-Editing in a Typical Localisation Context Mirko Plitt, François Masselot
Building a Web-based parallel corpus and filtering out machinetranslated
Building a Web-based parallel corpus and filtering out machinetranslated text Alexandra Antonova, Alexey Misyurev Yandex 16, Leo Tolstoy St., Moscow, Russia {antonova, misyurev}@yandex-team.ru Abstract
Choosing the best machine translation system to translate a sentence by using only source-language information*
Choosing the best machine translation system to translate a sentence by using only source-language information* Felipe Sánchez-Martínez Dep. de Llenguatges i Sistemes Informàtics Universitat d Alacant
Cross-lingual Sentence Compression for Subtitles
Cross-lingual Sentence Compression for Subtitles Wilker Aziz and Sheila C. M. de Sousa University of Wolverhampton Stafford Street, WV1 1SB Wolverhampton, UK [email protected] [email protected]
The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER 2011 49 58. Ncode: an Open Source Bilingual N-gram SMT Toolkit
The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER 2011 49 58 Ncode: an Open Source Bilingual N-gram SMT Toolkit Josep M. Crego a, François Yvon ab, José B. Mariño c c a LIMSI-CNRS, BP 133,
Hybrid Machine Translation Guided by a Rule Based System
Hybrid Machine Translation Guided by a Rule Based System Cristina España-Bonet, Gorka Labaka, Arantza Díaz de Ilarraza, Lluís Màrquez Kepa Sarasola Universitat Politècnica de Catalunya University of the
Statistical Machine Translation Lecture 4. Beyond IBM Model 1 to Phrase-Based Models
p. Statistical Machine Translation Lecture 4 Beyond IBM Model 1 to Phrase-Based Models Stephen Clark based on slides by Philipp Koehn p. Model 2 p Introduces more realistic assumption for the alignment
The Impact of Morphological Errors in Phrase-based Statistical Machine Translation from English and German into Swedish
The Impact of Morphological Errors in Phrase-based Statistical Machine Translation from English and German into Swedish Oscar Täckström Swedish Institute of Computer Science SE-16429, Kista, Sweden [email protected]
The CMU-Avenue French-English Translation System
The CMU-Avenue French-English Translation System Michael Denkowski Greg Hanneman Alon Lavie Language Technologies Institute Carnegie Mellon University Pittsburgh, PA, 15213, USA {mdenkows,ghannema,alavie}@cs.cmu.edu
Translation Solution for
Translation Solution for Case Study Contents PROMT Translation Solution for PayPal Case Study 1 Contents 1 Summary 1 Background for Using MT at PayPal 1 PayPal s Initial Requirements for MT Vendor 2 Business
Statistical Machine Translation
Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language
Why Evaluation? Machine Translation. Evaluation. Evaluation Metrics. Ten Translations of a Chinese Sentence. How good is a given system?
Why Evaluation? How good is a given system? Machine Translation Evaluation Which one is the best system for our purpose? How much did we improve our system? How can we tune our system to become better?
Machine Translation. Why Evaluation? Evaluation. Ten Translations of a Chinese Sentence. Evaluation Metrics. But MT evaluation is a di cult problem!
Why Evaluation? How good is a given system? Which one is the best system for our purpose? How much did we improve our system? How can we tune our system to become better? But MT evaluation is a di cult
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990
Integra(on of human and machine transla(on. Marcello Federico Fondazione Bruno Kessler MT Marathon, Prague, Sept 2013
Integra(on of human and machine transla(on Marcello Federico Fondazione Bruno Kessler MT Marathon, Prague, Sept 2013 Motivation Human translation (HT) worldwide demand for translation services has accelerated,
The Prague Bulletin of Mathematical Linguistics NUMBER 93 JANUARY 2010 37 46. Training Phrase-Based Machine Translation Models on the Cloud
The Prague Bulletin of Mathematical Linguistics NUMBER 93 JANUARY 2010 37 46 Training Phrase-Based Machine Translation Models on the Cloud Open Source Machine Translation Toolkit Chaski Qin Gao, Stephan
Jane 2: Open Source Phrase-based and Hierarchical Statistical Machine Translation
Jane 2: Open Source Phrase-based and Hierarchical Statistical Machine Translation Joern Wuebker M at thias Huck Stephan Peitz M al te Nuhn M arkus F reitag Jan-Thorsten Peter Saab M ansour Hermann N e
Machine Translation for Human Translators
Machine Translation for Human Translators Michael Denkowski CMU-LTI-15-004 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 www.lti.cs.cmu.edu
THUTR: A Translation Retrieval System
THUTR: A Translation Retrieval System Chunyang Liu, Qi Liu, Yang Liu, and Maosong Sun Department of Computer Science and Technology State Key Lab on Intelligent Technology and Systems National Lab for
Large-scale multiple language translation accelerator at the United Nations
Large-scale multiple language translation accelerator at the United Nations Bruno Pouliquen, Cecilia Elizalde, Marcin Junczys-Dowmunt, Christophe Mazenc, José García-Verdugo World Intellectual Property
A New Input Method for Human Translators: Integrating Machine Translation Effectively and Imperceptibly
Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) A New Input Method for Human Translators: Integrating Machine Translation Effectively and Imperceptibly
Parmesan: Meteor without Paraphrases with Paraphrased References
Parmesan: Meteor without Paraphrases with Paraphrased References Petra Barančíková Institute of Formal and Applied Linguistics Charles University in Prague, Faculty of Mathematics and Physics Malostranské
An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation
An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation Robert C. Moore Chris Quirk Microsoft Research Redmond, WA 98052, USA {bobmoore,chrisq}@microsoft.com
Segmentation and Punctuation Prediction in Speech Language Translation Using a Monolingual Translation System
Segmentation and Punctuation Prediction in Speech Language Translation Using a Monolingual Translation System Eunah Cho, Jan Niehues and Alex Waibel International Center for Advanced Communication Technologies
Human in the Loop Machine Translation of Medical Terminology
Human in the Loop Machine Translation of Medical Terminology by John J. Morgan ARL-MR-0743 April 2010 Approved for public release; distribution unlimited. NOTICES Disclaimers The findings in this report
The noisier channel : translation from morphologically complex languages
The noisier channel : translation from morphologically complex languages Christopher J. Dyer Department of Linguistics University of Maryland College Park, MD 20742 [email protected] Abstract This paper
Turker-Assisted Paraphrasing for English-Arabic Machine Translation
Turker-Assisted Paraphrasing for English-Arabic Machine Translation Michael Denkowski and Hassan Al-Haj and Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon University
On Statistical Machine Translation and Translation Theory
On Statistical Machine Translation and Translation Theory Christian Hardmeier Uppsala University Department of Linguistics and Philology Box 635, 751 26 Uppsala, Sweden [email protected] Abstract
On the practice of error analysis for machine translation evaluation
On the practice of error analysis for machine translation evaluation Sara Stymne, Lars Ahrenberg Linköping University Linköping, Sweden {sara.stymne,lars.ahrenberg}@liu.se Abstract Error analysis is a
Towards Application of User-Tailored Machine Translation in Localization
Towards Application of User-Tailored Machine Translation in Localization Andrejs Vasiļjevs Tilde SIA [email protected] Raivis Skadiņš Tilde SIA raivis.skadins@ tilde.lv Inguna Skadiņa Tilde SIA inguna.skadina@
The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL 2011 5 18. A Guide to Jane, an Open Source Hierarchical Translation Toolkit
The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL 2011 5 18 A Guide to Jane, an Open Source Hierarchical Translation Toolkit Daniel Stein, David Vilar, Stephan Peitz, Markus Freitag, Matthias
Convergence of Translation Memory and Statistical Machine Translation
Convergence of Translation Memory and Statistical Machine Translation Philipp Koehn University of Edinburgh 10 Crichton Street Edinburgh, EH8 9AB Scotland, United Kingdom [email protected] Jean Senellart
Applying Statistical Post-Editing to. English-to-Korean Rule-based Machine Translation System
Applying Statistical Post-Editing to English-to-Korean Rule-based Machine Translation System Ki-Young Lee and Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research
A Joint Sequence Translation Model with Integrated Reordering
A Joint Sequence Translation Model with Integrated Reordering Nadir Durrani, Helmut Schmid and Alexander Fraser Institute for Natural Language Processing University of Stuttgart Introduction Generation
7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan
7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan We explain field experiments conducted during the 2009 fiscal year in five areas of Japan. We also show the experiments of evaluation
Chapter 5. Phrase-based models. Statistical Machine Translation
Chapter 5 Phrase-based models Statistical Machine Translation Motivation Word-Based Models translate words as atomic units Phrase-Based Models translate phrases as atomic units Advantages: many-to-many
PROMT Technologies for Translation and Big Data
PROMT Technologies for Translation and Big Data Overview and Use Cases Julia Epiphantseva PROMT About PROMT EXPIRIENCED Founded in 1991. One of the world leading machine translation provider DIVERSIFIED
A web-based multilingual help desk
LTC-Communicator: A web-based multilingual help desk Nigel Goffe The Language Technology Centre Ltd Kingston upon Thames Abstract Software vendors operating in international markets face two problems:
Machine Translation and the Translator
Machine Translation and the Translator Philipp Koehn 8 April 2015 About me 1 Professor at Johns Hopkins University (US), University of Edinburgh (Scotland) Author of textbook on statistical machine translation
The United Nations Parallel Corpus v1.0
The United Nations Parallel Corpus v1.0 Michał Ziemski, Marcin Junczys-Dowmunt, Bruno Pouliquen United Nations, DGACM, New York, United States of America Adam Mickiewicz University, Poznań, Poland World
Dual Subtitles as Parallel Corpora
Dual Subtitles as Parallel Corpora Shikun Zhang, Wang Ling, Chris Dyer Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA [email protected], [email protected], [email protected]
SUMAT: Data Collection and Parallel Corpus Compilation for Machine Translation of Subtitles
SUMAT: Data Collection and Parallel Corpus Compilation for Machine Translation of Subtitles Volha Petukhova 1, Rodrigo Agerri 2, Mark Fishel 3, Yota Georgakopoulou 4, Sergio Penkale 5, Arantza del Pozo
Report on the embedding and evaluation of the second MT pilot
Report on the embedding and evaluation of the second MT pilot quality translation by deep language engineering approaches DELIVERABLE D3.10 VERSION 1.6 2015-11-02 P2 QTLeap Machine translation is a computational
tance alignment and time information to create confusion networks 1 from the output of different ASR systems for the same
1222 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 7, SEPTEMBER 2008 System Combination for Machine Translation of Spoken and Written Language Evgeny Matusov, Student Member,
IRIS - English-Irish Translation System
IRIS - English-Irish Translation System Mihael Arcan, Unit for Natural Language Processing of the Insight Centre for Data Analytics at the National University of Ireland, Galway Introduction about me,
Translating the Penn Treebank with an Interactive-Predictive MT System
IJCLA VOL. 2, NO. 1 2, JAN-DEC 2011, PP. 225 237 RECEIVED 31/10/10 ACCEPTED 26/11/10 FINAL 11/02/11 Translating the Penn Treebank with an Interactive-Predictive MT System MARTHA ALICIA ROCHA 1 AND JOAN
Statistical Machine Translation prototype using UN parallel documents
Proceedings of the 16th EAMT Conference, 28-30 May 2012, Trento, Italy Statistical Machine Translation prototype using UN parallel documents Bruno Pouliquen, Christophe Mazenc World Intellectual Property
Improving MT System Using Extracted Parallel Fragments of Text from Comparable Corpora
mproving MT System Using Extracted Parallel Fragments of Text from Comparable Corpora Rajdeep Gupta, Santanu Pal, Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University Kolata
Customizing an English-Korean Machine Translation System for Patent Translation *
Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,
Factored bilingual n-gram language models for statistical machine translation
Mach Translat DOI 10.1007/s10590-010-9082-5 Factored bilingual n-gram language models for statistical machine translation Josep M. Crego François Yvon Received: 2 November 2009 / Accepted: 12 June 2010
Domain-specific terminology extraction for Machine Translation. Mihael Arcan
Domain-specific terminology extraction for Machine Translation Mihael Arcan Outline Phd topic Introduction Resources Tools Multi Word Extraction (MWE) extraction Projection of MWE Evaluation Future Work
Collecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
Glossary of translation tool types
Glossary of translation tool types Tool type Description French equivalent Active terminology recognition tools Bilingual concordancers Active terminology recognition (ATR) tools automatically analyze
Bridges Across the Language Divide www.eu-bridge.eu
Bridges Across the Language Divide www.eu-bridge.eu EU-BRIDGE the Goal Bridges Across the Language Divide (EU-BRIDGE) aims to develop speech and machine translation capabilities that exceed the state-of-the-art
