Convergence of Translation Memory and Statistical Machine Translation

Similar documents

Convergence of Translation Memory and Statistical Machine Translation

Translation and Localization Services

Statistical Machine Translation

Statistical Machine Translation Lecture 4. Beyond IBM Model 1 to Phrase-Based Models

Chapter 5. Phrase-based models. Statistical Machine Translation

ACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation Project no.

A Joint Sequence Translation Model with Integrated Reordering

Computer Aided Translation

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混合策略汉英和英汉机器翻译系 CWMT2011 技术报告

SYSTRAN 混合策略汉英和英汉机器翻译系统

Question template for interviews

TS3: an Improved Version of the Bilingual Concordancer TransSearch

HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN

Collecting Polish German Parallel Corpora in the Internet

A New Input Method for Human Translators: Integrating Machine Translation Effectively and Imperceptibly

Machine Translation. Agenda

Hybrid Machine Translation Guided by a Rule Based System

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

LIUM s Statistical Machine Translation System for IWSLT 2010

The TCH Machine Translation System for IWSLT 2008

Machine Translation at the European Commission

Glossary of translation tool types

Why Evaluation? Machine Translation. Evaluation. Evaluation Metrics. Ten Translations of a Chinese Sentence. How good is a given system?

Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation

Machine Translation. Why Evaluation? Evaluation. Ten Translations of a Chinese Sentence. Evaluation Metrics. But MT evaluation is a di cult problem!

Investigating Repetition and Reusability of Translations in Subtitle Corpora for Use with Example-Based Machine Translation 1

The KIT Translation system for IWSLT 2010

Statistical Machine Translation: IBM Models 1 and 2

THUTR: A Translation Retrieval System

Adaptive Development Data Selection for Log-linear Model in Statistical Machine Translation

TRANSREAD LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS. Projet ANR CORD 01 5

Translation Solution for

The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER Ncode: an Open Source Bilingual N-gram SMT Toolkit

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

The University of Maryland Statistical Machine Translation System for the Fifth Workshop on Machine Translation

JOB BANK TRANSLATION AUTOMATED TRANSLATION SYSTEM. Table of Contents

Statistical Pattern-Based Machine Translation with Statistical French-English Machine Translation

An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation

Collaborative Machine Translation Service for Scientific texts

Recent developments in machine translation policy at the European Patent Office

Spotting false translation segments in translation memories

The Prague Bulletin of Mathematical Linguistics NUMBER 102 OCTOBER A Fast and Simple Online Synchronous Context Free Grammar Extractor

Leveraging ASEAN Economic Community through Language Translation Services

T U R K A L A T O R 1

Machine Translation and the Translator

Phrase-Based MT. Machine Translation Lecture 7. Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu. Website: mt-class.

Application of Machine Translation in Localization into Low-Resourced Languages

Statistical NLP Spring Machine Translation: Examples

Building a Web-based parallel corpus and filtering out machinetranslated

Multi-Engine Machine Translation by Recursive Sentence Decomposition

Integra(on of human and machine transla(on. Marcello Federico Fondazione Bruno Kessler MT Marathon, Prague, Sept 2013

SDL International Localization Services Overview for GREE Game Developers

An Empirical Study on Web Mining of Parallel Data

Machine Translation Computer Aided Translation Machine Language Processing

Improving MT System Using Extracted Parallel Fragments of Text from Comparable Corpora

Detecting semantic overlap: Announcing a Parallel Monolingual Treebank for Dutch

A Joint Sequence Translation Model with Integrated Reordering

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Adaptation to Hungarian, Swedish, and Spanish

UEdin: Translating L1 Phrases in L2 Context using Context-Sensitive SMT

Development of Translation Memory Database System for Law Translation

Introduction. Philipp Koehn. 28 January 2016

Moses from the point of view of an LSP: The Trusted Translations Experience

tance alignment and time information to create confusion networks 1 from the output of different ASR systems for the same

Anubis - speeding up Computer-Aided Translation

Project Management. From industrial perspective. A. Helle M. Herranz. EXPERT Summer School, Pangeanic - BI-Europe

Domain-specific terminology extraction for Machine Translation. Mihael Arcan

Completing an Accounts Payable Audit With ACL (Aired on Feb 15)

Adapting General Models to Novel Project Ideas

KEZBER CONTENT MANAGEMENT SYSTEM MANUAL

Training and evaluation of POS taggers on the French MULTITAG corpus

Towards a General and Extensible Phrase-Extraction Algorithm

XTM for Language Service Providers Explained

The Prague Bulletin of Mathematical Linguistics NUMBER 93 JANUARY Training Phrase-Based Machine Translation Models on the Cloud

Provalis Research Text Analytics and the Victory Index

An End-to-End Discriminative Approach to Machine Translation

Language and Computation

Using Author-it Localization Manager

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

2-3 Automatic Construction Technology for Parallel Corpora

7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan

Content Management & Translation Management

Statistical Machine Translation prototype using UN parallel documents

KantanMT.com. The world s #1 MT Platform. No Hardware. No Software. No Hassle MT.

Phrase-Based Statistical Translation of Programming Languages

Pairwise Sequence Alignment

Evaluation of speech technologies

Factored Translation Models

Overview of MT techniques. Malek Boualem (FT)

Quantifying the Influence of MT Output in the Translators Performance: A Case Study in Technical Translation

Corpus Design for a Unit Selection Database

The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems

Word Completion and Prediction in Hebrew

Factored bilingual n-gram language models for statistical machine translation

Automatic Network Protocol Analysis

SCHOOL OF ENGINEERING AND INFORMATION TECHNOLOGIES GRADUATE PROGRAMS

SDL Trados Studio 2015 Project Management Quick Start Guide

Search Engines. Stephen Shaw 18th of February, Netsoc

Translation Management Systems Explained

Automated Online English -Arabic Translator

Transcription:

Convergence of Translation Memory and Statistical Machine Translation Philipp Koehn and Jean Senellart 4 November 2010

Progress in Translation Automation 1 Translation Memory (TM) translators store past translation in database when translating new text, consult database for similar segments fuzzy match score defines similarity widely used by translation agencies

Progress in Translation Automation 1 Translation Memory (TM) translators store past translation in database when translating new text, consult database for similar segments fuzzy match score defines similarity widely used by translation agencies Statistical Machine Translation (SMT) collect large quantities of translated text extract automatically probabilistic translation rules when translating new text, find most probable translation given rules wide use of free web-based services not yet used by many translation agencies

TM vs. SMT 2 used by human translator restricted domain (e.g. product manual) very repetitive content used by target language information seeker open domain translation (e.g. news) huge diversity (esp. web) corpus size: corpus size: 1 million words 100-1000 million words commercial developers (e.g., SDL Trados) academic/commercial research (e.g., Google)

Our Goal 3 Better TM using SMT methods

Main Idea 4 Input The second paragraph of Article 21 is deleted. Fuzzy match in translation memory The second paragraph of Article 5 is deleted. Part of the translation from TM fuzzy match Part of the translation with SMT The second paragraph of Article 21 is deleted.

Related Work 5 Work inspired by EBMT [Smith and Clark, 2009] [Zhechev and van Genabith, 2010] uses syntactic information in alignment lower performance than reported here Encode fuzzy match as rule with gaps [Biçici and Dymetman, 2008] similar to our second method impressive improvements, but weak baseline SMT

Two Solutions 6 XML frames Very large hierarchical rules

Example 7 Input sentence: The second paragraph of Article 21 is deleted.

Example 8 Input sentence: The second paragraph of Article 21 is deleted. Fuzzy match in translation memory: The second paragraph of Article 5 is deleted. = À l article 5, le texte du deuxiéme alinéa est supprimé.

Example 9 Input sentence: The second paragraph of Article 21 is deleted. Fuzzy match in translation memory: The second paragraph of Article 5 is deleted. = À l article 5, le texte du deuxiéme alinéa est supprimé. Detect mismatch (string edit distance)

Example 10 Input sentence: The second paragraph of Article 21 is deleted. Fuzzy match in translation memory: The second paragraph of Article 5 is deleted. = À l article 5, le texte du deuxiéme alinéa est supprimé. Detect mismatch (string edit distance) Align mismatch (using word alignment from giza++)

Example 11 Input sentence: The second paragraph of Article 21 is deleted. Fuzzy match in translation memory: The second paragraph of Article 5 is deleted. = À l article 5, le texte du deuxiéme alinéa est supprimé. Output word(s) taken from the target TM

Example 12 Input sentence: The second paragraph of Article 21 is deleted. Fuzzy match in translation memory: The second paragraph of Article 5 is deleted. = À l article 5, le texte du deuxiéme alinéa est supprimé. Output word(s) taken from the target TM Input word(s) that still need to be translated by SMT

Example 13 Input sentence: The second paragraph of Article 21 is deleted. Fuzzy match in translation memory: The second paragraph of Article 5 is deleted. = À l article 5, le texte du deuxiéme alinéa est supprimé. XML frame (input to Moses) <xml translation=" À l article "/> 21 <xml translation=", le texte du deuxiéme alinéa est supprimé. "/>

Example 14 Input sentence: The second paragraph of Article 21 is deleted. Fuzzy match in translation memory: The second paragraph of Article 5 is deleted. = À l article 5, le texte du deuxiéme alinéa est supprimé. More compact formalism for the purposes of this presentation: < À l article > 21 <, le texte du deuxiéme alinéa est supprimé. >

Steps 15 Fuzzy matching

Steps 15 Fuzzy matching based on string edit distance on words FMS = 1 edit-distance(source, tm-source) max( source, tm-source ) string edit distance on letters as tie breaker details see [Koehn and Senellart, AMTA 2010]

Steps 15 Fuzzy matching based on string edit distance on words FMS = 1 edit-distance(source, tm-source) max( source, tm-source ) string edit distance on letters as tie breaker details see [Koehn and Senellart, AMTA 2010] straight-forward

Steps 15 Fuzzy matching based on string edit distance on words FMS = 1 edit-distance(source, tm-source) max( source, tm-source ) string edit distance on letters as tie breaker details see [Koehn and Senellart, AMTA 2010] straight-forward Word alignment of TM source and target

Steps 15 Fuzzy matching based on string edit distance on words FMS = 1 edit-distance(source, tm-source) max( source, tm-source ) string edit distance on letters as tie breaker details see [Koehn and Senellart, AMTA 2010] straight-forward Word alignment of TM source and target standard method

Steps 15 Fuzzy matching based on string edit distance on words FMS = 1 edit-distance(source, tm-source) max( source, tm-source ) string edit distance on letters as tie breaker details see [Koehn and Senellart, AMTA 2010] straight-forward Word alignment of TM source and target standard method Construction of XML frame = linking mismatch( input, TM source ) to TM target

Steps 15 Fuzzy matching based on string edit distance on words FMS = 1 edit-distance(source, tm-source) max( source, tm-source ) string edit distance on letters as tie breaker details see [Koehn and Senellart, AMTA 2010] straight-forward Word alignment of TM source and target standard method Construction of XML frame = linking mismatch( input, TM source ) to TM target can be tricky

Construction of XML Frame 16 Basic principles start with fully specified XML frame all mismatched source words have to be translated by SMT all TM target words aligned to mismatched TM source words are removed if the alignment to the TM target words fails go to the previous TM source word and follow its alignment See paper for algorithm

Example 17 Source String Edit TM Source The second paragraph of Article 21 is deleted. The second paragraph of Article 5 is deleted. Word Alignment TM Target XML Frame À l' article 5, le texte du deuxième alinéa est supprimé. <À l' article> 21 <, le texte du deuxième alinéa est supprimé.>

Special Case: Insertion 18 Source String Edit TM Source the big fish the fish Word Alignment TM Target XML Frame les poissons <les> big <poissons>

Special Case: Deletion 19 the fish the big fish les gros poissons <les> <poissons>

Special Case: Unaligned Mismatch 20 the green fish the big fish les poissons <les> green <poissons>

Special Case: Disconnected Alignments 21 joe will eat joe does not eat joe ne mange pas <joe> will <mange>

Experiments 22 Baseline 1: Unmodified TM matches Baseline 2: SMT system trained on TM data Our XML frame method

Corpora: Size 23 Acquis Corpus Test segments 1,165,867 4,107 English words 24,069,452 129,261 French words 25,533,259 135,224 Product Corpus Test segments 83,461 2,000 English words 1,038,762 24,643 French words 1,110,284 26,248

Corpora: Fuzzy Matches Acquis Sentences Words W/S 100% 1395 14,559 10.4 90-99% 433 12,775 29.5 80-89% 154 5,347 34.7 70-79% 250 6,767 27.1 Product Sentences Words W/S 95-99% 230 3,006 13.1 90-94% 225 2,968 13.2 85-89% 177 2,000 11.3 80-84% 185 1,950 10.5 75-79% 152 1,350 8.9 70-74% 98 987 10.1 24

Results: Acquis 25

Results: Product 26

Recap 27 TM provides fuzzy matches SMT translates mismatched words

Recap 27 TM provides fuzzy matches SMT translates mismatched words TM match encoded in XML frame

Recap 27 TM provides fuzzy matches SMT translates mismatched words TM match encoded in XML frame... but is that not just a very large translation rule?

Background: Hierarchical Phrase Rules 28 Given: sentence pair with monotone 1-to-1 alignment the big fish = les gros poissons

Background: Hierarchical Phrase Rules 28 Given: sentence pair with monotone 1-to-1 alignment the big fish = les gros poissons Phrase translation rules ( the ; les ) ( the big ; les gros ) ( the big fish ; les gros poissons ) ( big ; gros ) ( big fish ; gros poissons ) ( fish ; poissons )

Background: Hierarchical Phrase Rules 28 Given: sentence pair with monotone 1-to-1 alignment the big fish = les gros poissons Phrase translation rules ( the ; les ) ( the big ; les gros ) ( the big fish ; les gros poissons ) ( big ; gros ) ( big fish ; gros poissons ) ( fish ; poissons ) Hierarchical phrase-based rule are constructed by subtraction

Background: Hierarchical Phrase Rules 28 Given: sentence pair with monotone 1-to-1 alignment the big fish = les gros poissons Phrase translation rules ( the ; les ) ( the big ; les gros ) ( the big fish ; les gros poissons ) ( big ; gros ) ( big fish ; gros poissons ) ( fish ; poissons ) Hierarchical phrase-based rule are constructed by subtraction large rule: ( the big fish ; les gros poissons ) small rule: ( big ; gros ) (contained in large rule)

Background: Hierarchical Phrase Rules 28 Given: sentence pair with monotone 1-to-1 alignment the big fish = les gros poissons Phrase translation rules ( the ; les ) ( the big ; les gros ) ( the big fish ; les gros poissons ) ( big ; gros ) ( big fish ; gros poissons ) ( fish ; poissons ) Hierarchical phrase-based rule are constructed by subtraction large rule: ( the big fish ; les gros poissons ) small rule: ( big ; gros ) (contained in large rule) hierarchical rule: ( the x fish ; les x poissons )

XML Frame as Very Large Rule 29 XML frame <À for input l article> 21 <, le texte du deuxiéme alinéa est supprimé.> The second paragraph of Article 21 is deleted. Very large rule ( The second paragraph of Article x is deleted. ; À l article x, le texte du deuxiéme alinéa est supprimé. )

Very Large Rules in SMT 30 Rule size limited in SMT maximum number of words, e.g. 5 maximum number of non-terminals (x), e.g. 2... but only due to storage limitations for large rule rule tables Rules may be generated on the fly [Lopez, 2007]

Advantage over XML Method 31 Choices 1. multiple fuzzy matches in TM with same score 2. same TM source with multiple translations 3. multiple SMT translations

Advantage over XML Method 31 Choices 1. multiple fuzzy matches in TM with same score 2. same TM source with multiple translations 3. multiple SMT translations Decisions in XML frame method 1. randomly chosen 2. most frequent 3. highest model score (tried others, see paper)

Advantage over XML Method 31 Choices 1. multiple fuzzy matches in TM with same score 2. same TM source with multiple translations 3. multiple SMT translations Decisions in XML frame method 1. randomly chosen 2. most frequent 3. highest model score (tried others, see paper) Decisions for very large rules 1. all 2. all 3. integrated scoring of VLR rules and basic translation rules (tunable)

Result: Acquis 32

Future Work: User Studies 33 Significant increases in bleu To do: validation in user studies Additional benefit: possible to highlight mismatch in translation input The second paragraph of Article 21 is deleted. suggested translation À l article 21, le texte du deuxiéme alinéa est supprimé.

Thank You 34 questions?