Domain-specific terminology extraction for Machine Translation. Mihael Arcan

Similar documents

IRIS - English-Irish Translation System

Collaborative Machine Translation Service for Scientific texts

A Joint Sequence Translation Model with Integrated Reordering

Domain Adaptation for Medical Text Translation Using Web Resources

Adaptation to Hungarian, Swedish, and Spanish

ACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation Project no.

Convergence of Translation Memory and Statistical Machine Translation

The TCH Machine Translation System for IWSLT 2008

LIUM s Statistical Machine Translation System for IWSLT 2010

The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER Ncode: an Open Source Bilingual N-gram SMT Toolkit

Machine Translation. Agenda

The KIT Translation system for IWSLT 2010

Semantics in Statistical Machine Translation

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Leveraging ASEAN Economic Community through Language Translation Services

Translation and Localization Services

Improving MT System Using Extracted Parallel Fragments of Text from Comparable Corpora

TRANSREAD LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS. Projet ANR CORD 01 5

Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation

Cache-based Online Adaptation for Machine Translation Enhanced Computer Assisted Translation

Adapting General Models to Novel Project Ideas

Hybrid Machine Translation Guided by a Rule Based System

Parallel FDA5 for Fast Deployment of Accurate Statistical Machine Translation Systems

Integra(on of human and machine transla(on. Marcello Federico Fondazione Bruno Kessler MT Marathon, Prague, Sept 2013

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混合策略汉英和英汉机器翻译系 CWMT2011 技术报告

THUTR: A Translation Retrieval System

PBML. logo. The Prague Bulletin of Mathematical Linguistics NUMBER??? JANUARY Grammar based statistical MT on Hadoop

Computer Aided Translation

Human in the Loop Machine Translation of Medical Terminology

Chinese-Japanese Machine Translation Exploiting Chinese Characters

Domain Adaptation of Statistical Machine Translation using Web-Crawled Resources: A Case Study

HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN

Question template for interviews

SYSTRAN 混合策略汉英和英汉机器翻译系统

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

An Online Service for SUbtitling by MAchine Translation

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

Effective Self-Training for Parsing

Machine Translation at the European Commission

Customizing an English-Korean Machine Translation System for Patent Translation *

Comparison of Data Selection Techniques for the Translation of Video Lectures

Glossary of translation tool types

An Online Service for SUbtitling by MAchine Translation

Interactive alignment of Parallel Texts a cross browser experience. (standards in practice)

Micro blogs Oriented Word Segmentation System

7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Building a Web-based parallel corpus and filtering out machinetranslated

Arabic Recognition and Translation System

SYSTRAN v6 Quick Start Guide

Scaling Shrinkage-Based Language Models

Main Points. File layout Directory layout

Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation

Word Completion and Prediction in Hebrew

Statistical Machine Translation

Factored bilingual n-gram language models for statistical machine translation

2-3 Automatic Construction Technology for Parallel Corpora

The CMU Syntax-Augmented Machine Translation System: SAMT on Hadoop with N-best alignments

The University of Maryland Statistical Machine Translation System for the Fifth Workshop on Machine Translation

Modelling Pronominal Anaphora in Statistical Machine Translation

Phrase-Based MT. Machine Translation Lecture 7. Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu. Website: mt-class.

Overview of MT techniques. Malek Boualem (FT)

Language and Computation

Mining a Corpus of Job Ads

T U R K A L A T O R 1

D4.3: TRANSLATION PROJECT- LEVEL EVALUATION

7 Ways To Save On Legal Translations

Machine translation techniques for presentation of summaries

Visualizing Data Structures in Parsing-based Machine Translation. Jonathan Weese, Chris Callison-Burch

IBM WebSphere Operational Decision Management Improve business outcomes with real-time, intelligent decision automation

SCHOOL OF ENGINEERING AND INFORMATION TECHNOLOGIES GRADUATE PROGRAMS

Statistical Pattern-Based Machine Translation with Statistical French-English Machine Translation

UEdin: Translating L1 Phrases in L2 Context using Context-Sensitive SMT

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Statistical Machine Translation prototype using UN parallel documents

Detecting Forum Authority Claims in Online Discussions

Collecting Polish German Parallel Corpora in the Internet

Audio Indexing on a Medical Video Database: the AVISON Project

Schema documentation for types1.2.xsd

70-299: Implementing and Administering Security in a Microsoft Windows Server 2003 Network (Corso MS-2823)

Getting Off to a Good Start: Best Practices for Terminology

XLIFF 2.0. David Filip Secretary & Editor OASIS XLIFF TC

Survey Results: Requirements and Use Cases for Linguistic Linked Data

A Comparative Study of Online Translation Services for Cross Language Information Retrieval

Dutch Parallel Corpus

UNSUPERVISED MORPHOLOGICAL SEGMENTATION FOR STATISTICAL MACHINE TRANSLATION

SemWeB Semantic Web Browser Improving Browsing Experience with Semantic and Personalized Information and Hyperlinks

Joint Research Centre

UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation

Translating the Penn Treebank with an Interactive-Predictive MT System

A Study of Using an Out-Of-Box Commercial MT System for Query Translation in CLIR

Chapter 6. Decoding. Statistical Machine Translation

The Adjunct Network Marketing Model

HOW CAN I TRANSFORM AN ONLINE SERVICE TO TRULY MULTILINGUAL?

Deciphering Foreign Language

Real-Time Identification of MWE Candidates in Databases from the BNC and the Web

An Empirical Study on Web Mining of Parallel Data

Factored Translation Models

Document downloaded from: This paper must be cited as:

Transcription:

Domain-specific terminology extraction for Machine Translation Mihael Arcan

Outline Phd topic Introduction Resources Tools Multi Word Extraction (MWE) extraction Projection of MWE Evaluation Future Work Conclusion

Phd topic Translating of domain-specific terms Issue: Out-Of-Vocabulary, specific structure of terms, OOV brings along ambiguity Possible solutions: Generating new parallel resources Using specific vocabulary and embed it correctly into the SMT system

Introduction Domain-specific terminology extraction for MT How to extract domain-specific terminology? what is a domain-specific term? what is a domain? How to embed the extracted terminology into SMT? how much data we need to improve MT using existing resources adding terminology to general resources

Introduction Permission is hereby granted, free of charge, to any person obtaining a copy of the Unicode data files to deal in them without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sell copies. A chiunque ottenga una copia dei file di dati Unicode è concesso il permesso, senza oneri, di distribuirli senza restrizioni, ivi incluso senza limitazione dei diritti di uso, copia, modifica, unione, pubblicazione, distribuzione e/o vendita di copie. KDE is available free of charge, but costs are incurred and assets formed in its creation. Thus, the KDE community...

JRC Acquis Resources http://ipsc.jrc.ec.europa.eu/index.php?id=198 Gnome https://l10n.gnome.org 66 documents (10 dev / 56 test set) 166 sentences for extracting keyphrases KDE4 http://opus.lingfil.uu.se/ Top 1000 sentences with most MWE overlap

kx Toolkit 1 for Multilingual MWE extraction kx Toolkit: n-gram based extraction POS annotation tf-idf information (Parallel) corpus for the source and target language In-domain data (gnome in English and Italian) Out-domain data (JRC Acquis in English and Italian) (1) http://dl.acm.org/citation.cfm?id=1859700

kx Toolkit 1 for Multilingual MWE extraction English keyphrase extraction: accelerates, activation requests, active applications, active, antialiasing, application manager, available zoom levels, desktop, detailed metadata list, development packages,... Italian keyphrase extraction: accelera, richiesta di attivazione, applicazioni attive, attivo, antialiasing, gestore applicazioni, livelli di ingrandimento disponibili, elenco dettagliato dei metadati, pacchetti di sviluppo,...

kx keyphrase extraction - Evaluation Annotation of 10 gnome files A {1 string} specifying how parts of overlong {2 file names} should be replaced by {3 ellipses}, depending on the {4 zoom level}. A {1 string} specifying how parts of overlong {2 file names} should be replaced by ellipses, depending on the {3 zoom level}. F-score 0.38 (p0.35/r0.42) for English and 0.40 (p0.46/r0.35) for Italian

Projection Options Word alignment Giza++ PB-SMT alignment Moses decoder SRILM Toolkit

Word Alignment Projection Options Option 1: Word Alignment Option 2: Option 1 + Contiguous Span Option 3: Option 1 + Sentence Lookup Option 4: Option 1 + kx lookup

Word Alignment Projection Options Option 1: Word Alignment titlebar-font-size option opzione titlebar-font-size size used dimensione usata email systems sistemi di email executable text files file testo eseguibili *order of subpixel elements ordine elementi subpixel

Word Alignment Projection Options Option 2: Option 1 + Contiguous Span order of subpixel elements ordine degli elementi subpixel protection of personal data protezione dati personali

Word Alignment Projection Options Option 3: Option 1 + Sentence Lookup error errore embed incorporata straight-line method of depreciation metodo di ammortamento lineare given number of lines numero di righe indicato

Word Alignment Projection Options Option 4: Option 1 + kx lookup specified zoom level livello di ingrandimento specificato activation requests richiesta di attivazione webmmux webmmux shortcuts *predefinite (scorciatoia, but we get default)

PB-SMT Projection Options Option 1: PB-SMT best translation Option 2: Option 1 + Sentence Lookup Option 3: Option 1 + kx Lookup Option 4: n-best PB-SMT + Sentence Lookup Option 5: n-best PB-SMT + kx Lookup (n=10)

Bilingual Projections In-domain Excluded by Out Word Alignment (WA) 356 190 WA + Contiguous Span 352 190 WA + Sentence lookup 284 186 WA + kx lookup 115 29 PB-SMT 1015 126 PB-SMT + Sent. lookup 549 42 PB-SMT + kx lookup 143 20 Nbest PB-SMT + Sent. 549 80 Nbest PB-SMT + kx lookup. 157 46

Bilingual Projections In-domain Precision Word Alignment (WA) 356 0.75 WA + Contiguous Span 352 0.65 WA + Sentence lookup 284 0.8 WA + kx lookup 115 0.85 PB-SMT 1015 0.35 PB-SMT + Sent. lookup 549 0.7 PB-SMT + kx lookup 143 0.9 Nbest PB-SMT + Sent. 549 0.7 Nbest PB-SMT + kx lookup. 157 0.9

Embedding Terminology into SMT JRC Acquis Acquis + Gnome parallel data Acquis + kx keyphrases Xml markup Xml markup with CacheBased LM JRC Acquis Cache Based TM + Cache Based LM

Translation Models- Evaluation BLEU JRC Acquis 14.62 XML Markup WA with kx lookup 15.23 XML Markup WA with sentence 13.39 XML Markup WA with contiguous span 11.99 XML Markup n-best PB-SMT w. kx lookup 14.94 XML Markup n-best PB-SMT w. sentence lookup 14.69 XML Markup n-best PB-SMT without proving 14.31

Translation Models- Evaluation BLEU JRC Acquis 14.62 CBXL Markup WA with kx lookup 15.26 CBXL Markup WA with sentence 14.29 CBXL Markup WA with contiguous span 14.53 CBXL Markup n-best PB-SMT w. kx lookup 14.93 CBXL Markup n-best PB-SMT w. sentence lookup 14.61 CBXL Markup n-best PB-SMT without proving 14.14

Future Work Analyzing the results Applying the experiment to English->French and English->German Extended bilingual evaluation Experiments on other domains Additional experiments on close domains Using comparable data for keyphrase extraction (e.g. Wikipedia)

Conclusion Extraction of domain-specific MWE Bilingual Projection of MWE Novel results of embedding the terminological knowledge into SMT

Thank you