Machine Translation and the Translator

Similar documents
Introduction. Philipp Koehn. 28 January 2016

Computer Aided Translation

Statistical Machine Translation

Phrase-Based MT. Machine Translation Lecture 7. Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu. Website: mt-class.

A Joint Sequence Translation Model with Integrated Reordering

Statistical Machine Translation Lecture 4. Beyond IBM Model 1 to Phrase-Based Models

Comprendium Translator System Overview

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告

Convergence of Translation Memory and Statistical Machine Translation

Why Evaluation? Machine Translation. Evaluation. Evaluation Metrics. Ten Translations of a Chinese Sentence. How good is a given system?

Machine Translation. Why Evaluation? Evaluation. Ten Translations of a Chinese Sentence. Evaluation Metrics. But MT evaluation is a di cult problem!

Machine Translation. Agenda

Chapter 5. Phrase-based models. Statistical Machine Translation

Collecting Polish German Parallel Corpora in the Internet

SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统

Question template for interviews

Chapter 6. Decoding. Statistical Machine Translation

Hybrid Strategies. for better products and shorter time-to-market

Factored Translation Models

BILINGUAL TRANSLATION SYSTEM

Leveraging ASEAN Economic Community through Language Translation Services

ACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation Project no.

Extracting translation relations for humanreadable dictionaries from bilingual text

Modelling Pronominal Anaphora in Statistical Machine Translation

Adaptation to Hungarian, Swedish, and Spanish

Machine Learning for natural language processing

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

A Joint Sequence Translation Model with Integrated Reordering

Introduction. BM1 Advanced Natural Language Processing. Alexander Koller. 17 October 2014

Language and Computation

This page intentionally left blank

PROMT Technologies for Translation and Big Data

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

WHITE PAPER. Machine Translation of Language for Safety Information Sharing Systems

Effective Self-Training for Parsing

Collaborative Machine Translation Service for Scientific texts

Big Data in Education

Empirical Machine Translation and its Evaluation

Learning Translation Rules from Bilingual English Filipino Corpus

BITS: A Method for Bilingual Text Search over the Web

Visualizing Data Structures in Parsing-based Machine Translation. Jonathan Weese, Chris Callison-Burch

The KIT Translation system for IWSLT 2010

Translation Solution for

The TCH Machine Translation System for IWSLT 2008

THUTR: A Translation Retrieval System

Glossary of translation tool types

The University of Maryland Statistical Machine Translation System for the Fifth Workshop on Machine Translation

Search and Information Retrieval

UEdin: Translating L1 Phrases in L2 Context using Context-Sensitive SMT

Neural Machine Transla/on for Spoken Language Domains. Thang Luong IWSLT 2015 (Joint work with Chris Manning)

HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN

Multi language e Discovery Three Critical Steps for Litigating in a Global Economy

White Paper. Translation Quality - Understanding factors and standards. Global Language Translations and Consulting, Inc. Author: James W.

Rule based Sentence Simplification for English to Tamil Machine Translation System

How to translate your website. An overview of the steps to take if you are about to embark on a website localization project.

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

Multilingual and mixed-lingual TTS applications

TRANSREAD LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS. Projet ANR CORD 01 5

Differences in linguistic and discourse features of narrative writing performance. Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu 3

LIUM s Statistical Machine Translation System for IWSLT 2010

Automated Translation Quality Assurance and Quality Control. Andrew Bredenkamp Daniel Grasmick Julia V. Makoushina

A New Input Method for Human Translators: Integrating Machine Translation Effectively and Imperceptibly

Deciphering Foreign Language

Outline of today s lecture

Adapting General Models to Novel Project Ideas

Statistical Machine Translation

Privacy Issues in Online Machine Translation Services European Perspective.

Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

Scalable Inference and Training of Context-Rich Syntactic Translation Models

Web-based automatic translation: the Yandex.Translate API

POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23

Introduction. Compiler Design CSE 504. Overview. Programming problems are easier to solve in high-level languages

Text Mining - Scope and Applications

IRIS - English-Irish Translation System

EU-BRIDGE Technology Catalogue

The Prague Bulletin of Mathematical Linguistics NUMBER 93 JANUARY Training Phrase-Based Machine Translation Models on the Cloud

Study Plan. Bachelor s in. Faculty of Foreign Languages University of Jordan

The history of machine translation in a nutshell

Online free translation services

Translation and Localization Services

Chinese-Japanese Machine Translation Exploiting Chinese Characters

Machine Translation Computer Aided Translation Machine Language Processing

Customizing an English-Korean Machine Translation System for Patent Translation *

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Moses from the point of view of an LSP: The Trusted Translations Experience

Building a Web-based parallel corpus and filtering out machinetranslated

Interactive Dynamic Information Extraction

Translution Price List GBP

Machine vs. Human Translation Scott Bass, Advanced Language Translation Inc.

The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems

Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

IAI : Knowledge Representation

A Machine Translation System Between a Pair of Closely Related Languages

Modern foreign languages

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1

LetsMT!: A Cloud-Based Platform for Do-It-Yourself Machine Translation

Transcription:

Machine Translation and the Translator Philipp Koehn 8 April 2015

About me 1 Professor at Johns Hopkins University (US), University of Edinburgh (Scotland) Author of textbook on statistical machine translation Leading development of open source Moses toolkit developed since 2006 reference implementation of state-of-the art methods used in academia as benchmark and testbed extensive commercial deployment (20% of MT market)

Recent Projects 2 Speech translation Computer aided translation Development of an open source toolkit tightly integrated with machine translation Novel types of assistance for translators Adaptation of machine translation to user needs Open source infrastructure MOSES CORE

3 how good is machine translation?

Machine Translation: Chinese 4

Machine Translation: Chinese 4

Machine Translation: French 5

Quality 6 HTER assessment 0% 10% 20% publishable editable 30% gistable 40% triagable 50% (scale developed in preparation of DARPA GALE programme)

Applications 7 HTER assessment application examples 0% Seamless bridging of language divide publishable Automatic publication of official announcements 10% editable Increased productivity of human translators 20% Access to official publications Multi-lingual communication (chat, social networks) 30% gistable Information gathering Trend spotting 40% triagable Identifying relevant documents 50%

Current State of the Art 8 HTER assessment language pairs and domains 0% publishable French-English restricted domain 10% French-English news stories editable German-English news stories 20% Chinese-English news stories English-Czech open domain 30% gistable English-Japanese open domain 40% triagable 50% (informal rough estimates by presenter)

9 big picture

A Clear Plan 10 Interlingua Lexical Transfer Source Target

A Clear Plan 11 Interlingua Analysis Syntactic Transfer Lexical Transfer Generation Source Target

A Clear Plan 12 Interlingua Semantic Transfer Generation Analysis Syntactic Transfer Lexical Transfer Source Target

A Clear Plan 13 Interlingua Analysis Semantic Transfer Syntactic Transfer Generation Lexical Transfer Source Target

Learning from Data 14 foreign/english parallel text English text statistical analysis Translation Model statistical analysis Language Model Decoding Algorithm

Finding the Best Translation 15 e BEST = argmax e p(e f)

16 why is that a good plan?

Word Translation Problems 17 Words are ambiguous He deposited money in a bank account with a high interest rate. Sitting on the bank of the Mississippi, a passing ship piqued his interest. How do we find the right meaning, and thus translation? Context should be helpful

Phrase Translation Problems 18 Idiomatic phrases are not compositional It s raining cats and dogs. Es schüttet aus Eimern. (it pours from buckets.) How can we translate such larger units?

Syntactic Translation Problems 19 Languages have different sentence structure das behaupten sie wenigstens this claim they at least the she

Syntactic Translation Problems 19 Languages have different sentence structure das behaupten sie wenigstens this claim they at least the she Convert from object-verb-subject (OVS) to subject-verb-object (SVO) Ambiguities can be resolved through syntactic analysis the meaning the of das not possible (not a noun phrase) the meaning she of sie not possible (subject-verb agreement)

Semantic Translation Problems 20 Pronominal anaphora I saw the movie and it is good. How to translate it into German (or French)?

Semantic Translation Problems 20 Pronominal anaphora I saw the movie and it is good. How to translate it into German (or French)? it refers to movie movie translates to Film Film has masculine gender ergo: it must be translated into masculine pronoun er We are not handling this very well [Le Nagard and Koehn, 2010]

Semantic Translation Problems 20 Pronominal anaphora I saw the movie and it is good. How to translate it into German (or French)? it refers to movie movie translates to Film Film has masculine gender ergo: it must be translated into masculine pronoun er We are not handling this very well [Le Nagard and Koehn, 2010]

Semantic Translation Problems 21 Coreference Whenever I visit my uncle and his daughters, I can t decide who is my favorite cousin. How to translate cousin into German? Male or female? Complex inference required

No Single Right Answer 22 Israeli officials are responsible for airport security. Israel is in charge of the security at this airport. The security work for this airport is the responsibility of the Israel government. Israeli side was in charge of the security of this airport. Israel is responsible for the airport s security. Israel is responsible for safety work at this airport. Israel presides over the security of the airport. Israel took charge of the airport security. The safety of this airport is taken charge of by Israel. This airport s security is the responsibility of the Israeli security officials.

Learning from Data 23 What is the best translation? Sicherheit security 14,516 Sicherheit safety 10,015 Sicherheit certainty 334

Learning from Data 24 What is the best translation? Counts in European Parliament corpus Sicherheit security 14,516 Sicherheit safety 10,015 Sicherheit certainty 334

Learning from Data 25 What is the best translation? Phrasal rules Sicherheit security 14,516 Sicherheit safety 10,015 Sicherheit certainty 334 Sicherheitspolitik security policy 1580 Sicherheitspolitik safety policy 13 Sicherheitspolitik certainty policy 0 Lebensmittelsicherheit food security 51 Lebensmittelsicherheit food safety 1084 Lebensmittelsicherheit food certainty 0 Rechtssicherheit legal security 156 Rechtssicherheit legal safety 5 Rechtssicherheit legal certainty 723

26 better models

Phrase-Based Model 27 natürlich hat John Spaß am Spiel of course John has fun with the game Foreign input is segmented in phrases Each phrase is translated into English Phrases are reordered Workhorse of today s statistical machine translation

Synchronous Grammar Rules 28 Nonterminal rules NP DET 1 NN 2 JJ 3 DET 1 JJ 3 NN 2 Terminal rules N maison house NP la maison bleue the blue house Mixed rules NP la maison JJ 1 the JJ 1 house

Learning Rules 29 S VP VP VP PP NP PRP MD VB VBG RP TO PRP DT NNS I shall be passing on to you some comments Ich werde Ihnen die entsprechenden Anmerkungen aushändigen Extracted rule: VP X 1 X 2 aushändigen passing on PP 1 NP 2

Syntax Decoding 30 S PRO VP VP VP VBZ wants TO to VB NP NP NP PP PRO she DET a NN cup IN of NN NN coffee VB drink Sie PPER will VAFIN eine ART Tasse NN Kaffee NN trinken VVINF NP S VP

New State of the Art 31 Good results for German English [WMT 2014] language pair syntax preferred German English 57% English German 55% Mixed for other language pairs language pair syntax preferred Czech English 44% Russian English 44% Hindi English 54% Also very successful for Chinese English

32 better machine learning

Sparse Data 33 Statistical estimation often suffers from sparse data Zipf s law most words are extremely rare frequency rank = constant rank Statistics from Europarl the occurs 1,929,379 times large tail of words that occur once: 33,447 words, for instance cornflakes, mathematicians, or Tazhikhistan frequency

Brown Clusters 34 Main idea: share evidence with similar words Cluster words to reduce sparsity presented the laconic message pursued these pompous lesson aired that melancholic letter commissioned this bouncy counterfactuals published incompletable stunner For instance: use in language modeling p(cluster(message) cluster(presented), cluster(the), class(laconic)

Word Embeddings 35

Word Embeddings 36

Deep Learning 37 Autoencoders first: learn embeddings unsupervised then: supervised learning of task Neural network language models several implementations some integrated in Moses Neural networks everywhere translation model reordering model operation sequence model

38 data

Big Data 39 For many language pairs, lots of text available. Text you read in your lifetime 300 million words Translated text available billions of words English text available trillions of words

Mining the Web 40 Largest source for test: the World Wide Web Common Crawl publicly available crawl of the web hosted by Amazon Web Services, but can be downloaded regularly updated (semi-annual) 2-4 billion web pages per crawl Currently filling up our hard drives

Monolingual Data 41 Starting point: 35TB of text Processing pipeline [Buck et al., 2014] language detection reduplication normalization of Unicode characters sentence splitting Obtained corpora Language Lines (B) Tokens (B) Bytes BLEU (WMT) English 59.13 975.63 5.14 TB - German 3.87 51.93 317.46 GB +0.5 Spanish 3.50 62.21 337.16 GB - French 3.04 49.31 273.96 GB +0.6 Russian 1.79 21.41 220.62 GB +1.2 Czech 0.47 5.79 34.67 GB +0.6

Parallel Data 42 Basic processing pipeline [Smith et al., 2013] find parallel web pages (based on URL only) align document by HTML structure sentence splitting and tokenization sentence alignment filtering (remove boilerplate) Obtained corpora French German Spanish Russian Japanese Chinese Segments 10.2M 7.50M 5.67M 3.58M 1.70M 1.42M Foreign Tokens 128M 79.9M 71.5M 34.7M 9.91M 8.14M English Tokens 118M 87.5M 67.6M 36.7M 19.1M 14.8M Bengali Farsi Telugu Somali Kannada Pashto Segments 59.9K 44.2K 50.6K 52.6K 34.5K 28.0K Foreign Tokens 573K 477K 336K 318K 305K 208K English Tokens 537K 459K 358K 325K 297K 218K Much more work needed!

43 computer aided translation

Post-Editing Machine Translation 44 (source: Autodesk)

Interactivity 45 Traditional professional translation approaches translation from scratch post-editing translation memory match post-editing machine translation output More interactive collaboration between machine and professional?

Interactive Machine Translation 46 Input Sentence Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten. Professional Translator

Interactive Machine Translation 47 Input Sentence Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten. Professional Translator He

Interactive Machine Translation 48 Input Sentence Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten. Professional Translator He has

Interactive Machine Translation 49 Input Sentence Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten. Professional Translator He has for months

Interactive Machine Translation 50 Input Sentence Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten. Professional Translator He planned

Interactive Machine Translation 51 Input Sentence Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten. Professional Translator He planned for months

Word Alignment Visualization 52 Input Sentence Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten. Professional Translator He planned for months to give a lecture in New York in

Word Alignment Visualization 53 Input Sentence Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten. Professional Translator He planned for months to give a lecture in New York in

Shading off Translated Material 54 Input Sentence Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten. Professional Translator He planned for months to give a lecture in New York in

Choices 55 Trigger the passive vocabulary Display multiple translations for words and phrases er hat seit Monaten geplant, im April einen Vortrag... he has for months the plan in April a lecture... it has for months now planned, in April a presentation... he was for several months planned to in the April a speech... he has made since months the pipeline in April of a statement... he did for many months scheduled the April a general... Rank and color-highlight by probability of each translation Prefer diversity

Instant Feedback Loop 56 source text translate MT engine MT translation re-train post-edit human translation

CASMACAT Home Edition 57 Available as open source software Features installation on any desktop machine allows training of MT engines all new types of assistance incremental updating of models Warning: still in development stage (help welcome!)

58 summary

Summary 59 Machine translation is not perfect, but useful Better models (esp. syntax) Better machine learning (esp. neural networks) More data Closer integration with target application (e.g., computer aided translation)

Thank You 60 questions?