Machine Translation and the Translator Philipp Koehn 8 April 2015
About me 1 Professor at Johns Hopkins University (US), University of Edinburgh (Scotland) Author of textbook on statistical machine translation Leading development of open source Moses toolkit developed since 2006 reference implementation of state-of-the art methods used in academia as benchmark and testbed extensive commercial deployment (20% of MT market)
Recent Projects 2 Speech translation Computer aided translation Development of an open source toolkit tightly integrated with machine translation Novel types of assistance for translators Adaptation of machine translation to user needs Open source infrastructure MOSES CORE
3 how good is machine translation?
Machine Translation: Chinese 4
Machine Translation: Chinese 4
Machine Translation: French 5
Quality 6 HTER assessment 0% 10% 20% publishable editable 30% gistable 40% triagable 50% (scale developed in preparation of DARPA GALE programme)
Applications 7 HTER assessment application examples 0% Seamless bridging of language divide publishable Automatic publication of official announcements 10% editable Increased productivity of human translators 20% Access to official publications Multi-lingual communication (chat, social networks) 30% gistable Information gathering Trend spotting 40% triagable Identifying relevant documents 50%
Current State of the Art 8 HTER assessment language pairs and domains 0% publishable French-English restricted domain 10% French-English news stories editable German-English news stories 20% Chinese-English news stories English-Czech open domain 30% gistable English-Japanese open domain 40% triagable 50% (informal rough estimates by presenter)
9 big picture
A Clear Plan 10 Interlingua Lexical Transfer Source Target
A Clear Plan 11 Interlingua Analysis Syntactic Transfer Lexical Transfer Generation Source Target
A Clear Plan 12 Interlingua Semantic Transfer Generation Analysis Syntactic Transfer Lexical Transfer Source Target
A Clear Plan 13 Interlingua Analysis Semantic Transfer Syntactic Transfer Generation Lexical Transfer Source Target
Learning from Data 14 foreign/english parallel text English text statistical analysis Translation Model statistical analysis Language Model Decoding Algorithm
Finding the Best Translation 15 e BEST = argmax e p(e f)
16 why is that a good plan?
Word Translation Problems 17 Words are ambiguous He deposited money in a bank account with a high interest rate. Sitting on the bank of the Mississippi, a passing ship piqued his interest. How do we find the right meaning, and thus translation? Context should be helpful
Phrase Translation Problems 18 Idiomatic phrases are not compositional It s raining cats and dogs. Es schüttet aus Eimern. (it pours from buckets.) How can we translate such larger units?
Syntactic Translation Problems 19 Languages have different sentence structure das behaupten sie wenigstens this claim they at least the she
Syntactic Translation Problems 19 Languages have different sentence structure das behaupten sie wenigstens this claim they at least the she Convert from object-verb-subject (OVS) to subject-verb-object (SVO) Ambiguities can be resolved through syntactic analysis the meaning the of das not possible (not a noun phrase) the meaning she of sie not possible (subject-verb agreement)
Semantic Translation Problems 20 Pronominal anaphora I saw the movie and it is good. How to translate it into German (or French)?
Semantic Translation Problems 20 Pronominal anaphora I saw the movie and it is good. How to translate it into German (or French)? it refers to movie movie translates to Film Film has masculine gender ergo: it must be translated into masculine pronoun er We are not handling this very well [Le Nagard and Koehn, 2010]
Semantic Translation Problems 20 Pronominal anaphora I saw the movie and it is good. How to translate it into German (or French)? it refers to movie movie translates to Film Film has masculine gender ergo: it must be translated into masculine pronoun er We are not handling this very well [Le Nagard and Koehn, 2010]
Semantic Translation Problems 21 Coreference Whenever I visit my uncle and his daughters, I can t decide who is my favorite cousin. How to translate cousin into German? Male or female? Complex inference required
No Single Right Answer 22 Israeli officials are responsible for airport security. Israel is in charge of the security at this airport. The security work for this airport is the responsibility of the Israel government. Israeli side was in charge of the security of this airport. Israel is responsible for the airport s security. Israel is responsible for safety work at this airport. Israel presides over the security of the airport. Israel took charge of the airport security. The safety of this airport is taken charge of by Israel. This airport s security is the responsibility of the Israeli security officials.
Learning from Data 23 What is the best translation? Sicherheit security 14,516 Sicherheit safety 10,015 Sicherheit certainty 334
Learning from Data 24 What is the best translation? Counts in European Parliament corpus Sicherheit security 14,516 Sicherheit safety 10,015 Sicherheit certainty 334
Learning from Data 25 What is the best translation? Phrasal rules Sicherheit security 14,516 Sicherheit safety 10,015 Sicherheit certainty 334 Sicherheitspolitik security policy 1580 Sicherheitspolitik safety policy 13 Sicherheitspolitik certainty policy 0 Lebensmittelsicherheit food security 51 Lebensmittelsicherheit food safety 1084 Lebensmittelsicherheit food certainty 0 Rechtssicherheit legal security 156 Rechtssicherheit legal safety 5 Rechtssicherheit legal certainty 723
26 better models
Phrase-Based Model 27 natürlich hat John Spaß am Spiel of course John has fun with the game Foreign input is segmented in phrases Each phrase is translated into English Phrases are reordered Workhorse of today s statistical machine translation
Synchronous Grammar Rules 28 Nonterminal rules NP DET 1 NN 2 JJ 3 DET 1 JJ 3 NN 2 Terminal rules N maison house NP la maison bleue the blue house Mixed rules NP la maison JJ 1 the JJ 1 house
Learning Rules 29 S VP VP VP PP NP PRP MD VB VBG RP TO PRP DT NNS I shall be passing on to you some comments Ich werde Ihnen die entsprechenden Anmerkungen aushändigen Extracted rule: VP X 1 X 2 aushändigen passing on PP 1 NP 2
Syntax Decoding 30 S PRO VP VP VP VBZ wants TO to VB NP NP NP PP PRO she DET a NN cup IN of NN NN coffee VB drink Sie PPER will VAFIN eine ART Tasse NN Kaffee NN trinken VVINF NP S VP
New State of the Art 31 Good results for German English [WMT 2014] language pair syntax preferred German English 57% English German 55% Mixed for other language pairs language pair syntax preferred Czech English 44% Russian English 44% Hindi English 54% Also very successful for Chinese English
32 better machine learning
Sparse Data 33 Statistical estimation often suffers from sparse data Zipf s law most words are extremely rare frequency rank = constant rank Statistics from Europarl the occurs 1,929,379 times large tail of words that occur once: 33,447 words, for instance cornflakes, mathematicians, or Tazhikhistan frequency
Brown Clusters 34 Main idea: share evidence with similar words Cluster words to reduce sparsity presented the laconic message pursued these pompous lesson aired that melancholic letter commissioned this bouncy counterfactuals published incompletable stunner For instance: use in language modeling p(cluster(message) cluster(presented), cluster(the), class(laconic)
Word Embeddings 35
Word Embeddings 36
Deep Learning 37 Autoencoders first: learn embeddings unsupervised then: supervised learning of task Neural network language models several implementations some integrated in Moses Neural networks everywhere translation model reordering model operation sequence model
38 data
Big Data 39 For many language pairs, lots of text available. Text you read in your lifetime 300 million words Translated text available billions of words English text available trillions of words
Mining the Web 40 Largest source for test: the World Wide Web Common Crawl publicly available crawl of the web hosted by Amazon Web Services, but can be downloaded regularly updated (semi-annual) 2-4 billion web pages per crawl Currently filling up our hard drives
Monolingual Data 41 Starting point: 35TB of text Processing pipeline [Buck et al., 2014] language detection reduplication normalization of Unicode characters sentence splitting Obtained corpora Language Lines (B) Tokens (B) Bytes BLEU (WMT) English 59.13 975.63 5.14 TB - German 3.87 51.93 317.46 GB +0.5 Spanish 3.50 62.21 337.16 GB - French 3.04 49.31 273.96 GB +0.6 Russian 1.79 21.41 220.62 GB +1.2 Czech 0.47 5.79 34.67 GB +0.6
Parallel Data 42 Basic processing pipeline [Smith et al., 2013] find parallel web pages (based on URL only) align document by HTML structure sentence splitting and tokenization sentence alignment filtering (remove boilerplate) Obtained corpora French German Spanish Russian Japanese Chinese Segments 10.2M 7.50M 5.67M 3.58M 1.70M 1.42M Foreign Tokens 128M 79.9M 71.5M 34.7M 9.91M 8.14M English Tokens 118M 87.5M 67.6M 36.7M 19.1M 14.8M Bengali Farsi Telugu Somali Kannada Pashto Segments 59.9K 44.2K 50.6K 52.6K 34.5K 28.0K Foreign Tokens 573K 477K 336K 318K 305K 208K English Tokens 537K 459K 358K 325K 297K 218K Much more work needed!
43 computer aided translation
Post-Editing Machine Translation 44 (source: Autodesk)
Interactivity 45 Traditional professional translation approaches translation from scratch post-editing translation memory match post-editing machine translation output More interactive collaboration between machine and professional?
Interactive Machine Translation 46 Input Sentence Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten. Professional Translator
Interactive Machine Translation 47 Input Sentence Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten. Professional Translator He
Interactive Machine Translation 48 Input Sentence Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten. Professional Translator He has
Interactive Machine Translation 49 Input Sentence Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten. Professional Translator He has for months
Interactive Machine Translation 50 Input Sentence Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten. Professional Translator He planned
Interactive Machine Translation 51 Input Sentence Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten. Professional Translator He planned for months
Word Alignment Visualization 52 Input Sentence Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten. Professional Translator He planned for months to give a lecture in New York in
Word Alignment Visualization 53 Input Sentence Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten. Professional Translator He planned for months to give a lecture in New York in
Shading off Translated Material 54 Input Sentence Er hat seit Monaten geplant, im April einen Vortrag in New York zu halten. Professional Translator He planned for months to give a lecture in New York in
Choices 55 Trigger the passive vocabulary Display multiple translations for words and phrases er hat seit Monaten geplant, im April einen Vortrag... he has for months the plan in April a lecture... it has for months now planned, in April a presentation... he was for several months planned to in the April a speech... he has made since months the pipeline in April of a statement... he did for many months scheduled the April a general... Rank and color-highlight by probability of each translation Prefer diversity
Instant Feedback Loop 56 source text translate MT engine MT translation re-train post-edit human translation
CASMACAT Home Edition 57 Available as open source software Features installation on any desktop machine allows training of MT engines all new types of assistance incremental updating of models Warning: still in development stage (help welcome!)
58 summary
Summary 59 Machine translation is not perfect, but useful Better models (esp. syntax) Better machine learning (esp. neural networks) More data Closer integration with target application (e.g., computer aided translation)
Thank You 60 questions?