Statistical Machine Translation

Size: px
Start display at page:

Download "Statistical Machine Translation"

Transcription

1 Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language Technology School of Computing Dublin City University Week Ten/Eleven 17th April 2014

2 Lecture Plan Part One: Recap of decoding Part Two: More on decoding Part Three: Encoding linguistic knowledge

3 Part One Recap of decoding

4 Questions 1. What is decoding? 2. What methods are used to make the decoding problem tractable? 3. What is the difference between histogram pruning and threshold pruning?

5 What is decoding? Decoding is the process of searching for the most likely translation within the space of candidate translations.

6 Making decoding tractable 1. Hypothesis recombination 2. Pruning

7 Histogram Pruning Keep a maximum of m hypotheses

8 Threshold Pruning Keep only those hypotheses that are within a threshold α of the best hypothesis.

9 Histogram Pruning How many hypotheses will be pruned if m = 1?

10 Threshold Pruning How many hypotheses will be pruned if we prune all hypotheses that are at least 0.5 times worse than the best hypothesis? 0.09

11 Threshold Pruning 0.01 Threshold pruning is more flexible than histogram pruning since it takes into account the difference between the scores of the best and worst hypothesis

12 Part Two More on decoding

13 Stack Decoding Algorithm Partial translations or hypotheses are grouped into stacks based on how many words in the input sentence they translate.

14 Stack Decoding Algorithm Partial translations or hypotheses are grouped into stacks based on how many words in the input sentence they translate. Working left to right through the stacks, the decoder takes a hypothesis from a stack and a phrase from the input sentence and tries to construct a new hypothesis.

15 Stack Decoding Algorithm Partial translations or hypotheses are grouped into stacks based on how many words in the input sentence they translate. Working left to right through the stacks, the decoder takes a hypothesis from a stack and a phrase from the input sentence and tries to construct a new hypothesis. The output is the highest scoring hypothesis that translates the entire input sentence.

16 Stack Decoding Algorithm Partial translations or hypotheses are grouped into stacks based on how many words in the input sentence they translate. Working left to right through the stacks, the decoder takes a hypothesis from a stack and a phrase from the input sentence and tries to construct a new hypothesis. The output is the highest scoring hypothesis that translates the entire input sentence. Stacks are kept to a manageable size using pruning.

17 Stack Decoding Pseudocode 1:place empty hypothesis into stack 0 2: for all stacks 0...n 1 do 3: for all hypotheses in stack do 4: for all translation options do 5: if applicable then 6: create new hypothesis 7: place in stack 8: recombine with existing hypothesis if possible 9: prune stack if too big 10: end if 11: end for 12: end for 13: end for

18 Stack Decoding Example Input sentence: maison bleu Goal: Translate it into English using the stack decoding algorithm with histogram pruning (maximum number of hypotheses per stack = 3)

19 Stack Decoding Example Phrase pairs in training data: maison house maison home bleu blue bleu turquoise maison bleu blue house

20 Stack Decoding Example All possible translations: house blue, house turquoise, home blue, home turquoise, blue house, blue home, turquoise home, turquoise house

21 Stack Decoding Example Insert the empty hypothesis in Stack

22 Stack Decoding Example Construct new hypotheses compatible with empty hypothesis turquoise blue home house blue house 0 1 2

23 Stack Decoding Example Prune Stack 1 (assume turquoise has lowest probability) turquoise blue home house blue house 0 1 2

24 Stack Decoding Example Construct new hypotheses compatible with hypotheses in Stack 1 home turqoise house turquoise home blue blue home house house blue blue home blue house blue house 0 1 2

25 Stack Decoding Example Perform hypothesis recombination (assume in this case that the shorter path hypothesis is more likely) home turqoise house turquoise home blue blue home house house blue blue home blue house 0 1 2

26 Stack Decoding Example Prune stack 2 (assume home blue, home turquoise and house turquoise have the lowest scores) blue home house home turqoise housee turquoise home blue house blue blue home blue house 0 1 2

27 Stack Decoding Example Output most likely translation in Stack 2 blue house blue home blue home house blue house 0 1 2

28 Limitation of Stack Decoding Algorithm Relatively low-scoring hypotheses that are part of a good translation may be pruned.

29 Limitation of Stack Decoding Algorithm the tourism initiative addresses this for the first time Die Tm:-0.19, lm:-0.4, d:0,all:-0.59 touristische Tm:-1.16, lm:-2.93, d:0,all:-4.09 initiative Tm:-1.21, lm:-4.67, d:0,all:-5.88 das erste mal Tm:-0.56, lm:-2.81, d:-0.74,all:-4.11

30 Limitation of Stack Decoding Algorithm the tourism initiative addresses this for the first time Die Tm:-0.19, lm:-0.4, d:0,all:-0.59 touristische Tm:-1.16, lm:-2.93, d:0,all:-4.09 initiative Tm:-1.21, lm:-4.67, d:0,all:-5.88 das erste mal Tm:-0.56, lm:-2.81, d:-0.74,all:-4.11 Both hypotheses translate 3 words (therefore, in the same stack so will be compared to each other when deciding what to prune)

31 Limitation of Stack Decoding Algorithm the tourism initiative addresses this for the first time Die Tm:-0.19, lm:-0.4, d:0,all:-0.59 touristische Tm:-1.16, lm:-2.93, d:0,all:-4.09 initiative Tm:-1.21, lm:-4.67, d:0,all:-5.88 das erste mal Tm:-0.56, lm:-2.81, d:-0.74,all:-4.11 Both hypotheses translate 3 words (therefore, in the same stack so will be compared to each other when deciding what to prune) The worse hypothesis (i.e. the one starting with das erste mal) has a better score at this stage because it contains common words that are easier to translate.

32 Limitation of Stack Decoding Algorithm How to overcome this limitation?

33 Limitation of Stack Decoding Algorithm How to overcome this limitation? Estimate the future cost associated with a hypothesis by looking at the rest of the input string.

34 Future Cost A hypothesis score is some combination of its probability and its future cost. Why is the exact future cost not calculated?

35 Future Cost A hypothesis score is some combination of its probability and its future cost. Why is the exact future cost not calculated? Because calculating the exact future cost of all hypotheses would amount to evaluating the translation cost of all hypotheses the very thing we use pruning to avoid.

36 Depth-first versus breadth-first search A distinction is made between decoding algorithms that perform a breadth-first search and those that perform a depth-first search.

37 Depth-first versus breadth-first search A distinction is made between decoding algorithms that perform a breadth-first search and those that perform a depth-first search. A breadth-first search explores many hypotheses in parallel, e.g. stack decoding. A depth-first search fully explores a hypothesis before moving on to the next one, e.g. A* decoding.

38 Search and Model Errors 1. A search error occurs when the decoder misses a translation with a higher probability than the translation it returns.

39 Search and Model Errors 1. A search error occurs when the decoder misses a translation with a higher probability than the translation it returns. 2. A model error occurs when the model assigns a higher probability to an incorrect translation than to a correct one.

40 Part Three Making Use of Linguistic Knowledge

41 The story so far The SMT approaches we have studied so far are language-independent.

42 The story so far The SMT approaches we have studied so far are language-independent. Input: 1. A monolingual corpus for training a language model 2. A sentence-aligned parallel corpus for training a translation model

43 The story so far The SMT approaches we have studied so far are language-independent. Input: 1. A monolingual corpus for training a language model 2. A sentence-aligned parallel corpus for training a translation model Output: A translation (of variable quality)

44 The old way of doing things Traditional rule-based MT systems tried to translate using hand-coded rules which encode linguistic knowledge or generalisations.

45 Examples of linguistic knowledge 1. In English, the noun comes after the adjective. In French, the noun usually precedes the adjective.

46 Examples of linguistic knowledge 1. In English, the noun comes after the adjective. In French, the noun usually precedes the adjective. 2. In German, the main verb often occurs at the end of a clause.

47 Examples of linguistic knowledge 1. In English, the noun comes after the adjective. In French, the noun usually precedes the adjective. 2. In German, the main verb often occurs at the end of a clause. 3. In Irish, prepositions are inflected for number and person.

48 Examples of linguistic knowledge 1. In English, the noun comes after the adjective. In French, the noun usually precedes the adjective. 2. In German, the main verb often occurs at the end of a clause. 3. In Irish, prepositions are inflected for number and person. 4. In many languages, the subject and verb agree in person and number.

49 Why bother with linguistic knowledge? Why would we want to encode linguistic knowledge in an SMT system?

50 Why bother with linguistic knowledge? Why would we want to encode linguistic knowledge in an SMT system? 1. Data-driven SMT works well but there is room for improvement (especially for some language pairs).

51 Why bother with linguistic knowledge? Why would we want to encode linguistic knowledge in an SMT system? 1. Data-driven SMT works well but there is room for improvement (especially for some language pairs). 2. Data-driven SMT relies on the availability of large corpora. For minority languages, this can be problematic.

52 Using linguistic knowledge Where linguistic knowledge might be useful:

53 Using linguistic knowledge Where linguistic knowledge might be useful: 1. More global view of fluency than an n-gram language model: the house is the man is small

54 Using linguistic knowledge Where linguistic knowledge might be useful: 1. More global view of fluency than an n-gram language model: the house is the man is small 2. Long-distance dependencies: make a tremendous amount of sense

55 Using linguistic knowledge Where linguistic knowledge might be useful: 1. More global view of fluency than an n-gram language model: the house is the man is small 2. Long-distance dependencies: make a tremendous amount of sense 3. Word compounding: Knowing that Aktionsplan is made up of the words Aktion and plan

56 Using linguistic knowledge Where linguistic knowledge might be useful: 1. More global view of fluency than an n-gram language model: the house is the man is small 2. Long-distance dependencies: make a tremendous amount of sense 3. Word compounding: Knowing that Aktionsplan is made up of the words Aktion and plan 4. Better handling of function words: role of de in entreprises de transports (haulage companies)

57 How to encode linguistic knowledge How can linguistic knowledge be encoded in an SMT system?

58 How to encode linguistic knowledge How can linguistic knowledge be encoded in an SMT system? 1. Before translation (pre-processing)

59 How to encode linguistic knowledge How can linguistic knowledge be encoded in an SMT system? 1. Before translation (pre-processing) 2. During translation

60 How to encode linguistic knowledge How can linguistic knowledge be encoded in an SMT system? 1. Before translation (pre-processing) 2. During translation 3. After translation (post-processing)

61 Where do we get this linguistic knowledge? How do we obtain linguistic knowledge automatically?

62 Where do we get this linguistic knowledge? How do we obtain linguistic knowledge automatically? 1. Automatic morphological analysis 2. Part-of-speech tagging 3. Syntactic parsing

63 Where do we get this linguistic knowledge? How do we obtain linguistic knowledge automatically? 1. Automatic morphological analysis 2. Part-of-speech tagging 3. Syntactic parsing Can you spot a potential problem here (hint: automatic)?

64 Syntactic Parsing S NP VP dobj NNP VBP NNP John loves Mary NNP VBP NNP John loves Mary nsubj

65 Pre-processing Encoding linguistic knowledge before translation 1. Word segmentation 2. Word lemmatisation 3. Re-ordering

66 Word Segmentation The notion of what a word is varies from language to language.

67 Word Segmentation The notion of what a word is varies from language to language. In some languages, e.g. Chinese, words are not separated by spaces.

68 Word Segmentation The notion of what a word is varies from language to language. In some languages, e.g. Chinese, words are not separated by spaces. Why is this a problem for machine translation?

69 Word Segmentation The notion of what a word is varies from language to language. In some languages, e.g. Chinese, words are not separated by spaces. Why is this a problem for machine translation? Potential solution: try to perform automatic word segmentation before translation.

70 Word Segmentation The notion of what a word is varies from language to language. In some languages, e.g. Chinese, words are not separated by spaces. Why is this a problem for machine translation? Potential solution: try to perform automatic word segmentation before translation. A related challenge is word compounding in German.

71 Word Lemmatisation Languages with a rich inflectional system can have several different variants of the same root form or lemma.

72 Word Lemmatisation Languages with a rich inflectional system can have several different variants of the same root form or lemma. Example: Latin: bellum, belli, bello, bella, bellis, bellorum English: war, wars

73 Word Lemmatisation Languages with a rich inflectional system can have several different variants of the same root form or lemma. Example: Latin: bellum, belli, bello, bella, bellis, bellorum English: war, wars Why is this a problem for translation from an MRL (morphologically rich language) into English?

74 Word Lemmatisation Languages with a rich inflectional system can have several different variants of the same root form or lemma. Example: Latin: bellum, belli, bello, bella, bellis, bellorum English: war, wars Why is this a problem for translation from an MRL (morphologically rich language) into English? Potential solution: before translation, replace full form with lemma

75 Word Lemmatisation Languages with a rich inflectional system can have several different variants of the same root form or lemma. Example: Latin: bellum, belli, bello, bella, bellis, bellorum English: war, wars Why is this a problem for translation from an MRL (morphologically rich language) into English? Potential solution: before translation, replace full form with lemma

76 Syntactic Re-ordering It is harder to translate language pairs with very different word order.

77 Syntactic Re-ordering It is harder to translate language pairs with very different word order. Language model and phrase-based translation model can only do so much.

78 Syntactic Re-ordering It is harder to translate language pairs with very different word order. Language model and phrase-based translation model can only do so much. Potential solution: transform the source sentence so that its word order more closely resembles the word order of the target language.

79 Syntactic Re-ordering It is harder to translate language pairs with very different word order. Language model and phrase-based translation model can only do so much. Potential solution: transform the source sentence so that its word order more closely resembles the word order of the target language. Consider translating from English to German: I walked to the shop yesterday. I yesterday to the shop walked.

80 During Translation Encoding linguistic knowledge during translation

81 During Translation Encoding linguistic knowledge during translation 1. Log-linear models

82 During Translation Encoding linguistic knowledge during translation 1. Log-linear models 2. Tree-based models

83 During Translation Encoding linguistic knowledge during translation 1. Log-linear models 2. Tree-based models 3. Syntactic language models

84 Log-linear models The log-linear model of translation allows several sources of information to be employed simultaneously during the translation process: p x = exp n i=1 λ i h i (x)

85 Log-linear models The log-linear model of translation allows several sources of information to be employed simultaneously during the translation process: p x = exp n i=1 λ i h i (x) Some of this information can be linguistic knowledge, e.g. instead of just calculating p(ocras hungry) we can also calculate p(noun adj).

86 Tree-based models A more radical approach is to replace the phrasebased translation model with a different type of translation model entirely.

87 Tree-based models A more radical approach is to replace the phrasebased translation model with a different type of translation model entirely. One such model is a tree-based model which is based on the notion of a syntax tree and which allows one, in theory, to better capture structural similarities and divergences between languages.

88 Tree-based models A more radical approach is to replace the phrasebased translation model with a different type of translation model entirely. One such model is a tree-based model which is based on the notion of a syntax tree and which allows one, in theory, to better capture structural similarities and divergences between languages. During decoding, syntax trees for the source and language pairs are built up simultaneously (or one side only)

89 Syntactic language models N-gram language models are not the only language models.

90 Syntactic language models N-gram language models are not the only language models. People have experimented with other kinds of language models, including those based on the notion of a syntax tree.

91 Syntactic language models N-gram language models are not the only language models. People have experimented with other kinds of language models, including those based on the notion of a syntax tree. Instead of computing the probability of a sentence by multiplying n-gram probabilities, the probability is computed by multiplying the probabilities of the rules that are used to build the syntax tree(s).

92 Post-processing SMT systems return a ranked list of candidate translations. Linguistic information (of all kinds) can be employed in re-ranking this list using machine learning techniques.

Chapter 6. Decoding. Statistical Machine Translation

Chapter 6. Decoding. Statistical Machine Translation Chapter 6 Decoding Statistical Machine Translation Decoding We have a mathematical model for translation p(e f) Task of decoding: find the translation e best with highest probability Two types of error

More information

Statistical Machine Translation Lecture 4. Beyond IBM Model 1 to Phrase-Based Models

Statistical Machine Translation Lecture 4. Beyond IBM Model 1 to Phrase-Based Models p. Statistical Machine Translation Lecture 4 Beyond IBM Model 1 to Phrase-Based Models Stephen Clark based on slides by Philipp Koehn p. Model 2 p Introduces more realistic assumption for the alignment

More information

Phrase-Based MT. Machine Translation Lecture 7. Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu. Website: mt-class.

Phrase-Based MT. Machine Translation Lecture 7. Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu. Website: mt-class. Phrase-Based MT Machine Translation Lecture 7 Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu Website: mt-class.org/penn Translational Equivalence Er hat die Prüfung bestanden, jedoch

More information

Hybrid Machine Translation Guided by a Rule Based System

Hybrid Machine Translation Guided by a Rule Based System Hybrid Machine Translation Guided by a Rule Based System Cristina España-Bonet, Gorka Labaka, Arantza Díaz de Ilarraza, Lluís Màrquez Kepa Sarasola Universitat Politècnica de Catalunya University of the

More information

Convergence of Translation Memory and Statistical Machine Translation

Convergence of Translation Memory and Statistical Machine Translation Convergence of Translation Memory and Statistical Machine Translation Philipp Koehn and Jean Senellart 4 November 2010 Progress in Translation Automation 1 Translation Memory (TM) translators store past

More information

The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems

The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems Dr. Ananthi Sheshasaayee 1, Angela Deepa. V.R 2 1 Research Supervisior, Department of Computer Science & Application,

More information

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 Yidong Chen, Xiaodong Shi Institute of Artificial Intelligence Xiamen University P. R. China November 28, 2006 - Kyoto 13:46 1

More information

Factored Translation Models

Factored Translation Models Factored Translation s Philipp Koehn and Hieu Hoang [email protected], [email protected] School of Informatics University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW Scotland, United Kingdom

More information

Customizing an English-Korean Machine Translation System for Patent Translation *

Customizing an English-Korean Machine Translation System for Patent Translation * Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,

More information

Learning Translation Rules from Bilingual English Filipino Corpus

Learning Translation Rules from Bilingual English Filipino Corpus Proceedings of PACLIC 19, the 19 th Asia-Pacific Conference on Language, Information and Computation. Learning Translation s from Bilingual English Filipino Corpus Michelle Wendy Tan, Raymond Joseph Ang,

More information

A Joint Sequence Translation Model with Integrated Reordering

A Joint Sequence Translation Model with Integrated Reordering A Joint Sequence Translation Model with Integrated Reordering Nadir Durrani, Helmut Schmid and Alexander Fraser Institute for Natural Language Processing University of Stuttgart Introduction Generation

More information

Chapter 5. Phrase-based models. Statistical Machine Translation

Chapter 5. Phrase-based models. Statistical Machine Translation Chapter 5 Phrase-based models Statistical Machine Translation Motivation Word-Based Models translate words as atomic units Phrase-Based Models translate phrases as atomic units Advantages: many-to-many

More information

Why Evaluation? Machine Translation. Evaluation. Evaluation Metrics. Ten Translations of a Chinese Sentence. How good is a given system?

Why Evaluation? Machine Translation. Evaluation. Evaluation Metrics. Ten Translations of a Chinese Sentence. How good is a given system? Why Evaluation? How good is a given system? Machine Translation Evaluation Which one is the best system for our purpose? How much did we improve our system? How can we tune our system to become better?

More information

Machine Translation. Why Evaluation? Evaluation. Ten Translations of a Chinese Sentence. Evaluation Metrics. But MT evaluation is a di cult problem!

Machine Translation. Why Evaluation? Evaluation. Ten Translations of a Chinese Sentence. Evaluation Metrics. But MT evaluation is a di cult problem! Why Evaluation? How good is a given system? Which one is the best system for our purpose? How much did we improve our system? How can we tune our system to become better? But MT evaluation is a di cult

More information

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990

More information

Translation Solution for

Translation Solution for Translation Solution for Case Study Contents PROMT Translation Solution for PayPal Case Study 1 Contents 1 Summary 1 Background for Using MT at PayPal 1 PayPal s Initial Requirements for MT Vendor 2 Business

More information

tance alignment and time information to create confusion networks 1 from the output of different ASR systems for the same

tance alignment and time information to create confusion networks 1 from the output of different ASR systems for the same 1222 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 7, SEPTEMBER 2008 System Combination for Machine Translation of Spoken and Written Language Evgeny Matusov, Student Member,

More information

THUTR: A Translation Retrieval System

THUTR: A Translation Retrieval System THUTR: A Translation Retrieval System Chunyang Liu, Qi Liu, Yang Liu, and Maosong Sun Department of Computer Science and Technology State Key Lab on Intelligent Technology and Systems National Lab for

More information

An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System

An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System Thiruumeni P G, Anand Kumar M Computational Engineering & Networking, Amrita Vishwa Vidyapeetham, Coimbatore,

More information

The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER 2011 49 58. Ncode: an Open Source Bilingual N-gram SMT Toolkit

The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER 2011 49 58. Ncode: an Open Source Bilingual N-gram SMT Toolkit The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER 2011 49 58 Ncode: an Open Source Bilingual N-gram SMT Toolkit Josep M. Crego a, François Yvon ab, José B. Mariño c c a LIMSI-CNRS, BP 133,

More information

UNSUPERVISED MORPHOLOGICAL SEGMENTATION FOR STATISTICAL MACHINE TRANSLATION

UNSUPERVISED MORPHOLOGICAL SEGMENTATION FOR STATISTICAL MACHINE TRANSLATION UNSUPERVISED MORPHOLOGICAL SEGMENTATION FOR STATISTICAL MACHINE TRANSLATION by Ann Clifton B.A., Reed College, 2001 a thesis submitted in partial fulfillment of the requirements for the degree of Master

More information

A New Input Method for Human Translators: Integrating Machine Translation Effectively and Imperceptibly

A New Input Method for Human Translators: Integrating Machine Translation Effectively and Imperceptibly Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) A New Input Method for Human Translators: Integrating Machine Translation Effectively and Imperceptibly

More information

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University 1. Introduction This paper describes research in using the Brill tagger (Brill 94,95) to learn to identify incorrect

More information

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives Ramona Enache and Adam Slaski Department of Computer Science and Engineering Chalmers University of Technology and

More information

Hybrid Strategies. for better products and shorter time-to-market

Hybrid Strategies. for better products and shorter time-to-market Hybrid Strategies for better products and shorter time-to-market Background Manufacturer of language technology software & services Spin-off of the research center of Germany/Heidelberg Founded in 1999,

More information

The Impact of Morphological Errors in Phrase-based Statistical Machine Translation from English and German into Swedish

The Impact of Morphological Errors in Phrase-based Statistical Machine Translation from English and German into Swedish The Impact of Morphological Errors in Phrase-based Statistical Machine Translation from English and German into Swedish Oscar Täckström Swedish Institute of Computer Science SE-16429, Kista, Sweden [email protected]

More information

Word Completion and Prediction in Hebrew

Word Completion and Prediction in Hebrew Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology

More information

Building a Web-based parallel corpus and filtering out machinetranslated

Building a Web-based parallel corpus and filtering out machinetranslated Building a Web-based parallel corpus and filtering out machinetranslated text Alexandra Antonova, Alexey Misyurev Yandex 16, Leo Tolstoy St., Moscow, Russia {antonova, misyurev}@yandex-team.ru Abstract

More information

This page intentionally left blank

This page intentionally left blank This page intentionally left blank Statistical Machine Translation The field of machine translation has recently been energized by the emergence of statistical techniques, which have brought the dream

More information

SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统

SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统 SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems Jin Yang, Satoshi Enoue Jean Senellart, Tristan Croiset SYSTRAN Software, Inc. SYSTRAN SA 9333 Genesee Ave. Suite PL1 La Grande

More information

Machine Translation. Agenda

Machine Translation. Agenda Agenda Introduction to Machine Translation Data-driven statistical machine translation Translation models Parallel corpora Document-, sentence-, word-alignment Phrase-based translation MT decoding algorithm

More information

Paraphrasing controlled English texts

Paraphrasing controlled English texts Paraphrasing controlled English texts Kaarel Kaljurand Institute of Computational Linguistics, University of Zurich [email protected] Abstract. We discuss paraphrasing controlled English texts, by defining

More information

Introduction. Philipp Koehn. 28 January 2016

Introduction. Philipp Koehn. 28 January 2016 Introduction Philipp Koehn 28 January 2016 Administrativa 1 Class web site: http://www.mt-class.org/jhu/ Tuesdays and Thursdays, 1:30-2:45, Hodson 313 Instructor: Philipp Koehn (with help from Matt Post)

More information

Factored bilingual n-gram language models for statistical machine translation

Factored bilingual n-gram language models for statistical machine translation Mach Translat DOI 10.1007/s10590-010-9082-5 Factored bilingual n-gram language models for statistical machine translation Josep M. Crego François Yvon Received: 2 November 2009 / Accepted: 12 June 2010

More information

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告 SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 Jin Yang and Satoshi Enoue SYSTRAN Software, Inc. 4444 Eastgate Mall, Suite 310 San Diego, CA 92121, USA E-mail:

More information

Computer Aided Translation

Computer Aided Translation Computer Aided Translation Philipp Koehn 30 April 2015 Why Machine Translation? 1 Assimilation reader initiates translation, wants to know content user is tolerant of inferior quality focus of majority

More information

Machine Translation and the Translator

Machine Translation and the Translator Machine Translation and the Translator Philipp Koehn 8 April 2015 About me 1 Professor at Johns Hopkins University (US), University of Edinburgh (Scotland) Author of textbook on statistical machine translation

More information

Learning Translations of Named-Entity Phrases from Parallel Corpora

Learning Translations of Named-Entity Phrases from Parallel Corpora Learning Translations of Named-Entity Phrases from Parallel Corpora Robert C. Moore Microsoft Research Redmond, WA 98052, USA [email protected] Abstract We develop a new approach to learning phrase

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

Effective Self-Training for Parsing

Effective Self-Training for Parsing Effective Self-Training for Parsing David McClosky [email protected] Brown Laboratory for Linguistic Information Processing (BLLIP) Joint work with Eugene Charniak and Mark Johnson David McClosky - [email protected]

More information

PROMT Technologies for Translation and Big Data

PROMT Technologies for Translation and Big Data PROMT Technologies for Translation and Big Data Overview and Use Cases Julia Epiphantseva PROMT About PROMT EXPIRIENCED Founded in 1991. One of the world leading machine translation provider DIVERSIFIED

More information

L130: Chapter 5d. Dr. Shannon Bischoff. Dr. Shannon Bischoff () L130: Chapter 5d 1 / 25

L130: Chapter 5d. Dr. Shannon Bischoff. Dr. Shannon Bischoff () L130: Chapter 5d 1 / 25 L130: Chapter 5d Dr. Shannon Bischoff Dr. Shannon Bischoff () L130: Chapter 5d 1 / 25 Outline 1 Syntax 2 Clauses 3 Constituents Dr. Shannon Bischoff () L130: Chapter 5d 2 / 25 Outline Last time... Verbs...

More information

Empirical Machine Translation and its Evaluation

Empirical Machine Translation and its Evaluation Empirical Machine Translation and its Evaluation EAMT Best Thesis Award 2008 Jesús Giménez (Advisor, Lluís Màrquez) Universitat Politècnica de Catalunya May 28, 2010 Empirical Machine Translation Empirical

More information

An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation

An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation Robert C. Moore Chris Quirk Microsoft Research Redmond, WA 98052, USA {bobmoore,chrisq}@microsoft.com

More information

Training and evaluation of POS taggers on the French MULTITAG corpus

Training and evaluation of POS taggers on the French MULTITAG corpus Training and evaluation of POS taggers on the French MULTITAG corpus A. Allauzen, H. Bonneau-Maynard LIMSI/CNRS; Univ Paris-Sud, Orsay, F-91405 {allauzen,maynard}@limsi.fr Abstract The explicit introduction

More information

Context Grammar and POS Tagging

Context Grammar and POS Tagging Context Grammar and POS Tagging Shian-jung Dick Chen Don Loritz New Technology and Research New Technology and Research LexisNexis LexisNexis Ohio, 45342 Ohio, 45342 [email protected] [email protected]

More information

Natural Language Database Interface for the Community Based Monitoring System *

Natural Language Database Interface for the Community Based Monitoring System * Natural Language Database Interface for the Community Based Monitoring System * Krissanne Kaye Garcia, Ma. Angelica Lumain, Jose Antonio Wong, Jhovee Gerard Yap, Charibeth Cheng De La Salle University

More information

ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking

ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking Anne-Laure Ligozat LIMSI-CNRS/ENSIIE rue John von Neumann 91400 Orsay, France [email protected] Cyril Grouin LIMSI-CNRS rue John von Neumann 91400

More information

Statistical Machine Translation

Statistical Machine Translation Statistical Machine Translation What works and what does not Andreas Maletti Universität Stuttgart [email protected] Stuttgart May 14, 2013 Statistical Machine Translation A. Maletti 1 Main

More information

Rule based Sentence Simplification for English to Tamil Machine Translation System

Rule based Sentence Simplification for English to Tamil Machine Translation System Volume 25 No8, July 2011 Rule based Sentence Simplification for English to Tamil Machine Translation System Poornima C, Dhanalakshmi V Computational Engineering and Networking Amrita Vishwa Vidyapeetham

More information

a Chinese-to-Spanish rule-based machine translation

a Chinese-to-Spanish rule-based machine translation Chinese-to-Spanish rule-based machine translation system Jordi Centelles 1 and Marta R. Costa-jussà 2 1 Centre de Tecnologies i Aplicacions del llenguatge i la Parla (TALP), Universitat Politècnica de

More information

Improving the Performance of English-Tamil Statistical Machine Translation System using Source-Side Pre-Processing

Improving the Performance of English-Tamil Statistical Machine Translation System using Source-Side Pre-Processing Proc. of Int. Conf. on Advances in Computer Science, AETACS Improving the Performance of English-Tamil Statistical Machine Translation System using Source-Side Pre-Processing Anand Kumar M 1, Dhanalakshmi

More information

Language Model of Parsing and Decoding

Language Model of Parsing and Decoding Syntax-based Language Models for Statistical Machine Translation Eugene Charniak ½, Kevin Knight ¾ and Kenji Yamada ¾ ½ Department of Computer Science, Brown University ¾ Information Sciences Institute,

More information

Text Generation for Abstractive Summarization

Text Generation for Abstractive Summarization Text Generation for Abstractive Summarization Pierre-Etienne Genest, Guy Lapalme RALI-DIRO Université de Montréal P.O. Box 6128, Succ. Centre-Ville Montréal, Québec Canada, H3C 3J7 {genestpe,lapalme}@iro.umontreal.ca

More information

MORPHOLOGY BASED PROTOTYPE STATISTICAL MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL LANGUAGE

MORPHOLOGY BASED PROTOTYPE STATISTICAL MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL LANGUAGE MORPHOLOGY BASED PROTOTYPE STATISTICAL MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL LANGUAGE A Thesis Submitted for the Degree of Doctor of Philosophy in the School of Engineering by ANAND KUMAR M CENTER

More information

Example-based Translation without Parallel Corpora: First experiments on a prototype

Example-based Translation without Parallel Corpora: First experiments on a prototype -based Translation without Parallel Corpora: First experiments on a prototype Vincent Vandeghinste, Peter Dirix and Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven Maria

More information

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing 1 Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing Lourdes Araujo Dpto. Sistemas Informáticos y Programación, Univ. Complutense, Madrid 28040, SPAIN (email: [email protected])

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Introduction Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 13 Introduction Goal of machine learning: Automatically learn how to

More information

Identifying Focus, Techniques and Domain of Scientific Papers

Identifying Focus, Techniques and Domain of Scientific Papers Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 [email protected] Christopher D. Manning Department of

More information

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words , pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan

More information

Phrase-Based Statistical Translation of Programming Languages

Phrase-Based Statistical Translation of Programming Languages Phrase-Based Statistical Translation of Programming Languages Svetoslav Karaivanov ETH Zurich [email protected] Veselin Raychev ETH Zurich [email protected] Martin Vechev ETH Zurich [email protected]

More information

TS3: an Improved Version of the Bilingual Concordancer TransSearch

TS3: an Improved Version of the Bilingual Concordancer TransSearch TS3: an Improved Version of the Bilingual Concordancer TransSearch Stéphane HUET, Julien BOURDAILLET and Philippe LANGLAIS EAMT 2009 - Barcelona June 14, 2009 Computer assisted translation Preferred by

More information

Ling 201 Syntax 1. Jirka Hana April 10, 2006

Ling 201 Syntax 1. Jirka Hana April 10, 2006 Overview of topics What is Syntax? Word Classes What to remember and understand: Ling 201 Syntax 1 Jirka Hana April 10, 2006 Syntax, difference between syntax and semantics, open/closed class words, all

More information

PoS-tagging Italian texts with CORISTagger

PoS-tagging Italian texts with CORISTagger PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy [email protected] Abstract. This paper presents an evolution of CORISTagger [1], an high-performance

More information

Segmentation and Punctuation Prediction in Speech Language Translation Using a Monolingual Translation System

Segmentation and Punctuation Prediction in Speech Language Translation Using a Monolingual Translation System Segmentation and Punctuation Prediction in Speech Language Translation Using a Monolingual Translation System Eunah Cho, Jan Niehues and Alex Waibel International Center for Advanced Communication Technologies

More information

Applying Statistical Post-Editing to. English-to-Korean Rule-based Machine Translation System

Applying Statistical Post-Editing to. English-to-Korean Rule-based Machine Translation System Applying Statistical Post-Editing to English-to-Korean Rule-based Machine Translation System Ki-Young Lee and Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research

More information

COMPUTATIONAL DATA ANALYSIS FOR SYNTAX

COMPUTATIONAL DATA ANALYSIS FOR SYNTAX COLING 82, J. Horeck~ (ed.j North-Holland Publishing Compa~y Academia, 1982 COMPUTATIONAL DATA ANALYSIS FOR SYNTAX Ludmila UhliFova - Zva Nebeska - Jan Kralik Czech Language Institute Czechoslovak Academy

More information

Integration of Content Optimization Software into the Machine Translation Workflow. Ben Gottesman Acrolinx

Integration of Content Optimization Software into the Machine Translation Workflow. Ben Gottesman Acrolinx Integration of Content Optimization Software into the Machine Translation Workflow Ben Gottesman Acrolinx What is Acrolinx? Acrolinx is Content Optimization Software. It helps authors make their text!

More information

Syntax: Phrases. 1. The phrase

Syntax: Phrases. 1. The phrase Syntax: Phrases Sentences can be divided into phrases. A phrase is a group of words forming a unit and united around a head, the most important part of the phrase. The head can be a noun NP, a verb VP,

More information

Comprendium Translator System Overview

Comprendium Translator System Overview Comprendium System Overview May 2004 Table of Contents 1. INTRODUCTION...3 2. WHAT IS MACHINE TRANSLATION?...3 3. THE COMPRENDIUM MACHINE TRANSLATION TECHNOLOGY...4 3.1 THE BEST MT TECHNOLOGY IN THE MARKET...4

More information

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD. Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.

More information

HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN

HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN Yu Chen, Andreas Eisele DFKI GmbH, Saarbrücken, Germany May 28, 2010 OUTLINE INTRODUCTION ARCHITECTURE EXPERIMENTS CONCLUSION SMT VS. RBMT [K.

More information

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems cation systems. For example, NLP could be used in Question Answering (QA) systems to understand users natural

More information

GRASP: Grammar- and Syntax-based Pattern-Finder for Collocation and Phrase Learning

GRASP: Grammar- and Syntax-based Pattern-Finder for Collocation and Phrase Learning PACLIC 24 Proceedings 357 GRASP: Grammar- and Syntax-based Pattern-Finder for Collocation and Phrase Learning Mei-hua Chen a, Chung-chi Huang a, Shih-ting Huang b, and Jason S. Chang b a Institute of Information

More information

TREC 2003 Question Answering Track at CAS-ICT

TREC 2003 Question Answering Track at CAS-ICT TREC 2003 Question Answering Track at CAS-ICT Yi Chang, Hongbo Xu, Shuo Bai Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China [email protected] http://www.ict.ac.cn/

More information

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata Alessandra Giordani and Alessandro Moschitti Department of Computer Science and Engineering University of Trento Via Sommarive

More information

A Joint Sequence Translation Model with Integrated Reordering

A Joint Sequence Translation Model with Integrated Reordering A Joint Sequence Translation Model with Integrated Reordering Nadir Durrani Advisors: Alexander Fraser and Helmut Schmid Institute for Natural Language Processing University of Stuttgart Machine Translation

More information

English Descriptive Grammar

English Descriptive Grammar English Descriptive Grammar 2015/2016 Code: 103410 ECTS Credits: 6 Degree Type Year Semester 2500245 English Studies FB 1 1 2501902 English and Catalan FB 1 1 2501907 English and Classics FB 1 1 2501910

More information

Morphological Knowledge in Statistical Machine Translation. Kristina Toutanova Microsoft Research

Morphological Knowledge in Statistical Machine Translation. Kristina Toutanova Microsoft Research Morphological Knowledge in Statistical Machine Translation Kristina Toutanova Microsoft Research Morphology Morpheme the smallest unit of language that carries information about meaning or function Word

More information

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Chapter 8. Final Results on Dutch Senseval-2 Test Data Chapter 8 Final Results on Dutch Senseval-2 Test Data The general idea of testing is to assess how well a given model works and that can only be done properly on data that has not been seen before. Supervised

More information

M LTO Multilingual On-Line Translation

M LTO Multilingual On-Line Translation O non multa, sed multum M LTO Multilingual On-Line Translation MOLTO Consortium FP7-247914 Project summary MOLTO s goal is to develop a set of tools for translating texts between multiple languages in

More information