Statistical Machine Translation

Similar documents

Chapter 6. Decoding. Statistical Machine Translation

Statistical Machine Translation Lecture 4. Beyond IBM Model 1 to Phrase-Based Models

Phrase-Based MT. Machine Translation Lecture 7. Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu. Website: mt-class.

Hybrid Machine Translation Guided by a Rule Based System

Convergence of Translation Memory and Statistical Machine Translation

The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

Factored Translation Models

Customizing an English-Korean Machine Translation System for Patent Translation *

Learning Translation Rules from Bilingual English Filipino Corpus

A Joint Sequence Translation Model with Integrated Reordering

Chapter 5. Phrase-based models. Statistical Machine Translation

Why Evaluation? Machine Translation. Evaluation. Evaluation Metrics. Ten Translations of a Chinese Sentence. How good is a given system?

Machine Translation. Why Evaluation? Evaluation. Ten Translations of a Chinese Sentence. Evaluation Metrics. But MT evaluation is a di cult problem!

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Translation Solution for

tance alignment and time information to create confusion networks 1 from the output of different ASR systems for the same

THUTR: A Translation Retrieval System

An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System

The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER Ncode: an Open Source Bilingual N-gram SMT Toolkit

UNSUPERVISED MORPHOLOGICAL SEGMENTATION FOR STATISTICAL MACHINE TRANSLATION

A New Input Method for Human Translators: Integrating Machine Translation Effectively and Imperceptibly

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

Hybrid Strategies. for better products and shorter time-to-market

The Impact of Morphological Errors in Phrase-based Statistical Machine Translation from English and German into Swedish

Word Completion and Prediction in Hebrew

Building a Web-based parallel corpus and filtering out machinetranslated

This page intentionally left blank

SYSTRAN 混合策略汉英和英汉机器翻译系统

Machine Translation. Agenda

Paraphrasing controlled English texts

Introduction. Philipp Koehn. 28 January 2016

Factored bilingual n-gram language models for statistical machine translation

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混合策略汉英和英汉机器翻译系 CWMT2011 技术报告

Computer Aided Translation

Machine Translation and the Translator

Learning Translations of Named-Entity Phrases from Parallel Corpora

Building a Question Classifier for a TREC-Style Question Answering System

Effective Self-Training for Parsing

PROMT Technologies for Translation and Big Data

L130: Chapter 5d. Dr. Shannon Bischoff. Dr. Shannon Bischoff () L130: Chapter 5d 1 / 25

Empirical Machine Translation and its Evaluation

An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation

Training and evaluation of POS taggers on the French MULTITAG corpus

Context Grammar and POS Tagging

Natural Language Database Interface for the Community Based Monitoring System *

ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking

Statistical Machine Translation

Rule based Sentence Simplification for English to Tamil Machine Translation System

a Chinese-to-Spanish rule-based machine translation

Improving the Performance of English-Tamil Statistical Machine Translation System using Source-Side Pre-Processing

Language Model of Parsing and Decoding

Text Generation for Abstractive Summarization

MORPHOLOGY BASED PROTOTYPE STATISTICAL MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL LANGUAGE

Example-based Translation without Parallel Corpora: First experiments on a prototype

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Machine Learning for natural language processing

Identifying Focus, Techniques and Domain of Scientific Papers

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Phrase-Based Statistical Translation of Programming Languages

TS3: an Improved Version of the Bilingual Concordancer TransSearch

Ling 201 Syntax 1. Jirka Hana April 10, 2006

PoS-tagging Italian texts with CORISTagger

Segmentation and Punctuation Prediction in Speech Language Translation Using a Monolingual Translation System

Applying Statistical Post-Editing to. English-to-Korean Rule-based Machine Translation System

COMPUTATIONAL DATA ANALYSIS FOR SYNTAX

Integration of Content Optimization Software into the Machine Translation Workflow. Ben Gottesman Acrolinx

Syntax: Phrases. 1. The phrase

Comprendium Translator System Overview

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

GRASP: Grammar- and Syntax-based Pattern-Finder for Collocation and Phrase Learning

TREC 2003 Question Answering Track at CAS-ICT

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

A Joint Sequence Translation Model with Integrated Reordering

English Descriptive Grammar

Morphological Knowledge in Statistical Machine Translation. Kristina Toutanova Microsoft Research

Chapter 8. Final Results on Dutch Senseval-2 Test Data

M LTO Multilingual On-Line Translation

Transcription:

Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language Technology School of Computing Dublin City University http://nclt.computing.dcu.ie/~jfoster Week Ten/Eleven 17th April 2014

Lecture Plan Part One: Recap of decoding Part Two: More on decoding Part Three: Encoding linguistic knowledge

Part One Recap of decoding

Questions 1. What is decoding? 2. What methods are used to make the decoding problem tractable? 3. What is the difference between histogram pruning and threshold pruning?

What is decoding? Decoding is the process of searching for the most likely translation within the space of candidate translations.

Making decoding tractable 1. Hypothesis recombination 2. Pruning

Histogram Pruning Keep a maximum of m hypotheses

Threshold Pruning Keep only those hypotheses that are within a threshold α of the best hypothesis.

Histogram Pruning How many hypotheses will be pruned if m = 1? 0.01 0.08 0.09

Threshold Pruning 0.01 0.08 How many hypotheses will be pruned if we prune all hypotheses that are at least 0.5 times worse than the best hypothesis? 0.09

Threshold Pruning 0.01 Threshold pruning is more flexible than histogram pruning since it takes into account the difference between the scores of the best and worst hypothesis. 0.08 0.09

Part Two More on decoding

Stack Decoding Algorithm Partial translations or hypotheses are grouped into stacks based on how many words in the input sentence they translate.

Stack Decoding Algorithm Partial translations or hypotheses are grouped into stacks based on how many words in the input sentence they translate. Working left to right through the stacks, the decoder takes a hypothesis from a stack and a phrase from the input sentence and tries to construct a new hypothesis.

Stack Decoding Algorithm Partial translations or hypotheses are grouped into stacks based on how many words in the input sentence they translate. Working left to right through the stacks, the decoder takes a hypothesis from a stack and a phrase from the input sentence and tries to construct a new hypothesis. The output is the highest scoring hypothesis that translates the entire input sentence.

Stack Decoding Algorithm Partial translations or hypotheses are grouped into stacks based on how many words in the input sentence they translate. Working left to right through the stacks, the decoder takes a hypothesis from a stack and a phrase from the input sentence and tries to construct a new hypothesis. The output is the highest scoring hypothesis that translates the entire input sentence. Stacks are kept to a manageable size using pruning.

Stack Decoding Pseudocode 1:place empty hypothesis into stack 0 2: for all stacks 0...n 1 do 3: for all hypotheses in stack do 4: for all translation options do 5: if applicable then 6: create new hypothesis 7: place in stack 8: recombine with existing hypothesis if possible 9: prune stack if too big 10: end if 11: end for 12: end for 13: end for

Stack Decoding Example Input sentence: maison bleu Goal: Translate it into English using the stack decoding algorithm with histogram pruning (maximum number of hypotheses per stack = 3)

Stack Decoding Example Phrase pairs in training data: maison house maison home bleu blue bleu turquoise maison bleu blue house

Stack Decoding Example All possible translations: house blue, house turquoise, home blue, home turquoise, blue house, blue home, turquoise home, turquoise house

Stack Decoding Example Insert the empty hypothesis in Stack 0 0 1 2

Stack Decoding Example Construct new hypotheses compatible with empty hypothesis turquoise blue home house blue house 0 1 2

Stack Decoding Example Prune Stack 1 (assume turquoise has lowest probability) turquoise blue home house blue house 0 1 2

Stack Decoding Example Construct new hypotheses compatible with hypotheses in Stack 1 home turqoise house turquoise home blue blue home house house blue blue home blue house blue house 0 1 2

Stack Decoding Example Perform hypothesis recombination (assume in this case that the shorter path hypothesis is more likely) home turqoise house turquoise home blue blue home house house blue blue home blue house 0 1 2

Stack Decoding Example Prune stack 2 (assume home blue, home turquoise and house turquoise have the lowest scores) blue home house home turqoise housee turquoise home blue house blue blue home blue house 0 1 2

Stack Decoding Example Output most likely translation in Stack 2 blue house blue home blue home house blue house 0 1 2

Limitation of Stack Decoding Algorithm Relatively low-scoring hypotheses that are part of a good translation may be pruned.

Limitation of Stack Decoding Algorithm the tourism initiative addresses this for the first time Die Tm:-0.19, lm:-0.4, d:0,all:-0.59 touristische Tm:-1.16, lm:-2.93, d:0,all:-4.09 initiative Tm:-1.21, lm:-4.67, d:0,all:-5.88 das erste mal Tm:-0.56, lm:-2.81, d:-0.74,all:-4.11

Limitation of Stack Decoding Algorithm the tourism initiative addresses this for the first time Die Tm:-0.19, lm:-0.4, d:0,all:-0.59 touristische Tm:-1.16, lm:-2.93, d:0,all:-4.09 initiative Tm:-1.21, lm:-4.67, d:0,all:-5.88 das erste mal Tm:-0.56, lm:-2.81, d:-0.74,all:-4.11 Both hypotheses translate 3 words (therefore, in the same stack so will be compared to each other when deciding what to prune)

Limitation of Stack Decoding Algorithm the tourism initiative addresses this for the first time Die Tm:-0.19, lm:-0.4, d:0,all:-0.59 touristische Tm:-1.16, lm:-2.93, d:0,all:-4.09 initiative Tm:-1.21, lm:-4.67, d:0,all:-5.88 das erste mal Tm:-0.56, lm:-2.81, d:-0.74,all:-4.11 Both hypotheses translate 3 words (therefore, in the same stack so will be compared to each other when deciding what to prune) The worse hypothesis (i.e. the one starting with das erste mal) has a better score at this stage because it contains common words that are easier to translate.

Limitation of Stack Decoding Algorithm How to overcome this limitation?

Limitation of Stack Decoding Algorithm How to overcome this limitation? Estimate the future cost associated with a hypothesis by looking at the rest of the input string.

Future Cost A hypothesis score is some combination of its probability and its future cost. Why is the exact future cost not calculated?

Future Cost A hypothesis score is some combination of its probability and its future cost. Why is the exact future cost not calculated? Because calculating the exact future cost of all hypotheses would amount to evaluating the translation cost of all hypotheses the very thing we use pruning to avoid.

Depth-first versus breadth-first search A distinction is made between decoding algorithms that perform a breadth-first search and those that perform a depth-first search.

Depth-first versus breadth-first search A distinction is made between decoding algorithms that perform a breadth-first search and those that perform a depth-first search. A breadth-first search explores many hypotheses in parallel, e.g. stack decoding. A depth-first search fully explores a hypothesis before moving on to the next one, e.g. A* decoding.

Search and Model Errors 1. A search error occurs when the decoder misses a translation with a higher probability than the translation it returns.

Search and Model Errors 1. A search error occurs when the decoder misses a translation with a higher probability than the translation it returns. 2. A model error occurs when the model assigns a higher probability to an incorrect translation than to a correct one.

Part Three Making Use of Linguistic Knowledge

The story so far The SMT approaches we have studied so far are language-independent.

The story so far The SMT approaches we have studied so far are language-independent. Input: 1. A monolingual corpus for training a language model 2. A sentence-aligned parallel corpus for training a translation model

The story so far The SMT approaches we have studied so far are language-independent. Input: 1. A monolingual corpus for training a language model 2. A sentence-aligned parallel corpus for training a translation model Output: A translation (of variable quality)

The old way of doing things Traditional rule-based MT systems tried to translate using hand-coded rules which encode linguistic knowledge or generalisations.

Examples of linguistic knowledge 1. In English, the noun comes after the adjective. In French, the noun usually precedes the adjective.

Examples of linguistic knowledge 1. In English, the noun comes after the adjective. In French, the noun usually precedes the adjective. 2. In German, the main verb often occurs at the end of a clause.

Examples of linguistic knowledge 1. In English, the noun comes after the adjective. In French, the noun usually precedes the adjective. 2. In German, the main verb often occurs at the end of a clause. 3. In Irish, prepositions are inflected for number and person.

Examples of linguistic knowledge 1. In English, the noun comes after the adjective. In French, the noun usually precedes the adjective. 2. In German, the main verb often occurs at the end of a clause. 3. In Irish, prepositions are inflected for number and person. 4. In many languages, the subject and verb agree in person and number.

Why bother with linguistic knowledge? Why would we want to encode linguistic knowledge in an SMT system?

Why bother with linguistic knowledge? Why would we want to encode linguistic knowledge in an SMT system? 1. Data-driven SMT works well but there is room for improvement (especially for some language pairs).

Why bother with linguistic knowledge? Why would we want to encode linguistic knowledge in an SMT system? 1. Data-driven SMT works well but there is room for improvement (especially for some language pairs). 2. Data-driven SMT relies on the availability of large corpora. For minority languages, this can be problematic.

Using linguistic knowledge Where linguistic knowledge might be useful:

Using linguistic knowledge Where linguistic knowledge might be useful: 1. More global view of fluency than an n-gram language model: the house is the man is small

Using linguistic knowledge Where linguistic knowledge might be useful: 1. More global view of fluency than an n-gram language model: the house is the man is small 2. Long-distance dependencies: make a tremendous amount of sense

Using linguistic knowledge Where linguistic knowledge might be useful: 1. More global view of fluency than an n-gram language model: the house is the man is small 2. Long-distance dependencies: make a tremendous amount of sense 3. Word compounding: Knowing that Aktionsplan is made up of the words Aktion and plan

Using linguistic knowledge Where linguistic knowledge might be useful: 1. More global view of fluency than an n-gram language model: the house is the man is small 2. Long-distance dependencies: make a tremendous amount of sense 3. Word compounding: Knowing that Aktionsplan is made up of the words Aktion and plan 4. Better handling of function words: role of de in entreprises de transports (haulage companies)

How to encode linguistic knowledge How can linguistic knowledge be encoded in an SMT system?

How to encode linguistic knowledge How can linguistic knowledge be encoded in an SMT system? 1. Before translation (pre-processing)

How to encode linguistic knowledge How can linguistic knowledge be encoded in an SMT system? 1. Before translation (pre-processing) 2. During translation

How to encode linguistic knowledge How can linguistic knowledge be encoded in an SMT system? 1. Before translation (pre-processing) 2. During translation 3. After translation (post-processing)

Where do we get this linguistic knowledge? How do we obtain linguistic knowledge automatically?

Where do we get this linguistic knowledge? How do we obtain linguistic knowledge automatically? 1. Automatic morphological analysis 2. Part-of-speech tagging 3. Syntactic parsing

Where do we get this linguistic knowledge? How do we obtain linguistic knowledge automatically? 1. Automatic morphological analysis 2. Part-of-speech tagging 3. Syntactic parsing Can you spot a potential problem here (hint: automatic)?

Syntactic Parsing S NP VP dobj NNP VBP NNP John loves Mary NNP VBP NNP John loves Mary nsubj

Pre-processing Encoding linguistic knowledge before translation 1. Word segmentation 2. Word lemmatisation 3. Re-ordering

Word Segmentation The notion of what a word is varies from language to language.

Word Segmentation The notion of what a word is varies from language to language. In some languages, e.g. Chinese, words are not separated by spaces.

Word Segmentation The notion of what a word is varies from language to language. In some languages, e.g. Chinese, words are not separated by spaces. Why is this a problem for machine translation?

Word Segmentation The notion of what a word is varies from language to language. In some languages, e.g. Chinese, words are not separated by spaces. Why is this a problem for machine translation? Potential solution: try to perform automatic word segmentation before translation.

Word Segmentation The notion of what a word is varies from language to language. In some languages, e.g. Chinese, words are not separated by spaces. Why is this a problem for machine translation? Potential solution: try to perform automatic word segmentation before translation. A related challenge is word compounding in German.

Word Lemmatisation Languages with a rich inflectional system can have several different variants of the same root form or lemma.

Word Lemmatisation Languages with a rich inflectional system can have several different variants of the same root form or lemma. Example: Latin: bellum, belli, bello, bella, bellis, bellorum English: war, wars

Word Lemmatisation Languages with a rich inflectional system can have several different variants of the same root form or lemma. Example: Latin: bellum, belli, bello, bella, bellis, bellorum English: war, wars Why is this a problem for translation from an MRL (morphologically rich language) into English?

Word Lemmatisation Languages with a rich inflectional system can have several different variants of the same root form or lemma. Example: Latin: bellum, belli, bello, bella, bellis, bellorum English: war, wars Why is this a problem for translation from an MRL (morphologically rich language) into English? Potential solution: before translation, replace full form with lemma

Word Lemmatisation Languages with a rich inflectional system can have several different variants of the same root form or lemma. Example: Latin: bellum, belli, bello, bella, bellis, bellorum English: war, wars Why is this a problem for translation from an MRL (morphologically rich language) into English? Potential solution: before translation, replace full form with lemma

Syntactic Re-ordering It is harder to translate language pairs with very different word order.

Syntactic Re-ordering It is harder to translate language pairs with very different word order. Language model and phrase-based translation model can only do so much.

Syntactic Re-ordering It is harder to translate language pairs with very different word order. Language model and phrase-based translation model can only do so much. Potential solution: transform the source sentence so that its word order more closely resembles the word order of the target language.

Syntactic Re-ordering It is harder to translate language pairs with very different word order. Language model and phrase-based translation model can only do so much. Potential solution: transform the source sentence so that its word order more closely resembles the word order of the target language. Consider translating from English to German: I walked to the shop yesterday. I yesterday to the shop walked.

During Translation Encoding linguistic knowledge during translation

During Translation Encoding linguistic knowledge during translation 1. Log-linear models

During Translation Encoding linguistic knowledge during translation 1. Log-linear models 2. Tree-based models

During Translation Encoding linguistic knowledge during translation 1. Log-linear models 2. Tree-based models 3. Syntactic language models

Log-linear models The log-linear model of translation allows several sources of information to be employed simultaneously during the translation process: p x = exp n i=1 λ i h i (x)

Log-linear models The log-linear model of translation allows several sources of information to be employed simultaneously during the translation process: p x = exp n i=1 λ i h i (x) Some of this information can be linguistic knowledge, e.g. instead of just calculating p(ocras hungry) we can also calculate p(noun adj).

Tree-based models A more radical approach is to replace the phrasebased translation model with a different type of translation model entirely.

Tree-based models A more radical approach is to replace the phrasebased translation model with a different type of translation model entirely. One such model is a tree-based model which is based on the notion of a syntax tree and which allows one, in theory, to better capture structural similarities and divergences between languages.

Tree-based models A more radical approach is to replace the phrasebased translation model with a different type of translation model entirely. One such model is a tree-based model which is based on the notion of a syntax tree and which allows one, in theory, to better capture structural similarities and divergences between languages. During decoding, syntax trees for the source and language pairs are built up simultaneously (or one side only)

Syntactic language models N-gram language models are not the only language models.

Syntactic language models N-gram language models are not the only language models. People have experimented with other kinds of language models, including those based on the notion of a syntax tree.

Syntactic language models N-gram language models are not the only language models. People have experimented with other kinds of language models, including those based on the notion of a syntax tree. Instead of computing the probability of a sentence by multiplying n-gram probabilities, the probability is computed by multiplying the probabilities of the rules that are used to build the syntax tree(s).

Post-processing SMT systems return a ranked list of candidate translations. Linguistic information (of all kinds) can be employed in re-ranking this list using machine learning techniques.