Chapter 5. Phrase-based models. Statistical Machine Translation

Similar documents

Chapter 6. Decoding. Statistical Machine Translation

Phrase-Based MT. Machine Translation Lecture 7. Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu. Website: mt-class.

Statistical Machine Translation Lecture 4. Beyond IBM Model 1 to Phrase-Based Models

Statistical Machine Translation

A Joint Sequence Translation Model with Integrated Reordering

Machine Translation. Agenda

Convergence of Translation Memory and Statistical Machine Translation

HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN

Computer Aided Translation

Introduction. Philipp Koehn. 28 January 2016

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Statistical Machine Translation

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation

Statistical Machine Translation: IBM Models 1 and 2

Hybrid Machine Translation Guided by a Rule Based System

The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER Ncode: an Open Source Bilingual N-gram SMT Toolkit

Factored Language Models for Statistical Machine Translation

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization

Statistical NLP Spring Machine Translation: Examples

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Chapter 7. Language models. Statistical Machine Translation

Building a Web-based parallel corpus and filtering out machinetranslated

A Joint Sequence Translation Model with Integrated Reordering

Factored Translation Models

Learning Translations of Named-Entity Phrases from Parallel Corpora

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

TS3: an Improved Version of the Bilingual Concordancer TransSearch

CSE 473: Artificial Intelligence Autumn 2010

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

THUTR: A Translation Retrieval System

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Segmentation and Classification of Online Chats

Course: Model, Learning, and Inference: Lecture 5

Annotation Guidelines for Dutch-English Word Alignment

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

IRIS - English-Irish Translation System

The Prague Bulletin of Mathematical Linguistics NUMBER 93 JANUARY Training Phrase-Based Machine Translation Models on the Cloud

This page intentionally left blank

Adaptive Development Data Selection for Log-linear Model in Statistical Machine Translation

1 Maximum likelihood estimation

Big Data & Scripting Part II Streaming Algorithms

tance alignment and time information to create confusion networks 1 from the output of different ASR systems for the same

Factored bilingual n-gram language models for statistical machine translation

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混合策略汉英和英汉机器翻译系 CWMT2011 技术报告

Getting Off to a Good Start: Best Practices for Terminology

Machine Translation and the Translator

A Systematic Comparison of Various Statistical Alignment Models

The Impact of Morphological Errors in Phrase-based Statistical Machine Translation from English and German into Swedish

Collecting Polish German Parallel Corpora in the Internet

Overview of MT techniques. Malek Boualem (FT)

Statistical Machine Translation

Reliable and Cost-Effective PoS-Tagging

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Tagging with Hidden Markov Models

Probabilistic user behavior models in online stores for recommender systems

UNSUPERVISED MORPHOLOGICAL SEGMENTATION FOR STATISTICAL MACHINE TRANSLATION

Projektgruppe. Categorization of text documents via classification

An End-to-End Discriminative Approach to Machine Translation

Linguistic Universals

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Statistical Machine Learning

Introduction to Quantum Computing

Model Combination. 24 Novembre 2009

Language and Computation

Randomized Sorting as a Big Data Search Algorithm

Deciphering Foreign Language

Statistical Machine Translation of French and German into English Using IBM Model 2 Greedy Decoding

Coding and decoding with convolutional codes. The Viterbi Algor

The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems

Texas Success Initiative (TSI) Assessment. Interpreting Your Score

SYSTRAN 混合策略汉英和英汉机器翻译系统

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

A Learning Based Method for Super-Resolution of Low Resolution Images

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

MACHINE LEARNING IN HIGH ENERGY PHYSICS

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Abstract Title: Identifying and measuring factors related to student learning: the promise and pitfalls of teacher instructional logs

Texas Success Initiative (TSI) Assessment

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Maskinöversättning F2 Översättningssvårigheter + Översättningsstrategier

The Data Engineer. Mike Tamir Chief Science Officer Galvanize. Steven Miller Global Leader Academic Programs IBM Analytics

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Question template for interviews

Transcription:

Chapter 5 Phrase-based models Statistical Machine Translation

Motivation Word-Based Models translate words as atomic units Phrase-Based Models translate phrases as atomic units Advantages: many-to-many translation can handle non-compositional phrases use of local context in translation the more data, the longer phrases can be learned Standard Model, used by Google Translate and others Chapter 5: Phrase-Based Models 1

Phrase-Based Model Foreign input is segmented in phrases Each phrase is translated into English Phrases are reordered Chapter 5: Phrase-Based Models 2

Phrase Translation Table Main knowledge source: table with phrase translations and their probabilities Example: phrase translations for natuerlich Translation Probability φ(ē f) of course 0.5 naturally 0.3 of course, 0.15, of course, 0.05 Chapter 5: Phrase-Based Models 3

Real Example Phrase translations for den Vorschlag learned from the Europarl corpus: English φ(ē f) English φ(ē f) the proposal 0.6227 the suggestions 0.0114 s proposal 0.1068 the proposed 0.0114 a proposal 0.0341 the motion 0.0091 the idea 0.0250 the idea of 0.0091 this proposal 0.0227 the proposal, 0.0068 proposal 0.0205 its proposal 0.0068 of the proposal 0.0159 it 0.0068 the proposals 0.0159...... lexical variation (proposal vs suggestions) morphological variation (proposal vs proposals) included function words (the, a,...) noise (it) Chapter 5: Phrase-Based Models 4

Linguistic Phrases? Model is not limited to linguistic phrases (noun phrases, verb phrases, prepositional phrases,...) Example non-linguistic phrase pair spass am fun with the Prior noun often helps with translation of preposition Experiments show that limitation to linguistic phrases hurts quality Chapter 5: Phrase-Based Models 5

Probabilistic Model Bayes rule translation model p(e f) language model plm(e) e best = argmax e p(e f) Decomposition of the translation model = argmax e p(f e) plm(e) p( f I 1 ē I 1) = I φ( f i ē i ) d(start i end i 1 1) i=1 phrase translation probability φ reordering probability d Chapter 5: Phrase-Based Models 6

Distance-Based Reordering d=0 d=-2 d=-3 d=-1 foreign 1 2 3 4 5 6 7 English phrase translates movement distance 1 1 3 start at beginning 0 2 6 skip over 4 5 +2 3 4 5 move back over 4 6-3 4 7 skip over 6 +1 Scoring function: d(x) = α x exponential with distance Chapter 5: Phrase-Based Models 7

Learning a Phrase Translation Table Task: learn the model from a parallel corpus Three stages: word alignment: using IBM models or other method extraction of phrase pairs scoring phrase pairs Chapter 5: Phrase-Based Models 8

Word Alignment michael geht davon aus, dass er im haus bleibt michael assumes that he will stay in the house Chapter 5: Phrase-Based Models 9

Extracting Phrase Pairs michael geht davon aus, dass er im haus bleibt michael assumes that he will stay in the house extract phrase pair consistent with word alignment: assumes that / geht davon aus, dass Chapter 5: Phrase-Based Models 10

Consistent ok violated ok one alignment unaligned point outside word is fine All words of the phrase pair have to align to each other. Chapter 5: Phrase-Based Models 11

Consistent Phrase pair (ē, f) consistent with an alignment A, if all words f 1,..., f n in f that have alignment points in A have these with words e 1,..., e n in ē and vice versa: (ē, f) consistent with A e i ē : (e i, f j ) A f j f and f j f : (e i, f j ) A e i ē and e i ē, f j f : (e i, f j ) A Chapter 5: Phrase-Based Models 12

Phrase Pair Extraction michael geht davon aus, dass er im haus bleibt michael assumes that he will stay in the house Smallest phrase pairs: michael michael assumes geht davon aus / geht davon aus, that dass /, dass he er will stay bleibt in the im house haus unaligned words (here: German comma) lead to multiple translations Chapter 5: Phrase-Based Models 13

Larger Phrase Pairs michael geht davon aus, dass er im haus bleibt michael assumes that he will stay in the house michael assumes michael geht davon aus / michael geht davon aus, assumes that geht davon aus, dass ; assumes that he geht davon aus, dass er that he dass er /, dass er ; in the house im haus michael assumes that michael geht davon aus, dass michael assumes that he michael geht davon aus, dass er michael assumes that he will stay in the house michael geht davon aus, dass er im haus bleibt assumes that he will stay in the house geht davon aus, dass er im haus bleibt that he will stay in the house dass er im haus bleibt ; dass er im haus bleibt, he will stay in the house er im haus bleibt ; will stay in the house im haus bleibt Chapter 5: Phrase-Based Models 14

Scoring Phrase Translations Phrase pair extraction: collect all phrase pairs from the data Phrase pair scoring: assign probabilities to phrase translations Score by relative frequency: φ( f ē) = count(ē, f) f i count(ē, f i ) Chapter 5: Phrase-Based Models 15

Size of the Phrase Table Phrase translation table typically bigger than corpus... even with limits on phrase lengths (e.g., max 7 words) Too big to store in memory? Solution for training extract to disk, sort, construct for one source phrase at a time Solutions for decoding on-disk data structures with index for quick look-ups suffix arrays to create phrase pairs on demand Chapter 5: Phrase-Based Models 16

Weighted Model Described standard model consists of three sub-models phrase translation model φ( f ē) reordering model d language model p LM (e) I e best = argmax e φ( f i ē i ) d(start i end i 1 1) i=1 e i=1 p LM (e i e 1...e i 1 ) Some sub-models may be more important than others Add weights λ φ, λ d, λ LM I e best = argmax e φ( f i ē i ) λ φ d(start i end i 1 1) λ d i=1 e i=1 p LM (e i e 1...e i 1 ) λ LM Chapter 5: Phrase-Based Models 17

Log-Linear Model Such a weighted model is a log-linear model: p(x) = exp n λ i h i (x) i=1 Our feature functions number of feature function n = 3 random variable x = (e, f, start, end) feature function h 1 = log φ feature function h 2 = log d feature function h 3 = log p LM Chapter 5: Phrase-Based Models 18

Weighted Model as Log-Linear Model p(e, a f) = exp(λ φ λ d λ LM I log φ( f i ē i )+ i=1 I log d(a i b i 1 1)+ i=1 e i=1 log p LM (e i e 1...e i 1 )) Chapter 5: Phrase-Based Models 19

More Feature Functions Bidirectional alignment probabilities: φ(ē f) and φ( f ē) Rare phrase pairs have unreliable phrase translation probability estimates lexical weighting with word translation probabilities geht nicht davon aus NULL does not assume lex(ē f, a) = length(ē) i=1 1 {j (i, j) a} (i,j) a w(e i f j ) Chapter 5: Phrase-Based Models 20

More Feature Functions Language model has a bias towards short translations word count: wc(e) = log e ω We may prefer finer or coarser segmentation phrase count pc(e) = log I ρ Multiple language models Multiple translation models Other knowledge sources Chapter 5: Phrase-Based Models 21

Lexicalized Reordering Distance-based reordering model is weak learn reordering preference for each phrase pair Three orientations types: (m) monotone, (s) swap, (d) discontinuous orientation {m, s, d} p o (orientation f, ē) Chapter 5: Phrase-Based Models 22

Learning Lexicalized Reordering?? Collect orientation information during phrase pair extraction if word alignment point to the top left exists monotone if a word alignment point to the top right exists swap if neither a word alignment point to top left nor to the top right exists neither monotone nor swap discontinuous Chapter 5: Phrase-Based Models 23

Learning Lexicalized Reordering Estimation by relative frequency p o (orientation) = count(orientation, ē, f) f ē count(o, ē, f) f ē o Smoothing with unlexicalized orientation model p(orientation) to avoid zero probabilities for unseen orientations p o (orientation f, ē) = σ p(orientation) + count(orientation, ē, f) σ + o count(o, ē, f) Chapter 5: Phrase-Based Models 24

EM Training of the Phrase Model We presented a heuristic set-up to build phrase translation table (word alignment, phrase extraction, phrase scoring) Alternative: align phrase pairs directly with EM algorithm initialization: uniform model, all φ(ē, f) are the same expectation step: estimate likelihood of all possible phrase alignments for all sentence pairs maximization step: collect counts for phrase pairs (ē, f), weighted by alignment probability update phrase translation probabilties p(ē, f) However: method easily overfits (learns very large phrase pairs, spanning entire sentences) Chapter 5: Phrase-Based Models 25

Summary Phrase Model Training the model word alignment phrase pair extraction phrase pair scoring Log linear model sub-models as feature functions lexical weighting word and phrase count features Lexicalized reordering model EM training of the phrase model Chapter 5: Phrase-Based Models 26