Statistical Machine Translation

Similar documents

Chapter 5. Phrase-based models. Statistical Machine Translation

Phrase-Based MT. Machine Translation Lecture 7. Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu. Website: mt-class.

Statistical Machine Translation Lecture 4. Beyond IBM Model 1 to Phrase-Based Models

Chapter 6. Decoding. Statistical Machine Translation

Machine Translation. Agenda

Convergence of Translation Memory and Statistical Machine Translation

Statistical Machine Translation

Introduction. Philipp Koehn. 28 January 2016

A Joint Sequence Translation Model with Integrated Reordering

THUTR: A Translation Retrieval System

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

Chapter 7. Language models. Statistical Machine Translation

Statistical Machine Translation: IBM Models 1 and 2

Machine Translation and the Translator

Why Evaluation? Machine Translation. Evaluation. Evaluation Metrics. Ten Translations of a Chinese Sentence. How good is a given system?

Machine Translation. Why Evaluation? Evaluation. Ten Translations of a Chinese Sentence. Evaluation Metrics. But MT evaluation is a di cult problem!

An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation

Computer Aided Translation

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Course: Model, Learning, and Inference: Lecture 5

Language and Computation

The TCH Machine Translation System for IWSLT 2008

Big Data and Scripting. (lecture, computer science, bachelor/master/phd)

Lecture 3: Linear methods for classification

A Joint Sequence Translation Model with Integrated Reordering

Hybrid Machine Translation Guided by a Rule Based System

Offline sorting buffers on Line

Log-Linear Models a.k.a. Logistic Regression, Maximum Entropy Models

Factored Language Models for Statistical Machine Translation

Big Data and Scripting

This page intentionally left blank

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Statistical NLP Spring Machine Translation: Examples

A Learning Based Method for Super-Resolution of Low Resolution Images

Selected Topics in Applied Machine Learning: An integrating view on data analysis and learning algorithms

2-3 Automatic Construction Technology for Parallel Corpora

Factored Translation Models

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Outline of today s lecture

HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN

CHAPTER 2 Estimating Probabilities

0.1 Phase Estimation Technique

A Theory of Capital Controls As Dynamic Terms of Trade Manipulation

The KIT Translation system for IWSLT 2010

The University of Maryland Statistical Machine Translation System for the Fifth Workshop on Machine Translation

TS3: an Improved Version of the Bilingual Concordancer TransSearch

LIUM s Statistical Machine Translation System for IWSLT 2010

tance alignment and time information to create confusion networks 1 from the output of different ASR systems for the same

Probabilistic user behavior models in online stores for recommender systems

A survey on click modeling in web search

Gambling Systems and Multiplication-Invariant Measures

SYSTRAN 混合策略汉英和英汉机器翻译系统

Dynamic Programming. Lecture Overview Introduction

Arabic Recognition and Translation System

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Internet Advertising and the Generalized Second Price Auction:

1. The RSA algorithm In this chapter, we ll learn how the RSA algorithm works.

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

CHAPTER 7 DISLOCATIONS AND STRENGTHENING MECHANISMS PROBLEM SOLUTIONS

Math 55: Discrete Mathematics

The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER Ncode: an Open Source Bilingual N-gram SMT Toolkit

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

PBML. logo. The Prague Bulletin of Mathematical Linguistics NUMBER??? JANUARY Grammar based statistical MT on Hadoop

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Introduction to Logistic Regression

Arithmetic Coding: Introduction

Numerical methods for American options

Annotation Guidelines for Dutch-English Word Alignment

Load Balancing and Switch Scheduling

Near Optimal Solutions

FACTORING. n = fall in the arithmetic sequence

SCO TT G LEA SO N D EM O Z G EB R E-

Basic Excel Handbook

Reliable and Cost-Effective PoS-Tagging

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Cross-linguistic differences in the interpretation of sentences with more than one QP: German (Frey 1993) and Hungarian (É Kiss 1991)

Randomization Approaches for Network Revenue Management with Customer Choice Behavior

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Analyzing Mission Critical Voice over IP Networks. Michael Todd Gardner

Tagging with Hidden Markov Models

Healthcare data analytics. Da-Wei Wang Institute of Information Science

Building a Web-based parallel corpus and filtering out machinetranslated

Factored bilingual n-gram language models for statistical machine translation

Scheduling Shop Scheduling. Tim Nieberg

OPTIMIZED H O M E C A R E S C H E D U L I N G AND ROUTING

CS 533: Natural Language. Word Prediction

The program also provides supplemental modules on topics in geometry and probability and statistics.

1. Then f has a relative maximum at x = c if f(c) f(x) for all values of x in some

Moral Hazard. Itay Goldstein. Wharton School, University of Pennsylvania

No: Bilkent University. Monotonic Extension. Farhad Husseinov. Discussion Papers. Department of Economics

The Advantages and Disadvantages of Network Computing Nodes

ANALYTIC HIERARCHY PROCESS (AHP) TUTORIAL

1. Degenerate Pressure

CHAPTER 3 SECURITY CONSTRAINED OPTIMAL SHORT-TERM HYDROTHERMAL SCHEDULING

Log-Linear Models. Michael Collins

Maximum Likelihood Estimation

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Deciphering Foreign Language

Transcription:

Statistical Machine Translation LECTURE 6 PHRASE-BASED MODELS APRIL 19, 2010 1

Brief Outline - Introduction - Mathematical Definition - Phrase Translation Table - Consistency - Phrase Extraction - Translation Probabilities - Reordering Models 2

Introduction Question: Is word-model the right one? 1. Often group of words taken together are translated in a different way: e.g. My daughter is apple of my eyes. >> mia figlia è mela dei miei occhi He was beaten black and blue. >> di essere stato picchiato nero e blu 2. A polysemic word e.g. fan, bat, bank may be better Translated, if translated in a context. All these translations breaks down in the Word level. 3

Introduction Solution is phrase based model Note: A phrase here is NOT the way a parser defines based on syntactic role: NP, VP. Rather it is a multi-word unit as a sequence of words. When such set of consecutive words occur frequently and statistics about their translations are calculated, it may lead to better translation. Still there can be problems - as we see in the follwoing examples 4

Introduction The fan is running >> Il ventilatore è in funzione The fan is on >> Il ventilatore è in The fan is on table >> Il ventilatore è in tavola The fan is on the table >> La ventola è sul tavolo The fan is on the desk >> Il ventilatore è sulla sedia The fan is on test >> La ventola è in prova The fan is waving a flag >> la ventola è sventolare una bandiera The fan is >> prashansak hai The fan is running >> pankhaa chal rahaa hai The fan is on >> prashansak par hai The fan is on the table >> prashansak mej par hai The fan is on the chair >> prashansak kursii par hai The fan is on test >> prashansak parikshaa par hai The fan is waiving a flag >> prashansak ek jhandaa lahraate hai 5

What are the problems? Mathematical Definition - Semantics is not always preserved. - Translation changed with introduction of article - Preposition changes with article. - Identification of phrase boundaries is difficult Still phrase based makes more sense than pure word-based. 6

Mathematical Definition Here we improve upon the Bayesian Model: e Best = arg max p( e f ) = e arg max e p( f e ) p( e) For phrase-based model, the term p(f e) is further broken down: Let f be split into I phrases: f f,..., 1, 2 f I Not modeled explicitly so all segments are Equally likely 7

Mathematical Definition Each of the foreign phrases f 1, f 2,..., f is translated into corresponding e phrase ē i I Hence we get: p( f e) = I i = 1 φ ( f i e i ) * pd ( d i ) - pd is the relative distortion probability - d i is an relative distortion of the i th phrase. This is because the phrases need not be in the same order in f as in e. 8

Mathematical Definition Example: he has gone into the room he peeped into the room woh kamre mein chalaa gayaa woh kamre mein jhankaa There should be a way to handle this reordering Often pd is taken as a decaying function : α x, s..t. α (0, 1). Although it is possible to study their distribution from a corpus. However, it is not easy. 9

Mathematical Definition A simple measure for d i is start i end i-1-1 - Counts no. of f words skipped when taken out of sequence. - Computed on the basis of e phrase and f words. Note: wihtout any reordering, start i = end i-1 +1 Ex: natuerlich hat john spass am spiel (DE) of course john has fun with the game d1 = 0 d2 = 1 d3 = -2 d4 = 1 d5 = 0 10

Calculation of Di Phrase No. Translates Skips D i 1 1 Start at the beginning 0 2 3 Phrase 2 1 3 2 Moves back 3 & 2-2 4 4-5 Phrase 3 1 5 6 Start normally 0 Ex: Try computing the di s for the following the tall boy riding a bicycle is singing loud songs Il ragazzo alto in sella a una bicicletta è cantare le canzoni ad alto volume 11

Phrase Translation table Gives information about how the phrases are translated. The phrase translations are extracted from word alignment Can be accomplished in many ways. We shall use an algorithm based on consistency. The following algorithm is designed on the top of word alignment in a parallel corpus. It extracts phrase pairs that are consistent with the word alignment Q: Can we develop a good algorithm for Phrase alignment? 12

What is Consistency? Definition: A phrase pair ( f, ē ) is said to be consistent w.r.t an alignment A, if all words f 1.. f n in f that have alignment points e 1.. e n in ē and vice versa: ( f, ē ) is consistent with A e i ē (e i, f j ) A f j f f j f (e i, f j ) A e i ē e i ē, f j f (e i, f j ) A 13

What is Consistency? Example: The boy is singing loud songs>> Il ragazzo sta cantando le canzoni ad alto volume The boy >> Il ragazzo Is singing >> sta cantando Loud >> forte Songs >> canzoni Loud songs >> canzoni ad alto volume Singing loud songs >> cantando canzoni ad alto volume loud songs >> canzoni ad alto volume Is inconsistent?? 14

Phrase Extraction We need an algorithm to extract consistent phrases: - It should go over all possible sub-sentences - For each it extracts the minimal foreign phrase - Matching is done by identifying all alignment points for ē, and then identifying the shortest f that includes all the alignment points. NOTE: - If ē contains only nonaligned points then no f - If the matched f has additional alignment points, then it cannot be extracted. - If the minimal phrase borders unaligned words, then the extended phrase also extracted. 15

Michael Assumes That He Will Stay In The house Phrase Extraction Example From Philip Koehn. Michael geht davon aus, dass er im haus bleibt 16

Phrase Extraction Example The extracted phrases are: From Philip Koehn. Michael michael Michael assumes michael geht davon aus michael geht davon aus, Michael assumes that michael geht davon aus, dass Michael assumes that he michael geht davon aus, dass er Michael assumes that he will stay in the house michael geht davon aus, dass er im haus bleibt assumes geht davon aus geht davon aus, assumes that geht davon aus, dass 17

Phrase Extraction Example. assumes that he geht davon aus, dass er assumes that he will stay in the house geht davon aus, dass er im haus bleibt that dass, dass that he - dass er, dass er that he will stay in the house dass er im haus bleibt, dass er im haus bleibt he er he will stay in the house er im haus bleibt will stay bleibt will stay in the house - im haus bleibt in the - im in the house - im haus house - haus 18

Phrase Extraction Algorithm Input: Word alignment for sentence pair (e,f) Output: Set of Phrase pairs PP. For es = 1 to m // m is length of e, n is length of f For ee = es to m // Find the minimally matching phrase of f fs = n; fe = 0; For all (e, f ) A // A is the set of alignments { if es e ee then { fs = min ( f, fs) fe = max (f, fe) }} PP = PP extractfrom (fs, fe, es, ee) 19

Phrase Extraction Algorithm Function extractfrom (fs, fe, es, ee) if fe = 0 then return {} For all (e, f) A if (e < es ) or (e > ee ) then return {} S = ϕ ffs = fs repeat ffe = fe repeat S = S (es ee, ffs ffe) ffe++ until ffe is aligned ffs until ffs is aligned return S 20

Phrase Extraction Example Some Statistics: - 9 english Words vs. 10 German words - 11 alignment points - 45 contiguous English phrases - 55 contiguous German phrases - 24 pairs have been extracted. 21

Phrase Extraction Points to note about Phrase Extraction: - Unaligned words -> Multiple matches (Consider for example, the effect of the comma (,) ) - No restriction on phrase length. - Leads to huge number of extracted pairs - Most long phrases of the training data are unlikely to occur in the test data - Often a restriction is kept on the max length of a phrase. -Extracting a huge no of phrases lead to computational burden. - It is NOT clear whether it has effect on output. 22

Translation Process 23

Translation Process Let us now look at the most practical aspect: How the translation is generated for a given new sentence f? For illustration consider the Italian sentence: non gli piace il cibo indiano We expect one person non-conversant with the language will proceed as follows 24

Translation Process Non gli piace il cibo indiano No 25

Translation Process Non gli piace il cibo indiano No Indian 26

Translation Process Non gli piace il cibo indiano No the Indian 27

Translation Process Non gli piace il cibo indiano No the food Indian 28

Translation Process Non gli piace il cibo indiano No the food Indian Apply Reordering 29

Translation Process Non gli piace il cibo indiano No like the food Indian 30

Translation Process Non gli piace il cibo indiano No he like the food Indian 31

Translation Process Non gli piace il cibo indiano He no like the food Indian Apply Reordering 32

Translation Process Non gli piace il cibo indiano He does not like the Indian food Apply Language Modeling 33

Translation Process Non gli piace il cibo indiano He does not like the Indian food Apply Language Modeling 34

Translation Process Non gli piace il cibo indiano He does not like Indian food Got the final Translation!!! 35

Comments This example is done at a word level. But suppose we have Phrase Translation tables, which Gives the following translations directly: Il cibo indian >> Indian food Non gli piace >> he does not like Then the whole translation process will be easier. Hence Phrase-Based Methods are pursued. 36

Difficulties - In each case we may have different options E.G non >> not, none gli >> the, him, it piace >> like cibo >> food, grub, viands, meat, fare, scoff, cheer indiano >> Indian, Hindu So the task is NOT as easy as we thought!! 37

Difficulties If we look at from phrases we can get multiple translations: E.g. non gli piace >> not like (got it from google) dislikes, does not like, is not fond of Hence most our actions will be probabilistic. 38

Divergence Observation Simple sentence translations of this structure from English to Italian : He does not read >> egli non legge He does not sing >> egli non canta He does not cry >> egli non piange But He does not like >> non gli piace He does not speak >> non parla He does not write >> non scrive 39

Computing Translation Probability To make mathematically sound we resort to the following model: The best translation is obtained from the eq: e Best = argmax φ ( e i= 1 I f i e How do we get the probabilities? i )* pd( d i ) p Lm ( e) Note that there are three components: - φ - matching the SL phrase with the TL phrase. - pd the phrases are rearranged appropriately - p LM the output is fluent as per the TL is concerned They are now used to compute sentence probabilities 40

Computing Translation Probability Note: Each of the individual probabilities are computed separately - φ( ) - from phrase translation table. -pd( )- as a phrase is translated one stores its end position. This is used as end i-1. For the next phrase we note its position as start i. As these numbers are known one can compute the distortion probability. -P LM - gives the prob. of a sentence based on n-grams. As the translation is being constructed the Partial scores are built i-1. 41

Phrase Translation Probabilities 42

Phrase Translation Probabilities The huge number of extracted phrases prohibits a generative modeling. Note: In word modeling, where on the basis of word alignment (between input and output text) and counting we could estimate probabilities designed mathematically. Here the situation is different: - there are finer phrases and coarser phrases. - we do not know usefulness of them. - We do not want to eliminate anyone. - Counting does not give the solution. Let us illustrate from google: 43

Phrase Translation Probabilities he >> egli ; does >> fa ; not >> non ; like >> piacere he does >> lo fa ; does not >> non ; not like >> non come he does not >> egli non ; does not like >> non piace he does not like >> non gli piace How do we store phrases in the Phrase Translation Table? As a consequence we have to approach Estimation of Phrase translation Probabilities in a different way. 44

Phrase Translation Probabilities - For each sentence pair extract a no. of phrase pairs ( f, ē ). - Count in how many sentence pairs a particular phrase pair is extracted. - Estimate for φ ( f ē ) is then the relative frequency: Note: f k count ( e, f ) count ( e, - For large corpus the Table may be several GBs. - This makes it difficult to use for future translation. - Typically kept sorted, and partially loaded into RAM. - Can be used in translation for better performance. f k ) 45

Extension to Translation Model We started with the following equation: ebest = argmax p( e f ) = argmax p( f e e Then p(f e) is further reduced to I φ ( f i e i )* pd ( d i i = 1 ) e) p( e) Thus our translation model has three components: 1. The phrase translation Table φ ( f, ē ). 2. The language model p(e) 3. The model for reordering pd (d i ), d i = start i end i-1-1 This gives the following translation model 46

e Best Extension to Translation Model I m = i argmax φ ( f )* ( ) i ei pd di p( ei e1... e 1 ) e i= 1 i= 1 However, not all of them are equally important! Often the words are well translated but the translation is not o.k. Hence we prefer more weight to Language Model. By introducing three weighting constants, we have: e Best = I λ m φ λ λ argmax φ ( f e pd d d p e e e L i i ) * ( i ) ( j 1... i 1 ) e i = 1 i = 1 47

Log-Linear Model The above model however mathematically not straightforward. So we maximize exp( log p) instead of the original function p. Hence our function is: exp λ I λ φ i = 1 log φ ( λ d I i = 1 f e ) i i + log pd ( start λ L m i = 1 i end p ( e i e i 1 1... 1) e i 1 + ) 48

Log-Linear Model - Each translation is considered to be a data point. - Each data point is considered to be a vector (of features) (Similar to Word Space Model) - A model has corresponding set of feature functions. - The feature functions are trained separately, assuming they are independent - This allows to experiment with different weights for different functions. - This also allows to add additional modeling (e.g. lexical weighting, word penalty, phrase length penalty) if deemed necessary. 49

Reordering Model 50

Reordering Model Reordering seems to be the most difficult of the tasks. As most of the reordering are based on language pairs It is difficult to design a general reordering model E.g. Adj NN (English) >> NN Adj (Fr) Aux V Main V (English) >> Main V Aux V (Hindi) I am eating Indian food I am eating Indian food Je mange la nourriture indienne main bharatiya khaana khaa rahaa hoon Consequently total reordering requires much more information!! 51

Reordering Model The reordering model we discussed is based on the movement distance d i. Does not take care of the underlying word (or its class). Hence there is a lot a scope to improve reordering: - e.g hierarchical, reordering with syntactic knowledge We discuss one important Reordering Model viz. Lexical Reordering. 52

Reordering Model (Lexical) Here we observe that some phrases switch positions more often than others: Studente di dottorato (It) >> Doctoral student Traduzione automatica >> Machine translation Stati Uniti d'america >> United States of America European Parliament >> Parlamento europeo Parlement Européen (Fr) >> European Parliament Bombe atomique >> Atom bomb America yuktarashtra (B) >> United States of America 53

Lexical Reordering Reordering is done on the basis of the Actual phrases. Here we take statistics of three possible directions of reordering: -monotone (m) : two successive phrases in f translate into successive phrases in e. -swap (s) : two successive phrases in f ( f j, f j +1 ) translate into two successive phrases ( e i, e i+1) of e but word alignment is in reverse order. -discontinuous (d): translations of two successive phrases in f are not successive in e. How to obtain? 54

Lexical Ordering Non mi piace cibo indiano I Do Not Like Indian Food 55

Lexical Ordering Michael ne kahaa ki wah iss ghar mein rahegaa Michael said that he would stay in this house 56

Lexical Reordering Note: a) If there is an alignment point at top left of the cell then it is monotone. b) If there is an alignment point at the bottom left then it is swap. c) Otherwise it is a discontinuous. While doing phrase alignment we take the statistics on Orientation. count ( x, f, e ), o, x { m, s, d} p o ( x f, e ) = count ( o, f, e ) o Because of data sparseness often some modified Formula/ techniques are also used. 57

Lexical Reordering For example, one can make use of the overall probability of some particular orientation: Let (x ) = p },, {,, ),, ( d s m x o e f x count f e 58 (x ) = p o },, {,, ),, ( d s m x o e f o count o f e f e Then for a given constant α, one can use the following: = ), ( e f x p o },, {,, ),, ( ) ( * ),, ( d s m x o e f o count x p e f x count o o + + α α

Other issues of Reordering Reordering needs to be dealt with more judiciously: -It has been noticed that in reordering some groups of words move together -This happens depending upon the role of the phrase - There may be local reordering and global E.g. intra phrase and inter-phrase * hence distance is not the best measurement * as it penalizes large movements * but some languages demand it ( e.g. SOV vs. SVO) - Typically reordering is guided by the Language Model - Question is: can a typical 3-gram / 4-gram model can guide long distance reordering? 59

Other issues Since reordering is difficult, people advocate: (1) monotone translation - it does not give perfect translation. - but it reduces search complexity from exponential to polynomial. (2) limited reordering - allowing only local reordering - typically intraphrase as suggested by SL-TL pair. - Controlled by a small-size window. - Often gives better translation compared to unrestricted reordering. 60

Lexical Weighting Other issues - Infrequent phrase pairs often cause problems particularly if collected from noisy data. - If both f and ē occur only once giving φ ( f ē ) = 1 = φ (ē f ). -Lexical weighting is a smoothing method where we back off on more reliable probabilities. One possible formula (Based on word alignment): lex( e f, a) = length( e) #{ 1 j ( i, j ) a i = 1 ( i, j ) a } w( e i f j ) w(e i f j ) s are calculated from word-aligned corpus. 61

Other issues Word penalty and Phrase Penalty - comes into consideration if we try to model the length of the translation. Note that the language model prefers shorter translations as fewer n-grams need to be scored. -Word penalty controls the length of the translation by adding a factor w for each produced word. w < 1 shorter translation is preferred w > 1 longer translation is preferred -Phrase penalty tries to control the number of phrases using a factor ρ ρ < 1 fewer phrases are preferred ρ > 1 more phrases are preferred 62

Concluding Remarks Phrase based seems to be more intuitive than Word Based But the Algorithm is based on Word Alignment Hence two questions arise: 1) Can t we extract Phrases directly? A joint model of phrase alignment has also been proposed. Which uses the EM algorithm 2) How to combine the phrase translations? This needs development of Decoder. In the next class we shall focus on these issues. 63

Thank you 64