A Joint Sequence Translation Model with Integrated Reordering Nadir Durrani, Helmut Schmid and Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Introduction Generation of bilingual sentence pair through a sequence of operations Operation: Translate or Reorder P (E,F,A) = Probability of the operation sequence required to generate the bilingual sentence pair Extension of N-gram based SMT Sequence of operations rather than tuples Integrated reordering rather than source linearization + rule extraction
Example Er hat eine Pizza gegessen He has eaten a pizza
Example Er hat eine Pizza gegessen He has eaten a pizza Simultaneous generation of source and target Generation is done in order of the target sentence Reorder when the source words are not in the same order
Example Er hat eine Pizza gegessen He has eaten a pizza Operations Generate Er He Er He
Example Er hat eine Pizza gegessen He has eaten a pizza Operations Generate Er He Generate hat has Er hat He has
Example Er hat eine Pizza gegessen He has eaten a pizza Operations Generate Er He Generate hat has Er hat Insert gap He has
Example Er hat eine Pizza gegessen He has eaten a pizza Operations Generate Er He Generate hat has Er hat gegessen Insert gap Generate gegessen eaten He has eaten
Example Er hat eine Pizza gegessen He has eaten a pizza Operations Generate Er He Generate hat has Er hat gegessen Insert gap Generate gegessen eaten He has eaten Jump back
Example Er hat eine Pizza gegessen He has eaten a pizza Operations Generate Er He Generate hat has Er hat eine Insert gap Generate gegessen eaten Jump back He has eaten a Generate eine a gegessen
Example Er hat eine Pizza gegessen He has eaten a pizza Operations Generate Er He Generate hat has Insert gap Generate gegessen eaten Jump back Generate eine a Generate Pizza pizza Er hat eine Pizza gegessen He has eaten a pizza
Lexical Trigger Er hat gegessen He has eaten Generate Er He Generate hat has Insert Gap Generate gegessen eat Jump Back
Generalizing to Unseen Context Er hat einen Erdbeerkuchen gegessen He has eaten a strawberry cake Generate Er-He Generate hat-has Insert Gap Generate gegessen-eat Jump Back(1) Generate einen-a Generate Erdbeerkuchen strawberry cake
Generalizing to Unseen Context Er hat einen Erdbeerkuchen und eine Menge Butterkekse gegessen He has eaten a strawberry cake and a lot of butter cookies Generate Er He Generate hat has Insert Gap Generate gegessen eat Jump Back(1) Generate einen a Generate Erdbeerkuchen strawberry cake Generate und and Generate eine a Generate Menge lot of Generate Butterkekse butter cookies
Key Ideas - Contributions Reordering integrated into translation model Translation and reordering decisions influence each other Handles local and long distance reorderings in a unified manner An operation model that accounts for: Translation Reordering Source-side gaps Source word deletion Joint model with bilingual information (like N-gram SMT) No spurious phrasal segmentation (like N-gram SMT) No distortion limit
List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward
List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) Example Generate (gegessen, eaten) 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward
List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) Example Generate (Inflationsraten, inflation rate) Inflationsraten Inflation rate 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward
List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward Example kehrten zurück returned Generate (kehrten zurück, returned) Insert Gap Continue Source Cept
List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) Example Generate Identical instead of Generate (Portland, Portland) 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward If count (Portland) = 1
List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) Example kommen Sie mit come with me Generate Source Only (Sie) 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward
List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) Example über konkrete Zahlen nicht verhandeln wollen do not want to negotiate on specific figures 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward Gap # 1 do not nicht
List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) Example über konkrete Zahlen nicht verhandeln wollen do not want to negotiate on specific figures 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward Gap # 2 Gap # 1 nicht wollen do not want to Jump Back (1)!!!
List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) Example über konkrete Zahlen nicht verhandeln wollen do not want to negotiate on specific figures 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward Gap # 1 do not want to negotiate nicht verhandeln wollen
List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) Example über konkrete Zahlen nicht verhandeln wollen do not want to negotiate on specific figures 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward Gap # 1 nicht verhandeln wollen do not want to negotiate Jump Back (1)!!!
List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward über konkrete Zahlen nicht verhandeln wollen do not want to negotiate on specific figures
List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) Jump Forward!!! 3 Reordering Operations Insert Gap über konkrete Zahlen nicht verhandeln wollen Jump Back (N) Jump Forward do not want to negotiate on specific figures
List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward über konkrete Zahlen nicht verhandeln wollen. do not want to negotiate on specific figures.
Learning Phrases through Operation Sequences über konkrete Zahlen nicht verhandeln wollen do not want to negotiate on specific figures Phrase Pair : nicht verhandeln wollen ~ do not want to negotiate Generate (nicht, do not) Insert Gap Generate (wollen, want to) Jump Back(1) Generate (verhandeln, negotiate)
Model Joint-probability model over operation sequences
Search Search is defined as: Incorporating language model 5-gram for the language model (p LM ) 9-gram for operation model and prior probability (p pr ) Stack based beam decoder which uses operations
Other Features
Other Features Length Penalty : Counts the number of target words produced
Other Features Length Penalty : Counts the number of target words produced Deletion Penalty : Counts the number of source words deleted
Other Features Length Penalty : Counts the number of target words produced Deletion Penalty : Counts the number of source words deleted Gap Penalty : Counts the number of gaps inserted
Other Features Length Penalty : Counts the number of target words produced Deletion Penalty : Counts the number of source words deleted Gap Penalty : Counts the number of gaps inserted Open Gap Penalty : Number of open gaps, paid once per each translation operation
Other Features Length Penalty : Counts the number of target words produced Deletion Penalty : Counts the number of source words deleted Gap Penalty : Counts the number of gaps inserted Open Gap Penalty : Number of open gaps, paid once per each translation operation Reordering Distance : Distance from the last translated tuple
Other Features Length Penalty : Counts the number of target words produced Deletion Penalty : Counts the number of source words deleted Gap Penalty : Counts the number of gaps inserted Open Gap Penalty : Number of open gaps, paid once per each translation operation Reordering Distance : Distance from the last translated tuple Gap Width : Distance from the first open gap
Other Features Length Penalty : Counts the number of target words produced Deletion Penalty : Counts the number of source words deleted Gap Penalty : Counts the number of gaps inserted Open Gap Penalty : Number of open gaps, paid once per each translation operation Reordering Distance : Distance from the last translated tuple Gap Width : Distance from the first open gap Lexical Probabilities : Source-to-Target and Target-to-Source lexical translation probabilities
Experimental Setup Language Pairs: German, Spanish and French to English Data 4 th Version of the Europarl Corpus Bilingual Data: 200K parallel sentences (reduced version of WMT 09) ~74K News commentary + ~ 126K Europarl Monolingual Data: 500K = 300K from the monolingual corpus (news commentary) + 200K English side of bilingual corpus Standard WMT 2009 sets for tuning and testing
Training & Tuning Giza++ for word alignment Heuristic modification of alignments to remove target-side gaps and unaligned target words (see the paper for details) Convert word-aligned bilingual corpus into operation corpus (see paper for details) SRI-Toolkit to train n-gram language models Kneser-Ney Smoothing Parameter Tuning with Z-mert
Results Baseline: Moses (with lexicalized reordering) with defaults A 5-gram language model (same as ours) Two baselines with no distortion limit and using a reordering limit 6 Two variations of our system Using no reordering limit Using gap-width of 6 as a reordering limit
Using Non-Gappy Source Cepts Source German Spanish French Bl no-rl 17.41 19.85 19.39 Bl rl-6 18.57 21.67 20.84 Tw no-rl 18.97 22.17 20.92 Tw rl-6 19.03 21.88 20.72 Moses score without reordering limit drops by more than a BLEU point Our best system Tw no-rl gives Statistically significant results over Bl rl-6 for German and Spanish Comparable results for French
Gappy + Non-Gappy Source Cepts Source German Spanish French Tw no-rl 18.97 22.17 20.92 19.03 21.88 20.72 Tw rl-6 Tw asg-no-rl 18.61 21.60 20.59 Tw asg-rl-6 18.65 21.40 20.47
Why didn t Gappy-Cepts improve performance? Using all source gaps explodes the search space Source German Spanish French Gaps 965,515 1,705,156 1,473,798 No Gaps 256,992 313,690 343,220 Number of tuples using 10-best translations
Why didn t Gappy-Cepts improve performance? Using all source gaps explodes the search space Source German Spanish French Gaps 965,515 1,705,156 1,473,798 No Gaps 256,992 313,690 343,220 Number of tuples using 10-best translations Future cost is incorrectly estimated in case of gappy cepts Dynamic programming algorithm for calculation of bigger spans doesn t apply anymore Modification but still problematic when gappy cepts interleave
Heuristic Use only the gappy cepts with scores better than sum of their parts log prob(habe gemacht made) > log p(habe have) + log p(gemacht made) Source German Spanish French Gaps 965,515 1,705,156 1,473,798 No Gaps 256,992 313,690 343,220 Heuristic 281,618 346,993 385,869
With Gappy Source Cepts + Heuristic Source Tw asg-no-rl Tw asg-rl-6 German 18.61 18.65 Spanish 21.60 21.40 French 20.59 20.47 Tw hsg-no-rl 18.91 21.93 20.87 Tw hsg-rl-6 19.23 21.79 20.75
Summary Translation and Reordering are combined into a single generative story Handles long and short distance reordering identically Ability to learn phrases through operation sequence All possible reorderings (in contrast with N-gram SMT) Using bilingual context (like N-gram SMT) No spurious phrasal segmentation (like N-gram SMT) No distortion limit Compared with state-of-the-art Moses system Comparable results for French-to-English Significantly better results for German-to-English and Spanish-to-English
Thank you - Questions? Decoder and Corpus Conversion Algorithm available at: http://www.ims.uni-stuttgart.de/~durrani/resources.html
Future Work Improving Future Cost estimate Using phrases instead of tuples for future cost estimation N-gram Model and Phrase-based decoding Source-side discontinuities Future cost estimation with gappy units Gappy Phrases Improve the model to better handle source gas Target-side discontinuities Target unaligned words (Generate Target Only (Y) Operation) Generalizing the operation model using a combination of POS tags and lexical items
Search and Future Cost Estimation The search problem is much harder than in PBSMT Larger beam needed to produce translations similar to PBSMT Example zum Beispiel for example vs zum for, Beispiel example Problem with future cost estimation Language model probability Phrase based : p(for) * p(example for) Our Model : p(for) * p(example) Future Cost for reordering operations Future Cost for features gap penalty, gap-width and reordering distance
Future Cost Estimation with Source-Side Gaps Future Cost estimation with source side gaps is problematic Future Cost for Bigger Spans cost (I,K) = min( cost (I,J) + cost (J+1,K) ) for all J in I K cost (1,8) = min ( { cost (1,1) + cost (2,8) }, {cost (1,2) + cost (3,7)},, {cost(1,7) + cost(8,8)} 1 2 3 4 5 6 7 8
Future Cost Estimation with Source-Side Gaps FC estimation with source side gaps is problematic Future Cost for Bigger Spans cost (I,K) = min( cost (I,J) + cost (J+1,K) ) for all J in I K cost (1,8) = min ( { cost (1,1) + cost (2,8) }, {cost (1,2) + cost (3,7)},, {cost(1,7) + cost(8,8)} 1 2 3 4 5 6 7 8
Future Cost Estimation with Source-Side Gaps FC estimation with source side gaps is problematic Future Cost for Bigger Spans cost (I,K) = min( cost (I,J) + cost (J+1,K) ) for all J in I K cost (1,8) = min ( { cost (1,1) + cost (2,8) }, {cost (1,2) + cost (3,7)},, {cost(1,7) + cost(8,8)} 1 2 3 4 5 6 7 8
Future Cost Estimation with Source-Side Gaps Does not work for cepts with gaps Best way to cover word 1,4 and 8 is through cept 1..4..8 cost (1,8) =? After computation of cost (1,8) we do another pass to find min (cost (1,8), cost (2,3) + cost (5,7) + cost_of_cept (1..4..8) 1 2 3 4 5 6 7 8
Future Cost Estimation with Source-Side Gaps Does not work for cepts with gaps Best way to cover word 1,4 and 8 is through cept 1..4..8 cost (1,8) =? After computation of cost (1,8) we do another pass to find min (cost (1,8), cost (2,3) + cost (5,7) + cost_of_cept (1..4..8) 1 2 3 4 5 6 7 8
Future Cost Estimation with Source-Side Gaps Does not work for cepts with gaps Best way to cover word 1,4 and 8 is through cept 1..4..8 cost (1,8) =? After computation of cost (1,8) we do another pass to find min (cost (1,8), cost (2,3) + cost (5,7) + cost_of_cept (1..4..8) 1 2 3 4 5 6 7 8
Future Cost Estimation with Source-Side Gaps Does not work for cepts with gaps Best way to cover word 1,4 and 8 is through cept 1..4..8 cost (1,8) =? min (cost (1,8), cost (2,3) + cost (5,7) + cost_of_cept (1..4..8), cost (3,7) + cost_of_cept (1..2..8)) 1 2 3 4 5 6 7 8
Future Cost Estimation with Source-Side Gaps Still problematic when gappy Cepts interleave Example: Consider best way to cover 1 & 5 is through cept 1 5 Modification can not capture that best cost = cost_of_cept (1..5) + cost_of_cept(2 4...8) + cost (3,3) + cost (6,7) 1 2 3 4 5 6 7 8
Future Cost Estimation with Source-Side Gaps Gives incorrect cost if coverage vector already covers a word between the gappy cept 1 2 3 4 5 6 7 8 Decoder has covered 3 Future cost estimate cost (1,2) + cost (4,8) is wrong The correct estimate is cost_of_cept (1 4 8) + cost (2,2) + cost (5,8) No efficient way to cover all possible permutations
Target Side Gaps & Unaligned Words Our model does not allow target-side gaps and target unaligned words Post-editing of alignments a 3 step process Step-I: Remove all target-side gaps For a gappy alignment, link to least frequent target word is identified A group of link that contain this word is retained Example A B C D U V W X Y Z Target Side Discontinuity!!
Target Side Gaps & Unaligned Words Our model does not allow target-side gaps and target unaligned words Post-editing of alignments a 3 step process Step-I: Remove all target-side gaps For a gappy alignment, link to least frequent target word is identified A group of link that contain this word is retained Example A B C D U V W X Y Z
Target Side Gaps & Unaligned Words Our model does not allow target-side gaps and target unaligned words Post-editing of alignments a 3 step process Step-I: Remove all target-side gaps For a gappy alignment, link to least frequent target word is identified A group of link that contain this word is retained Example A B C D U V W X Y Z No target side gaps but target unaligned words!!!
Continued After Step-I A B C D U V W X Y Z Step-II: Counting over the training corpus to find the attachment preference of a word Count (U,V) = 1 Count (W,X) = 1 Count (W,X) = 1 Count (X,Y) = 0.5 Count (Y,Z) = 0.5
Continued Step-III: Attached target-unaligned words to right or left based on the collected counts After Step-III A B C D U V W X Y Z