A Joint Sequence Translation Model with Integrated Reordering

Transcription

1 A Joint Sequence Translation Model with Integrated Reordering Nadir Durrani, Helmut Schmid and Alexander Fraser Institute for Natural Language Processing University of Stuttgart

2 Introduction Generation of bilingual sentence pair through a sequence of operations Operation: Translate or Reorder P (E,F,A) = Probability of the operation sequence required to generate the bilingual sentence pair Extension of N-gram based SMT Sequence of operations rather than tuples Integrated reordering rather than source linearization + rule extraction

3 Example Er hat eine Pizza gegessen He has eaten a pizza

4 Example Er hat eine Pizza gegessen He has eaten a pizza Simultaneous generation of source and target Generation is done in order of the target sentence Reorder when the source words are not in the same order

5 Example Er hat eine Pizza gegessen He has eaten a pizza Operations Generate Er He Er He

6 Example Er hat eine Pizza gegessen He has eaten a pizza Operations Generate Er He Generate hat has Er hat He has

7 Example Er hat eine Pizza gegessen He has eaten a pizza Operations Generate Er He Generate hat has Er hat Insert gap He has

8 Example Er hat eine Pizza gegessen He has eaten a pizza Operations Generate Er He Generate hat has Er hat gegessen Insert gap Generate gegessen eaten He has eaten

9 Example Er hat eine Pizza gegessen He has eaten a pizza Operations Generate Er He Generate hat has Er hat gegessen Insert gap Generate gegessen eaten He has eaten Jump back

10 Example Er hat eine Pizza gegessen He has eaten a pizza Operations Generate Er He Generate hat has Er hat eine Insert gap Generate gegessen eaten Jump back He has eaten a Generate eine a gegessen

11 Example Er hat eine Pizza gegessen He has eaten a pizza Operations Generate Er He Generate hat has Insert gap Generate gegessen eaten Jump back Generate eine a Generate Pizza pizza Er hat eine Pizza gegessen He has eaten a pizza

12 Lexical Trigger Er hat gegessen He has eaten Generate Er He Generate hat has Insert Gap Generate gegessen eat Jump Back

13 Generalizing to Unseen Context Er hat einen Erdbeerkuchen gegessen He has eaten a strawberry cake Generate Er-He Generate hat-has Insert Gap Generate gegessen-eat Jump Back(1) Generate einen-a Generate Erdbeerkuchen strawberry cake

14 Generalizing to Unseen Context Er hat einen Erdbeerkuchen und eine Menge Butterkekse gegessen He has eaten a strawberry cake and a lot of butter cookies Generate Er He Generate hat has Insert Gap Generate gegessen eat Jump Back(1) Generate einen a Generate Erdbeerkuchen strawberry cake Generate und and Generate eine a Generate Menge lot of Generate Butterkekse butter cookies

15 Key Ideas - Contributions Reordering integrated into translation model Translation and reordering decisions influence each other Handles local and long distance reorderings in a unified manner An operation model that accounts for: Translation Reordering Source-side gaps Source word deletion Joint model with bilingual information (like N-gram SMT) No spurious phrasal segmentation (like N-gram SMT) No distortion limit

16 List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward

17 List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) Example Generate (gegessen, eaten) 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward

18 List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) Example Generate (Inflationsraten, inflation rate) Inflationsraten Inflation rate 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward

19 List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward Example kehrten zurück returned Generate (kehrten zurück, returned) Insert Gap Continue Source Cept

20 List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) Example Generate Identical instead of Generate (Portland, Portland) 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward If count (Portland) = 1

21 List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) Example kommen Sie mit come with me Generate Source Only (Sie) 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward

22 List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) Example über konkrete Zahlen nicht verhandeln wollen do not want to negotiate on specific figures 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward Gap # 1 do not nicht

23 List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) Example über konkrete Zahlen nicht verhandeln wollen do not want to negotiate on specific figures 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward Gap # 2 Gap # 1 nicht wollen do not want to Jump Back (1)!!!

24 List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) Example über konkrete Zahlen nicht verhandeln wollen do not want to negotiate on specific figures 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward Gap # 1 do not want to negotiate nicht verhandeln wollen

25 List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) Example über konkrete Zahlen nicht verhandeln wollen do not want to negotiate on specific figures 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward Gap # 1 nicht verhandeln wollen do not want to negotiate Jump Back (1)!!!

26 List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward über konkrete Zahlen nicht verhandeln wollen do not want to negotiate on specific figures

27 List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) Jump Forward!!! 3 Reordering Operations Insert Gap über konkrete Zahlen nicht verhandeln wollen Jump Back (N) Jump Forward do not want to negotiate on specific figures

28 List of Operations 4 Translation Operations Generate (X,Y) Continue Source Cept Generate Identical Generate Source Only (X) 3 Reordering Operations Insert Gap Jump Back (N) Jump Forward über konkrete Zahlen nicht verhandeln wollen. do not want to negotiate on specific figures.

29 Learning Phrases through Operation Sequences über konkrete Zahlen nicht verhandeln wollen do not want to negotiate on specific figures Phrase Pair : nicht verhandeln wollen ~ do not want to negotiate Generate (nicht, do not) Insert Gap Generate (wollen, want to) Jump Back(1) Generate (verhandeln, negotiate)

30 Model Joint-probability model over operation sequences

31 Search Search is defined as: Incorporating language model 5-gram for the language model (p LM ) 9-gram for operation model and prior probability (p pr ) Stack based beam decoder which uses operations

32 Other Features

33 Other Features Length Penalty : Counts the number of target words produced

34 Other Features Length Penalty : Counts the number of target words produced Deletion Penalty : Counts the number of source words deleted

35 Other Features Length Penalty : Counts the number of target words produced Deletion Penalty : Counts the number of source words deleted Gap Penalty : Counts the number of gaps inserted

36 Other Features Length Penalty : Counts the number of target words produced Deletion Penalty : Counts the number of source words deleted Gap Penalty : Counts the number of gaps inserted Open Gap Penalty : Number of open gaps, paid once per each translation operation

37 Other Features Length Penalty : Counts the number of target words produced Deletion Penalty : Counts the number of source words deleted Gap Penalty : Counts the number of gaps inserted Open Gap Penalty : Number of open gaps, paid once per each translation operation Reordering Distance : Distance from the last translated tuple

38 Other Features Length Penalty : Counts the number of target words produced Deletion Penalty : Counts the number of source words deleted Gap Penalty : Counts the number of gaps inserted Open Gap Penalty : Number of open gaps, paid once per each translation operation Reordering Distance : Distance from the last translated tuple Gap Width : Distance from the first open gap

39 Other Features Length Penalty : Counts the number of target words produced Deletion Penalty : Counts the number of source words deleted Gap Penalty : Counts the number of gaps inserted Open Gap Penalty : Number of open gaps, paid once per each translation operation Reordering Distance : Distance from the last translated tuple Gap Width : Distance from the first open gap Lexical Probabilities : Source-to-Target and Target-to-Source lexical translation probabilities

40 Experimental Setup Language Pairs: German, Spanish and French to English Data 4 th Version of the Europarl Corpus Bilingual Data: 200K parallel sentences (reduced version of WMT 09) ~74K News commentary + ~ 126K Europarl Monolingual Data: 500K = 300K from the monolingual corpus (news commentary) + 200K English side of bilingual corpus Standard WMT 2009 sets for tuning and testing

41 Training & Tuning Giza++ for word alignment Heuristic modification of alignments to remove target-side gaps and unaligned target words (see the paper for details) Convert word-aligned bilingual corpus into operation corpus (see paper for details) SRI-Toolkit to train n-gram language models Kneser-Ney Smoothing Parameter Tuning with Z-mert

42 Results Baseline: Moses (with lexicalized reordering) with defaults A 5-gram language model (same as ours) Two baselines with no distortion limit and using a reordering limit 6 Two variations of our system Using no reordering limit Using gap-width of 6 as a reordering limit

43 Using Non-Gappy Source Cepts Source German Spanish French Bl no-rl Bl rl Tw no-rl Tw rl Moses score without reordering limit drops by more than a BLEU point Our best system Tw no-rl gives Statistically significant results over Bl rl-6 for German and Spanish Comparable results for French

44 Gappy + Non-Gappy Source Cepts Source German Spanish French Tw no-rl Tw rl-6 Tw asg-no-rl Tw asg-rl

45 Why didn t Gappy-Cepts improve performance? Using all source gaps explodes the search space Source German Spanish French Gaps 965,515 1,705,156 1,473,798 No Gaps 256, , ,220 Number of tuples using 10-best translations

46 Why didn t Gappy-Cepts improve performance? Using all source gaps explodes the search space Source German Spanish French Gaps 965,515 1,705,156 1,473,798 No Gaps 256, , ,220 Number of tuples using 10-best translations Future cost is incorrectly estimated in case of gappy cepts Dynamic programming algorithm for calculation of bigger spans doesn t apply anymore Modification but still problematic when gappy cepts interleave

47 Heuristic Use only the gappy cepts with scores better than sum of their parts log prob(habe gemacht made) > log p(habe have) + log p(gemacht made) Source German Spanish French Gaps 965,515 1,705,156 1,473,798 No Gaps 256, , ,220 Heuristic 281, , ,869

48 With Gappy Source Cepts + Heuristic Source Tw asg-no-rl Tw asg-rl-6 German Spanish French Tw hsg-no-rl Tw hsg-rl

49 Summary Translation and Reordering are combined into a single generative story Handles long and short distance reordering identically Ability to learn phrases through operation sequence All possible reorderings (in contrast with N-gram SMT) Using bilingual context (like N-gram SMT) No spurious phrasal segmentation (like N-gram SMT) No distortion limit Compared with state-of-the-art Moses system Comparable results for French-to-English Significantly better results for German-to-English and Spanish-to-English

50 Thank you - Questions? Decoder and Corpus Conversion Algorithm available at:

51 Future Work Improving Future Cost estimate Using phrases instead of tuples for future cost estimation N-gram Model and Phrase-based decoding Source-side discontinuities Future cost estimation with gappy units Gappy Phrases Improve the model to better handle source gas Target-side discontinuities Target unaligned words (Generate Target Only (Y) Operation) Generalizing the operation model using a combination of POS tags and lexical items

52 Search and Future Cost Estimation The search problem is much harder than in PBSMT Larger beam needed to produce translations similar to PBSMT Example zum Beispiel for example vs zum for, Beispiel example Problem with future cost estimation Language model probability Phrase based : p(for) * p(example for) Our Model : p(for) * p(example) Future Cost for reordering operations Future Cost for features gap penalty, gap-width and reordering distance

53 Future Cost Estimation with Source-Side Gaps Future Cost estimation with source side gaps is problematic Future Cost for Bigger Spans cost (I,K) = min( cost (I,J) + cost (J+1,K) ) for all J in I K cost (1,8) = min ( { cost (1,1) + cost (2,8) }, {cost (1,2) + cost (3,7)},, {cost(1,7) + cost(8,8)}

54 Future Cost Estimation with Source-Side Gaps FC estimation with source side gaps is problematic Future Cost for Bigger Spans cost (I,K) = min( cost (I,J) + cost (J+1,K) ) for all J in I K cost (1,8) = min ( { cost (1,1) + cost (2,8) }, {cost (1,2) + cost (3,7)},, {cost(1,7) + cost(8,8)}

55 Future Cost Estimation with Source-Side Gaps FC estimation with source side gaps is problematic Future Cost for Bigger Spans cost (I,K) = min( cost (I,J) + cost (J+1,K) ) for all J in I K cost (1,8) = min ( { cost (1,1) + cost (2,8) }, {cost (1,2) + cost (3,7)},, {cost(1,7) + cost(8,8)}

56 Future Cost Estimation with Source-Side Gaps Does not work for cepts with gaps Best way to cover word 1,4 and 8 is through cept cost (1,8) =? After computation of cost (1,8) we do another pass to find min (cost (1,8), cost (2,3) + cost (5,7) + cost_of_cept ( )

59 Future Cost Estimation with Source-Side Gaps Does not work for cepts with gaps Best way to cover word 1,4 and 8 is through cept cost (1,8) =? min (cost (1,8), cost (2,3) + cost (5,7) + cost_of_cept ( ), cost (3,7) + cost_of_cept ( ))

60 Future Cost Estimation with Source-Side Gaps Still problematic when gappy Cepts interleave Example: Consider best way to cover 1 & 5 is through cept 1 5 Modification can not capture that best cost = cost_of_cept (1..5) + cost_of_cept( ) + cost (3,3) + cost (6,7)

61 Future Cost Estimation with Source-Side Gaps Gives incorrect cost if coverage vector already covers a word between the gappy cept Decoder has covered 3 Future cost estimate cost (1,2) + cost (4,8) is wrong The correct estimate is cost_of_cept (1 4 8) + cost (2,2) + cost (5,8) No efficient way to cover all possible permutations

62 Target Side Gaps & Unaligned Words Our model does not allow target-side gaps and target unaligned words Post-editing of alignments a 3 step process Step-I: Remove all target-side gaps For a gappy alignment, link to least frequent target word is identified A group of link that contain this word is retained Example A B C D U V W X Y Z Target Side Discontinuity!!

63 Target Side Gaps & Unaligned Words Our model does not allow target-side gaps and target unaligned words Post-editing of alignments a 3 step process Step-I: Remove all target-side gaps For a gappy alignment, link to least frequent target word is identified A group of link that contain this word is retained Example A B C D U V W X Y Z

64 Target Side Gaps & Unaligned Words Our model does not allow target-side gaps and target unaligned words Post-editing of alignments a 3 step process Step-I: Remove all target-side gaps For a gappy alignment, link to least frequent target word is identified A group of link that contain this word is retained Example A B C D U V W X Y Z No target side gaps but target unaligned words!!!

65 Continued After Step-I A B C D U V W X Y Z Step-II: Counting over the training corpus to find the attachment preference of a word Count (U,V) = 1 Count (W,X) = 1 Count (W,X) = 1 Count (X,Y) = 0.5 Count (Y,Z) = 0.5

66 Continued Step-III: Attached target-unaligned words to right or left based on the collected counts After Step-III A B C D U V W X Y Z