TS3: an Improved Version of the Bilingual Concordancer TransSearch Stéphane HUET, Julien BOURDAILLET and Philippe LANGLAIS EAMT 2009 - Barcelona June 14, 2009
Computer assisted translation Preferred by professional translators Exploits a translation memory One of these tools: bilingual concordancer Retrieves from a bitext parts associated with a query Currently operates at the sentence level TransSearch: a web-based concordancer with 177,000 queries/month 2
Current version: www.tsrali.com highlighted query sentence alignment 3
Prototype version: TS3 highlighted query sentence alignment translation of the query 4
Prototype version: TS3 several translations of the query context of use 5
Outline Spotting of the query translation Refinement of translation spotting Translation variants merging Corpora Experimental results 6
Translation spotting (or Transpotting) Identification in a sentence of the translation of a query This is in keeping with that strategy. La présente mesure est conforme à cette stratégie. Query: in keeping with 7
Translation spotting (or Transpotting) Identification in a sentence of the translation of a query This is in keeping with that strategy. La présente mesure est conforme à cette stratégie. Query: in keeping with Transpot: conforme à 8
Word alignment This is in keeping with that strategy. La présente mesure est conforme à cette stratégie. Use of an IBM2 model Discontinuous transpots Not the best method to transpot 9
Transpotting algorithm This is in keeping with that strategy. La présente mesure est conforme à cette stratégie. Algorithm of [Simard 03] Contiguous transpots Best performance among several tested methods 10
Transpotting algorithm This is in keeping with that strategy. La présente mesure est conforme à cette stratégie. Algorithm of [Simard 03] Contiguous transpots Best performance among several tested methods 11
The need for post-processing Query: in keeping with Proposed transpots conforme à (45) conformément à (29) à (21) dans (20) conforme aux (18) de (14) conforme (13) conformément aux (13) conforme au (12) conformes à (11) d actualité (1) gestes en (1) correspond à (1) respectent (1) 12
The need for post-processing Query: in keeping with Proposed transpots after filtering conforme à (45) conformément à (29) à (21) dans (20) conforme aux (18) de (14) conforme (13) conformément aux (13) conforme au (12) conformes à (11) d actualité (1) gestes en (1) correspond à (1) respectent (1) 13
The need for post-processing Query: in keeping with Proposed transpots after filtering and merging conforme à (45) conformément à (29) conforme aux (18) conforme (13) conformément aux (13) conforme au (12) conformes à (11) correspond à (1) respectent (1) 14
Filtering bad transpots At the level of a pair of sentences Computation of 3 sets of features Size of the transpot, size of the query Statistical word alignment features: min and max likelihood, Viterbi scores... Linguistic features: grammatical word ratio, article counts, preposition counts... Training of various classifiers Voted-perceptron, SVM, decision tree, voting 15
Merging translation variants At the level of the transpot list found for a query High complexity when building all possible clusters Neighbor-joining method of [Saiou and Nei 87] Builds a distance matrix Q between all pairs Is a greedy algorithm that at each step Merges the two closest transpots Updates Q Uses a word-based distance Minimal cost between 2 inflected forms of a lemma Edition costs smaller for grammatical words 16
Example for the merging process conforme au conforme aux correspondant au dans le sens de l dans les sens des 17
Example for the merging process conforme au conforme aux correspondant au dans le sens de l dans les sens des 18
Example for the merging process conforme au conforme aux correspondant au dans le sens des dans le sens de l 19
Example for the merging process correspondant au conforme au conforme aux dans le sens des dans le sens de l 20
Example for the merging process dans le sens des dans le sens de l correspondant au conforme au conforme aux 21
Detection of similar variants dans le sens des dans le sens de l correspondant au conforme au conforme aux 22
Corpus used in the experiments 5,000 most frequent queries Canadian Hansard 8.3 M pairs of sentences Retrieved Retrieved pairs of pairs sentences of sentences 23
Reference corpus for filtering Annotation of 530 queries (23 translations per query) 24
Results for classification of transpots Trained on the annotated queries Tested by 10-fold cross-validation Correct classification F-measure for bad transpots All good 62 0 Grammatical ratio >0.75 78 63 Best classifier 84 77 Similar results for the 4 tested classifiers: voted-perceptron, SVM, decision stump, AdaBoost Most informative features: grammatical and statistical word alignment 25
Reference corpus for transpotting Retrieved Retrieved pairs of pairs sentences of sentences Bilingual lexicon Transpotted pairs of sentences Reference = 1.4 M pairs of sentences 26
Metrics for transpotting Precison 2/4 suggested transpot Je crois qu il est tout à fait conforme à l esprit du projet de loi. reference 27
Metrics for transpotting Precison 2/4 suggested transpot Je crois qu il est tout à fait conforme à l esprit du projet de loi. Recall 1/2 reference suggested transpot Cela n est pas conforme aux normes des Nations Unies. reference Averaged for each query, then averaged on the overall corpus 28
Results for transpotting and filtering precision recall F-measure Transpotting 79 84 81 Transpotting + filtering 82 90 86 Filtering of 7.9% of pairs of sentences Improvement of F-measure, in particular of recall 29
Evaluation of variant merging Significant reduction of the number of translations proposed for a query: 164 86 Higher diversity among the top translations Example: query as described rank 1 2 3 4 5 before décrits décrite décrit tel que décrit comme l a after décrits prévu comme l a tel que prescrit comme le propose 30
Evaluation of variant merging Task: retrieving the 5 best transpots of a query Example {décrits, décrite, décrit, tel que décrit, comme l a} 31
Evaluation of variant merging Task: retrieving the 5 best transpots of a query Example {décrits, décrite, décrit, tel que décrit, comme l a} bag of unigrams {décrits, décrite, décrit, tel, que, comme, l, a} 32
Evaluation of variant merging Task: retrieving the 5 best transpots of a query Example {décrits, décrite, décrit, tel que décrit, comme l a} bag of unigrams grammatical words removed {décrits, décrite, décrit} 33
Evaluation of variant merging Task: retrieving the 5 best transpots of a query Example {décrits, décrite, décrit, tel que décrit, comme l a} bag of unigrams grammatical words removed lemmatization {décrit} 34
Results for variant merging Task: retrieving the 5 best transpots of a query Experiments done on the manually annotated corpus precision recall F-measure Before merging 90 43 58 After merging 86 54 66 Slight decrease of precision and significant improvement of recall => higher diversity 35
Conclusion Use of word alignment in a bilingual concordancer Quantitative evaluation of a transpotting algorithm Two new issues Filtering erroneous transpots Merging similar variants of translations 36
Future work Improvement of word alignment Higher level IBM models Phrase-based models Use of pseudo-relevance feedback to improve transpotting Evaluation with end users 37
Thank you for your attention 38