TS3: an Improved Version of the Bilingual Concordancer TransSearch



Similar documents
Statistical NLP Spring Machine Translation: Examples

Convergence of Translation Memory and Statistical Machine Translation

BUSINESS PROCESS OPTIMIZATION. OPTIMIZATION DES PROCESSUS D ENTERPRISE Comment d aborder la qualité en améliorant le processus

Archived Content. Contenu archivé

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

RAPPORT FINANCIER ANNUEL PORTANT SUR LES COMPTES 2014

CENG 734 Advanced Topics in Bioinformatics

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

Archived Content. Contenu archivé

Office of the Auditor General / Bureau du vérificateur général FOLLOW-UP TO THE 2010 AUDIT OF COMPRESSED WORK WEEK AGREEMENTS 2012 SUIVI DE LA

Archived Content. Contenu archivé

I will explain to you in English why everything from now on will be in French

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Statistical Machine Translation

TREATIES AND OTHER INTERNATIONAL ACTS SERIES Agreement Between the UNITED STATES OF AMERICA and CONGO

Survey on Conference Services provided by the United Nations Office at Geneva

SVM Based Learning System For Information Extraction

DIRECTIVE ON ACCOUNTABILITY IN CONTRACT MANAGEMENT FOR PUBLIC BODIES. An Act respecting contracting by public bodies (chapter C-65.1, a.

Post-Secondary Opportunities For Student-Athletes / Opportunités post-secondaire pour les étudiantathlètes

Machine Translation. Agenda

Archived Content. Contenu archivé

Why Evaluation? Machine Translation. Evaluation. Evaluation Metrics. Ten Translations of a Chinese Sentence. How good is a given system?

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Chapter 5. Phrase-based models. Statistical Machine Translation

Product / Produit Description Duration /Days Total / Total

Training and evaluation of POS taggers on the French MULTITAG corpus

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Cliquez sur le résultat que vous avez obtenu au test de classement linguistique Click on the result you obtained following the language test

Certificat de fusion. Certificate of Amalgamation. Canada Business Corporations Act. Loi canadienne sur les sociétés par actions

Archived Content. Contenu archivé

Collecting Polish German Parallel Corpora in the Internet

Archived Content. Contenu archivé

BILL C-665 PROJET DE LOI C-665 C-665 C-665 HOUSE OF COMMONS OF CANADA CHAMBRE DES COMMUNES DU CANADA

Measuring Policing Complexity: A Research Based Agenda

A web-based multilingual help desk

Level 2 French, 2014

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods


Active Learning SVM for Blogs recommendation

Experiments in Web Page Classification for Semantic Web

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Altiris Patch Management Solution for Windows 7.6 from Symantec Third-Party Legal Notices

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

RRSS - Rating Reviews Support System purpose built for movies recommendation

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告

Altiris Patch Management Solution for Windows 7.5 SP1 from Symantec Third-Party Legal Notices

Hybrid Machine Translation Guided by a Rule Based System

Computer Aided Translation

Genetic Algorithm-based Multi-Word Automatic Language Translation

Contracts over $10,000: 1 April 2013 to 30 September 2013 Contrats de plus de $ : 1er avril 2013 au 30 septembre 2013

FINAL DRAFT INTERNATIONAL STANDARD

Evaluation of speech technologies

Numéro de projet CISPR Amd 2 Ed IEC/TC or SC: CISPR/A CEI/CE ou SC: Date of circulation Date de diffusion

Machine Translation. Why Evaluation? Evaluation. Ten Translations of a Chinese Sentence. Evaluation Metrics. But MT evaluation is a di cult problem!

Sun Management Center Change Manager Release Notes

Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection

FOR TEACHERS ONLY The University of the State of New York

Automatic Text Processing: Cross-Lingual. Text Categorization

Error Log Processing for Accurate Failure Prediction. Humboldt-Universität zu Berlin

AP FRENCH LANGUAGE AND CULTURE 2013 SCORING GUIDELINES

AP FRENCH LANGUAGE 2008 SCORING GUIDELINES

Dutch Parallel Corpus

Hours: The hours for the class are divided between practicum and in-class activities. The dates and hours are as follows:

ColdGuard Bi-PARTING DOOR INSTALLATION INSTRUCTIONS

A Systematic Comparison of Various Statistical Alignment Models

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Thailand Business visa Application for citizens of Hong Kong living in Manitoba

Enterprise Risk Management & Board members. GUBERNA Alumni Event June 19 th 2014 Prepared by Gaëtan LEFEVRE

SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统

TRANSREAD LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS. Projet ANR CORD 01 5

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Note concernant votre accord de souscription au service «Trusted Certificate Service» (TCS)

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

FACULTY OF MANAGEMENT MBA PROGRAM

Direct AC Wiring Option Installation Guide

A STUDY OF THE SENSITIVITY, STABILITY AND SPECIFICITY OF PHENOLPHTHALEIN AS AN INDICATOR TEST FOR BLOOD R. S. HIGAKI 1 and W. M. S.

Archived Content. Contenu archivé

An Incrementally Trainable Statistical Approach to Information Extraction Based on Token Classification and Rich Context Models

Future Entreprise. Jean-Dominique Meunier NEM Executive Director Nov. 23, 2009 FIA Stockholm

ULYSSES L.T. FUNDS EUROPEAN GENERAL. L.T. Funds European General: Share Price Evolution INVESTMENT STRATEGY AUGUST 2015 COMMENT

Search Result Optimization using Annotators

Similarity Search in a Very Large Scale Using Hadoop and HBase

Chapter 6. The stacking ensemble approach

HOW MUCH DO YOU KNOW ABOUT RUGBY???

Information Retrieval and Web Search Engines

Transcription:

TS3: an Improved Version of the Bilingual Concordancer TransSearch Stéphane HUET, Julien BOURDAILLET and Philippe LANGLAIS EAMT 2009 - Barcelona June 14, 2009

Computer assisted translation Preferred by professional translators Exploits a translation memory One of these tools: bilingual concordancer Retrieves from a bitext parts associated with a query Currently operates at the sentence level TransSearch: a web-based concordancer with 177,000 queries/month 2

Current version: www.tsrali.com highlighted query sentence alignment 3

Prototype version: TS3 highlighted query sentence alignment translation of the query 4

Prototype version: TS3 several translations of the query context of use 5

Outline Spotting of the query translation Refinement of translation spotting Translation variants merging Corpora Experimental results 6

Translation spotting (or Transpotting) Identification in a sentence of the translation of a query This is in keeping with that strategy. La présente mesure est conforme à cette stratégie. Query: in keeping with 7

Translation spotting (or Transpotting) Identification in a sentence of the translation of a query This is in keeping with that strategy. La présente mesure est conforme à cette stratégie. Query: in keeping with Transpot: conforme à 8

Word alignment This is in keeping with that strategy. La présente mesure est conforme à cette stratégie. Use of an IBM2 model Discontinuous transpots Not the best method to transpot 9

Transpotting algorithm This is in keeping with that strategy. La présente mesure est conforme à cette stratégie. Algorithm of [Simard 03] Contiguous transpots Best performance among several tested methods 10

Transpotting algorithm This is in keeping with that strategy. La présente mesure est conforme à cette stratégie. Algorithm of [Simard 03] Contiguous transpots Best performance among several tested methods 11

The need for post-processing Query: in keeping with Proposed transpots conforme à (45) conformément à (29) à (21) dans (20) conforme aux (18) de (14) conforme (13) conformément aux (13) conforme au (12) conformes à (11) d actualité (1) gestes en (1) correspond à (1) respectent (1) 12

The need for post-processing Query: in keeping with Proposed transpots after filtering conforme à (45) conformément à (29) à (21) dans (20) conforme aux (18) de (14) conforme (13) conformément aux (13) conforme au (12) conformes à (11) d actualité (1) gestes en (1) correspond à (1) respectent (1) 13

The need for post-processing Query: in keeping with Proposed transpots after filtering and merging conforme à (45) conformément à (29) conforme aux (18) conforme (13) conformément aux (13) conforme au (12) conformes à (11) correspond à (1) respectent (1) 14

Filtering bad transpots At the level of a pair of sentences Computation of 3 sets of features Size of the transpot, size of the query Statistical word alignment features: min and max likelihood, Viterbi scores... Linguistic features: grammatical word ratio, article counts, preposition counts... Training of various classifiers Voted-perceptron, SVM, decision tree, voting 15

Merging translation variants At the level of the transpot list found for a query High complexity when building all possible clusters Neighbor-joining method of [Saiou and Nei 87] Builds a distance matrix Q between all pairs Is a greedy algorithm that at each step Merges the two closest transpots Updates Q Uses a word-based distance Minimal cost between 2 inflected forms of a lemma Edition costs smaller for grammatical words 16

Example for the merging process conforme au conforme aux correspondant au dans le sens de l dans les sens des 17

Example for the merging process conforme au conforme aux correspondant au dans le sens de l dans les sens des 18

Example for the merging process conforme au conforme aux correspondant au dans le sens des dans le sens de l 19

Example for the merging process correspondant au conforme au conforme aux dans le sens des dans le sens de l 20

Example for the merging process dans le sens des dans le sens de l correspondant au conforme au conforme aux 21

Detection of similar variants dans le sens des dans le sens de l correspondant au conforme au conforme aux 22

Corpus used in the experiments 5,000 most frequent queries Canadian Hansard 8.3 M pairs of sentences Retrieved Retrieved pairs of pairs sentences of sentences 23

Reference corpus for filtering Annotation of 530 queries (23 translations per query) 24

Results for classification of transpots Trained on the annotated queries Tested by 10-fold cross-validation Correct classification F-measure for bad transpots All good 62 0 Grammatical ratio >0.75 78 63 Best classifier 84 77 Similar results for the 4 tested classifiers: voted-perceptron, SVM, decision stump, AdaBoost Most informative features: grammatical and statistical word alignment 25

Reference corpus for transpotting Retrieved Retrieved pairs of pairs sentences of sentences Bilingual lexicon Transpotted pairs of sentences Reference = 1.4 M pairs of sentences 26

Metrics for transpotting Precison 2/4 suggested transpot Je crois qu il est tout à fait conforme à l esprit du projet de loi. reference 27

Metrics for transpotting Precison 2/4 suggested transpot Je crois qu il est tout à fait conforme à l esprit du projet de loi. Recall 1/2 reference suggested transpot Cela n est pas conforme aux normes des Nations Unies. reference Averaged for each query, then averaged on the overall corpus 28

Results for transpotting and filtering precision recall F-measure Transpotting 79 84 81 Transpotting + filtering 82 90 86 Filtering of 7.9% of pairs of sentences Improvement of F-measure, in particular of recall 29

Evaluation of variant merging Significant reduction of the number of translations proposed for a query: 164 86 Higher diversity among the top translations Example: query as described rank 1 2 3 4 5 before décrits décrite décrit tel que décrit comme l a after décrits prévu comme l a tel que prescrit comme le propose 30

Evaluation of variant merging Task: retrieving the 5 best transpots of a query Example {décrits, décrite, décrit, tel que décrit, comme l a} 31

Evaluation of variant merging Task: retrieving the 5 best transpots of a query Example {décrits, décrite, décrit, tel que décrit, comme l a} bag of unigrams {décrits, décrite, décrit, tel, que, comme, l, a} 32

Evaluation of variant merging Task: retrieving the 5 best transpots of a query Example {décrits, décrite, décrit, tel que décrit, comme l a} bag of unigrams grammatical words removed {décrits, décrite, décrit} 33

Evaluation of variant merging Task: retrieving the 5 best transpots of a query Example {décrits, décrite, décrit, tel que décrit, comme l a} bag of unigrams grammatical words removed lemmatization {décrit} 34

Results for variant merging Task: retrieving the 5 best transpots of a query Experiments done on the manually annotated corpus precision recall F-measure Before merging 90 43 58 After merging 86 54 66 Slight decrease of precision and significant improvement of recall => higher diversity 35

Conclusion Use of word alignment in a bilingual concordancer Quantitative evaluation of a transpotting algorithm Two new issues Filtering erroneous transpots Merging similar variants of translations 36

Future work Improvement of word alignment Higher level IBM models Phrase-based models Use of pseudo-relevance feedback to improve transpotting Evaluation with end users 37

Thank you for your attention 38