Machine Translation. Why Evaluation? Evaluation. Ten Translations of a Chinese Sentence. Evaluation Metrics. But MT evaluation is a di cult problem!



Similar documents
Why Evaluation? Machine Translation. Evaluation. Evaluation Metrics. Ten Translations of a Chinese Sentence. How good is a given system?

How to Read Clearly Without Having a Brainstorm

Statistical Machine Translation

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告

SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Appraise: an Open-Source Toolkit for Manual Evaluation of MT Output

ACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation Project no.

Convergence of Translation Memory and Statistical Machine Translation

Empirical Machine Translation and its Evaluation

The Impact of Morphological Errors in Phrase-based Statistical Machine Translation from English and German into Swedish

TAUS Quality Dashboard. An Industry-Shared Platform for Quality Evaluation and Business Intelligence September, 2015

A Joint Sequence Translation Model with Integrated Reordering

Translation Solution for

Computer Aided Translation

Machine Translation. Agenda

Machine Translation and the Translator

Report on the embedding and evaluation of the second MT pilot

A New Input Method for Human Translators: Integrating Machine Translation Effectively and Imperceptibly

Quantifying the Influence of MT Output in the Translators Performance: A Case Study in Technical Translation

TRANSREAD LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS. Projet ANR CORD 01 5

Introduction. Philipp Koehn. 28 January 2016

A Flexible Online Server for Machine Translation Evaluation

Tuning Methods in Statistical Machine Translation

Semantics in Statistical Machine Translation

Collecting Polish German Parallel Corpora in the Internet

LIUM s Statistical Machine Translation System for IWSLT 2010

Turker-Assisted Paraphrasing for English-Arabic Machine Translation

Machine Translation at the European Commission

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

Statistical Machine Translation Lecture 4. Beyond IBM Model 1 to Phrase-Based Models

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Statistical Pattern-Based Machine Translation with Statistical French-English Machine Translation

TS3: an Improved Version of the Bilingual Concordancer TransSearch

Hybrid Machine Translation Guided by a Rule Based System

Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation

Topics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment

JOB BANK TRANSLATION AUTOMATED TRANSLATION SYSTEM. Table of Contents

PROMT Technologies for Translation and Big Data

Recent developments in machine translation policy at the European Patent Office

Collaborative Machine Translation Service for Scientific texts

BEER 1.1: ILLC UvA submission to metrics and tuning task

Language technologies for Education: recent results by the MLLP group

Deciphering Foreign Language

Chapter 6. Decoding. Statistical Machine Translation

HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN

Build Vs. Buy For Text Mining

Project Management. From industrial perspective. A. Helle M. Herranz. EXPERT Summer School, Pangeanic - BI-Europe

D4.3: TRANSLATION PROJECT- LEVEL EVALUATION

Word Completion and Prediction in Hebrew

THUTR: A Translation Retrieval System

The Machine Translation Help Desk and the Post-Editing Service

Free Online Translators:

Overview of iclef 2008: search log analysis for Multilingual Image Retrieval

Structural and Semantic Indexing for Supporting Creation of Multilingual Web Pages

User choice as an evaluation metric for web translation services in cross language instant messaging applications

A Joint Sequence Translation Model with Integrated Reordering

Effective Self-Training for Parsing

Reorganizing information in a multilingual website: Issues and Challenges

The TCH Machine Translation System for IWSLT 2008

Automatic slide assignation for language model adaptation

UNSUPERVISED MORPHOLOGICAL SEGMENTATION FOR STATISTICAL MACHINE TRANSLATION

The history of machine translation in a nutshell

Composing Human and Machine Translation Services: Language Grid for Improving Localization Processes

AMTA th Biennial Conference of the Association for Machine Translation in the Americas. San Diego, Oct 28 Nov 1, 2012

UEdin: Translating L1 Phrases in L2 Context using Context-Sensitive SMT

State of affairs today ALL THESE CAN BE TRUE!!!! We tried MT but it was not good. Because of MT, our revenues increased by 17%

PROMT-Adobe Case Study:

TAUS Membership Program (Executive Overview) write to to request the 35 pages detailed service overview.

Bridging the Online Language Barriers with Machine Translation at the United Nations

The University of Maryland Statistical Machine Translation System for the Fifth Workshop on Machine Translation

Evaluating a Machine Translation System in a Technical Support Scenario

Modeling coherence in ESOL learner texts

Handbook on Test Development: Helpful Tips for Creating Reliable and Valid Classroom Tests. Allan S. Cohen. and. James A. Wollack

Statistical Machine Translation

Customizing an English-Korean Machine Translation System for Patent Translation *

Segmentation and Punctuation Prediction in Speech Language Translation Using a Monolingual Translation System

CLOUD ANALYTICS: Empowering the Army Intelligence Core Analytic Enterprise

Mining a Corpus of Job Ads

Language Independent Evaluation of Translation Style and Consistency: Comparing Human and Machine Translations of Camus Novel The Stranger

The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems

Evaluation of speech technologies

SCHOOL OF ENGINEERING AND INFORMATION TECHNOLOGIES GRADUATE PROGRAMS

Integration of Content Optimization Software into the Machine Translation Workflow. Ben Gottesman Acrolinx

Statistical Machine Translation prototype using UN parallel documents

An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?

This page intentionally left blank

Hybrid Strategies. for better products and shorter time-to-market

E-discovery Taking Predictive Coding Out of the Black Box

WHITE PAPER. Machine Translation of Language for Safety Information Sharing Systems

Using the Amazon Mechanical Turk for Transcription of Spoken Language

The KIT Translation system for IWSLT 2010

A chart generator for the Dutch Alpino grammar

Predictive Coding Defensibility

Privacy Issues in Online Machine Translation Services European Perspective.

Building a Web-based parallel corpus and filtering out machinetranslated

Web-based automatic translation: the Yandex.Translate API

Machine Translation of Public Health Materials From English to Chinese: A Feasibility Study

Moses from the point of view of an LSP: The Trusted Translations Experience

Transcription:

Why Evaluation? How good is a given system? Which one is the best system for our purpose? How much did we improve our system? How can we tune our system to become better? But MT evaluation is a di cult problem! Evaluation of Machine Translation Machine Translation Evaluation Based on Philipp Koehn s slides from Chapter 8 Evaluation Metrics subjective judgments by human evaluators automatic evaluation metrics task-based evaluation, e.g.: how much post-editing e ort? does information come across? Evaluation of Machine Translation Ten Translations of a Chinese Sentence Israeli o cials are responsible for airport security. Israel is in charge of the security at this airport. The security work for this airport is the responsibility of the Israel government. Israeli side was in charge of the security of this airport. Israel is responsible for the airport s security. Israel is responsible for safety work at this airport. Israel presides over the security of the airport. Israel took charge of the airport security. The safety of this airport is taken charge of by Israel. This airport s security is the responsibility of the Israeli security o cials. (a typical example from the 00 NIST evaluation set) Evaluation of Machine Translation

Adequacy and Fluency Human judgement given: machine translation output given: source and/or reference translation task: assess the quality of the machine translation output Metrics Adequacy: Does the output convey the same meaning as the input sentence? Is part of the message lost, added, or distorted? Fluency: Is the output good fluent English? This involves both grammatical correctness and idiomatic word choices. Evaluation of Machine Translation Human vs. Automatic Evaluation Human evaluation is Ultimately what we are interested in, but Very time consuming Not re-usable Automatic evaluation is Cheap and re-usable, but Not necessarily reliable Evaluation of Machine Translation Human Evaluation Source: Estos tejidos estan analizados, transformados y congelados antes de ser almacenados en Hema- Quebec, que gestiona tambien el unico banco publico de sangre del cordon umbilical en Quebec. Reference: These tissues are analyzed, processed and frozen before being stored at Hema- Quebec, which manages also the only bank of placental blood in Quebec. Translation: These weavings are analyzed, transformed and frozen before being stored in Hema-Quebec, that negotiates also the public only bank of blood of the umbilical cord in Quebec. What is your judgement in terms of adequacy and fluency? Adequacy Fluency all meaning flawless English most meaning good English much meaning non-native English little meaning disfluent English none incomprehensible Evaluation of Machine Translation 7 Fluency and Adequacy: Scales Adequacy Fluency all meaning flawless English most meaning good English much meaning non-native English little meaning disfluent English none incomprehensible Evaluation of Machine Translation

Measuring Agreement between Evaluators Kappa coe cient p(a) p(e) K = p(e) p(a): proportion of times that the evaluators agree p(e): proportion of time that they would agree by chance (-point scale! p(e) = ) Example: Inter-evaluator agreement in WMT 007 evaluation campaign Evaluation type P (A) P (E) K Fluency.00..0 Adequacy.80.. Evaluation of Machine Translation 9 Evaluators Disagree Histogram of adequacy judgments by di erent human evaluators 0% 0% 0% (from WMT 00 evaluation) Evaluation of Machine Translation 8 Rank Sentences You have judged sentences for WMT09 Spanish-English News Corpus, 7 sentences total taking.9 seconds per sentence. Human Evaluation Source: Estos tejidos están analizados, transformados y congelados antes de ser almacenados en Hema- Québec, que gestiona también el único banco público de sangre del cordón umbilical en Quebec. Reference: These tissues are analyzed, processed and frozen before being stored at Hema- Quebec, which manages also the only bank of placental blood in Quebec. Reference: These tissues are analyzed, processed and frozen before being stored at Héma-Québec, which manages also the only bank of placental blood in Quebec. Translation Rank These weavings are analyzed, transformed and frozen before being stored in Hema-Quebec, that negotiates also the public only bank of blood of the umbilical cord in Quebec. These tissues analysed, processed and before frozen of stored in Hema- Québec, which also operates the only public bank umbilical cord blood in Quebec. These tissues are analyzed, processed and frozen before being stored in Hema-Québec, which also manages the only public bank umbilical cord blood in Quebec. These tissues are analyzed, processed and frozen before being stored in Hema-Quebec, which also operates the only public bank of umbilical cord blood in Quebec. These fabrics are analyzed, are transformed and are frozen before being stored in Hema-Québec, who manages also the only public bank of blood of the umbilical cord in Quebec. Annotator: ccb Task: WMT09 Spanish-English News Corpus Instructions: Rank each translation from Best to Worst relative to the other choices (ties are allowed). These are not interpreted as absolute scores. They are Best Best Best Best Best Worst Worst Worst Worst Worst Evaluation relative scores. of Machine Translation Ranking Translations Task for evaluator: Is translation X better than translation Y? (choices: better, worse, equal) Evaluators are more consistent: Evaluation type P (A) P (E) K Fluency.00..0 Adequacy.80.. Sentence ranking.8..7 Evaluation of Machine Translation 0

General Goals for Evaluation Metrics Correct: metric must rank better systems higher Meaningful: score should give intuitive interpretation of translation quality Low cost: reduce time and money spent on carrying out evaluation Useful for tuning: automatically optimize system parameters towards metric Consistent: repeated use of metric should give same results Evaluation of Machine Translation Human Evaluation Reference: These tissues are analyzed, processed and frozen before being stored at Hema- Quebec, which manages also the only bank of placental blood in Quebec. Evaluation of Machine Translation Precision and Recall of Words SYSTEM A: Israeli officials responsibility of airport safety REFERENCE: Israeli officials are responsible for airport security Precision correct output-length = = 0% Recall correct reference-length = 7 = % F-measure precision recall (precision + recall)/ =.. (.+.)/ = % Evaluation of Machine Translation Automatic Evaluation Metrics Goal: computer program that computes the quality of translations Advantages: low cost, fast, re-usable Basic strategy given: machine translation output given: human reference translation task: compute similarity between them Evaluation of Machine Translation

Word Error Rate Minimum number of editing steps to transform output to reference match: words match, no cost substitution: replace one word with another insertion: add word deletion: drop word Levenshtein distance substitutions + insertions + deletions wer = reference-length Evaluation of Machine Translation 7 Precision and Recall SYSTEM A: Israeli officials responsibility of airport safety REFERENCE: Israeli officials are responsible for airport security SYSTEM B: airport security Israeli officials are responsible Metric System A System B precision 0% 00% recall % 00% f-measure % 00% flaw: no penalty for reordering Evaluation of Machine Translation Evaluation of Machine Translation 8 Metric System A System B word error rate (wer) 7% 7% security 7 security 7 7 airport airport for for responsible responsible are are officials 0 officials Israeli 0 Israeli 0 0 Israeli officials responsibility of airport safety airport security Israeli officials are responsible Example Evaluation of Machine Translation 9 if output-length c>reference-length r BP exp( r/c) if output-length c apple reference-length r Add brevity penalty for short translations: i= ny = (precision precision... precisionn) n = precisioni! n P = np precision precision... precisionn Compute geometric mean of n-gram precisions (typically size to ): N-gram overlap between machine translation output and reference translation BLEU

Example SYSTEM A: Israeli officials responsibility of airport safety -GRAM MATCH -GRAM MATCH REFERENCE: Israeli officials are responsible for airport security SYSTEM B: airport security Israeli officials are responsible -GRAM MATCH -GRAM MATCH Metric System A System B precision (gram) / / precision (gram) / / precision (gram) 0/ / precision (gram) 0/ / brevity penalty /7 /7 bleu 0% % Evaluation of Machine Translation BLEU More e cient: P =( Q n i= precision i) n = exp n P n i= log e(precisonn) Putting everything together (for to -grams): BLEU =min,exp reference-length exp output-length NX loge(precisonn) n n= Typically computed over the entire test corpus, not single sentences Can you figure out why? Evaluation of Machine Translation 0! Multiple Reference Translations To account for variability, use multiple reference translations n-grams may match in any of the references closest reference length used Example SYSTEM: REFERENCES: Israeli officials responsibility of airport safety -GRAM MATCH -GRAM MATCH -GRAM Israeli officials are responsible for airport security Israel is in charge of the security at this airport The security work for this airport is the responsibility of the Israel government Israeli side was in charge of the security of this airport Evaluation of Machine Translation Modified N-gram Precision Avoid counting correct N-grams more often than they appear in any reference translation! countclip = min (countcandidate, maxcountreference) Candidate: the the the the the the the. Reference : The cat is on the mat. Reference : There is a cat on the mat. countclip(the) = precision = /7 (unigram precision) Evaluation of Machine Translation

Correlation with Human Judgement Evaluation of Machine Translation Typical BLEU Scores BLEU scores for 0 statistical machine translation systems (Koehn 00) % da de el en es fr fi it nl pt sv da - 8.. 8.. 8.7.... 8. de. - 0.7.. 7.7.8... 0. el.7 7. - 7.....8 0.0 7.. en. 7.. - 0...0..0 7..8 es. 8. 8. 0. - 0.....9.9 fr.7 8.. 0.0 8. -..... fi 0.0. 8..8.. - 8. 7.0 9. 8.8 it..9.8 7.8.0.0.0-0.0. 0. nl 0. 8. 7..0.9. 0. 0.0-0.7 9.0 pt. 8.. 0. 7.9 9.0.9.0 0. -.9 sv 0. 8.9.8 0. 8. 9.7..9.9.9 - Evaluation of Machine Translation Critique of Automatic Metrics Ignore relevance of words (names and core concepts more important than determiners and punctuation) Operate on local level (do not consider overall grammaticality of the sentence or sentence meaning) Scores are meaningless (scores very test-set specific, absolute value not informative) Human translators score low on BLEU (possibly because of higher variability, di erent word choices) Evaluation of Machine Translation 7 METEOR: Flexible Matching Partial credit for matching stems system Jim went home reference Joe goes home Partial credit for matching (near) synonyms system Jim walks home reference Joe goes home Use of paraphrases Evaluation of Machine Translation

Evaluation of Machine Translation 8 Bleu Score 0.8 0. 0. 0. 0. 0.8 0. 0.. Human Score. Adequacy Correlation Post-edited output vs. statistical systems (NIST 00) Evidence of Shortcomings of Automatic Metrics Evaluation of Machine Translation 9 Bleu Score 0.8 0. 0. 0. 0. 0.8 0.. SMT System Human Score. Rule-based System (Systran) SMT System Adequacy Fluency. Rule-based vs. statistical systems Evidence of Shortcomings of Automatic Metrics Automatic Metrics: Conclusions Automatic metrics essential tool for system development Not fully suited to rank systems of di erent types Evaluation metrics still open challenge Evaluation of Machine Translation Metric Research Active development of new metrics syntactic similarity semantic equivalence or entailment metrics targeted at reordering trainable metrics etc. Evaluation campaigns that rank metrics Evaluation of Machine Translation 0

Post-Editing Machine Translation Measuring time spent on producing translations baseline: translation from scratch post-editing machine translation But: time consuming, depend on skills of translator and post-editor Metrics inspired by this task ter: based on number of editing steps Levenshtein operations (insertion, deletion, substitution) plus movement hter: manually construct reference translation for output, apply ter (very time consuming, used in DARPA GALE program 00-0) Evaluation of Machine Translation Task-Oriented Evaluation Does machine translation output help accomplish a task? browsing quality: Is the translation understandable in its context? (its main contents is clear to find information I need) post-editing quality: How many edit operations are required to turn it into a good translation? publishing quality: How many human interventions are necessary to make the entire document ready for printing? Evaluation of Machine Translation Other Evaluation Criteria When deploying systems, considerations go beyond quality of translations Speed: we prefer faster machine translation systems Size: fits into memory of available machines (e.g., handheld devices) Integration: can be integrated into existing workflow Customization: can be adapted to user s needs Evaluation of Machine Translation Content Understanding Tests Given machine translation output, can monolingual target side speaker answer questions about it?. basic facts: who? where? when? names, numbers, and dates. actors and events: relationships, temporal and causal order. nuance and author intent: emphasis and subtext Very hard to devise questions Sentence editing task (WMT 009 00) person A edits the translation to make it fluent (with no access to source or reference) person B checks if edit is correct! did person A understand the translation correctly? Evaluation of Machine Translation

Summary MT evaluation is important System development Parameter tuning Task-oriented performance MT evaluation is di cult Human evaluators are expensive and disagree Automatic metrics ar not always reliable! Be careful when arguing about MT quality! Evaluation of Machine Translation