Why Evaluation? How good is a given system? Which one is the best system for our purpose? How much did we improve our system? How can we tune our system to become better? But MT evaluation is a di cult problem! Evaluation of Machine Translation Machine Translation Evaluation Based on Philipp Koehn s slides from Chapter 8 Evaluation Metrics subjective judgments by human evaluators automatic evaluation metrics task-based evaluation, e.g.: how much post-editing e ort? does information come across? Evaluation of Machine Translation Ten Translations of a Chinese Sentence Israeli o cials are responsible for airport security. Israel is in charge of the security at this airport. The security work for this airport is the responsibility of the Israel government. Israeli side was in charge of the security of this airport. Israel is responsible for the airport s security. Israel is responsible for safety work at this airport. Israel presides over the security of the airport. Israel took charge of the airport security. The safety of this airport is taken charge of by Israel. This airport s security is the responsibility of the Israeli security o cials. (a typical example from the 00 NIST evaluation set) Evaluation of Machine Translation
Adequacy and Fluency Human judgement given: machine translation output given: source and/or reference translation task: assess the quality of the machine translation output Metrics Adequacy: Does the output convey the same meaning as the input sentence? Is part of the message lost, added, or distorted? Fluency: Is the output good fluent English? This involves both grammatical correctness and idiomatic word choices. Evaluation of Machine Translation Human vs. Automatic Evaluation Human evaluation is Ultimately what we are interested in, but Very time consuming Not re-usable Automatic evaluation is Cheap and re-usable, but Not necessarily reliable Evaluation of Machine Translation Human Evaluation Source: Estos tejidos estan analizados, transformados y congelados antes de ser almacenados en Hema- Quebec, que gestiona tambien el unico banco publico de sangre del cordon umbilical en Quebec. Reference: These tissues are analyzed, processed and frozen before being stored at Hema- Quebec, which manages also the only bank of placental blood in Quebec. Translation: These weavings are analyzed, transformed and frozen before being stored in Hema-Quebec, that negotiates also the public only bank of blood of the umbilical cord in Quebec. What is your judgement in terms of adequacy and fluency? Adequacy Fluency all meaning flawless English most meaning good English much meaning non-native English little meaning disfluent English none incomprehensible Evaluation of Machine Translation 7 Fluency and Adequacy: Scales Adequacy Fluency all meaning flawless English most meaning good English much meaning non-native English little meaning disfluent English none incomprehensible Evaluation of Machine Translation
Measuring Agreement between Evaluators Kappa coe cient p(a) p(e) K = p(e) p(a): proportion of times that the evaluators agree p(e): proportion of time that they would agree by chance (-point scale! p(e) = ) Example: Inter-evaluator agreement in WMT 007 evaluation campaign Evaluation type P (A) P (E) K Fluency.00..0 Adequacy.80.. Evaluation of Machine Translation 9 Evaluators Disagree Histogram of adequacy judgments by di erent human evaluators 0% 0% 0% (from WMT 00 evaluation) Evaluation of Machine Translation 8 Rank Sentences You have judged sentences for WMT09 Spanish-English News Corpus, 7 sentences total taking.9 seconds per sentence. Human Evaluation Source: Estos tejidos están analizados, transformados y congelados antes de ser almacenados en Hema- Québec, que gestiona también el único banco público de sangre del cordón umbilical en Quebec. Reference: These tissues are analyzed, processed and frozen before being stored at Hema- Quebec, which manages also the only bank of placental blood in Quebec. Reference: These tissues are analyzed, processed and frozen before being stored at Héma-Québec, which manages also the only bank of placental blood in Quebec. Translation Rank These weavings are analyzed, transformed and frozen before being stored in Hema-Quebec, that negotiates also the public only bank of blood of the umbilical cord in Quebec. These tissues analysed, processed and before frozen of stored in Hema- Québec, which also operates the only public bank umbilical cord blood in Quebec. These tissues are analyzed, processed and frozen before being stored in Hema-Québec, which also manages the only public bank umbilical cord blood in Quebec. These tissues are analyzed, processed and frozen before being stored in Hema-Quebec, which also operates the only public bank of umbilical cord blood in Quebec. These fabrics are analyzed, are transformed and are frozen before being stored in Hema-Québec, who manages also the only public bank of blood of the umbilical cord in Quebec. Annotator: ccb Task: WMT09 Spanish-English News Corpus Instructions: Rank each translation from Best to Worst relative to the other choices (ties are allowed). These are not interpreted as absolute scores. They are Best Best Best Best Best Worst Worst Worst Worst Worst Evaluation relative scores. of Machine Translation Ranking Translations Task for evaluator: Is translation X better than translation Y? (choices: better, worse, equal) Evaluators are more consistent: Evaluation type P (A) P (E) K Fluency.00..0 Adequacy.80.. Sentence ranking.8..7 Evaluation of Machine Translation 0
General Goals for Evaluation Metrics Correct: metric must rank better systems higher Meaningful: score should give intuitive interpretation of translation quality Low cost: reduce time and money spent on carrying out evaluation Useful for tuning: automatically optimize system parameters towards metric Consistent: repeated use of metric should give same results Evaluation of Machine Translation Human Evaluation Reference: These tissues are analyzed, processed and frozen before being stored at Hema- Quebec, which manages also the only bank of placental blood in Quebec. Evaluation of Machine Translation Precision and Recall of Words SYSTEM A: Israeli officials responsibility of airport safety REFERENCE: Israeli officials are responsible for airport security Precision correct output-length = = 0% Recall correct reference-length = 7 = % F-measure precision recall (precision + recall)/ =.. (.+.)/ = % Evaluation of Machine Translation Automatic Evaluation Metrics Goal: computer program that computes the quality of translations Advantages: low cost, fast, re-usable Basic strategy given: machine translation output given: human reference translation task: compute similarity between them Evaluation of Machine Translation
Word Error Rate Minimum number of editing steps to transform output to reference match: words match, no cost substitution: replace one word with another insertion: add word deletion: drop word Levenshtein distance substitutions + insertions + deletions wer = reference-length Evaluation of Machine Translation 7 Precision and Recall SYSTEM A: Israeli officials responsibility of airport safety REFERENCE: Israeli officials are responsible for airport security SYSTEM B: airport security Israeli officials are responsible Metric System A System B precision 0% 00% recall % 00% f-measure % 00% flaw: no penalty for reordering Evaluation of Machine Translation Evaluation of Machine Translation 8 Metric System A System B word error rate (wer) 7% 7% security 7 security 7 7 airport airport for for responsible responsible are are officials 0 officials Israeli 0 Israeli 0 0 Israeli officials responsibility of airport safety airport security Israeli officials are responsible Example Evaluation of Machine Translation 9 if output-length c>reference-length r BP exp( r/c) if output-length c apple reference-length r Add brevity penalty for short translations: i= ny = (precision precision... precisionn) n = precisioni! n P = np precision precision... precisionn Compute geometric mean of n-gram precisions (typically size to ): N-gram overlap between machine translation output and reference translation BLEU
Example SYSTEM A: Israeli officials responsibility of airport safety -GRAM MATCH -GRAM MATCH REFERENCE: Israeli officials are responsible for airport security SYSTEM B: airport security Israeli officials are responsible -GRAM MATCH -GRAM MATCH Metric System A System B precision (gram) / / precision (gram) / / precision (gram) 0/ / precision (gram) 0/ / brevity penalty /7 /7 bleu 0% % Evaluation of Machine Translation BLEU More e cient: P =( Q n i= precision i) n = exp n P n i= log e(precisonn) Putting everything together (for to -grams): BLEU =min,exp reference-length exp output-length NX loge(precisonn) n n= Typically computed over the entire test corpus, not single sentences Can you figure out why? Evaluation of Machine Translation 0! Multiple Reference Translations To account for variability, use multiple reference translations n-grams may match in any of the references closest reference length used Example SYSTEM: REFERENCES: Israeli officials responsibility of airport safety -GRAM MATCH -GRAM MATCH -GRAM Israeli officials are responsible for airport security Israel is in charge of the security at this airport The security work for this airport is the responsibility of the Israel government Israeli side was in charge of the security of this airport Evaluation of Machine Translation Modified N-gram Precision Avoid counting correct N-grams more often than they appear in any reference translation! countclip = min (countcandidate, maxcountreference) Candidate: the the the the the the the. Reference : The cat is on the mat. Reference : There is a cat on the mat. countclip(the) = precision = /7 (unigram precision) Evaluation of Machine Translation
Correlation with Human Judgement Evaluation of Machine Translation Typical BLEU Scores BLEU scores for 0 statistical machine translation systems (Koehn 00) % da de el en es fr fi it nl pt sv da - 8.. 8.. 8.7.... 8. de. - 0.7.. 7.7.8... 0. el.7 7. - 7.....8 0.0 7.. en. 7.. - 0...0..0 7..8 es. 8. 8. 0. - 0.....9.9 fr.7 8.. 0.0 8. -..... fi 0.0. 8..8.. - 8. 7.0 9. 8.8 it..9.8 7.8.0.0.0-0.0. 0. nl 0. 8. 7..0.9. 0. 0.0-0.7 9.0 pt. 8.. 0. 7.9 9.0.9.0 0. -.9 sv 0. 8.9.8 0. 8. 9.7..9.9.9 - Evaluation of Machine Translation Critique of Automatic Metrics Ignore relevance of words (names and core concepts more important than determiners and punctuation) Operate on local level (do not consider overall grammaticality of the sentence or sentence meaning) Scores are meaningless (scores very test-set specific, absolute value not informative) Human translators score low on BLEU (possibly because of higher variability, di erent word choices) Evaluation of Machine Translation 7 METEOR: Flexible Matching Partial credit for matching stems system Jim went home reference Joe goes home Partial credit for matching (near) synonyms system Jim walks home reference Joe goes home Use of paraphrases Evaluation of Machine Translation
Evaluation of Machine Translation 8 Bleu Score 0.8 0. 0. 0. 0. 0.8 0. 0.. Human Score. Adequacy Correlation Post-edited output vs. statistical systems (NIST 00) Evidence of Shortcomings of Automatic Metrics Evaluation of Machine Translation 9 Bleu Score 0.8 0. 0. 0. 0. 0.8 0.. SMT System Human Score. Rule-based System (Systran) SMT System Adequacy Fluency. Rule-based vs. statistical systems Evidence of Shortcomings of Automatic Metrics Automatic Metrics: Conclusions Automatic metrics essential tool for system development Not fully suited to rank systems of di erent types Evaluation metrics still open challenge Evaluation of Machine Translation Metric Research Active development of new metrics syntactic similarity semantic equivalence or entailment metrics targeted at reordering trainable metrics etc. Evaluation campaigns that rank metrics Evaluation of Machine Translation 0
Post-Editing Machine Translation Measuring time spent on producing translations baseline: translation from scratch post-editing machine translation But: time consuming, depend on skills of translator and post-editor Metrics inspired by this task ter: based on number of editing steps Levenshtein operations (insertion, deletion, substitution) plus movement hter: manually construct reference translation for output, apply ter (very time consuming, used in DARPA GALE program 00-0) Evaluation of Machine Translation Task-Oriented Evaluation Does machine translation output help accomplish a task? browsing quality: Is the translation understandable in its context? (its main contents is clear to find information I need) post-editing quality: How many edit operations are required to turn it into a good translation? publishing quality: How many human interventions are necessary to make the entire document ready for printing? Evaluation of Machine Translation Other Evaluation Criteria When deploying systems, considerations go beyond quality of translations Speed: we prefer faster machine translation systems Size: fits into memory of available machines (e.g., handheld devices) Integration: can be integrated into existing workflow Customization: can be adapted to user s needs Evaluation of Machine Translation Content Understanding Tests Given machine translation output, can monolingual target side speaker answer questions about it?. basic facts: who? where? when? names, numbers, and dates. actors and events: relationships, temporal and causal order. nuance and author intent: emphasis and subtext Very hard to devise questions Sentence editing task (WMT 009 00) person A edits the translation to make it fluent (with no access to source or reference) person B checks if edit is correct! did person A understand the translation correctly? Evaluation of Machine Translation
Summary MT evaluation is important System development Parameter tuning Task-oriented performance MT evaluation is di cult Human evaluators are expensive and disagree Automatic metrics ar not always reliable! Be careful when arguing about MT quality! Evaluation of Machine Translation