A Three-Layer Architecture for Automatic Post-Editing System Using Rule-Based Paradigm

Similar documents
Visualizing Data Structures in Parsing-based Machine Translation. Jonathan Weese, Chris Callison-Burch

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告

Hybrid Machine Translation Guided by a Rule Based System

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Automatic Detection and Correction of Errors in Dependency Treebanks

Statistical Machine Translation

Learning Translation Rules from Bilingual English Filipino Corpus

SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

Natural Language to Relational Query by Using Parsing Compiler

An Approach to Handle Idioms and Phrasal Verbs in English-Tamil Machine Translation System

Hybrid Strategies. for better products and shorter time-to-market

Natural Language Database Interface for the Community Based Monitoring System *

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

English Grammar Checker

Customizing an English-Korean Machine Translation System for Patent Translation *

The Evalita 2011 Parsing Task: the Dependency Track

A Machine Translation System Between a Pair of Closely Related Languages

LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*

The KIT Translation system for IWSLT 2010

DEPENDENCY PARSING JOAKIM NIVRE

Jane 2: Open Source Phrase-based and Hierarchical Statistical Machine Translation

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Overview of MT techniques. Malek Boualem (FT)

Factored Translation Models

Transition-Based Dependency Parsing with Long Distance Collocations

COMPUTATIONAL DATA ANALYSIS FOR SYNTAX

Appraise: an Open-Source Toolkit for Manual Evaluation of MT Output

The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems

NATURAL LANGUAGE QUERY PROCESSING USING SEMANTIC GRAMMAR

31 Case Studies: Java Natural Language Tools Available on the Web

BILINGUAL TRANSLATION SYSTEM

Workshop. Neil Barrett PhD, Jens Weber PhD, Vincent Thai MD. Engineering & Health Informa2on Science

The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL A Guide to Jane, an Open Source Hierarchical Translation Toolkit

POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

NATURAL LANGUAGE TO SQL CONVERSION SYSTEM

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Leveraging ASEAN Economic Community through Language Translation Services

Paraphrasing controlled English texts

Convergence of Translation Memory and Statistical Machine Translation

Outline of today s lecture

Enriching Morphologically Poor Languages for Statistical Machine Translation

SakkamMT White Paper

Extracting translation relations for humanreadable dictionaries from bilingual text

Why language is hard. And what Linguistics has to say about it. Natalia Silveira Participation code: eagles

HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN

CALICO Journal, Volume 9 Number 1 9

The TCH Machine Translation System for IWSLT 2008

A chart generator for the Dutch Alpino grammar

MOVING MACHINE TRANSLATION SYSTEM TO WEB

Brill s rule-based PoS tagger

Human in the Loop Machine Translation of Medical Terminology

Adapting General Models to Novel Project Ideas

REALIZATION SORTING ALGORITHM USING PARALLEL TECHNOLOGIES bachelor, Mikhelev Vladimir candidate of Science, prof., Sinyuk Vasily

A Hybrid System for Patent Translation

Adaptation to Hungarian, Swedish, and Spanish

PROMT Technologies for Translation and Big Data

Sentiment analysis on news articles using Natural Language Processing and Machine Learning Approach.

Comprendium Translator System Overview

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Semantic annotation of requirements for automatic UML class diagram generation

Parsing Software Requirements with an Ontology-based Semantic Role Labeler

Collecting Polish German Parallel Corpora in the Internet

Building a Question Classifier for a TREC-Style Question Answering System

Special Topics in Computer Science

Collaborative Machine Translation Service for Scientific texts

Parmesan: Meteor without Paraphrases with Paraphrased References

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Sharing resources between free/open-source rule-based machine translation systems: Grammatical Framework and Apertium

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

An Interactive Hypertextual Environment for MT Training

TRANSLATION OF TELUGU-MARATHI AND VICE- VERSA USING RULE BASED MACHINE TRANSLATION

The Specific Text Analysis Tasks at the Beginning of MDA Life Cycle

ACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation Project no.

M LTO Multilingual On-Line Translation

Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation

Translating while Parsing

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Specialty Answering Service. All rights reserved.

Polish - English Statistical Machine Translation of Medical Texts.

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

The Role of Sentence Structure in Recognizing Textual Entailment

Shallow Parsing with Apache UIMA

Statistical Pattern-Based Machine Translation with Statistical French-English Machine Translation

Application of Natural Language Interface to a Machine Translation Problem

TRANSREAD LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS. Projet ANR CORD 01 5

Syntactic Transfer Using a Bilingual Lexicon

Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology

An English-to-Arabic Prototype Machine Translator for Statistical Sentences

A Survey of Online Tools Used in English-Thai and Thai-English Translation by Thai Students

Applying Co-Training Methods to Statistical Parsing. Anoop Sarkar anoop/

Online Large-Margin Training of Dependency Parsers

Interactive Dynamic Information Extraction

Automatic Text Analysis Using Drupal

BEER 1.1: ILLC UvA submission to metrics and tuning task

Information extraction from online XML-encoded documents

Transcription:

A Three-Layer Architecture for Automatic Post-Editing System Using Rule-Based Paradigm Mahsa Mohaghegh1, Abdolhossein Sarrafzadeh1, Mehdi Mohammadi2 1 Unitec 2 Sheikhbahaee University

Content Introduction Literature Review Description of the System Experiments and Results Conclusion References 2

Introduction Statistical Machine Translation (SMT) systems is one of the most popular machine translation (MT) approaches. MT output is often seriously grammatically incorrect. Incorrect output is more often the case in SMT than other approaches. Our previous work was focused on English-Persian SMT. 3

Introduction Certain MT systems can carry out the task of an Automatic post-editing (APE) component Machine translation mistakes are repetitive APE process is very similar to machine translation process RBMT Systems Direct Systems Transfer RBMT Systems Interligual RBMT Systems 4

Introduction The Grafix APE system Follows the pattern of Transfer-based systems. Correct some grammatical SMT system errors frequently occur in English-to-Persian translations. 5

Related Works The most popular combinations of MT systems and APE modules are a RBMT main system linked with phrase-based SMT (PB-SMT) Simard et al. (2007) Lagarda et al. (2009) Isabelle et al (2007) Recently, RBMT has been used as the base for an APE system Marecek et al. (2011) Rosa et al. (2012) 6

Description of the System Our previous PeEn-SMT system is coupled with an RBMT-based APE To cover linguistic knowledge in translation output, RBMT is a superior candidate because of its strengths in linguistic knowledge. Three levels of transformations: Lexical transformers, Shallow transformers, and Deep transformers 7

Description of the System Input comes from SMT system. Input Tokenized Lexical Transformation Shallow Transformation Deep Transformation Generate Output Overview of the APE System 8

Description of the System The Underlying SMT System Persian is a morphologically rich language, with word disordering as a common issue in MT. Joshua: A Hierarchical SMT takes syntax into account, phrases being used to learn word reordering. We conducted our study on Joshua after numerous experiments with Moses 9

Description of the System Training Data Source We used dependency structures for representing sentences. Our main source of training data for both POStagging and dependency parsing was Persian Dependency Treebank. It contains over 12500 sentences and 189000 tokens. Its data format is based on CoNLL Shared Task on Dependency Parsing. 10

Description of the System Pre-Processing and Tagging We eliminate whitespace, semi-space, tab and new line in order to tokenize the text. Maximum Likelihood Estimation (MLE) is used for POS-Tagging Using this approach for tagging the Persian language yielded promising results. (Raja et al., 2007) It is trained with the data of Persian Dependency Treebank. 11

Description of the System Dependency Parsing In a dependency parsing tree, words are linked to their arguments by dependency representations. We used MSTParser An implementation of Dependency Parsing Its main algorithm is Maximum Spanning Tree (MST) The maximum spanning tree should be found in order to find the best parse tree. 12

Description of the System Sentence Dependency Tree: An Example 13

Description of the System Rule-based Transformers Transfmoration rules are acquired manually Identifying the incorrect and incomplete translation outputs Determining frequent wrong patterns in output of dependency parser for incorrect/incomplete translations. Word level corrections handled in Lexical Transformers Identified patterns in the POS sequences fall into Shallow Transformers Identified patterns in the Dependency Trees categorized as Deep Transformers 14

Description of the System Lexical Transformation OOV Remover A substitution rule like E F 1, F 2 F n The first meaning (which is often the most frequent) found for the English word in the dictionary is used. Transliterator Despite OOV Remover, some words are remained in English. Transliterator works based on training an amount of prepared data to produce the most likely Persian word for the English word. The training data set contains over 4600 of the most frequently used Persian words and named entities. 15

Description of the System Shallow Transformers IncompleteDependentTransformer The tag <SUBR> in POS-tag sequence without a consequent verb indicates an incomplete dependent sentence: a verb should complete the sentence. IncompleteEndedPREMTransformer Pre-modifiers (denoted by PREM) are noun modifiers and should precede nouns. A POS sequence in which a pre-modifier is located at the end of a sentence deemed as incorrect. 16

Description of the System Shallow Transformers AdjectiveArrangementTransformer In Farsi, adjectives usually come after the nouns they describe except the superlative adjectives, which are identified by the suffix»»ترين (/tarin/). Appearance of non-superlative adjectives before their described nouns indicates incorrect composition. 17

Description of the System Deep Transformers The input is parsed by a dependency parser. The rules here examine dependency tree of each sentence to verify it bounds to some syntactical and grammatical constraints. 18

Description of the System Deep Transformers NoSubjectSentenceTransformer In some cases, what was parsed as the object in the sentence was actually the subject. Indication: Sentences that have a third person verb, no definite subject and an object tagged as POSTP (postposition) in the POS sequence. This transformer revises the sentence by removing the postposition»»را (/rã/) which is the indicator of a direct object in the sentence. 19

Description of the System Deep Transformers PluralNounsTransformer In Farsi, the word coming after a number is always singular. Incorrect cases in SMT output are corrected by removing the plural symbol of Persian words (suffix»»ها /hã/). VerbArrangementTransformer The dominant word order in Farsi is SOV. The case is, a main verb as Root does not occur immediately before the period punctuation mark. The sentence is reordered by moving the root verb (and its Non Verbal Element in the case of compound verbs) to the end of the sentence, before the period punctuation mark. 20

Description of the System Deep Transformers MissingVerbTransformer SMT output can occur with missing verbs specifically compound verbs. Persian [verb] Valency Lexicon: our source for proper verb for a non-verbal element in the sentence. All obligatory and optional non-verbal elements (main-verb dependents) are listed in this lexicon. Finding correct root of the verb (past or present) by looking at other verbs in a sentence. The last word in the sentence is considered as a candidate of non-verbal element in the Verb Valency Lexicon. 21

Experiments and Results Our baseline system is Joshua 4.0 using default settings. Training data consisted of almost 85,000 sentence pairs of 1.4 million words. Data are mostly from bilingual news sites in different domains like literature, science and conversation. The language model was extracted from IRNA website, comprised over 66 million words. 22

Experiments and Results Test Data Set Eight test sets extracted from certain bilingual websites. Test sentences are randomly selected and cover different domains such as art, medicine, economics and news. Size of the test data : 100 words ~ 600 words (one paragraph to a complete page) The total number of test sentences is 153. 23

BLEU Experiments and Results Automatic Evaluation: BLEU 1 0.9 0.8 0.7 0.6 0.5 0.4 Before APE After APE 0.3 0.2 0.1 0 #1 #2 #3 #4 #5 #6 #7 #8 Test Set 24

NIST Experiments and Results Automatic Evaluation: NIST 8 7 6 5 4 3 Before APE After APE 2 1 0 #1 #2 #3 #4 #5 #6 #7 #8 Test Set 25

Experiments and Results Automatic Evaluation The results generally show increases in both metrics In certain test sets the scoring metrics report a decrease in output quality. The main reason : lack of training data for the Transliterator module. The quality of statistical translation (in terms of BLEU metric score) affects the APE module directly. In test set #4, in the religious genre, the decrease in both BLEU and NIST is attributed to a lack of data in this genre. 26

Percent Experiments and Results Manual Evaluation The same test sets in automatic evaluation are evaluated here. Two separate annotators annotated the APE output: improvement, decrease, or no change in fluency. 70% 60% 50% Manual Evaluation for Grafix 40% 30% 20% 10% 0% Improved No Change Weakened Average % 29.40% 63.40% 7.20% Agreement % 25% 59% 3% 27

Experiments and Results Manual Evaluation The APE system has been successful in improving the quality of the baseline SMT system output by 29.4%. The current developed rules for the APE system are effective for about 37% of the SMT translated sentences. Based on annotators agreement, the quality improvement from the APE system is 25% and weakened by 3%. 28

Conclusion The automatic evaluation results in terms of BLEU metric show that 75% of the test sets have been improved after APE. Improvement of the SMT output up to 0.15 BLEU Some decreases in quality of translation OOVRemover and Transliterator produced a new unknown (incorrect) word Automatic evaluation could not reveal the quality of grammatical changes appropriately. Manual evaluation scores show that a rule-based approach for an APE system is useful for improving at least 25% of the translation with a loss of at most 7%. 29

Thanks for your attention Questions? 30

References Ahsan, A., Kolachina, P., Kolachina, S., Sharma, D. M., & Sangal, R. (2010). Coupling statistical machine translation with rule-based transfer and generation. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas. Denver, Colorado. Allen, J., & Hogan, C. (2000). Toward the Development of a Post-editing Module for Raw Machine Translation Output: A Controlled Language Perspective. Third International Controlled Language Applications Workshop (CLAW-00) (pp. 62-71). Ambati, V. (2008). Dependency Structure Trees in Syntax Based Machine Translation. In Advanced MT Seminar Course Report. Buchholz, S., & Marsi, E. (2006). CoNLL-X shared task on multilingual dependency parsing. In Proceedings of the Tenth Conference on Computational Natural Language Learning (pp. 149-164). New York City, USA: Association for Computational Linguistics. Callison-Burch, C., Koehn, P., Monz, C., & Zaidan, O. F. (2011). Findings of the 2011 workshop on statistical machine translation. Chiang, D. (2005). A hierarchical phrase-based model for statistical machine translation In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 263-270). University of Michigan, USA.: Association for Computational Linguistics. Chiang, D. (2007). Hierarchical phrase-based translation. Computational Linguistics, 33(2), 201-228. Crystal, D. (2004). The Cambridge Encyclopedia of English Language: Ernst Klett Sprachen. Ding, Y., & Palmer, M. (2005). Machine translation using probabilistic synchronous dependency insertion grammars In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 541-548). Ann Arbor, USA: Association for Computational Linguistics. Hudson, R. A. (1984). Word grammar: Blackwell Oxford. Isabelle, P., Goutte, C., & Simard, M. (2007). Domain adaptation of MT systems through automatic post-editing In Proceedings of Machine Translation Summit XI (pp. 255-261). Phuket, Thailand. Kübler, S., McDonald, R., & Nivre, J. (2009). Dependency parsing Synthesis Lectures on Human Language Technologies (Vol. 1, pp. 1-127). Lagarda, A. L., Alabau, V., Casacuberta, F., Silva, R., & Diaz-de-Liano, E. (2009). Statistical post-editing of a rule-based machine translation system. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers (pp. 217-220). Boulder, Colorado: Association for Computational Linguistics. 31

References Li, Z., Callison-Burch, C., Khudanpur, S., & Thornton, W. (2009). Decoding in Joshua. Prague Bulletin of Mathematical Linguistics, 91, 47-56. Marecek, D., Rosa, R., & Bojar, O. (2011). Two-step translation with grammatical post-processing. In Proceedings of the Sixth Workshop on Statistical Machine Translation (pp. 426-432). Stroudsburg, PA,USA: Association for Computational Linguistics. McDonald, R., Pereira, F., Ribarov, K., & Hajic, J. (2005). Non-projective dependency parsing using spanning tree algorithms. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing (pp. 523-530). Vancouver, B.C., Canada: Association for Computational Linguistics. Mohaghegh, M., & Sarrafzadeh, A. (2012). A hierarchical phrase-based model for English-Persian statistical machine translation. Mohaghegh, M., Sarrafzadeh, A., & Moir, T. (2011). Improving Persian-English Statistical Machine Translation: Experiments in Domain Adaptation. Pilevar, A. H. (2011). USING STATISTICAL POST-EDITING TO IMPROVE THE OUTPUT OF RULE-BASED MACHINE TRANSLATION SYSTEM. Training, 330, 330,000. Raja, F., Amiri, H., Tasharofi, S., Sarmadi, M., Hojjat, H., & Oroumchian, F. (2007). Evaluation of part of speech tagging on Persian text. University of Wollongong in Dubai-Papers, 8. Rasooli, M. S., Moloodi, A., Kouhestani, M., & Minaei-Bidgoli, B. (2011). A syntactic valency lexicon for Persian verbs: The first steps towards Persian dependency treebank. In Proceedings of 5th Language & Technology Conference (LTC): Human Language Technologies as a Challenge for Computer Science and Linguistics (pp. 227-231). Rosa, R., Marecek, D., & Duˇsek, O. (2012). DEPFIX: A System for Automatic Correction of Czech MT Outputs. In Proceedings of the Seventh Workshop on Statistical Machine Translation,. Montreal, Canada: Association for Computational Linguistics. Simard, M., Goutte, C., & Isabelle, P. (2007). Statistical phrase-based post-editing In Human LanguageTechnologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference (pp. 508-515). Rochester, USA. Weese, J., Ganitkevitch, J., Callison-Burch, C., Post, M., & Lopez, A. (2011). Joshua 3.0: Syntax-based machine translation with the thrax grammar extractor. Xuan, H., Li, W., & Tang, G. (2011). An Advanced Review of Hybrid Machine Translation (HMT). Procedia Engineering, 29, 3017-3022. Zollmann, A., & Venugopal, A. (2006). Syntax augmented machine translation via chart parsing. In Proceedings of the Workshop on Statistical Machine Translation (pp. 138-141). New York City, USA: Association for Computational Linguistics. 32