TRANSREAD LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS. Projet ANR 201 2 CORD 01 5

Projet ANR 201 2 CORD 01 5 TRANSREAD Lecture et interaction bilingues enrichies par les données d'alignement LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS Avril 201 4 Benoit Le Ny

Abstract The third task of the Transread project aims at developing new methods and tools to control the quality of human translations and translation memories. In this document, we introduce the different use cases we plan to consider during the project and summarize the different solutions we plan to study. We also quickly describe existing solutions of quality control in machine and human translation. 1

Quality Control in Human Translations: Use Cases and Specifications Benoit Le Ny bleny@softissimo.com April 2014 Contents 1 Introduction 2 2 State of the Art: benchmark on translation quality estimation, terminology management and CAT tools 3 2.1 CAT tools and quality insurance tools (tested)............... 3 2.1.1 Across.................................. 3 2.1.2 Trados.................................. 4 2.1.3 XBench QA............................... 5 2.2 CAT tools and quality insurance tools (not tested)............. 5 2.2.1 QA Distiller............................... 6 2.2.2 ErrorSpy................................ 6 2.2.3 QuEst.................................. 7 3 Interaction scenarios 7 3.1 Bilingual reading and foreign language learning............... 8 3.2 Human translation revision.......................... 8 3.3 Automatic translation post-editing...................... 9 4 Format and visualization of quality indicators 10 5 Conclusion 14 1 Introduction Assessing the quality of bitext alignments is crucial in any translation task, be it machine or human generated. Yet, it is a difficult issue to address, and it has been overlooked. In response to an increasing demand of the translation industry, driven by the growth 2

of machine translation, many translation companies invested in developing formalized metrics for assigning different types of errors (terminology, spelling, mistranslations...). In the context of the TransRead project, we will focus on identifying key indicators to assess the quality of bitext alignments (parallel corpora, translation memories...). This document starts by analyzing existing tools, used both by academic and industry users for quality estimation control, then recalls the main user scenarios benefiting from such indicators, and finally puts a specific focus on the most important and promising kinds of monolingual and bilingual indicators that should be used for bi-text visualization and alignment quality estimation. The formats that will be used to represent monolingual and bilingual indicators in the files and to the user is described in the last section. 2 State of the Art: benchmark on translation quality estimation, terminology management and CAT tools A study of existing tools for both translation quality estimation and terminology management has been conducted at the beginning of the project. It helped us determine which indicators should be used for the TransRead project. Those tools include both open source, free and commercial tools. Not all could be tested, but all were analyzed. 2.1 CAT tools and quality insurance tools (tested) 2.1.1 Across Across is a free CAT tool which assists translators in re-using contents, controlling processes, and ensuring a high quality level. The interesting features offered by Across (as well as most of all CAT tools) are the terminology control and translation consistency control, which consists in checking whether the same source is translated with the same translation in a whole document, both versus a translation memory and a glossary. For terminology control, there is no advanced source deflection or target inflection mechanism. For example, belle is not found as a form of beau, but both fleurs and fleuri are considered as forms of fleur. Nonetheless, those tools allow translators to work with translation memory, and have the added benefit of using fuzzy matches. It works with matches that may be less than 100% perfect when finding correspondences between segments of a text and entries in a database of previous translations. It usually operates at sentence-level segments. There is also a large list of additional quality assurance criteria, that we can split into: Monolingual criteria Bounding Spaces: consistent use of white spaces at the beginning and end of a paragraph. Date, time, and number format First capitalization : Checks whether the first word of a paragraph has been capitalized 3

Consistent brackets. Multiple spaces check: Checks whether the target document contains multiple consecutive white spaces. As we can see, those checks are close to common spell checking features. Bilingual criteria 2.1.2 Trados 100% match check versus translation memory crossterm check : The QM criterion of the crossterm check verifies whether the source-language terms (source terms) in the source text have been duly translated with the corresponding target language terms (target terms). Field match and Field types match : tags matching Format style sequence and usage check: Compares whether the same formatting (such as boldface and italics) is used in the source and in the translation. Identical segments check : Checks whether the segments are identical in the source and target documents Identical punctuation check: Checks whether the number and sequence of the punctuation marks are identical in the source and target documents. Empty content check : looks for empty paragraphs Segments count check. Structure check : for structured resources (SGML, XML,.NET), performs a validity check Very similar functionalities are available in other CAT Tools, such as Trados: Segments to Exclude: allows skipping certain segments, for example, if a perfect match has already been found. Segments Verification: check that segments are completed correctly and there are no missing translations. Inconsistencies: allows checking for inconsistencies if the same text is rendered in several different ways. Punctuation: checks that punctuation is correct. For example, in some countries, certain punctuation rules have to be observed. Numbers: checks that numbers, times and dates are correct. Word List: user can specify the correct form of a word that should be used. For example, specify to only use website as one word and never as web site. Regular Expressions: users can configure regular expressions. 4

Trademark Check: checks that trademarks are used correctly. Length Verification: checks that the length is no longer than a specified number of characters. Terminology Verifier: checks documents to ensure that the target terms contained in the term base have been used during translation or to verify whether target terms have been used that are black-listed in the term base. Tag Verification XML Validation 2.1.3 XBench QA XBench provides Quality Assurance and Terminology Management in a single package. The list of quality indicators are divided into four steps: 1. Basic criteria Untranslated Segments Inconsistency in source: Different sources with the same translation. Inconsistency in target: Same source with different translations. Source = target 2. Content criteria Tag Mismatch Numeric Mismatch Double Blank Repeated Word Key Term Mismatch CamelCase Mismatch ALLUPPERCASE Mismatch 3. Checklists: customized filters based on regular expressions 4. Spell-checking 2.2 CAT tools and quality insurance tools (not tested) This section deals with other translation quality tools that have not been tested, yet analyzed: features are reported here as they are announced in their documentation. 5

2.2.1 QA Distiller QA Distiller allows automatic detection and correction of format errors in human or machine translation, and in translation memories. Quality indicators available with these tools are: Omissions: Empty translations, (partially) forgotten translations, skipped translations, incomplete translations; Inconsistencies: Translation inconsistencies, source language inconsistencies; Language independent formatting: Spacing, punctuation, brackets, tab characters, capitalization; Language dependent formatting: Corrupt characters, spacing, number values, number formatting, quotation marks, measurement system; Terminology: Usage, consistency; Regular expressions: Fully customizable checks. 2.2.2 ErrorSpy ErrorSpy is another commercial quality assurance software for translations. Supported checks are: Terminology check: Using a terminology list, ErrorSpy checks whether the translator has used the correct terminology. Consistency check: ErrorSpy notes all sentences or segments which are identical in the source language. If the translations differ for such segments, the reviser is in-formed of these differences. Number check: Number errors are amongst the most serious. ErrorSpy not only checks whether numbers are identical in the source and target language, it also checks whether the decimal signs comply with the specification. Completeness check: It is possible to specify a certain length for the translation. All sentences which are too short are submitted to the reviser for checking. He can therefore find any missing translations. Tag check: ErrorSpy informs you of all segments containing tag errors. Acronym check: A further source of errors consists of wrongly adopted acronyms. ErrorSpy offers you the option of defining acronym patterns and checking whether they correspond in both languages. Typography check: dependent. The correctness of punctuation marks is checked language Missing translations: scan for missing translations. Identical source and target segments are reported to the person verifying the translation. 6

2.2.3 QuEst QuEst is an open source translation quality estimation tool. It is software toolkit that allows building up a statistical quality estimation system. Very much like Moses [1] for machine translation it provides some software both for feature extraction and for training a machine learning model from those features. This software has two main modules: a Java module to extract a number of sentence-level features (and a few wordlevel features) and a python module that interacts with the scikit-learn toolkit for machine learning. To create a working translation quality estimation system one needs to train it on the data specific to one s project. It is also possible to use some resources provided on the tool website (resources based on WMT shared task datasets): language models, training corpora. The quality features that can be extracted with this tool are divided into three categories: fluency, adequacy and confidence [3]. Some examples of features include: Fluency: number of tokens in the target segment; average number of occurrences of the target word within the target segment; Language Model probability of target segment using a large corpus of the target language to build the Language Model Adequacy: ratio of number of tokens in source and target segments; ratio of brackets and punctuation symbols in source and target segments; proportion of dependency relations between (aligned) constituents in source and target segments; Confidence: features and global score of the SMT system; number of distinct hypotheses in the n-best list; average size of the target phrases; 3 Interaction scenarios The precise definitions of useful cues for translation quality and other characteristics of bilingual texts demand a study of various possible interaction scenarios between the user and the future tool. Based on partners experience in education and learning, as well as in translation industry and in user interfaces, three possible use case scenarios for the tool prototype have been suggested and studied. 7

3.1 Bilingual reading and foreign language learning One possible use case for the tool would be to facilitate the reading of bilingual texts (literary works or scientific and other domain-specific articles) for non-native readers. In this situation TransRead tool is used as a means of improving the reading experience and enriching it with various clues for foreign language learning. The bilingual texts that are used for this particular case are supposed to be well-translated and translation quality indicators are but of minor importance. On the other hand, it is important for this scenario to indicate to the user the alignment links of source and target at different levels. As opposed to the minimum unit alignment as it is calculated by machine translation models, we are interested here in a more linguistic knowledge-based alignment, such that would allow the user to identify bigger units corresponding to idioms and collocations. Idiomatic expressions highlighting would facilitate the text analysis as well as external dictionary look-ups, which is yet another useful functionality for all of the usage scenarios. We will see in the last section how these complex bilingual quality indicators will be used to display language quality information to the user. 3.2 Human translation revision Another possible use case for the bilingual text device, and the one that is particularly important for the professional translation industry, is human translation revision. In this scenario, a translation produced by one translator is revised by another one. The translation quality of the texts presented for this task may be highly variable depending on the translator skills and experience. Typical useful interaction scenario would include various indicators of translation quality and highlight the regions with different types of problems and errors in the translated text. As the calculation of various bilingual and monolingual cues, as well as of the general quality estimation measure, is very time- and space-consuming, it is not possible for the moment to conceive a tool that will enable the reviser to make corrections and recalculate scores on the fly. The TransRead bilingual text tool is mainly destined to allow a quick visualization of translation quality information at various levels of precision and from various viewpoints. Different levels of precision should allow the user to visualize the quality of the whole text at document level and then to zoom in to any particular problematic part of the text at phrase or word level. Different viewpoints will allow visualizing separately different aspects and indicators of the bilingual text and its translation quality (most of these cues are also used to calculate the general translation quality estimation metric). Thus, for different translation revision scenarios, different viewpoints on translation quality and different kinds of information may be of use and it is important to make distinction between the errors that are likely to appear in a human translation as opposed to typical automatic translation system errors (subject of the third use case for our tool). Although translation quality varies greatly from one translator to another, still, we assume that we are dealing with translations by native or near-native speakers of the target language and that it is therefore highly unlikely that a human translator 8

produces an agrammatical sentence in the target language. On the other hand, human translation may contain some specific errors that are unlikely to appear in automatic translation, such as spelling errors, completely untranslated or partially translated segments, repeated segments, segments in inversed order, etc. Consequently, some specific features and quality indicators will probably only be useful for human translation scenario: it implies all simple translation consistency controls, monolingual and bilingual statistics and also the spellchecking feature. Indeed one of the most important features for human translation revision is the terminology consistency verification. For technical translation it is often crucial that the translation should respect a given terminology (domain terminology, company terminology, etc.) and that the translation of terms should be consistent throughout the whole document. It is accordingly important to include terminology verification as part of TransRead tool for human translation scenario. Whereas some indicators that are particularly helpful for the automatic translation quality estimation, such as language models for measuring the grammaticality of the target sentence, are unlikely to be of great use in human translation. There are also certain types of errors that are common to both automatic translation systems and human translators. For instance, idioms may be translated literally in both cases, if they are unknown either to the human translator or to the machine translation system. Similarly, polysemy and homonymy may be a source of errors in both human and machine translations. Thus, one of the goals of the TransRead project is to adapt statistical quality estimation systems developed for machine translation, that are using a variety of monolingual and bilingual cues, to the human translation revision case. To do so, important enough corpora of human revision must be collected, containing both the original and the revised version of the translated text. 3.3 Automatic translation post-editing Finally, another possible interaction scenario for TransRead tool might be in context of post-editing of the automatic translation output by a human translator. Post-editing, that for a long time has been used for purely academic purposes, is recently becoming more and more popular and promising technique in professional translation industry. Advances in machine translation technology allowed a better translation quality and a wider acceptance of its usage as a basis for a human translation in post-editing process. The difficulty to satisfy by purely manual translation the growing demand for localization of huge amounts of texts produced on web and by international institutions and companies, is yet another reason why the subject of post-editing has gained so much importance in modern translation community. Post-editing of the machine translation allows a considerable gain in speed and productivity of human translation. Automatic translation quality estimation and in particular a reliable phrase- level translation quality measure is of particular importance for post-editing scenario, since it will speed up the revision by indicating to the post-editor the relative quality of various segments and the post-editing effort required for their revision. Segments that receive high quality estimates may then be considered as well-translated, and the post- 9

editor may decide that they either do not need any modifications at all or just a quick cross-check. Low translation quality estimates may indicate that the segment is poorly translated, that it is not worth a post-editing effort and should be retranslated from scratch. All the other segments should be considered interesting for partial revision and the quality estimation system should highlight the problematic regions in those segments. Thus, useful features for post-editing scenario may include a general segment-level translation quality estimation score based on various combinations of features, as well as some local quality indicators. In this scenario, the complex monolingual and bilingual clues are the most useful, as unknown (OOV) words, language model scores, alignment confidence estimation, non-aligned words. As it has been stated above, machine translation is less likely to contain spelling errors, untranslated or partially translated segments than the human translation. It is to be noticed that automatic translation errors vary greatly according to the translation system used and the quality estimation models should be adapted accordingly. 4 Format and visualization of quality indicators Based on the user scenarios described previously, this section explains how the quality indicators can be implemented and visualized by the user. Quality indicators are all calculated on the basis of information available in the alignment file. A specific alignment format was defined for TransRead project between all partners in order to answer all needs of representation and exchange for all bilingual alignments during the whole project. It will be an XML file containing all alignments information: on source and target segments alignments, source and target word alignments, POS etc... The alignment format used for the TransRead project is described in the deliverable 1.1 Formats de représentation des alignements. In order to understand how the quality indicators will be used to display quality information on bilingual alignments to bilingual readers, human translation reviewer and post-editors, we propose here implementation of some of these quality indicators. In the three user scenarios described in the previous section, we can determine that consulting external dictionaries will allow the user to better understand the meaning of any particular source expression (including context disambiguation for ambiguous words) and to validate any particular translation choice. External dictionary search for terms and expressions selected by the user is the most basic and the most important functionality required by all possible usage scenarios. However, basic dictionary search can be sometimes quite tiresome and fruitless. For one thing, words and expressions have sometimes many distinct meanings (homonyms and polysemes) not to mention several parts of speech. Finding the right meaning that fits the context may take a lot of time, especially if the user is not experienced in working with dictionaries, or for a user not familiar with both languages. For another thing, it is not easy to find a good translation for idiomatic expressions, collocations and language clichés in a standard dictionary, since those expressions rarely constitute the keys in dictionaries (at best they are present as part of long lists of examples or idioms), and 10

since the dictionary coverage in not important enough. To address both of the above issues, we propose to enrich the user dictionary lookup experience with both context disambiguation and contextual dictionaries. Context desambiguisation will allow the user to be navigated directly to the dictionary article corresponding to the correct meaning of the word (conjectured from the context). A particularly useful tool for this purpose would be an innovative dictionary, such as BabelNet [2], that reunites a multilingual encyclopedic dictionary and an ontology connecting concepts and named entities in a very large network of semantic relations. Each entry with a unique ID represents only one given meaning and contains all the synonyms which express that meaning in a range of different languages. These information will be stored in the alignment file in the TransRead tool. The user can directly access the particular meaning for an ambiguous word and only see definitions, translations and even pictures for this meaning. Figure 1: Example of BabelNet results for room Contextual dictionary is a new type of dictionary currently developed at Reverso- 11

Softissimo for the TransRead project, which implements a bilingual concordancer and allow the user to search for simple words or expressions over a great amount of texts. The results contain both the aligned results in their context as well as the suggestions of the most frequent translations for the source expression calculated from those alignments. Figure 2: Example of Context results for room As you can see in Figure 2, source and target word aligned are highlighted in yellow to help the visualization of results to the user. In case of reading a text, it will be interesting to use a Context interface for example, in order for the user to be able to visualize source and target segments aligned, but also expressions and word alignments. For the TransRead project we can implement this feature for the whole text. While the user is reading a sentence from the source text with word1 word2 word3, if he selects word1, word1 and its translation will be highlighted. This can be done for each word, each expression, and finally to a whole segment. An improvement for Context that needs to be developed for this usage in this project will be to add a confidence score on the alignments. If we take the 5th results of the search room we can see that we have a wrong alignment between Room in English and prévu le in French. Here we can imagine two possibilities to display quality information on this alignment: 1) will be to add a warning on this segment saying that we detect a wrong alignment based on the alignment confidence score stored in the XML 12

alignment file; 2) will be to display this result at the end of the results in the section Other results. In the case of reviewing machine translation output, we can also think of using this feature based on this confidence score indicator. We can imagine using and improving Localize, which is a Reverso tool developed during the FLAVIUS project that allows all the Web sites and applications editors to generate easily, quickly and without technical knowledge, multilingual versions of their contents. While using Localize, the user will be able to review translated texts in the following way: a confidence score will be added in a column at the right of the translated texts, and a filter on the confidence score can help the user to review the less confident segments first for example, and omit the revision of confident segments in order to gain time. Figure 3: Visualization of Localize Reverso tool with source and target segments An overall quality estimation score with all indicators defined for this use case can be also implemented in Localize above the filter panel to give an estimation on the quality of the translation for this project, and filters can be added to help the user working on errors or bad translated text only. In case of human translation revision, two features can be added to help the user identifying translation errors. First of all, if we use the external dictionary look-up as Context, another possibility will be to use the aligned segments to check the consistency of translated texts. For example, if we keep an sentence from the example with the word room above: That s Mike s room there aligned with La chambre de Mike est là and the translation to review was the same segment in the source and different in the 13

target, the user will be able to check the consistency of this segment thanks to Context, and will be able to review this segment with the Context translated text found. It will be another kind of consistency control using a concordancer. Then, we can also add the comparison with a machine translation output to obtain another confidence score between translation, revision and MT output. We can imagine a user interface as the one in MTcompare Reverso tool (see figure below): the user will be able to review the translation comparing it with a given MT output; he will be able to easily visualize unknown words (text in red) and difference between the translated texts (text in blue). Figure 4: MTcompare user interface to review and compare two translations for a given source text. 5 Conclusion Thanks to the benchmark of current existing tools for quality estimation and CAT tools containing a quality control feature, we selected the appropriate quality clues for the TransRead project. They need to be calculated based on information stored in the alignment file. The visualization of these indicators will follow the presentation made in the last section, but can evolve depending on the tool used for TransRead and on experiments of the different scenarios (part of task 2). References [1] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177 180, Prague, Czech Republic, June 2007. Association for Computational Linguistics. [2] Roberto Navigli and Simone Paolo Ponzetto. Babelnet: Building a very large multilingual semantic network. In Proceedings of the 48th Annual Meeting of the Asso- 14

ciation for Computational Linguistics, pages 216 225, Uppsala, Sweden, July 2010. Association for Computational Linguistics. [3] Lucia Specia, Kashif Shah, Jose G.C. de Souza, and Trevor Cohn. Quest - a translation quality estimation framework. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 79 84, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. 15