Improving Pronoun Translation for Statistical Machine Translation (SMT)

Size: px
Start display at page:

Download "Improving Pronoun Translation for Statistical Machine Translation (SMT)"

Transcription

1 Improving Pronoun Translation for Statistical Machine Translation (SMT) Liane Guillou E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2011

2

3 Abstract Machine Translation is a well established field, yet the majority of current systems perform the translation of sentences in complete isolation, losing valuable contextual information from previously translated sentences in the discourse. One such class of contextual information concerns who or what it is that a reduced referring expression such as a pronoun is meant to refer to. The use of inappropriate referring expressions in a target language text can seriously affect its ability to be understood by the reader. This project follows on from two recent research papers that focussed on improving the translation of pronouns in Statistical Machine Translation (SMT). The approach taken is to annotate the pronouns in the source language with the morphological properties of the antecedent translation in the target language prior to translation using a phrase-based English-Czech SMT system. The project makes use of a number of manually annotated corpora in order to factor out the effects arising from poor coreference resolution, wherein selecting the wrong antecedent for a pronoun in the source language text will wrongly bias its translation. The aim of this work is to discover whether perfect coreference resolution in the source language text can reduce the incidence of inappropriate referring expressions in the target language text. The annotated translation system developed as part of this project makes only a marginal improvement over the baseline system, as measured using a bespoke automated evaluation metric. These results are supported by a manual evaluation conducted by a native Czech speaker. The reason for a lack of substantial improvement over the baseline may be attributed to many factors, not least of which concern the highly inflective nature of the Czech language. iii

4 Acknowledgements I would like to thank my supervisor, Professor Bonnie Webber, for her continued guidance and support from the conception of this project through to its realisation. I am deeply grateful for the patience that she has shown in explaining to me those concepts that were difficult to grasp, for setting me on the correct path when I became lost and most of all, for infecting me with her enthusiasm for this work. I have thoroughly enjoyed my time spent working on this project and I couldn t have asked for anything more in terms of the supervision I have received in my first foray into the field of Machine Translation. Special thanks are owed to Dr. Markéta Lopatková and Dr. Ondřej Bojar at Charles University. I am indebted to Markéta for her suggestions, enthusiasm and assistance with the analysis of results at every stage of this project. Her expertise in Czech Natural Language Processing has proved invaluable and I can honestly say as a monolingual speaker that without her help, this project would not have been possible. I am also extremely grateful to Ondřej for his recommendations with respect to the stemming of the English and Czech data to obtain shared word alignments for the translation models and his suggestions regarding the automated evaluation of the translation output. Thanks also to Christian Hardmeier for his patience in answering my many questions in relation to his previous work on pronoun translation and evaluation. Credit is also owed to David Mareček at Charles University, who created the PCEDT 2.0 alignment file used in this project. Finally, I would like to thank my colleagues for their company during the long days spent in the computer labs and their assistance in peer reviewing this document. The PCEDT 2.0 corpus, which is not yet publicly available, has been used with permission from the Institute of Formal and Applied Linguistics, Charles University, Prague. iv

5 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Liane Guillou) v

6 I dedicate this thesis to my mother, Anna Guillou, who instilled in me from an early age the importance of education and made sacrifices to ensure that I received the very best. Her love, encouragement and unwavering support have been instrumental throughout my life, and have given me the confidence that I needed to embark upon this course of further study. Words alone cannot convey my gratitude. vi

7 Table of Contents 1 Introduction Definition of the Problem Background Previous Work Focus on Pronoun Translation in Machine Translation English-Czech Machine Translation Example of Poor Pronoun Translation Hypothesis and Contributions Chapter Summary Concepts Anaphora and Coreference Coreference Resolution Czech Language Phrase-based Statistical Machine Translation Moses Evaluation in Machine Translation Automated Evaluation Manual Evaluation Chapter Summary Data BBN Pronoun Coreference and Entity Type Corpus Penn Treebank 3.0 Corpus PCEDT 2.0 Corpus Chapter Summary Methodology Overview vii

8 4.2 Assumptions Datasets Constructing the Language Model Combining the Corpora Identification of Coreferential Pronouns and their Antecedents Extraction of the Antecedent Head Noun Extraction of Morphological Properties from the PCEDT 2.0 Corpus Training the Translation Models Computing the Word Alignments Tuning the Translation System Weights: Minimum Error Rate Training (MERT) Annotation of the Training Set Data The Annotated Translation Process Annotation and Translation System Architecture Evaluation Automated Evaluation: Assessing the Accuracy of Pronoun Translations Manual Evaluation: Error Analysis and Human Judgements Chapter Summary Results and Discussion Automated Evaluation Manual Evaluation Critical Evaluation of the Approach and Potential Sources of Error Chapter Summary Conclusion and Future Work Conclusion Future Work A Czech Pronouns Used in the Automated Evaluation 61 Bibliography 65 viii

9 Chapter 1 Introduction The primary aim of this project is to produce more accurate coreferring expressions in the target language within English to Czech Statistical Machine Translation (SMT). To date there have been few attempts to integrate coreference resolution methods into Machine Translation. Notable exceptions include two recently published articles, focussing on English to French/German translation of third person personal pronouns. This project considers the translation of pronouns in English-Czech SMT, which is a more complex issue due to certain properties of the Czech language. Czech is a highly inflective language (as with German) that exhibits subject pro-drop and has a free word-order, i.e. the word order reflects the information structure of discourse. Whilst considerable progress has been made in Machine Translation research, little attention has been paid to cross-sentence coreference (Le Nagard and Koehn, 2010). The recent work of both Le Nagard and Koehn (2010) and Hardmeier and Federico (2010), focussing on thirdperson personal pronoun translation for SMT, represents a realisation of the need to address this gap. In particular, it represents an acknowledgement that the appropriate translation of discourse-level phenomena, including pronominal reference, is essential to ensure that the translated text makes sense to its intended audience. As Le Nagard and Koehn (2010) state, current Machine Translation methods treat sentences as mutually independent and therefore do not handle the cross-sentence dependencies that can arise due to the use of anaphoric reference. The recent work of Le Nagard and Koehn (2010) and Hardmeier and Federico (2010) demonstrates an interest within the research community in improving overall translation quality via the accurate translation of pronouns. Whilst the method proposed by Le Nagard and Koehn (2010) showed little improvement, the method presented by Hardmeier and Federico (2010) showed a small but significant improvement as measured by their bespoke automated scoring metric that incorporates precision and recall. 1

10 2 Chapter 1. Introduction This project investigates whether the approach used by Le Nagard and Koehn (2010) can improve pronoun translation in English-Czech SMT. This method was selected in preference to that used by Hardmeier and Federico (2010) due to its simplicity. A major difference between this project and previous work is the use of manually annotated corpora in place of coreference resolution algorithms to extract pronoun antecedents and automated methods to identify antecedent head nouns. These corpora provide coreference annotation and noun phrases from which the head noun can be extracted with little effort. This marks the first attempt to assess the potential for source language coreference to improve pronoun translation in SMT by exploiting perfect manual source language coreference annotation. Furthermore it is also the first attempt to apply the technique of source language pronoun annotation to the English-Czech language pair. The motivation for using the English-Czech language pair is threefold. Firstly, the availability of the PCEDT 2.0 parallel English-Czech corpus, as provided by the Institute of Formal and Applied Linguistics at Charles University, Prague, coincided with the start of this project. Secondly, as a monolingual speaker, the choice of the second language in the pair is fairly arbitrary, but dependent on the availability of a native speaker to assist in the evaluation of the translation system output and to provide language specific assistance during the development of such a system. This project benefited enormously from the expert advice of Dr. Markéta Lopatková at Charles University, Prague. The third, and perhaps most salient reason for choosing Czech as the second language in the translation pair is that Czech is a subject pro-drop language. That is, in Czech, an explicit subject pronoun may be omitted if its antecedent can be predicted on the grounds of saliency and/or verb morphology. It was initially envisaged that the system developed as part of this project would be designed to explicitly handle this phenomenon. However, due to the complexity of designing a pronoun-focussed translation system and devising a strategy for evaluating the system output, this has been left as a future extension to this project. This document describes in detail the approach taken in the investigation of whether source language annotation may improve pronoun translation in English-Czech SMT. The remainder of this chapter defines the problem, introduces the concept of anaphora resolution and its application in Machine Translation and presents the hypothesis upon which this project is based. Chapter 2 introduces the key concepts and chapter 3, the corpora used in the project. Chapter 4 describes the approach taken in the development of the annotation and translation system and the evaluation of its output. The results of the evaluation are presented and discussed in chapter 5 and the project is concluded in chapter 6. Possible options for future continuation of this work are also included in chapter 6, with suggestions reflecting some of the key issues highlighted in the preceding chapters.

11 1.1. Definition of the Problem Definition of the Problem Pronouns can be used as anaphoric expressions. When a pronoun is used anaphorically, it is called a coreferential pronoun. In Czech, as with many other languages, the number and gender of a personal pronoun must agree with the number and gender of its antecedent. This is the phenomenon known as anaphora. When observing this phenomenon in discourse it is common for the pronoun s antecedent to appear in an earlier sentence to the pronoun itself, presenting a problem for current state of the art Machine Translation systems which translate sentences in isolation. When sentences are translated in isolation, the contextual information present in the preceding sentences becomes lost. In the case of a coreferential pronoun, if its antecedent appears in a previous sentence, information about that antecedent will be lost by the time the sentence in which the pronoun occurs is considered for translation. The translation of the pronoun is then carried out with no knowledge of the number and gender of the pronoun s antecedent. Consider the translation of the English pronoun it into Czech for the following simple examples 1 : 1. The dog has a ball. I can see it playing outside. 2. The cow is in the field. I can see it grazing. 3. The car is in the garage. I will drive it to school later. In each of the examples, the English pronoun it refers to an entity that has a different gender in Czech. In order to translate the pronoun correctly in Czech it is necessary to identify the gender (and number) of the entity to which the pronoun refers and ensure that the gender (and number) of the pronoun agrees. In example 1 it refers to the dog ( pes, masculine) and should be translated as jeho/ho/jej. In example 2, it refers to the cow ( kráva, feminine) and should be translated as ji. In the final example, 3, it refers to the car ( auto, neuter) and should be translated as je/jej/ho. In Czech, within the masculine gender, a distinction is made between animate objects (e.g. people and animals) and inanimate objects (e.g. buildings). In many cases the same pronoun may be used for both animate and inanimate masculine genders, but there are a number cases in which different pronouns must be used. For example, in the case of possessive reflexive pronouns in the accusative case, svého is used to refer to a dog (masculine animate, singular) that belongs to someone, e.g. I admired my (own) dog : Obdivoval jsme svého psa. This is in contrast with sv oj which is used to refer to a castle (masculine inanimate, singular) that 1 Examples adapted from information from Local Lingo - an online Czech language resource:

12 4 Chapter 1. Introduction belongs to someone, e.g. I admired my (own) castle : Obdivoval jsme sv oj hrad. The problem of identifying the entity to which a pronoun refers is termed anaphora resolution. Section 1.2 outlines a brief history of anaphora resolution with particular reference to its incorporation in the field of Machine Translation. The concept of Anaphora and the closely related concept of Coreference are described in greater detail in chapter Background Anaphora resolution involves the identification of the antecedent of a referent, typically a pronominal or noun phrase expression that is used to refer to something that has been previously mentioned in the discourse (the antecedent). In the case where multiple referents refer to the same antecedent, these referents are said to be coreferential; these relationships can be represented using coreference chains. Mitkov et al. (1995) assert that the identification of an anaphor s antecedent is often crucial to ensure a correct translation, especially in cases in which the target language of the translation marks the gender of pronouns. The problems of anaphora resolution and the related task of coreference resolution have sparked considerable research within the field of Natural Language Processing (NLP). Strube (2007) charts the changes from early techniques that modelled linguistic knowledge algorithmically such as Hobbs s Algorithm (Hobbs, 1978), the Centering model (Grosz et al., 1995) and Lappin and Leass s algorithm (1994), through to the Supervised and Semi-Supervised Machine Learning methods commonly used today. Even within the sphere of Machine Learning, there is still much debate as to which method provides the best results. Early methods include that to which Strube (2007) credits Soon et al. (2001) - the recasting of coreference resolution as a binary classification task to which Machine Learning techniques can be applied. In contrast, Linh et al. (2009) argue that ranking based models are more suited to the task of anaphora resolution. Ng (2010) also argues in favour of ranking models that allow for the identification of the most probable candidate antecedents, claiming that they outperform other classes of supervised Machine Learning methods. In order to improve methods for anaphora resolution based on supervised Machine Learning, as well as to serve as Gold standards for evaluation, parallel efforts have been pursued to manually annotate large corpora with coreference chains. The OntoNotes 3.0 corpus (Weischedel et al., 2009) and the BBN Pronoun Coreference and Entity Type corpus (Weischedel and Brunstein, 2005) (used in this project) are examples of such corpora. Despite continued efforts into providing methods for anaphora resolution, there has been little work focusing on the integration of anaphora resolution and SMT systems. Le Nagard and

13 1.3. Previous Work 5 Koehn (2010) argue that work on SMT has not moved beyond sentence-level translation. Furthermore they assert that the translation ambiguity arising from the use of pronouns cannot be resolved within the context of a single sentence if a pronoun refers to an antecedent from a previous sentence. Hardmeier and Federico (2010) present a case study of the performance of one of their SMT systems on personal pronouns to illustrate that improved handling of pronominal anaphora may lead to improvements in translation quality. They report that the SMT system is unable to find a suitable translation for anaphoric pronouns in 39% of cases and that while choosing the wrong pronoun does not generally affect important content words, it can make the output translations difficult to understand. 1.3 Previous Work Focus on Pronoun Translation in Machine Translation Early work on the integration of anaphora resolution with Machine Translation includes that of Mitkov et al. (1995), Lappin and Leass (1994) and Saiggon and Carvalho (1994). Mitkov et al. (1995) focussed on intersentential anaphora resolution, conjoining sentences to simulate the intersententiality that could be handled by the rule-based CAT2 Machine Translation system. They provided example output from their system showing instances where pronouns are translated correctly from English to German. However, they provided only the details of their approach and several examples, offering no information relating to the evaluation of their method. Lappin and Leass (1994) integrated their RAP algorithm into a logic-based Machine Translation system, but the core focus of their work was on anaphora resolution and not on Machine Translation. Saiggon and Carvalho (1994) used a transfer approach combined with Artificial Intelligence techniques and focussed on both intersentential and intrasentential anaphora resolution for the translation of pronouns in Portuguese to English translation. This interest in the 1990 s culminated in the publication of a special issue on anaphora resolution in Machine Translation with an introduction provided by Mitkov (1999). No further evidence of work on the integration of anaphora resolution and Machine Translation systems is available until 2010, in which papers on the subject were published by Le Nagard and Koehn (2010) and Hardmeier and Federico (2010). This resurgence in the interest of anaphora resolution for Machine Translation systems follows advances in the field since the 1990 s which have made the application of these new approaches possible. The approach taken by Le Nagard and Koehn (2010) involves the identification of the antecedent of each coreferential occurrence of it and they in the source language (English) together with the identification of the antecedent s translation into the target language (French)

14 6 Chapter 1. Introduction and its grammatical gender. Based on the gender of the noun in the target language, the occurrence of it in the source language text is replaced by it-masculine, it-feminine or it-neutral. The same is applied for occurrences of they. Using the Moses toolkit (Hoang et al., 2007), they trained an SMT system on annotated training data composed using the annotation method previously described, before applying the same process to the test data as part of the translation process. In the training of the annotation system the French translation of the English antecedent is extracted from the parallel corpus using the word alignment obtained as part of the process of training their baseline system. When running test translations, they first translate the test text using the baseline system to extract the French translations of the English antecedents. They then use the gender of the French word to annotate the English pronoun before translating the annotated test text using the system trained on annotated training data. This approach treats the annotation of pronouns as a separate task which is performed outside of the translation process. The authors report little change in the BLEU score of their system over the baseline and instead resort to manually counting the number of correctly translated pronouns. Whilst they attribute the lack of improvement of their system to the poor quality of their coreference resolution system, they claim that the process works well when the coreference resolution system provides accurate results. The approach taken by Hardmeier and Federico (2010) differs in that it provides a singlestep process whereby the identification of a pronoun s antecedent in the source language and the extraction of its target language translation s morphological properties is integrated in the translation process as an additional model in their SMT system. This additional model maintains a mapping of each source language pronoun and the number and gender of its antecedent. Translation is achieved by first processing the source language test text using a coreference resolution system to identify coreferential pronouns and their antecedents. The output of the coreference resolution system is used as input to a decoder driver module which runs a number of Moses decoder processes in parallel. The decoder driver then feeds individual sentences to the decoder processes using a priority queue to order sentences according to how many pronoun antecedents they contain. Thus sentences that contain a greater number of antecedents are translated first, ensuring a high throughput of the system. The authors report no significant improvement in BLEU score between their system and the baseline, but they do report a small but significant improvement in pronoun translation recall against a single reference translation. The approach used in this project is similar to that taken by Le Nagard and Koehn (2010). Whilst their project required the use of a coreference resolution system to build coreference chains, the provision of a source language corpus with manually annotated coreference information allowed this project to focus on the translation problem. This project also accommodates a wider range of English pronouns than the study by Le Nagard and Koehn (2010), which

15 1.4. Example of Poor Pronoun Translation 7 only considered the translation of it and they English-Czech Machine Translation Much of the recent work in English-Czech SMT has been conducted at the Institute of Formal and Applied Linguistics at Charles University, Prague. Research has been conducted in many areas including the development of parallel corpora suitable for the development of Machine Translation systems such as the PCEDT 2.0 corpus used in this project and its predecessor, the PCEDT 1.0 corpus (Čmejrek et al., 2004). Another area of research has concentrated on the development of both phrase-based and dependency-based SMT systems. In a comparative study of phrase-based and dependency-based SMT systems Bojar and Hajič (2008) concluded that their best phrase-based system outperformed the experimental dependency-based system, but work continues in both directions. The decision to focus on phrase-based SMT in this project is due to its simplicity, which given the relatively short time-scale, is an important factor. That phrase-based systems currently outperform dependency-based systems in English-Czech SMT is an added bonus. 1.4 Example of Poor Pronoun Translation As an example of poor pronoun translation, consider the following English sentence from the Wall Street Journal corpus and its translation (by a Machine Translation system) in Czech: he said mexico could be one of the next countries to be removed from the priority list because of its efforts to craft a new patent law. řekl, že mexiko by mohl být jeden z dalších zemí, aby byl odvolán z prioritou seznam, protože její snahy podpořit nové patentový zákon. In this example, the English pronoun its, which refers to mexico is translated in Czech as její (feminine, singular) and mexico is translated as mexiko (neuter, singular). Here, the Czech translation of the pronoun and its antecedent disagree in gender. A more correct translation of the pronoun would be jeho (neuter, singular possessive pronoun) or své (possessive pronoun) depending on the overall structure of the translated sentence.

16 8 Chapter 1. Introduction 1.5 Hypothesis and Contributions The work of Hardmeier and Federico (2010) focussed on English to German translation whilst Le Nagard and Koehn (2010) focussed on English to French translation. This project considers the translation of pronouns in English to Czech SMT and builds on the work of Le Nagard and Koehn (2010) and Hardmeier and Federico (2010). By factoring out the problems of automated coreference resolution, parsing and part of speech (POS) tagging and morphological tagging, this project attempts to assess how well an approach to explicitly annotating pronouns in the source language could work when applied to English-Czech SMT if conditions were assumed to be perfect. Where French (a Romance language) and German (a Germanic language) share a similar root to English, the differences between English and Czech are even greater. Therefore, not only does this project assess the suitability of a pronoun annotation approach in improving the translation of pronouns into another language, but into a language that is very different from English. It is believed that this project is the first attempt made to explicitly handle the problem of pronoun translation in Czech SMT. This project makes three major contributions: 1. A prototype system for the annotation and translation of pronouns in English-Czech SMT. 2. Automated and manual evaluations of the output of the system as compared against a baseline. 3. An annotated aligned parallel corpus which could be used in future investigations into pronoun translation in English-Czech SMT. 1.6 Chapter Summary This chapter introduced the specific problem of pronoun translation in SMT, discussed previous work in relation to anaphora resolution, pronoun-focussed Machine Translation and English- Czech SMT and outlined the hypothesis on which this work is based. The next chapter will describe in detail many of the concepts that are essential to the understanding of the problem as well as the approach taken in the development of the annotation and translation system and its evaluation.

17 Chapter 2 Concepts 2.1 Anaphora and Coreference Anaphora is a discourse level phenomenon in which the interpretation of one expression is dependent on another previously mentioned expression, also known as the the antecedent. For example in the sentence below, the word He at the start of the second sentence refers to J.P. Bolduc at the start of the first sentence. In order to understand the meaning of the second sentence, the reader must first identify the referent of the pronoun He (which in this example is J.P. Bolduc ). J.P. Bolduc, vice chairman of W.R. Grace & Co., which holds a 83.4% interest in this energyservices company, was elected a director. He succeeds Terrence D. Daniels, formerly a W.R. Grace vice chairman, who resigned. 1 Where anaphora is concerned with referring to a previously mentioned expression in the discourse, coreference is the act of referring to the same referent (Mitkov et al., 2000), such that multiple expressions that refer to the same expression are said to be coreferential. Coreferential chains may be established in order to link multiple referring expressions to the same antecedent expression. This project focuses on the translation of already resolved instances of nominal anaphora, in which a referring expression - a pronoun, definite Noun Phrase (NP) or proper name, has a non-pronominal NP as its antecedent (Mitkov et al., 2000). The project makes use of manually annotated corpora from which instances of coreferential (and anaphoric) pronouns and their antecedents are identified, in order to annotate training data with which to train an SMT system. 1 Example taken from the Wall Street Journal corpus 9

18 10 Chapter 2. Concepts 2.2 Coreference Resolution Coreference Resolution is the process of identifying the referent to which a referring expression refers. In this project, the pronouns are the referring expressions and the antecedents are the referents. As discussed in chapter 1, there has been much research into the development of automated methods to provide coreference and anaphora resolution. Such automated methods were used by both Le Nagard and Koehn (2010) and Hardmeier and Federico (2010), but it is well documented that these methods do not acheive perfect accuracy. Indeed, Le Nagard and Koehn (2010) cite the poor performance of their coreference resolution as a possible reason for their lack of improvement in pronoun translation. In this project, a manually annotated coreference corpus (the BBN Coreference and Entity Type corpus) is used to identify coreferential pronouns and their antecedents. As the corpus has been manually annotated, the coreference annotation is assumed to be highly accurate. 2.3 Czech Language Czech is a member of the western group of Slavic languages. Like other Slavic languages it is highly inflective, with seven cases and four grammatical genders: masculine animate (for people and animals), masculine inanimate (for inanimate objects), feminine and neuter. In the case of the feminine and neuter genders, animacy is not grammatically marked. Czech is a free word-order language, in which word order reflects the information structure of the sentence within the current discourse. In addition, Czech is a pro-drop language; an explicit subject pronoun may be omitted if it may be inferred based on some other grammatical feature, for example verb morphology. 2 In contrast with Czech, English, is neither a highly inflectional nor a pro-drop language. Furthermore, English follows a Subject-Verb-Object (SVO) pattern for word order and lacks grammatical gender. 2.4 Phrase-based Statistical Machine Translation Phrase-based models are currently the best performing SMT models (Koehn, 2009). The concept behind these models is the decomposition of the translation problem into a number of smaller word sequences, called phrases, which are translated one at a time in order to build the complete translation. It is important to note that a phrase may be any sequence of words 2 Information provided by The Czech Language - an online guide:

19 2.4. Phrase-based Statistical Machine Translation 11 of arbitrary length and that there is no deep linguistic motivation behind the choice of segmentation. Phrase-based models have several advantages over word-based models in which words are translated in isolation. Firstly, phrase-based models provide a simple solution to the problem where a single word in the source language translates into multiple words in the target language or vice versa. Secondly, translating phrases rather than single words can help to resolve translation ambiguities. Finally, with phrase-based models, the notions of insertion and deletion that are present in word-based models are no longer necessary, leading to a model that is conceptually simpler. The three components that make up a phrase-based model are the translation model, language model and reordering model. The translation model takes the form of a phrase translation table which provides a mapping between the source and target language phrases and the probabilities associated with each mapping. The phrase translation table is learned by creating word alignments between the aligned sentence pairs of a parallel training corpus. The word alignments are collected for both translation directions, the alignment points are merged and then those phrases that are consistent with the word alignment are extracted. The probabilities that are assigned to each phrase mapping in the table are calculated by counting the number of (parallel) sentence pairs a particular phrase pair appears in, and then computing the relative frequency of this count compared with the count of the source phrase translating as any other phrase in the target language. The language model ensures the fluency of the translations output by the model - providing a means to score and hence identify the best output translation from a list of candidate translations. The language models used in SMT are typically n-gram language models which consist of n-grams in the target language together with probabilities based on maximum likelihood estimation. A language model is usually constructed from the target side of the parallel corpus used in the training of the translation model, and may be augmented by additional in-domain target data, or weighted with a separate out-of-domain language model. Smoothing is often applied to improve the reliability of the probability estimates, with modified Kneser-Ney smoothing commonly used in SMT (Kneser and Ney, 1995). The reordering model allows phrases in the source language to be taken out of sequence when building the translation in the target language, thereby allowing phrase-level reordering. Allowing unlimited reordering can have a detrimental effect on translation quality, and so it is usual for a penalty to be associated with any reordering that takes place. Penalties are assigned such that a larger cost is associated with the movement of a phrase that skips more word positions, than one that skips fewer word positions. In phrase-based SMT, these three models are combined as a linear model. The best translation arg max c p(c e) is computed using Bayes Rule, which combines the three components of the

20 12 Chapter 2. Concepts phrase-based model as in the equation below: the translation model φ(e c), the language model P LM and the reordering model Ω(e c). argmax c p(c e) = argmax c φ(e c) P LM Ω(e c) Where e is an English sentence and c is the Czech translation of that sentence. Once the components of the phrase-based model have been constructed, their weights are tuned to optimise the overall model performance. Tuning is carried out using a dataset that is kept separate from the main training dataset for this specific purpose. Minimum Error Rate Training (MERT) (Och, 2003) is a commonly used tuning technique in SMT. MERT tunes the model weights to optimise performance as measured using BLEU scores calculated against one or more reference translations. BLEU will be described in more detail in section 2.6. In Machine Translation, the process of finding the best scoring translation according to the model is referred to as decoding (Koehn, 2009). Using a phrase-based translation model, decoding is carried out by starting with a source sentence and building the translation from left to right, extracting source phrases in any order. The phrases are translated into the target language and then stitched together to make a complete translation. The source words covered by each phrase are then marked as translated and the process continues until all of the source words have been covered. As there are many possible valid translations of a single source language sentence, these variations must be captured. This is achieved using a search graph from which the single best translation (or an N-best list) may be derived using a scoring method that uses a language model and the phrase table probabilities. 2.5 Moses Moses (Hoang et al., 2007) is an open source SMT toolkit that provides automated training of translation models and may be used with any language pair, given a parallel training corpus. Moses may be used to construct both tree-based and phrase-based translation models but for the purpose of this project only the phrase-based training was required. The automated training process produces a phrase translation table and a lexicalised reordering model. The language model is created separately using the target side of the parallel corpus together with additional in-domain corpus data as required. The training process consists of a number of steps which include data preparation, the creation of word alignments using Giza++ (Och and Ney, 2003), extraction and scoring of phrases and building the generation and lexi-

21 2.6. Evaluation in Machine Translation 13 calised reordering models 3. The generation model contains probabilities for both directions of translation. During testing, in which a sentence or collection of sentences from the test corpus (which are not also included in the training corpus) are translated, the Moses decoder constructs a search graph and uses a beam search algorithm to select the translation with the highest probability from that graph. The search graph is constructed using the process of hypothesis expansion. Hypothesis combination and pruning are then employed to reduce the search space. In the Moses implementation of beam search, hypotheses that cover the same number of foreign words are compared and those with high cost (low probability) are pruned. The cost of each hypothesis is calculated using a combination of the cost of translation and the estimated future cost of translating the remaining source text for the current sentence. Whilst the decoder may be used to output an N-Best list of translations for an input sentence, in this project only the best translation is required and therefore only a single translation is requested from the decoder. 2.6 Evaluation in Machine Translation Evaluation in Machine Translation typically falls into one of two categories: manual or automated. Whilst automated methods are used to ascertain improvements during the development of a Machine Translation system, manual methods using either monolingual or bilingual human judges are typically used to provide the final evaluation. Currently there are no standard automated metrics available for the evaluation of pronoun translation in SMT. Hardmeier and Federico (Hardmeier and Federico, 2010) developed their own bespoke automated metric incorporating precision and recall measured against a single reference translation. In contrast, Le Nagard and Koehn (2010) relied on manually counting the number of correctly translated pronouns in their system output. Manual evaluation of the results is slow and therefore not a practical solution for large volumes of text. Furthermore, for a monolingual SMT system developer, manual evaluation must be outsourced to a third party, adding an additional hindrance to the development process. In this project, the Czech translations output by the phrase-based SMT system were evaluated using a combination of manual and automated methods. The manual methods used focussed on human judgements as to whether pronouns in the Machine Translation output were correctly used or dropped and if they were incorrectly used, whether a native Czech speaker would be able to understand the meaning of the sentence as a whole. BLEU, an automated metric widely used in the evaluation of SMT systems was used during system development as a preliminary 3 A full description of the Moses translation system training process can be found at:

22 14 Chapter 2. Concepts check to confirm that the system output was valid Czech, before a more detailed automated analysis of the results was conducted. The evaluation methods used in this project are discussed in more detail in chapter Automated Evaluation BLEU (Papineni et al., 2002) is an automated evaluation metric widely used in SMT to assess the overall quality of the output translations. It provides an efficient and low cost alternative to human judgements during iterations of development cycles to measure system improvement. It computes a document-level score of the translated output against a single reference translation or a set of reference translations (Koehn, 2009). The BLEU score is based on a combination of n-gram precision and a brevity penalty. BLEU = BP exp( N n=1 w n log p n ) The n-gram precision (p n ) is a measure of the ratio of n-grams of order n in the output translation that are present in the reference translation to the total number of n-grams of order n in the output translation, and w n are positive weights that sum to one. The brevity penalty (BP) ensures that the length of the output translation is not too short, as compared with the length of the reference translation. The effect of the brevity penalty is that the BLEU score is reduced if the output translation is shorter than the reference translation, i.e. where words are dropped in the output translation. The BLEU score is applied at the document level in order to allow some freedom in translation output length at the sentence level, for example where a single source sentence may be translated into two sentences in the target language, or vice versa. BLEU has been widely criticised (Koehn, 2009), yet remains one of the most popular automated evaluation metrics in use with SMT systems due to its high correlation with human judgements of quality (Papineni et al., 2002). With respect to the specific problem of pronoun translation evaluation in Czech, two further criticisms apply. Firstly, as the sole focus of this project is pronoun translation, only a small number of words are expected to change between the translations produced by the baseline and annotated translation systems. Therefore, the variation in BLEU score is expected to be very small. Observations regarding the shortcomings of BLEU in relation to the evaluation of pronoun translation have been made previously by both Le Nagard and Koehn (2010) and Hardmeier and Federico (2010). Secondly, Czech is a highly inflective language with four genders and seven cases, so with only a single reference translation provided in the PCEDT 2.0 corpus it is not reasonable to evaluate the output of the translation systems using a recall-

23 2.7. Chapter Summary 15 based method. Bojar and Kos (2010) are critical of the use of BLEU scores in the evaluation of English-Czech SMT, claiming that BLEU scores correlate poorly with human judgements. It is for these reasons that BLEU was not used in the evaluation of the systems developed as part of this project Manual Evaluation The manual evaluation of Machine Translation output can be rather complex. Human judges are typically required to rate a single target language text using a five point scale or to rank several target language texts based on fluency (whether the text is fluent), and adequacy (whether the meaning of the source language text has been captured) (Koehn, 2009). Evaluation based on fluency and adequacy judgements suffers from a number of problems. Firstly, it can be slow and unreliable (Callison-Burch et al., 2008). Secondly, the scores assigned by human judges in the measurement of fluency and adequacy are often very close suggesting that the judges may find it difficult to make a clear distinction between the two criteria. Thirdly, there are concerns that without explicit instructions, many human judges develop their own rules or misinterpret the intended use of an absolute scale and instead score the output of multiple systems relative to one another (Callison-Burch et al., 2007). Finally, manual evaluation using such criteria tends to be subjective, which can lead to poor agreement between a group of human judges. Again, these manual methods tend to focus on sentences as a whole and are therefore not wholly applicable to the more specific problem of evaluating pronoun translation. 2.7 Chapter Summary This chapter introduced the concepts of anaphora and coreference resolution and provided an introduction to phrase-based SMT, the Moses toolkit and the methods currently used in the evaluation of Machine Translation output. In particular, the various issues associated with automated and manual evaluation methods were highlighted with respect to their application to the more specific problem of evaluating pronoun translation. The next chapter will introduce the manually annotated corpora used in this project.

24

25 Chapter 3 Data In the development of the annotation and translation process a number of manually annotated corpora in both English and Czech are used: the BBN Pronoun Coreference and Entity Type corpus for the English (source) side of the parallel corpus and the identification of coreferential pronouns and their antecedents, and the PCEDT 2.0 corpus for the Czech (target) side of the parallel corpus. Each corpus contains text or a translation of the original text taken from a subset of the Wall Street Journal (WSJ). It is the provision of these manually annotated corpora that allowed the project to focus solely on the translation problem without the need for automated methods for coreference or anaphora resolution. In addition, the annotation of the WSJ files within the Penn Treebank 3.0 corpus is used to identify a single antecedent head word in the case where the antecedent extracted from the BBN Pronoun Coreference and Entity Type corpus spans multiple words. This is particularly important as in order to extract the number and gender of a Czech word it is necessary to first identify the head of the English antecedent. The corpora are described in detail in the following sections. 3.1 BBN Pronoun Coreference and Entity Type Corpus The BBN Pronoun Coreference and Entity Type corpus (Weischedel and Brunstein, 2005) provides annotations of the WSJ file texts with pronoun coreference and entity types together with the raw English text. For the purpose of this project, two files from the corpus are used: the WSJ.sent file that contains the raw English sentences and the WSJ.pron pronoun coreference file that contains a list of coreferential pronouns together with their antecedents. In the pronoun coreference file, coreferential pronouns and their antecedents are indexed using sentence and word token numbers. 17

26 18 Chapter 3. Data The WSJ.sent file has the format: (WSJ0005 S1: J.P. Bolduc, vice chairman of W.R. Grace & Co., which... S2: He succeeds Terrence D. Daniels, formerly a W.R. Grace... S3: W.R. Grace holds three of Grace Energy s seven board seats. ) For each file in the corpus collection, the sentences are numbered and listed in the order in which they appear in the text. The WSJ.pron file has the format: (WSJ0005 ( Antecedent -> S1:1-2 -> J.P. Bolduc Pronoun -> S2:1-1 -> He ) For each WSJ file in the collection, each antecedent and the pronouns that refer to it are listed, together with the number of the sentence in which they appear and the start and end positions of the word(s) within the sentence. It was initially envisaged that the OntoNotes 3.0 corpus (Weischedel et al., 2009) would be used to identify coreferential pronouns and their antecedents. However, the annotation in the BBN Coreference and Entity Type corpus allows for a simpler method of identification and extraction than the OntoNotes 3.0 corpus. The OntoNotes 3.0 corpus is then left as an alternative source of coreference information. Due to differences in the choice of which types of coreference are annotated in these corpora, the use of the OntoNotes 3.0 corpus as an alternative or additional source of coreference information would allow for an investigation into the translation of it, this and that marked as event coreference. 3.2 Penn Treebank 3.0 Corpus The Penn Treebank 3.0 corpus contains manually annotated parse trees of the sentences within the WSJ corpus. The merged files within the corpus contain both parse and part of speech annotation and as such may be used to identify Noun Phrases (NPs) and through the use of simple rules, the head of an NP. The corpus contains separate merged files for each WSJ file. Within each file, a parse is provided for each sentence, with part of speech tags provided for each word or token.

Statistical Machine Translation

Statistical Machine Translation Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language

More information

Modelling Pronominal Anaphora in Statistical Machine Translation

Modelling Pronominal Anaphora in Statistical Machine Translation Modelling Pronominal Anaphora in Statistical Machine Translation Christian Hardmeier and Marcello Federico Fondazione Bruno Kessler Human Language Technologies Via Sommarive, 18 38123 Trento, Italy {hardmeier,federico}@fbk.eu

More information

Introduction. Philipp Koehn. 28 January 2016

Introduction. Philipp Koehn. 28 January 2016 Introduction Philipp Koehn 28 January 2016 Administrativa 1 Class web site: http://www.mt-class.org/jhu/ Tuesdays and Thursdays, 1:30-2:45, Hodson 313 Instructor: Philipp Koehn (with help from Matt Post)

More information

Chapter 5. Phrase-based models. Statistical Machine Translation

Chapter 5. Phrase-based models. Statistical Machine Translation Chapter 5 Phrase-based models Statistical Machine Translation Motivation Word-Based Models translate words as atomic units Phrase-Based Models translate phrases as atomic units Advantages: many-to-many

More information

Hybrid Machine Translation Guided by a Rule Based System

Hybrid Machine Translation Guided by a Rule Based System Hybrid Machine Translation Guided by a Rule Based System Cristina España-Bonet, Gorka Labaka, Arantza Díaz de Ilarraza, Lluís Màrquez Kepa Sarasola Universitat Politècnica de Catalunya University of the

More information

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 Yidong Chen, Xiaodong Shi Institute of Artificial Intelligence Xiamen University P. R. China November 28, 2006 - Kyoto 13:46 1

More information

Adaptation to Hungarian, Swedish, and Spanish

Adaptation to Hungarian, Swedish, and Spanish www.kconnect.eu Adaptation to Hungarian, Swedish, and Spanish Deliverable number D1.4 Dissemination level Public Delivery date 31 January 2016 Status Author(s) Final Jindřich Libovický, Aleš Tamchyna,

More information

Factored Translation Models

Factored Translation Models Factored Translation s Philipp Koehn and Hieu Hoang pkoehn@inf.ed.ac.uk, H.Hoang@sms.ed.ac.uk School of Informatics University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW Scotland, United Kingdom

More information

THUTR: A Translation Retrieval System

THUTR: A Translation Retrieval System THUTR: A Translation Retrieval System Chunyang Liu, Qi Liu, Yang Liu, and Maosong Sun Department of Computer Science and Technology State Key Lab on Intelligent Technology and Systems National Lab for

More information

Why Evaluation? Machine Translation. Evaluation. Evaluation Metrics. Ten Translations of a Chinese Sentence. How good is a given system?

Why Evaluation? Machine Translation. Evaluation. Evaluation Metrics. Ten Translations of a Chinese Sentence. How good is a given system? Why Evaluation? How good is a given system? Machine Translation Evaluation Which one is the best system for our purpose? How much did we improve our system? How can we tune our system to become better?

More information

Turker-Assisted Paraphrasing for English-Arabic Machine Translation

Turker-Assisted Paraphrasing for English-Arabic Machine Translation Turker-Assisted Paraphrasing for English-Arabic Machine Translation Michael Denkowski and Hassan Al-Haj and Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon University

More information

Machine Translation. Why Evaluation? Evaluation. Ten Translations of a Chinese Sentence. Evaluation Metrics. But MT evaluation is a di cult problem!

Machine Translation. Why Evaluation? Evaluation. Ten Translations of a Chinese Sentence. Evaluation Metrics. But MT evaluation is a di cult problem! Why Evaluation? How good is a given system? Which one is the best system for our purpose? How much did we improve our system? How can we tune our system to become better? But MT evaluation is a di cult

More information

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged

More information

The KIT Translation system for IWSLT 2010

The KIT Translation system for IWSLT 2010 The KIT Translation system for IWSLT 2010 Jan Niehues 1, Mohammed Mediani 1, Teresa Herrmann 1, Michael Heck 2, Christian Herff 2, Alex Waibel 1 Institute of Anthropomatics KIT - Karlsruhe Institute of

More information

Statistical Machine Translation: IBM Models 1 and 2

Statistical Machine Translation: IBM Models 1 and 2 Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation

More information

Statistical Machine Translation Lecture 4. Beyond IBM Model 1 to Phrase-Based Models

Statistical Machine Translation Lecture 4. Beyond IBM Model 1 to Phrase-Based Models p. Statistical Machine Translation Lecture 4 Beyond IBM Model 1 to Phrase-Based Models Stephen Clark based on slides by Philipp Koehn p. Model 2 p Introduces more realistic assumption for the alignment

More information

Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation

Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation Rabih Zbib, Gretchen Markiewicz, Spyros Matsoukas, Richard Schwartz, John Makhoul Raytheon BBN Technologies

More information

Appraise: an Open-Source Toolkit for Manual Evaluation of MT Output

Appraise: an Open-Source Toolkit for Manual Evaluation of MT Output Appraise: an Open-Source Toolkit for Manual Evaluation of MT Output Christian Federmann Language Technology Lab, German Research Center for Artificial Intelligence, Stuhlsatzenhausweg 3, D-66123 Saarbrücken,

More information

Machine Translation. Agenda

Machine Translation. Agenda Agenda Introduction to Machine Translation Data-driven statistical machine translation Translation models Parallel corpora Document-, sentence-, word-alignment Phrase-based translation MT decoding algorithm

More information

LIUM s Statistical Machine Translation System for IWSLT 2010

LIUM s Statistical Machine Translation System for IWSLT 2010 LIUM s Statistical Machine Translation System for IWSLT 2010 Anthony Rousseau, Loïc Barrault, Paul Deléglise, Yannick Estève Laboratoire Informatique de l Université du Maine (LIUM) University of Le Mans,

More information

Learning Translation Rules from Bilingual English Filipino Corpus

Learning Translation Rules from Bilingual English Filipino Corpus Proceedings of PACLIC 19, the 19 th Asia-Pacific Conference on Language, Information and Computation. Learning Translation s from Bilingual English Filipino Corpus Michelle Wendy Tan, Raymond Joseph Ang,

More information

Chapter 6. Decoding. Statistical Machine Translation

Chapter 6. Decoding. Statistical Machine Translation Chapter 6 Decoding Statistical Machine Translation Decoding We have a mathematical model for translation p(e f) Task of decoding: find the translation e best with highest probability Two types of error

More information

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,

More information

The TCH Machine Translation System for IWSLT 2008

The TCH Machine Translation System for IWSLT 2008 The TCH Machine Translation System for IWSLT 2008 Haifeng Wang, Hua Wu, Xiaoguang Hu, Zhanyi Liu, Jianfeng Li, Dengjun Ren, Zhengyu Niu Toshiba (China) Research and Development Center 5/F., Tower W2, Oriental

More information

SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统

SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统 SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems Jin Yang, Satoshi Enoue Jean Senellart, Tristan Croiset SYSTRAN Software, Inc. SYSTRAN SA 9333 Genesee Ave. Suite PL1 La Grande

More information

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告 SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 Jin Yang and Satoshi Enoue SYSTRAN Software, Inc. 4444 Eastgate Mall, Suite 310 San Diego, CA 92121, USA E-mail:

More information

ACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation www.accurat-project.eu Project no.

ACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation www.accurat-project.eu Project no. ACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation www.accurat-project.eu Project no. 248347 Deliverable D5.4 Report on requirements, implementation

More information

A Joint Sequence Translation Model with Integrated Reordering

A Joint Sequence Translation Model with Integrated Reordering A Joint Sequence Translation Model with Integrated Reordering Nadir Durrani, Helmut Schmid and Alexander Fraser Institute for Natural Language Processing University of Stuttgart Introduction Generation

More information

Automatic Pronominal Anaphora Resolution in English Texts

Automatic Pronominal Anaphora Resolution in English Texts Computational Linguistics and Chinese Language Processing Vol. 9, No.1, February 2004, pp. 21-40 21 The Association for Computational Linguistics and Chinese Language Processing Automatic Pronominal Anaphora

More information

An Online Service for SUbtitling by MAchine Translation

An Online Service for SUbtitling by MAchine Translation SUMAT CIP-ICT-PSP-270919 An Online Service for SUbtitling by MAchine Translation Annual Public Report 2012 Editor(s): Contributor(s): Reviewer(s): Status-Version: Arantza del Pozo Mirjam Sepesy Maucec,

More information

Using Feedback Tags and Sentiment Analysis to Generate Sharable Learning Resources

Using Feedback Tags and Sentiment Analysis to Generate Sharable Learning Resources Using Feedback Tags and Sentiment Analysis to Generate Sharable Learning Resources Investigating Automated Sentiment Analysis of Feedback Tags in a Programming Course Stephen Cummins, Liz Burd, Andrew

More information

D2.4: Two trained semantic decoders for the Appointment Scheduling task

D2.4: Two trained semantic decoders for the Appointment Scheduling task D2.4: Two trained semantic decoders for the Appointment Scheduling task James Henderson, François Mairesse, Lonneke van der Plas, Paola Merlo Distribution: Public CLASSiC Computational Learning in Adaptive

More information

Paraphrasing controlled English texts

Paraphrasing controlled English texts Paraphrasing controlled English texts Kaarel Kaljurand Institute of Computational Linguistics, University of Zurich kaljurand@gmail.com Abstract. We discuss paraphrasing controlled English texts, by defining

More information

Machine translation techniques for presentation of summaries

Machine translation techniques for presentation of summaries Grant Agreement Number: 257528 KHRESMOI www.khresmoi.eu Machine translation techniques for presentation of summaries Deliverable number D4.6 Dissemination level Public Delivery date April 2014 Status Author(s)

More information

Enriching Morphologically Poor Languages for Statistical Machine Translation

Enriching Morphologically Poor Languages for Statistical Machine Translation Enriching Morphologically Poor Languages for Statistical Machine Translation Eleftherios Avramidis e.avramidis@sms.ed.ac.uk Philipp Koehn pkoehn@inf.ed.ac.uk School of Informatics University of Edinburgh

More information

Discourse Processing for Context Question Answering Based on Linguistic Knowledge

Discourse Processing for Context Question Answering Based on Linguistic Knowledge Discourse Processing for Context Question Answering Based on Linguistic Knowledge Mingyu Sun a,joycey.chai b a Department of Linguistics Michigan State University East Lansing, MI 48824 sunmingy@msu.edu

More information

Machine Translation and the Translator

Machine Translation and the Translator Machine Translation and the Translator Philipp Koehn 8 April 2015 About me 1 Professor at Johns Hopkins University (US), University of Edinburgh (Scotland) Author of textbook on statistical machine translation

More information

31 Case Studies: Java Natural Language Tools Available on the Web

31 Case Studies: Java Natural Language Tools Available on the Web 31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software

More information

Phrase-Based MT. Machine Translation Lecture 7. Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu. Website: mt-class.

Phrase-Based MT. Machine Translation Lecture 7. Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu. Website: mt-class. Phrase-Based MT Machine Translation Lecture 7 Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu Website: mt-class.org/penn Translational Equivalence Er hat die Prüfung bestanden, jedoch

More information

Adapting General Models to Novel Project Ideas

Adapting General Models to Novel Project Ideas The KIT Translation Systems for IWSLT 2013 Thanh-Le Ha, Teresa Herrmann, Jan Niehues, Mohammed Mediani, Eunah Cho, Yuqi Zhang, Isabel Slawik and Alex Waibel Institute for Anthropomatics KIT - Karlsruhe

More information

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Open Domain Information Extraction. Günter Neumann, DFKI, 2012 Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for

More information

The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER 2011 49 58. Ncode: an Open Source Bilingual N-gram SMT Toolkit

The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER 2011 49 58. Ncode: an Open Source Bilingual N-gram SMT Toolkit The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER 2011 49 58 Ncode: an Open Source Bilingual N-gram SMT Toolkit Josep M. Crego a, François Yvon ab, José B. Mariño c c a LIMSI-CNRS, BP 133,

More information

An Introduction to. Metrics. used during. Software Development

An Introduction to. Metrics. used during. Software Development An Introduction to Metrics used during Software Development Life Cycle www.softwaretestinggenius.com Page 1 of 10 Define the Metric Objectives You can t control what you can t measure. This is a quote

More information

Topics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment

Topics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Topics in Computational Linguistics Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Presented By: Mohammad Saif Department of Computer

More information

The Impact of Morphological Errors in Phrase-based Statistical Machine Translation from English and German into Swedish

The Impact of Morphological Errors in Phrase-based Statistical Machine Translation from English and German into Swedish The Impact of Morphological Errors in Phrase-based Statistical Machine Translation from English and German into Swedish Oscar Täckström Swedish Institute of Computer Science SE-16429, Kista, Sweden oscar@sics.se

More information

UEdin: Translating L1 Phrases in L2 Context using Context-Sensitive SMT

UEdin: Translating L1 Phrases in L2 Context using Context-Sensitive SMT UEdin: Translating L1 Phrases in L2 Context using Context-Sensitive SMT Eva Hasler ILCC, School of Informatics University of Edinburgh e.hasler@ed.ac.uk Abstract We describe our systems for the SemEval

More information

DIFFICULTIES AND SOME PROBLEMS IN TRANSLATING LEGAL DOCUMENTS

DIFFICULTIES AND SOME PROBLEMS IN TRANSLATING LEGAL DOCUMENTS DIFFICULTIES AND SOME PROBLEMS IN TRANSLATING LEGAL DOCUMENTS Ivanka Sakareva Translation of legal documents bears its own inherent difficulties. First we should note that this type of translation is burdened

More information

Collecting Polish German Parallel Corpora in the Internet

Collecting Polish German Parallel Corpora in the Internet Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska

More information

Partial Estimates of Reliability: Parallel Form Reliability in the Key Stage 2 Science Tests

Partial Estimates of Reliability: Parallel Form Reliability in the Key Stage 2 Science Tests Partial Estimates of Reliability: Parallel Form Reliability in the Key Stage 2 Science Tests Final Report Sarah Maughan Ben Styles Yin Lin Catherine Kirkup September 29 Partial Estimates of Reliability:

More information

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper Parsing Technology and its role in Legacy Modernization A Metaware White Paper 1 INTRODUCTION In the two last decades there has been an explosion of interest in software tools that can automate key tasks

More information

Tagging with Hidden Markov Models

Tagging with Hidden Markov Models Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,

More information

A New Input Method for Human Translators: Integrating Machine Translation Effectively and Imperceptibly

A New Input Method for Human Translators: Integrating Machine Translation Effectively and Imperceptibly Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) A New Input Method for Human Translators: Integrating Machine Translation Effectively and Imperceptibly

More information

Computer Aided Translation

Computer Aided Translation Computer Aided Translation Philipp Koehn 30 April 2015 Why Machine Translation? 1 Assimilation reader initiates translation, wants to know content user is tolerant of inferior quality focus of majority

More information

Language Modeling. Chapter 1. 1.1 Introduction

Language Modeling. Chapter 1. 1.1 Introduction Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set

More information

Automatic Pronominal Anaphora Resolution. in English Texts

Automatic Pronominal Anaphora Resolution. in English Texts Automatic Pronominal Anaphora Resolution in English Texts Tyne Liang and Dian-Song Wu Department of Computer and Information Science National Chiao Tung University Hsinchu, Taiwan Email: tliang@cis.nctu.edu.tw;

More information

Methodological Issues for Interdisciplinary Research

Methodological Issues for Interdisciplinary Research J. T. M. Miller, Department of Philosophy, University of Durham 1 Methodological Issues for Interdisciplinary Research Much of the apparent difficulty of interdisciplinary research stems from the nature

More information

Arguments and Dialogues

Arguments and Dialogues ONE Arguments and Dialogues The three goals of critical argumentation are to identify, analyze, and evaluate arguments. The term argument is used in a special sense, referring to the giving of reasons

More information

Comprendium Translator System Overview

Comprendium Translator System Overview Comprendium System Overview May 2004 Table of Contents 1. INTRODUCTION...3 2. WHAT IS MACHINE TRANSLATION?...3 3. THE COMPRENDIUM MACHINE TRANSLATION TECHNOLOGY...4 3.1 THE BEST MT TECHNOLOGY IN THE MARKET...4

More information

HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN

HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN Yu Chen, Andreas Eisele DFKI GmbH, Saarbrücken, Germany May 28, 2010 OUTLINE INTRODUCTION ARCHITECTURE EXPERIMENTS CONCLUSION SMT VS. RBMT [K.

More information

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD. Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.

More information

Guide to Writing a Project Report

Guide to Writing a Project Report Guide to Writing a Project Report The following notes provide a guideline to report writing, and more generally to writing a scientific article. Please take the time to read them carefully. Even if your

More information

Finding Advertising Keywords on Web Pages. Contextual Ads 101

Finding Advertising Keywords on Web Pages. Contextual Ads 101 Finding Advertising Keywords on Web Pages Scott Wen-tau Yih Joshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University Contextual Ads 101 Publisher s website Digital Camera Review The

More information

Author Gender Identification of English Novels

Author Gender Identification of English Novels Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in

More information

Music Mood Classification

Music Mood Classification Music Mood Classification CS 229 Project Report Jose Padial Ashish Goel Introduction The aim of the project was to develop a music mood classifier. There are many categories of mood into which songs may

More information

Convergence of Translation Memory and Statistical Machine Translation

Convergence of Translation Memory and Statistical Machine Translation Convergence of Translation Memory and Statistical Machine Translation Philipp Koehn and Jean Senellart 4 November 2010 Progress in Translation Automation 1 Translation Memory (TM) translators store past

More information

An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation

An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation Robert C. Moore Chris Quirk Microsoft Research Redmond, WA 98052, USA {bobmoore,chrisq}@microsoft.com

More information

REPORT ON THE WORKBENCH FOR DEVELOPERS

REPORT ON THE WORKBENCH FOR DEVELOPERS REPORT ON THE WORKBENCH FOR DEVELOPERS for developers DELIVERABLE D3.2 VERSION 1.3 2015 JUNE 15 QTLeap Machine translation is a computational procedure that seeks to provide the translation of utterances

More information

Polish - English Statistical Machine Translation of Medical Texts.

Polish - English Statistical Machine Translation of Medical Texts. Polish - English Statistical Machine Translation of Medical Texts. Krzysztof Wołk, Krzysztof Marasek Department of Multimedia Polish Japanese Institute of Information Technology kwolk@pjwstk.edu.pl Abstract.

More information

TRANSREAD LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS. Projet ANR 201 2 CORD 01 5

TRANSREAD LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS. Projet ANR 201 2 CORD 01 5 Projet ANR 201 2 CORD 01 5 TRANSREAD Lecture et interaction bilingues enrichies par les données d'alignement LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS Avril 201 4

More information

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. Miguel Ruiz, Anne Diekema, Páraic Sheridan MNIS-TextWise Labs Dey Centennial Plaza 401 South Salina Street Syracuse, NY 13202 Abstract:

More information

Examiner s report F8 Audit & Assurance September 2015

Examiner s report F8 Audit & Assurance September 2015 Examiner s report F8 Audit & Assurance September 2015 General Comments There were two sections to the examination paper and all the questions were compulsory. Section A consisted of 12 multiple-choice

More information

IMPLEMENTATION NOTE. Validating Risk Rating Systems at IRB Institutions

IMPLEMENTATION NOTE. Validating Risk Rating Systems at IRB Institutions IMPLEMENTATION NOTE Subject: Category: Capital No: A-1 Date: January 2006 I. Introduction The term rating system comprises all of the methods, processes, controls, data collection and IT systems that support

More information

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

More information

The European Financial Reporting Advisory Group (EFRAG) and the Autorité des Normes Comptables (ANC) jointly publish on their websites for

The European Financial Reporting Advisory Group (EFRAG) and the Autorité des Normes Comptables (ANC) jointly publish on their websites for The European Financial Reporting Advisory Group (EFRAG) and the Autorité des Normes Comptables (ANC) jointly publish on their websites for information purpose a Research Paper on the proposed new Definition

More information

Tuning Methods in Statistical Machine Translation

Tuning Methods in Statistical Machine Translation A thesis submitted in partial fulfilment for the degree of Master of Science in the science of Artificial Intelligence Tuning Methods in Statistical Machine Translation Author: Anne Gerard Schuth aschuth@science.uva.nl

More information

IMPLEMENTING BUSINESS CONTINUITY MANAGEMENT IN A DISTRIBUTED ORGANISATION: A CASE STUDY

IMPLEMENTING BUSINESS CONTINUITY MANAGEMENT IN A DISTRIBUTED ORGANISATION: A CASE STUDY IMPLEMENTING BUSINESS CONTINUITY MANAGEMENT IN A DISTRIBUTED ORGANISATION: A CASE STUDY AUTHORS: Patrick Roberts (left) and Mike Stephens (right). Patrick Roberts: Following early experience in the British

More information

7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan

7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan 7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan We explain field experiments conducted during the 2009 fiscal year in five areas of Japan. We also show the experiments of evaluation

More information

Leveraging ASEAN Economic Community through Language Translation Services

Leveraging ASEAN Economic Community through Language Translation Services Leveraging ASEAN Economic Community through Language Translation Services Hammam Riza Center for Information and Communication Technology Agency for the Assessment and Application of Technology (BPPT)

More information

Rubrics for Assessing Student Writing, Listening, and Speaking High School

Rubrics for Assessing Student Writing, Listening, and Speaking High School Rubrics for Assessing Student Writing, Listening, and Speaking High School Copyright by the McGraw-Hill Companies, Inc. All rights reserved. Permission is granted to reproduce the material contained herein

More information

Text Analytics Illustrated with a Simple Data Set

Text Analytics Illustrated with a Simple Data Set CSC 594 Text Mining More on SAS Enterprise Miner Text Analytics Illustrated with a Simple Data Set This demonstration illustrates some text analytic results using a simple data set that is designed to

More information

BCS HIGHER EDUCATION QUALIFICATIONS Level 6 Professional Graduate Diploma in IT. March 2013 EXAMINERS REPORT. Knowledge Based Systems

BCS HIGHER EDUCATION QUALIFICATIONS Level 6 Professional Graduate Diploma in IT. March 2013 EXAMINERS REPORT. Knowledge Based Systems BCS HIGHER EDUCATION QUALIFICATIONS Level 6 Professional Graduate Diploma in IT March 2013 EXAMINERS REPORT Knowledge Based Systems Overall Comments Compared to last year, the pass rate is significantly

More information

Building a Web-based parallel corpus and filtering out machinetranslated

Building a Web-based parallel corpus and filtering out machinetranslated Building a Web-based parallel corpus and filtering out machinetranslated text Alexandra Antonova, Alexey Misyurev Yandex 16, Leo Tolstoy St., Moscow, Russia {antonova, misyurev}@yandex-team.ru Abstract

More information

Modeling coherence in ESOL learner texts

Modeling coherence in ESOL learner texts University of Cambridge Computer Lab Building Educational Applications NAACL 2012 Outline 1 2 3 4 The Task: Automated Text Scoring (ATS) ATS systems Discourse coherence & cohesion The Task: Automated Text

More information

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013

ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013 ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION, Fuel Consulting, LLC May 2013 DATA AND ANALYSIS INTERACTION Understanding the content, accuracy, source, and completeness of data is critical to the

More information

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

WRITING A CRITICAL ARTICLE REVIEW

WRITING A CRITICAL ARTICLE REVIEW WRITING A CRITICAL ARTICLE REVIEW A critical article review briefly describes the content of an article and, more importantly, provides an in-depth analysis and evaluation of its ideas and purpose. The

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

Semantic Class Induction and Coreference Resolution

Semantic Class Induction and Coreference Resolution Semantic Class Induction and Coreference Resolution Vincent Ng Human Language Technology Research Institute University of Texas at Dallas Richardson, TX 75083-0688 vince@hlt.utdallas.edu Abstract This

More information

Linguistic Universals

Linguistic Universals Armin W. Buch 1 2012/11/28 1 Relying heavily on material by Gerhard Jäger and David Erschler Linguistic Properties shared by all languages Trivial: all languages have consonants and vowels More interesting:

More information

Overview of MT techniques. Malek Boualem (FT)

Overview of MT techniques. Malek Boualem (FT) Overview of MT techniques Malek Boualem (FT) This section presents an standard overview of general aspects related to machine translation with a description of different techniques: bilingual, transfer,

More information

The Open University s repository of research publications and other research outputs

The Open University s repository of research publications and other research outputs Open Research Online The Open University s repository of research publications and other research outputs Using LibQUAL+ R to Identify Commonalities in Customer Satisfaction: The Secret to Success? Journal

More information

Qualitative Critique: Missed Nursing Care. Kalisch, B. (2006). Missed Nursing Care A Qualitative Study. J Nurs Care Qual, 21(4), 306-313.

Qualitative Critique: Missed Nursing Care. Kalisch, B. (2006). Missed Nursing Care A Qualitative Study. J Nurs Care Qual, 21(4), 306-313. Qualitative Critique: Missed Nursing Care 1 Qualitative Critique: Missed Nursing Care Kalisch, B. (2006). Missed Nursing Care A Qualitative Study. J Nurs Care Qual, 21(4), 306-313. Gina Gessner RN BSN

More information

The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems

The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems Dr. Ananthi Sheshasaayee 1, Angela Deepa. V.R 2 1 Research Supervisior, Department of Computer Science & Application,

More information

Sample Size and Power in Clinical Trials

Sample Size and Power in Clinical Trials Sample Size and Power in Clinical Trials Version 1.0 May 011 1. Power of a Test. Factors affecting Power 3. Required Sample Size RELATED ISSUES 1. Effect Size. Test Statistics 3. Variation 4. Significance

More information

Log-Linear Models. Michael Collins

Log-Linear Models. Michael Collins Log-Linear Models Michael Collins 1 Introduction This note describes log-linear models, which are very widely used in natural language processing. A key advantage of log-linear models is their flexibility:

More information

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013. Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing

More information

Publishers Note. Anson Reed Limited 145-157 St John Street London EC1V 4PY United Kingdom. Anson Reed Limited and InterviewGold.

Publishers Note. Anson Reed Limited 145-157 St John Street London EC1V 4PY United Kingdom. Anson Reed Limited and InterviewGold. Publishers Note Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act, this publication may only be

More information

Automatic Text Analysis Using Drupal

Automatic Text Analysis Using Drupal Automatic Text Analysis Using Drupal By Herman Chai Computer Engineering California Polytechnic State University, San Luis Obispo Advised by Dr. Foaad Khosmood June 14, 2013 Abstract Natural language processing

More information