Improving Pronoun Translation for Statistical Machine Translation (SMT)

Improving Pronoun Translation for Statistical Machine Translation (SMT) Liane Guillou E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2011

Abstract Machine Translation is a well established field, yet the majority of current systems perform the translation of sentences in complete isolation, losing valuable contextual information from previously translated sentences in the discourse. One such class of contextual information concerns who or what it is that a reduced referring expression such as a pronoun is meant to refer to. The use of inappropriate referring expressions in a target language text can seriously affect its ability to be understood by the reader. This project follows on from two recent research papers that focussed on improving the translation of pronouns in Statistical Machine Translation (SMT). The approach taken is to annotate the pronouns in the source language with the morphological properties of the antecedent translation in the target language prior to translation using a phrase-based English-Czech SMT system. The project makes use of a number of manually annotated corpora in order to factor out the effects arising from poor coreference resolution, wherein selecting the wrong antecedent for a pronoun in the source language text will wrongly bias its translation. The aim of this work is to discover whether perfect coreference resolution in the source language text can reduce the incidence of inappropriate referring expressions in the target language text. The annotated translation system developed as part of this project makes only a marginal improvement over the baseline system, as measured using a bespoke automated evaluation metric. These results are supported by a manual evaluation conducted by a native Czech speaker. The reason for a lack of substantial improvement over the baseline may be attributed to many factors, not least of which concern the highly inflective nature of the Czech language. iii

Acknowledgements I would like to thank my supervisor, Professor Bonnie Webber, for her continued guidance and support from the conception of this project through to its realisation. I am deeply grateful for the patience that she has shown in explaining to me those concepts that were difficult to grasp, for setting me on the correct path when I became lost and most of all, for infecting me with her enthusiasm for this work. I have thoroughly enjoyed my time spent working on this project and I couldn t have asked for anything more in terms of the supervision I have received in my first foray into the field of Machine Translation. Special thanks are owed to Dr. Markéta Lopatková and Dr. Ondřej Bojar at Charles University. I am indebted to Markéta for her suggestions, enthusiasm and assistance with the analysis of results at every stage of this project. Her expertise in Czech Natural Language Processing has proved invaluable and I can honestly say as a monolingual speaker that without her help, this project would not have been possible. I am also extremely grateful to Ondřej for his recommendations with respect to the stemming of the English and Czech data to obtain shared word alignments for the translation models and his suggestions regarding the automated evaluation of the translation output. Thanks also to Christian Hardmeier for his patience in answering my many questions in relation to his previous work on pronoun translation and evaluation. Credit is also owed to David Mareček at Charles University, who created the PCEDT 2.0 alignment file used in this project. Finally, I would like to thank my colleagues for their company during the long days spent in the computer labs and their assistance in peer reviewing this document. The PCEDT 2.0 corpus, which is not yet publicly available, has been used with permission from the Institute of Formal and Applied Linguistics, Charles University, Prague. iv

Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Liane Guillou) v

I dedicate this thesis to my mother, Anna Guillou, who instilled in me from an early age the importance of education and made sacrifices to ensure that I received the very best. Her love, encouragement and unwavering support have been instrumental throughout my life, and have given me the confidence that I needed to embark upon this course of further study. Words alone cannot convey my gratitude. vi

Table of Contents 1 Introduction 1 1.1 Definition of the Problem............................. 3 1.2 Background.................................... 4 1.3 Previous Work................................... 5 1.3.1 Focus on Pronoun Translation in Machine Translation......... 5 1.3.2 English-Czech Machine Translation................... 7 1.4 Example of Poor Pronoun Translation...................... 7 1.5 Hypothesis and Contributions........................... 8 1.6 Chapter Summary................................. 8 2 Concepts 9 2.1 Anaphora and Coreference............................ 9 2.2 Coreference Resolution.............................. 10 2.3 Czech Language.................................. 10 2.4 Phrase-based Statistical Machine Translation.................. 10 2.5 Moses....................................... 12 2.6 Evaluation in Machine Translation........................ 13 2.6.1 Automated Evaluation.......................... 14 2.6.2 Manual Evaluation............................ 15 2.7 Chapter Summary................................. 15 3 Data 17 3.1 BBN Pronoun Coreference and Entity Type Corpus............... 17 3.2 Penn Treebank 3.0 Corpus............................ 18 3.3 PCEDT 2.0 Corpus................................ 19 3.4 Chapter Summary................................. 21 4 Methodology 23 4.1 Overview..................................... 23 vii

4.2 Assumptions................................... 27 4.3 Datasets...................................... 28 4.4 Constructing the Language Model........................ 29 4.5 Combining the Corpora.............................. 30 4.5.1 Identification of Coreferential Pronouns and their Antecedents..... 30 4.5.2 Extraction of the Antecedent Head Noun................ 31 4.5.3 Extraction of Morphological Properties from the PCEDT 2.0 Corpus.. 31 4.6 Training the Translation Models......................... 32 4.6.1 Computing the Word Alignments.................... 33 4.6.2 Tuning the Translation System Weights: Minimum Error Rate Training (MERT).................................. 33 4.6.3 Annotation of the Training Set Data................... 34 4.7 The Annotated Translation Process........................ 36 4.8 Annotation and Translation System Architecture................. 37 4.9 Evaluation..................................... 38 4.9.1 Automated Evaluation: Assessing the Accuracy of Pronoun Translations 39 4.9.2 Manual Evaluation: Error Analysis and Human Judgements...... 42 4.10 Chapter Summary................................. 43 5 Results and Discussion 45 5.1 Automated Evaluation.............................. 45 5.2 Manual Evaluation................................ 48 5.3 Critical Evaluation of the Approach and Potential Sources of Error....... 52 5.4 Chapter Summary................................. 53 6 Conclusion and Future Work 55 6.1 Conclusion.................................... 55 6.2 Future Work.................................... 58 A Czech Pronouns Used in the Automated Evaluation 61 Bibliography 65 viii

Chapter 1 Introduction The primary aim of this project is to produce more accurate coreferring expressions in the target language within English to Czech Statistical Machine Translation (SMT). To date there have been few attempts to integrate coreference resolution methods into Machine Translation. Notable exceptions include two recently published articles, focussing on English to French/German translation of third person personal pronouns. This project considers the translation of pronouns in English-Czech SMT, which is a more complex issue due to certain properties of the Czech language. Czech is a highly inflective language (as with German) that exhibits subject pro-drop and has a free word-order, i.e. the word order reflects the information structure of discourse. Whilst considerable progress has been made in Machine Translation research, little attention has been paid to cross-sentence coreference (Le Nagard and Koehn, 2010). The recent work of both Le Nagard and Koehn (2010) and Hardmeier and Federico (2010), focussing on thirdperson personal pronoun translation for SMT, represents a realisation of the need to address this gap. In particular, it represents an acknowledgement that the appropriate translation of discourse-level phenomena, including pronominal reference, is essential to ensure that the translated text makes sense to its intended audience. As Le Nagard and Koehn (2010) state, current Machine Translation methods treat sentences as mutually independent and therefore do not handle the cross-sentence dependencies that can arise due to the use of anaphoric reference. The recent work of Le Nagard and Koehn (2010) and Hardmeier and Federico (2010) demonstrates an interest within the research community in improving overall translation quality via the accurate translation of pronouns. Whilst the method proposed by Le Nagard and Koehn (2010) showed little improvement, the method presented by Hardmeier and Federico (2010) showed a small but significant improvement as measured by their bespoke automated scoring metric that incorporates precision and recall. 1

2 Chapter 1. Introduction This project investigates whether the approach used by Le Nagard and Koehn (2010) can improve pronoun translation in English-Czech SMT. This method was selected in preference to that used by Hardmeier and Federico (2010) due to its simplicity. A major difference between this project and previous work is the use of manually annotated corpora in place of coreference resolution algorithms to extract pronoun antecedents and automated methods to identify antecedent head nouns. These corpora provide coreference annotation and noun phrases from which the head noun can be extracted with little effort. This marks the first attempt to assess the potential for source language coreference to improve pronoun translation in SMT by exploiting perfect manual source language coreference annotation. Furthermore it is also the first attempt to apply the technique of source language pronoun annotation to the English-Czech language pair. The motivation for using the English-Czech language pair is threefold. Firstly, the availability of the PCEDT 2.0 parallel English-Czech corpus, as provided by the Institute of Formal and Applied Linguistics at Charles University, Prague, coincided with the start of this project. Secondly, as a monolingual speaker, the choice of the second language in the pair is fairly arbitrary, but dependent on the availability of a native speaker to assist in the evaluation of the translation system output and to provide language specific assistance during the development of such a system. This project benefited enormously from the expert advice of Dr. Markéta Lopatková at Charles University, Prague. The third, and perhaps most salient reason for choosing Czech as the second language in the translation pair is that Czech is a subject pro-drop language. That is, in Czech, an explicit subject pronoun may be omitted if its antecedent can be predicted on the grounds of saliency and/or verb morphology. It was initially envisaged that the system developed as part of this project would be designed to explicitly handle this phenomenon. However, due to the complexity of designing a pronoun-focussed translation system and devising a strategy for evaluating the system output, this has been left as a future extension to this project. This document describes in detail the approach taken in the investigation of whether source language annotation may improve pronoun translation in English-Czech SMT. The remainder of this chapter defines the problem, introduces the concept of anaphora resolution and its application in Machine Translation and presents the hypothesis upon which this project is based. Chapter 2 introduces the key concepts and chapter 3, the corpora used in the project. Chapter 4 describes the approach taken in the development of the annotation and translation system and the evaluation of its output. The results of the evaluation are presented and discussed in chapter 5 and the project is concluded in chapter 6. Possible options for future continuation of this work are also included in chapter 6, with suggestions reflecting some of the key issues highlighted in the preceding chapters.

1.1. Definition of the Problem 3 1.1 Definition of the Problem Pronouns can be used as anaphoric expressions. When a pronoun is used anaphorically, it is called a coreferential pronoun. In Czech, as with many other languages, the number and gender of a personal pronoun must agree with the number and gender of its antecedent. This is the phenomenon known as anaphora. When observing this phenomenon in discourse it is common for the pronoun s antecedent to appear in an earlier sentence to the pronoun itself, presenting a problem for current state of the art Machine Translation systems which translate sentences in isolation. When sentences are translated in isolation, the contextual information present in the preceding sentences becomes lost. In the case of a coreferential pronoun, if its antecedent appears in a previous sentence, information about that antecedent will be lost by the time the sentence in which the pronoun occurs is considered for translation. The translation of the pronoun is then carried out with no knowledge of the number and gender of the pronoun s antecedent. Consider the translation of the English pronoun it into Czech for the following simple examples 1 : 1. The dog has a ball. I can see it playing outside. 2. The cow is in the field. I can see it grazing. 3. The car is in the garage. I will drive it to school later. In each of the examples, the English pronoun it refers to an entity that has a different gender in Czech. In order to translate the pronoun correctly in Czech it is necessary to identify the gender (and number) of the entity to which the pronoun refers and ensure that the gender (and number) of the pronoun agrees. In example 1 it refers to the dog ( pes, masculine) and should be translated as jeho/ho/jej. In example 2, it refers to the cow ( kráva, feminine) and should be translated as ji. In the final example, 3, it refers to the car ( auto, neuter) and should be translated as je/jej/ho. In Czech, within the masculine gender, a distinction is made between animate objects (e.g. people and animals) and inanimate objects (e.g. buildings). In many cases the same pronoun may be used for both animate and inanimate masculine genders, but there are a number cases in which different pronouns must be used. For example, in the case of possessive reflexive pronouns in the accusative case, svého is used to refer to a dog (masculine animate, singular) that belongs to someone, e.g. I admired my (own) dog : Obdivoval jsme svého psa. This is in contrast with sv oj which is used to refer to a castle (masculine inanimate, singular) that 1 Examples adapted from information from Local Lingo - an online Czech language resource: http://www.locallingo.com/

4 Chapter 1. Introduction belongs to someone, e.g. I admired my (own) castle : Obdivoval jsme sv oj hrad. The problem of identifying the entity to which a pronoun refers is termed anaphora resolution. Section 1.2 outlines a brief history of anaphora resolution with particular reference to its incorporation in the field of Machine Translation. The concept of Anaphora and the closely related concept of Coreference are described in greater detail in chapter 2. 1.2 Background Anaphora resolution involves the identification of the antecedent of a referent, typically a pronominal or noun phrase expression that is used to refer to something that has been previously mentioned in the discourse (the antecedent). In the case where multiple referents refer to the same antecedent, these referents are said to be coreferential; these relationships can be represented using coreference chains. Mitkov et al. (1995) assert that the identification of an anaphor s antecedent is often crucial to ensure a correct translation, especially in cases in which the target language of the translation marks the gender of pronouns. The problems of anaphora resolution and the related task of coreference resolution have sparked considerable research within the field of Natural Language Processing (NLP). Strube (2007) charts the changes from early techniques that modelled linguistic knowledge algorithmically such as Hobbs s Algorithm (Hobbs, 1978), the Centering model (Grosz et al., 1995) and Lappin and Leass s algorithm (1994), through to the Supervised and Semi-Supervised Machine Learning methods commonly used today. Even within the sphere of Machine Learning, there is still much debate as to which method provides the best results. Early methods include that to which Strube (2007) credits Soon et al. (2001) - the recasting of coreference resolution as a binary classification task to which Machine Learning techniques can be applied. In contrast, Linh et al. (2009) argue that ranking based models are more suited to the task of anaphora resolution. Ng (2010) also argues in favour of ranking models that allow for the identification of the most probable candidate antecedents, claiming that they outperform other classes of supervised Machine Learning methods. In order to improve methods for anaphora resolution based on supervised Machine Learning, as well as to serve as Gold standards for evaluation, parallel efforts have been pursued to manually annotate large corpora with coreference chains. The OntoNotes 3.0 corpus (Weischedel et al., 2009) and the BBN Pronoun Coreference and Entity Type corpus (Weischedel and Brunstein, 2005) (used in this project) are examples of such corpora. Despite continued efforts into providing methods for anaphora resolution, there has been little work focusing on the integration of anaphora resolution and SMT systems. Le Nagard and

1.3. Previous Work 5 Koehn (2010) argue that work on SMT has not moved beyond sentence-level translation. Furthermore they assert that the translation ambiguity arising from the use of pronouns cannot be resolved within the context of a single sentence if a pronoun refers to an antecedent from a previous sentence. Hardmeier and Federico (2010) present a case study of the performance of one of their SMT systems on personal pronouns to illustrate that improved handling of pronominal anaphora may lead to improvements in translation quality. They report that the SMT system is unable to find a suitable translation for anaphoric pronouns in 39% of cases and that while choosing the wrong pronoun does not generally affect important content words, it can make the output translations difficult to understand. 1.3 Previous Work 1.3.1 Focus on Pronoun Translation in Machine Translation Early work on the integration of anaphora resolution with Machine Translation includes that of Mitkov et al. (1995), Lappin and Leass (1994) and Saiggon and Carvalho (1994). Mitkov et al. (1995) focussed on intersentential anaphora resolution, conjoining sentences to simulate the intersententiality that could be handled by the rule-based CAT2 Machine Translation system. They provided example output from their system showing instances where pronouns are translated correctly from English to German. However, they provided only the details of their approach and several examples, offering no information relating to the evaluation of their method. Lappin and Leass (1994) integrated their RAP algorithm into a logic-based Machine Translation system, but the core focus of their work was on anaphora resolution and not on Machine Translation. Saiggon and Carvalho (1994) used a transfer approach combined with Artificial Intelligence techniques and focussed on both intersentential and intrasentential anaphora resolution for the translation of pronouns in Portuguese to English translation. This interest in the 1990 s culminated in the publication of a special issue on anaphora resolution in Machine Translation with an introduction provided by Mitkov (1999). No further evidence of work on the integration of anaphora resolution and Machine Translation systems is available until 2010, in which papers on the subject were published by Le Nagard and Koehn (2010) and Hardmeier and Federico (2010). This resurgence in the interest of anaphora resolution for Machine Translation systems follows advances in the field since the 1990 s which have made the application of these new approaches possible. The approach taken by Le Nagard and Koehn (2010) involves the identification of the antecedent of each coreferential occurrence of it and they in the source language (English) together with the identification of the antecedent s translation into the target language (French)

6 Chapter 1. Introduction and its grammatical gender. Based on the gender of the noun in the target language, the occurrence of it in the source language text is replaced by it-masculine, it-feminine or it-neutral. The same is applied for occurrences of they. Using the Moses toolkit (Hoang et al., 2007), they trained an SMT system on annotated training data composed using the annotation method previously described, before applying the same process to the test data as part of the translation process. In the training of the annotation system the French translation of the English antecedent is extracted from the parallel corpus using the word alignment obtained as part of the process of training their baseline system. When running test translations, they first translate the test text using the baseline system to extract the French translations of the English antecedents. They then use the gender of the French word to annotate the English pronoun before translating the annotated test text using the system trained on annotated training data. This approach treats the annotation of pronouns as a separate task which is performed outside of the translation process. The authors report little change in the BLEU score of their system over the baseline and instead resort to manually counting the number of correctly translated pronouns. Whilst they attribute the lack of improvement of their system to the poor quality of their coreference resolution system, they claim that the process works well when the coreference resolution system provides accurate results. The approach taken by Hardmeier and Federico (2010) differs in that it provides a singlestep process whereby the identification of a pronoun s antecedent in the source language and the extraction of its target language translation s morphological properties is integrated in the translation process as an additional model in their SMT system. This additional model maintains a mapping of each source language pronoun and the number and gender of its antecedent. Translation is achieved by first processing the source language test text using a coreference resolution system to identify coreferential pronouns and their antecedents. The output of the coreference resolution system is used as input to a decoder driver module which runs a number of Moses decoder processes in parallel. The decoder driver then feeds individual sentences to the decoder processes using a priority queue to order sentences according to how many pronoun antecedents they contain. Thus sentences that contain a greater number of antecedents are translated first, ensuring a high throughput of the system. The authors report no significant improvement in BLEU score between their system and the baseline, but they do report a small but significant improvement in pronoun translation recall against a single reference translation. The approach used in this project is similar to that taken by Le Nagard and Koehn (2010). Whilst their project required the use of a coreference resolution system to build coreference chains, the provision of a source language corpus with manually annotated coreference information allowed this project to focus on the translation problem. This project also accommodates a wider range of English pronouns than the study by Le Nagard and Koehn (2010), which

1.4. Example of Poor Pronoun Translation 7 only considered the translation of it and they. 1.3.2 English-Czech Machine Translation Much of the recent work in English-Czech SMT has been conducted at the Institute of Formal and Applied Linguistics at Charles University, Prague. Research has been conducted in many areas including the development of parallel corpora suitable for the development of Machine Translation systems such as the PCEDT 2.0 corpus used in this project and its predecessor, the PCEDT 1.0 corpus (Čmejrek et al., 2004). Another area of research has concentrated on the development of both phrase-based and dependency-based SMT systems. In a comparative study of phrase-based and dependency-based SMT systems Bojar and Hajič (2008) concluded that their best phrase-based system outperformed the experimental dependency-based system, but work continues in both directions. The decision to focus on phrase-based SMT in this project is due to its simplicity, which given the relatively short time-scale, is an important factor. That phrase-based systems currently outperform dependency-based systems in English-Czech SMT is an added bonus. 1.4 Example of Poor Pronoun Translation As an example of poor pronoun translation, consider the following English sentence from the Wall Street Journal corpus and its translation (by a Machine Translation system) in Czech: he said mexico could be one of the next countries to be removed from the priority list because of its efforts to craft a new patent law. řekl, že mexiko by mohl být jeden z dalších zemí, aby byl odvolán z prioritou seznam, protože její snahy podpořit nové patentový zákon. In this example, the English pronoun its, which refers to mexico is translated in Czech as její (feminine, singular) and mexico is translated as mexiko (neuter, singular). Here, the Czech translation of the pronoun and its antecedent disagree in gender. A more correct translation of the pronoun would be jeho (neuter, singular possessive pronoun) or své (possessive pronoun) depending on the overall structure of the translated sentence.

8 Chapter 1. Introduction 1.5 Hypothesis and Contributions The work of Hardmeier and Federico (2010) focussed on English to German translation whilst Le Nagard and Koehn (2010) focussed on English to French translation. This project considers the translation of pronouns in English to Czech SMT and builds on the work of Le Nagard and Koehn (2010) and Hardmeier and Federico (2010). By factoring out the problems of automated coreference resolution, parsing and part of speech (POS) tagging and morphological tagging, this project attempts to assess how well an approach to explicitly annotating pronouns in the source language could work when applied to English-Czech SMT if conditions were assumed to be perfect. Where French (a Romance language) and German (a Germanic language) share a similar root to English, the differences between English and Czech are even greater. Therefore, not only does this project assess the suitability of a pronoun annotation approach in improving the translation of pronouns into another language, but into a language that is very different from English. It is believed that this project is the first attempt made to explicitly handle the problem of pronoun translation in Czech SMT. This project makes three major contributions: 1. A prototype system for the annotation and translation of pronouns in English-Czech SMT. 2. Automated and manual evaluations of the output of the system as compared against a baseline. 3. An annotated aligned parallel corpus which could be used in future investigations into pronoun translation in English-Czech SMT. 1.6 Chapter Summary This chapter introduced the specific problem of pronoun translation in SMT, discussed previous work in relation to anaphora resolution, pronoun-focussed Machine Translation and English- Czech SMT and outlined the hypothesis on which this work is based. The next chapter will describe in detail many of the concepts that are essential to the understanding of the problem as well as the approach taken in the development of the annotation and translation system and its evaluation.

Chapter 2 Concepts 2.1 Anaphora and Coreference Anaphora is a discourse level phenomenon in which the interpretation of one expression is dependent on another previously mentioned expression, also known as the the antecedent. For example in the sentence below, the word He at the start of the second sentence refers to J.P. Bolduc at the start of the first sentence. In order to understand the meaning of the second sentence, the reader must first identify the referent of the pronoun He (which in this example is J.P. Bolduc ). J.P. Bolduc, vice chairman of W.R. Grace & Co., which holds a 83.4% interest in this energyservices company, was elected a director. He succeeds Terrence D. Daniels, formerly a W.R. Grace vice chairman, who resigned. 1 Where anaphora is concerned with referring to a previously mentioned expression in the discourse, coreference is the act of referring to the same referent (Mitkov et al., 2000), such that multiple expressions that refer to the same expression are said to be coreferential. Coreferential chains may be established in order to link multiple referring expressions to the same antecedent expression. This project focuses on the translation of already resolved instances of nominal anaphora, in which a referring expression - a pronoun, definite Noun Phrase (NP) or proper name, has a non-pronominal NP as its antecedent (Mitkov et al., 2000). The project makes use of manually annotated corpora from which instances of coreferential (and anaphoric) pronouns and their antecedents are identified, in order to annotate training data with which to train an SMT system. 1 Example taken from the Wall Street Journal corpus 9

10 Chapter 2. Concepts 2.2 Coreference Resolution Coreference Resolution is the process of identifying the referent to which a referring expression refers. In this project, the pronouns are the referring expressions and the antecedents are the referents. As discussed in chapter 1, there has been much research into the development of automated methods to provide coreference and anaphora resolution. Such automated methods were used by both Le Nagard and Koehn (2010) and Hardmeier and Federico (2010), but it is well documented that these methods do not acheive perfect accuracy. Indeed, Le Nagard and Koehn (2010) cite the poor performance of their coreference resolution as a possible reason for their lack of improvement in pronoun translation. In this project, a manually annotated coreference corpus (the BBN Coreference and Entity Type corpus) is used to identify coreferential pronouns and their antecedents. As the corpus has been manually annotated, the coreference annotation is assumed to be highly accurate. 2.3 Czech Language Czech is a member of the western group of Slavic languages. Like other Slavic languages it is highly inflective, with seven cases and four grammatical genders: masculine animate (for people and animals), masculine inanimate (for inanimate objects), feminine and neuter. In the case of the feminine and neuter genders, animacy is not grammatically marked. Czech is a free word-order language, in which word order reflects the information structure of the sentence within the current discourse. In addition, Czech is a pro-drop language; an explicit subject pronoun may be omitted if it may be inferred based on some other grammatical feature, for example verb morphology. 2 In contrast with Czech, English, is neither a highly inflectional nor a pro-drop language. Furthermore, English follows a Subject-Verb-Object (SVO) pattern for word order and lacks grammatical gender. 2.4 Phrase-based Statistical Machine Translation Phrase-based models are currently the best performing SMT models (Koehn, 2009). The concept behind these models is the decomposition of the translation problem into a number of smaller word sequences, called phrases, which are translated one at a time in order to build the complete translation. It is important to note that a phrase may be any sequence of words 2 Information provided by The Czech Language - an online guide: http://www.czech-language.cz

2.4. Phrase-based Statistical Machine Translation 11 of arbitrary length and that there is no deep linguistic motivation behind the choice of segmentation. Phrase-based models have several advantages over word-based models in which words are translated in isolation. Firstly, phrase-based models provide a simple solution to the problem where a single word in the source language translates into multiple words in the target language or vice versa. Secondly, translating phrases rather than single words can help to resolve translation ambiguities. Finally, with phrase-based models, the notions of insertion and deletion that are present in word-based models are no longer necessary, leading to a model that is conceptually simpler. The three components that make up a phrase-based model are the translation model, language model and reordering model. The translation model takes the form of a phrase translation table which provides a mapping between the source and target language phrases and the probabilities associated with each mapping. The phrase translation table is learned by creating word alignments between the aligned sentence pairs of a parallel training corpus. The word alignments are collected for both translation directions, the alignment points are merged and then those phrases that are consistent with the word alignment are extracted. The probabilities that are assigned to each phrase mapping in the table are calculated by counting the number of (parallel) sentence pairs a particular phrase pair appears in, and then computing the relative frequency of this count compared with the count of the source phrase translating as any other phrase in the target language. The language model ensures the fluency of the translations output by the model - providing a means to score and hence identify the best output translation from a list of candidate translations. The language models used in SMT are typically n-gram language models which consist of n-grams in the target language together with probabilities based on maximum likelihood estimation. A language model is usually constructed from the target side of the parallel corpus used in the training of the translation model, and may be augmented by additional in-domain target data, or weighted with a separate out-of-domain language model. Smoothing is often applied to improve the reliability of the probability estimates, with modified Kneser-Ney smoothing commonly used in SMT (Kneser and Ney, 1995). The reordering model allows phrases in the source language to be taken out of sequence when building the translation in the target language, thereby allowing phrase-level reordering. Allowing unlimited reordering can have a detrimental effect on translation quality, and so it is usual for a penalty to be associated with any reordering that takes place. Penalties are assigned such that a larger cost is associated with the movement of a phrase that skips more word positions, than one that skips fewer word positions. In phrase-based SMT, these three models are combined as a linear model. The best translation arg max c p(c e) is computed using Bayes Rule, which combines the three components of the

12 Chapter 2. Concepts phrase-based model as in the equation below: the translation model φ(e c), the language model P LM and the reordering model Ω(e c). argmax c p(c e) = argmax c φ(e c) P LM Ω(e c) Where e is an English sentence and c is the Czech translation of that sentence. Once the components of the phrase-based model have been constructed, their weights are tuned to optimise the overall model performance. Tuning is carried out using a dataset that is kept separate from the main training dataset for this specific purpose. Minimum Error Rate Training (MERT) (Och, 2003) is a commonly used tuning technique in SMT. MERT tunes the model weights to optimise performance as measured using BLEU scores calculated against one or more reference translations. BLEU will be described in more detail in section 2.6. In Machine Translation, the process of finding the best scoring translation according to the model is referred to as decoding (Koehn, 2009). Using a phrase-based translation model, decoding is carried out by starting with a source sentence and building the translation from left to right, extracting source phrases in any order. The phrases are translated into the target language and then stitched together to make a complete translation. The source words covered by each phrase are then marked as translated and the process continues until all of the source words have been covered. As there are many possible valid translations of a single source language sentence, these variations must be captured. This is achieved using a search graph from which the single best translation (or an N-best list) may be derived using a scoring method that uses a language model and the phrase table probabilities. 2.5 Moses Moses (Hoang et al., 2007) is an open source SMT toolkit that provides automated training of translation models and may be used with any language pair, given a parallel training corpus. Moses may be used to construct both tree-based and phrase-based translation models but for the purpose of this project only the phrase-based training was required. The automated training process produces a phrase translation table and a lexicalised reordering model. The language model is created separately using the target side of the parallel corpus together with additional in-domain corpus data as required. The training process consists of a number of steps which include data preparation, the creation of word alignments using Giza++ (Och and Ney, 2003), extraction and scoring of phrases and building the generation and lexi-

2.6. Evaluation in Machine Translation 13 calised reordering models 3. The generation model contains probabilities for both directions of translation. During testing, in which a sentence or collection of sentences from the test corpus (which are not also included in the training corpus) are translated, the Moses decoder constructs a search graph and uses a beam search algorithm to select the translation with the highest probability from that graph. The search graph is constructed using the process of hypothesis expansion. Hypothesis combination and pruning are then employed to reduce the search space. In the Moses implementation of beam search, hypotheses that cover the same number of foreign words are compared and those with high cost (low probability) are pruned. The cost of each hypothesis is calculated using a combination of the cost of translation and the estimated future cost of translating the remaining source text for the current sentence. Whilst the decoder may be used to output an N-Best list of translations for an input sentence, in this project only the best translation is required and therefore only a single translation is requested from the decoder. 2.6 Evaluation in Machine Translation Evaluation in Machine Translation typically falls into one of two categories: manual or automated. Whilst automated methods are used to ascertain improvements during the development of a Machine Translation system, manual methods using either monolingual or bilingual human judges are typically used to provide the final evaluation. Currently there are no standard automated metrics available for the evaluation of pronoun translation in SMT. Hardmeier and Federico (Hardmeier and Federico, 2010) developed their own bespoke automated metric incorporating precision and recall measured against a single reference translation. In contrast, Le Nagard and Koehn (2010) relied on manually counting the number of correctly translated pronouns in their system output. Manual evaluation of the results is slow and therefore not a practical solution for large volumes of text. Furthermore, for a monolingual SMT system developer, manual evaluation must be outsourced to a third party, adding an additional hindrance to the development process. In this project, the Czech translations output by the phrase-based SMT system were evaluated using a combination of manual and automated methods. The manual methods used focussed on human judgements as to whether pronouns in the Machine Translation output were correctly used or dropped and if they were incorrectly used, whether a native Czech speaker would be able to understand the meaning of the sentence as a whole. BLEU, an automated metric widely used in the evaluation of SMT systems was used during system development as a preliminary 3 A full description of the Moses translation system training process can be found at: http://www.statmt.org/moses/

14 Chapter 2. Concepts check to confirm that the system output was valid Czech, before a more detailed automated analysis of the results was conducted. The evaluation methods used in this project are discussed in more detail in chapter 4. 2.6.1 Automated Evaluation BLEU (Papineni et al., 2002) is an automated evaluation metric widely used in SMT to assess the overall quality of the output translations. It provides an efficient and low cost alternative to human judgements during iterations of development cycles to measure system improvement. It computes a document-level score of the translated output against a single reference translation or a set of reference translations (Koehn, 2009). The BLEU score is based on a combination of n-gram precision and a brevity penalty. BLEU = BP exp( N n=1 w n log p n ) The n-gram precision (p n ) is a measure of the ratio of n-grams of order n in the output translation that are present in the reference translation to the total number of n-grams of order n in the output translation, and w n are positive weights that sum to one. The brevity penalty (BP) ensures that the length of the output translation is not too short, as compared with the length of the reference translation. The effect of the brevity penalty is that the BLEU score is reduced if the output translation is shorter than the reference translation, i.e. where words are dropped in the output translation. The BLEU score is applied at the document level in order to allow some freedom in translation output length at the sentence level, for example where a single source sentence may be translated into two sentences in the target language, or vice versa. BLEU has been widely criticised (Koehn, 2009), yet remains one of the most popular automated evaluation metrics in use with SMT systems due to its high correlation with human judgements of quality (Papineni et al., 2002). With respect to the specific problem of pronoun translation evaluation in Czech, two further criticisms apply. Firstly, as the sole focus of this project is pronoun translation, only a small number of words are expected to change between the translations produced by the baseline and annotated translation systems. Therefore, the variation in BLEU score is expected to be very small. Observations regarding the shortcomings of BLEU in relation to the evaluation of pronoun translation have been made previously by both Le Nagard and Koehn (2010) and Hardmeier and Federico (2010). Secondly, Czech is a highly inflective language with four genders and seven cases, so with only a single reference translation provided in the PCEDT 2.0 corpus it is not reasonable to evaluate the output of the translation systems using a recall-

2.7. Chapter Summary 15 based method. Bojar and Kos (2010) are critical of the use of BLEU scores in the evaluation of English-Czech SMT, claiming that BLEU scores correlate poorly with human judgements. It is for these reasons that BLEU was not used in the evaluation of the systems developed as part of this project. 2.6.2 Manual Evaluation The manual evaluation of Machine Translation output can be rather complex. Human judges are typically required to rate a single target language text using a five point scale or to rank several target language texts based on fluency (whether the text is fluent), and adequacy (whether the meaning of the source language text has been captured) (Koehn, 2009). Evaluation based on fluency and adequacy judgements suffers from a number of problems. Firstly, it can be slow and unreliable (Callison-Burch et al., 2008). Secondly, the scores assigned by human judges in the measurement of fluency and adequacy are often very close suggesting that the judges may find it difficult to make a clear distinction between the two criteria. Thirdly, there are concerns that without explicit instructions, many human judges develop their own rules or misinterpret the intended use of an absolute scale and instead score the output of multiple systems relative to one another (Callison-Burch et al., 2007). Finally, manual evaluation using such criteria tends to be subjective, which can lead to poor agreement between a group of human judges. Again, these manual methods tend to focus on sentences as a whole and are therefore not wholly applicable to the more specific problem of evaluating pronoun translation. 2.7 Chapter Summary This chapter introduced the concepts of anaphora and coreference resolution and provided an introduction to phrase-based SMT, the Moses toolkit and the methods currently used in the evaluation of Machine Translation output. In particular, the various issues associated with automated and manual evaluation methods were highlighted with respect to their application to the more specific problem of evaluating pronoun translation. The next chapter will introduce the manually annotated corpora used in this project.

Chapter 3 Data In the development of the annotation and translation process a number of manually annotated corpora in both English and Czech are used: the BBN Pronoun Coreference and Entity Type corpus for the English (source) side of the parallel corpus and the identification of coreferential pronouns and their antecedents, and the PCEDT 2.0 corpus for the Czech (target) side of the parallel corpus. Each corpus contains text or a translation of the original text taken from a subset of the Wall Street Journal (WSJ). It is the provision of these manually annotated corpora that allowed the project to focus solely on the translation problem without the need for automated methods for coreference or anaphora resolution. In addition, the annotation of the WSJ files within the Penn Treebank 3.0 corpus is used to identify a single antecedent head word in the case where the antecedent extracted from the BBN Pronoun Coreference and Entity Type corpus spans multiple words. This is particularly important as in order to extract the number and gender of a Czech word it is necessary to first identify the head of the English antecedent. The corpora are described in detail in the following sections. 3.1 BBN Pronoun Coreference and Entity Type Corpus The BBN Pronoun Coreference and Entity Type corpus (Weischedel and Brunstein, 2005) provides annotations of the WSJ file texts with pronoun coreference and entity types together with the raw English text. For the purpose of this project, two files from the corpus are used: the WSJ.sent file that contains the raw English sentences and the WSJ.pron pronoun coreference file that contains a list of coreferential pronouns together with their antecedents. In the pronoun coreference file, coreferential pronouns and their antecedents are indexed using sentence and word token numbers. 17

18 Chapter 3. Data The WSJ.sent file has the format: (WSJ0005 S1: J.P. Bolduc, vice chairman of W.R. Grace & Co., which... S2: He succeeds Terrence D. Daniels, formerly a W.R. Grace... S3: W.R. Grace holds three of Grace Energy s seven board seats. ) For each file in the corpus collection, the sentences are numbered and listed in the order in which they appear in the text. The WSJ.pron file has the format: (WSJ0005 ( Antecedent -> S1:1-2 -> J.P. Bolduc Pronoun -> S2:1-1 -> He ) For each WSJ file in the collection, each antecedent and the pronouns that refer to it are listed, together with the number of the sentence in which they appear and the start and end positions of the word(s) within the sentence. It was initially envisaged that the OntoNotes 3.0 corpus (Weischedel et al., 2009) would be used to identify coreferential pronouns and their antecedents. However, the annotation in the BBN Coreference and Entity Type corpus allows for a simpler method of identification and extraction than the OntoNotes 3.0 corpus. The OntoNotes 3.0 corpus is then left as an alternative source of coreference information. Due to differences in the choice of which types of coreference are annotated in these corpora, the use of the OntoNotes 3.0 corpus as an alternative or additional source of coreference information would allow for an investigation into the translation of it, this and that marked as event coreference. 3.2 Penn Treebank 3.0 Corpus The Penn Treebank 3.0 corpus contains manually annotated parse trees of the sentences within the WSJ corpus. The merged files within the corpus contain both parse and part of speech annotation and as such may be used to identify Noun Phrases (NPs) and through the use of simple rules, the head of an NP. The corpus contains separate merged files for each WSJ file. Within each file, a parse is provided for each sentence, with part of speech tags provided for each word or token.