Improving Pronoun Translation for Statistical Machine Translation (SMT)
|
|
- Ezra Baldwin
- 7 years ago
- Views:
Transcription
1 Improving Pronoun Translation for Statistical Machine Translation (SMT) Liane Guillou E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science Artificial Intelligence School of Informatics University of Edinburgh 2011
2
3 Abstract Machine Translation is a well established field, yet the majority of current systems perform the translation of sentences in complete isolation, losing valuable contextual information from previously translated sentences in the discourse. One such class of contextual information concerns who or what it is that a reduced referring expression such as a pronoun is meant to refer to. The use of inappropriate referring expressions in a target language text can seriously affect its ability to be understood by the reader. This project follows on from two recent research papers that focussed on improving the translation of pronouns in Statistical Machine Translation (SMT). The approach taken is to annotate the pronouns in the source language with the morphological properties of the antecedent translation in the target language prior to translation using a phrase-based English-Czech SMT system. The project makes use of a number of manually annotated corpora in order to factor out the effects arising from poor coreference resolution, wherein selecting the wrong antecedent for a pronoun in the source language text will wrongly bias its translation. The aim of this work is to discover whether perfect coreference resolution in the source language text can reduce the incidence of inappropriate referring expressions in the target language text. The annotated translation system developed as part of this project makes only a marginal improvement over the baseline system, as measured using a bespoke automated evaluation metric. These results are supported by a manual evaluation conducted by a native Czech speaker. The reason for a lack of substantial improvement over the baseline may be attributed to many factors, not least of which concern the highly inflective nature of the Czech language. iii
4 Acknowledgements I would like to thank my supervisor, Professor Bonnie Webber, for her continued guidance and support from the conception of this project through to its realisation. I am deeply grateful for the patience that she has shown in explaining to me those concepts that were difficult to grasp, for setting me on the correct path when I became lost and most of all, for infecting me with her enthusiasm for this work. I have thoroughly enjoyed my time spent working on this project and I couldn t have asked for anything more in terms of the supervision I have received in my first foray into the field of Machine Translation. Special thanks are owed to Dr. Markéta Lopatková and Dr. Ondřej Bojar at Charles University. I am indebted to Markéta for her suggestions, enthusiasm and assistance with the analysis of results at every stage of this project. Her expertise in Czech Natural Language Processing has proved invaluable and I can honestly say as a monolingual speaker that without her help, this project would not have been possible. I am also extremely grateful to Ondřej for his recommendations with respect to the stemming of the English and Czech data to obtain shared word alignments for the translation models and his suggestions regarding the automated evaluation of the translation output. Thanks also to Christian Hardmeier for his patience in answering my many questions in relation to his previous work on pronoun translation and evaluation. Credit is also owed to David Mareček at Charles University, who created the PCEDT 2.0 alignment file used in this project. Finally, I would like to thank my colleagues for their company during the long days spent in the computer labs and their assistance in peer reviewing this document. The PCEDT 2.0 corpus, which is not yet publicly available, has been used with permission from the Institute of Formal and Applied Linguistics, Charles University, Prague. iv
5 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Liane Guillou) v
6 I dedicate this thesis to my mother, Anna Guillou, who instilled in me from an early age the importance of education and made sacrifices to ensure that I received the very best. Her love, encouragement and unwavering support have been instrumental throughout my life, and have given me the confidence that I needed to embark upon this course of further study. Words alone cannot convey my gratitude. vi
7 Table of Contents 1 Introduction Definition of the Problem Background Previous Work Focus on Pronoun Translation in Machine Translation English-Czech Machine Translation Example of Poor Pronoun Translation Hypothesis and Contributions Chapter Summary Concepts Anaphora and Coreference Coreference Resolution Czech Language Phrase-based Statistical Machine Translation Moses Evaluation in Machine Translation Automated Evaluation Manual Evaluation Chapter Summary Data BBN Pronoun Coreference and Entity Type Corpus Penn Treebank 3.0 Corpus PCEDT 2.0 Corpus Chapter Summary Methodology Overview vii
8 4.2 Assumptions Datasets Constructing the Language Model Combining the Corpora Identification of Coreferential Pronouns and their Antecedents Extraction of the Antecedent Head Noun Extraction of Morphological Properties from the PCEDT 2.0 Corpus Training the Translation Models Computing the Word Alignments Tuning the Translation System Weights: Minimum Error Rate Training (MERT) Annotation of the Training Set Data The Annotated Translation Process Annotation and Translation System Architecture Evaluation Automated Evaluation: Assessing the Accuracy of Pronoun Translations Manual Evaluation: Error Analysis and Human Judgements Chapter Summary Results and Discussion Automated Evaluation Manual Evaluation Critical Evaluation of the Approach and Potential Sources of Error Chapter Summary Conclusion and Future Work Conclusion Future Work A Czech Pronouns Used in the Automated Evaluation 61 Bibliography 65 viii
9 Chapter 1 Introduction The primary aim of this project is to produce more accurate coreferring expressions in the target language within English to Czech Statistical Machine Translation (SMT). To date there have been few attempts to integrate coreference resolution methods into Machine Translation. Notable exceptions include two recently published articles, focussing on English to French/German translation of third person personal pronouns. This project considers the translation of pronouns in English-Czech SMT, which is a more complex issue due to certain properties of the Czech language. Czech is a highly inflective language (as with German) that exhibits subject pro-drop and has a free word-order, i.e. the word order reflects the information structure of discourse. Whilst considerable progress has been made in Machine Translation research, little attention has been paid to cross-sentence coreference (Le Nagard and Koehn, 2010). The recent work of both Le Nagard and Koehn (2010) and Hardmeier and Federico (2010), focussing on thirdperson personal pronoun translation for SMT, represents a realisation of the need to address this gap. In particular, it represents an acknowledgement that the appropriate translation of discourse-level phenomena, including pronominal reference, is essential to ensure that the translated text makes sense to its intended audience. As Le Nagard and Koehn (2010) state, current Machine Translation methods treat sentences as mutually independent and therefore do not handle the cross-sentence dependencies that can arise due to the use of anaphoric reference. The recent work of Le Nagard and Koehn (2010) and Hardmeier and Federico (2010) demonstrates an interest within the research community in improving overall translation quality via the accurate translation of pronouns. Whilst the method proposed by Le Nagard and Koehn (2010) showed little improvement, the method presented by Hardmeier and Federico (2010) showed a small but significant improvement as measured by their bespoke automated scoring metric that incorporates precision and recall. 1
10 2 Chapter 1. Introduction This project investigates whether the approach used by Le Nagard and Koehn (2010) can improve pronoun translation in English-Czech SMT. This method was selected in preference to that used by Hardmeier and Federico (2010) due to its simplicity. A major difference between this project and previous work is the use of manually annotated corpora in place of coreference resolution algorithms to extract pronoun antecedents and automated methods to identify antecedent head nouns. These corpora provide coreference annotation and noun phrases from which the head noun can be extracted with little effort. This marks the first attempt to assess the potential for source language coreference to improve pronoun translation in SMT by exploiting perfect manual source language coreference annotation. Furthermore it is also the first attempt to apply the technique of source language pronoun annotation to the English-Czech language pair. The motivation for using the English-Czech language pair is threefold. Firstly, the availability of the PCEDT 2.0 parallel English-Czech corpus, as provided by the Institute of Formal and Applied Linguistics at Charles University, Prague, coincided with the start of this project. Secondly, as a monolingual speaker, the choice of the second language in the pair is fairly arbitrary, but dependent on the availability of a native speaker to assist in the evaluation of the translation system output and to provide language specific assistance during the development of such a system. This project benefited enormously from the expert advice of Dr. Markéta Lopatková at Charles University, Prague. The third, and perhaps most salient reason for choosing Czech as the second language in the translation pair is that Czech is a subject pro-drop language. That is, in Czech, an explicit subject pronoun may be omitted if its antecedent can be predicted on the grounds of saliency and/or verb morphology. It was initially envisaged that the system developed as part of this project would be designed to explicitly handle this phenomenon. However, due to the complexity of designing a pronoun-focussed translation system and devising a strategy for evaluating the system output, this has been left as a future extension to this project. This document describes in detail the approach taken in the investigation of whether source language annotation may improve pronoun translation in English-Czech SMT. The remainder of this chapter defines the problem, introduces the concept of anaphora resolution and its application in Machine Translation and presents the hypothesis upon which this project is based. Chapter 2 introduces the key concepts and chapter 3, the corpora used in the project. Chapter 4 describes the approach taken in the development of the annotation and translation system and the evaluation of its output. The results of the evaluation are presented and discussed in chapter 5 and the project is concluded in chapter 6. Possible options for future continuation of this work are also included in chapter 6, with suggestions reflecting some of the key issues highlighted in the preceding chapters.
11 1.1. Definition of the Problem Definition of the Problem Pronouns can be used as anaphoric expressions. When a pronoun is used anaphorically, it is called a coreferential pronoun. In Czech, as with many other languages, the number and gender of a personal pronoun must agree with the number and gender of its antecedent. This is the phenomenon known as anaphora. When observing this phenomenon in discourse it is common for the pronoun s antecedent to appear in an earlier sentence to the pronoun itself, presenting a problem for current state of the art Machine Translation systems which translate sentences in isolation. When sentences are translated in isolation, the contextual information present in the preceding sentences becomes lost. In the case of a coreferential pronoun, if its antecedent appears in a previous sentence, information about that antecedent will be lost by the time the sentence in which the pronoun occurs is considered for translation. The translation of the pronoun is then carried out with no knowledge of the number and gender of the pronoun s antecedent. Consider the translation of the English pronoun it into Czech for the following simple examples 1 : 1. The dog has a ball. I can see it playing outside. 2. The cow is in the field. I can see it grazing. 3. The car is in the garage. I will drive it to school later. In each of the examples, the English pronoun it refers to an entity that has a different gender in Czech. In order to translate the pronoun correctly in Czech it is necessary to identify the gender (and number) of the entity to which the pronoun refers and ensure that the gender (and number) of the pronoun agrees. In example 1 it refers to the dog ( pes, masculine) and should be translated as jeho/ho/jej. In example 2, it refers to the cow ( kráva, feminine) and should be translated as ji. In the final example, 3, it refers to the car ( auto, neuter) and should be translated as je/jej/ho. In Czech, within the masculine gender, a distinction is made between animate objects (e.g. people and animals) and inanimate objects (e.g. buildings). In many cases the same pronoun may be used for both animate and inanimate masculine genders, but there are a number cases in which different pronouns must be used. For example, in the case of possessive reflexive pronouns in the accusative case, svého is used to refer to a dog (masculine animate, singular) that belongs to someone, e.g. I admired my (own) dog : Obdivoval jsme svého psa. This is in contrast with sv oj which is used to refer to a castle (masculine inanimate, singular) that 1 Examples adapted from information from Local Lingo - an online Czech language resource:
12 4 Chapter 1. Introduction belongs to someone, e.g. I admired my (own) castle : Obdivoval jsme sv oj hrad. The problem of identifying the entity to which a pronoun refers is termed anaphora resolution. Section 1.2 outlines a brief history of anaphora resolution with particular reference to its incorporation in the field of Machine Translation. The concept of Anaphora and the closely related concept of Coreference are described in greater detail in chapter Background Anaphora resolution involves the identification of the antecedent of a referent, typically a pronominal or noun phrase expression that is used to refer to something that has been previously mentioned in the discourse (the antecedent). In the case where multiple referents refer to the same antecedent, these referents are said to be coreferential; these relationships can be represented using coreference chains. Mitkov et al. (1995) assert that the identification of an anaphor s antecedent is often crucial to ensure a correct translation, especially in cases in which the target language of the translation marks the gender of pronouns. The problems of anaphora resolution and the related task of coreference resolution have sparked considerable research within the field of Natural Language Processing (NLP). Strube (2007) charts the changes from early techniques that modelled linguistic knowledge algorithmically such as Hobbs s Algorithm (Hobbs, 1978), the Centering model (Grosz et al., 1995) and Lappin and Leass s algorithm (1994), through to the Supervised and Semi-Supervised Machine Learning methods commonly used today. Even within the sphere of Machine Learning, there is still much debate as to which method provides the best results. Early methods include that to which Strube (2007) credits Soon et al. (2001) - the recasting of coreference resolution as a binary classification task to which Machine Learning techniques can be applied. In contrast, Linh et al. (2009) argue that ranking based models are more suited to the task of anaphora resolution. Ng (2010) also argues in favour of ranking models that allow for the identification of the most probable candidate antecedents, claiming that they outperform other classes of supervised Machine Learning methods. In order to improve methods for anaphora resolution based on supervised Machine Learning, as well as to serve as Gold standards for evaluation, parallel efforts have been pursued to manually annotate large corpora with coreference chains. The OntoNotes 3.0 corpus (Weischedel et al., 2009) and the BBN Pronoun Coreference and Entity Type corpus (Weischedel and Brunstein, 2005) (used in this project) are examples of such corpora. Despite continued efforts into providing methods for anaphora resolution, there has been little work focusing on the integration of anaphora resolution and SMT systems. Le Nagard and
13 1.3. Previous Work 5 Koehn (2010) argue that work on SMT has not moved beyond sentence-level translation. Furthermore they assert that the translation ambiguity arising from the use of pronouns cannot be resolved within the context of a single sentence if a pronoun refers to an antecedent from a previous sentence. Hardmeier and Federico (2010) present a case study of the performance of one of their SMT systems on personal pronouns to illustrate that improved handling of pronominal anaphora may lead to improvements in translation quality. They report that the SMT system is unable to find a suitable translation for anaphoric pronouns in 39% of cases and that while choosing the wrong pronoun does not generally affect important content words, it can make the output translations difficult to understand. 1.3 Previous Work Focus on Pronoun Translation in Machine Translation Early work on the integration of anaphora resolution with Machine Translation includes that of Mitkov et al. (1995), Lappin and Leass (1994) and Saiggon and Carvalho (1994). Mitkov et al. (1995) focussed on intersentential anaphora resolution, conjoining sentences to simulate the intersententiality that could be handled by the rule-based CAT2 Machine Translation system. They provided example output from their system showing instances where pronouns are translated correctly from English to German. However, they provided only the details of their approach and several examples, offering no information relating to the evaluation of their method. Lappin and Leass (1994) integrated their RAP algorithm into a logic-based Machine Translation system, but the core focus of their work was on anaphora resolution and not on Machine Translation. Saiggon and Carvalho (1994) used a transfer approach combined with Artificial Intelligence techniques and focussed on both intersentential and intrasentential anaphora resolution for the translation of pronouns in Portuguese to English translation. This interest in the 1990 s culminated in the publication of a special issue on anaphora resolution in Machine Translation with an introduction provided by Mitkov (1999). No further evidence of work on the integration of anaphora resolution and Machine Translation systems is available until 2010, in which papers on the subject were published by Le Nagard and Koehn (2010) and Hardmeier and Federico (2010). This resurgence in the interest of anaphora resolution for Machine Translation systems follows advances in the field since the 1990 s which have made the application of these new approaches possible. The approach taken by Le Nagard and Koehn (2010) involves the identification of the antecedent of each coreferential occurrence of it and they in the source language (English) together with the identification of the antecedent s translation into the target language (French)
14 6 Chapter 1. Introduction and its grammatical gender. Based on the gender of the noun in the target language, the occurrence of it in the source language text is replaced by it-masculine, it-feminine or it-neutral. The same is applied for occurrences of they. Using the Moses toolkit (Hoang et al., 2007), they trained an SMT system on annotated training data composed using the annotation method previously described, before applying the same process to the test data as part of the translation process. In the training of the annotation system the French translation of the English antecedent is extracted from the parallel corpus using the word alignment obtained as part of the process of training their baseline system. When running test translations, they first translate the test text using the baseline system to extract the French translations of the English antecedents. They then use the gender of the French word to annotate the English pronoun before translating the annotated test text using the system trained on annotated training data. This approach treats the annotation of pronouns as a separate task which is performed outside of the translation process. The authors report little change in the BLEU score of their system over the baseline and instead resort to manually counting the number of correctly translated pronouns. Whilst they attribute the lack of improvement of their system to the poor quality of their coreference resolution system, they claim that the process works well when the coreference resolution system provides accurate results. The approach taken by Hardmeier and Federico (2010) differs in that it provides a singlestep process whereby the identification of a pronoun s antecedent in the source language and the extraction of its target language translation s morphological properties is integrated in the translation process as an additional model in their SMT system. This additional model maintains a mapping of each source language pronoun and the number and gender of its antecedent. Translation is achieved by first processing the source language test text using a coreference resolution system to identify coreferential pronouns and their antecedents. The output of the coreference resolution system is used as input to a decoder driver module which runs a number of Moses decoder processes in parallel. The decoder driver then feeds individual sentences to the decoder processes using a priority queue to order sentences according to how many pronoun antecedents they contain. Thus sentences that contain a greater number of antecedents are translated first, ensuring a high throughput of the system. The authors report no significant improvement in BLEU score between their system and the baseline, but they do report a small but significant improvement in pronoun translation recall against a single reference translation. The approach used in this project is similar to that taken by Le Nagard and Koehn (2010). Whilst their project required the use of a coreference resolution system to build coreference chains, the provision of a source language corpus with manually annotated coreference information allowed this project to focus on the translation problem. This project also accommodates a wider range of English pronouns than the study by Le Nagard and Koehn (2010), which
15 1.4. Example of Poor Pronoun Translation 7 only considered the translation of it and they English-Czech Machine Translation Much of the recent work in English-Czech SMT has been conducted at the Institute of Formal and Applied Linguistics at Charles University, Prague. Research has been conducted in many areas including the development of parallel corpora suitable for the development of Machine Translation systems such as the PCEDT 2.0 corpus used in this project and its predecessor, the PCEDT 1.0 corpus (Čmejrek et al., 2004). Another area of research has concentrated on the development of both phrase-based and dependency-based SMT systems. In a comparative study of phrase-based and dependency-based SMT systems Bojar and Hajič (2008) concluded that their best phrase-based system outperformed the experimental dependency-based system, but work continues in both directions. The decision to focus on phrase-based SMT in this project is due to its simplicity, which given the relatively short time-scale, is an important factor. That phrase-based systems currently outperform dependency-based systems in English-Czech SMT is an added bonus. 1.4 Example of Poor Pronoun Translation As an example of poor pronoun translation, consider the following English sentence from the Wall Street Journal corpus and its translation (by a Machine Translation system) in Czech: he said mexico could be one of the next countries to be removed from the priority list because of its efforts to craft a new patent law. řekl, že mexiko by mohl být jeden z dalších zemí, aby byl odvolán z prioritou seznam, protože její snahy podpořit nové patentový zákon. In this example, the English pronoun its, which refers to mexico is translated in Czech as její (feminine, singular) and mexico is translated as mexiko (neuter, singular). Here, the Czech translation of the pronoun and its antecedent disagree in gender. A more correct translation of the pronoun would be jeho (neuter, singular possessive pronoun) or své (possessive pronoun) depending on the overall structure of the translated sentence.
16 8 Chapter 1. Introduction 1.5 Hypothesis and Contributions The work of Hardmeier and Federico (2010) focussed on English to German translation whilst Le Nagard and Koehn (2010) focussed on English to French translation. This project considers the translation of pronouns in English to Czech SMT and builds on the work of Le Nagard and Koehn (2010) and Hardmeier and Federico (2010). By factoring out the problems of automated coreference resolution, parsing and part of speech (POS) tagging and morphological tagging, this project attempts to assess how well an approach to explicitly annotating pronouns in the source language could work when applied to English-Czech SMT if conditions were assumed to be perfect. Where French (a Romance language) and German (a Germanic language) share a similar root to English, the differences between English and Czech are even greater. Therefore, not only does this project assess the suitability of a pronoun annotation approach in improving the translation of pronouns into another language, but into a language that is very different from English. It is believed that this project is the first attempt made to explicitly handle the problem of pronoun translation in Czech SMT. This project makes three major contributions: 1. A prototype system for the annotation and translation of pronouns in English-Czech SMT. 2. Automated and manual evaluations of the output of the system as compared against a baseline. 3. An annotated aligned parallel corpus which could be used in future investigations into pronoun translation in English-Czech SMT. 1.6 Chapter Summary This chapter introduced the specific problem of pronoun translation in SMT, discussed previous work in relation to anaphora resolution, pronoun-focussed Machine Translation and English- Czech SMT and outlined the hypothesis on which this work is based. The next chapter will describe in detail many of the concepts that are essential to the understanding of the problem as well as the approach taken in the development of the annotation and translation system and its evaluation.
17 Chapter 2 Concepts 2.1 Anaphora and Coreference Anaphora is a discourse level phenomenon in which the interpretation of one expression is dependent on another previously mentioned expression, also known as the the antecedent. For example in the sentence below, the word He at the start of the second sentence refers to J.P. Bolduc at the start of the first sentence. In order to understand the meaning of the second sentence, the reader must first identify the referent of the pronoun He (which in this example is J.P. Bolduc ). J.P. Bolduc, vice chairman of W.R. Grace & Co., which holds a 83.4% interest in this energyservices company, was elected a director. He succeeds Terrence D. Daniels, formerly a W.R. Grace vice chairman, who resigned. 1 Where anaphora is concerned with referring to a previously mentioned expression in the discourse, coreference is the act of referring to the same referent (Mitkov et al., 2000), such that multiple expressions that refer to the same expression are said to be coreferential. Coreferential chains may be established in order to link multiple referring expressions to the same antecedent expression. This project focuses on the translation of already resolved instances of nominal anaphora, in which a referring expression - a pronoun, definite Noun Phrase (NP) or proper name, has a non-pronominal NP as its antecedent (Mitkov et al., 2000). The project makes use of manually annotated corpora from which instances of coreferential (and anaphoric) pronouns and their antecedents are identified, in order to annotate training data with which to train an SMT system. 1 Example taken from the Wall Street Journal corpus 9
18 10 Chapter 2. Concepts 2.2 Coreference Resolution Coreference Resolution is the process of identifying the referent to which a referring expression refers. In this project, the pronouns are the referring expressions and the antecedents are the referents. As discussed in chapter 1, there has been much research into the development of automated methods to provide coreference and anaphora resolution. Such automated methods were used by both Le Nagard and Koehn (2010) and Hardmeier and Federico (2010), but it is well documented that these methods do not acheive perfect accuracy. Indeed, Le Nagard and Koehn (2010) cite the poor performance of their coreference resolution as a possible reason for their lack of improvement in pronoun translation. In this project, a manually annotated coreference corpus (the BBN Coreference and Entity Type corpus) is used to identify coreferential pronouns and their antecedents. As the corpus has been manually annotated, the coreference annotation is assumed to be highly accurate. 2.3 Czech Language Czech is a member of the western group of Slavic languages. Like other Slavic languages it is highly inflective, with seven cases and four grammatical genders: masculine animate (for people and animals), masculine inanimate (for inanimate objects), feminine and neuter. In the case of the feminine and neuter genders, animacy is not grammatically marked. Czech is a free word-order language, in which word order reflects the information structure of the sentence within the current discourse. In addition, Czech is a pro-drop language; an explicit subject pronoun may be omitted if it may be inferred based on some other grammatical feature, for example verb morphology. 2 In contrast with Czech, English, is neither a highly inflectional nor a pro-drop language. Furthermore, English follows a Subject-Verb-Object (SVO) pattern for word order and lacks grammatical gender. 2.4 Phrase-based Statistical Machine Translation Phrase-based models are currently the best performing SMT models (Koehn, 2009). The concept behind these models is the decomposition of the translation problem into a number of smaller word sequences, called phrases, which are translated one at a time in order to build the complete translation. It is important to note that a phrase may be any sequence of words 2 Information provided by The Czech Language - an online guide:
19 2.4. Phrase-based Statistical Machine Translation 11 of arbitrary length and that there is no deep linguistic motivation behind the choice of segmentation. Phrase-based models have several advantages over word-based models in which words are translated in isolation. Firstly, phrase-based models provide a simple solution to the problem where a single word in the source language translates into multiple words in the target language or vice versa. Secondly, translating phrases rather than single words can help to resolve translation ambiguities. Finally, with phrase-based models, the notions of insertion and deletion that are present in word-based models are no longer necessary, leading to a model that is conceptually simpler. The three components that make up a phrase-based model are the translation model, language model and reordering model. The translation model takes the form of a phrase translation table which provides a mapping between the source and target language phrases and the probabilities associated with each mapping. The phrase translation table is learned by creating word alignments between the aligned sentence pairs of a parallel training corpus. The word alignments are collected for both translation directions, the alignment points are merged and then those phrases that are consistent with the word alignment are extracted. The probabilities that are assigned to each phrase mapping in the table are calculated by counting the number of (parallel) sentence pairs a particular phrase pair appears in, and then computing the relative frequency of this count compared with the count of the source phrase translating as any other phrase in the target language. The language model ensures the fluency of the translations output by the model - providing a means to score and hence identify the best output translation from a list of candidate translations. The language models used in SMT are typically n-gram language models which consist of n-grams in the target language together with probabilities based on maximum likelihood estimation. A language model is usually constructed from the target side of the parallel corpus used in the training of the translation model, and may be augmented by additional in-domain target data, or weighted with a separate out-of-domain language model. Smoothing is often applied to improve the reliability of the probability estimates, with modified Kneser-Ney smoothing commonly used in SMT (Kneser and Ney, 1995). The reordering model allows phrases in the source language to be taken out of sequence when building the translation in the target language, thereby allowing phrase-level reordering. Allowing unlimited reordering can have a detrimental effect on translation quality, and so it is usual for a penalty to be associated with any reordering that takes place. Penalties are assigned such that a larger cost is associated with the movement of a phrase that skips more word positions, than one that skips fewer word positions. In phrase-based SMT, these three models are combined as a linear model. The best translation arg max c p(c e) is computed using Bayes Rule, which combines the three components of the
20 12 Chapter 2. Concepts phrase-based model as in the equation below: the translation model φ(e c), the language model P LM and the reordering model Ω(e c). argmax c p(c e) = argmax c φ(e c) P LM Ω(e c) Where e is an English sentence and c is the Czech translation of that sentence. Once the components of the phrase-based model have been constructed, their weights are tuned to optimise the overall model performance. Tuning is carried out using a dataset that is kept separate from the main training dataset for this specific purpose. Minimum Error Rate Training (MERT) (Och, 2003) is a commonly used tuning technique in SMT. MERT tunes the model weights to optimise performance as measured using BLEU scores calculated against one or more reference translations. BLEU will be described in more detail in section 2.6. In Machine Translation, the process of finding the best scoring translation according to the model is referred to as decoding (Koehn, 2009). Using a phrase-based translation model, decoding is carried out by starting with a source sentence and building the translation from left to right, extracting source phrases in any order. The phrases are translated into the target language and then stitched together to make a complete translation. The source words covered by each phrase are then marked as translated and the process continues until all of the source words have been covered. As there are many possible valid translations of a single source language sentence, these variations must be captured. This is achieved using a search graph from which the single best translation (or an N-best list) may be derived using a scoring method that uses a language model and the phrase table probabilities. 2.5 Moses Moses (Hoang et al., 2007) is an open source SMT toolkit that provides automated training of translation models and may be used with any language pair, given a parallel training corpus. Moses may be used to construct both tree-based and phrase-based translation models but for the purpose of this project only the phrase-based training was required. The automated training process produces a phrase translation table and a lexicalised reordering model. The language model is created separately using the target side of the parallel corpus together with additional in-domain corpus data as required. The training process consists of a number of steps which include data preparation, the creation of word alignments using Giza++ (Och and Ney, 2003), extraction and scoring of phrases and building the generation and lexi-
21 2.6. Evaluation in Machine Translation 13 calised reordering models 3. The generation model contains probabilities for both directions of translation. During testing, in which a sentence or collection of sentences from the test corpus (which are not also included in the training corpus) are translated, the Moses decoder constructs a search graph and uses a beam search algorithm to select the translation with the highest probability from that graph. The search graph is constructed using the process of hypothesis expansion. Hypothesis combination and pruning are then employed to reduce the search space. In the Moses implementation of beam search, hypotheses that cover the same number of foreign words are compared and those with high cost (low probability) are pruned. The cost of each hypothesis is calculated using a combination of the cost of translation and the estimated future cost of translating the remaining source text for the current sentence. Whilst the decoder may be used to output an N-Best list of translations for an input sentence, in this project only the best translation is required and therefore only a single translation is requested from the decoder. 2.6 Evaluation in Machine Translation Evaluation in Machine Translation typically falls into one of two categories: manual or automated. Whilst automated methods are used to ascertain improvements during the development of a Machine Translation system, manual methods using either monolingual or bilingual human judges are typically used to provide the final evaluation. Currently there are no standard automated metrics available for the evaluation of pronoun translation in SMT. Hardmeier and Federico (Hardmeier and Federico, 2010) developed their own bespoke automated metric incorporating precision and recall measured against a single reference translation. In contrast, Le Nagard and Koehn (2010) relied on manually counting the number of correctly translated pronouns in their system output. Manual evaluation of the results is slow and therefore not a practical solution for large volumes of text. Furthermore, for a monolingual SMT system developer, manual evaluation must be outsourced to a third party, adding an additional hindrance to the development process. In this project, the Czech translations output by the phrase-based SMT system were evaluated using a combination of manual and automated methods. The manual methods used focussed on human judgements as to whether pronouns in the Machine Translation output were correctly used or dropped and if they were incorrectly used, whether a native Czech speaker would be able to understand the meaning of the sentence as a whole. BLEU, an automated metric widely used in the evaluation of SMT systems was used during system development as a preliminary 3 A full description of the Moses translation system training process can be found at:
22 14 Chapter 2. Concepts check to confirm that the system output was valid Czech, before a more detailed automated analysis of the results was conducted. The evaluation methods used in this project are discussed in more detail in chapter Automated Evaluation BLEU (Papineni et al., 2002) is an automated evaluation metric widely used in SMT to assess the overall quality of the output translations. It provides an efficient and low cost alternative to human judgements during iterations of development cycles to measure system improvement. It computes a document-level score of the translated output against a single reference translation or a set of reference translations (Koehn, 2009). The BLEU score is based on a combination of n-gram precision and a brevity penalty. BLEU = BP exp( N n=1 w n log p n ) The n-gram precision (p n ) is a measure of the ratio of n-grams of order n in the output translation that are present in the reference translation to the total number of n-grams of order n in the output translation, and w n are positive weights that sum to one. The brevity penalty (BP) ensures that the length of the output translation is not too short, as compared with the length of the reference translation. The effect of the brevity penalty is that the BLEU score is reduced if the output translation is shorter than the reference translation, i.e. where words are dropped in the output translation. The BLEU score is applied at the document level in order to allow some freedom in translation output length at the sentence level, for example where a single source sentence may be translated into two sentences in the target language, or vice versa. BLEU has been widely criticised (Koehn, 2009), yet remains one of the most popular automated evaluation metrics in use with SMT systems due to its high correlation with human judgements of quality (Papineni et al., 2002). With respect to the specific problem of pronoun translation evaluation in Czech, two further criticisms apply. Firstly, as the sole focus of this project is pronoun translation, only a small number of words are expected to change between the translations produced by the baseline and annotated translation systems. Therefore, the variation in BLEU score is expected to be very small. Observations regarding the shortcomings of BLEU in relation to the evaluation of pronoun translation have been made previously by both Le Nagard and Koehn (2010) and Hardmeier and Federico (2010). Secondly, Czech is a highly inflective language with four genders and seven cases, so with only a single reference translation provided in the PCEDT 2.0 corpus it is not reasonable to evaluate the output of the translation systems using a recall-
23 2.7. Chapter Summary 15 based method. Bojar and Kos (2010) are critical of the use of BLEU scores in the evaluation of English-Czech SMT, claiming that BLEU scores correlate poorly with human judgements. It is for these reasons that BLEU was not used in the evaluation of the systems developed as part of this project Manual Evaluation The manual evaluation of Machine Translation output can be rather complex. Human judges are typically required to rate a single target language text using a five point scale or to rank several target language texts based on fluency (whether the text is fluent), and adequacy (whether the meaning of the source language text has been captured) (Koehn, 2009). Evaluation based on fluency and adequacy judgements suffers from a number of problems. Firstly, it can be slow and unreliable (Callison-Burch et al., 2008). Secondly, the scores assigned by human judges in the measurement of fluency and adequacy are often very close suggesting that the judges may find it difficult to make a clear distinction between the two criteria. Thirdly, there are concerns that without explicit instructions, many human judges develop their own rules or misinterpret the intended use of an absolute scale and instead score the output of multiple systems relative to one another (Callison-Burch et al., 2007). Finally, manual evaluation using such criteria tends to be subjective, which can lead to poor agreement between a group of human judges. Again, these manual methods tend to focus on sentences as a whole and are therefore not wholly applicable to the more specific problem of evaluating pronoun translation. 2.7 Chapter Summary This chapter introduced the concepts of anaphora and coreference resolution and provided an introduction to phrase-based SMT, the Moses toolkit and the methods currently used in the evaluation of Machine Translation output. In particular, the various issues associated with automated and manual evaluation methods were highlighted with respect to their application to the more specific problem of evaluating pronoun translation. The next chapter will introduce the manually annotated corpora used in this project.
24
25 Chapter 3 Data In the development of the annotation and translation process a number of manually annotated corpora in both English and Czech are used: the BBN Pronoun Coreference and Entity Type corpus for the English (source) side of the parallel corpus and the identification of coreferential pronouns and their antecedents, and the PCEDT 2.0 corpus for the Czech (target) side of the parallel corpus. Each corpus contains text or a translation of the original text taken from a subset of the Wall Street Journal (WSJ). It is the provision of these manually annotated corpora that allowed the project to focus solely on the translation problem without the need for automated methods for coreference or anaphora resolution. In addition, the annotation of the WSJ files within the Penn Treebank 3.0 corpus is used to identify a single antecedent head word in the case where the antecedent extracted from the BBN Pronoun Coreference and Entity Type corpus spans multiple words. This is particularly important as in order to extract the number and gender of a Czech word it is necessary to first identify the head of the English antecedent. The corpora are described in detail in the following sections. 3.1 BBN Pronoun Coreference and Entity Type Corpus The BBN Pronoun Coreference and Entity Type corpus (Weischedel and Brunstein, 2005) provides annotations of the WSJ file texts with pronoun coreference and entity types together with the raw English text. For the purpose of this project, two files from the corpus are used: the WSJ.sent file that contains the raw English sentences and the WSJ.pron pronoun coreference file that contains a list of coreferential pronouns together with their antecedents. In the pronoun coreference file, coreferential pronouns and their antecedents are indexed using sentence and word token numbers. 17
26 18 Chapter 3. Data The WSJ.sent file has the format: (WSJ0005 S1: J.P. Bolduc, vice chairman of W.R. Grace & Co., which... S2: He succeeds Terrence D. Daniels, formerly a W.R. Grace... S3: W.R. Grace holds three of Grace Energy s seven board seats. ) For each file in the corpus collection, the sentences are numbered and listed in the order in which they appear in the text. The WSJ.pron file has the format: (WSJ0005 ( Antecedent -> S1:1-2 -> J.P. Bolduc Pronoun -> S2:1-1 -> He ) For each WSJ file in the collection, each antecedent and the pronouns that refer to it are listed, together with the number of the sentence in which they appear and the start and end positions of the word(s) within the sentence. It was initially envisaged that the OntoNotes 3.0 corpus (Weischedel et al., 2009) would be used to identify coreferential pronouns and their antecedents. However, the annotation in the BBN Coreference and Entity Type corpus allows for a simpler method of identification and extraction than the OntoNotes 3.0 corpus. The OntoNotes 3.0 corpus is then left as an alternative source of coreference information. Due to differences in the choice of which types of coreference are annotated in these corpora, the use of the OntoNotes 3.0 corpus as an alternative or additional source of coreference information would allow for an investigation into the translation of it, this and that marked as event coreference. 3.2 Penn Treebank 3.0 Corpus The Penn Treebank 3.0 corpus contains manually annotated parse trees of the sentences within the WSJ corpus. The merged files within the corpus contain both parse and part of speech annotation and as such may be used to identify Noun Phrases (NPs) and through the use of simple rules, the head of an NP. The corpus contains separate merged files for each WSJ file. Within each file, a parse is provided for each sentence, with part of speech tags provided for each word or token.
Statistical Machine Translation
Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language
More informationModelling Pronominal Anaphora in Statistical Machine Translation
Modelling Pronominal Anaphora in Statistical Machine Translation Christian Hardmeier and Marcello Federico Fondazione Bruno Kessler Human Language Technologies Via Sommarive, 18 38123 Trento, Italy {hardmeier,federico}@fbk.eu
More informationIntroduction. Philipp Koehn. 28 January 2016
Introduction Philipp Koehn 28 January 2016 Administrativa 1 Class web site: http://www.mt-class.org/jhu/ Tuesdays and Thursdays, 1:30-2:45, Hodson 313 Instructor: Philipp Koehn (with help from Matt Post)
More informationChapter 5. Phrase-based models. Statistical Machine Translation
Chapter 5 Phrase-based models Statistical Machine Translation Motivation Word-Based Models translate words as atomic units Phrase-Based Models translate phrases as atomic units Advantages: many-to-many
More informationHybrid Machine Translation Guided by a Rule Based System
Hybrid Machine Translation Guided by a Rule Based System Cristina España-Bonet, Gorka Labaka, Arantza Díaz de Ilarraza, Lluís Màrquez Kepa Sarasola Universitat Politècnica de Catalunya University of the
More informationThe XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006
The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 Yidong Chen, Xiaodong Shi Institute of Artificial Intelligence Xiamen University P. R. China November 28, 2006 - Kyoto 13:46 1
More informationAdaptation to Hungarian, Swedish, and Spanish
www.kconnect.eu Adaptation to Hungarian, Swedish, and Spanish Deliverable number D1.4 Dissemination level Public Delivery date 31 January 2016 Status Author(s) Final Jindřich Libovický, Aleš Tamchyna,
More informationFactored Translation Models
Factored Translation s Philipp Koehn and Hieu Hoang pkoehn@inf.ed.ac.uk, H.Hoang@sms.ed.ac.uk School of Informatics University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW Scotland, United Kingdom
More informationTHUTR: A Translation Retrieval System
THUTR: A Translation Retrieval System Chunyang Liu, Qi Liu, Yang Liu, and Maosong Sun Department of Computer Science and Technology State Key Lab on Intelligent Technology and Systems National Lab for
More informationWhy Evaluation? Machine Translation. Evaluation. Evaluation Metrics. Ten Translations of a Chinese Sentence. How good is a given system?
Why Evaluation? How good is a given system? Machine Translation Evaluation Which one is the best system for our purpose? How much did we improve our system? How can we tune our system to become better?
More informationTurker-Assisted Paraphrasing for English-Arabic Machine Translation
Turker-Assisted Paraphrasing for English-Arabic Machine Translation Michael Denkowski and Hassan Al-Haj and Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon University
More informationMachine Translation. Why Evaluation? Evaluation. Ten Translations of a Chinese Sentence. Evaluation Metrics. But MT evaluation is a di cult problem!
Why Evaluation? How good is a given system? Which one is the best system for our purpose? How much did we improve our system? How can we tune our system to become better? But MT evaluation is a di cult
More informationTesting Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
More informationThe KIT Translation system for IWSLT 2010
The KIT Translation system for IWSLT 2010 Jan Niehues 1, Mohammed Mediani 1, Teresa Herrmann 1, Michael Heck 2, Christian Herff 2, Alex Waibel 1 Institute of Anthropomatics KIT - Karlsruhe Institute of
More informationStatistical Machine Translation: IBM Models 1 and 2
Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation
More informationStatistical Machine Translation Lecture 4. Beyond IBM Model 1 to Phrase-Based Models
p. Statistical Machine Translation Lecture 4 Beyond IBM Model 1 to Phrase-Based Models Stephen Clark based on slides by Philipp Koehn p. Model 2 p Introduces more realistic assumption for the alignment
More informationSystematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation
Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation Rabih Zbib, Gretchen Markiewicz, Spyros Matsoukas, Richard Schwartz, John Makhoul Raytheon BBN Technologies
More informationAppraise: an Open-Source Toolkit for Manual Evaluation of MT Output
Appraise: an Open-Source Toolkit for Manual Evaluation of MT Output Christian Federmann Language Technology Lab, German Research Center for Artificial Intelligence, Stuhlsatzenhausweg 3, D-66123 Saarbrücken,
More informationMachine Translation. Agenda
Agenda Introduction to Machine Translation Data-driven statistical machine translation Translation models Parallel corpora Document-, sentence-, word-alignment Phrase-based translation MT decoding algorithm
More informationLIUM s Statistical Machine Translation System for IWSLT 2010
LIUM s Statistical Machine Translation System for IWSLT 2010 Anthony Rousseau, Loïc Barrault, Paul Deléglise, Yannick Estève Laboratoire Informatique de l Université du Maine (LIUM) University of Le Mans,
More informationLearning Translation Rules from Bilingual English Filipino Corpus
Proceedings of PACLIC 19, the 19 th Asia-Pacific Conference on Language, Information and Computation. Learning Translation s from Bilingual English Filipino Corpus Michelle Wendy Tan, Raymond Joseph Ang,
More informationChapter 6. Decoding. Statistical Machine Translation
Chapter 6 Decoding Statistical Machine Translation Decoding We have a mathematical model for translation p(e f) Task of decoding: find the translation e best with highest probability Two types of error
More informationDublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection
Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,
More informationThe TCH Machine Translation System for IWSLT 2008
The TCH Machine Translation System for IWSLT 2008 Haifeng Wang, Hua Wu, Xiaoguang Hu, Zhanyi Liu, Jianfeng Li, Dengjun Ren, Zhengyu Niu Toshiba (China) Research and Development Center 5/F., Tower W2, Oriental
More informationSYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems Jin Yang, Satoshi Enoue Jean Senellart, Tristan Croiset SYSTRAN Software, Inc. SYSTRAN SA 9333 Genesee Ave. Suite PL1 La Grande
More informationSYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 Jin Yang and Satoshi Enoue SYSTRAN Software, Inc. 4444 Eastgate Mall, Suite 310 San Diego, CA 92121, USA E-mail:
More informationACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation www.accurat-project.eu Project no.
ACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation www.accurat-project.eu Project no. 248347 Deliverable D5.4 Report on requirements, implementation
More informationA Joint Sequence Translation Model with Integrated Reordering
A Joint Sequence Translation Model with Integrated Reordering Nadir Durrani, Helmut Schmid and Alexander Fraser Institute for Natural Language Processing University of Stuttgart Introduction Generation
More informationAutomatic Pronominal Anaphora Resolution in English Texts
Computational Linguistics and Chinese Language Processing Vol. 9, No.1, February 2004, pp. 21-40 21 The Association for Computational Linguistics and Chinese Language Processing Automatic Pronominal Anaphora
More informationAn Online Service for SUbtitling by MAchine Translation
SUMAT CIP-ICT-PSP-270919 An Online Service for SUbtitling by MAchine Translation Annual Public Report 2012 Editor(s): Contributor(s): Reviewer(s): Status-Version: Arantza del Pozo Mirjam Sepesy Maucec,
More informationUsing Feedback Tags and Sentiment Analysis to Generate Sharable Learning Resources
Using Feedback Tags and Sentiment Analysis to Generate Sharable Learning Resources Investigating Automated Sentiment Analysis of Feedback Tags in a Programming Course Stephen Cummins, Liz Burd, Andrew
More informationD2.4: Two trained semantic decoders for the Appointment Scheduling task
D2.4: Two trained semantic decoders for the Appointment Scheduling task James Henderson, François Mairesse, Lonneke van der Plas, Paola Merlo Distribution: Public CLASSiC Computational Learning in Adaptive
More informationParaphrasing controlled English texts
Paraphrasing controlled English texts Kaarel Kaljurand Institute of Computational Linguistics, University of Zurich kaljurand@gmail.com Abstract. We discuss paraphrasing controlled English texts, by defining
More informationMachine translation techniques for presentation of summaries
Grant Agreement Number: 257528 KHRESMOI www.khresmoi.eu Machine translation techniques for presentation of summaries Deliverable number D4.6 Dissemination level Public Delivery date April 2014 Status Author(s)
More informationEnriching Morphologically Poor Languages for Statistical Machine Translation
Enriching Morphologically Poor Languages for Statistical Machine Translation Eleftherios Avramidis e.avramidis@sms.ed.ac.uk Philipp Koehn pkoehn@inf.ed.ac.uk School of Informatics University of Edinburgh
More informationDiscourse Processing for Context Question Answering Based on Linguistic Knowledge
Discourse Processing for Context Question Answering Based on Linguistic Knowledge Mingyu Sun a,joycey.chai b a Department of Linguistics Michigan State University East Lansing, MI 48824 sunmingy@msu.edu
More informationMachine Translation and the Translator
Machine Translation and the Translator Philipp Koehn 8 April 2015 About me 1 Professor at Johns Hopkins University (US), University of Edinburgh (Scotland) Author of textbook on statistical machine translation
More information31 Case Studies: Java Natural Language Tools Available on the Web
31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software
More informationPhrase-Based MT. Machine Translation Lecture 7. Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu. Website: mt-class.
Phrase-Based MT Machine Translation Lecture 7 Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu Website: mt-class.org/penn Translational Equivalence Er hat die Prüfung bestanden, jedoch
More informationAdapting General Models to Novel Project Ideas
The KIT Translation Systems for IWSLT 2013 Thanh-Le Ha, Teresa Herrmann, Jan Niehues, Mohammed Mediani, Eunah Cho, Yuqi Zhang, Isabel Slawik and Alex Waibel Institute for Anthropomatics KIT - Karlsruhe
More informationOpen Domain Information Extraction. Günter Neumann, DFKI, 2012
Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for
More informationThe Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER 2011 49 58. Ncode: an Open Source Bilingual N-gram SMT Toolkit
The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER 2011 49 58 Ncode: an Open Source Bilingual N-gram SMT Toolkit Josep M. Crego a, François Yvon ab, José B. Mariño c c a LIMSI-CNRS, BP 133,
More informationAn Introduction to. Metrics. used during. Software Development
An Introduction to Metrics used during Software Development Life Cycle www.softwaretestinggenius.com Page 1 of 10 Define the Metric Objectives You can t control what you can t measure. This is a quote
More informationTopics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment
Topics in Computational Linguistics Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Presented By: Mohammad Saif Department of Computer
More informationThe Impact of Morphological Errors in Phrase-based Statistical Machine Translation from English and German into Swedish
The Impact of Morphological Errors in Phrase-based Statistical Machine Translation from English and German into Swedish Oscar Täckström Swedish Institute of Computer Science SE-16429, Kista, Sweden oscar@sics.se
More informationUEdin: Translating L1 Phrases in L2 Context using Context-Sensitive SMT
UEdin: Translating L1 Phrases in L2 Context using Context-Sensitive SMT Eva Hasler ILCC, School of Informatics University of Edinburgh e.hasler@ed.ac.uk Abstract We describe our systems for the SemEval
More informationDIFFICULTIES AND SOME PROBLEMS IN TRANSLATING LEGAL DOCUMENTS
DIFFICULTIES AND SOME PROBLEMS IN TRANSLATING LEGAL DOCUMENTS Ivanka Sakareva Translation of legal documents bears its own inherent difficulties. First we should note that this type of translation is burdened
More informationCollecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
More informationPartial Estimates of Reliability: Parallel Form Reliability in the Key Stage 2 Science Tests
Partial Estimates of Reliability: Parallel Form Reliability in the Key Stage 2 Science Tests Final Report Sarah Maughan Ben Styles Yin Lin Catherine Kirkup September 29 Partial Estimates of Reliability:
More informationParsing Technology and its role in Legacy Modernization. A Metaware White Paper
Parsing Technology and its role in Legacy Modernization A Metaware White Paper 1 INTRODUCTION In the two last decades there has been an explosion of interest in software tools that can automate key tasks
More informationTagging with Hidden Markov Models
Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,
More informationA New Input Method for Human Translators: Integrating Machine Translation Effectively and Imperceptibly
Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) A New Input Method for Human Translators: Integrating Machine Translation Effectively and Imperceptibly
More informationComputer Aided Translation
Computer Aided Translation Philipp Koehn 30 April 2015 Why Machine Translation? 1 Assimilation reader initiates translation, wants to know content user is tolerant of inferior quality focus of majority
More informationLanguage Modeling. Chapter 1. 1.1 Introduction
Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set
More informationAutomatic Pronominal Anaphora Resolution. in English Texts
Automatic Pronominal Anaphora Resolution in English Texts Tyne Liang and Dian-Song Wu Department of Computer and Information Science National Chiao Tung University Hsinchu, Taiwan Email: tliang@cis.nctu.edu.tw;
More informationMethodological Issues for Interdisciplinary Research
J. T. M. Miller, Department of Philosophy, University of Durham 1 Methodological Issues for Interdisciplinary Research Much of the apparent difficulty of interdisciplinary research stems from the nature
More informationArguments and Dialogues
ONE Arguments and Dialogues The three goals of critical argumentation are to identify, analyze, and evaluate arguments. The term argument is used in a special sense, referring to the giving of reasons
More informationComprendium Translator System Overview
Comprendium System Overview May 2004 Table of Contents 1. INTRODUCTION...3 2. WHAT IS MACHINE TRANSLATION?...3 3. THE COMPRENDIUM MACHINE TRANSLATION TECHNOLOGY...4 3.1 THE BEST MT TECHNOLOGY IN THE MARKET...4
More informationHIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN
HIERARCHICAL HYBRID TRANSLATION BETWEEN ENGLISH AND GERMAN Yu Chen, Andreas Eisele DFKI GmbH, Saarbrücken, Germany May 28, 2010 OUTLINE INTRODUCTION ARCHITECTURE EXPERIMENTS CONCLUSION SMT VS. RBMT [K.
More informationHow the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.
Svetlana Sokolova President and CEO of PROMT, PhD. How the Computer Translates Machine translation is a special field of computer application where almost everyone believes that he/she is a specialist.
More informationGuide to Writing a Project Report
Guide to Writing a Project Report The following notes provide a guideline to report writing, and more generally to writing a scientific article. Please take the time to read them carefully. Even if your
More informationFinding Advertising Keywords on Web Pages. Contextual Ads 101
Finding Advertising Keywords on Web Pages Scott Wen-tau Yih Joshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University Contextual Ads 101 Publisher s website Digital Camera Review The
More informationAuthor Gender Identification of English Novels
Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in
More informationMusic Mood Classification
Music Mood Classification CS 229 Project Report Jose Padial Ashish Goel Introduction The aim of the project was to develop a music mood classifier. There are many categories of mood into which songs may
More informationConvergence of Translation Memory and Statistical Machine Translation
Convergence of Translation Memory and Statistical Machine Translation Philipp Koehn and Jean Senellart 4 November 2010 Progress in Translation Automation 1 Translation Memory (TM) translators store past
More informationAn Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation
An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation Robert C. Moore Chris Quirk Microsoft Research Redmond, WA 98052, USA {bobmoore,chrisq}@microsoft.com
More informationREPORT ON THE WORKBENCH FOR DEVELOPERS
REPORT ON THE WORKBENCH FOR DEVELOPERS for developers DELIVERABLE D3.2 VERSION 1.3 2015 JUNE 15 QTLeap Machine translation is a computational procedure that seeks to provide the translation of utterances
More informationPolish - English Statistical Machine Translation of Medical Texts.
Polish - English Statistical Machine Translation of Medical Texts. Krzysztof Wołk, Krzysztof Marasek Department of Multimedia Polish Japanese Institute of Information Technology kwolk@pjwstk.edu.pl Abstract.
More informationTRANSREAD LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS. Projet ANR 201 2 CORD 01 5
Projet ANR 201 2 CORD 01 5 TRANSREAD Lecture et interaction bilingues enrichies par les données d'alignement LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS Avril 201 4
More informationCINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. Miguel Ruiz, Anne Diekema, Páraic Sheridan MNIS-TextWise Labs Dey Centennial Plaza 401 South Salina Street Syracuse, NY 13202 Abstract:
More informationExaminer s report F8 Audit & Assurance September 2015
Examiner s report F8 Audit & Assurance September 2015 General Comments There were two sections to the examination paper and all the questions were compulsory. Section A consisted of 12 multiple-choice
More informationIMPLEMENTATION NOTE. Validating Risk Rating Systems at IRB Institutions
IMPLEMENTATION NOTE Subject: Category: Capital No: A-1 Date: January 2006 I. Introduction The term rating system comprises all of the methods, processes, controls, data collection and IT systems that support
More informationSearch and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or
More informationThe European Financial Reporting Advisory Group (EFRAG) and the Autorité des Normes Comptables (ANC) jointly publish on their websites for
The European Financial Reporting Advisory Group (EFRAG) and the Autorité des Normes Comptables (ANC) jointly publish on their websites for information purpose a Research Paper on the proposed new Definition
More informationTuning Methods in Statistical Machine Translation
A thesis submitted in partial fulfilment for the degree of Master of Science in the science of Artificial Intelligence Tuning Methods in Statistical Machine Translation Author: Anne Gerard Schuth aschuth@science.uva.nl
More informationIMPLEMENTING BUSINESS CONTINUITY MANAGEMENT IN A DISTRIBUTED ORGANISATION: A CASE STUDY
IMPLEMENTING BUSINESS CONTINUITY MANAGEMENT IN A DISTRIBUTED ORGANISATION: A CASE STUDY AUTHORS: Patrick Roberts (left) and Mike Stephens (right). Patrick Roberts: Following early experience in the British
More information7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan
7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan We explain field experiments conducted during the 2009 fiscal year in five areas of Japan. We also show the experiments of evaluation
More informationLeveraging ASEAN Economic Community through Language Translation Services
Leveraging ASEAN Economic Community through Language Translation Services Hammam Riza Center for Information and Communication Technology Agency for the Assessment and Application of Technology (BPPT)
More informationRubrics for Assessing Student Writing, Listening, and Speaking High School
Rubrics for Assessing Student Writing, Listening, and Speaking High School Copyright by the McGraw-Hill Companies, Inc. All rights reserved. Permission is granted to reproduce the material contained herein
More informationText Analytics Illustrated with a Simple Data Set
CSC 594 Text Mining More on SAS Enterprise Miner Text Analytics Illustrated with a Simple Data Set This demonstration illustrates some text analytic results using a simple data set that is designed to
More informationBCS HIGHER EDUCATION QUALIFICATIONS Level 6 Professional Graduate Diploma in IT. March 2013 EXAMINERS REPORT. Knowledge Based Systems
BCS HIGHER EDUCATION QUALIFICATIONS Level 6 Professional Graduate Diploma in IT March 2013 EXAMINERS REPORT Knowledge Based Systems Overall Comments Compared to last year, the pass rate is significantly
More informationBuilding a Web-based parallel corpus and filtering out machinetranslated
Building a Web-based parallel corpus and filtering out machinetranslated text Alexandra Antonova, Alexey Misyurev Yandex 16, Leo Tolstoy St., Moscow, Russia {antonova, misyurev}@yandex-team.ru Abstract
More informationModeling coherence in ESOL learner texts
University of Cambridge Computer Lab Building Educational Applications NAACL 2012 Outline 1 2 3 4 The Task: Automated Text Scoring (ATS) ATS systems Discourse coherence & cohesion The Task: Automated Text
More informationENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION Francine Forney, Senior Management Consultant, Fuel Consulting, LLC May 2013
ENHANCING INTELLIGENCE SUCCESS: DATA CHARACTERIZATION, Fuel Consulting, LLC May 2013 DATA AND ANALYSIS INTERACTION Understanding the content, accuracy, source, and completeness of data is critical to the
More informationThe multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2
2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of
More informationFacebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
More informationWRITING A CRITICAL ARTICLE REVIEW
WRITING A CRITICAL ARTICLE REVIEW A critical article review briefly describes the content of an article and, more importantly, provides an in-depth analysis and evaluation of its ideas and purpose. The
More informationMaking Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research
More informationBuilding a Question Classifier for a TREC-Style Question Answering System
Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
More informationSemantic Class Induction and Coreference Resolution
Semantic Class Induction and Coreference Resolution Vincent Ng Human Language Technology Research Institute University of Texas at Dallas Richardson, TX 75083-0688 vince@hlt.utdallas.edu Abstract This
More informationLinguistic Universals
Armin W. Buch 1 2012/11/28 1 Relying heavily on material by Gerhard Jäger and David Erschler Linguistic Properties shared by all languages Trivial: all languages have consonants and vowels More interesting:
More informationOverview of MT techniques. Malek Boualem (FT)
Overview of MT techniques Malek Boualem (FT) This section presents an standard overview of general aspects related to machine translation with a description of different techniques: bilingual, transfer,
More informationThe Open University s repository of research publications and other research outputs
Open Research Online The Open University s repository of research publications and other research outputs Using LibQUAL+ R to Identify Commonalities in Customer Satisfaction: The Secret to Success? Journal
More informationQualitative Critique: Missed Nursing Care. Kalisch, B. (2006). Missed Nursing Care A Qualitative Study. J Nurs Care Qual, 21(4), 306-313.
Qualitative Critique: Missed Nursing Care 1 Qualitative Critique: Missed Nursing Care Kalisch, B. (2006). Missed Nursing Care A Qualitative Study. J Nurs Care Qual, 21(4), 306-313. Gina Gessner RN BSN
More informationThe Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems
The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems Dr. Ananthi Sheshasaayee 1, Angela Deepa. V.R 2 1 Research Supervisior, Department of Computer Science & Application,
More informationSample Size and Power in Clinical Trials
Sample Size and Power in Clinical Trials Version 1.0 May 011 1. Power of a Test. Factors affecting Power 3. Required Sample Size RELATED ISSUES 1. Effect Size. Test Statistics 3. Variation 4. Significance
More informationLog-Linear Models. Michael Collins
Log-Linear Models Michael Collins 1 Introduction This note describes log-linear models, which are very widely used in natural language processing. A key advantage of log-linear models is their flexibility:
More informationApplied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.
Applied Mathematical Sciences, Vol. 7, 2013, no. 112, 5591-5597 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2013.38457 Accuracy Rate of Predictive Models in Credit Screening Anirut Suebsing
More informationPublishers Note. Anson Reed Limited 145-157 St John Street London EC1V 4PY United Kingdom. Anson Reed Limited and InterviewGold.
Publishers Note Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act, this publication may only be
More informationAutomatic Text Analysis Using Drupal
Automatic Text Analysis Using Drupal By Herman Chai Computer Engineering California Polytechnic State University, San Luis Obispo Advised by Dr. Foaad Khosmood June 14, 2013 Abstract Natural language processing
More information