Hybrid Approaches in Machine Translation

Transcription

1 WDS'08 Proceedings of Contributed Papers, Part I, , ISBN MATFYZPRESS Hybrid Approaches in Machine Translation M. Týnovský Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, Prague, Czech Republic. Abstract. In this paper we shortly summarize the main paradigms of machine translation. We describe the principles of successful phrase-based statistical approach and show some recent improvements using hybridization towards both example-based and rule-based directions. We also describe the experiments in this topic English-to-Czech translation that carried out at our department, we comment the weak points and provide for improvement possibilities. Introduction Machine translation typology Historically there are three main paradigms in solving the task of machine translation: rule-based machine translation (RBMT) example-based machine translation (EBMT) statistical machine translation (SMT) Rule-based systems use rules discovered and formulated by linguists that say how to transfer words or sequences of words or other structures from the source language to the target language. The latter two approaches use translation rules automatically extracted from parallel corpora (and therefore they are often referred as data driven approaches.) Given a translation system, it may be hard to decide which of these categories it actually belongs to. Why? Imagine a rule-based system that transfers syntactical trees of source sentences into syntactical trees of target sentences and uses statistical parser to obtain the source trees. Yes, the transfer rules are made by linguists but the whole system is not statistics free. Similarly SMT or EBMT systems often use indirect knowledge explicitly defined by linguists. Therefore, theoretically we can distinguish six hybrid directions: SMT affected by EBMT SMT affected by RBMT EBMT affected by SMT EBMT affected by RBMT RBMT affected by SMT RBMT affected by EBMT In this paper we describe some systems of the first two categories. Firstly, we mention basic SMT principles (noisy channel model, IBM models), with little more attention paid to the recently successful phrase-based models. Then we give two examples of approaches pushing the phrase-based models into hybridity using more general phrase concepts (EBMT direction) or using morphological or syntactical information (RBMT direction). Finally, we compare previously described approaches to translating from English into Czech using treelets. At the end we discuss their weak points and give an account of current works on 124

2 enhancing the system by using Martin Čmejrek s implementation of a tree-to-tree transducer for alignment of treelets. Combining outputs of more translation systems is often called hybridization too. In this paper, we do not describe this kind of hybrid translation systems, we give examples of methods laying in between the main machine translation paradigms instead. Phrase-based machine translation In early 1990s the first statistical methods of machine translation were proposed. They were based on the idea of noisy channel decoding: a French sentence is being seen as an English sentence that was garbled into French. The task is to find the most probable original sentence so that the noisy channel returns the input French sentence. ê = argmax e p(e f) This formula is often used in a slightly changed version following the Bayes rule: ê = argmax e p(f e)p(e) This variant allows to divide the task into two models: the translation model p(f e) from French to English which ensures translation accuracy and the language model p(e) of English which ensures fluency of the English output. In (Brown et al., 1993) there was proposed a series of five models (IBM models) which are based on this noisy channel principle and which use word-alignment as a hidden variable. They did not produce very good translations but they can be and are successfully used for the word-alignment extraction. Since the time two fundamental improvements have been presented in the field of SMT. Firstly, the more general log-linear translation model was proposed. It uses the following formula to find the most probable English translation (Och and Ney, 2002): ê = argmax e m λ mh m (e,f) where h m are feature functions scoring various variables describing the translation quality. A special case of this formula ( h 1 = log(p(f e)), h 2 = log(p(e))) matches the noisy channel model. Secondly, phrase translation is used instead of word-based translation. A phrase is defined simply as any consequence of words. In first steps, phrase pairs are extracted from n-to-n wordalignment according to following rules: all English words aligned to any French word in the phrase must be included and all French words aligned to any English word in the phrase must be included. An example of word-alignment and phrase pairs extracted are shown at Figure 1. Figure 1. An example of phrase extraction using word-alignment. The black squares define word-alignment, the boundaries define phrases extracted. The leftmost picture shows a valid phrase extraction, the other two show invalid extractions as they are inconsistent with the word-alignment. From (Koehn et al., 2003) When phrase pairs are observed, their probabilities are estimated using maximum likelihood and they are stored into a phrase table, which is later used for translation. The translation consists of two steps: 125

3 extract relevant phrases from the phrase table (those with non-empty intersection with given input sentence to be translated) find the most probable coverage of input sentence by the phrases in terms of phrase probabilities and target language model (or other models included in the log-linear model). Searching the whole hypothesis space (a hypothesis being a particular partial coverage and phrase reordering) is NP-complete, therefore a beam search is used in practice. It causes that future costs of phrases must be estimated in order to avoid preferring coverages of simple parts of a sentence over the more complex parts. But in principle phrase-based translation works as described above. The main advantage of the phrase-based approach is that it can capture local reorderings of words and in factored variant (Koehn et al., 2007) it can treat e.g. local morphological relations. On the other hand, a weak point is that it fails in more complex changes of word order and ignores relations between distant words. Hybrid approaches can partially solve these problems. In the following section, we describe two examples of such hybrid systems. Hybridity The Hiero system The first hybrid system improving the standard phrase-based model we mention is Hiero described in (Chiang, 2005). It uses no explicit linguistic knowledge, it just generalizes the concept of phrase similarly to EBMT approaches generalizing the concept of an example. It is based on simple idea of reusing the benefit of phrases once more: If phrases are good at local reordering of words, they might also be good at reordering of sub-phrases. This idea is realized in more general definition of a phrase: A hierarchical phrase is any sequence of words and/or sub-phrases. Formally, a hierarchical phrase pair is a rule of synchronous context free grammar that looks like this: X yu X 1 you X 2, have X 2 with X 1 where X is the single non-terminal symbol for any phrase and the indices encode the alignment of sub-phrases. These rules for hierarchical phrase pairs with two additional simple rules: S X 1,X 1 S S 1 X 2,S 1 X 2 which generate the initial sequence of phrases constitute the whole grammar. The translation is then seen as a synchronous parse process. Dependency treelet translation The second example of a hybrid SMT approach is the dependency treelet translation described in (Quirk, Menezes 2007). This approach uses a dependency parser on the source side to create syntax dependency trees from source sentences. Using word-alignment, it also creates syntax dependency trees from target sentences analogical to the source side ones. Treelet pairs are extracted from this syntax enhanced corpus. They are pairs of any subtrees up to given size which agree with the word-alignment similarly as phrases do. The model is trained over these treelets instead of phrases while everything else remains the same. The only exception is the reordering model which is based on the syntax trees (the probability of an ordering of a given tree is computed as the product of probabilities of descendants orderings for each node). English-to-Czech treelet translation The system of English-to-Czech treelet translation (Bojar, Cinková, Ptáček, 2007) goes even further in using linguistics. It uses dependency parsers on both the source and the target sentences. Synchronous tree substitution grammar (STSG) is then trained from resulting dependency tree pairs. To explain what STSG is, we first define a treelet: it is a subtree of a 126

4 dependency tree containing two types of nodes: internals and slots. Every internal holds all its children from the original tree. The slot is always a leaf node (in the treelet, not necessarily in the original tree): it has no lexical label, it is only a placeholder saying which type of treelet it can be substituted with (e.g. syntactic function of detached constituent). A rule of STSG is a treelet pair with aligned slots. An example of such rules is in Figure 2. Figure 2. An example of a set of rules of Synchronous tree substitution grammar. From (Bojar, Cinková, Ptáček, 2007) The probability model used for translation assigns to each rule a conditional probability saying how probably the rule plugged in given pair of slot types. The treelet alignment is then seen as a set of rules from which a pair of trees is built in parallel top-down procedure until all slots are fulfilled. The process of translation consists of covering a source tree using left hand sides of the most probable rules and genering the target tree using right hand sides of the same rules. This process is implemented by adopted chart parsing algorithm described in (Bojar, Čmejrek 2007), the training is done by an expectation-maximization algorithm. Discussion The first two approaches described give significant improvements of translation quality (measured in the standard BLEU metric (Papineni et al., 2002)) over phrase-based model while the English-to-Czech experiments unfortunately give poor results. This can be caused by two things: Firstly, the current implementation does not exactly correspond to the procedure described above. It differs both in the STSG model parameters estimation and the decoding procedure. The model parameters are estimated by a heuristic method based on n-to-n word-alignment all treelet pairs reflecting the word-alignment up to given size are extracted and the parameters are estimated by maximum likelihood from the treelet pair counts. The decoding is done by top-down beam search. At present, we are working on adopting the more appropriate methods described in previous section. Secondly, the usage of parsers on both sides can increase the data sparseness problem in case of incompatible parses. Despite back-off models are used for smoothing, the influence of the dissimilarity of parser outputs cannot be entirely suppressed. Conclusion In this paper we described three approaches to machine translation which enhance statistical methods with features of example-based and rule-based paradigms. Two of them give promising results, the third is under ongoing development. In the discussion, we provide possible improvements of the current state. 127

5 Acknowledgments. The work on this project was supported by the FP6-IST STP (EuroMatrix) grant. References Bojar O., S. Cinková and J. Ptáček, Towards English-to-Czech MT via Tectogrammatical Layer, Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories, Bergen, Norway Bojar O. and M. Čmejrek, Mathematical Model of Tree Transformations, Project Euromatrix Deliverable 3.2, Prague, Czech Republic Brown P. F. et al., The mathematics of statistical machine translation: parameter estimation, Computational Linguistics, , Cambridge, MA, USA Chiang D., A hierarchical phrase-based model for statistical machine translation, ACL 05: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, , Ann Arbor, Michigan Koehn P., F. J. Och, and D. Marcu, Statistical phrase-based translation, NAACL 03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, 48 54, Edmonton, Canada Koehn P. et al., Moses: Open Source Toolkit for Statistical Machine Translation. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, , Prague, Czech Republic Och F. J. and H. Ney, Discriminative training and maximum entropy models for statistical machine translation, ACL 02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, , Philadelphia, Pennsylvania Papineni, K., S. Roukos, T. Ward and W.-J. Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, , Philadelphia, Pennsylvania Quirk C. and A. Menezes, Dependency treelet translation: the convergence of statistical and examplebased machine-translation?, Machine Translation, 43 65, Redmond