Adaptive Development Data Selection for Log-linear Model in Statistical Machine Translation
|
|
|
- Duane Norton
- 10 years ago
- Views:
Transcription
1 Adaptive Development Data Selection for Log-linear Model in Statistical Machine Translation Mu Li Microsoft Research Asia Dongdong Zhang Microsoft Research Asia Abstract This paper addresses the problem of dynamic model parameter selection for loglinear model based statistical machine translation (SMT) systems. In this work, we propose a principled method for this task by transforming it to a test data dependent development set selection problem. We present two algorithms for automatic development set construction, and evaluated our method on several NIST data sets for the Chinese-English translation task. Experimental results show that our method can effectively adapt log-linear model parameters to different test data, and consistently achieves good translation performance compared with conventional methods that use a fixed model parameter setting across different data sets. 1 Introduction In recent years, log-linear model (Och and Ney, 2002) has been a mainstream method to formulate statistical models for machine translation. Using this formulation, various kinds of relevant properties and data statistics used in the translation process, either on the monolingual-side or on the bilingual-side, are encoded and used as realvalued feature functions, thus it provides an effective mathematical framework to accommodate a large variety of SMT formalisms with different computational linguistic motivations. This work was done while the author was visiting Microsoft Research Asia. Yinggong Zhao Nanjing University [email protected] Ming Zhou Microsoft Research Asia [email protected] Formally, in a log-linear SMT model, given a source sentence f, we are to find a translation e with largest posterior probability among all possible translations: e = argmax Pr(e f) e and the posterior probability distribution Pr(e f) is directly approximated by a log-linear formulation: Pr(e f) = p λ (e f) = exp( M m=1 λ m h m (e, f)) e exp( M m=1 λ m h m (e, f)) (1) in which h m s are feature functions and λ = (λ 1,..., λ M ) are model parameters (feature weights). For a successful practical log-linear SMT model, it is usually a combined result of the several efforts: Construction of well-motivated SMT models Accurate estimation of feature functions Appropriate scaling of log-linear model features (feature weight tuning). In this paper, we focus on the last mentioned issue parameter tuning for log-linear model. In general, log-linear model parameters are optimized on a held-out development data set. Using this method, similarly to many machine learning tasks, the model parameters are solely tuned based on the development data, and the optimality of obtained model on unseen test data relies on the assumption that both development and test data observe identical probabilistic distribution, 662 Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages , Beijing, August 2010
2 which often does not hold for real-world data. The goal of this paper is to investigate novel methods for test data dependent model parameter selection. We begin with discussing the principle of parameter learning for log-linear SMT models, and explain the rationale of task transformation from parameter selection to development data selection. We describe two algorithms for automatic development set construction, and evaluated our method on several NIST MT evaluation data sets. Experimental results show that our method can effectively adapt log-linear model parameters to different test data and achieves consistent good translation performance compared with conventional methods that use a group of fixed model parameters across different data sets. 2 Model Learning for SMT with Log-linear Models Model learning refers to the task to estimate a group of suitable log-linear model parameters λ = (λ 1,..., λ M ) for use in Equation 1, which is often formulated as an optimization problem that finds the parameters maximizing certain goodness of the translations generated by the learnt model on a development corpus D. The goodness can be measured with either the translations likelihood or specific machine translation evaluation metrics such as TER or BLEU. More specifically, let e be the most probable translation of D with respect to model parameters λ, and E(e, λ, D) be a score function indicating the goodness of translation e, then a parameter estimation algorithm will try to find the λ which satisfies: λ = argmax E(e, λ, D) (2) λ Note when the goodness scoring function E( ) is specified, the parameter learning criterion in Equation 2 indicates that the derivation of model parameters λ only depends on development data D, and does not require any knowledge of test data T. The underlying rationale for this rule is that if the test data T observes the same distribution as D, λ will be optimal for both of them. On the other side, however, when there are mismatches between development and test data, the translation performance on test data will be suboptimal, which is very common for real-world data. Due to the difference between data sets, generally there is no such λ that is optimal for multiple data sets at the same time. Table 1 shows some empirical evidences when two data sets are mutually used as development and test data. In this setting, we used a hierarchical phrase based decoder and 2 years evaluation data of NIST Chineseto-English machine translation task (for the year 2008 only the newswire subset was used because we want to limit both data sets within the same domain to show that data mismatch also exists even if there is no domain difference), and report results using BLEU scores. Model parameters were tuned using the MERT algorithm (Och, 2003) optimized for BLEU metric. Dev data MT05 MT08-nw MT MT08-nw Table 1: Translation performance of cross development/test on two NIST evaluation data sets. In our work, we present a solution to this problem by using test data dependent model parameters for test data translation. As discussed above, since model parameters are solely determined by development data D, selection of log-linear model parameters is basically equivalent to selecting a set of development data D. However, automatic development data selection in current SMT research remains a relatively open issue. Manual selection based on human experience and observation is still a common practice. 3 Adaptive Model Parameter Selection An important heuristic behind manual development data selection is to use the dataset which is as similar to test set as possible in order to work around the data mismatch problem to maximal extent. There are also empirical evidences supporting this heuristics. For instance, it is generally perceived that data set MT03 is more similar to MT05, while MT06-nw is closer to MT08-nw. Table 2 shows experimental results using model parameters induced from MT03 and MT06-nw as 663
3 development sets with the same settings as in Table 1. As expected, MT06-nw is far more suitable than MT03 as the development data for MT08- nw; yet for test set MT05, the situation is just the opposite. Dev data MT05 MT08-nw MT MT06-nw Table 2: Translation performance on different test sets of using different development sets. In this work, this heuristic is further exploited for automatic development data selection when there is no prior knowledge of the test data available. In the following discussion, we assume the availability of a set of candidate source sentences together with translation references that are qualified for the log-linear model parameter learning task. Let D F be the full candidate set, given a test set T, the task of selecting a set of development data which can optimize the translation quality on T can be transformed to searching for a suitable subset of D F which is most similar to T : D = argmax D D F Sim(D, T ) To achieve this goal, we need to address the following key issues: How to define and compute Sim(D, T ), the similarity between different data sets; How to extract development data sets from a full candidate set for unseen test data. 3.1 Dataset Similarity Computing document similarity is a classical task in many research areas such as information retrieval and document classification. However, typical methods for computing document similarity may not be suitable for our purpose. The reasons are two-fold: 1. The sizes of both development and test data are small in usual circumstances, and using similarity measures such as cosine or dice coefficient based on term vectors will suffer from severe data sparseness problems. As a result, the obtained similarity measure will not be statistically reliable. 2. More importantly, what we care about here is not the surface string similarity. Instead, we need a method to measure how similar two data sets are from the view of a log-linear SMT model. Next we start with discussing the similarity between sentences. Given a source sentence f, we denote its possible translation space with H(f). In a log-linear SMT model, every translation e H(f) is essentially a feature vector h(e) = (h 1,..., h M ). Accordingly, the similarity between two sentences f 1 and f 2 should be defined on the feature space of the model in use. Let V (f) = {h(e) : e H(f)} be the set of feature vectors for all translations in H(f), we have ( ) Sim(f 1, f 2 ) = Sim V (f 1 ), V (f 2 ) (3) Because it is not practical to compute Equation 3 directly by enumerating all translations in H(f 1 ) and H(f 2 ) due to the huge search space in SMT tasks, we need to resort to some approximations. A viable solution to this is that if we can use a single feature vector h(f) to represent V (f), then Equation 3 can be simply computed using existing vector similarity measures. One reasonable method to derive h(f) is to use a feature vector based on the average principle each dimension of the vector is set to the expectation of its corresponding feature value over all translations: h(f) = e H(f) P (e f)h(e) (4) An alternative and much simpler way to compute h(f) is to employ the max principle in which we just use the feature vector of the best translation in H(f): h(f) = h(e ) (5) where e = argmax e P (e f). Note that in both Equation 4 and Equation 5 we make use of e s posterior probability P (e f). 664
4 Since the true distribution is unknown, a prelearnt model M has to be used to assign approximate probabilities to translations, which indicates that the obtained similarity depends on a specific model. As a convention, we use Sim M (f 1, f 2 ) to denote the similarity between f 1 and f 2 based on M, and call M the reference model of the computed similarity. To avoid unexpected bias caused by a single reference model, multiple reference models can be simultaneously used, and the similarity is defined to be the maximum of all modeldependent similarity values: Sim(f 1, f 2 ) = max M Sim M(f 1, f 2 ) (6) where M belongs to {M 1,..., M n }, which is the set of reference models under consideration. To generalize this method to data set level, we compute the vector h(s) for a data set S = (f 1,..., f S ) as follows: S h(s) = h(f i ) (7) i=1 3.2 Development Sets Pre-construction In the following, we sketch a method for automatically building a set of development data based on the full candidate set D F before seeing any test data. Theoretically, a subset of D F containing randomly sampled sentences from D F will not meet our requirement well because it is very probable that it will observe a distribution similar to D F. What we expect is that the pre-built development sets can approximate as many as possible typical data distributions that can be estimated from subsets of D F. Our solution is based on the assumption that D F can be depicted by some mixture models, hence we can use classical clustering methods such as k-means to partition D F into subsets with different distributions. Let S F be the set of extracted development data from D F. The construction of S DF proceeds as following: 1. Train a log-linear model M F using D F as development data; 2. Compute a feature vector h(d) 1 for each sentence d D F using M F as reference model; 3. Cluster sentences in D F using h(d)/ d as feature vectors; 4. Add obtained sentence clusters to S DF as candidate development sets. In the third step, since the feature vector h(d) is defined at sentence level, it is averaged by the number of words in d so that it is irrelevant to the length of a sentence. Considering the outputs of unsupervised data clustering methods are usually sensitive to initial conditions, we include in S DF sentence clusters based on different initialization configurations to remove related random effects. An initialization configuration for sentence clustering in our work includes starting point for each cluster and total number of clusters. In fact, the inclusion of more sentence clusters increases the diversity of the resulted S DF as well. At decoding time, when a test set T is presented, we compute the similarity between T and each development set D S DF, and choose the one with largest similarity score as the development set for T : D = argmax Sim(T, D) (8) D S DF When a single reference model is used to compute Sim(T, D), M F is a natural choice. In the multi-model setting as shown in Equation 6, models learnt from the development sets in S DF can serve this purpose. Note in this method model learning is not required for every new test set because the model parameters for each development set in S DF can also be pre-learnt and ready to be used for decoding. 3.3 Dynamic Development Set Construction In the previous method, test data T is only involved in the process of choosing a development set from a list of candidates but not in process of development set construction. Next we present a 1 Throughout this paper, a development sentence d generally refers to the source part of it if there is no extra explanation. 665
5 method for building a development set on demand based on test data T. Let D F = (d 1,..., d n ) be the data set containing all candidate sentences for development data selection. The method is iterative process in which development data and learnt model are alternatively updated. Detailed steps are illustrated as follows: 1. Let i = 0, D 0 = D F ; 2. Train a model M i based on D i ; 3. For each d k D F, compute the similarity score Sim Mi (T, d k ) between T and d k based on model M i ; 4. Select top n candidate sentences with highest similarity scores from D F to form D i+1 ; 5. Repeat step 2 to step 4 until the similarity between T and latest selected development data converges (the increase in similarity measure is less than a specified threshold compared to last round) or the specified iteration limit is reached. In step 4, D i+1 is greedily extracted from D F, and there is no guarantee that Sim Mi (T, D i+1 ) will increase or decrease after a new sentence is added to D i+1. Thereby the number of selected sentences n needs to be empirically determined. If n is too small, neither the selected data nor the learnt model parameters will be statistically reliable; while if n is too large, we may have to include some sentences that are not suitable for test data in the development data, and miss the opportunity to extract the most desirable development set. One drawback of this method is the relatively high computational cost because it requires multiple parameter training passes when any test set is presented to the system for translation. 4 Experiments 4.1 Data Experiments were conducted on the data sets used for NIST Chinese-English machine translation evaluation tasks. MT03 and MT06 data sets, which contain 919 and 1,664 sentences respectively, were used for development data in various settings. MT04, MT05 and MT08 data sets were used for test purpose. In some settings, we also used a test set MT0x, which containing 1,000 sentences randomly sampled from the above 3 data sets. All the translation performance results were measured in terms of case-insensitive BLEU scores. For all experiments, all parallel corpora available to the constrained track of NIST 2008 Chinese-English MT evaluation task were used for translation model training, which consist of around 5.1M bilingual sentence pairs. GIZA++ was used for word alignment in both directions, which was further refined with the intersec-diaggrow heuristics. We used a 5-gram language model which was trained from the Xinhua portion of English Gigaword corpus version 3.0 from LDC and the English part of parallel corpora. 4.2 Machine Translation System We used an in-house implementation of the hierarchical phrase-based decoder as described in Chiang (2005). In addtion to the standard features used in Chiang (2005), we also used a lexicon feature indicating how many word paris in the translation found in a conventional Chinese-English lexicon. Phrasal rules were extracted from all the parallel data, but hierarchical rules were only extracted from the FBIS part of the parallel data which contains around 128,000 sentence pairs. For all the development data, feature weights of the decoder were tuned using the MERT algorithm (Och, 2003). 4.3 Results of Development Data Pre-construction In the following we first present some overall results using the method of development data preconstruction, then dive into more detailed settings of the experiments. Table 3 shows the results using 3 different data sets for log-linear model parameter tuning. Elements in the first column indicate the data sets used for parameter tuning, and other columns contain evaluation results on different test sets. In the 666
6 Tuning set MT04 MT05 MT08 MT0x MT / / / / MT / / / / MT03+MT / / / / Oracle cluster Self-training Table 3: Translation performance using different methods and data sets for parameter tuning. third row of the table, MT03+MT06 means combining the data sets of MT03 and MT06 together to form a larger tuning set. The first number in each cell denotes the BLEU score using the tuning set as standard development set D, and the second for using the tuning set as a candidate set D F. For all experiment settings in the table, we used cosine value between feature vectors to measure similarity between data sets, and feature vectors were computed according to Equation 5 and Equation 7 using a reference model which is trained on the corresponding candidate set D F as development set. 2 We adopted the k-means algorithm for data clustering with the number of clusters iterating from 2 to 5. In each iteration, we ran 4 passes of clustering using different initial values. Therefore, in total there are 56 sentence clusters generated in each S DF. 3 From the table it can be seen that given the same set of sentences (MT03, MT06 and MT03+MT06), when they are used as the candidate set D F for the development set preconstruction method, the translation performance is generally better than when they are just used as development sets as a whole. Using MT03 data set as D F is an exception: there is slight performance drop on test sets MT04 and MT05, but it also helps reduce the performance see-saw problem on different test sets as shown in Table 1. Meanwhile, in the other two settings of D F, we observed significant BLEU score increase on all test sets but MT0x (on which the performance almost kept unchanged). In addition, the fact that using MT03+MT06 as D F achieves best (or al- 2 For example, in all the experiments in the row of MT03 as D F, we use the same reference model trained with MT03 as development set. 3 Sometimes some clusters are empty or contain too few sentences, so the actual number may be smaller. most best) performance on all test sets implies that it should be a better choice to include as diverse data as possible in D F. We also appended two oracle BLEU numbers for each test set in Table 3 for reference. One is denoted with oracle cluster, which is the highest possible BLEU that can be achieved on the test set when the development set must be chosen from the sentence clusters in S MT 03+MT 06. The other is labeled as self-training, which is the BLEU score that can be obtained when the test data itself is used as development data. This number can serve as actual performance upper bound on the test set. Next we investigated the impact of using different ways to compute feature vectors presented in Section 3.1. We re-ran some previous experiments on test sets MT04, MT05 and MT08 using MT03+MT06 as D F. Most settings were kept unchanged except that the feature vector of each sentence was computed according to Equation 4. A 20-best translation list was used to approximate H(f). The results are shown in Table 4. Test set average max MT MT MT Table 4: Translation performance when using averaged feature values for similarity computation. The numbers in the second column are based on Equation 4. Numbers based on Equation 5 are also listed in the third column for comparison. In all the experiment settings we did not observe consistent or significant advantage when using Equation 4 over using Equation 5. Since Equation 5 667
7 is much simpler, it is a good decision to use it in practice. So did we conduct all following experiments based on Equation 5. We are also interested in the correlation between two measures: the similarity between development and test data and the actual translation performance on test data. First we would like to echo the motivating experiment presented in Section 3. Table 5 shows the similarity between the data sets used in the experiment with M MT03+MT06 as reference model. Obviously the results in Table 2 and Table 5 fit each other very well. Dev data MT05 MT08-nw MT MT06-nw Table 5: Similarity between NIST data sets. Figure 1 shows the results of a set of more comprehensive experiments on MT05 data set concerning the similarity between development and test sets. BLEU Multiple MT03+MT06 MT Rank of development set Figure 1: Correlation between similarity and BLEU on MT05 data set In the figure, every data line shows how BLEU score changes when different pre-built development set in S MT 03+MT 06 is used for model learning. The data points in each line are sorted by the rank of similarity between the development set in use and the MT05 data set. We also compared results based on 3 reference model settings. In the first one (multiple), the similarity was computed using Equation 6, and the reference model set contains all models learnt from the development sets in S MT 03+MT 06. The other two settings use reference models learnt from MT06 and MT03+MT06 data sets respectively. We can observe from the figure that the correlation between BLEU scores and data set similarity can only be identified on macro scales for all the three similarity settings. Although using data similarity may not be able to select the perfect development data set from S DF, by picking a development set with highest similarity score, we can usually (almost always) get good enough BLEU scores in our experiments. 4.4 Results of Development Data Dynamic Generation We ran two sets of experiments for the method of development data dynamic construction. The first one was designed to investigate how the size of extracted development data affects the translation performance. Using MT05 and MT08 as test sets and MT03+MT06 as D F, we ran experiments for the algorithm presented in Section 3.3 with n = 200 to n = 1, 000. In this experiment we did not observe significant enough changes in BLEU scores the difference between the highest and lowest numbers is generally less than The second one aimed at examining how BLEU numbers changes when the extracted development data were iteratively updated. Figure 2 shows one set of results on test sets MT05 and MT08 using MT03+MT06 data set as D F and n set to 400. MT05 BLEU MT05 MT Iteration Figure 2: BLEU score as function of iteration in dynamic development data extraction. MT08 BLEU The similarity usually converged after 2 to 3 it- 668
8 erations, which is consistent with trend of BLEU scores on test sets. However, in all our experimental settings, we did not observe any results significantly better than using the development set preconstruction method. 5 Discussions Some of the previous work related to building adaptive SMT systems were discussed in the domain adaptation context, in which one fundamental idea is to estimate a more suitable domainspecific translation model or language model. When the target domain is already known, adding a small amount of domain data (both monolingual and bilingual) to the existing training corpora has been shown to be very effective in practice. But model adaptation is required in more scenarios other than explicitly defined domains. As shown by the results in Table 2, even for the data from the same domain, distribution mismatch can also be a problem. There are also considerable efforts made to deal with the unknown distribution of text to be translated, and the research topics were still focused on translation and language model adaptation. Typical methods used in this direction include dynamic data selection (Lü et al., 2007; Zhao et al., 2004; Hildebrand et al., 1995) and data weighting (Foster and Kuhn, 2007; Matsoukas et al., 2009). All the mentioned methods use information retrieval techniques to identify relevant training data from the entire training corpora. Our work presented here also makes no assumption about the distribution of test data, but it differs from the previous methods significantly from a log-linear model s perspective. Adjusting translation and language models based on test data can be viewed as adaptation of feature values, while our method is essentially adaptation of feature weights. This difference makes these two kinds of methods complementary to each other it is possible to make further improvement by using both of them in one task. To our knowledge, there is no dedicated discussion on principled methods to perform development data selection in previous research. In Lü et al. (2007), log-linear model parameters can also be adjusted at decoding time. But in their approach, the adjustment was based on heuristic rules and re-weighted training data distribution. In addition, compared with training data selection, the computational cost of development data selection is much smaller. From machine learning perspective, both proposed methods can be viewed as certain form of transductive learning applied to the SMT task (Ueffing et al., 2007). But our methods do not rely on surface similarities between training and training/development sentences, and development/test sentences are not used to re-train SMT sub-models. 6 Conclusions and Future Work In this paper, we addressed the data mismatch issue between training and decoding time of loglinear SMT models, and presented principled methods for dynamically inferring test data dependent model parameters with development set selection. We describe two algorithms for this task, development set pre-construction and dynamic construction, and evaluated our method on the NIST data sets for the Chinese-English translation task. Experimental results show that our methods are capable of consistently achieving good translation performance on multiple test sets with different data distributions without manual tweaking of log-linear model parameters. Though theoretically using the dynamic construction method could bring better results, the preconstruction method performs comparably well in our experimental settings. Considering the fact that the pre-consruction method is computationally cheaper, it should be a better choice in practice. In the future, we are interested in two directions. One is to explore the possibility to perform data clustering on test set as well and choosing suitable model parameters for each cluster separately. The other involves dynamic SMT model selection for example, some parts of the test data fit the phrase-based model better while other parts can be better translated using a syntax-based model. 669
9 References David Chiang A hierarchical phrase-based model for statistical machine translation. In Proc. of the 43th Annual Meeting of the Association for Computational Linguistic (ACL). Ann Arbor, Michigan. George Foster and Roland Kuhn Mixture- Model Adaptation for SMT. In Proc. of the Second ACL Workshop on Statistical Machine Translation.. Prague, Czech Republic. Michel Galley, Mark Hopkins, Kevin Knight and Daniel Marcu What s in a translation rule? In Proc. of the Human Language Technology Conf. (HLT-NAACL). Boston, Massachusetts. Almut Hildebrand, Matthias Eck, Stephan Vogel, and Alex Waibel Adaptation of the Translation Model for Statistical Machine translation Based on Information Retrieval. In Proc. of EAMT. Budapest, Hungary. Philipp Koehn, Franz Och and Daniel Marcu Statistical Phrase-Based Translation. In Proc. of the Human Language Technology Conf. (HLT-NAACL). Edmonton, Canada. Yang Liu, Qun Liu, and Shouxun Lin Treeto-string alignment template for statistical machine translation. In Proc. of the 45th Annual Meeting of the Association for Computational Linguistic (ACL). Prague, Czech Republic. Yajuan Lü, Jin Huang and Qun Liu Improving Statistical Machine Translation Performance by Training Data Selection and Optimization. In Proc. of the Conference on Empirical Methods in Natural Language Processing. Prague, Czech Republic. Spyros Matsoukas, Antti-Veikko I. Rosti and Bing Zhang Discriminative Corpus Weight Estimation for Machine Translation. In Proc. of the Conference on Empirical Methods in Natural Language Processing. Singapore. Franz Och and Hermann Ney Discriminative Training and Maximum Entropy Models for Statistical Machine Translation. In Proc. of the 40th Annual Meeting of the Association for Computational Linguistic (ACL). Philadelphia, PA. Franz Och Minimum Error Rate Training in Statistical Machine Translation. In Proc. of the 41th Annual Meeting of the Association for Computational Linguistic (ACL). Sapporo, Japan. Nicola Ueffing, Gholamreza Haffari and Anoop Sarkar Transductive learning for statistical machine translation. In Proc. of the Annual Meeting of the Association for Computational Linguistics. Prague, Czech Republic. Bing Zhao, Matthias Eck, and Stephan Vogel Language Model Adaptation for Statistical Machine Translation with Structured Query Models. In Proc. of COLING. Geneva, Switzerland. 670
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 Jin Yang and Satoshi Enoue SYSTRAN Software, Inc. 4444 Eastgate Mall, Suite 310 San Diego, CA 92121, USA E-mail:
SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems Jin Yang, Satoshi Enoue Jean Senellart, Tristan Croiset SYSTRAN Software, Inc. SYSTRAN SA 9333 Genesee Ave. Suite PL1 La Grande
The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006
The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 Yidong Chen, Xiaodong Shi Institute of Artificial Intelligence Xiamen University P. R. China November 28, 2006 - Kyoto 13:46 1
THUTR: A Translation Retrieval System
THUTR: A Translation Retrieval System Chunyang Liu, Qi Liu, Yang Liu, and Maosong Sun Department of Computer Science and Technology State Key Lab on Intelligent Technology and Systems National Lab for
An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation
An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation Robert C. Moore Chris Quirk Microsoft Research Redmond, WA 98052, USA {bobmoore,chrisq}@microsoft.com
Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation
Systematic Comparison of Professional and Crowdsourced Reference Translations for Machine Translation Rabih Zbib, Gretchen Markiewicz, Spyros Matsoukas, Richard Schwartz, John Makhoul Raytheon BBN Technologies
Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines
, 22-24 October, 2014, San Francisco, USA Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines Baosheng Yin, Wei Wang, Ruixue Lu, Yang Yang Abstract With the increasing
Domain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
Convergence of Translation Memory and Statistical Machine Translation
Convergence of Translation Memory and Statistical Machine Translation Philipp Koehn and Jean Senellart 4 November 2010 Progress in Translation Automation 1 Translation Memory (TM) translators store past
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan
7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan We explain field experiments conducted during the 2009 fiscal year in five areas of Japan. We also show the experiments of evaluation
Machine Translation. Agenda
Agenda Introduction to Machine Translation Data-driven statistical machine translation Translation models Parallel corpora Document-, sentence-, word-alignment Phrase-based translation MT decoding algorithm
FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
Statistical Machine Translation
Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language
Jane 2: Open Source Phrase-based and Hierarchical Statistical Machine Translation
Jane 2: Open Source Phrase-based and Hierarchical Statistical Machine Translation Joern Wuebker M at thias Huck Stephan Peitz M al te Nuhn M arkus F reitag Jan-Thorsten Peter Saab M ansour Hermann N e
The Prague Bulletin of Mathematical Linguistics NUMBER 93 JANUARY 2010 37 46. Training Phrase-Based Machine Translation Models on the Cloud
The Prague Bulletin of Mathematical Linguistics NUMBER 93 JANUARY 2010 37 46 Training Phrase-Based Machine Translation Models on the Cloud Open Source Machine Translation Toolkit Chaski Qin Gao, Stephan
Clustering Technique in Data Mining for Text Documents
Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor
An End-to-End Discriminative Approach to Machine Translation
An End-to-End Discriminative Approach to Machine Translation Percy Liang Alexandre Bouchard-Côté Dan Klein Ben Taskar Computer Science Division, EECS Department University of California at Berkeley Berkeley,
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast
Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast Hassan Sawaf Science Applications International Corporation (SAIC) 7990
The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL 2011 5 18. A Guide to Jane, an Open Source Hierarchical Translation Toolkit
The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL 2011 5 18 A Guide to Jane, an Open Source Hierarchical Translation Toolkit Daniel Stein, David Vilar, Stephan Peitz, Markus Freitag, Matthias
Statistical Machine Translation Lecture 4. Beyond IBM Model 1 to Phrase-Based Models
p. Statistical Machine Translation Lecture 4 Beyond IBM Model 1 to Phrase-Based Models Stephen Clark based on slides by Philipp Koehn p. Model 2 p Introduces more realistic assumption for the alignment
A New Input Method for Human Translators: Integrating Machine Translation Effectively and Imperceptibly
Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) A New Input Method for Human Translators: Integrating Machine Translation Effectively and Imperceptibly
Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
Chapter 5. Phrase-based models. Statistical Machine Translation
Chapter 5 Phrase-based models Statistical Machine Translation Motivation Word-Based Models translate words as atomic units Phrase-Based Models translate phrases as atomic units Advantages: many-to-many
Factored Translation Models
Factored Translation s Philipp Koehn and Hieu Hoang [email protected], [email protected] School of Informatics University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW Scotland, United Kingdom
An Iterative Link-based Method for Parallel Web Page Mining
An Iterative Link-based Method for Parallel Web Page Mining Le Liu 1, Yu Hong 1, Jun Lu 2, Jun Lang 2, Heng Ji 3, Jianmin Yao 1 1 School of Computer Science & Technology, Soochow University, Suzhou, 215006,
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Factored Markov Translation with Robust Modeling
Factored Markov Translation with Robust Modeling Yang Feng Trevor Cohn Xinkai Du Information Sciences Institue Computing and Information Systems Computer Science Department The University of Melbourne
The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems
The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems Dr. Ananthi Sheshasaayee 1, Angela Deepa. V.R 2 1 Research Supervisior, Department of Computer Science & Application,
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Micro blogs Oriented Word Segmentation System
Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,
Subspace Analysis and Optimization for AAM Based Face Alignment
Subspace Analysis and Optimization for AAM Based Face Alignment Ming Zhao Chun Chen College of Computer Science Zhejiang University Hangzhou, 310027, P.R.China [email protected] Stan Z. Li Microsoft
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
Genetic Algorithm-based Multi-Word Automatic Language Translation
Recent Advances in Intelligent Information Systems ISBN 978-83-60434-59-8, pages 751 760 Genetic Algorithm-based Multi-Word Automatic Language Translation Ali Zogheib IT-Universitetet i Goteborg - Department
Hybrid Machine Translation Guided by a Rule Based System
Hybrid Machine Translation Guided by a Rule Based System Cristina España-Bonet, Gorka Labaka, Arantza Díaz de Ilarraza, Lluís Màrquez Kepa Sarasola Universitat Politècnica de Catalunya University of the
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
2014/02/13 Sphinx Lunch
2014/02/13 Sphinx Lunch Best Student Paper Award @ 2013 IEEE Workshop on Automatic Speech Recognition and Understanding Dec. 9-12, 2013 Unsupervised Induction and Filling of Semantic Slot for Spoken Dialogue
The Impact of Morphological Errors in Phrase-based Statistical Machine Translation from English and German into Swedish
The Impact of Morphological Errors in Phrase-based Statistical Machine Translation from English and German into Swedish Oscar Täckström Swedish Institute of Computer Science SE-16429, Kista, Sweden [email protected]
Search Result Optimization using Annotators
Search Result Optimization using Annotators Vishal A. Kamble 1, Amit B. Chougule 2 1 Department of Computer Science and Engineering, D Y Patil College of engineering, Kolhapur, Maharashtra, India 2 Professor,
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Improving MT System Using Extracted Parallel Fragments of Text from Comparable Corpora
mproving MT System Using Extracted Parallel Fragments of Text from Comparable Corpora Rajdeep Gupta, Santanu Pal, Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University Kolata
A Systematic Comparison of Various Statistical Alignment Models
A Systematic Comparison of Various Statistical Alignment Models Franz Josef Och Hermann Ney University of Southern California RWTH Aachen We present and compare various methods for computing word alignments
ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING
ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of
Building a Web-based parallel corpus and filtering out machinetranslated
Building a Web-based parallel corpus and filtering out machinetranslated text Alexandra Antonova, Alexey Misyurev Yandex 16, Leo Tolstoy St., Moscow, Russia {antonova, misyurev}@yandex-team.ru Abstract
Clustering Connectionist and Statistical Language Processing
Clustering Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised
Performance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
The CMU Syntax-Augmented Machine Translation System: SAMT on Hadoop with N-best alignments
The CMU Syntax-Augmented Machine Translation System: SAMT on Hadoop with N-best alignments Andreas Zollmann, Ashish Venugopal, Stephan Vogel interact, Language Technology Institute School of Computer Science
So, how do you pronounce. Jilles Vreeken. Okay, now we can talk. So, what kind of data? binary. * multi-relational
Simply Mining Data Jilles Vreeken So, how do you pronounce Exploratory Data Analysis Jilles Vreeken Jilles Yill less Vreeken Fray can 17 August 2015 Okay, now we can talk. 17 August 2015 The goal So, what
Machine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY
SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute
Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach
Outline Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Jinfeng Yi, Rong Jin, Anil K. Jain, Shaili Jain 2012 Presented By : KHALID ALKOBAYER Crowdsourcing and Crowdclustering
PULLING OUT OPINION TARGETS AND OPINION WORDS FROM REVIEWS BASED ON THE WORD ALIGNMENT MODEL AND USING TOPICAL WORD TRIGGER MODEL
Journal homepage: www.mjret.in ISSN:2348-6953 PULLING OUT OPINION TARGETS AND OPINION WORDS FROM REVIEWS BASED ON THE WORD ALIGNMENT MODEL AND USING TOPICAL WORD TRIGGER MODEL Utkarsha Vibhute, Prof. Soumitra
Generating Chinese Classical Poems with Statistical Machine Translation Models
Proceedings of the Twenty-Sixth AAA Conference on Artificial ntelligence Generating Chinese Classical Poems with Statistical Machine Translation Models Jing He* Ming Zhou Long Jiang Tsinghua University
Make Better Decisions with Optimization
ABSTRACT Paper SAS1785-2015 Make Better Decisions with Optimization David R. Duling, SAS Institute Inc. Automated decision making systems are now found everywhere, from your bank to your government to
Categorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE
STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 1, 2014 A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE DIANA HALIŢĂ AND DARIUS BUFNEA Abstract. Then
Large-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
Globally Optimal Crowdsourcing Quality Management
Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University [email protected] Aditya G. Parameswaran University of Illinois (UIUC) [email protected] Jennifer Widom Stanford
Language Model of Parsing and Decoding
Syntax-based Language Models for Statistical Machine Translation Eugene Charniak ½, Kevin Knight ¾ and Kenji Yamada ¾ ½ Department of Computer Science, Brown University ¾ Information Sciences Institute,
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
Translating the Penn Treebank with an Interactive-Predictive MT System
IJCLA VOL. 2, NO. 1 2, JAN-DEC 2011, PP. 225 237 RECEIVED 31/10/10 ACCEPTED 26/11/10 FINAL 11/02/11 Translating the Penn Treebank with an Interactive-Predictive MT System MARTHA ALICIA ROCHA 1 AND JOAN
Collaborative Machine Translation Service for Scientific texts
Collaborative Machine Translation Service for Scientific texts Patrik Lambert [email protected] Jean Senellart Systran SA [email protected] Laurent Romary Humboldt Universität Berlin
tance alignment and time information to create confusion networks 1 from the output of different ASR systems for the same
1222 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 7, SEPTEMBER 2008 System Combination for Machine Translation of Spoken and Written Language Evgeny Matusov, Student Member,
TS3: an Improved Version of the Bilingual Concordancer TransSearch
TS3: an Improved Version of the Bilingual Concordancer TransSearch Stéphane HUET, Julien BOURDAILLET and Philippe LANGLAIS EAMT 2009 - Barcelona June 14, 2009 Computer assisted translation Preferred by
Emotion Detection in Code-switching Texts via Bilingual and Sentimental Information
Emotion Detection in Code-switching Texts via Bilingual and Sentimental Information Zhongqing Wang, Sophia Yat Mei Lee, Shoushan Li, and Guodong Zhou Natural Language Processing Lab, Soochow University,
Regression Clustering
Chapter 449 Introduction This algorithm provides for clustering in the multiple regression setting in which you have a dependent variable Y and one or more independent variables, the X s. The algorithm
A Systematic Cross-Comparison of Sequence Classifiers
A Systematic Cross-Comparison of Sequence Classifiers Binyamin Rozenfeld, Ronen Feldman, Moshe Fresko Bar-Ilan University, Computer Science Department, Israel [email protected], [email protected],
The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER 2011 49 58. Ncode: an Open Source Bilingual N-gram SMT Toolkit
The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER 2011 49 58 Ncode: an Open Source Bilingual N-gram SMT Toolkit Josep M. Crego a, François Yvon ab, José B. Mariño c c a LIMSI-CNRS, BP 133,
Turker-Assisted Paraphrasing for English-Arabic Machine Translation
Turker-Assisted Paraphrasing for English-Arabic Machine Translation Michael Denkowski and Hassan Al-Haj and Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon University
W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
Machine translation techniques for presentation of summaries
Grant Agreement Number: 257528 KHRESMOI www.khresmoi.eu Machine translation techniques for presentation of summaries Deliverable number D4.6 Dissemination level Public Delivery date April 2014 Status Author(s)
Mining Signatures in Healthcare Data Based on Event Sequences and its Applications
Mining Signatures in Healthcare Data Based on Event Sequences and its Applications Siddhanth Gokarapu 1, J. Laxmi Narayana 2 1 Student, Computer Science & Engineering-Department, JNTU Hyderabad India 1
DEA implementation and clustering analysis using the K-Means algorithm
Data Mining VI 321 DEA implementation and clustering analysis using the K-Means algorithm C. A. A. Lemos, M. P. E. Lins & N. F. F. Ebecken COPPE/Universidade Federal do Rio de Janeiro, Brazil Abstract
Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization
Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization Alfio Gliozzo and Carlo Strapparava ITC-Irst via Sommarive, I-38050, Trento, ITALY {gliozzo,strappa}@itc.it
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
