Identifying Japanese-Chinese Bilingual Synonymous Technical Terms from Patent Families

Size: px
Start display at page:

Download "Identifying Japanese-Chinese Bilingual Synonymous Technical Terms from Patent Families"

Transcription

1 Identifying Japanese-Chinese Bilingual Synonymous Technical Terms from Patent Families Zi Long Lijuan Dong Takehito Utsuro Tomoharu Mitsuhashi Mikio Yamamoto Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, , Japan Japan Patent Information Organization, 4-1-7, Toyo, Koto-ku, Tokyo, , Japan Abstract In the task of acquiring Japanese-Chinese technical term translation equivalent pairs from parallel patent documents, this paper considers situations where a technical term is observed in many parallel patent sentences and is translated into many translation equivalents and studies the issue of identifying synonymous translation equivalent pairs. First, we collect candidates of synonymous translation equivalent pairs from parallel patent sentences. Then, we apply the Support Vector Machines (SVMs) to the task of identifying bilingual synonymous technical terms, and achieve the performance of over 85% precision and over 60% F-measure. We further examine two types of segmentation of Chinese sentences, i.e., by characters and by morphemes, and integrate those two types of segmentation in the form of the intersection of SVM judgments, which achieved over 90% precision. Keywords: synonymous technical terms, patent families, technical term translation 1. Introduction For both high quality machine and human translation, a large scale and high quality bilingual lexicon is the most important key resource. Since manual compilation of bilingual lexicon requires plenty of time and huge manual labor, in the research area of knowledge acquisition from natural language text, automatic bilingual lexicon compilation have been studied. Techniques invented so far include translation term pair acquisition based on statistical co-occurrence measure from parallel sentences (Matsumoto and Utsuro, 2000), translation term pair acquisition from comparable corpora (Fung and Yee, 1998), compositional translation generation based on an existing bilingual lexicon for human use (Tonoike et al., 2006), and translation term pair acquisition by collecting partially bilingual texts through the search engine (Huang et al., 2005). Among those efforts of acquiring bilingual lexicon from text, Morishita et al. (2008) studied to acquire Japanese- English technical term translation lexicon from phrase tables, which are trained by a phrase-based SMT model with parallel sentences automatically extracted from parallel patent documents. In more recent studies, they require the acquired technical term translation equivalents to be consistent with word alignment in parallel sentences and achieved 91.9% precision with almost 70% recall. Furthermore, based on the achievement above, Liang et al. (2011a) considered situations where a technical term is observed in many parallel patent sentences and is translated into many translation equivalents. More specifically, in the task of acquiring Japanese-English technical term translation equivalent pairs, Liang et al. (2011a) studied the issue of identifying Japanese-English synonymous translation equivalent pairs. First, they collect candidates of synonymous translation equivalent pairs from parallel patent sentences. Then, they apply the Support Vector Machines (SVMs) (Vapnik, 1998) to the task of identifying bilingual synonymous technical terms. Based on the technique and the results of identifying Japanese-English synonymous translation equivalent pairs in Liang et al. (2011a), we aim at identifying Japanese- Chinese synonymous translation equivalent pairs from Japanese-Chinese patent families. We especially examine two types of segmentation of Chinese sentences, namely, by characters and by morphemes. Although both types of segmentation achieved almost similar performance around 95 97% (in recall / precision / f-measure) in the task of acquiring Japanese-Chinese technical term translation pairs, they have different types of errors. Also in the task of identifying Japanese-Chinese synonymous technical terms, both types of segmentation achieved almost similar performance, while they have different types of errors. Thus, we integrate those two types of segmentation in the form of the intersection of SVM judgments, and show that this achieves over 90% precision. 2. Japanese-Chinese Parallel Patent Documents Japanese-Chinese parallel patent documents are collected from the Japanese patent documents published by the Japanese Patent Office (JPO) in and the Chinese patent documents published by State Intellectual Property Office of the People s Republic of China (SIPO) in From them, we extract 312,492 patent families, and the method of Utiyama and Isahara (2007) is applied 1 to the text of those patent families, and Japanese and Chinese sentences are aligned. In this paper, we use 3.6M parallel patent sentences with the highest scores of sentence alignment. 3. Phrase Table of an SMT Model As a toolkit of a phrase-based SMT model, we use Moses (Koehn et al., 2007) and apply it to the whole 3.6M parallel patent sentences. Before applying Moses, Japanese sentences are segmented into a sequence of morphemes by the Japanese morphological analyzer MeCab 2 with the 1 Here, we used a Japanese-Chinese translation lexicon consisting of about 170,000 Chinese head words. 2

2 Figure 1: Developing a Reference Set of Bilingual Synonymous Technical Terms morpheme lexicon IPAdic 3. For Chinese sentences, we examine two types of segmentation, i.e., segmentation by characters 4 and segmentation by morphemes 5. As the result of applying Moses, we have a phrase table in the direction of Japanese to Chinese translation, and another one in the opposite direction of Chinese to Japanese translation. In the direction of Japanese to Chinese translation, we finally obtain 108M (Chinese sentences segmented by morphemes) / 274M (Chinese sentences segmented by characters) translation pairs with 75M / 197M unique Japanese phrases with Japanese to Chinese phrase translation probabilities P (p C p J ) of translating a Japanese phrase p J into a Chinese phrase p C. For each Japanese phrase, those multiple translation candidates in the phrase table are ranked in descending order of Japanese to Chinese phrase translation probabilities. In the similar way, in the phrase table in the opposite direction of Chinese to Japanese translation, for each Chinese phrase, multiple Japanese translation candidates are ranked in descending order of Chinese to Japanese phrase translation probabilities. Those two phrase tables are then referred to when identifying a bilingual technical term pair, given a parallel sen A consecutive sequence of numbers as well as a consecutive sequence of alphabetical characters are segmented into a token. 5 Chinese sentences are segmented into a sequence of morphemes by the Chinese morphological analyzer Stanford Word Segment (Tseng et al., 2005) trained with Chinese Penn Treebank. tence pair S J,S C and a Japanese technical term t J,or a Chinese technical term t C. In the direction of Japanese to Chinese, given a parallel sentence pair S J,S C containing a Japanese technical term t J, Chinese translation candidates collected from the Japanese to Chinese phrase table are matched against the Chinese sentence S C of the parallel sentence pair. Among those found in S C, ˆt C with the largest translation probability P (t C t J ) is selected and the bilingual technical term pair t J, ˆt C is identified. Similarly, in the opposite direction of Chinese to Japanese, given a parallel sentence pair S J,S C containing a Chinese technical term t C, the Chinese to Japanese phrase table is referred to when identifying a bilingual technical term pair. 4. Developing a Reference Set of Bilingual Synonymous Technical Terms When developing a reference set of bilingual synonymous technical terms (detailed procedure to be found in Liang et al. (2011a)), starting from a seed bilingual term pair s JC = s J,s C, we repeat the translation estimation procedure of the previous section six times and generate the set CBP(s J ) of candidates of bilingual synonymous technical term pairs. Figure 1 illustrates the whole procedure. Then, we manually divide the set CBP(s J ) into SBP(s JC ), those of which are synonymous with s JC, and the remaining NSBP(s JC ). As in Table 1, we collect 114 seeds, where the number of bilingual technical terms included in SBP(s JC ) in total for all of the 114 seed bilin-

3 Table 1: Number of Bilingual Technical Terms: Candidates and Reference of Synonyms Candidates of Synonyms s J CBP(s J ) Reference of Synonyms s JC SBP(s JC ) Candidates of Synonyms s J CBP(s J ) Reference of Synonyms s JC SBP(s JC ) (a) With the Phrase Table based on Chinese Sentences Segmented by Characters # of bilingual technical terms for the total 114 seeds in the set (a) in the set (a) 8, ,563 13, ,496 2, (b) With the Phrase Table based on Chinese Sentences Segmented by Morphemes # of bilingual technical terms for the total 114 seeds in the set (b) in the set (b) 14, ,948 14, ,604 2, average per seed average per seed gual technical term pairs is around 2,500 to 2,600, which amounts to around 22 per seed on average. It can be also seen from Table 1 that although about 90% of reference of synonymous technical terms are shared by the two types of segmentation (by characters and by morphemes), only about 40% to 50% of candidates of synonymous technical terms are shared by the two types of segmentation. 5. Identifying Bilingual Synonymous Technical Terms by Machine Learning In this section, we apply the SVMs to the task of identifying bilingual synonymous technical terms. In this paper, we model the task of identifying bilingual synonymous technical terms by the SVMs as that of judging whether or not the input bilingual term pair t J,t C is synonymous with the seed bilingual technical term pair s JC = s J,s C The Procedure First, let CBP be the union of the sets CBP(s J ) of candidates of bilingual synonymous technical term pairs for all of the 114 seed bilingual technical term pairs. In the training and testing of the classifier for identifying bilingual synonymous technical terms, we first divide the set of 114 seed bilingual technical term pairs into 10 subsets. Here, for each i-th subset (i =1,...,10), we construct the union CBP i of the sets CBP(s J ) of candidates of bilingual synonymous technical term pairs, where CBP 1,...,CBP 10 are 10 disjoint subsets 6 of CBP. 6 Here, we divide the set of 114 seed bilingual technical term pairs into 10 subsets so that the numbers of positive (i.e., syn- As a tool for learning SVMs, we use TinySVM ( chasen.org/ taku/software/tinysvm/). As the kernel function, we use the polynomial (1st order) kernel 7. In the testing of a SVMs classifier, we regard the distance from the separating hyperplane to each test instance as a confidence measure, and return test instances satisfying confidence measures over a certain lower bound only as positive samples (i.e., synonymous with the seed). In the training of SVMs, we use 8 subsets out of the whole 10 subsets CBP 1,...,CBP 10. Then, we tune the lower bound of the confidence measure with one of the remaining two subsets. With this subset, we also tune the parameter of TinySVM for trade-off between training error and margin. Finally, we test the trained classifier against another one of the remaining two subsets. We repeat this procedure of training / tuning / testing 10 times, and average the 10 results of test performance Features Table 2 lists all the features used for training and testing of SVMs for identifying bilingual synonymous technical terms. Features are roughly divided into two types: those of the first type f 1,...,f 6 simply represent various characteristics of the input bilingual technical term t J,t C, while those of the second type f 7,...,f 16 represent relation of the input bilingual technical term t J,t C and the onymous with the seed) / negative (i.e., not synonymous with the seed) samples in each CBP i (i = 1,...,10) are comparative among the 10 subsets. 7 We compare the performance of the 1st order and 2nd order kernels, where we have almost comparative performance.

4 Table 2: Features for Identifying Bilingual Synonymous Technical Terms by Machine Learning class features for bilingual technical terms t J,t C features for the relation of bilingual technical terms t J,t C and the seed s J,s C definition feature ( where X denotes J or C, and s J,s C denotes the seed bilingual technical term pair ) f 1 : frequency log of the frequency of t J,t C within the whole parallel patent sentences f 2 : rank of the Chinese term given t J, log of the rank of t C with respect to the descending order of the conditional translation probability P(t C t J ) f 3 : rank of the Japanese term given t C, log of the rank of t J with respect to the descending order of the conditional translation probability P(t J t C ) f 4 : number of Japanese characters number of characters in t J f 5 : number of Chinese charac- number of characters in t C f 6 : ters number of times generating translation by applying the phrase tables the number of times repeating the procedure of generating translation by applying the phrase tables until generating t C or t J from s J, as in s C t J t C,or,s J t C t J f 7 : identity of Japanese terms returns 1 when t J = s J f 8 : identity of Chinese terms returns 1 when t C = s C ED(tX,sX) f 9 : edit distance similarity of f 9 (t X,s X )=1 max( t X, s X ) (where ED is the edit distance monolingual terms of t X and s X, and t denotes the number of characters of t.) bigram(tx ) bigram(sx ) f 10 : character bigram similarity f 10 (t X,s X )= max( t X, s X ) 1 (where bigram(t) is of monolingual terms the set of character bigrams of the term t.) const(tj ) const(sj ) f 11 : rate of identical morphemes f 11 (t J,s J )= max( const(t J ), const(s J ) ) (where const(t) is the (for Japanese terms) set of morphemes in the Japanese term t.) const(t f 12 : rate of identical characters f 11 (t C,s C ) = C) const(s C ) max( const(t C), const(s C) ) (where const(t) is (for Chinese terms) the set of Characters in the Chinese term t.) f 13 : subsumption relation of returns 1 when the difference of t J and s J is only in their suffixes, strings / variants relation of or only whether or not having the prolonged sound, or only in surface forms (for Japanese their hiragana parts. terms ) f 14 : identical stem (for Chinese returns 1 when the difference of t C and s C is only whether or not terms) haing the word which is not the prefix or suffix. f 15 : f 16 : rate of intersection in translation by the phrase table translation by the phrase table f 15 (t X,s X )= ( where trans(t) is the set of translation of term t from the phrase table.) returns 1 when s J can be generated by translating t E with the phrase table, or, s E can be generated by translating t J with the phrase table. trans(tx) trans(sx ) max( trans(t X), trans(s X) ) seed bilingual technical term pair s JC = s J,s C. Among the features of the first type are the frequency (f 1 ), ranks of terms with respect to the conditional translation probabilities (f 2 and f 3 ), length of terms (f 4 and f 5 ), and the number of times repeating the procedure of generating translation with the phrase tables until generating input terms t J and t C from the Japanese seed term s J (f 6 ). Among the features of the second type are identity of monolingual terms (f 7 and f 8 ), edit distance of monolingual terms (f 9 ), character bigram similarity of monolingual terms (f 10 ), rate of identical morphemes (in Japanese, f 11 ) / characters (in Chinese, f 12 ), string subsumption and variants for Japanese (f 13 ), identical stem for Chinese (f 14 ), rate of intersection in translation by the phrase table (f 15 ), and translation by the phrase tables (f 16 ) Evaluation Results Table 3 shows the evaluation results for a baseline as well as for SVMs. As the baseline, we simply judge the input bilingual term pair t J,t C as synonymous with the seed bilingual technical term pair s JC = s J,s C when t J and s J are identical, or, t C and s C are identical. When training / testing a SVMs classifier, we tune the lower bound of the confidence measure of the distance from the separating hyperplane in two ways: i.e., for maximizing precision and for maximizing F-measure. When maximizing precision, we achieve almost 87% precision where F-measure is over 40%. When maximizing F-measure, we achieve over 60% F-measure with around 71% precision and over 52% recall. As shown in Figure 2, the two types of segmentation of Chinese sentences, namely, by characters and by morphemes, tend to have different types of errors. So, we integrate those two types of segmentation in the form of the intersection of

5 Table 3: Evaluation Results (%) segmented by characters segmented by morphemes intersection precision recall f-measure precision recall f-measure precision recall f-measure baseline (t J and s J are identical, or, t C and s C are identical.) SVM maximum precision maximum f-measure Figure 2: Evaluating Intersection of Judgments by SVM based on Character/Morpheme based Segmentation of Chinese Sentences SVM judgments, where, for both types of segmentation, we tune the lower bound of the confidence measure of the distance from the separating hyperplane. We maximize precision while keeping recall over 25% with held-out data, and this achieves over 90% precision as shown in Table Related Work Among related works on acquiring bilingual lexicon from text, Itagaki et al. (2007) focused on automatic validation of translation pairs available in the phrase table trained by an SMT model. Lu and Tsou (2009) and Yasuda and Sumita (2013) also studied to extract bilingual terms from comparable patents, where, they first extract parallel sentences from comparable patents, and then extract bilingual terms from parallel sentences. Those studies differ from this paper in that those studies did not address the issue of acquiring bilingual synonymous technical terms. Tsunakawa and Tsujii (2008) is mostly related to our study, in that they also proposed to apply machine learning technique to the task of identifying bilingual synonymous technical terms. However, Tsunakawa and Tsujii (2008) studied the issue of identifying bilingual synonymous technical terms only within manually compiled bilingual technical term lexicon and thus are quite limited in its applicability. Our approach, on the other hand, is quite advantageous in that we start from parallel patent documents which continue to be published every year and then, that we can generate candidates of bilingual synonymous technical terms automatically. Our study in this paper is also different from previous works on identifying synonyms based on bilingual and monolingual resources (e.g. Lin and Zhao (2003)) in that we learn bilingual synonymous technical terms from phrase tables of a phrase-based SMT model trained with very large parallel sentences. Also in the context of SMT between Japanese and Chinese, Sun and Lepage (2012) pointed out that character-based segmentation of sentences contributed to improving machine translation performance compared to morpheme-based segmentation of sentences. 7. Conclusion In the task of acquiring Japanese-Chinese technical term translation equivalent pairs from parallel patent documents, this paper considered situations where a technical term is observed in many parallel patent sentences and is translated into many translation equivalents and studied the is-

6 sue of identifying synonymous translation equivalent pairs. We especially examined two types of segmentation of Chinese sentences, i.e., by characters and by morphemes, and integrated those two types of segmentation in the form of the intersection of SVM judgments, which achieved over 90% precision. One of the most important future works is definitely to improve recall. To do this, we plan to apply the semi-automatic framework (Liang et al., 2011b) which have been invented in the task of identifying Japanese- English synonymous translation equivalent pairs and have been proven to be effective in improving recall. We plan to examine whether this semi-automatic framework is also effective in the task of identifying Japanese-Chinese synonymous translation equivalent pairs. 8. References P. Fung and L. Y. Yee An IR approach for translating new words from nonparallel, comparable texts. In Proc. 17th COLING and 36th ACL, pages F. Huang, Y. Zhang, and S. Vogel Mining key phrase translations from Web corpora. In Proc. HLT/EMNLP, pages M. Itagaki, T. Aikawa, and X. He Automatic validation of terminology translation consistency with statistical method. In Proc. MT Summit XI, pages P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst Moses: Open source toolkit for statistical machine translation. In Proc. 45th ACL, Companion Volume, pages B. Liang, T. Utsuro, and M. Yamamoto. 2011a. Identifying bilingual synonymous technical terms from phrase tables and parallel patent sentences. Procedia - Social and Behavioral Sciences, 27: B. Liang, T. Utsuro, and M. Yamamoto. 2011b. Semiautomatic identification of bilingual synonymous technical terms from phrase tables and parallel patent sentences. In Proc. 25th PACLIC, pages D. Lin and S. Zhao Identifying synonyms among distributionally similar words. In Proc. 18th IJCAI, pages B. Lu and B. K. Tsou Towards bilingual term extraction in comparable patents. In Proc. 23rd PACLIC, pages Y. Matsumoto and T. Utsuro Lexical knowledge acquisition. In R. Dale, H. Moisl, and H. Somers, editors, Handbook of Natural Language Processing, chapter 24, pages Marcel Dekker Inc. Y. Morishita, T. Utsuro, and M. Yamamoto Integrating a phrase-based SMT model and a bilingual lexicon for human in semi-automatic acquisition of technical term translation lexicon. In Proc. 8th AMTA, pages J. Sun and Y. Lepage Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese? In Proc. 26th PACLIC, pages M. Tonoike, M. Kida, T. Takagi, Y. Sasaki, T. Utsuro, and S. Sato A comparative study on compositional translation estimation using a domain/topic-specific corpus collected from the web. In Proc. 2nd Intl. Workshop on Web as Corpus, pages H. Tseng, P. Chang, G. Andrew, D. Jurafsky, and C. Manning A conditional random field word segmenter for Sighan bakeoff In Proc. 4th SIGHAN Workshop on Chinese Language Processing, pages T. Tsunakawa and J. Tsujii Bilingual synonym identification with spelling variations. In Proc. 3rd IJCNLP, pages M. Utiyama and H. Isahara A Japanese-English patent parallel corpus. In Proc. MT Summit XI, pages V. N. Vapnik Statistical Learning Theory. Wiley- Interscience. K. Yasuda and E. Sumita Building a bilingual dictionary from a Japanese-Chinese patent corpus. In Computational Linguistics and Intelligent Text Processing, volume 7817 of LNCS, pages Springer.

Domain Classification of Technical Terms Using the Web

Domain Classification of Technical Terms Using the Web Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

More information

THUTR: A Translation Retrieval System

THUTR: A Translation Retrieval System THUTR: A Translation Retrieval System Chunyang Liu, Qi Liu, Yang Liu, and Maosong Sun Department of Computer Science and Technology State Key Lab on Intelligent Technology and Systems National Lab for

More information

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告 SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 Jin Yang and Satoshi Enoue SYSTRAN Software, Inc. 4444 Eastgate Mall, Suite 310 San Diego, CA 92121, USA E-mail:

More information

7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan

7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan 7-2 Speech-to-Speech Translation System Field Experiments in All Over Japan We explain field experiments conducted during the 2009 fiscal year in five areas of Japan. We also show the experiments of evaluation

More information

SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统

SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统 SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems Jin Yang, Satoshi Enoue Jean Senellart, Tristan Croiset SYSTRAN Software, Inc. SYSTRAN SA 9333 Genesee Ave. Suite PL1 La Grande

More information

The TCH Machine Translation System for IWSLT 2008

The TCH Machine Translation System for IWSLT 2008 The TCH Machine Translation System for IWSLT 2008 Haifeng Wang, Hua Wu, Xiaoguang Hu, Zhanyi Liu, Jianfeng Li, Dengjun Ren, Zhengyu Niu Toshiba (China) Research and Development Center 5/F., Tower W2, Oriental

More information

Hybrid Machine Translation Guided by a Rule Based System

Hybrid Machine Translation Guided by a Rule Based System Hybrid Machine Translation Guided by a Rule Based System Cristina España-Bonet, Gorka Labaka, Arantza Díaz de Ilarraza, Lluís Màrquez Kepa Sarasola Universitat Politècnica de Catalunya University of the

More information

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines , 22-24 October, 2014, San Francisco, USA Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines Baosheng Yin, Wei Wang, Ruixue Lu, Yang Yang Abstract With the increasing

More information

Machine translation techniques for presentation of summaries

Machine translation techniques for presentation of summaries Grant Agreement Number: 257528 KHRESMOI www.khresmoi.eu Machine translation techniques for presentation of summaries Deliverable number D4.6 Dissemination level Public Delivery date April 2014 Status Author(s)

More information

2-3 Automatic Construction Technology for Parallel Corpora

2-3 Automatic Construction Technology for Parallel Corpora 2-3 Automatic Construction Technology for Parallel Corpora We have aligned Japanese and English news articles and sentences, extracted from the Yomiuri and the Daily Yomiuri newspapers, to make a large

More information

Adapting General Models to Novel Project Ideas

Adapting General Models to Novel Project Ideas The KIT Translation Systems for IWSLT 2013 Thanh-Le Ha, Teresa Herrmann, Jan Niehues, Mohammed Mediani, Eunah Cho, Yuqi Zhang, Isabel Slawik and Alex Waibel Institute for Anthropomatics KIT - Karlsruhe

More information

The KIT Translation system for IWSLT 2010

The KIT Translation system for IWSLT 2010 The KIT Translation system for IWSLT 2010 Jan Niehues 1, Mohammed Mediani 1, Teresa Herrmann 1, Michael Heck 2, Christian Herff 2, Alex Waibel 1 Institute of Anthropomatics KIT - Karlsruhe Institute of

More information

Turker-Assisted Paraphrasing for English-Arabic Machine Translation

Turker-Assisted Paraphrasing for English-Arabic Machine Translation Turker-Assisted Paraphrasing for English-Arabic Machine Translation Michael Denkowski and Hassan Al-Haj and Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon University

More information

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features , pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of

More information

Adaptation to Hungarian, Swedish, and Spanish

Adaptation to Hungarian, Swedish, and Spanish www.kconnect.eu Adaptation to Hungarian, Swedish, and Spanish Deliverable number D1.4 Dissemination level Public Delivery date 31 January 2016 Status Author(s) Final Jindřich Libovický, Aleš Tamchyna,

More information

Improving MT System Using Extracted Parallel Fragments of Text from Comparable Corpora

Improving MT System Using Extracted Parallel Fragments of Text from Comparable Corpora mproving MT System Using Extracted Parallel Fragments of Text from Comparable Corpora Rajdeep Gupta, Santanu Pal, Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University Kolata

More information

Parallel FDA5 for Fast Deployment of Accurate Statistical Machine Translation Systems

Parallel FDA5 for Fast Deployment of Accurate Statistical Machine Translation Systems Parallel FDA5 for Fast Deployment of Accurate Statistical Machine Translation Systems Ergun Biçici Qun Liu Centre for Next Generation Localisation Centre for Next Generation Localisation School of Computing

More information

On-line and Off-line Chinese-Portuguese Translation Service for Mobile Applications

On-line and Off-line Chinese-Portuguese Translation Service for Mobile Applications On-line and Off-line Chinese-Portuguese Translation Service for Mobile Applications 12 12 2 3 Jordi Centelles ', Marta R. Costa-jussa ', Rafael E. Banchs, and Alexander Gelbukh Universitat Politecnica

More information

Factored Translation Models

Factored Translation Models Factored Translation s Philipp Koehn and Hieu Hoang pkoehn@inf.ed.ac.uk, H.Hoang@sms.ed.ac.uk School of Informatics University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW Scotland, United Kingdom

More information

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition POSBIOTM-NER: A Machine Learning Approach for Bio-Named Entity Recognition Yu Song, Eunji Yi, Eunju Kim, Gary Geunbae Lee, Department of CSE, POSTECH, Pohang, Korea 790-784 Soo-Jun Park Bioinformatics

More information

TRANSREAD LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS. Projet ANR 201 2 CORD 01 5

TRANSREAD LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS. Projet ANR 201 2 CORD 01 5 Projet ANR 201 2 CORD 01 5 TRANSREAD Lecture et interaction bilingues enrichies par les données d'alignement LIVRABLE 3.1 QUALITY CONTROL IN HUMAN TRANSLATIONS: USE CASES AND SPECIFICATIONS Avril 201 4

More information

UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation

UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation Liang Tian 1, Derek F. Wong 1, Lidia S. Chao 1, Paulo Quaresma 2,3, Francisco Oliveira 1, Yi Lu 1, Shuo Li 1, Yiming

More information

UEdin: Translating L1 Phrases in L2 Context using Context-Sensitive SMT

UEdin: Translating L1 Phrases in L2 Context using Context-Sensitive SMT UEdin: Translating L1 Phrases in L2 Context using Context-Sensitive SMT Eva Hasler ILCC, School of Informatics University of Edinburgh e.hasler@ed.ac.uk Abstract We describe our systems for the SemEval

More information

Collaborative Machine Translation Service for Scientific texts

Collaborative Machine Translation Service for Scientific texts Collaborative Machine Translation Service for Scientific texts Patrik Lambert patrik.lambert@lium.univ-lemans.fr Jean Senellart Systran SA senellart@systran.fr Laurent Romary Humboldt Universität Berlin

More information

An Empirical Study on Web Mining of Parallel Data

An Empirical Study on Web Mining of Parallel Data An Empirical Study on Web Mining of Parallel Data Gumwon Hong 1, Chi-Ho Li 2, Ming Zhou 2 and Hae-Chang Rim 1 1 Department of Computer Science & Engineering, Korea University {gwhong,rim}@nlp.korea.ac.kr

More information

Building a Web-based parallel corpus and filtering out machinetranslated

Building a Web-based parallel corpus and filtering out machinetranslated Building a Web-based parallel corpus and filtering out machinetranslated text Alexandra Antonova, Alexey Misyurev Yandex 16, Leo Tolstoy St., Moscow, Russia {antonova, misyurev}@yandex-team.ru Abstract

More information

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,

More information

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006 Yidong Chen, Xiaodong Shi Institute of Artificial Intelligence Xiamen University P. R. China November 28, 2006 - Kyoto 13:46 1

More information

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS

ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURKISH CORPUS Gürkan Şahin 1, Banu Diri 1 and Tuğba Yıldız 2 1 Faculty of Electrical-Electronic, Department of Computer Engineering

More information

Adaptive Development Data Selection for Log-linear Model in Statistical Machine Translation

Adaptive Development Data Selection for Log-linear Model in Statistical Machine Translation Adaptive Development Data Selection for Log-linear Model in Statistical Machine Translation Mu Li Microsoft Research Asia muli@microsoft.com Dongdong Zhang Microsoft Research Asia dozhang@microsoft.com

More information

A New Input Method for Human Translators: Integrating Machine Translation Effectively and Imperceptibly

A New Input Method for Human Translators: Integrating Machine Translation Effectively and Imperceptibly Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) A New Input Method for Human Translators: Integrating Machine Translation Effectively and Imperceptibly

More information

Collecting Polish German Parallel Corpora in the Internet

Collecting Polish German Parallel Corpora in the Internet Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska

More information

Cross-Lingual Concern Analysis from Multilingual Weblog Articles

Cross-Lingual Concern Analysis from Multilingual Weblog Articles Cross-Lingual Concern Analysis from Multilingual Weblog Articles Tomohiro Fukuhara RACE (Research into Artifacts), The University of Tokyo 5-1-5 Kashiwanoha, Kashiwa, Chiba JAPAN http://www.race.u-tokyo.ac.jp/~fukuhara/

More information

Human in the Loop Machine Translation of Medical Terminology

Human in the Loop Machine Translation of Medical Terminology Human in the Loop Machine Translation of Medical Terminology by John J. Morgan ARL-MR-0743 April 2010 Approved for public release; distribution unlimited. NOTICES Disclaimers The findings in this report

More information

Segmentation and Punctuation Prediction in Speech Language Translation Using a Monolingual Translation System

Segmentation and Punctuation Prediction in Speech Language Translation Using a Monolingual Translation System Segmentation and Punctuation Prediction in Speech Language Translation Using a Monolingual Translation System Eunah Cho, Jan Niehues and Alex Waibel International Center for Advanced Communication Technologies

More information

The Prague Bulletin of Mathematical Linguistics NUMBER 93 JANUARY 2010 37 46. Training Phrase-Based Machine Translation Models on the Cloud

The Prague Bulletin of Mathematical Linguistics NUMBER 93 JANUARY 2010 37 46. Training Phrase-Based Machine Translation Models on the Cloud The Prague Bulletin of Mathematical Linguistics NUMBER 93 JANUARY 2010 37 46 Training Phrase-Based Machine Translation Models on the Cloud Open Source Machine Translation Toolkit Chaski Qin Gao, Stephan

More information

Getting Off to a Good Start: Best Practices for Terminology

Getting Off to a Good Start: Best Practices for Terminology Getting Off to a Good Start: Best Practices for Terminology Technologies for term bases, term extraction and term checks Angelika Zerfass, zerfass@zaac.de Tools in the Terminology Life Cycle Extraction

More information

Identifying Focus, Techniques and Domain of Scientific Papers

Identifying Focus, Techniques and Domain of Scientific Papers Identifying Focus, Techniques and Domain of Scientific Papers Sonal Gupta Department of Computer Science Stanford University Stanford, CA 94305 sonal@cs.stanford.edu Christopher D. Manning Department of

More information

Translating the Penn Treebank with an Interactive-Predictive MT System

Translating the Penn Treebank with an Interactive-Predictive MT System IJCLA VOL. 2, NO. 1 2, JAN-DEC 2011, PP. 225 237 RECEIVED 31/10/10 ACCEPTED 26/11/10 FINAL 11/02/11 Translating the Penn Treebank with an Interactive-Predictive MT System MARTHA ALICIA ROCHA 1 AND JOAN

More information

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words , pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan

More information

ACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation www.accurat-project.eu Project no.

ACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation www.accurat-project.eu Project no. ACCURAT Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation www.accurat-project.eu Project no. 248347 Deliverable D5.4 Report on requirements, implementation

More information

An Online Service for SUbtitling by MAchine Translation

An Online Service for SUbtitling by MAchine Translation SUMAT CIP-ICT-PSP-270919 An Online Service for SUbtitling by MAchine Translation Annual Public Report 2012 Editor(s): Contributor(s): Reviewer(s): Status-Version: Arantza del Pozo Mirjam Sepesy Maucec,

More information

Customizing an English-Korean Machine Translation System for Patent Translation *

Customizing an English-Korean Machine Translation System for Patent Translation * Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,

More information

Micro blogs Oriented Word Segmentation System

Micro blogs Oriented Word Segmentation System Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,

More information

Applying Statistical Post-Editing to. English-to-Korean Rule-based Machine Translation System

Applying Statistical Post-Editing to. English-to-Korean Rule-based Machine Translation System Applying Statistical Post-Editing to English-to-Korean Rule-based Machine Translation System Ki-Young Lee and Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research

More information

The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER 2011 49 58. Ncode: an Open Source Bilingual N-gram SMT Toolkit

The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER 2011 49 58. Ncode: an Open Source Bilingual N-gram SMT Toolkit The Prague Bulletin of Mathematical Linguistics NUMBER 96 OCTOBER 2011 49 58 Ncode: an Open Source Bilingual N-gram SMT Toolkit Josep M. Crego a, François Yvon ab, José B. Mariño c c a LIMSI-CNRS, BP 133,

More information

Convergence of Translation Memory and Statistical Machine Translation

Convergence of Translation Memory and Statistical Machine Translation Convergence of Translation Memory and Statistical Machine Translation Philipp Koehn and Jean Senellart 4 November 2010 Progress in Translation Automation 1 Translation Memory (TM) translators store past

More information

Reliable and Cost-Effective PoS-Tagging

Reliable and Cost-Effective PoS-Tagging Reliable and Cost-Effective PoS-Tagging Yu-Fang Tsai Keh-Jiann Chen Institute of Information Science, Academia Sinica Nanang, Taipei, Taiwan 5 eddie,chen@iis.sinica.edu.tw Abstract In order to achieve

More information

Polish - English Statistical Machine Translation of Medical Texts.

Polish - English Statistical Machine Translation of Medical Texts. Polish - English Statistical Machine Translation of Medical Texts. Krzysztof Wołk, Krzysztof Marasek Department of Multimedia Polish Japanese Institute of Information Technology kwolk@pjwstk.edu.pl Abstract.

More information

Term extraction for user profiling: evaluation by the user

Term extraction for user profiling: evaluation by the user Term extraction for user profiling: evaluation by the user Suzan Verberne 1, Maya Sappelli 1,2, Wessel Kraaij 1,2 1 Institute for Computing and Information Sciences, Radboud University Nijmegen 2 TNO,

More information

Chinese-Japanese Machine Translation Exploiting Chinese Characters

Chinese-Japanese Machine Translation Exploiting Chinese Characters Chinese-Japanese Machine Translation Exploiting Chinese Characters CHENHUI CHU, TOSHIAKI NAKAZAWA, DAISUKE KAWAHARA, and SADAO KUROHASHI, Kyoto University The Chinese and Japanese languages share Chinese

More information

Identifying Prepositional Phrases in Chinese Patent Texts with. Rule-based and CRF Methods

Identifying Prepositional Phrases in Chinese Patent Texts with. Rule-based and CRF Methods Identifying Prepositional Phrases in Chinese Patent Texts with Rule-based and CRF Methods Hongzheng Li and Yaohong Jin Institute of Chinese Information Processing, Beijing Normal University 19, Xinjiekou

More information

An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation

An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation An Iteratively-Trained Segmentation-Free Phrase Translation Model for Statistical Machine Translation Robert C. Moore Chris Quirk Microsoft Research Redmond, WA 98052, USA {bobmoore,chrisq}@microsoft.com

More information

Jane 2: Open Source Phrase-based and Hierarchical Statistical Machine Translation

Jane 2: Open Source Phrase-based and Hierarchical Statistical Machine Translation Jane 2: Open Source Phrase-based and Hierarchical Statistical Machine Translation Joern Wuebker M at thias Huck Stephan Peitz M al te Nuhn M arkus F reitag Jan-Thorsten Peter Saab M ansour Hermann N e

More information

The United Nations Parallel Corpus v1.0

The United Nations Parallel Corpus v1.0 The United Nations Parallel Corpus v1.0 Michał Ziemski, Marcin Junczys-Dowmunt, Bruno Pouliquen United Nations, DGACM, New York, United States of America Adam Mickiewicz University, Poznań, Poland World

More information

The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems

The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems Dr. Ananthi Sheshasaayee 1, Angela Deepa. V.R 2 1 Research Supervisior, Department of Computer Science & Application,

More information

The Prague Bulletin of Mathematical Linguistics NUMBER 93 JANUARY 2010 7 16

The Prague Bulletin of Mathematical Linguistics NUMBER 93 JANUARY 2010 7 16 The Prague Bulletin of Mathematical Linguistics NUMBER 93 JANUARY 21 7 16 A Productivity Test of Statistical Machine Translation Post-Editing in a Typical Localisation Context Mirko Plitt, François Masselot

More information

The Impact of Morphological Errors in Phrase-based Statistical Machine Translation from English and German into Swedish

The Impact of Morphological Errors in Phrase-based Statistical Machine Translation from English and German into Swedish The Impact of Morphological Errors in Phrase-based Statistical Machine Translation from English and German into Swedish Oscar Täckström Swedish Institute of Computer Science SE-16429, Kista, Sweden oscar@sics.se

More information

Evaluation of Bayesian Spam Filter and SVM Spam Filter

Evaluation of Bayesian Spam Filter and SVM Spam Filter Evaluation of Bayesian Spam Filter and SVM Spam Filter Ayahiko Niimi, Hirofumi Inomata, Masaki Miyamoto and Osamu Konishi School of Systems Information Science, Future University-Hakodate 116 2 Kamedanakano-cho,

More information

Feature Selection for Electronic Negotiation Texts

Feature Selection for Electronic Negotiation Texts Feature Selection for Electronic Negotiation Texts Marina Sokolova, Vivi Nastase, Mohak Shah and Stan Szpakowicz School of Information Technology and Engineering, University of Ottawa, Ottawa ON, K1N 6N5,

More information

English to Arabic Transliteration for Information Retrieval: A Statistical Approach

English to Arabic Transliteration for Information Retrieval: A Statistical Approach English to Arabic Transliteration for Information Retrieval: A Statistical Approach Nasreen AbdulJaleel and Leah S. Larkey Center for Intelligent Information Retrieval Computer Science, University of Massachusetts

More information

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Wikipedia and Web document based Query Translation and Expansion for Cross-language IR Ling-Xiang Tang 1, Andrew Trotman 2, Shlomo Geva 1, Yue Xu 1 1Faculty of Science and Technology, Queensland University

More information

Revisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora

Revisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora Revisiting Context-based Projection Methods for Term-Translation Spotting in Comparable Corpora Audrey Laroche OLST Dép. de linguistique et de traduction Université de Montréal audrey.laroche@umontreal.ca

More information

T U R K A L A T O R 1

T U R K A L A T O R 1 T U R K A L A T O R 1 A Suite of Tools for Augmenting English-to-Turkish Statistical Machine Translation by Gorkem Ozbek [gorkem@stanford.edu] Siddharth Jonathan [jonsid@stanford.edu] CS224N: Natural Language

More information

Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm

Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm Peng Li and Maosong Sun Department of Computer Science and Technology State Key Lab on Intelligent Technology and Systems National Lab for

More information

TRANSREAD: Designing a Bilingual Reading Experience with Machine Translation Technologies

TRANSREAD: Designing a Bilingual Reading Experience with Machine Translation Technologies TRANSREAD: Designing a Bilingual Reading Experience with Machine Translation Technologies François Yvon and Yong Xu and Marianna Apidianaki LIMSI, CNRS, Université Paris-Saclay 91 403 Orsay {yvon,yong,marianna}@limsi.fr

More information

Cache-based Online Adaptation for Machine Translation Enhanced Computer Assisted Translation

Cache-based Online Adaptation for Machine Translation Enhanced Computer Assisted Translation Cache-based Online Adaptation for Machine Translation Enhanced Computer Assisted Translation Nicola Bertoldi Mauro Cettolo Marcello Federico FBK - Fondazione Bruno Kessler via Sommarive 18 38123 Povo,

More information

Enriching Morphologically Poor Languages for Statistical Machine Translation

Enriching Morphologically Poor Languages for Statistical Machine Translation Enriching Morphologically Poor Languages for Statistical Machine Translation Eleftherios Avramidis e.avramidis@sms.ed.ac.uk Philipp Koehn pkoehn@inf.ed.ac.uk School of Informatics University of Edinburgh

More information

Protein-protein Interaction Passage Extraction Using the Interaction Pattern Kernel Approach for the BioCreative 2015 BioC Track

Protein-protein Interaction Passage Extraction Using the Interaction Pattern Kernel Approach for the BioCreative 2015 BioC Track Protein-protein Interaction Passage Extraction Using the Interaction Pattern Kernel Approach for the BioCreative 2015 BioC Track Yung-Chun Chang 1,2, Yu-Chen Su 3, Chun-Han Chu 1, Chien Chin Chen 2 and

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

A Survey on Product Aspect Ranking Techniques

A Survey on Product Aspect Ranking Techniques A Survey on Product Aspect Ranking Techniques Ancy. J. S, Nisha. J.R P.G. Scholar, Dept. of C.S.E., Marian Engineering College, Kerala University, Trivandrum, India. Asst. Professor, Dept. of C.S.E., Marian

More information

Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach -

Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach - Enriching the Crosslingual Link Structure of Wikipedia - A Classification-Based Approach - Philipp Sorg and Philipp Cimiano Institute AIFB, University of Karlsruhe, D-76128 Karlsruhe, Germany {sorg,cimiano}@aifb.uni-karlsruhe.de

More information

Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology

Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology Extraction of Legal Definitions from a Japanese Statutory Corpus Toward Construction of a Legal Term Ontology Makoto Nakamura, Yasuhiro Ogawa, Katsuhiko Toyama Japan Legal Information Institute, Graduate

More information

Statistical Machine Translation prototype using UN parallel documents

Statistical Machine Translation prototype using UN parallel documents Proceedings of the 16th EAMT Conference, 28-30 May 2012, Trento, Italy Statistical Machine Translation prototype using UN parallel documents Bruno Pouliquen, Christophe Mazenc World Intellectual Property

More information

The University of Maryland Statistical Machine Translation System for the Fifth Workshop on Machine Translation

The University of Maryland Statistical Machine Translation System for the Fifth Workshop on Machine Translation The University of Maryland Statistical Machine Translation System for the Fifth Workshop on Machine Translation Vladimir Eidelman, Chris Dyer, and Philip Resnik UMIACS Laboratory for Computational Linguistics

More information

An Iterative approach to extract dictionaries from Wikipedia for under-resourced languages

An Iterative approach to extract dictionaries from Wikipedia for under-resourced languages An Iterative approach to extract dictionaries from Wikipedia for under-resourced languages Rohit Bharadwaj G SIEL, LTRC IIIT Hyd bharadwaj@research.iiit.ac.in Niket Tandon Databases and Information Systems

More information

LIUM s Statistical Machine Translation System for IWSLT 2010

LIUM s Statistical Machine Translation System for IWSLT 2010 LIUM s Statistical Machine Translation System for IWSLT 2010 Anthony Rousseau, Loïc Barrault, Paul Deléglise, Yannick Estève Laboratoire Informatique de l Université du Maine (LIUM) University of Le Mans,

More information

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter Gerard Briones and Kasun Amarasinghe and Bridget T. McInnes, PhD. Department of Computer Science Virginia Commonwealth University Richmond,

More information

Statistical Machine Translation

Statistical Machine Translation Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language

More information

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY G.Evangelin Jenifer #1, Mrs.J.Jaya Sherin *2 # PG Scholar, Department of Electronics and Communication Engineering(Communication and Networking), CSI Institute

More information

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks Firoj Alam 1, Anna Corazza 2, Alberto Lavelli 3, and Roberto Zanoli 3 1 Dept. of Information Eng. and Computer Science, University of Trento,

More information

End-to-End Sentiment Analysis of Twitter Data

End-to-End Sentiment Analysis of Twitter Data End-to-End Sentiment Analysis of Twitter Data Apoor v Agarwal 1 Jasneet Singh Sabharwal 2 (1) Columbia University, NY, U.S.A. (2) Guru Gobind Singh Indraprastha University, New Delhi, India apoorv@cs.columbia.edu,

More information

FOR IMMEDIATE RELEASE

FOR IMMEDIATE RELEASE FOR IMMEDIATE RELEASE Hitachi Developed Basic Artificial Intelligence Technology that Enables Logical Dialogue Analyzes huge volumes of text data on issues under debate, and presents reasons and grounds

More information

Hybrid Approach to English-Hindi Name Entity Transliteration

Hybrid Approach to English-Hindi Name Entity Transliteration Hybrid Approach to English-Hindi Name Entity Transliteration Shruti Mathur Department of Computer Engineering Government Women Engineering College Ajmer, India shrutimathur19@gmail.com Varun Prakash Saxena

More information

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Open Domain Information Extraction. Günter Neumann, DFKI, 2012 Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for

More information

TS3: an Improved Version of the Bilingual Concordancer TransSearch

TS3: an Improved Version of the Bilingual Concordancer TransSearch TS3: an Improved Version of the Bilingual Concordancer TransSearch Stéphane HUET, Julien BOURDAILLET and Philippe LANGLAIS EAMT 2009 - Barcelona June 14, 2009 Computer assisted translation Preferred by

More information

Statistical Machine Translation: IBM Models 1 and 2

Statistical Machine Translation: IBM Models 1 and 2 Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation

More information

Chapter 5. Phrase-based models. Statistical Machine Translation

Chapter 5. Phrase-based models. Statistical Machine Translation Chapter 5 Phrase-based models Statistical Machine Translation Motivation Word-Based Models translate words as atomic units Phrase-Based Models translate phrases as atomic units Advantages: many-to-many

More information

An Iterative Link-based Method for Parallel Web Page Mining

An Iterative Link-based Method for Parallel Web Page Mining An Iterative Link-based Method for Parallel Web Page Mining Le Liu 1, Yu Hong 1, Jun Lu 2, Jun Lang 2, Heng Ji 3, Jianmin Yao 1 1 School of Computer Science & Technology, Soochow University, Suzhou, 215006,

More information

Overview of MT techniques. Malek Boualem (FT)

Overview of MT techniques. Malek Boualem (FT) Overview of MT techniques Malek Boualem (FT) This section presents an standard overview of general aspects related to machine translation with a description of different techniques: bilingual, transfer,

More information

Statistical Pattern-Based Machine Translation with Statistical French-English Machine Translation

Statistical Pattern-Based Machine Translation with Statistical French-English Machine Translation Statistical Pattern-Based Machine Translation with Statistical French-English Machine Translation Jin'ichi Murakami, Takuya Nishimura, Masato Tokuhisa Tottori University, Japan Problems of Phrase-Based

More information

Factored Markov Translation with Robust Modeling

Factored Markov Translation with Robust Modeling Factored Markov Translation with Robust Modeling Yang Feng Trevor Cohn Xinkai Du Information Sciences Institue Computing and Information Systems Computer Science Department The University of Melbourne

More information

Harvesting parallel text in multiple languages with limited supervision

Harvesting parallel text in multiple languages with limited supervision Harvesting parallel text in multiple languages with limited supervision Luciano Barbosa 1 V ivek Kumar Rangarajan Sridhar 1 Mahsa Y armohammadi 2 Srinivas Bangalore 1 (1) AT&T Labs - Research, 180 Park

More information

Language Modeling. Chapter 1. 1.1 Introduction

Language Modeling. Chapter 1. 1.1 Introduction Chapter 1 Language Modeling (Course notes for NLP by Michael Collins, Columbia University) 1.1 Introduction In this chapter we will consider the the problem of constructing a language model from a set

More information

Author Gender Identification of English Novels

Author Gender Identification of English Novels Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in

More information

Domain Adaptation of Statistical Machine Translation using Web-Crawled Resources: A Case Study

Domain Adaptation of Statistical Machine Translation using Web-Crawled Resources: A Case Study Proceedings of the 16th EAMT Conference, 28-30 May 2012, Trento, Italy Domain Adaptation of Statistical Machine Translation using Web-Crawled Resources: A Case Study Pavel Pecina 1, Antonio Toral 2, Vassilis

More information

Pre-processing of Bilingual Corpora for Mandarin-English EBMT

Pre-processing of Bilingual Corpora for Mandarin-English EBMT Pre-processing of Bilingual Corpora for Mandarin-English EBMT Ying Zhang, Ralf Brown, Robert Frederking, Alon Lavie Language Technologies Institute, Carnegie Mellon University NSH, 5000 Forbes Ave. Pittsburgh,

More information

A Systematic Cross-Comparison of Sequence Classifiers

A Systematic Cross-Comparison of Sequence Classifiers A Systematic Cross-Comparison of Sequence Classifiers Binyamin Rozenfeld, Ronen Feldman, Moshe Fresko Bar-Ilan University, Computer Science Department, Israel grurgrur@gmail.com, feldman@cs.biu.ac.il,

More information

Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection

Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection Web based English-Chinese OOV term translation using Adaptive rules and Recursive feature selection Jian Qu, Nguyen Le Minh, Akira Shimazu School of Information Science, JAIST Ishikawa, Japan 923-1292

More information

Sentiment analysis: towards a tool for analysing real-time students feedback

Sentiment analysis: towards a tool for analysing real-time students feedback Sentiment analysis: towards a tool for analysing real-time students feedback Nabeela Altrabsheh Email: nabeela.altrabsheh@port.ac.uk Mihaela Cocea Email: mihaela.cocea@port.ac.uk Sanaz Fallahkhair Email:

More information