Identifying Japanese-Chinese Bilingual Synonymous Technical Terms from Patent Families

Transcription

1 Identifying Japanese-Chinese Bilingual Synonymous Technical Terms from Patent Families Zi Long Lijuan Dong Takehito Utsuro Tomoharu Mitsuhashi Mikio Yamamoto Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, , Japan Japan Patent Information Organization, 4-1-7, Toyo, Koto-ku, Tokyo, , Japan Abstract In the task of acquiring Japanese-Chinese technical term translation equivalent pairs from parallel patent documents, this paper considers situations where a technical term is observed in many parallel patent sentences and is translated into many translation equivalents and studies the issue of identifying synonymous translation equivalent pairs. First, we collect candidates of synonymous translation equivalent pairs from parallel patent sentences. Then, we apply the Support Vector Machines (SVMs) to the task of identifying bilingual synonymous technical terms, and achieve the performance of over 85% precision and over 60% F-measure. We further examine two types of segmentation of Chinese sentences, i.e., by characters and by morphemes, and integrate those two types of segmentation in the form of the intersection of SVM judgments, which achieved over 90% precision. Keywords: synonymous technical terms, patent families, technical term translation 1. Introduction For both high quality machine and human translation, a large scale and high quality bilingual lexicon is the most important key resource. Since manual compilation of bilingual lexicon requires plenty of time and huge manual labor, in the research area of knowledge acquisition from natural language text, automatic bilingual lexicon compilation have been studied. Techniques invented so far include translation term pair acquisition based on statistical co-occurrence measure from parallel sentences (Matsumoto and Utsuro, 2000), translation term pair acquisition from comparable corpora (Fung and Yee, 1998), compositional translation generation based on an existing bilingual lexicon for human use (Tonoike et al., 2006), and translation term pair acquisition by collecting partially bilingual texts through the search engine (Huang et al., 2005). Among those efforts of acquiring bilingual lexicon from text, Morishita et al. (2008) studied to acquire Japanese- English technical term translation lexicon from phrase tables, which are trained by a phrase-based SMT model with parallel sentences automatically extracted from parallel patent documents. In more recent studies, they require the acquired technical term translation equivalents to be consistent with word alignment in parallel sentences and achieved 91.9% precision with almost 70% recall. Furthermore, based on the achievement above, Liang et al. (2011a) considered situations where a technical term is observed in many parallel patent sentences and is translated into many translation equivalents. More specifically, in the task of acquiring Japanese-English technical term translation equivalent pairs, Liang et al. (2011a) studied the issue of identifying Japanese-English synonymous translation equivalent pairs. First, they collect candidates of synonymous translation equivalent pairs from parallel patent sentences. Then, they apply the Support Vector Machines (SVMs) (Vapnik, 1998) to the task of identifying bilingual synonymous technical terms. Based on the technique and the results of identifying Japanese-English synonymous translation equivalent pairs in Liang et al. (2011a), we aim at identifying Japanese- Chinese synonymous translation equivalent pairs from Japanese-Chinese patent families. We especially examine two types of segmentation of Chinese sentences, namely, by characters and by morphemes. Although both types of segmentation achieved almost similar performance around 95 97% (in recall / precision / f-measure) in the task of acquiring Japanese-Chinese technical term translation pairs, they have different types of errors. Also in the task of identifying Japanese-Chinese synonymous technical terms, both types of segmentation achieved almost similar performance, while they have different types of errors. Thus, we integrate those two types of segmentation in the form of the intersection of SVM judgments, and show that this achieves over 90% precision. 2. Japanese-Chinese Parallel Patent Documents Japanese-Chinese parallel patent documents are collected from the Japanese patent documents published by the Japanese Patent Office (JPO) in and the Chinese patent documents published by State Intellectual Property Office of the People s Republic of China (SIPO) in From them, we extract 312,492 patent families, and the method of Utiyama and Isahara (2007) is applied 1 to the text of those patent families, and Japanese and Chinese sentences are aligned. In this paper, we use 3.6M parallel patent sentences with the highest scores of sentence alignment. 3. Phrase Table of an SMT Model As a toolkit of a phrase-based SMT model, we use Moses (Koehn et al., 2007) and apply it to the whole 3.6M parallel patent sentences. Before applying Moses, Japanese sentences are segmented into a sequence of morphemes by the Japanese morphological analyzer MeCab 2 with the 1 Here, we used a Japanese-Chinese translation lexicon consisting of about 170,000 Chinese head words. 2

2 Figure 1: Developing a Reference Set of Bilingual Synonymous Technical Terms morpheme lexicon IPAdic 3. For Chinese sentences, we examine two types of segmentation, i.e., segmentation by characters 4 and segmentation by morphemes 5. As the result of applying Moses, we have a phrase table in the direction of Japanese to Chinese translation, and another one in the opposite direction of Chinese to Japanese translation. In the direction of Japanese to Chinese translation, we finally obtain 108M (Chinese sentences segmented by morphemes) / 274M (Chinese sentences segmented by characters) translation pairs with 75M / 197M unique Japanese phrases with Japanese to Chinese phrase translation probabilities P (p C p J ) of translating a Japanese phrase p J into a Chinese phrase p C. For each Japanese phrase, those multiple translation candidates in the phrase table are ranked in descending order of Japanese to Chinese phrase translation probabilities. In the similar way, in the phrase table in the opposite direction of Chinese to Japanese translation, for each Chinese phrase, multiple Japanese translation candidates are ranked in descending order of Chinese to Japanese phrase translation probabilities. Those two phrase tables are then referred to when identifying a bilingual technical term pair, given a parallel sen A consecutive sequence of numbers as well as a consecutive sequence of alphabetical characters are segmented into a token. 5 Chinese sentences are segmented into a sequence of morphemes by the Chinese morphological analyzer Stanford Word Segment (Tseng et al., 2005) trained with Chinese Penn Treebank. tence pair S J,S C and a Japanese technical term t J,or a Chinese technical term t C. In the direction of Japanese to Chinese, given a parallel sentence pair S J,S C containing a Japanese technical term t J, Chinese translation candidates collected from the Japanese to Chinese phrase table are matched against the Chinese sentence S C of the parallel sentence pair. Among those found in S C, ˆt C with the largest translation probability P (t C t J ) is selected and the bilingual technical term pair t J, ˆt C is identified. Similarly, in the opposite direction of Chinese to Japanese, given a parallel sentence pair S J,S C containing a Chinese technical term t C, the Chinese to Japanese phrase table is referred to when identifying a bilingual technical term pair. 4. Developing a Reference Set of Bilingual Synonymous Technical Terms When developing a reference set of bilingual synonymous technical terms (detailed procedure to be found in Liang et al. (2011a)), starting from a seed bilingual term pair s JC = s J,s C, we repeat the translation estimation procedure of the previous section six times and generate the set CBP(s J ) of candidates of bilingual synonymous technical term pairs. Figure 1 illustrates the whole procedure. Then, we manually divide the set CBP(s J ) into SBP(s JC ), those of which are synonymous with s JC, and the remaining NSBP(s JC ). As in Table 1, we collect 114 seeds, where the number of bilingual technical terms included in SBP(s JC ) in total for all of the 114 seed bilin-

3 Table 1: Number of Bilingual Technical Terms: Candidates and Reference of Synonyms Candidates of Synonyms s J CBP(s J ) Reference of Synonyms s JC SBP(s JC ) Candidates of Synonyms s J CBP(s J ) Reference of Synonyms s JC SBP(s JC ) (a) With the Phrase Table based on Chinese Sentences Segmented by Characters # of bilingual technical terms for the total 114 seeds in the set (a) in the set (a) 8, ,563 13, ,496 2, (b) With the Phrase Table based on Chinese Sentences Segmented by Morphemes # of bilingual technical terms for the total 114 seeds in the set (b) in the set (b) 14, ,948 14, ,604 2, average per seed average per seed gual technical term pairs is around 2,500 to 2,600, which amounts to around 22 per seed on average. It can be also seen from Table 1 that although about 90% of reference of synonymous technical terms are shared by the two types of segmentation (by characters and by morphemes), only about 40% to 50% of candidates of synonymous technical terms are shared by the two types of segmentation. 5. Identifying Bilingual Synonymous Technical Terms by Machine Learning In this section, we apply the SVMs to the task of identifying bilingual synonymous technical terms. In this paper, we model the task of identifying bilingual synonymous technical terms by the SVMs as that of judging whether or not the input bilingual term pair t J,t C is synonymous with the seed bilingual technical term pair s JC = s J,s C The Procedure First, let CBP be the union of the sets CBP(s J ) of candidates of bilingual synonymous technical term pairs for all of the 114 seed bilingual technical term pairs. In the training and testing of the classifier for identifying bilingual synonymous technical terms, we first divide the set of 114 seed bilingual technical term pairs into 10 subsets. Here, for each i-th subset (i =1,...,10), we construct the union CBP i of the sets CBP(s J ) of candidates of bilingual synonymous technical term pairs, where CBP 1,...,CBP 10 are 10 disjoint subsets 6 of CBP. 6 Here, we divide the set of 114 seed bilingual technical term pairs into 10 subsets so that the numbers of positive (i.e., syn- As a tool for learning SVMs, we use TinySVM ( chasen.org/ taku/software/tinysvm/). As the kernel function, we use the polynomial (1st order) kernel 7. In the testing of a SVMs classifier, we regard the distance from the separating hyperplane to each test instance as a confidence measure, and return test instances satisfying confidence measures over a certain lower bound only as positive samples (i.e., synonymous with the seed). In the training of SVMs, we use 8 subsets out of the whole 10 subsets CBP 1,...,CBP 10. Then, we tune the lower bound of the confidence measure with one of the remaining two subsets. With this subset, we also tune the parameter of TinySVM for trade-off between training error and margin. Finally, we test the trained classifier against another one of the remaining two subsets. We repeat this procedure of training / tuning / testing 10 times, and average the 10 results of test performance Features Table 2 lists all the features used for training and testing of SVMs for identifying bilingual synonymous technical terms. Features are roughly divided into two types: those of the first type f 1,...,f 6 simply represent various characteristics of the input bilingual technical term t J,t C, while those of the second type f 7,...,f 16 represent relation of the input bilingual technical term t J,t C and the onymous with the seed) / negative (i.e., not synonymous with the seed) samples in each CBP i (i = 1,...,10) are comparative among the 10 subsets. 7 We compare the performance of the 1st order and 2nd order kernels, where we have almost comparative performance.

4 Table 2: Features for Identifying Bilingual Synonymous Technical Terms by Machine Learning class features for bilingual technical terms t J,t C features for the relation of bilingual technical terms t J,t C and the seed s J,s C definition feature ( where X denotes J or C, and s J,s C denotes the seed bilingual technical term pair ) f 1 : frequency log of the frequency of t J,t C within the whole parallel patent sentences f 2 : rank of the Chinese term given t J, log of the rank of t C with respect to the descending order of the conditional translation probability P(t C t J ) f 3 : rank of the Japanese term given t C, log of the rank of t J with respect to the descending order of the conditional translation probability P(t J t C ) f 4 : number of Japanese characters number of characters in t J f 5 : number of Chinese charac- number of characters in t C f 6 : ters number of times generating translation by applying the phrase tables the number of times repeating the procedure of generating translation by applying the phrase tables until generating t C or t J from s J, as in s C t J t C,or,s J t C t J f 7 : identity of Japanese terms returns 1 when t J = s J f 8 : identity of Chinese terms returns 1 when t C = s C ED(tX,sX) f 9 : edit distance similarity of f 9 (t X,s X )=1 max( t X, s X ) (where ED is the edit distance monolingual terms of t X and s X, and t denotes the number of characters of t.) bigram(tx ) bigram(sx ) f 10 : character bigram similarity f 10 (t X,s X )= max( t X, s X ) 1 (where bigram(t) is of monolingual terms the set of character bigrams of the term t.) const(tj ) const(sj ) f 11 : rate of identical morphemes f 11 (t J,s J )= max( const(t J ), const(s J ) ) (where const(t) is the (for Japanese terms) set of morphemes in the Japanese term t.) const(t f 12 : rate of identical characters f 11 (t C,s C ) = C) const(s C ) max( const(t C), const(s C) ) (where const(t) is (for Chinese terms) the set of Characters in the Chinese term t.) f 13 : subsumption relation of returns 1 when the difference of t J and s J is only in their suffixes, strings / variants relation of or only whether or not having the prolonged sound, or only in surface forms (for Japanese their hiragana parts. terms ) f 14 : identical stem (for Chinese returns 1 when the difference of t C and s C is only whether or not terms) haing the word which is not the prefix or suffix. f 15 : f 16 : rate of intersection in translation by the phrase table translation by the phrase table f 15 (t X,s X )= ( where trans(t) is the set of translation of term t from the phrase table.) returns 1 when s J can be generated by translating t E with the phrase table, or, s E can be generated by translating t J with the phrase table. trans(tx) trans(sx ) max( trans(t X), trans(s X) ) seed bilingual technical term pair s JC = s J,s C. Among the features of the first type are the frequency (f 1 ), ranks of terms with respect to the conditional translation probabilities (f 2 and f 3 ), length of terms (f 4 and f 5 ), and the number of times repeating the procedure of generating translation with the phrase tables until generating input terms t J and t C from the Japanese seed term s J (f 6 ). Among the features of the second type are identity of monolingual terms (f 7 and f 8 ), edit distance of monolingual terms (f 9 ), character bigram similarity of monolingual terms (f 10 ), rate of identical morphemes (in Japanese, f 11 ) / characters (in Chinese, f 12 ), string subsumption and variants for Japanese (f 13 ), identical stem for Chinese (f 14 ), rate of intersection in translation by the phrase table (f 15 ), and translation by the phrase tables (f 16 ) Evaluation Results Table 3 shows the evaluation results for a baseline as well as for SVMs. As the baseline, we simply judge the input bilingual term pair t J,t C as synonymous with the seed bilingual technical term pair s JC = s J,s C when t J and s J are identical, or, t C and s C are identical. When training / testing a SVMs classifier, we tune the lower bound of the confidence measure of the distance from the separating hyperplane in two ways: i.e., for maximizing precision and for maximizing F-measure. When maximizing precision, we achieve almost 87% precision where F-measure is over 40%. When maximizing F-measure, we achieve over 60% F-measure with around 71% precision and over 52% recall. As shown in Figure 2, the two types of segmentation of Chinese sentences, namely, by characters and by morphemes, tend to have different types of errors. So, we integrate those two types of segmentation in the form of the intersection of

5 Table 3: Evaluation Results (%) segmented by characters segmented by morphemes intersection precision recall f-measure precision recall f-measure precision recall f-measure baseline (t J and s J are identical, or, t C and s C are identical.) SVM maximum precision maximum f-measure Figure 2: Evaluating Intersection of Judgments by SVM based on Character/Morpheme based Segmentation of Chinese Sentences SVM judgments, where, for both types of segmentation, we tune the lower bound of the confidence measure of the distance from the separating hyperplane. We maximize precision while keeping recall over 25% with held-out data, and this achieves over 90% precision as shown in Table Related Work Among related works on acquiring bilingual lexicon from text, Itagaki et al. (2007) focused on automatic validation of translation pairs available in the phrase table trained by an SMT model. Lu and Tsou (2009) and Yasuda and Sumita (2013) also studied to extract bilingual terms from comparable patents, where, they first extract parallel sentences from comparable patents, and then extract bilingual terms from parallel sentences. Those studies differ from this paper in that those studies did not address the issue of acquiring bilingual synonymous technical terms. Tsunakawa and Tsujii (2008) is mostly related to our study, in that they also proposed to apply machine learning technique to the task of identifying bilingual synonymous technical terms. However, Tsunakawa and Tsujii (2008) studied the issue of identifying bilingual synonymous technical terms only within manually compiled bilingual technical term lexicon and thus are quite limited in its applicability. Our approach, on the other hand, is quite advantageous in that we start from parallel patent documents which continue to be published every year and then, that we can generate candidates of bilingual synonymous technical terms automatically. Our study in this paper is also different from previous works on identifying synonyms based on bilingual and monolingual resources (e.g. Lin and Zhao (2003)) in that we learn bilingual synonymous technical terms from phrase tables of a phrase-based SMT model trained with very large parallel sentences. Also in the context of SMT between Japanese and Chinese, Sun and Lepage (2012) pointed out that character-based segmentation of sentences contributed to improving machine translation performance compared to morpheme-based segmentation of sentences. 7. Conclusion In the task of acquiring Japanese-Chinese technical term translation equivalent pairs from parallel patent documents, this paper considered situations where a technical term is observed in many parallel patent sentences and is translated into many translation equivalents and studied the is-

6 sue of identifying synonymous translation equivalent pairs. We especially examined two types of segmentation of Chinese sentences, i.e., by characters and by morphemes, and integrated those two types of segmentation in the form of the intersection of SVM judgments, which achieved over 90% precision. One of the most important future works is definitely to improve recall. To do this, we plan to apply the semi-automatic framework (Liang et al., 2011b) which have been invented in the task of identifying Japanese- English synonymous translation equivalent pairs and have been proven to be effective in improving recall. We plan to examine whether this semi-automatic framework is also effective in the task of identifying Japanese-Chinese synonymous translation equivalent pairs. 8. References P. Fung and L. Y. Yee An IR approach for translating new words from nonparallel, comparable texts. In Proc. 17th COLING and 36th ACL, pages F. Huang, Y. Zhang, and S. Vogel Mining key phrase translations from Web corpora. In Proc. HLT/EMNLP, pages M. Itagaki, T. Aikawa, and X. He Automatic validation of terminology translation consistency with statistical method. In Proc. MT Summit XI, pages P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst Moses: Open source toolkit for statistical machine translation. In Proc. 45th ACL, Companion Volume, pages B. Liang, T. Utsuro, and M. Yamamoto. 2011a. Identifying bilingual synonymous technical terms from phrase tables and parallel patent sentences. Procedia - Social and Behavioral Sciences, 27: B. Liang, T. Utsuro, and M. Yamamoto. 2011b. Semiautomatic identification of bilingual synonymous technical terms from phrase tables and parallel patent sentences. In Proc. 25th PACLIC, pages D. Lin and S. Zhao Identifying synonyms among distributionally similar words. In Proc. 18th IJCAI, pages B. Lu and B. K. Tsou Towards bilingual term extraction in comparable patents. In Proc. 23rd PACLIC, pages Y. Matsumoto and T. Utsuro Lexical knowledge acquisition. In R. Dale, H. Moisl, and H. Somers, editors, Handbook of Natural Language Processing, chapter 24, pages Marcel Dekker Inc. Y. Morishita, T. Utsuro, and M. Yamamoto Integrating a phrase-based SMT model and a bilingual lexicon for human in semi-automatic acquisition of technical term translation lexicon. In Proc. 8th AMTA, pages J. Sun and Y. Lepage Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese? In Proc. 26th PACLIC, pages M. Tonoike, M. Kida, T. Takagi, Y. Sasaki, T. Utsuro, and S. Sato A comparative study on compositional translation estimation using a domain/topic-specific corpus collected from the web. In Proc. 2nd Intl. Workshop on Web as Corpus, pages H. Tseng, P. Chang, G. Andrew, D. Jurafsky, and C. Manning A conditional random field word segmenter for Sighan bakeoff In Proc. 4th SIGHAN Workshop on Chinese Language Processing, pages T. Tsunakawa and J. Tsujii Bilingual synonym identification with spelling variations. In Proc. 3rd IJCNLP, pages M. Utiyama and H. Isahara A Japanese-English patent parallel corpus. In Proc. MT Summit XI, pages V. N. Vapnik Statistical Learning Theory. Wiley- Interscience. K. Yasuda and E. Sumita Building a bilingual dictionary from a Japanese-Chinese patent corpus. In Computational Linguistics and Intelligent Text Processing, volume 7817 of LNCS, pages Springer.