Chinese Term Extraction from Web Pages Based on Compound word Productivity

Transcription

1 Chinese Term Extraction from Web Pages Based on Compound word Productivity Hiroshi Nakagawa Information Technology Center, The University of Tokyo 7-3- Hongou, Bunkyo Tokyo, JAPAN, Abstract In this paper, we propose an automatic term recognition system for Chinese. Our idea is based on the relation between a compound word and its constituents that are simple words or individual Chinese character. More precisely, we basically focus on how many words/characters adjoin the word/character in question to form compound words. We also take into account the frequency of term. We evaluated word based method and character based method with several Chinese Web pages, resulting in precision of 75% for top ten candidate terms. Introduction Automatic term recognition, ATR in short, aims at extracting domain specific terms from a corpus or Web pages. Domain specific terms are terms that express the concept specifically defined in the given domain. They are required to have a unique meaning in order for efficient communication about the topic of the domain. It is, however, difficult to decide automatically whether they are unique. So we put this issue aside. In terms of feasibility, their grammatical status is important, for instance part of speeches. Although they are not necessarily confined to simple words where simple word means a word which could not be further divided into shorter and more basic words, the majority of them are actually compound words,. Thus, we here focus on both of simple and compound words. In terms of text length, even one Web page which is not long gives us a number of domain specific vocabulary like national library, library policy if the Web page is about libraries. If we expand domain specific terms to this extent, the big portion of domain specific terms are compound words. Obviously, the majority of compound words consist of relatively small number of distinct simple words. In this situation, it is natural to pay Hiroyuki ojima *, Akira Maeda** Faculty of Economics, The University of Tokyo 7-3- Hongou, Bunkyo Tokyo, JAPAN, * kojima@e.u-tokyo.ac.jp, **maeda@lib.u-tokyo.ac.jp attention to the relation among compound words and their constituent simple words. (ageura and Umino 996) proposed an important feature of domain specific terms called termhood which refers to the degree that a linguistic unit is related to a domain-specific concept. Presumably, it is necessary to develop an ATR method that calculates termhood of each term candidate extracted from a domain corpus that usually consists of a number of documents. Many works of ATR use statistics of term candidate distribution in a corpus such as term frequency to calculate the termhood of every term candidate. This frequency based methods, however, heavily depend on the size of corpus. Thus we do not expect a good result if we extract domain specific terms from one or a few Web pages. If we shift our focus from a corpus based statistics like frequency to term space that consists of all term candidates, we expect better result of extracted terms even from one Web page because of the following reason: A set of term candidates has its own structure like relations between compound words and their constituent simple words as stated before. The statistical information about these relations comes from more microscopic structure than term frequency. Thus, if we utilize more information from term space, it is reasonable to expect better performance in extracting from a small text like one Web page. Without this kind of information, we will be suffering from the shortage of information for ATR. Now look at frequency based information and information inherent with term space more closely. Even though several kinds of statistics about actual use in a corpus such as term frequency give a good approximation of termhood. They are not necessarily meanings in a writer's mind. On the contrary, the statistics of term space can reflect the meaning in a writer s mind because it is up to a writer s decision how to make a compound word term to express a complicated concept using simple word terms as its components. More precisely, if a certain simple word, say N,

2 expresses the basic concept of a domain that the document treats, the writer of the document, we expect, uses N not only many times but in various ways. One of typical way of this kind is that he/she composes quite a few compound words using N and uses these compound words in documents he/she writes. For this reason, we have to focus on the relation among simple words and compound words when pursuing new ATR methods. One of the attempts to make use of this relation has been done by Nakagawa and Mori (2003). Their method is based on the number of distinct simple words that come left or right of a simple word term to make up compound word terms. In this paper, we apply their method to deal with Web pages written in Chinese. In this paper, section 2 gives the background of ATR methods. In section 3 we introduce ATR method developed by Nakagawa and Mori(2003). Section 4, 5 and 6 are for how to apply their method to Chinese language and evaluation of two proposed method: ) Word based method using Chinese morphological analyzer ICTCLAS(Zhang, Yu, Xiong and Liu. 2003), 2) Stop character based method. 2 Background 2. Typical Procedures of Automatic Term Recognition An ATR procedure consists of two procedures in general. The first one is a procedure of extracting term candidates from a corpus. The second procedure is to assign each term candidate a score that indicates how likely the term candidate is a term to be recognized. Then all candidates are ranked according to their scores. In the remaining part of this section, we describe the background of a candidate extraction procedure and a scoring procedure respectively. 2.2 Candidates Extraction In term candidates extraction from the given text corpus, we mainly focus on compound words as well as simple words. To extract compound words which are promising term candidates and at the same time to exclude undesirable strings such as is a or of the, the frequently used method is to filter out the words that are the member of a stopword-list. The structure of complex term is another important factor for automatic term candidate extraction. A syntactic structure that is the result of parsing is focused on in many works. Since we focus on these complex structures, the first task in extracting term candidates is a morphological analysis including part of speech (POS) tagging. There are no explicit word boundary marker in Chinese, we first have to do morphological analysis which segments out words from a sentence and does POS tagging at the same time. After POS tagging, the complex structures mentioned above are extracted as term candidates. Previous studies have proposed many promising ways for this purpose, for instance, Smadja and Mceown (990), and Frantzi and Ananiadou (996) tried to treat more general structures like collocations. 2.3 Scoring The next step of ATR is to assign each term candidate its score in order to rank them in descending order of termhood. Many researchers have sought the definition of term candidate s score which approximates termhood. In fact, many of those proposals make use of statistics of actual use in a corpus such as term frequency which is so powerful and simple that many researchers directly or indirectly have used it. The combination of term frequency and inverse document frequency is also well studied i.e. (Uchimoto et al 2000), (Fukushige and Noguchi 2000). On the other hand, several scoring methods that are neither directly nor heavily based on frequency of term candidates have been proposed. Among those, Ananiadou et al. proposed C-value (Frantzi and Ananiadou 996) which counts how independently the given compound word is used in the given corpus. Hisamitsu (2000) proposes a way to measure termhood which estimates how far the document containing given term is different from the distribution of documents not containing the given term. However, the method proposed by Nakagawa and Mori (2003) outperforms these methods in terms of NTCIR TMREC task(ageura, et al, 999). 2.4 Chinese Term Extraction As for Chinese language NLP, very many works about word segmentation were published i.e. (Ma and Xia 2003). Nevertheless the term Term extraction has not yet been used for Chinese NLP, key words extraction have been a target for a long time. For instance, key words extraction from news articles (Li. et al. 2003) is the recent result which uses frequency and length of character string for scoring. Max-duplicated string based method (Yang and Li. 2002) is also promising. In spite of previous research efforts, there have been no attempt so far to apply the relation between simple and compound words to Chinese term extraction, and that is exactly what we propose in this paper.

3 3 Scoring methods with Simple word Bigrams 3. Simple word Bigrams The relation between a simple word and complex words that include the simple word is very important in terms of term space structure. Nevertheless, to my knowledge, this relation has not been paid enough attention so far except for the method proposed by Nakagawa and Mori (2003). In this paper, taking over their works, we focus on compound words among the various types of complex terms. In technical documents, the majority of domain-specific terms are noun phrases or compound words consisting of small size vocabulary of simple words. This observation leads to a new scoring methods that measures how many distinct compound words contain the simple word in question as their part in a given document or corpus. Here, suppose the situation where simple word: N occurs with other simple words as a part of many compound words shown in Figure where [N M] means bigram of noun N and M. [LN N] (#L) [LN2 N] (#L2) : : [LNn N](#Ln) [N RN](#R) [N RN2](#R2) [N RNm](#Rm) Figure. Noun Bigram and their Frequency In Figure, [LNi N] (i=,..,n) and [N RNj] (j=,...,m) are simple word bigrams which make (a part of) compound words. #Li and #Rj (i=,..,n and j=,..,m) mean the frequency of the bigram [LNi N] and [N RNj] in the corpus respectively. Note that since we depict only bigrams, compound words like [LNi N RNj] which contains [LNi N] and/or [N RNj] as their parts might actually occur in a corpus. Note that this noun trigram might be a part of longer compound words. We show an example of a set of noun bigrams. Suppose that we extract compound words including trigram as term candidates from a corpus as shown in the following example. Example. trigram statistics, word trigram, class trigram, word trigram, trigram acquisition, word trigram statistics, character trigram Then, noun bigrams consisting of a simple word trigram are shown in Figure 2 where the number between ( and ) shows the frequency in the corpus. word trigram (3) trigram statistics (2) class trigram () trigram acquisition () character trigram() Figure 2. An example of noun bigram Now we focus on and utilize simple word bigrams to define the scoring function. Note that we are only concerned with simple word bigrams and not with a simple word per se because, as stated before, we are concerned with the relation between a compound word and its component simple words. 3.2 Scoring Function 3.2. Score of simple word Since there are infinite number of scoring functions based on [LNi N] or [N RNj], we here consider the following simple but representative scoring functions. #LDN(N) and #RDN(N) : These are the number of distinct simple words which directly precede or succeed N. These coincide with n and m in Figure respectively. For instance, in an example shown in Figure 2, #LDN(trigram)=3, #RDN(trigram)=2. Using #LDN and #RDN we define LN(N) and RN(N): These are based on the number of occurrence of each noun bigram, and defined for [LNi N] and [N RNj] as follows respectively. #LDN(N) LN(N) = (#Li) () i= #RDN(N) RN(N) = (#Rj) (2) j= LN(N) and RN(N) are the frequencies of nouns that directly precede or succeed N. For instance, in an example shown in Figure 2, LN(trigram)=5, and RN(trigram)=3. Let s think about the background of these scoring functions. #LDN(N) and #RDN(N), where we do not take into account the frequency of each noun bigram but take into account the number of distinct nouns that adjoin to N to make compound words. That indicates how linguistically and domain dependently productive the noun:n is in a given corpus. That means that if N presents a key and/or basic concept of the domain treated by the corpus, writers in that domain work out many distinct compound words with N to express more complicated concepts. On the other hand, as for LN(N) and RN(N), we also focus on frequency of each noun bigram as well. In other words, statistic

4 bias in actual use of noun:n is, this time, one of our main concern. For example, in Figure 2, LN(trigram,2)=, and RN(trigram,2)=5. In conclusion, since LN(N) and RN(N) convey more information than #LDN(N) and #RDN(N), we adopt LN(N) and RN(N) in this research Score of compound words The next thing to do is expanding those scoring functions for simple word to the scoring functions for compound words. We adopt a geometric mean for this purpose. Now think of a compound word : CN = N N 2 N L, where N i (i=,.., L) is a simple word. Then a geometric mean: LR of CN is defined as follows. 2L L LR(CN) = (LN(N) i )(RN(N) i ) + + i= (3) For instance, if we use LN(N) and RN(N) in example, LR(trigram) = ( 3+ ) (5+ ) = LR does not depend on the length of CN where length means the number of simple words that consist of CN. This is because since we have not yet had any idea about the relation between the importance of a compound word and a length of the compound word, it is fair to treat all compound words, including simple words, equally no matter how long or short each compound word is Combining LR and Frequency of Nouns We still have not fully utilized the information about statistics of actual use in a corpus in the bigram based methods described in 3.2. and Among various kinds of information about actual use, the important and basic one is the frequency of single-and compound words that occur independently. The term independently means that the left and right adjacent words are not nouns. For instance, word patterns occurs independently in we use the word patterns which occur in this sentence. Since the scoring functions proposed in 3.2. is noun bigram statistics, the number of this kind of independent occurrences of nouns themselves have not been used so far. If we take this information into account, the better results are expected. Thus, if a simple word or a compound word occurs independently, the score of the noun is simply multiplied by the number of its independent occurrences. We call this new scoring function as FLR(CN) which is defined as follows. if N occurs independently then F LR(CN) = LR(CN) f(cn) where f(cn) means the number of independent occurrences of noun CN (4) 4 Term Extraction for Chinese based on Morphological Analysis If we try to apply the scoring method proposed in section 3 directly to a Chinese text, every word should be POS tagged because we extract multiword unit of several types of POS tag sequences as candidates of domain specific terms. For this we need a Chinese morphological analyzer because Chinese is an agglutinative language. Actually, we use Chinese morphological analyzer: ICTCLAS(Zhang and Liu 2004). As term candidates, we extract compound word: MWU having the following POS tag sequence expressed in (5). A multi-word-unit: MWU is defined by the following CFG rules where the right hand sides are expressed as a regular expression. MWU <--[ag a]* [ng n nr ns nt nz nx vn an i j]+ MWU <-- MWU? b [ng n nr ns nt nz nx vn an i j] + MWU <-- [ag a] + [u k] MWU MWU <-- MWU (u k he-2 yu-3) MWU (5) where ag, a, n, u are all tags used in ICTCLAS. Roughly speaking (5) means an adjective followed by the repetition of [adjective noun particle] followed by a noun. The problem is the ambiguity of POS tagging because the same word is very often used verb as well as noun. In addition, unknown words like newly appeared proper names also impairs the accuracy. Due to this problem caused by morphological analyzer, the accuracy is degraded. Once we segment out word sequences conforming the above POS tag sequences, we calculate LN and RN of each component word. In calculation of LN and RN, a word whose POS is c, u or k is omitted. In other words, if a word sequence w w2 w3 where POS of w2 is c u or k, then we calculate RN of w and LN of w3 by regarding the word sequence as w w3. Then we combine LN and RN of each word to calculate FLR by definition of (3) and (4) to sort all extracted candidates in descending order of FLR. We apply the proposed methods to 30 Web pages from People s Daily news. The areas are social, international and IT related news. The average length is characters. Firstly, we extract relevant terms by hand from each news article and use it as the gold standard. The average number of gold standard terms per one news particle is 5.9 words. Secondly, we extract terms from each news article and sort them in descending

5 order by the proposed method and evaluate them by a precision of top N terms defined as follows. CT()= if th term is one of the gold standard terms. 0 otherwise CT ( i) i precision = = ( ) (6) where N is the number of the gold standard terms, and in our experiment, N=20. Precision(), where =,..,20, are shown in Figure 3 as Strict. We also use another precision rate precision which is not strict and defined as follows. CTpart()= if one of gold standard terms. is a part of th term 0 otherwise CTpart( i) i precision = = '( ) (7) These are also shown in Figure 3 as Partly. documents, we expect POS sequences of terminologies with high accuracy. However, if we consider terminology extraction from Web pages, the possibility of unexpected POS sequence may rise. The third reason is language independency. Currently proposed and/or used morphological analyzers heavily depend either upon the sophisticated linguistic knowledge about the target language or upon a big size corpus of the target language if machine learning is employed. These linguistic resources, however, are not always available. Due to these reasons, we also developed term candidate extraction system which does not use a morphological analyzer. Instead of morphological analyzer, we try to employ a stop word list. In Chinese, as stop words, we find many character unigrams and bigrams because one Chinese character conveys larger amount of information than a character of Latin alphabet. They are partly shown in Appendix A. As term candidates, we simply extract character strings between two stop words that are nearest each other within a sentence. Obviously, the character strings thus extracted are not necessarily meaningful compound words. Therefore we cannot directly use these strings as words to calculate LN and RN function. Back to the idea that Chinese characters are ideograms, we come up to the idea that we calculate LN and RN of each character appearing within every character strings extracted as candidates. An example is shown in Figure 4. Figure 3. Strict and partly precision of word based extraction method. From Figure 3, we see that If we pick up the ten highest ranked terms, about 75% of them meet the gold standard. The case we loosen the definition of precision shows better than the strict case of (6) but the difference is not so large. That means that the proposed word based ranking method works very well to extract important Chinese terms from news articles. 5 Character based Term Extraction There are several reasons why we would like to develop a term extraction system without morphological analyzer. The first reason is that the accuracy of morphological analyzer is, in spite of the great advancement of these years, still around 95% (GuoDong and Jian 2003). The second reason is that there possibly exist terminologies with unexpected POS sequences. If we deal only with academic papers or technical LN(zuo-4)=3 RN(zuo-4)=2 Figure 4. LN and RN of Chinese character zuo-4 In calculation of LN and RN, we neglect characters whose POS are c,u or k as same as we did in morphological analyzer based method. Once we calculate LN and RN of each character, FLR of every character string is calculated as defined by (3) and (4) to sort them in descending order of FLR. Actually this idea is very similar with left and right entropy used to extract two character Chinese words from a corpus (Luo and Sun. 2003). However what we would like to extract is a set of longer compound words or even phrases used in a Web page. Moreover we only use the Web page and do not use any other language resources such as a big corpus at all due to the reason described above in this section.

6 We evaluate the proposed character based extraction method against the same Web pages from People s Daily news used in Morphological Analysis based method described in Section 4. We also use the same gold standard terms described in Section 4 for evaluation. The strict and partly precision defined by (6) and (7) are used. The result is shown in Figure 5. Figure 5. Strict and partly precision of character based extraction method. Comparing Figure 3 with Figure 5, apparently the result of extracted terms of word based method is better than that of character based method. However, it does not necessarily mean that the character based term extraction is inferior. If you take a glance at the stop word list of Appendix A, it seems that several of the stop words are selected mainly from words in auxiliary verbs, pronouns, adverbs, particles, prepositions, conjunctions, exclamations, onomatopoeic words and punctuation marks. However, in reality, our selection is based rather on meaning, usage and generally frequency of use than parts of speech. Thus some of them are not function words but content words in order to exclude non-domainspecific words. Actually, the stop words are not only character unigram but character bigram. Because Chinese character is ideograph and each character may have plural meanings, it is difficult only to use character unigram as a stop word in Chinese. Our method based on these viewpoints resulted in getting an interesting consequence. We show an example of news article and extracted terms from it by this method in Appendix B and Appendix C. This news article is entitled The Culture of Tibetan Web Site is formally created. Let s take a look at an underlined sentence in this short article and underlined terms extracted from there. This sentence says: According to the introduction, The Culture of Tibetan Web Site is a site of special pure culture for the purpose of investigating the essence of Tibetan culture, showing the scale of Tibetan culture and raising the spirit of Tibetan culture. In the case of method based on stop word list, we can extract compound term of investigating the essence of Tibetan culture, showing the scale of Tibetan culture ( raising the spirit of Tibetan culture ( and so on from this sentence. On the contrary, by the term extraction method based on morphological analysis, gerund, for example, showing( ) and raising ( ), can not be extracted. We said that the majority of domain specific terms are noun phrases or compound words consisting of small size vocabulary of simple words as stated in section 3. So we especially have paid attention to relation among nouns. However most of Chinese nouns can also be used as verbs. Moreover inflection of Chinese verbs can hardly be recognized visually. It is difficult to distinguish verb from noun by morphological analysis. Certainly ICTCLAS classifies the character that has meaning of both verb and noun into the category of vn (verb and noun). But gerunds and verbal noun infinitives are not contained in vn. For instance, means not only write a letter but writing letter. Thus we have to pay attention to verbs in Chinese too. Only by tuning up stop word list, we can take gerunds and verbal noun infinitives into account to some extent. Appendix C shows one of the evidence of this observation. 6 Conclusion In this paper, we apply automatic term recognition system based on FLR proposed by Nakagawa and Mori (2003) to Chinese Web pages because the term extraction from small text like one Web page is the future oriented topic. We proposed two methods: word based and character based extraction and ranking using the compound word productivity of simple words. Since the accuracies of term recognition are around 60% for top,000 term candidates in NTCIR TMREC task(ageura et al 999), the result of 75% accuracy of top ten candidates is a good start. References Ananiadou, S A Methodology for Automatic Term Recognition. In Proceedings of 4th International Conference on Computational Linguistics : GuoDong Zhou and Jian Su A Chinese Efficient Analyser Integrating Word Segmentation, Part-Of- Speech Tagging, Partial Parsing and Full Parsing, In

7 Proceedings of The Second SIGHAN Workshop, on Chinese Language Processing.ACL2003 :78-83 Frantzi, T.. and Ananiadou, S Extracting nested collocations. In Proceedings of 6th International Conference on Computational Linguistics :4-46. Fukushige, Y. and N. Nogichi Statistical and linguistic approaches to automatic term recognition NTCIR experiments at Matsushita. Terminology 6(2) : Hua-PingZhang, Hong-uYu, De-Yi Xiong and Qun Liu HHMM-based Chinese Lexical Analyzer ICTCLAS. In Proceedings of The Second SIGHAN Workshop, on Chinese Language Processing.ACL2003 :84-87 Hisamitsu, T, A Method of Measuring Term Representativeness. In Proceedings of 8th International Conference on Computational Linguistics : ageura,. and Umino, B Methods of automatic term recognition: a review. Terminology 3(2) : ageura,. et al, 999. TMREC Task: Overview and Evaluation. In Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition : Shengfen Luo and Maosong Sun Two-Character Chinese Word Extraction Based on Hybrid of Internal and Contextual Measures. In Proceedings of The Second SIGHAN Workshop, on Chinese Language Processing.ACL2003 :24-30 Qing Ma and Fei Xia Proceedings of The Second SIGHAN Workshop, on Chinese Language Processing.ACL2003, Sappro Sujian Li,Houfeng Wang,Shiwen Yu and Chengsheng Xin News-Oriented Automatic Chinese eyword Indexing. In Proceedings of The Second SIGHAN Workshop, on Chinese Language Processing.ACL2003 :92-97 Nakagawa, H. and Tatsunori Mori Automaic Term Recognition based on Statistics of Compound words and their Components, Terminology 9(2) :20-29 Smadja, F.A. and Mckeown,.R Automatically extracting and representing collocations for language generation. In Proceedings of the 28th Annual Meetings of the Association for Computational Linguistics : Uchimoto,., S.Sekine, M. Murata, H.Ozaku and H. Isahara Term recognition using corpora from different fields. Terminology 6(2) : Wenfeng Yang and Xing Li Chinese keyword extraction based on max-duplicated strings of the documents. In Proceedings of the 25th annual international ACM SIGIR conference on Research and develop-ment in information retrieval, pp evin Zhang and Qun Liu ICTCLAS. /manual /node2.html Appendix A: A part of stop word list Appendix B: An example of news article Appendix C: Terms extracted from Appendix B. Word Based (Top 0 terms with score of equation (4) ) Character Based(Top terms with score equation (4) )