An Approach to Automatically Constructing Domain Ontology 1

An Approach to Automatcally Constructng Doman Ontology 1 Tngtng He 1 2 3 Xaopeng Zhang 1 3 Xnghuo Ye 1 3 1 Department of Computer Scence, Huazhong ormal Unversty 430079 Wuhan, Chna 2 Software College of Tsnghua Unversty 102201 Beng, Chna 3 atonal Language Resources Montor & Research Center (network meda) 430079 Wuhan, Chna tthe@mal.ccnu.edu.cn zhangxaopeng@mals.ccnu.edu.cn yxhez@tom.com Abstract. In ths paper we present an approach to mnng doman-dependent ontologes usng term extracton and relatonshp dscovery technology. There are two man nnovatons n our approach. One s extractng terms usng log-lkelhood rato, whch s based on the contrastve probablty of term occurrence n doman corpus and background corpus. The other s fusng together nformaton from multple knowledge sources as evdences for dscoverng partcular semantc relatonshps among terms. In the experment, we also mprove the tradtonal k-medods algorthm for mult-level clusterng. We have appled our approach to produce an ontology for the doman of computer scence and obtaned promsng results. 1 Introducton Ontologes have become an mportant means for structurng knowledge and buldng knowledge-ntensve systems. Ontologes have shown ther usefulness n applcaton areas such as ntellgent nformaton ntegraton, nformaton retreval and natural language processng, to name but a few. For ths purpose, efforts have been made to facltate the ontology engneerng process, n partcular the acquston of ontologes from doman texts [1]. Constructng an ontology s an extremely laborous effort. Even wth some reuse of core knowledge from an Upper Model, the task of creatng an ontology for a partcular doman has a hgh cost, ncurred for each new doman. Tools that could automate, or sem-automate, the constructon of ontologes for dfferent domans could dramatcally reduce the knowledge creaton cost. One approach to developng such tools s to rely on nformaton mplct n collectons of texts n a partcular doman. If t were possble to automatcally extract terms and ther semantc relatons from the text corpus, doman ontology could be bult convenently. Ths would be more cost-effectve than havng a human develop the ontology from scratch [2][3]. Doman specfc ontologes are useful n systems nvolved wth artfcal reasonng, and nformaton retreval. Ontologes gve such systems a vocabulary of terms, and concepts relatng one term to another. In ths paper, we present a method that automatcally mnng an ontology from any large corpus n a specfc doman, to support data ntegraton and nformaton retreval tasks. The nduced ontology conssts of doman concepts only related by parent-chld lnks, not ncludng more specalzed relatons. Our approach s comprsed of two man phases: term extracton and relatonshp dscovery. In the former phase, meanngful terms are extracted usng LLR formula. In the latter one, parent-chld relatons among terms are nduced through multlevel clusterng. The remander of the paper s as follows. In Secton 2, we wll dscuss the related works n ths feld. In Secton 3, we wll gve an overvew of the overall system archtecture and explan the means and progress of ontology constructon. Subsequently, an example wll show some promsng results we have 1 The paper s supported by atonal atural Scence Foundaton of Chna (SFC), (Grant o.60496323; Grant o.60375016 Grant o.10071028;); Mnstry of educaton of Chna, Research Proect for Scence and technology, (Grant o. 105117). 150

obtaned when applyng our mechansms for mnng ontologes from text, together wth our analyss. Ths wll be presented n Secton 4. In Secton 5, we conclude and menton our future work. 2 Prevous Work The exstng approaches to ontology nducton nclude those that start from structured data, mergng ontologes or database schemas (Doan et al. 2002). Other approaches use natural language data, sometmes ust by analyzng the corpus (Sanderson and Croft 1999), (Caraballo 1999) or by learnng to expand Wordet wth clusters of terms from a corpus, e.g., (Gru et al. 2003). Informaton extracton approaches that nfer labeled relatons ether requre substantal hand-created lngustc or doman knowledge, e.g., (Craven and Kumlen 1999) (Hull and Gomez 1993), or requre human-annotated tranng data wth relaton nformaton for each doman (Craven et al. 1998). For automatc ontology constructon, Govnd and Chakravarth have presented n 2001 an approach that extracts ontology from text documents by Sngular Value Decomposton (SVD), whch s a pure statstcal analyss method, as compared to heurstc and rule based methods. They adopt Latent Semantc Indexng (LSI), whch attempts to catch term-term statstcal references by replacng the document space wth lower dmensonal concept space. Ther method s convnced of ts smplcty because t s based on farly precse mathematcal foundaton. It s effectve but lmted wth precson. What s more, t doesn t fgure out the exact relatons among terms [4]. In year of 2001, Dekang Ln and Patrck Pantel proposed a method for doman concepts dscovery based on a clusterng algorthm called CBC (Clusterng by Commttee). They generally regard a concept as a cluster of terms. It ust deals wth only one aspect of the whole progress of ontology nducton [5]. 3 Automatc Ontology Constructng 3.1 Archtecture An overall archtecture for doman-ndependent ontology nducton s shown n Fgure 1. The documents of doman corpora are preprocessed to remove segments such as pctures and specfc symbols and then flter the stopwords. ext, terms are extracted based on word segmentaton and syntactc taggng. Subsequently semantc relatons between pars of terms are derved usng multlevel clusterng wth evdences from multple knowledge sources. Data Cleanup Term Dscovery Doman Corpus Removng Symbols Flterng Stopwords Parsng Taggng Term Scorng Term Extracton Relatonshp Dscovery Ontology Output Mult-level Clusterng Smlarty Computng Chnese Howet Fg. 1. System Archtecture 151

3.2 Term Dscovery In ths secton, our goal s to dentfy doman-relevant terms from a collecton of doman specfc texts. For term extracton, there exst several approaches. One s based on PAT-TREE, whch has the advantage n extractng terms (phrases) of any length because t could avod word segmentaton for Chnese. But t needs a large amount of documents to acheve excellent precson. Another approach to term extracton s C/C-Value method proposed by Frantz and Ananadou (1999), whch combnes lngustcs and statstcs methods and makes a progress. But t has lmtaton because ts lngustcs knowledge s only for Englsh. In ths paper, we propose a promsng approach to extractng doman-relevant terms. Terms are scored for doman-relevance based on the assumpton that f a term occurs sgnfcantly more n a doman corpus than n a background corpus, then the term s clearly doman relevant. As an example, n Table 1 we compare the number of documents contanng the terms 算法 (algorthm), 内存 (memory), 路由器 (Router) and 数据库 (database) n a doman-dependent corpus of computer scence ncludng 200 Chnese perodcal papers, compared to a larger Modern Chnese Corpus as the background corpus. Table 1. Term IDF n doman and background corpora 算法内存路由器数据库 Total Doman Corpus 89 78 73 69 200 Background Corpus 7 0 4 6 12000 Dfference 82 78 69 63-11800 We can observe from Table 1 that the terms 算法 (algorthm), 内存 (memory), 路由器 (Router) and 数据库 (database) occur sgnfcantly more n the doman corpus than n the Background corpus. We consder they are doman-dependent. To estmate the doman relevancy of a term, we use the log-lkelhood rato (LLR) gven by -2 log2 ( Ho (p;k1,n1,k2,n2 ) / Ha ( p1,p2;n1,k1,n2,k2 ) ) (1) LLR measures the extent to whch a hypotheszed model of the dstrbuton of cell counts, Ha, dffers from the null hypothess, Ho (namely, that the percentage of documents contanng ths term s the same n both corpora). We use a bnomal model for Ho and Ha. Here, p=(k1+k2)/(n1+n2), p1=k1/n1, p2=k2/n2, k1 s the number of documents contanng the term n the doman corpus, k2 s the number of documents contanng the term n the background corpus, n1 s the total number of documents n the doman corpus, n2 s the total number of documents n the background corpus [6]. 3.3 Relatonshp Dscovery In ths secton, we try to mne the parent-chld lnks among terms usng multlevel clusterng based on term smlarty. We fuse together nformaton from multple knowledge sources as evdences for partcular semantc relatonshps among terms. For clusterng we adopt mproved k-medods algorthm, whch s flat and effectve. The process of nducng relatons s as follows. Frst the clusterng algorthm s used to obtan top-level clusters. Subsequently, for each top-level cluster, we use the clusterng algorthm agan to gan clusters of second level. By analogy, we can fnd multlevel clusters that mply parent-chld relatonshps among terms. In experment the depth of clusterng level s restrcted. 152

3.3.1 Term Smlarty Computaton The preprocessed documents are analyzed and the frequency matrx known as Term-Document matrx s produced as result of the analyss. For a Term-Document matrx M[m][n], each term can be represented as the followng vector: T r = (M[][0], M[][1],, M[][k],, M[][n]) (2) We can compute the cosne smlarty between a par of terms gven by r r r r T T (3) cos( T, T ) = r r T T The cosne smlartes between several term pars are shown n Table 2. Table 2. Cosne smlartes of term pars Term1 Term2 Cosne Smlarty 缓冲区 (buffer) 磁盘 (dsk) 0.436248 缓冲区 (buffer) 主存 (man store) 0.579771 内存 (memory) 磁盘 (dsk) 0.434059 服务器 (server) 网络 (network) 0.17585 It can be observed that the cosne smlarty s lkely to satsfy the concurrent frequency of term pars. It reflects the context correlaton of terms but mght have a preudce aganst semantc correlaton among terms. For nstance, the cosne smlarty computed above between 磁盘 (dsk) and 数据 (data) s ust 0.0663552, whch devates from the experental value. In order to exact smlarty, we use Howet as another knowledge resource, whch s a Chnese thesaurus by Zhendong Dong and Qang Dong (2001). We combne the smlarty of terms n Howet as a supplement wth the cosne smlarty to estmate the correlaton of term pars. The Howet smlarty that nvolves unknown words s assgned to 0. Table 3 shows some Howet smlartes of term pars. Table 3. Howet smlartes of term pars Term1 Term2 Howet Smlarty 缓冲区 (buffer) 磁盘 (dsk) 0.149333 缓冲区 (buffer) 主存 (man store) 0.000000 内存 (memory) 磁盘 (dsk) 0.149333 服务器 (server) 网络 (network) 0.285714 The fnal formula of smlarty computaton s gven by r r r r r r sma ( T, T ) + α smb ( T, T ) (4) sm ( T, T ) = 2 153

r r Where sma T, T ) ( r r s the cosne smlarty and smb T, T ) ( s the Howet smlarty; Meanwhle, α s an adustment argument. The adusted smlarty between term pars s gven n Table 4 (α =0.5). Table 4. Adusted smlartes of term pars Term1 Term2 Smlarty 缓冲区 (buffer) 磁盘 (dsk) 0.510913 缓冲区 (buffer) 主存 (man store) 0.579771 内存 (memory) 磁盘 (dsk) 0.508724 服务器 (server) 网络 (network) 0.318716 3.3.2 Clusterng Algorthm In order to make t converge faster and result n better clusters n accordance wth the orgnal dstrbuton of terms, we mproved the tradtonal k-medods algorthm. When decdng the new center of a cluster, we frst choose top-p terms closest to the old center n the cluster and take the term closest to the mean of the p terms as the new center. The mproved algorthm s descrbed as follows: 1. Choose m terms as the centers of clusters randomly: C1, C 2,, C,, ; C m 2. Compute the smlarty for each term to each cluster center. Assgn the term nto the closest cluster. 3. Determne the new center of each cluster as follows: Compute the average smlarty of terms n cluster usng the formula: 1 avgsm[ ] = n n k = 1 sm( T k, C ) Where n s the number of terms n cluster. Pck out p terms most closest to the cluster center decded by avgsm [ ] p = m max_ avgsm Where max_ avgsm s the maxmum of m average smlartes for m clusters. Compute the mean of the above p terms and choose the term closest to t as the new center of cluster. 4. Go to step 2 only f m cluster centers s not stable. 5. End and result n m clusters. (5) (6) 4 Experments and Evaluaton We have appled our approach to produce ontology n computer scence doman. The doman corpus contans 200 Chnese theses (3.12 M) retreved from professonal publcatons. The background corpus we used s the Modern Chnese Corpus (1999-2001) whch conssts of 12000 documents (33.5 M). 154

4.1 Term Dscovery In experment, we have compared the effect of the two scorng methods LLR and TF.IDF. Table 5 shows the scores of some terms n relatve percentage. The boldfaced terms s dentfed by LLR to be doman-relevant. It can be nferred from the experment results that LLR excels TF.IDF. Table 5. Relatve scores of both LLR and TF.IDF Term Doman DF Background DF LLR score TF.IDF score 路由器 73 4 98.6 66.8 总线 56 1 99.9 70.4 控制器 35 7 96.5 94.5 端口 25 5 93.1 65.3 缓存 36 2 94.6 53.5 请求 17 188 67.5 72.6 线路 31 232 56.2 65.7 质量 37 3725 42.4 99.9 屏蔽 7 8 15.3 48.2 A term lst of 286 doman-relevant terms (key terms) has been manually constructed to support the subsequent evaluaton. We estmate the results normally based on Recall, Precson and F-measure, whch are defned as follows. Recall(R) The rato of correct terms number to key terms number n manual lst gven by R = correct key (7) Precson(P) The rato of correct terms number to extracted terms number gven by P = correct correct + ncorrect (8) F-measure(F) F-measure s defned as a combnaton of recall and precson: F = 2 RP (9) R + P These three values vary dependng on the threshold of LLR. Fgure 2 shows the varety of F-measure, recall and precson along wth LLR. 155

Fg. 1. F-measure, recall and precson by varyng the threshold of LLR 4.2 Relatonshp Dscovery Accordng to fgure 2, we chose 94 as the threshold of LLR and obtaned 216 doman-relevant terms. The depth of clusterng level was restrcted to 2. We have ganed 5 top-level clusters and 15 second level clusters. The precson of the top-level clusters acheved 76.7%. The average precson of the second-level clusters acheved 70.3%. The parent-chld relatonshp among terms and clusters s shown n fgure 3. Each cluster s named by the most frequent term n t. 网络路由器路径主干网速率中继广域网延迟分组交换交换机局域网交换网接入网策略关网键词网格网络拓扑骨干网络搜索搜索引擎网页脚本主机万维网网站网址带宽阻塞端口监听信息网服务器防火墙流量计算中心域名主页服务器拷贝电子邮件邮件邮箱日志数据流语音共享信息港智能性信息港因特网智能化电话网宽带调制解调频率段数字网计算机进程计算机体系结构总线硬件接口中断控制器寄存器运算时间段数据共享实时系统调度者进程子系统分系统硬盘计算机芯片大型机巨型磁盘软盘光盘主存吞吐软驱外存内存存储器存储缓冲缓存输出驱动驱动器适配器网卡参数矩阵数组指针字符串参数静态访问编译器变量字符标量流程图流程结构图逻辑图运算符编程控制台编程源程序程序包子程序程序性源代码向量编译调试调试器程序员瓶颈数据包程地址输入提示符键盘终止符示意图主程序操作符空格数据位读取存取缓冲区地址计时器序程序编程者查询分析器数据库数据表冗余备份标识符计数器初始化程序语句表达式操作数图形切换图形界面请求文本框图标按钮视图标识号软件软件软件包构建复用重构组件构件重用性模块封装性封装继承私有实例对象信息信息加密令牌信息论用户名口令登录信息信道二进制编码解码局部译码补码误码纠错空串算算法队列索引字节标记字段算法二分法数据项回溯堆栈节点调度优先级虚拟法执行交换轮转流水号时序线性执行标识符 Fg. 2. Parent-chld relatonshps among terms 156

5 Concluson In ths paper we present an approach to automatcally mnng doman-dependent ontologes from corpora based on term extracton and relatonshp dscovery technology. There are two man nnovatons n our approach. One s extractng terms usng log-lkelhood rato, whch s based on the contrastve probablty of term occurrence n doman corpus and background corpus. The other s fusng together nformaton from multple knowledge sources as evdences for dscoverng partcular semantc relatonshps among terms. In our experment, we mprove the tradtonal k-medods algorthm for mult-level clusterng. We have appled our approach to produce an ontology for the doman of computer scence and obtaned promsng results. In the future work, we wll consder developng heurstc methods or rules to label the exact relatons among terms and fndng out a more effectve ontology evaluaton methodology. Reference 1. Indereet Man.Automatcally Inducng Ontologes from Corpora. Proceedngs of CompuTerm 2004: 3rd Internatonal Workshop on Computatonal Termnology, OLIG'2004, Geneva. 2. A. Maedche and S. Staab. Mnng ontology from text. 12th Internatonal Workshop on Knowledge Engneerng and Knowledge Management,2000 (EKAW'2000). 3. A. Maedche and S. Staab. The TEXT-TO-OTO Ontology Learnng Envronment. Software Demonstraton at ICCS-2000-Eght Internatonal Conference on Conceptual Structures. August 14-18, 2000, Darmstadt, Germany. 4. Dr.Sadanand Srvastava, Dr.James Gl de Lamadrd. Extractng an ontology from a document usng Sngular Value Decomposton. ADMI 2001. 5. Ln, D. and Pantel, P. 2001. Inducton of semantc classes from natural language text. In Proceedngs of SIGKDD-01. pp.317-322. SanFrancsco, CA. 6. Dunnng.T. Accurate Methods for the Statstcs of Surprse and Concdence, Computatonal Lngustcs, 19(1); 61-74, March 1993. 7. Chakravarth S Velvadapu, Document Ontology Extractor, Appled research n Computer scence, Fall-2001. 8. A. Maedche and S. Staab. Dscoverng conceptual relatons from text. In Proceedngs of ECAI-2000. IOS Press, Amsterdam, 2000. 9. D. Faure and C. edellec. A corpus-based conceptual clusterng method for verb frames and ontology acquston. In LREC workshop on adaptng lexcal and corpus resources to sublanguages and applcatons, Granada, Span, 1998. 10. Doan, A., Madhavan, J., Domngs, P. and Halevy, A. 2002. Learnng to Map between Ontologes on the Semantc Web. WWW 2002. 157