An Approach to Automatcally Constructng Doman Ontology 1 Tngtng He 1 2 3 Xaopeng Zhang 1 3 Xnghuo Ye 1 3 1 Department of Computer Scence, Huazhong ormal Unversty 430079 Wuhan, Chna 2 Software College of Tsnghua Unversty 102201 Beng, Chna 3 atonal Language Resources Montor & Research Center (network meda) 430079 Wuhan, Chna tthe@mal.ccnu.edu.cn zhangxaopeng@mals.ccnu.edu.cn yxhez@tom.com Abstract. In ths paper we present an approach to mnng doman-dependent ontologes usng term extracton and relatonshp dscovery technology. There are two man nnovatons n our approach. One s extractng terms usng log-lkelhood rato, whch s based on the contrastve probablty of term occurrence n doman corpus and background corpus. The other s fusng together nformaton from multple knowledge sources as evdences for dscoverng partcular semantc relatonshps among terms. In the experment, we also mprove the tradtonal k-medods algorthm for mult-level clusterng. We have appled our approach to produce an ontology for the doman of computer scence and obtaned promsng results. 1 Introducton Ontologes have become an mportant means for structurng knowledge and buldng knowledge-ntensve systems. Ontologes have shown ther usefulness n applcaton areas such as ntellgent nformaton ntegraton, nformaton retreval and natural language processng, to name but a few. For ths purpose, efforts have been made to facltate the ontology engneerng process, n partcular the acquston of ontologes from doman texts [1]. Constructng an ontology s an extremely laborous effort. Even wth some reuse of core knowledge from an Upper Model, the task of creatng an ontology for a partcular doman has a hgh cost, ncurred for each new doman. Tools that could automate, or sem-automate, the constructon of ontologes for dfferent domans could dramatcally reduce the knowledge creaton cost. One approach to developng such tools s to rely on nformaton mplct n collectons of texts n a partcular doman. If t were possble to automatcally extract terms and ther semantc relatons from the text corpus, doman ontology could be bult convenently. Ths would be more cost-effectve than havng a human develop the ontology from scratch [2][3]. Doman specfc ontologes are useful n systems nvolved wth artfcal reasonng, and nformaton retreval. Ontologes gve such systems a vocabulary of terms, and concepts relatng one term to another. In ths paper, we present a method that automatcally mnng an ontology from any large corpus n a specfc doman, to support data ntegraton and nformaton retreval tasks. The nduced ontology conssts of doman concepts only related by parent-chld lnks, not ncludng more specalzed relatons. Our approach s comprsed of two man phases: term extracton and relatonshp dscovery. In the former phase, meanngful terms are extracted usng LLR formula. In the latter one, parent-chld relatons among terms are nduced through multlevel clusterng. The remander of the paper s as follows. In Secton 2, we wll dscuss the related works n ths feld. In Secton 3, we wll gve an overvew of the overall system archtecture and explan the means and progress of ontology constructon. Subsequently, an example wll show some promsng results we have 1 The paper s supported by atonal atural Scence Foundaton of Chna (SFC), (Grant o.60496323; Grant o.60375016 Grant o.10071028;); Mnstry of educaton of Chna, Research Proect for Scence and technology, (Grant o. 105117). 150
obtaned when applyng our mechansms for mnng ontologes from text, together wth our analyss. Ths wll be presented n Secton 4. In Secton 5, we conclude and menton our future work. 2 Prevous Work The exstng approaches to ontology nducton nclude those that start from structured data, mergng ontologes or database schemas (Doan et al. 2002). Other approaches use natural language data, sometmes ust by analyzng the corpus (Sanderson and Croft 1999), (Caraballo 1999) or by learnng to expand Wordet wth clusters of terms from a corpus, e.g., (Gru et al. 2003). Informaton extracton approaches that nfer labeled relatons ether requre substantal hand-created lngustc or doman knowledge, e.g., (Craven and Kumlen 1999) (Hull and Gomez 1993), or requre human-annotated tranng data wth relaton nformaton for each doman (Craven et al. 1998). For automatc ontology constructon, Govnd and Chakravarth have presented n 2001 an approach that extracts ontology from text documents by Sngular Value Decomposton (SVD), whch s a pure statstcal analyss method, as compared to heurstc and rule based methods. They adopt Latent Semantc Indexng (LSI), whch attempts to catch term-term statstcal references by replacng the document space wth lower dmensonal concept space. Ther method s convnced of ts smplcty because t s based on farly precse mathematcal foundaton. It s effectve but lmted wth precson. What s more, t doesn t fgure out the exact relatons among terms [4]. In year of 2001, Dekang Ln and Patrck Pantel proposed a method for doman concepts dscovery based on a clusterng algorthm called CBC (Clusterng by Commttee). They generally regard a concept as a cluster of terms. It ust deals wth only one aspect of the whole progress of ontology nducton [5]. 3 Automatc Ontology Constructng 3.1 Archtecture An overall archtecture for doman-ndependent ontology nducton s shown n Fgure 1. The documents of doman corpora are preprocessed to remove segments such as pctures and specfc symbols and then flter the stopwords. ext, terms are extracted based on word segmentaton and syntactc taggng. Subsequently semantc relatons between pars of terms are derved usng multlevel clusterng wth evdences from multple knowledge sources. Data Cleanup Term Dscovery Doman Corpus Removng Symbols Flterng Stopwords Parsng Taggng Term Scorng Term Extracton Relatonshp Dscovery Ontology Output Mult-level Clusterng Smlarty Computng Chnese Howet Fg. 1. System Archtecture 151
3.2 Term Dscovery In ths secton, our goal s to dentfy doman-relevant terms from a collecton of doman specfc texts. For term extracton, there exst several approaches. One s based on PAT-TREE, whch has the advantage n extractng terms (phrases) of any length because t could avod word segmentaton for Chnese. But t needs a large amount of documents to acheve excellent precson. Another approach to term extracton s C/C-Value method proposed by Frantz and Ananadou (1999), whch combnes lngustcs and statstcs methods and makes a progress. But t has lmtaton because ts lngustcs knowledge s only for Englsh. In ths paper, we propose a promsng approach to extractng doman-relevant terms. Terms are scored for doman-relevance based on the assumpton that f a term occurs sgnfcantly more n a doman corpus than n a background corpus, then the term s clearly doman relevant. As an example, n Table 1 we compare the number of documents contanng the terms 算 法 (algorthm), 内 存 (memory), 路 由 器 (Router) and 数 据 库 (database) n a doman-dependent corpus of computer scence ncludng 200 Chnese perodcal papers, compared to a larger Modern Chnese Corpus as the background corpus. Table 1. Term IDF n doman and background corpora 算 法 内 存 路 由 器 数 据 库 Total Doman Corpus 89 78 73 69 200 Background Corpus 7 0 4 6 12000 Dfference 82 78 69 63-11800 We can observe from Table 1 that the terms 算 法 (algorthm), 内 存 (memory), 路 由 器 (Router) and 数 据 库 (database) occur sgnfcantly more n the doman corpus than n the Background corpus. We consder they are doman-dependent. To estmate the doman relevancy of a term, we use the log-lkelhood rato (LLR) gven by -2 log2 ( Ho (p;k1,n1,k2,n2 ) / Ha ( p1,p2;n1,k1,n2,k2 ) ) (1) LLR measures the extent to whch a hypotheszed model of the dstrbuton of cell counts, Ha, dffers from the null hypothess, Ho (namely, that the percentage of documents contanng ths term s the same n both corpora). We use a bnomal model for Ho and Ha. Here, p=(k1+k2)/(n1+n2), p1=k1/n1, p2=k2/n2, k1 s the number of documents contanng the term n the doman corpus, k2 s the number of documents contanng the term n the background corpus, n1 s the total number of documents n the doman corpus, n2 s the total number of documents n the background corpus [6]. 3.3 Relatonshp Dscovery In ths secton, we try to mne the parent-chld lnks among terms usng multlevel clusterng based on term smlarty. We fuse together nformaton from multple knowledge sources as evdences for partcular semantc relatonshps among terms. For clusterng we adopt mproved k-medods algorthm, whch s flat and effectve. The process of nducng relatons s as follows. Frst the clusterng algorthm s used to obtan top-level clusters. Subsequently, for each top-level cluster, we use the clusterng algorthm agan to gan clusters of second level. By analogy, we can fnd multlevel clusters that mply parent-chld relatonshps among terms. In experment the depth of clusterng level s restrcted. 152
3.3.1 Term Smlarty Computaton The preprocessed documents are analyzed and the frequency matrx known as Term-Document matrx s produced as result of the analyss. For a Term-Document matrx M[m][n], each term can be represented as the followng vector: T r = (M[][0], M[][1],, M[][k],, M[][n]) (2) We can compute the cosne smlarty between a par of terms gven by r r r r T T (3) cos( T, T ) = r r T T The cosne smlartes between several term pars are shown n Table 2. Table 2. Cosne smlartes of term pars Term1 Term2 Cosne Smlarty 缓 冲 区 (buffer) 磁 盘 (dsk) 0.436248 缓 冲 区 (buffer) 主 存 (man store) 0.579771 内 存 (memory) 磁 盘 (dsk) 0.434059 服 务 器 (server) 网 络 (network) 0.17585 It can be observed that the cosne smlarty s lkely to satsfy the concurrent frequency of term pars. It reflects the context correlaton of terms but mght have a preudce aganst semantc correlaton among terms. For nstance, the cosne smlarty computed above between 磁 盘 (dsk) and 数 据 (data) s ust 0.0663552, whch devates from the experental value. In order to exact smlarty, we use Howet as another knowledge resource, whch s a Chnese thesaurus by Zhendong Dong and Qang Dong (2001). We combne the smlarty of terms n Howet as a supplement wth the cosne smlarty to estmate the correlaton of term pars. The Howet smlarty that nvolves unknown words s assgned to 0. Table 3 shows some Howet smlartes of term pars. Table 3. Howet smlartes of term pars Term1 Term2 Howet Smlarty 缓 冲 区 (buffer) 磁 盘 (dsk) 0.149333 缓 冲 区 (buffer) 主 存 (man store) 0.000000 内 存 (memory) 磁 盘 (dsk) 0.149333 服 务 器 (server) 网 络 (network) 0.285714 The fnal formula of smlarty computaton s gven by r r r r r r sma ( T, T ) + α smb ( T, T ) (4) sm ( T, T ) = 2 153
r r Where sma T, T ) ( r r s the cosne smlarty and smb T, T ) ( s the Howet smlarty; Meanwhle, α s an adustment argument. The adusted smlarty between term pars s gven n Table 4 (α =0.5). Table 4. Adusted smlartes of term pars Term1 Term2 Smlarty 缓 冲 区 (buffer) 磁 盘 (dsk) 0.510913 缓 冲 区 (buffer) 主 存 (man store) 0.579771 内 存 (memory) 磁 盘 (dsk) 0.508724 服 务 器 (server) 网 络 (network) 0.318716 3.3.2 Clusterng Algorthm In order to make t converge faster and result n better clusters n accordance wth the orgnal dstrbuton of terms, we mproved the tradtonal k-medods algorthm. When decdng the new center of a cluster, we frst choose top-p terms closest to the old center n the cluster and take the term closest to the mean of the p terms as the new center. The mproved algorthm s descrbed as follows: 1. Choose m terms as the centers of clusters randomly: C1, C 2,, C,, ; C m 2. Compute the smlarty for each term to each cluster center. Assgn the term nto the closest cluster. 3. Determne the new center of each cluster as follows: Compute the average smlarty of terms n cluster usng the formula: 1 avgsm[ ] = n n k = 1 sm( T k, C ) Where n s the number of terms n cluster. Pck out p terms most closest to the cluster center decded by avgsm [ ] p = m max_ avgsm Where max_ avgsm s the maxmum of m average smlartes for m clusters. Compute the mean of the above p terms and choose the term closest to t as the new center of cluster. 4. Go to step 2 only f m cluster centers s not stable. 5. End and result n m clusters. (5) (6) 4 Experments and Evaluaton We have appled our approach to produce ontology n computer scence doman. The doman corpus contans 200 Chnese theses (3.12 M) retreved from professonal publcatons. The background corpus we used s the Modern Chnese Corpus (1999-2001) whch conssts of 12000 documents (33.5 M). 154
4.1 Term Dscovery In experment, we have compared the effect of the two scorng methods LLR and TF.IDF. Table 5 shows the scores of some terms n relatve percentage. The boldfaced terms s dentfed by LLR to be doman-relevant. It can be nferred from the experment results that LLR excels TF.IDF. Table 5. Relatve scores of both LLR and TF.IDF Term Doman DF Background DF LLR score TF.IDF score 路 由 器 73 4 98.6 66.8 总 线 56 1 99.9 70.4 控 制 器 35 7 96.5 94.5 端 口 25 5 93.1 65.3 缓 存 36 2 94.6 53.5 请 求 17 188 67.5 72.6 线 路 31 232 56.2 65.7 质 量 37 3725 42.4 99.9 屏 蔽 7 8 15.3 48.2 A term lst of 286 doman-relevant terms (key terms) has been manually constructed to support the subsequent evaluaton. We estmate the results normally based on Recall, Precson and F-measure, whch are defned as follows. Recall(R) The rato of correct terms number to key terms number n manual lst gven by R = correct key (7) Precson(P) The rato of correct terms number to extracted terms number gven by P = correct correct + ncorrect (8) F-measure(F) F-measure s defned as a combnaton of recall and precson: F = 2 RP (9) R + P These three values vary dependng on the threshold of LLR. Fgure 2 shows the varety of F-measure, recall and precson along wth LLR. 155
Fg. 1. F-measure, recall and precson by varyng the threshold of LLR 4.2 Relatonshp Dscovery Accordng to fgure 2, we chose 94 as the threshold of LLR and obtaned 216 doman-relevant terms. The depth of clusterng level was restrcted to 2. We have ganed 5 top-level clusters and 15 second level clusters. The precson of the top-level clusters acheved 76.7%. The average precson of the second-level clusters acheved 70.3%. The parent-chld relatonshp among terms and clusters s shown n fgure 3. Each cluster s named by the most frequent term n t. 网 络 路 由 器 路 径 主 干 网 速 率 中 继 广 域 网 延 迟 分 组 交 换 交 换 机 局 域 网 交 换 网 接 入 网 策 略 关 网 键 词 网 格 网 络 拓 扑 骨 干 网 络 搜 索 搜 索 引 擎 网 页 脚 本 主 机 万 维 网 网 站 网 址 带 宽 阻 塞 端 口 监 听 信 息 网 服 务 器 防 火 墙 流 量 计 算 中 心 域 名 主 页 服 务 器 拷 贝 电 子 邮 件 邮 件 邮 箱 日 志 数 据 流 语 音 共 享 信 息 港 智 能 性 信 息 港 因 特 网 智 能 化 电 话 网 宽 带 调 制 解 调 频 率 段 数 字 网 计 算 机 进 程 计 算 机 体 系 结 构 总 线 硬 件 接 口 中 断 控 制 器 寄 存 器 运 算 时 间 段 数 据 共 享 实 时 系 统 调 度 者 进 程 子 系 统 分 系 统 硬 盘 计 算 机 芯 片 大 型 机 巨 型 磁 盘 软 盘 光 盘 主 存 吞 吐 软 驱 外 存 内 存 存 储 器 存 储 缓 冲 缓 存 输 出 驱 动 驱 动 器 适 配 器 网 卡 参 数 矩 阵 数 组 指 针 字 符 串 参 数 静 态 访 问 编 译 器 变 量 字 符 标 量 流 程 图 流 程 结 构 图 逻 辑 图 运 算 符 编 程 控 制 台 编 程 源 程 序 程 序 包 子 程 序 程 序 性 源 代 码 向 量 编 译 调 试 调 试 器 程 序 员 瓶 颈 数 据 包 程 地 址 输 入 提 示 符 键 盘 终 止 符 示 意 图 主 程 序 操 作 符 空 格 数 据 位 读 取 存 取 缓 冲 区 地 址 计 时 器 序 程 序 编 程 者 查 询 分 析 器 数 据 库 数 据 表 冗 余 备 份 标 识 符 计 数 器 初 始 化 程 序 语 句 表 达 式 操 作 数 图 形 切 换 图 形 界 面 请 求 文 本 框 图 标 按 钮 视 图 标 识 号 软 件 软 件 软 件 包 构 建 复 用 重 构 组 件 构 件 重 用 性 模 块 封 装 性 封 装 继 承 私 有 实 例 对 象 信 息 信 息 加 密 令 牌 信 息 论 用 户 名 口 令 登 录 信 息 信 道 二 进 制 编 码 解 码 局 部 译 码 补 码 误 码 纠 错 空 串 算 算 法 队 列 索 引 字 节 标 记 字 段 算 法 二 分 法 数 据 项 回 溯 堆 栈 节 点 调 度 优 先 级 虚 拟 法 执 行 交 换 轮 转 流 水 号 时 序 线 性 执 行 标 识 符 Fg. 2. Parent-chld relatonshps among terms 156
5 Concluson In ths paper we present an approach to automatcally mnng doman-dependent ontologes from corpora based on term extracton and relatonshp dscovery technology. There are two man nnovatons n our approach. One s extractng terms usng log-lkelhood rato, whch s based on the contrastve probablty of term occurrence n doman corpus and background corpus. The other s fusng together nformaton from multple knowledge sources as evdences for dscoverng partcular semantc relatonshps among terms. In our experment, we mprove the tradtonal k-medods algorthm for mult-level clusterng. We have appled our approach to produce an ontology for the doman of computer scence and obtaned promsng results. In the future work, we wll consder developng heurstc methods or rules to label the exact relatons among terms and fndng out a more effectve ontology evaluaton methodology. Reference 1. Indereet Man.Automatcally Inducng Ontologes from Corpora. Proceedngs of CompuTerm 2004: 3rd Internatonal Workshop on Computatonal Termnology, OLIG'2004, Geneva. 2. A. Maedche and S. Staab. Mnng ontology from text. 12th Internatonal Workshop on Knowledge Engneerng and Knowledge Management,2000 (EKAW'2000). 3. A. Maedche and S. Staab. The TEXT-TO-OTO Ontology Learnng Envronment. Software Demonstraton at ICCS-2000-Eght Internatonal Conference on Conceptual Structures. August 14-18, 2000, Darmstadt, Germany. 4. Dr.Sadanand Srvastava, Dr.James Gl de Lamadrd. Extractng an ontology from a document usng Sngular Value Decomposton. ADMI 2001. 5. Ln, D. and Pantel, P. 2001. Inducton of semantc classes from natural language text. In Proceedngs of SIGKDD-01. pp.317-322. SanFrancsco, CA. 6. Dunnng.T. Accurate Methods for the Statstcs of Surprse and Concdence, Computatonal Lngustcs, 19(1); 61-74, March 1993. 7. Chakravarth S Velvadapu, Document Ontology Extractor, Appled research n Computer scence, Fall-2001. 8. A. Maedche and S. Staab. Dscoverng conceptual relatons from text. In Proceedngs of ECAI-2000. IOS Press, Amsterdam, 2000. 9. D. Faure and C. edellec. A corpus-based conceptual clusterng method for verb frames and ontology acquston. In LREC workshop on adaptng lexcal and corpus resources to sublanguages and applcatons, Granada, Span, 1998. 10. Doan, A., Madhavan, J., Domngs, P. and Halevy, A. 2002. Learnng to Map between Ontologes on the Semantc Web. WWW 2002. 157