Automated Agmet ad Extracto of gua Otoogy for Cross-Laguage Doma-Specfc Appcatos Ju-Feg Yeh, Chug-Hse Wu, Mg-Ju Che ad Lag-Chh Yu Departmet of Computer Scece ad Iformato Egeerg Natoa Cheg Kug Uversty, aa, awa, R.O.C. {fyeh, chwu, mche,cyu}@cse.cu.edu.tw Abstract I ths paper we propose a ove approach for otoogy agmet ad doma otoogy extracto from the exstg owedge bases, WordNet ad HowNet. hese two owedge bases are aged to costruct a bgua otoogy based o the cooccurrece of the words the setece pars of a parae corpus. he bgua otoogy has the mert that t cotas more structura ad sematc formato coverage from these two compemetary owedge bases. For domaspecfc appcatos, the doma specfc otoogy s further extracted from the bgua otoogy by the sad-drve agorthm ad the doma-specfc corpus. Fay, the doma-depedet termooges ad some axoms betwee doma termooges are tegrated to the otoogy. For otoogy evauato, expermets were coducted by comparg the bechmar costructed by the otoogy egeers or experts. he expermeta resuts show that the proposed approach ca extract a aged bgua doma-specfc otoogy. Itroducto I recet years, cosderabe progress has bee vested deveopg the coceptua bases for budg techoogy that aows owedge reuse ad sharg. As formato exchageabty ad commucato becomes creasgy goba, mutgua exca resources that ca provde trasatoa servces are becomg creasgy mportat. O the other had, mut-gua otoogy s very mportat for atura aguage processg, such as mache trasato (M, web mg (Oyama et a. 2004 ad cross aguage formato retreva (CLIR. Geeray, a mutgua otoogy maps the eyword set of oe aguage to aother aguage, or compute the cooccurrece of the words amog aguages. I addto, a ey mert for mutgua otoogy s that t ca crease the reato ad structura formato coverage by agg two or more aguage-depedet otooges wth dfferet sematc features. Over the ast few years, sgfcat effort has bee made to costruct the otoogy mauay accordg to the doma expert s owedge. Maua otoogy mergg usg covetoa edtg toos wthout teget support s dffcut, abor tesve ad error proe. herefore, severa systems ad framewors for supportg the owedge egeer the otoogy mergg tas have recety bee proposed (Noy ad Muse 2000. o avod the reterato otoogy costructo, the agorthm of otoogy mergg (UMLS http://umss.m.h.gov/ (Lagde ad Kght 998 ad otoogy agmet (Vosse ad Peters 997 (Wegard ad Hoppebrouwers 998 (Asaoma 200 were vested. he fa otoogy s a merged verso of the orga otooges. he two orga otooges persst, wth aged s betwee them. Agmet usuay s performed whe the otooges cover domas that are compemetary to each other. I the past, doma otoogy was usuay costructed mauay accordg to the owedge or experece of the experts or otoogy egeers. Recety, automatc ad sem-automatc methods have bee deveoped. OtoExtract (Fese et a. 2002 (Mssoff et a. 2002 provded a otoogy egeerg cha to costruct the doma otoogy from WordNet ad SemCor. Nowadays vast vestmet s made otoogy costructo for doma appcato. Fdg the authortatve evauato for otoogy s becomg a crtca ssue. Some evauatos are tegrated to the otoogy toos to detect ad prevet the mstaes. he mstaes that mght be made deveopg taxoomes wth frames are descrbed (Gómez-Pérez 200. hey defed three may types of mstaes: Icosstecy, Icompeteess, ad redudacy. o dea wth these mstaes ad carry out the vadato ad verfcato of otoogy, some otoogy checers, vadators ad parsers were deveoped. hese toos provde the effcacous apprasa of correctess whe deveopg the ew otoogy. However, they are dsappotg otoogy tegrato, especa whe the orga otooges are we defed. For other approaches (Maedche ad Staab 2002, the smarty measures are proposed the earer stage of the evauato. he evauato cossts two ayers: exca ayer ad
coceptua ayer. I exca ayer, the edt dstace s tegrated to the exca smarty measure. he measure s defed as: ( L L ed( L L m ( L, L m,, SM ( L, L max 0, 0, [ ] ( where SM ( deotes the exco smarty fucto, ed ( s the Levesthe edt dstace fucto defed (Levesthe. 966. L ad are the words wth the excos of the otooges. he coceptua ayer focuses o the coceptua structures of the otoogesm amey taxoomc ad otaxoomc reatos. I ths paper, WordNet ad HowNet owedge bases are aged to costruct a bgua uversa otoogy based o the co-occurrece of the words a parae corpus. For doma-specfc appcatos, the medca doma otoogy s further extracted from the uversa otoogy usg the sad-drve agorthm ad a medca doma corpus. Fay, the axoms betwee medca termooges are derved. he bechmar costructed by the otoogy egeers ad experts s troduced to evauate the bgua otoogy costructed usg the methods proposed ths paper. hs paper defes two measures, taxoomc reato ad o-taxoomc reato, as the quattatve metrcs to evauate the otoogy. he rest of the paper s orgazed as foows. Secto 2 descrbes otoogy costructo process ad the web search system framewor. Secto 3 presets the expermeta resuts for the evauato of our approach. Secto 4 gves some cocudg remars. 2 Methodooges Fgure shows the boc dagram for otoogy costructo. here are two maor processes the proposed system: bgua otoogy agmet ad doma otoogy extracto. 2. gua Otoogy Agmet I ths approach, the cross-gua otoogy s costructed by agg the words WordNet to ther correspodg words HowNet. he herarchca taxoomy s actuay a coverso of HowNet. Oe of the mportat portos of HowNet s the methodoogy of defg the exca etres. I HowNet, each exca etry s defed as a combato of oe or more prmary features ad a sequece of secodary features. he prmary features dcate the etry s category, amey, the reato: s-a whch s a herarchca taxoomy. ased o the category, the L secodary features mae the etry s sese more expct, but they are o-taxoomc. otay,52 prmary features are dvded to 6 upper categores: Evet, Etty, Attrbute Vaue, Quatty, ad Quatty Vaue. hese prmary features are orgazed to a herarchca taxoomy. Frst, the Sorama (Sorama 200 database s adopted as the bgua aguage parae corpus to compute the codtoa probabty of the words WordNet, gve the words HowNet. Secod, a bottom up agorthm s used for reato mappg. I WordNet a word may be assocated wth may sysets, each correspodg to a dfferet sese of the word. For fdg a reato betwee two dfferet words, a the sysets assocated wth each word are cosdered (Febaum 998. I HowNet, each word s composed of prmary features ad secodary features. he prmary features dcate the word s category. he purpose of ths approach s to crease the reato ad structura formato coverage by agg the above two aguagedepedet otooges, WordNet ad HowNet, wth ther sematc features. Fgure Otoogy costructo framewor he reato s-a defed WordNet correspods to the prmary feature defed HowNet. Equato (2 shows the mappg betwee the words HowNet ad the sysets WordNet.
Gve a Chese word, CW, the probabty of the word reated to syset, syset, ca be obtaed va ts correspodg Egsh syoyms, EW,,..., m, whch are the eemets syset. he probabty s estmated as foows. Pr( syset CW m Pr( syset, EW CW m (Pr( syset EW, CW Pr( EW CW where ( syset EW CW N( syset, EW, CW N( syset, EW, CW Pr, (2 (3 I the above equato, ( N syset, EW, CW represets the umber of co-occurreces of CW, EW ad syset. he probabty Pr ( EW CW s set to oe whe at east oe of the prmary features, PF ( CW, of the Chese word defed the HowNet matches oe of the acestor odes of syset, syset ( EW except the root odes the herarchca structures of the ou ad verb; Pr EW CW s set to zero. Otherwse the probabty ( Pr ( EW CW UPF ( CW { etty, evet, act, pay} ( ( I Uacestor( Usyset ( EW { etty, evet, act, pay} f 0 otherwse (4 where {etty,evet,act,pay} s the cocept set the root odes of HowNet ad WordNet. Fay, the Chese cocept, CW, has bee tegrated to the syset, syset, WordNet as og as the probabty, Pr(syset CW, s ot zero. Fgure 2(a shows the cocept tree geerated by agg WordNet ad HowNet. 2.2 Doma otoogy extracto here are two phases to costruct the doma otoogy: extract the otoogy from the crossaguage otoogy by the sad-drve agorthm, ad 2 tegrate the terms ad axoms defed a medca ecycopaeda to the doma otoogy. 2.2. Extracto by sad-drve agorthm Otoogy provdes cosstet cocepts ad word represetatos ecessary for cear commucato wth the owedge doma. Eve domaspecfc appcatos, the umber of words ca be expected to be umerous. Syoym prug s a effectve ateratve to word sese dsambguato. hs paper proposes a corpus-based statstca approach to extractg the doma otoogy. he steps are sted as foows: Step Learzato: hs step decomposes the tree structure the uversa otoogy show Fgure 2(a to the vertex st that s a ordered ode sequece startg at the eaf odes ad edg at the root ode. Step 2 Cocept extracto from the corpus: he ode s defed as a operatve ode whe the fdf vaue of word W the doma corpus s hgher tha that ts correspodg cotrastve (out-of-doma corpus. hat s, operatve _ ode( W, f f dfdoma ( W > f dfcotrastve ( W 0, Otherwse (5 where f dfdoma ( W freq og Doma, f dfcotrastve( W freq og, Cotrastve +, Doma, Cotrastve Doma, +, Doma, Cotrastve, Cotrastve I the above equatos, freq, ad Doma freq, Cotrastve are the frequeces of word W the doma documets ad ts cotrastve (out-of-doma documets, respectvey., Doma ad, Cotrastve are the umbers of the documets cotag word W the doma documets ad ts cotrastve documets, respectvey. he odes wth bod crce Fgure 2(a represet the operatve odes. Step 3 Reato expaso usg the saddrve agorthm: here are some doma cocepts ot operatve after the prevous steps due to the probem of sparse data. From the observato otoogy costructo, most of the operatve cocept odes have operatve hyperym odes ad hypoym odes. herefore, the sad-drve agorthm s adopted to actvate these operatve cocept odes f ther acestors ad descedats are a operatve. he odes wth gray bacgroud show Fgure 2(a are the actvated operatve odes.
Step 4 Doma otoogy extracto: he fa step s to merge the ear vertex st sequece to a herarchca tree. However, some osy cocepts ot beogg to ths doma otoogy are operatve. hese odes wth operatve osy cocepts shoud be ftered out. Fay, the doma otoogy s extracted ad the fa resut s show Fgure 2(b. After the above steps, a dummy ode s added as the root ode of the doma cocept tree. the causaty betwee the dsease ad the sydromes. he axoms aso provde two feds departmet of the cca care ad the category of the dsease for medca formato retreva or other medca appcatos. Fgure 3 Oe exampe of the axoms Fgure 2(a Cocept tree geerated by agg WordNet ad HowNet. he odes wth bod crce represet the operatve odes after cocept extracto. he odes wth gray bacgroud represet the operatve odes after reato expaso. Fgure 2(b he doma otoogy after fterg out the soated cocepts 2.2.2 Axom ad termoogy tegrato I practce, specfc doma termooges ad axoms shoud be derved ad troduced to the otoogy for doma-specfc appcatos. here are two approaches to add the termooges ad axoms: the frst oe s maua edtg by the otoogy egeers, ad the other s to obta from the doma ecycopaeda. For medca doma, we obta 23 axoms derved from a medca ecycopaeda about the termooges reated to dseases, sydromes, ad the cc formato. Fgure 3 shows a exampe of the axom. I ths exampe, the dsease dabetes s tagged as eve A whch represets that ths dsease s frequet occurrece. Ad the degrees for the correspodg sydromes represet 3 Evauato For evauato, a medca doma otoogy s costructed. A medca web mg system s aso mpemeted to evauate the practcabty of the bgua otoogy. 3. Coceptua Evauato of Otoogy he bechmar otooges are created to be the test-sutes of reusabe data whch ca be empoyed by otoogy egeers or costructer for bechmarg purposes. he bechmar otoogy was costructed by the doma experts cudg two doctors ad oe pharmacoogst based o UMLS. he doma experts have tegrated the Chese cocepts wthout chagg the cotets of UMLS Evauato of otoogy costructo adopted the two ayer measures: Lexca ad Coceptua ayers (Echma et a. 998. he evauato the coceptua ayer seems to be more mportat tha that the exca ayer whe the otoogy s costructed by agg or mergg severa we defed source otooges. here are two coceptua reato categores for evauato: axoomc ad o-axoomc evauatos. 3.. Evauato of the taxoomc reato Step Learzato: hs step decomposes the tree structure to the vertex st as descrbed Secto 2.2. he otoogy, O, ad the bechmar, O are show the Fgure 4(a ad 4(b, respectvey. After ths earzato, the vertex st sets: VLS ad VLS are obtaed as show Fgure 4(c, where VLS { VL, VL2, VL3, VL4} ad VLS VL, VL, VL. { 2 3}
(a he taxoomc herarchca represetato of target otoogy O (b he taxoomc herarchca represetato of bechmar otoogy O VLS VLS (c he taxoomc vertex st set represetato of target otoogy ad bechmar otoogy Fgure 4 Learzato of otooges Step 2 Normazato: Sce the frequeces of cocepts the vertex st set are ot equa, the ormazato factors are troduced to address ths probem. For the target otoogy, the factor vectors for ormazato s NF { f, f2, f3, f4, f5, f6, f7, f8 }, ad for the bechmar otoogy s NF f, f, f, f, f, f, f, f, f { 2 3 4 5 6 7 8 9 } o where f s the ormazato factor for the -th cocept of the otoogy O. It s defed as the recproca of the frequecy the vertex st set. O f the vertex sts cota the cocept otoogy O Step 3 Estmato of the vertex st smarty: herefore, the parwse smarty of these two vertex sts of the target otoogy ad bechmar otoogy ca be obtaed usg the Needema/Wusch techques show the foowg steps: Itazato: Create a matrx wth m+ coums ad + rows. m ad are the umbers of the cocepts the vertex sts of the target otoogy ad the bech mar otoogy, respectvey. he frst row ad frst coum of the matrx ca be tay set to 0. hat s, Sm( m, 0, f m 0 or 0 (6 Matrx fg: Assg the vaues to the remat eemets the matrx as the foowg equato: Sm( V, V m ( 2 ( ( Sm( m, + fm + f Smexco( Vm, V, max Sm( m, + fm + f Smexco( Vm, V, 2 Sm( m, + fm + f Smexco( Vm, V 2 (7 here are some syoyms beogg to the same cocept represeted oe vertex. So the exco smarty ca be descrbed as exco m Sm ( V, V Syoyms defed the V Syoyms defed the V m m ad V or V (8 racebac: Determe the actua agmet wth the maxmum score, Sm(Vm, V, ad therefore the parwse smarty w be defed as the foowg equato: ( m Sm VL, VL argmax Sm(V, V (9 Step 4 Parwse smarty matrx estmato: he parwse smarty matrx s obtaed after p q tmes for Step3. p,q are the umbers of the vertex st of target otoogy ad bechmar otoogy. Each eemet of the parwse smarty matrx as Equato (0 s obtaed usg Equato (9. VL VL VL VL VL p ( Sm VL, VL Sm( VL, VL Sm( VLp, VLq Fgure 5 Parwse smarty betwee the target otogy ad bechmar otoogy VL q
(, O PSM O Sm VL VL : Sm VL VL O : (,... Sm( VL, VLq ( p,... Sm( VLp, VLq p q (0 Step 5 Evauato of the taxoomc herarchy: he whoe smarty betwee target otoogy ad bechmar otoogy ca be represeted as: (, Smtaxoomc O O p argmax Sm VL, VL p q ( ( 3..2 Evauato of the o-taxoomc reato Some reatos defed the otoogy are otaxoomc set such as syoym. I fact, the exco smarty s apped to measure the coceptua smarty. he exco smarty of set ca be defed as the foowg equato: exco s t Sm ( V, V Words defed the Vs Words defed the V s ad V or V t t (2 herefore, the evauato of the o-taxoomc reato s defed as (, Smo taxoomc O O p q ( Sm, exco Vs Vt p q s t (3 3..3 Evauato Resuts Usg the bechmar otoogy ad evauato metrcs descrbed prevous sectos, the evauato resuts are show abe. abe the smarty measure betwee the target otoogy ad bechmar otoogy axoomc reato smarty 0.57 No-axoomc reato smarty 0.68 Accordg to the expermeta resuts, some pheomea are dscovered as foows: frst, the umber of words mapped to the same cocept the upper ayer of otoogy s arger tha that the ower ayer because the termooges usuay appear the ower ayer. 3.2 Evauato of doma appcato o assess the otoogy performace, a medca web-mg system to search the desred page has bee mpemeted. I ths system the web pages were coected from severa Webstes ad totay 2322 web pages for medca doma ad 833 web pages for cotrastve doma were coected. he trag ad test queres for trag ad evauatg the system performace were aso coected. Forty users, who do ot tae part the system deveopmet, were ased to provde a set of queres gve the coected web pages. After postprocessg, the dupcate queres ad the queres out of the medca doma are removed. Fay, 3207 test queres usg atura aguage were obtaed. he basee system s based o the Vector-Space Mode (VSM ad syoym expaso. he coceptua reatos ad axoms defed the medca otoogy are tegrated to the basee as the otoogy-based system. he resut s show abe 2. he resuts show that otoogy-based system outperforms the basee system wth syoym expaso, especay reca rate. 4 Cocuso A ove approach to automated otoogy agmet ad doma otoogy extracto from two owedge bases s preseted ths paper. I ths approach, a bgua otoogy s deveoped from two we estabshed aguage-depedet owedge bases, WordNet ad HowNet accordg to the co-occurrece of the words the parae bgua corpus. A doma-depedet otoogy s further extracted from the uversa otoogy usg the sad-drve agorthm ad a doma ad ts cotrastve corpus. I addto, doma-specfc terms ad axoms are aso added to the doma otoogy. hs paper aso proposed a evauato method, bechmar ad metrcs, for otoogy costructo. esdes, we aso apped the domaspecfc otoogy to the web page search medca doma. he expermeta resuts show that the proposed approach outperformed the syoym expaso approach. he overa performace of the formato retreva system s drecty reated to the otoogy. abe 2 Precso rate (% at the pots reca eve Reca Leve 0..2.3.4.5.6.7.8.9 asee system 78 73 68 65 60 52 38 30 2 5 Otoogy based 87 86 82 77 73 7 68 62 5 40 32
Refereces N. Asaoma, 200. Agmet of Otooges: WordNet ad Go-ae. WordNet ad Other Lexca Resources Worshop Program, NAACL200. 89-94 D. Echma, M. Ruz, ad P. Srvasa, 998. Cross-aguage formato retreva wth the UMLS Metathesaurus, Proceedg of ACM Speca Iterest Group o Iformato Retreva (SIGIR, ACM Press, NY (998, 72-80. D. Fese, C. usser, Y. Dg, v. Kartseva, M. Ke, M. Koroty,. Omeayeo ad R. Sebes, 2002. Sematc Web Appcato Areas, the 7th Iteratoa Worshop o Appcatos of Natura Laguage to Iformato Systems (NLD02. F. C. Febaum, 998. WordNet a eectroc Lexca Database, he MI Press 998. pp307-308 A. Gómez-Pérez, 200. Evauatg otooges: Cases of Study IEEE Iteget Systems ad ther Appcatos: Speca Issue o Verfcato ad Vadato of otooges. Vo. 6, Number 3. March 200. Pags: 39-409. I. Lagde ad K. Kght, 998. Geerato that Expots Corpus-ased Statstca Kowedge. I Proceedgs of COLING-ACL 998. V. Levesthe, 966. ary codes capabe of correctg deetos, sertos, ad reversas. Sovet Physcs Doady, 0:707 70. A. Maedche, ad S. Staab, 2002. Measurg Smartes betwee Otooges. I Proceedgs of the 3th Europea Coferece o Kowedge Egeerg ad Kowedge Maagemet EKAW, Madrd, Spa 2002/0/04 M. Mssoff,, R. Navg, ad P. Veard, 2002. Itegrated approach to Web otoogy earg ad egeerg, Computer, Voume: 35 Issue:. 60 63 N. F. Noy, ad M. Muse, 2000. PROMP: Agorthm ad oo for Automated Otoogy Mergg ad Agmet, Proceedgs of the Natoa Coferece o Artfca Itegece. AAAI2000. 450-455 S. Oyama,. Koubo, ad. Ishda, 2004. Doma-Specfc Web Search wth Keyword Spce. IEEE rasactos o Kowedge ad Data Egeerg, Vo 6,NO., 7-27. Sorama Magaze ad Wordpeda.com Co., 200. Mutmeda CD-ROMs of Sorama from 976 to 2000, ape. P. Vosse, ad W. Peters, 997. Mutgua desg of EuroWordNet, Proceedgs of the Deos worshop o Cross-aguage Iformato Retreva. H. Wegard, ad S. Hoppebrouwers, 998. Expereces wth a mutgua otoogy-based exco for ews fterg, Proceedgs the 9th Iteratoa Worshop o Database ad Expert Systems Appcatos. 60-65