Automated Agmet ad Extracto of Bgua Doma Otoogy for Medca Doma Web Search Ju-Feg Yeh *+, Chug-Hse Wu *, Mg-Ju Che * ad Lag-Chh Yu * * Departmet of Computer Scece ad Iformato Egeerg Natoa Cheg Kug Uversty, Taa, R.O.C. {chu, fyeh, mche,cyu}@cse.cu.edu.t + Departmet of Computer pcato Egeerg Far East Coege, Taa, R.O.C. Abstract Ths paper proposes a approach to automated otoogy agmet ad doma otoogy extracto from to oedge bases. Frst, WordNet ad HoNet oedge bases are aged to costruct a bgua uversa otoogy based o the co-occurrece of the ords a parae corpus. The bgua uversa otoogy has the mert that t cotas more structura ad sematc formato coverage from to compemetary oedge bases, WordNet ad HoNet. For doma-specfc appcatos, a medca doma otoogy s further extracted from the uversa otoogy usg the saddrve agorthm ad a medca doma corpus. Fay, the doma-depedet terms ad some axoms betee medca terms based o a medca ecycopaeda are added to the otoogy. For otoogy evauato, expermets o eb search ere coducted usg the costructed otoogy. The expermeta resuts sho that the proposed approach ca automatcay ag ad extract the doma-specfc otoogy. I addto, the extracted otoogy aso shos ts promsg abty for medca eb search. Itroducto I teget mg, order to obvate the uecessary eyord expaso, some oedge base shoud be voved the teget formato system. I recet years, cosderabe progress has bee vested deveopg the coceptua bases for budg techoogy that aos oedge reuse ad sharg. As formato exchageabty ad commucato becomes creasgy goba, mutpe-aguage exca resources that ca provde trasatoa servces are becomg creasgy mportat. Over the ast fe years, sgfcat effort has bee made to costruct the otoogy mauay accordg to the doma expert s oedge. Maua otoogy mergg usg covetoa edtg toos thout teget support s dffcut, abor tesve ad error proe. Therefore, severa systems ad frameors for supportg the oedge egeer the otoogy mergg tas have recety bee proposed. To avod the reterato otoogy costructo, the agorthm of otoogy merge (UMLS http://umss.m.h.gov/) (Lagde ad Kght 998) ad otoogy agmet (Vosse ad Peters 997) (Wegard ad Hoppebrouers 998) (Asaoma 200) ere vested. The fa otoogy s a merged verso of the orga otooges. The to orga otooges persst, th s estabshed betee them agmet. Agmet usuay s performed he the otooges cover domas that are compemetary to each other. I the past, doma otoogy as usuay costructed mauay accordg to the oedge or experece of the experts or otoogy egeers. Recety, automatc ad sem-automatc methods have bee deveoped. OtoExtract (Fese et a. 2002) (Mssoff et a. 2002) provded a otoogy egeerg cha to costruct the doma otoogy from WordNet ad SemCor. O the other had, mut-gua otoogy s very mportat for atura aguage processg, such as mache trasato (MT), eb mg ad cross aguage formato retreva (CLIR). Geeray, a mut-gua otoogy maps the eyord set of oe aguage to aother aguage, or compute the co-occurrece of the ords amog aguages. I addto, a ey mert for mutgua otoogy s that t ca crease the reato ad structura formato coverage by agg to or more aguage-depedet otooges th dfferet sematc features. Noadays arge coectos of formato varous styes are avaabe o the Iteret. Ad fdg desred formato o the Word Wde Web s becomg a crtca ssue. Some geerapurpose search ege e Googe (http://.googe.com) ad Atavsta (http://.atavsta.com/) provde the facty to
me the eb. There are three maor research areas about eb mg: eb cotet mg, eb structure mg ad eb usage mg. Ths paper proposes a ove method to eb cotet mg th ustructured eb pages. There are may approaches the ve of atura aguage processg. Accordg to the represetato of eb pages, there are three ds of the cotet: bag of ords (th order or ot) (Kargupta et a. 997) (Nahm ad Mooey, 2000), phrases (Ahoe et a. 998) (Fra et a. 999)(Yag et a. 999), reatoa terms (Cohe, 998) (Juer 999) ad cocept categores. We proposed a otoogybased eb search approach. Ufortuatey, there are some rreevat pages obtaed ad these pages resut o precso rate ad reca rate due to the probem of poysemy. To sove ths probem, doma oedge becomes ecessary. The doma-specfc eb mers e SPIRAL, Cora (Cohe, 998), WebKB (Mart ad Eud 2000) ad HepfuMed (Che et a. 2003) are empoyed as the speca search ege for the terestg topc. These oes dedcated to recpes are ess ey to retur rreevat eb pages he the query s etered. I ths paper, WordNet ad HoNet oedge bases are aged to costruct a bgua uversa otoogy based o the co-occurrece of the ords a parae corpus. For doma-specfc appcatos, a medca doma otoogy s further extracted from the uversa otoogy usg the sad-drve agorthm (Lee et a. 995) ad a medca doma corpus. Fay, the axoms betee medca terms are derved based o sematc reatos. A eb search system for medca doma based o the extracted doma otoogy s reazed to demostrate the feasbty of the methods proposed ths paper. The rest of the paper s orgazed as foos. Secto 2 descrbes otoogy costructo process ad the eb searchg system frameor. Secto 3 presets the expermeta resuts for the evauato of our approach. Secto 4 gves some cocudg remars. 2 Methodooges Fgure shos the boc dagram for otoogy costructo ad the frameor of the domaspecfc eb search system. There are four maor processes the proposed system: bgua otoogy agmet, doma otoogy extracto, oedge represetato ad doma-specfc eb search. 2. Bgua Otoogy Agmet I ths approach, the cross-gua otoogy s costructed by agg the ords WordNet th ther correspodg ords HoNet. Frst, the Sorama (Sorama 200) database s adopted as the bgua aguage parae corpus to compute the codtoa probabty of the ords WordNet, gve the ords HoNet. Secod, a bottom up agorthm s used for reato mappg. Fgure Otoogy costructo frameor ad the doma-specfc eb search system
I WordNet a ord may be assocated th may sysets, each correspodg to a dfferet sese of the ord. Whe e oo for a reato betee to dfferet ords e cosder a the sysets assocated th each ord (Chrstae 998). I HoNet, each ord s composed of prmary features ad secodary features. The prmary features dcate the ord s category. The purpose of ths approach s to crease the reato ad structura formato coverage by agg the above to aguage-depedet otooges, WordNet ad HoNet, th dfferet sematc features. The reato s-a defed WordNet correspods to the prmary feature defed HoNet. Equato () shos the mappg betee the ords HoNet ad the sysets WordNet. Gve a Chese ord, CW, the probabty of the ord reated to syset, syset, ca be obtaed va ts correspodg Egsh syoyms, EW, =,..., m, hch are the eemets syset. The probabty s estmated as foos. Pr( syset CW ) m = Pr( syset, EW CW ) = m = (Pr( syset EW, CW ) Pr( EW CW )) = here ( syset EW CW) Pr, = N( syset, EW, CW) N ( syset, EW, CW) () (2) I the above equato, ( N syset, EW, CW ) represets the umber of co-occurreces of EW ad CW, CW s syset. The probabty Pr ( EW ) set to oe he at east oe of the prmary features, PF ( CW ), of the Chese ord defed the HoNet matches oe of the acestor odes of syset, syset ( EW ) except the root odes the herarchca structures of the ou ad verb; Otherse set to zero. That s, ( EW CW) Pr ( PF ( CW ) {,,, }) ( etty evet act pay f = acestor( syset ( EW)) { etty, evet, act, pay} ) 0 otherse (3) here {etty,evet,act,pay} s the cocept set the root odes of HoNet ad WordNet. Fay, the Chese cocept, CW, has bee tegrated to the syset, syset, WordNet as og as the probabty, Pr(syset CW ), s ot zero. Fgure 2 shos the cocept tree geerated by agg WordNet ad HoNet. Fgure 2. Cocept tree geerated by agg WordNet ad HoNet. The odes th bod crce represet the operatve odes after cocept extracto. The odes th gray bacgroud represet the operatve odes after reato expaso. 2.2 Doma otoogy extracto There are to phases to costruct the doma otoogy: ) extract the otoogy from the crossaguage otoogy by sad-drve agorthm, ad 2) tegrate the terms ad axoms defed a medca ecycopaeda to the doma otoogy. 2.2. Extracto by sad-drve agorthm Otoogy provdes cosstet cocepts ad ord represetatos ecessary for cear commucato th the oedge doma. Eve domaspecfc appcatos, the umber of ords ca be expected to be umerous. Syoym prug s a effectve ateratve to ord sese dsambguato. Ths paper proposes a corpus-based statstca approach to extractg the doma otoogy. The steps are sted as foos: Step : Learzato: Ths step decomposed the tree structure the uversa otoogy sho Fgure 2 to the vertex st that s a ordered ode sequece startg at the eaf odes ad edg at the root ode. Step 2: Cocept extracto from the corpus: The ode s defed as a operatve ode he the Tfdf vaue of ord W the doma corpus s hgher tha that ts correspodg cotrastve (out-of-doma) corpus. That s,
operatve _ ode( W ), f Tf dfdoma ( W ) > Tf dfcotrastve ( W ) (4) = 0, Otherse here Tf df ( W ) = freq og Doma, Doma Tf df ( W ) = freq og Cotrastve, Cotrastve +, Doma, Cotrastve, Doma +, Doma, Cotrastve, Cotrastve I the above equatos, freq, Doma ad freq, Cotrastve are the frequeces of ord W the doma documets ad ts cotrastve (out-ofdoma) documets, respectvey., ad Doma, Cotrastve are the umbers of the documets cotag ord W the doma documets ad ts cotrastve documets, respectvey. The odes th bod crce Fgure 2 represet the operatve odes. Step 3: Reatoa expaso usg the saddrve agorthm: There are some doma cocepts ot operatve after the prevous steps due to the probem of suffcet data. From the observato otoogy costructo, most of the operatve cocept odes have operatve hyperym odes ad hypoym odes. Therefore, the sad-drve agorthm s adopted to actvate these operatve cocept odes f ther acestors ad descedats are a operatve. The odes th gray bacgroud sho Fgure 2 are the actvated operatve odes. Step 4: Doma otoogy extracto: The fa step s to merge the ear vertex st sequece to a herarchca tree. Hoever, some osy cocepts ot beogg to ths doma otoogy are operatve after step 3. These osy odes th operatve osy cocepts shoud be ftered out automatcay. Fay, the doma otoogy s extracted ad the fa resut s sho Fgure 3. After the above steps, a dummy ode s added as the root ode of the doma cocept tree. 2.2.2 Axom ad termoogy tegrato I practce, specfc doma termooges ad axoms shoud be derved ad troduced to the otoogy for doma-specfc appcatos. I our approach, 23 axoms derved from a medca ecycopaeda have bee tegrated to the doma otoogy. Fgure 4 shos a exampe of the axom. I ths exampe, the dsease dabetes s tagged as eve A hch represets that ths dsease s frequet occurrece. Ad the degrees for the correspodg sydromes represet the causaty betee the dsease ad the sydromes. The axoms aso provde to feds departmet of the cca care ad the category of the dsease for medca formato retreva. Fgure 4 axom exampe 2.3 Doma-specfc eb search Ths paper proposed a medca eb search ege based o the costructed medca doma otoogy. The ege cossts of atura aguage terface, eb craer ad dexer, reato ferece modue ad axom ferece modue. The fuctos ad techques of these modues are descrbed as foos. 2.3. Natura aguage terface ad eb craer ad dexer Natura aguage terface s geeray cosdered as a etcg prospect because t offers may advatages: t oud be easy to ear ad easy to remember, because ts structure ad vocabuary are aready famar to the user; t s partcuary poerfu because of the muttude of ays hch to accompsh a search acto by usg the atura aguage put. A atura aguage query s trasformed to obta the desred represetato after the ord segmetato, removg the stop ords, stemmg ad taggg process. The eb craer ad dexer are desged to see medca eb pages from Iteret, extract the cotet ad estabsh the dces automatcay. Fgure 3 The doma otoogy after fterg out the soated cocepts
2.3.2 Cocept ferece modue For sematc represetato, tradtoay, the eyord-based systems troduce to probems. Frst, ambguty usuay resuts from the poysemy of ords. The doma otoogy gve a cear descrpto of the cocepts. I addto, ot a the syoyms of the ord shoud be expaded thout costrats. Secody, reatos betee the cocepts shoud be expaded ad eghted order to cude more sematc formato for sematc ferece. We treat each of the user s put ad the cotet of eb pages as a sequece of ords. Ths meas that the sequece of ords s treated as a bag of ords regardess of the ord order. For the ord sequece of the user s put, q=w q = q, q2,, qk, ad the ord sequece of the eb page, A=W A = A, A2,, AL, The smarty betee put query ad the page s defed as the smarty betee the to bags of ords. The smarty measure based o ey cocepts the otoogy s defed as foos. Sm A, q reato ( ) (, ) = Sm W W reato A q = Sm (,,...,,,,..., ) reato A A2 AL q q2 qk KL, H =, = = (5) here H s cocept smarty of A ad q. Most of the eyord expaso approaches use the exteso of scope by the syoyms. I ths paper the smarty, H, s defed as H A ad q are detca A ad q are hyperyms, (6) r 2 r s the umber of eves betee = 2 A ad q are syoyms, r 2 r s the umber of ther commo cocepts 0 others 2.3.3 Axom ferece modue Some axoms, such as resut ad resut from, that are expected to affect the performace of a eb search system a techca doma are defed to descrbe the reatoshp betee sydromes ad dseases. Ths aspect s the use of specfc terms used the medca doma. We coected the data about sydromes ad dseases from a medca ecycopeda ad tagged the dseases th three eves accordg to ts occurrece ad sydromes th four eves accordg to ts sgfcace to the specfc dsease. The resut reato score s defed as RI( A, q ) f a dsease occurs the put query ad ts correspodg sydromes appear the eb page. Smary, f sydrome occurs the put query ad ts correspodg dsease appears the eb page, the resut from reato score s defed as RF( A, q. ) The reato score s estmated as foos. ( ) Axom A, q = max{ RI ( A, q), RF( A, q)} = max{ RI( A, 2,...,,, 2,..., ), A AP q q qr RF(,,...,,,,..., )} here sydrome Smary, A A2 AP q q2 qr PR, PR, RI RF dpr dpr p=, r= p=, r= = max{, }, RI d pr = /2 f dsease qr ad (7) resuts qr s the top- feature of RF d pr = /2 f sydrome from dsease qr ad. resuts s the top- feature of qr. The codtoa probabty of the -th eb pages th respect to aspect s A,2 ad query q s defed as Sm axom ( A, q) (, ) (, ) Axom A q =. Axom A q 3 Evauato To evauate the proposed approach, a medca eb search system as costructed. The eb pages ere coected from severa Webstes ad totay 2322 eb pages for medca doma ad 833 eb pages for cotrastve doma ere coected. O the other had, the trag ad test queres for trag ad evauatg the system performace ere aso coected. Forty users, ho do ot tae part the system deveopmet, ere ased to provde a set of queres gve the coected eb pages. After post-processg, the dupcate queres ad the queres out of the medca doma are removed. Fay, 3207 test queres mxed Chese th Egsh ords usg atura aguage ere obtaed. 3. Keyord-Based VSM proach: A basee system for comparso I recet years, most of the formato retreva approaches ere based o the Vector-Space Mode (VSM). Assumg that the query s deoted as a vector q = ( q, q2,..., q ) ad the eb page s
represeted as a vector A= ( a, a2,..., a ). The Tf-df measure s empoyed ad the smarty ca be measured by the cose fucto defed as foos. Sm eyord based ( A, q) = cos(a,q) = aq = 2 2 a q = = (8) here a =. Ths approach for ey term expaso based o syoym set s aso adopted the basee system. The resuts ad dscussos are descrbed the foog sectos. 3.2 Weght determato usg -avgp score The medca doma eb search system s modeed by the ear combato of reatoa ferece mode ad axom ferece mode. The ormazed eght factor, α, s empoyed for cocept expaso as foos. Sm( A, q) = ( α) Sm ( A, q) + α Sm ( A, q) (9) reato axom Ths expermet s coducted o the estmato of the combato eghts for each mode. The resuts are sho Fgure 5. The performace measure caed -AvgP [Echma ad Srvasa 998] as used to summarze the precso ad reca rates. The best -AvgP score be obtaed he the eght α = 0.428. -avgp score 0.7 0.66 0.62 0.58 0.54 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Fgure 5 The -avgp score th dfferet vaues of α 3.3 Evauato o dfferet ferece modues I the foog expermets, eb pages ere separatey evauated by focusg o oe ferece modue based o the doma-specfc otoogy at a tme. That s, the mxture eght s set to for oe ferece modue ad the other s set to 0 each evauato. For comparso, the eyord-based VSM approach ad the otoogy-based system are aso evauated ad sho Fgure 6. The precso ad reca rates are used as the evauato measures. Ad the otoogy based approach meas the α combato of cocept ferece ad axom ferece descrbed the secto 3.2. Precso rate (%) 00 90 80 70 60 50 40 30 20 0 0 Otoogy based Basee+cocept ferece modue Basee+axom ferece modue Basee 0 0 20 30 40 50 60 70 80 90 00 Reca rate (%) Fgure 6 The precso rates ad reca rates of the proposed method ad the basee system 4 Cocuso Ths paper has preseted a approach to automated otoogy agmet ad doma otoogy extracto from to oedge bases. I ths approach, a bgua otoogy s deveoped usg a corpus-based statstca approach from to e estabshed aguage-depedet oedge bases, WordNet ad HoNet. A doma-depedet otoogy s further extracted from the uversa otoogy usg the sad-drve agorthm ad a doma corpus. I addto, doma-specfc terms ad axoms are aso added to the doma otoogy. We have apped the doma-specfc otoogy to the eb page search medca doma. The expermeta resuts sho that the proposed approach outperformed the eyord-based ad syoym expaso approaches. Refereces N. Asaoma. 200 Agmet of Otooges: WordNet ad Go-Tae. WordNet ad Other Lexca Resources Worshop Program, NAACL200. 89-94 Chrstae Febaum, 998 WordNet a eectroc Lexca Database, The MIT Press 998. pp307-308 Fese, D., Busser, C., Dg, Y., Kartseva, V., Ke, M., Koroty, M., Omeayeo, B. ad Sebes R. 2002 Sematc Web pcato Areas, the 7th Iteratoa Worshop o pcatos of Natura Laguage to Iformato Systems (NLDB02). M. Mssoff,, R. Navg, ad P. Veard. 2002 Itegrated approach to Web otoogy earg ad egeerg, Computer, Voume: 35 Issue:. 60 63
N.F. Noy, ad M. Muse,. 2000 PROMPT: Agorthm ad Too for Automated Otoogy Mergg ad Agmet, Proceedgs of the Natoa Coferece o Artfca Itegece. AAAI2000. 450-455 Sorama Magaze ad Wordpeda.com Co. 200 Mutmeda CD-ROMs of Sorama from 976 to 2000, Tape. P. Vosse, ad W. Peters, 997 Mutgua desg of EuroWordNet, Proceedgs of the Deos orshop o Cross-aguage Iformato Retreva. H. Wegard, ad S. Hoppebrouers, 998, Expereces th a mutgua otoogy-based exco for es fterg, Proceedgs the 9th Iteratoa Worshop o Database ad Expert Systems pcatos. 60-65 H. Kargupta, I. Hamzaogu, ad B. Stafford. 997. Dstrbuted data mg usg a aget based archtecture. I Proceedgs of Koedge Dscovery ad Data Mg, 2-24. U. Y. Nahm ad R. J. Mooey. 2000. A mutuay beefca tegrato of data mg ad formato extracto. I Proceedg of the AAAI-00. H. Ahoe, O. Heoe, M. Kemette, ad A. Veramo. 998. pyg data mg techques for descrptve phrase extracto dgta documet coectos. I Advece Dgta Lbrares. E. Fra, G. W. Payter, I. H. Wtte, C. Gut, ad C. G. Nev-Mag. 999. Domaspecfc eyphrase extracto. I procedg of IJCAI-99, 668-673. Y. Yag, J. Carboe, R. Bro, T. Perce, B. T. Archbad, ad X. Lu. 999. Learg approaches for detectg ad tracg es evets. IEEE Iteget Systems, 4(4):32-43. W. W. Cohe. 998. A eb-based formato system that reasos th structured coocatos of text. I Proceedgs of 2d Aget'98 M. Juer, M. Ste, ad M. Rc. 999 Learg for text categorzato ad formato extracto th p. I Proceedgs of the Worshop o Learg Laguage Logc, Bed, Sovea S. Oyama, T. Koubo, ad T. Ishda. 2004 Doma- Specfc Web Search th Keyord Spce. IEEE Trasactos o Koedge ad Data Egeerg, Vo 6,NO., 7-27. Saar K. Pa, Varu Taar, ad Pabtra Mtra. 2002. Web Mgg Soft Computg Frameor: Reevace, State of the Art ad Future Drectos. IEEE Trasactos o Neura Netors, Vo. 3, NO. 5. T. Hofma. 999. The custer-abstracto mode: Usupervsed earg of topc herarches from text data. I Proceedgs of 6th IJCAI, 682-687, P. Mart ad P. Eud. 2000. Koedge Idexato ad Retreva ad the Word Wde Web. IEEE Iteget Systems, speca ssue "Koedge Maagemet ad Koedge Dstrbuto over the Iteret" H. Che, A. M. Lay, B. Zhu, ad M. Chau.,2003 HepfuMed: Iteget Searchg for Medca Iformato over the teret. Joura od the Amerca Socety for Iformato Scece ad Techoogy, 54(7):683-694. D. Echma,, Ruz, M., Srvasa, P., 998 Cross-aguage formato retreva th the UMLS Metathesaurus, Proceedg of ACM Speca Iterest Group o Iformato Retreva (SIGIR), ACM Press, NY (998), 72-80.