0 IEEE Iteratoal Coferece o Fuzzy Systes Jue 7-30, 0, Tape, Tawa Developg a Fuzzy Search Ege Based o Fuzzy Otology ad Seatc Search Le-Fu La Chao-Ch Wu Pe-Yg L Dept. of Coputer Scece ad Iforato Egeerg Natoal Chaghua Uversty of Educato Chaghua, R.O.C. {lfla,ccwu}@cc.cue.edu.tw, ester636@gal.co Lag-Tsug Huag Departet of Botechology Mgdao Uversty Chaghua, R.O.C. larry@du.edu.tw Abstract Most of estg search eges retreve web pages by eas of fdg eact eywords. Tradtoal eyword-based search eges suffer several probles. Frst, syoys ad ters slar to eywords are ot tae to cosderato to search web pages. Users ay eed to put several slar eywords dvdually to coplete a search. Secod, tradtoal search eges treat all eywords as the sae portace ad caot dfferetate the portace of oe eyword fro that of aother. Thrd, tradtoal search eges lac a applcable classfcato echas to reduce the search space ad prove the search results. I ths paper, we develop a fuzzy search ege, called Fuzzy-Go. Frst, a fuzzy otology s costructed by usg fuzzy logc to capture the slartes of ters the otology, whch offerg approprate seatc dstaces betwee ters to accoplsh the seatc search of eywords. The Fuzzy- Go search ege ca thus autoatcally retreve web pages that cota syoys or ters slar to eywords. Secod, users ca put ultple eywords wth dfferet degrees of portace based o ther eeds. The totally satsfactory degree of eywords ca be aggregated based o ther degrees of portace ad degrees of satsfacto. Thrd, the doa classfcato of web pages offers users to select the approprate doa for searchg web pages, whch ecludes web pages the approprate doas to reduce the search space ad to prove the search results. Keywords- Fuzzy Search Ege; Fuzzy Otology; Seatc Search I. INTRODUCTION As the assve data o the teret s creasg rapdly, teret search eges have becoe the essetal ways to fd forato. Most of estg search eges, such as Google, Yahoo, MSN, ASK, Badu, ad Bg, retreve web pages by eas of fdg eact eywords. These eyword-based search eges collect ad aalyze web pages through web crawlers. Whle users put eywords to search web pages, web pages that cota eact eywords are retreved ad raed. For eaple, the Google search ege sorts the search results by ther Page-ra scores, Relevace scores ad Local scores. However, tradtoal eyword-based search eges suffer several probles: Syoys ad ters slar to eywords are ot tae to cosderato to search web pages. Users ay eed to put several slar eywords dvdually to coplete a search. The restrcto of eact eywords aes t coveet for users to search web pages. May valuable web pages would be otted f users dd ot search for several slar eywords dvdually. Whle users put several eywords to search web pages, dfferet eywords ay have dfferet degrees of portace ther opos. Tradtoal search eges treat all eywords as the sae portace ad caot dfferetate the portace of oe eyword fro that of aother. The proble of forato overload aes t dffcult for users to fd really useful forato fro a large aout of search results. Tradtoal search eges lac a applcable classfcato echas to reduce the search space ad prove the search results. To allevate the etoed probles, we have appled the fuzzy logc theory ad the seatc search techques to develop a fuzzy search ege, called Fuzzy-Go. Frst, a fuzzy otology s costructed by usg fuzzy logc [5] to capture the slartes of ters the otology, whch offerg approprate seatc dstaces betwee ters to accoplsh the seatc search of eywords. The Fuzzy-Go search ege ca thus autoatcally search web pages that cota syoys or ters slar to eywords. Secod, users ca put ultple eywords wth dfferet degrees of portace based o ther eeds. The totally satsfactory degree of eywords ca be aggregated based o ther degrees of portace ad degrees of satsfacto. Thrd, the doa classfcato of web pages offer users to select the approprate doa for searchg web pages, whch ecludes web pages the approprate doas to reduce the search space ad to prove the search results. A overvew of the Fuzzy-Go search ege s show Fgure. The Web Crawler s developed to gather ad classfy web pages. Web pages are classfed ad stored based o ther doas. The characterstc forato of web pages s recorded the Web Docuet. The Otology Maager costructs ad atas a Fuzzy Otology. A data g approach s appled o the Web Docuet to calculate the fuzzy slarty betwee ters the Fuzzy Otology. The User Iterface provdes users to select a approprate doa ad to put ultple eywords wth dfferet degrees of 978--444-737-5//$6.00 0 IEEE 684
portace based o ther eeds. The Fuzzy Search Mechas would eclude web pages the approprate doas to reduce the search space ad to prove the search results. The eywords are epaded by the Fuzzy Otology to fd out syoys ad ters slar to eywords. The search results are orderg based o several fuzzy factors cludg the satsfacto degrees of eywords, the portace degrees of eywords, the relevace of doas, the page ras [3], the page -ls [], ttles, ad last odfed dates. atas ters a ore geeral doa, whle the lower level atas ters a ore specfc doa. Each doa cotas a ter lattce the secod layer (see Fgure 3). A sub-doa ca hert the ter lattce fro ts super-doa, add ew ters specfc to the sub-doa tself, ad overrde (.e., redefe) fuzzy slarty betwee ters the superdoa. Fgure 3. The ter lattce a doa Fgure. A overvew of the Fuzzy-Go search ege II. CONSTRUCTING THE FUZZY ONTOLOGY Doa depedecy s a portat character of owledge: the eag of a ter ay be dfferet dfferet doas. We costruct a two-layered fuzzy otology to orgaze ters that are elcted fro WordNet [6]. WordNet s a large lecal database of Eglsh, bult by Prceto Uversty. Nous, verbs, adjectves ad adverbs are grouped to sets of cogtve syoys, each epressg a dstct cocept. I the two-layered fuzzy otology, the frst layer fors a doa herarchy (see Fgure ). We adopt the Suggested Upper Merged Otology (SUMO) [5] to classfy doas. SUMO s owed by the IEEE ad defes a group of doa otologes whch have bee apped to the WordNet doas. I our approach, the doa herarchy s treated as ge-spec relatoshps. The upper level the doa herarchy Fgure. The doa herarchy the two-layered fuzzy otology Church et al. [9] had eteded the word assocato psycholgustcs to provde the bass for a statstcal descrpto of the seatc relatoshp betwee words through the co-occurrece of words. To detere the slarty degree betwee ters, we use the data g ethod to fd out the degree of co-occurrece for two ters web pages. The easure of utual forato [9] s adopted to calculate the degree of a chldre ode coforg to a paret ode the ter lattce. For ay two ters ad y, the utual forato I(,y) s defed as P(, y) I(, y) log () P( ) P( y) Where P() ad P(y) s the probabltes of observg ad y depedetly web pages, ad P(,y) s the probabltes of observg ad y together web pages. The hgher the seatc relatoshp betwee ad y, the larger the cooccurrece degree (.e. P(,y) s larger), ad cosequetly the larger the utual Iforato I(,y). To oralze the utual forato to a slarty degree terval [0,], we adopt the azg set [6] to assg the fuzzy slarty degree betwee ters. For all utual Iforato X, let f be a real-valued fucto X. The azg set ~ M = {(, μ ~ ( ))}, X degrees of ebershp μ ~ ( ) for all X wth M M assgs the f ( ) f( f ) μ ~ ( ) = () M sup( f ) f( f ) Where f() s the u of f ad sup() s the au of f. 685
The Otology Maager s resposble to apply the data g approach to gve the slarty degrees betwee ters all ter lattces based o the Web Docuet. Fgure 4 shows a ter lattce wth the slarty degrees. Note that the slarty degree betwee two syoys defed by WordNet s assged.0 drectly. Fgure 4. A ter lattce wth slarty degrees III. THE WEB CRAWLER We develop a crawler aget to autoatcally etract web pages ad vetory the by fdg out eywords ad characterstc values. Whle the Web Crawler retreves web pages, soe portat forato of web pages s aalyzed ad stored, cludg ttles, eywords, the page ras of web pages, the last odfed dates etc. Fgure 5 shows the wor flow of the Web Crawler. The Web Crawler s also resposble to classfy retreved web pages by deterg ther doas. Sce the aout of retreved web pages s huge, we adopt the otos of Fuzzy C- Mea (FCM) [7] to cluster web pages based o ther eywords. I our approach, each web page s traslated to a vector of eywords. Each vector does t belog to oly oe cluster but belogs to several clusters wth dfferet degrees. By coparg to the ters wth doas the Fuzzy Otology, the eywords the ceter of a cluster are useful to detere what doa the cluster belogs to. Keywords web pages are regarded as the characterstc values ad each web page s traslated to a vector of eywords the web page. Assue that all web pages the Web Docuet are gathered to clusters. The Euler dstace betwee two web pages ad j would be dst (, ) j = d (3) jd d= The ceter of a cluster wll be chaged after terato. I each terato, we calculate the total devato for chagg the ceter of the cluster. I the case of clusters, the total devato E ca be obtaed by E = C, where C = S s the ceter of each cluster S ad s the web page cluster S. If the total devato for the + s terato s saller tha the s, the ceter of the cluster would be chaged to + s terato -- otherwse the ceter would rea uchaged (see Fgure 6). Fgure 6. Chagg of the FCM ceter of the clusterg Sce the ceters of clusters wll be chaged aga ad aga, we use a objectve fucto J to chec f the result of clusterg s stable. J = J = w = = j= j j C (4) Fgure 5. The wor flow of the Web Crawler Where w j s the degree of the web page j belogs to the cluster S wth w j =, w j =, wj fors a X j C = s= X j Cs weghtg atr, ad C s the ceter of each cluster S wth w j j j= C =. If the value of the objectve fucto J s stable, w j j= the result of clusterg s stable. 686
Fally, the characterstc values set of the ceter of the cluster s captured to copare wth ters of doas. We appled N-Gra [4] to fd out the degree of atch betwee a cluster ad a doa. Assue that C K s the eywords set etracted fro a cluster K ad C D s the eywords set of a doa D. The degree of atch betwee a cluster K ad a doa D s CK CD S ( CK, CD) =,{ CK } φ (5) C A web page ay belog to several doas. Whle the degree of atch betwee a cluster ad a doa s above the threshold, we assg all web pages the cluster to the doa. The doa classfcato echas ca reduce the search space ad prove the search results. IV. THE FUZZY SEARCH MECHANISM The User Iterface of the fuzzy search ege provdes users to select a approprate doa ad to put ultple eywords wth dfferet degrees of portace based o ther eeds (see Fgure 7). O the oe had, web pages the approprate doas ca be ecluded to reduce the search space ad to prove the search results, sce all web pages are already classfed by doas. O the other had, a better search result ca be obtaed by dfferetatg betwee eywords accordg to the degrees of portace. We use lgustc degrees of portace [3] (.e. do t care, uportat, rather uportat, oderately portat, rather portat, very portat, ad ost portat) to ae t easer for users to grade relatve portace. Each lgustc degree of portace ca be apped to a tragular fuzzy uber as show Fgure 8. K It s coveet for users to be restrcted to put precse eywords. Users ay eed to put several slar eywords dvdually to coplete a search. To allevate the proble, we use the ter epaso to fd out syoys ad ters slar to eywords based o the seatcs herted fro the selected doa. Whle the user selects a approprate doa ad puts eywords to search web pages, the Fuzzy Search Mechas would epad eywords to syoys ad ters slar to eywords based o the ter lattce of the selected doa. For the eaple Fgure 9, whe the eyword s Coputer Networ ad the α-cut s set to 0.475, we would acqure ters whose slarty degrees to Coputer Networ are greater tha 0.475. Fgure 9. The ter epaso of eywords the ter lattce The fuzzy slarty betwee two o-adjacet ters T A ad T B ca be calculated by the ultplcato prcple for a path fro T A to T B. That s, we ca obta the fuzzy slarty betwee T A ad T B va aother ter T C the ter lattce of the doa D. { S( T, T ) S( T, T )} S ( TA, TB ) = Ma A C C B Tc D (6) Fgure 7. The user terface of the fuzzy search ege rather rather μ ~ A ( ) uportat oderately very portat uportat portat portat 0 0 0.5 0.5 0.75 Fgure 8. Mebershp fuctos for degrees of portace Therefore, the ters slar to a eyword wth degree αcut could be foud through the ter lattce of the selected doa. All syoys ad ters slar to eywords are tae to accout whle we calculate the degree of satsfacto betwee the eyword ad a web page. I our approach, the eyword desty s used to detere the degrees of satsfacto betwee eywords ad web pages. The eyword desty s the percetage of tes a eyword appears o a web page copared to the total uber of words o the web page. May SEO (Search Ege Optzato) eperts cosder the optu eyword desty to be to 3 percet []. Others cla that the optu eyword desty would be 3%, 4% eve as hgh as %~8%. We add the desty of the eyword ad all slar ters together, ad adopt the largest rage (.e. to 8 percet), S fuctos ad Z fuctos [6] to establsh the ebershp fucto of the optu eyword desty (see Fgure 0). If the eyword desty falls wth % to 8%, the degree of satsfacto betwee the eyword ad the web page would be set to. I the case of 0% to %, the S fucto s appled to set the degree value; whereas the case of 8% to %, the Z fucto s appled to set the degree value. 687
Fgure 0. The ebershp fucto of the optu eyword desty Sce users ay put ultple eywords wth dfferet degrees of portace to search web pages, the totally satsfactory degree of ultple eywords should be aggregated based o ther degrees of portace ad degrees of satsfacto. The fuzzy weghted average (FWA) [8,] s appled to calculate the totally satsfactory degree of ultple eywords usg tragular fuzzy ubers. I our approach, the degree of satsfacto betwee a eyword ad a web page s the dcator ( ), ad the degrees of portace are weghts (w ) that act upo dcators. Therefore, the fuzzy weghted average y ca be defed as = w y = f (,...,, w,..., w ) = (7) w Where there s eywords, the degree of satsfacto, s a crsp value, ad the degree of portace w, s represeted by a tragular fuzzy uber. We adopt the approate epressos o ad operators for the coputato of L-R fuzzy ubers, whch s suggested by Dubos ad Prade [0]. For eaple, a user ay put two eywords to search web pages: epert systes wth oderately portat ad soft coputg wth very portat. I the ter lattce, the ter epert systes ls to a slar ter owledge-based systes wth the slarty degree 0.786, ad soft coputg ls to fuzzy logc wth 0.63. Assue that the eyword destes of epert systes, owledge-based systes, soft coputg ad fuzzy logc for the web page P are.5%,.%, 0% ad.% respectvely. The epaded eyword desty of epert systes would be.5%+.% 0.786=.443%, ad the degree of satsfacto betwee epert systes ad P would be. The epaded eyword desty of soft coputg would be 0%+.% 0.63=.3083%, ad the degree of satsfacto betwee soft coputg ad P would be ( ).3083.3083 0. 76. S = 0 FWA s appled to calculate P s totally satsfactory degree of eywords as y = (0.5,0.5,0.75).0 (0.5, 0.5, 0.75).57 = 0.8508.3536 (0.75,,) 0.76 = (0.75,, ) = (0.8,.6,.5) (,.5,.75) Applyg the atheatcal operatos o fuzzy ubers [0,6], we get two fuzzy ubers (0.8,.6,.5) ad (,.5,.75). The ceter of gravty s adopted to defuzzfy a fuzzy uber [4], whch s acheved by atheatcal tegral. Therefore, P s totally satsfactory degree of eywords s 0.8508. The search results are orderg based o several fuzzy factors cludg the totally satsfactory degrees of eywords, the page ras, the page -ls, ttles, ad last odfed dates. Through the utual copesato of all fuzzy factors, a orderg of search results whch s cosdered every respect ca be geerated. The fuzzy weghted average [8,] s utlzed to aggregate all fuzzy factors based o ther degrees of portace ad degrees of satsfacto. The users ca aually select approprate fuzzy factors wth dfferet portace degrees to eet ther specfc requreets (see Fgure ) -- otherwse the portace degrees of fuzzy factors would be default values. Fgure shows the orderg of search results whch the grade s geerated by aggregatg all fuzzy factors. Fgure. The selecto of fuzzy factors wth dfferet portace degrees Fgure. The orderg of search results 688
However, the orgal settg s ot optal, ad t eeds cotuous learg ad adaptato to ft users epectatos gradually. We have appled the geetc algorth to propose a self-adaptato approach to Fuzzy-Go search ege []. For each search, the fuzzy search ege records the dfferece betwee the orderg of search results ad user s real behavor o clcg web pages. Ths feedbac s gathered ad aalyzed to adjust the fuzzy slartes betwee ters the fuzzy otology, the doa classfcato of web pages, ad the default portace degrees of fuzzy factors. The orderg of search results ca thus be proved gradually by cotuous learg ad adaptato. V. CONCLUSION I ths paper, we develop a fuzzy search ege, called Fuzzy-Go. Frst, the Otology Maager costructs ad atas a Fuzzy Otology. A data g approach s appled o the Web Docuet to calculate the fuzzy slarty betwee ters the Fuzzy Otology. The Fuzzy Otology s costructed by usg fuzzy logc to capture the slartes of ters the otology, whch offerg approprate seatc dstaces betwee ters to accoplsh the seatc search of eywords. The Fuzzy-Go search ege ca thus autoatcally retreve web pages that cota syoys or ters slar to eywords. Secod, the Web Crawler s developed to gather ad classfy web pages. Web pages are classfed ad stored based o ther doas. To deal wth the huge aout of retreved web pages, we propose a data g approach to cluster web pages based o ther eywords. Thrd, the User Iterface provdes users to select a approprate doa ad to put ultple eywords wth dfferet degrees of portace based o ther eeds. The totally satsfactory degree of eywords ca be aggregated based o ther degrees of portace ad degrees of satsfacto. Fally, the Fuzzy Search Mechas would eclude web pages the approprate doas to reduce the search space ad to prove the search results. The eywords are epaded by the Fuzzy Otology to fd out syoys ad ters slar to eywords. The search results are orderg based o several fuzzy factors cludg the totally satsfactory degrees of eywords, the page ras, the page ls, ttles, ad last odfed dates. The advatages of the proposed approach are as follows. Usg fuzzy logc to capture the slartes of ters the otology, whch offers approprate seatc dstaces betwee ters to accoplsh the seatc search of eywords. The doa classfcato of web pages offers users the selecto of the approprate doa to search web pages, whch ecludes web pages the approprate doas to reduce the search space ad to prove the search results. Users ca put ultple eywords wth dfferet degrees of portace based o ther eeds, ad all fuzzy factors are aggregated based o ther degrees of portace ad degrees of satsfacto. Through the utual copesato of all fuzzy factors, the orderg of search results whch s cosdered every respect ca be geerated. ACKNOWLEDGMENT Ths research s partally sposored by Natoal Scece Coucl (Tawa, R.O.C.) uder the grat NSC 98--E- 08-008-MY. REFERENCES [] Keyword Desty, http://e.wpeda.org/w/keyword_desty [] Page Il Aalyss, http://ercragla.co/l/ [3] Page Ra Chec, http://www.prchecer.fo/chec_page_ra.php [4] The Stochastc Laguage Models (N-Gra) Specfcato, http://www.w3.org/tr/gra-spec/ [5] The Suggested Upper Merged Otology (SUMO), http://www.otologyportal.org/ [6] WordNet, http://wordet.prceto.edu/ [7] J. C. Bezde, Patter Recogto wth Fuzzy Objectve Fucto Algorths. New Yor: Pleu, 98. [8] P.T. Chag, K.C. Hug, K.P. L, ad C.H. Chag, a Coparso of Dscrete Algorths for Fuzzy Weghted Average, IEEE Trasactos o Fuzzy Systes, pp.:663-675, Oct. 006. [9] K.W. Church ad P. Has Word Assocato Nors, Mutual Iforato ad Lecography, Coputatoal Lgustcs 6():-9, Mar. 990. [0] D. Dubos ad H. Prade. Fuzzy sets ad systes: theory ad applcatos. New Yor, Lodo, 980. [] L.F. La, C.C. Wu, M.Y. Shh, L.T. Huag, ad W. Chou. Parallel Processg for Fuzzy Queres Hua Resources Webstes. Joural of Iteret Techology, 7():943-953, Dec. 00. [] Y.C. L, L.F. La, C.C. Wu, ad L.T. Huag. A Self-Adaptato Approach to Fuzzy-Go Search Ege. The 00 Iteratoal Coputer Syposu (ICS 00), pp. 00-05, Dec. 00. [3] E.W.T. Nga ad F.K.T. Wat. Fuzzy Decso Support Syste for Rs Aalyss E-Coerce Developet. Decso Support Systes. pp.:35-55, Aug. 005. [4] T.Y. Tseg ad C.M. Kle. A New Algorth for Fuzzy Multcrtera Decso Mag. Iteratoal Joural of Approate Reasog, 6:45-66, 99. [5] L.A. Zadeh. Fuzzy Sets. Iforato ad Cotrol, pp.338-353, 965. [6] H.J. Zera. Fuzzy set theory ad ts applcatos. d revsed edto, Kluwer Acadec Publshers, 99. 689