Journal of Comuter Scence (1): 97-103, 006 ISS 1549-3636 006 Scence Publcatons Towards an Effectve Personalzed Informaton Flter for PP Based Focused Web Crawlng Fu Xang-hua and Feng Bo-qn Deartment of Comuter Scence and Technology, X an Jaotong Unversty, X an, 710049, Chna Abstract: Informaton access s one of the hottest tocs of nformaton socety, whch has become even more mortant snce the advent of the Web, but nowadays the general Web search engnes stll have no ablty to fnd correct and tmely nformaton for ndvduals. In ths aer, we roose a Peerto-Peer (PP) based decentralzed focused Web crawlng system called PeerBrdge to rovde usercentered, content-senstve and ersonalzed nformaton search servce from Web. The PeerBrdge s bult on the foundaton of our revous work about WebBrdge, whch s a focused crawlng system to crawl Web accordng several secfed toc. The most mortant functon of PeerBrdge s to dentfy nterestng nformaton. So we furthermore resent an effcent ersonalzed nformaton flter n detal, whch combnes several comonent neural networks to accomlsh the flterng task. Performance evaluaton n the exerments showed that PeerBrdge s effectve to crawl relevant nformaton for secfc tocs and the nformaton flter s effcent, whch recson s better than that of suort vector machne, naïve bayesan and ndvdual neural network. Key words: PeerBrdge, web crawlng system, PP based, artfcal neural network ITRODUCTIO Informaton access s one of the hottest tocs of nformaton socety and t has become even more mortant snce the advent of the Web. On one sde, our socety deends more and more on nformaton. Knowng the rght nformaton, at the rght moment, as soon as t s avalable s an essental for all of us. On the other sde, the amount of avalable nformaton, esecally on the Web, s ncreasng tremendously over tme and we are wtnessng an nformaton overload. The rocess of extract relevant nformaton from Web s stll very dffcult, tme-consumng and n many cases ractcally s unfeasble, snce t requres huge cogntve rocessng. Researchers try ther best to address the challengng roblem of locatng correct nformaton from Web effcently. They have develoed many dfferent technques, such as centralzed search engnes, Meta search engnes, ersonalzed web search system and toc drven search systems [1,]. The most conventonal examle s the centralzed search engnes (CESs). There are some roblems of CESs. One major roblem wth CESs s that they do not facltate human user collaboraton, whch has otental for greatly mrovng Web search qualty and effcency. Wthout Collaboraton, user must start from scratch every tme they erform a search task, even f other users have done smlar or relevant searches. Another major roblem wth CSE s that they gnore comletely the nterests and references of users. For a same query, dfferent users wll be answered wth a same lst of results. But actually, a substantal amount of ersonal nformaton could be obtaned durng user s searchng rocess whch may be used to fnd sutable results for a secal user. Wth the emergence of successful alcaton lke Gnutella, Kazaa, Freenet, eer-to-eer (PP) technology has receved sgnfcant vsblty over the ast few years. PP systems are massvely dstrbuted comutng system n whch eer (node) communcaton drectly wth one another to dstrbute task or exchange nformaton or accomlsh tasks. Also there are a few rojects such as Aodea [3], Edutella [4], ODISSEA [5] attemt to buld a PP based Web search or crawlng system. Develong a PP-based dstrbuted aradgm wll brng n several advantages that cannot be exloted n a centralzed aradgm. Bascally, they are ascrbed to the fact that nformaton has been collected, refned, and stored among users accordng to ther nterests. The actve contrbutons of users rovde multle advantages. In effect, the creaton of a secal user rofle allows flterng search results deendng on the user nterests, ntroducng a certan degree of ersonalzaton n search. Further, f one consders users not only as solated ndvduals but also as a communty then ths socal dmenson could be exloted n order to access the exertse of eole wth smlar nterests. The socal dmenson of the communty allows clusterng users accordng to ther nterests and exertse and so focus on nterestng nformaton by reducng the doman of nterest. In ths study, we resent a PP based focused Web crawlng system called PeerBrdge, whch s develoed Corresondng Author: Fu Xang-hua, Deartment of Comuter Scence and Technology, X an Jaotong Unversty, X an, 710049, Chna 97
J. Comuter Sc., (1): 97-103, 006 based on our WebBrdge [4]. In PeerBrdge, each node only search and store a art of the Web model that the user s nterested n and the nodes nteract n a eer-toeer fashon n order to create a real dstrbuted search engne. All users share these artal models that globally create a consstent model for the web resource that s equvalent to ts centralzed counterart. One key roblem we must to solve n PeerBrdge s to search nformaton that s relevant to secal node. To avod get rrelevant nformaton, PeerBrdge would try to guess exactly what knd of document the user desres, basng that guess not only on the key words rovded by the user, but also on a rofle of the user s background and nterests and on evaluatons of how the system satsfed or faled to satsfy the user s requests n the ast. Moreover, t would retreve only the secfc knd of documents defned by the user model comonent. A ersonalzed nformaton flter based on heterogeneous neural network ensemble classfer (E) [5] s used as the content flter to model the eer s reference and flter rrelevant nformaton. Furthermore, Toc Overlay etwork Search algorthm (TOS) s develoed to suort comlex queres on to of the exstng structured network [6]. Desgn overvew of PeerBrdge: PeerBrdge have fve man comonents: a content flter whch makes relevance judgments on ages crawled from Web and query results searched from other nodes, a dstller whch determnes a measure of centralty of crawled ages to determne vst rortes, a crawler wth dynamcally reconfgurable rorty controls whch s governed by the content flter and dstller, a PP nfrastructure whch suorts to construct a PP overlay network wth other nodes to share and search nformaton each others and an user nterface wth whch user can edt tranng samles, select category taxonomy to tranng classfer and query nformaton from the ersonalzed data resource base and other nodes. A block dagram of the general archtecture of PeerBrdge s shown n Fg. 1. ere we brefly outlne the basc rocesses of each comonent. The content flter: The content flter s a document classfer mlemented by a heterogeneous neural networks ensemble to determne whether the downloaded documents are useful. It s the central comonent to guarantee the qualty of the search results. The reresentatve features of the samle Web ages are extracted as nuts to tran the E content flter. Tranng s objectve s to let the E confgure tself and adjust ts weght arameters accordng to the tranng examles, to facltate generalzaton beyond the tranng samles. In our system, the tranng samle ncludes a selected canoncal taxonomy (such as Yahoo!, the Oen Drectory Project) and the examles secfed by the user. All of the tranng samles defne what tocs the user s nterest n. We use vector model of documents to reresent the user model and comute the smlarty between documents and nterests. A retraned E classfer can be used to flter rrelevant nformaton. The dstller: The dstller s used to analyze the lnk structures of the downloaded Web ages and dentfy ages contanng large numbers of lnks to relevant ages, called hubs. Snce the ctatons sgnfy delberate judgment by the age author, most ctatons are to semantcally related materal. Intermttently, the system runs a toc dstllaton algorthm to dentfy hubs. The vst rortes of these ages and mmedate neghbors are rased. All of the age lnks dstlled by the dstller wll be lace nto the search lst orderly accordng ther rortes. The crawler: The functon of the crawler s smle. It gets age lnks from the search lst and then seeks and acqures the corresondng Web ages from the Web. Integratng wth the dstller and the content flter, the crawler runs as a focused crawlng to access only a narrow segment of the Web. We have resented a focused crawler wth onlne-ncremental adatve learnng ablty n [6]. It entals a very small nvestment n hardware and network resources and yet acheves resectable coverage at a rad rate. In PeerBrdge, there are several crawlng threads to crawl Web age synchronously durng the workng rocess. The PP nfrastructure: Wth the PP nfrastructure, the nstances of PeerBrdge run on many user comuters form a PP overlay networks to share ther nformaton resource. DT based dstrbuted looku and nformaton-exchange rotocols [7] are used to exchange vtal nformaton between the eers. Each eer mantans a small routng table. Gven a key, these technques guarantee the locaton of ts value n a bounded number of hos wthn the network. Bloom flter [8] s used to store the lst of URLs already crawled by a eer. TOS s used to suort comlex queres [9]. Thus Web content s managed by a dstrbuted team of eers, each of whch secalzng n one or a few tocs. When a query s requred, each eer wll not only look for t n the local host but also ublsh t to the overlay network. Wth our effectve PP search algorthm, the relevant query results n the whole overlay network wll be return to the user. The user nterface: The user nterface manly rovdes a convenent oeraton nterface to the user. User can use t to select category taxonomy, edt and judge examles, query nformaton and dslay query result wth rank and so on. In our rototye, t stll has not been mlemented comletely now. 98
J. Comuter Sc., (1): 97-103, 006 X =tf. log(/df ) () Query Results Select Toc Query User Interface Taxonomy Table Read Samles Classfer (Tranng) Personalzed Data Resource Edt Samles Search Lsts Content Flter Query Delvery Comute ubs and Authortes User User Model Model Peer A Dstller Select URLs Comute Relevance Save Relevant Informaton Peer-to-Peer Protocol Crawler Web Web Search Classfer (Flterng) Query Results PP Search Fg. 1: The general archtecture of PeerBrdge Crawler Classfer (Flterng) Peer-to-Peer Protocol Peer B Adatve content flterng model: An nformaton flterng system can use ntellgent content analyss to automatcally classfy documents. If a document s judged not belongng to a user secfc class, t s an rrelevant document should be dscarded. Such methods nclude k-nearest neghbor classfcaton, lnear least square ft, lnear dscrmnant analyss and naïve Bayesan robablstc classfcaton [1,,10]. owever, because real-world data such as we re usng tend to be nosy and are not clearly defned, lnear or low-order statstcal models cannot always descrbe them. We use artfcal neural networks because they are robust enough to ft a wde range of dstrbutons accurately and can model any hgh-degree exonental models. eural networks are chosen also for comutatonal reasons snce, once traned, they oerate very fast. Moreover, such a learnng and adataton rocess can gve semantc meanng to context-deendent words. User Model To flter nformaton for secfc users accordng to ther reference and nterests, user model s created as an mage of what users need. We defne a user model as: UM := (MID,FD,FT,UI,UIV) (1) Where, UMID s an user model dentfer, FD:= {d 1, d,,d } s a set of samle documents, FT:= {t 1, t,, t M } s a lexcon comrse all feature terms of FD, UI := {u 1,u,...,u T } s a set of nterests secfed by users and UIV := {UIV 1, UIV,..., UIV T } s a set of nterest vectors of a secal user, of whch every element resonds to a nterest u k (1 k T) and s defned as UIV k := <(t 1,w 1k ), (t,w k ),..., (t M,w Mk )>, where w k s the frequency of term t (1 M) n UIV k. Accordng vector sace model (VSM), FD consttutes a term by document matrx X := (d 1,d,,d ), where a column d j :=<(t 1,x 1j ),(t,x j ),..., (t M,x Mj )> s a document vector of the document d j and every element x j s the frequency of the term t n document d j. TDFIF frequency s used, whch s defned as: Where, tf j s the number of the term t that occurs n the document d j and df s the number of documents where the word t occurs. The smlarty between document vectors s defned as: Sm(d,d)=d T d = x x k = 1 k kj x 1 k x k = k = 1 kj. (3) Equaton (3) also can be used to comute the smlarty between document vector and nterest vector. eural networks-based content flterng: The neural networks-based adatve content flter comrses two major rocesses: tranng and classfcaton. Durng tranng, the flter learns from samle documents to form a knowledge base. And then t classfes ncomng documents accordng to ther content. Before tranng or classfcaton, a rerocessng rocedure s needed to extract from the documents words and hrases wth the use of secfc feature selecton algorthm. The eural networks contan an nut layer, wth as many elements as there are feature terms needed to descrbe the documents to be classfed as well as a mddle layer, whch organzes the tranng document set so that an ndvdual rocessng element reresents each nut vector. Fnally, they have an outut layer also called a summaton layer, whch has as many rocessng elements there are nterests of user to be recognzed. Each element n ths layer s combned va rocessng elements wthn the mddle layer, whch relate to the same class and reare that category for outut. Fgure llustrates the form of a content flter based on a three-layer feedforward artfcal neural network. t1 t t t M Inut layer of feature terms Layer of hdden neuros Layer of outut neuros u1 uk ut User nterests Fg. : Adatve content flter based on three layer feedforward artfcal neural network 99
J. Comuter Sc., (1): 97-103, 006 In our content flter, the numercal nut obtaned from each document s a vector contanng the frequency of aearance of terms. Owng to the ossble aearance of thousands of terms, the dmenson of the vectors can be reduced by sngular value decomoston (SVD), Prncal Comonent Analyss, Informaton Entroy Loss and word frequency threshold [10], etc. eterogeneous neural networks ensemble classfer: eural etwork ensemble (E) s a learnng aradgm where many neural networks are jontly used to solve a roblem [11]. It orgnates from ansen and Salamon s work [1], whch shows that the generalzaton erformance of a neural network system can be sgnfcantly mroved through combnng several ndvdual networks on the same task. The creaton of a neural network ensemble s constructed n two stes, the frst beng the judcous creaton of the ndvdual ensemble members and the second ther arorate combnaton to roduce the ensemble outut. There has been much work n tranng ensembles [11~16]. owever, all these methods are used to change weghts n an ensemble. The structure of the ensemble, e.g., the number of s n the ensemble and the structure of ndvdual s, e.g., the number of hdden nodes, are all desgned manually and fxed durng the tranng rocess. Whle manual desgn of s and ensembles mght be arorate for roblems where rch ror knowledge and an exerenced exert exst, t often nvolves a tedous tral-and-error rocess for many real-world roblems because rch ror knowledge and exerence human exerts are hard to get n ractce. In [17], we roose a new method to construct heterogeneous neural network ensemble (E) wth negatve correlaton. It combnes ensemble s archtecture desgn wth cooeratve tranng of ndvdual s n an ensemble. It determnes automatcally not only the number of s n an ensemble, but also the number of hdden nodes n ndvdual s. It uses ncremental tranng based on negatve correlaton learnng [10,13] n tranng ndvdual s. The man advantage of negatve learnng s that t encourages dfferent ndvdual s to learn dfferent asects of the tranng data so that the ensemble can learn the whole tranng data better. It does not requre any manual dvson of the tranng data to roduce dfferent tranng sets for dfferent ndvdual s n an ensemble. Theory Foundaton of eural etwork Ensemble Suose a data set D:= { 1,y 1 ),,y ),...,,y )}, where x s the nut samle and y s the outut result (1 ). An ensemble comrsng comonent neural network and every comonent network s traned to aroxmate a functon f: R C where C s the set of class labels. Suose the weght of the th comonent network s w (1 ) and all the weghts satsfes 100 w 0, w =1. When the nut samle s x, the =1 outut of the th comonent network s f ) and the outut of the ensemble s: f )= j=1 w f ). Thus the generalzaton error of the ensemble n the whole data set s: (4) =1 E= (y -f )) The generalzaton error of the th comonent network n the whole data set s: (5) = E = (y -f )) The weghted generalzaton of the ensemble s: (6) =1 E= w E The dversty of the ensemble s: A= w (f )-f )) =1 ensemble satsfes: j. So the generalzaton of the E=E-A (7) Combnng the oututs s clearly only relevant when they dsagree on some or several of the nuts. Ths nsght was formalzed by [15], who showed that squared error of the ensemble when redctng a sngle target s equal to the average squared error of the ndvdual networks, mnus the dversty defne as the varance of the ndvdual network outut. Thus, to reduce the ensemble error, one tres to ncrease the dversty wthout ncreasng the ndvdual network errors too much. Construct neural network ensemble wth negatve correlaton: Because all the comonent networks are traned wth the samles of the same data set D to aroxmate the same functon, the outut of the comonent networks are hgh correlated otentally leadng to severe colnearty and reducng the robustness of the ensemble network [16]. Defne the correlaton of the th comonent network wth the others s: j =1 j=1,j C = (f )-f )) (f )-f )) (8) To mtgate ths otental colnearty roblem, Equaton (5) s modfed by addng a decorrelaton
J. Comuter Sc., (1): 97-103, 006 enalty to t. The new error functon for an ndvdual network s: E = (y -f )) + λc (9) = Where λ ( λ 0 ) s an adjustable arameter, whch s used to adjust the strength of the enalty. So the ndvdual networks attemt to not only mnmze the error between the target and ther outut, but also to decorrelate ther error wth those from revously traned networks. When the smle average weght s used to combne the comonent networks, namely w =1/, then Equaton (9) can be modfed as: selected to combne a heterogeneous neural network ensemble. Performance evaluaton: As one of the most mortant work of our adatve content flterng, we have mlemented a PP-based nformaton search and dscovery system called PeerBrdge for user-centered tmely nformaton search and extract from Web and other eers ncrementally. The nfrastructure tools of the PeerBrdge nclude Full-text Indexng and Retreval Engne, Metadata Manager, User Mode Manager, E based Content Flter, Web Crawler, PP Protocol, PP Search Engne. The PeerBrdge currently bult on Wndows latform. Fgure 3 shows the snashots of the WebBrdge, and Fgure 4 shows a snashot of the PeerBrdge. 1 E = ( (y -f )) λ(f )-f )) ) (10) =1 The average value of all the comonent error s: 1 1 E sum = ( (y -f )) -(f )-f )) ) (11) =1 =1 The artal dervatve of Equaton (10), wth resect to the outut of network on the th tranng samle, s 1 1 E sum = ( (y -f )) -(f )-f )) ) (1) =1 =1 When λ = 1/, E=E sum, so we get E ) E ) f ) f ) (13) Accordng Equaton (13), the mnmzaton of the emrcal rsk functon of the ensemble s acheved by mnmzng the error functons of the ndvdual networks. From ths vew, negatve correlaton learnng rovdes a novel way to decomose the learnng task of the ensemble nto a number of subtasks for dfferent ndvdual networks. In lterature [17], we rovde a new method to ncremental construct heterogeneous neural network ensemble wth negatve correlaton. The new method ncludes two rocesses: at frst the Cascor [18] s modfed to construct otmal ndvdual heterogeneous networks wth negatve correlaton learnng, durng ths rocess, what are consder s: (1) constructng all the ndvdual networks wth the same data set sequent; () Equaton (10), (1) are used to guarantee all of the ndvdual networks are negatve correlaton; and then the otmal ndvdual heterogeneous networks are Fg. 3: A snashot of the WebBrdge Based on PeerBrdge we have evaluated the flterng erformance of the Chnese Web ages content flter wth varant number of comonent neural network n Web search task. We fnd the heterogeneous neural network ensemble classfer s effcent and feasble for adatve nformaton flter n dstrbuted heterogeneous network envronment. In our exerments sx dfferent heterogeneous neural network ensembles are tested, the number of comonent neural network of whch are resectvely 1,5,10,15,0,5 and are notated as E1, E5, E10, E15, E0 and E5. Wth above dfferent content flters traned by the same 101
J. Comuter Sc., (1): 97-103, 006 nterest documents, PeerBrdge search relevant web documents from Yahoo Chna (htt://cn.yahoo.com). The evaluaton results are shown n Fg. 5. Fg. 4: A snashot of PeerBrdge recson 100 80 60 40 0 0 E1 E5 E10 E15 E0 E5 age number100) 4 1 0 8 36 44 Fg. 5: Precson of content flter wth dfferent number of comonent neural networks Table 1: The document number of the tranng set and test set n sx categores Earn acq money-fx crude gran trade Tranng set 709 1488 460 349 394 337 Test set 1014 630 133 160 130 106 1 0.8 0.6 0.4 0. 0 F1 E0 SVM Bayes E1 earn acq money-fx gran crude trade category Fg. 6: Comarson wth E0, SVM, Bayes, E1 n Reuters-1578 collecton The measurement Rr R F1 = R + R r s used to evaluate the erformance of the classfers, where f a s the number of documents correctly assgned to ths category, b s the number of documents ncorrectly assgned to ths category and c s the number of documents ncorrectly rejected from ths category, then recson a R = and recall a R a + b =. The r a + c exerment results are shown n Fg. 6. Fgure 5 manfested combnng many comonent neural networks mroved the content flterng recson of the Web search system. It s also obvously that ncreasng the number of the comonent neural network can mrove the recson largely at the begnnng, but when the number s suffcently large, the mrovement became small. Fgure 6 showed that the heterogeneous neural network ensemble based classfcaton algorthm was better than other classfcaton algorthm. Once traned, neural network ensemble oerates very fast. Moreover, the assumtons on the roblem s dstrbuton model of neural network classfer are much less than that of aïve Bayes classfer, so t s has less ndeendence on the roblem and they are robust enough to ft a wde range of dstrbutons accurately and can model any hgh-degree exonental models. 10 COCLUSIO Informaton access s one of the most mortant requrements of everybody n nowadays. Facng to the nformaton overload on the Web and CESs roblem, we rovde a PP based, contentsenstve, nterest-related and ersonalzed web crawlng system. A new content flter based on E classfer base s roosed to guarantee each node only crawlng ersonalzed relevant nformaton. Performance evaluaton n the exerments showed that PeerBrdge s effectve to crawl relevant nformaton for secfc tocs. To comare wth other classfers such as SVM, naïve bayesan and ndvdual artfcal neural network, the exerment results showed that E classfer s very effcent and feasble. In the future we wll take nto account those ssues n PeerBrdge such as effcently nformaton search, fault tolerance and access control etc. REFERECES 1. Arasu, A., J. Cho,. Garca-Molna and S. Raghavan, 001. Searchng the Web. ACM Trans. Internet Technol., 1: -43.. Baeza-Yates, R., 003. Informaton retreval n the Web: Beyond current search engnes. Intl. J. Arox. Reasonng, 34: 97-104. 3. Sngh, A., M. Srvatsa, L. Lu and T. Mller, 003. Aodea: A decentralzed eer-to-eer archtecture for crawlng the World Wde Web. SIGIR 003 Worksho on Dstrbuted Informaton Retreval.
J. Comuter Sc., (1): 97-103, 006 4. ejdl, W., B. Wolf, C. Qu, S. Decker, M. Snterk, A. aeve, M. lsson, M. Palmer and T. Rsch, 003. Edutella: A networkng nfrastructure based on RDF. Proc. 1th Intl. Conf. World Wde Web, awa, USA, : 604-15. 5. Suel, T., C. Mathur, J.W. Wu and J. Zhang, 003. ODISSEA: A eer-to-eer archtecture for scalable web search and nformaton retreval. In 6th Intl. Worksho on the Web and Databases. 6. Fu, X.., B.Q. Feng, Z.F. Ma and M. e, 004. Focused crawlng method wth onlne-ncremental adatve learnng. J. X An JaoTong Unv., 38: 599-60. 7. Stoca, I., R. Morrs, D. Karger, M.F. Kaashoek and. Balakrshnan, 001. Chord: A scalable eer-to-eer looku servce for nternet alcaton. Proc. SIGCOMM Ann. Conf. Data Communcaton. 8. Bloom, B., 1970. Sace/tme trade-off n hash codng wth allowable errors. Commun. ACM, 1: 4-46. 9. Fu, X.. and B.Q. Feng, 005. Dstrbuted nformaton search based on toc segments n structured eer-to-eer networks. J. X An JaoTong Unv. (Acceted) 10. Sebastan, F., 00. Machne learnng n automated text categorzaton. ACM Com. Surveys, 34: 1-47. 11. Zhou, Z.., J.X. Wu and W. Tang, 00. Ensemblng neural networks: Many could be better than all. Artfcal Intellgence, 137: 39-63. 1. ansen, L.K. and P. Salamon, 1990. eural network ensembles. IEEE Trans. Pattern Analyss and Machne Intellgence, 1: 993-1001. 13. Lu, Y. and X. Yao, 000. Evolutonary ensembles wth negatve correlaton learnng. IEEE Trans. Evoluton. Com., 4: 380~387. 14. Detterch, T., 000. Ensemble methods n machne learnng. Frst Intl. Worksho on Multle Classfer Systems, : 1-15. 15. Krogh, A. and J. Vedelsby, 1995. eural network ensembles cross valdaton and actve learnng. Advances n eural Informaton Processng Systems, San Mateo, CA: Morgan Kaufman. 16. Rosen, B.E., 1996. Ensemble learnng usng decorrelated neural networks. Connecton Sc., 8: 373-378. 17. Fu, X.., B.Q. Feng, Z.F. Ma and M. e, 004. Method of ncremental constructon of heterogeneous neural network ensemble wth negatve correlaton. J. X An JaoTong Unv., 38: 796-799. 18. Fahlman, S.E. and C. Lebere, 1990. The Cascadecorrelaton learnng archtecture. Advances n eural Informaton Processng Systems,. Los Altos, USA: Morgan Kaufmann Publshers, : 54-53. 19. Joachms, T., 1998. Text categorzaton wth suort vector machnes: Learnng wth many relevant features. Proc. ECML-98, 10th Eur. Conf. Machne Learnng, : 137-14. 103