Towards an Effective Personalized Information Filter for P2P Based Focused Web Crawling



Similar documents
The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Forecasting the Direction and Strength of Stock Market Movement

An Interest-Oriented Network Evolution Mechanism for Online Communities

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

A NEW ACTIVE QUEUE MANAGEMENT ALGORITHM BASED ON NEURAL NETWORKS PI. M. Yaghoubi Waskasi M. J. Yazdanpanah

A Study on Secure Data Storage Strategy in Cloud Computing

Load Balancing of Parallelized Information Filters

A Structure Preserving Database Encryption Scheme

A Prediction System Based on Fuzzy Logic

Chapter 3: Dual-bandwidth Data Path and BOCP Design

Applied Research Laboratory. Decision Theory and Receiver Design

A Comprehensive Analysis of Bandwidth Request Mechanisms in IEEE Networks

What is Candidate Sampling

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Real-Time Traffic Signal Intelligent Control with Transit-Priority

Evaluation of the information servicing in a distributed learning environment by using monitoring and stochastic modeling

An Alternative Way to Measure Private Equity Performance

Web Object Indexing Using Domain Knowledge *

Support Vector Machines

Performance Analysis and Coding Strategy of ECOC SVMs

Learning User's Scheduling Criteria in a Personal Calendar Agent!

Portfolio Loss Distribution

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

PAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of Illinois-Urbana Champaign

Using Content-Based Filtering for Recommendation 1

A Passive Network Measurement-based Traffic Control Algorithm in Gateway of. P2P Systems

A DATA MINING APPLICATION IN A STUDENT DATABASE

Adaptive Load Balancing of Parallel Applications with Multi-Agent Reinforcement Learning on Heterogeneous Systems

An Analytical Model for Multi-tier Internet Services and Its Applications

Efficient Computation of Optimal, Physically Valid Motion

Neural Network Solutions for Forward Kinematics Problem of Hybrid Serial-Parallel Manipulator

Searching for Interacting Features for Spam Filtering

RequIn, a tool for fast web traffic inference

P2P/ Grid-based Overlay Architecture to Support VoIP Services in Large Scale IP Networks

A Secure Password-Authenticated Key Agreement Using Smart Cards

Chosen Public Key and Ciphertext Secure Proxy Re-encryption Schemes

PEER REVIEWER RECOMMENDATION IN ONLINE SOCIAL LEARNING CONTEXT: INTEGRATING INFORMATION OF LEARNERS AND SUBMISSIONS

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

Optimal maintenance of a production-inventory system with continuous repair times and idle periods

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Inter-domain Alliance Authentication Protocol Based on Blind Signature

Enterprise Master Patient Index

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

A Fast Incremental Spectral Clustering for Large Data Sets

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

Using Supervised Clustering Technique to Classify Received Messages in 137 Call Center of Tehran City Council

A Replication-Based and Fault Tolerant Allocation Algorithm for Cloud Computing

Lecture 2: Single Layer Perceptrons Kevin Swingler

A Survey of Stroke- Based Rendering

Analysis and Modeling of Buck Converter in Discontinuous-Output-Inductor-Current Mode Operation *

Open Access A Load Balancing Strategy with Bandwidth Constraint in Cloud Computing. Jing Deng 1,*, Ping Guo 2, Qi Li 3, Haizhu Chen 1

L10: Linear discriminants analysis

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

Calculating the high frequency transmission line parameters of power cables

Monitoring Network Traffic to Detect Stepping-Stone Intrusion

Gender Classification for Real-Time Audience Analysis System

Number of Levels Cumulative Annual operating Income per year construction costs costs ($) ($) ($) 1 600,000 35, , ,200,000 60, ,000

Software project management with GAs

8 Algorithm for Binary Searching in Trees

Minimal Coding Network With Combinatorial Structure For Instantaneous Recovery From Edge Failures

Improved SVM in Cloud Computing Information Mining

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

1 Example 1: Axis-aligned rectangles

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

The Greedy Method. Introduction. 0/1 Knapsack Problem

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

Traffic State Estimation in the Traffic Management Center of Berlin

On-Line Fault Detection in Wind Turbine Transmission System using Adaptive Filter and Robust Statistical Features

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm


A spam filtering model based on immune mechanism

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

Activity Scheduling for Cost-Time Investment Optimization in Project Management

Mining Multiple Large Data Sources

Research of Network System Reconfigurable Model Based on the Finite State Automation

A Performance Analysis of View Maintenance Techniques for Data Warehouses

Semantic Link Analysis for Finding Answer Experts *

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

A Probabilistic Theory of Coherence

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy

Overview of monitoring and evaluation

An agent architecture for network support of distributed simulation systems

A Dynamic Load Balancing for Massive Multiplayer Online Game Server

Peer-to-Peer Networks Protocols, Cooperation and Competition

PERFORMANCE ANALYSIS OF PARALLEL ALGORITHMS

Genetic Algorithm Based Optimization Model for Reliable Data Storage in Cloud Environment

LIFETIME INCOME OPTIONS

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

Transcription:

Journal of Comuter Scence (1): 97-103, 006 ISS 1549-3636 006 Scence Publcatons Towards an Effectve Personalzed Informaton Flter for PP Based Focused Web Crawlng Fu Xang-hua and Feng Bo-qn Deartment of Comuter Scence and Technology, X an Jaotong Unversty, X an, 710049, Chna Abstract: Informaton access s one of the hottest tocs of nformaton socety, whch has become even more mortant snce the advent of the Web, but nowadays the general Web search engnes stll have no ablty to fnd correct and tmely nformaton for ndvduals. In ths aer, we roose a Peerto-Peer (PP) based decentralzed focused Web crawlng system called PeerBrdge to rovde usercentered, content-senstve and ersonalzed nformaton search servce from Web. The PeerBrdge s bult on the foundaton of our revous work about WebBrdge, whch s a focused crawlng system to crawl Web accordng several secfed toc. The most mortant functon of PeerBrdge s to dentfy nterestng nformaton. So we furthermore resent an effcent ersonalzed nformaton flter n detal, whch combnes several comonent neural networks to accomlsh the flterng task. Performance evaluaton n the exerments showed that PeerBrdge s effectve to crawl relevant nformaton for secfc tocs and the nformaton flter s effcent, whch recson s better than that of suort vector machne, naïve bayesan and ndvdual neural network. Key words: PeerBrdge, web crawlng system, PP based, artfcal neural network ITRODUCTIO Informaton access s one of the hottest tocs of nformaton socety and t has become even more mortant snce the advent of the Web. On one sde, our socety deends more and more on nformaton. Knowng the rght nformaton, at the rght moment, as soon as t s avalable s an essental for all of us. On the other sde, the amount of avalable nformaton, esecally on the Web, s ncreasng tremendously over tme and we are wtnessng an nformaton overload. The rocess of extract relevant nformaton from Web s stll very dffcult, tme-consumng and n many cases ractcally s unfeasble, snce t requres huge cogntve rocessng. Researchers try ther best to address the challengng roblem of locatng correct nformaton from Web effcently. They have develoed many dfferent technques, such as centralzed search engnes, Meta search engnes, ersonalzed web search system and toc drven search systems [1,]. The most conventonal examle s the centralzed search engnes (CESs). There are some roblems of CESs. One major roblem wth CESs s that they do not facltate human user collaboraton, whch has otental for greatly mrovng Web search qualty and effcency. Wthout Collaboraton, user must start from scratch every tme they erform a search task, even f other users have done smlar or relevant searches. Another major roblem wth CSE s that they gnore comletely the nterests and references of users. For a same query, dfferent users wll be answered wth a same lst of results. But actually, a substantal amount of ersonal nformaton could be obtaned durng user s searchng rocess whch may be used to fnd sutable results for a secal user. Wth the emergence of successful alcaton lke Gnutella, Kazaa, Freenet, eer-to-eer (PP) technology has receved sgnfcant vsblty over the ast few years. PP systems are massvely dstrbuted comutng system n whch eer (node) communcaton drectly wth one another to dstrbute task or exchange nformaton or accomlsh tasks. Also there are a few rojects such as Aodea [3], Edutella [4], ODISSEA [5] attemt to buld a PP based Web search or crawlng system. Develong a PP-based dstrbuted aradgm wll brng n several advantages that cannot be exloted n a centralzed aradgm. Bascally, they are ascrbed to the fact that nformaton has been collected, refned, and stored among users accordng to ther nterests. The actve contrbutons of users rovde multle advantages. In effect, the creaton of a secal user rofle allows flterng search results deendng on the user nterests, ntroducng a certan degree of ersonalzaton n search. Further, f one consders users not only as solated ndvduals but also as a communty then ths socal dmenson could be exloted n order to access the exertse of eole wth smlar nterests. The socal dmenson of the communty allows clusterng users accordng to ther nterests and exertse and so focus on nterestng nformaton by reducng the doman of nterest. In ths study, we resent a PP based focused Web crawlng system called PeerBrdge, whch s develoed Corresondng Author: Fu Xang-hua, Deartment of Comuter Scence and Technology, X an Jaotong Unversty, X an, 710049, Chna 97

J. Comuter Sc., (1): 97-103, 006 based on our WebBrdge [4]. In PeerBrdge, each node only search and store a art of the Web model that the user s nterested n and the nodes nteract n a eer-toeer fashon n order to create a real dstrbuted search engne. All users share these artal models that globally create a consstent model for the web resource that s equvalent to ts centralzed counterart. One key roblem we must to solve n PeerBrdge s to search nformaton that s relevant to secal node. To avod get rrelevant nformaton, PeerBrdge would try to guess exactly what knd of document the user desres, basng that guess not only on the key words rovded by the user, but also on a rofle of the user s background and nterests and on evaluatons of how the system satsfed or faled to satsfy the user s requests n the ast. Moreover, t would retreve only the secfc knd of documents defned by the user model comonent. A ersonalzed nformaton flter based on heterogeneous neural network ensemble classfer (E) [5] s used as the content flter to model the eer s reference and flter rrelevant nformaton. Furthermore, Toc Overlay etwork Search algorthm (TOS) s develoed to suort comlex queres on to of the exstng structured network [6]. Desgn overvew of PeerBrdge: PeerBrdge have fve man comonents: a content flter whch makes relevance judgments on ages crawled from Web and query results searched from other nodes, a dstller whch determnes a measure of centralty of crawled ages to determne vst rortes, a crawler wth dynamcally reconfgurable rorty controls whch s governed by the content flter and dstller, a PP nfrastructure whch suorts to construct a PP overlay network wth other nodes to share and search nformaton each others and an user nterface wth whch user can edt tranng samles, select category taxonomy to tranng classfer and query nformaton from the ersonalzed data resource base and other nodes. A block dagram of the general archtecture of PeerBrdge s shown n Fg. 1. ere we brefly outlne the basc rocesses of each comonent. The content flter: The content flter s a document classfer mlemented by a heterogeneous neural networks ensemble to determne whether the downloaded documents are useful. It s the central comonent to guarantee the qualty of the search results. The reresentatve features of the samle Web ages are extracted as nuts to tran the E content flter. Tranng s objectve s to let the E confgure tself and adjust ts weght arameters accordng to the tranng examles, to facltate generalzaton beyond the tranng samles. In our system, the tranng samle ncludes a selected canoncal taxonomy (such as Yahoo!, the Oen Drectory Project) and the examles secfed by the user. All of the tranng samles defne what tocs the user s nterest n. We use vector model of documents to reresent the user model and comute the smlarty between documents and nterests. A retraned E classfer can be used to flter rrelevant nformaton. The dstller: The dstller s used to analyze the lnk structures of the downloaded Web ages and dentfy ages contanng large numbers of lnks to relevant ages, called hubs. Snce the ctatons sgnfy delberate judgment by the age author, most ctatons are to semantcally related materal. Intermttently, the system runs a toc dstllaton algorthm to dentfy hubs. The vst rortes of these ages and mmedate neghbors are rased. All of the age lnks dstlled by the dstller wll be lace nto the search lst orderly accordng ther rortes. The crawler: The functon of the crawler s smle. It gets age lnks from the search lst and then seeks and acqures the corresondng Web ages from the Web. Integratng wth the dstller and the content flter, the crawler runs as a focused crawlng to access only a narrow segment of the Web. We have resented a focused crawler wth onlne-ncremental adatve learnng ablty n [6]. It entals a very small nvestment n hardware and network resources and yet acheves resectable coverage at a rad rate. In PeerBrdge, there are several crawlng threads to crawl Web age synchronously durng the workng rocess. The PP nfrastructure: Wth the PP nfrastructure, the nstances of PeerBrdge run on many user comuters form a PP overlay networks to share ther nformaton resource. DT based dstrbuted looku and nformaton-exchange rotocols [7] are used to exchange vtal nformaton between the eers. Each eer mantans a small routng table. Gven a key, these technques guarantee the locaton of ts value n a bounded number of hos wthn the network. Bloom flter [8] s used to store the lst of URLs already crawled by a eer. TOS s used to suort comlex queres [9]. Thus Web content s managed by a dstrbuted team of eers, each of whch secalzng n one or a few tocs. When a query s requred, each eer wll not only look for t n the local host but also ublsh t to the overlay network. Wth our effectve PP search algorthm, the relevant query results n the whole overlay network wll be return to the user. The user nterface: The user nterface manly rovdes a convenent oeraton nterface to the user. User can use t to select category taxonomy, edt and judge examles, query nformaton and dslay query result wth rank and so on. In our rototye, t stll has not been mlemented comletely now. 98

J. Comuter Sc., (1): 97-103, 006 X =tf. log(/df ) () Query Results Select Toc Query User Interface Taxonomy Table Read Samles Classfer (Tranng) Personalzed Data Resource Edt Samles Search Lsts Content Flter Query Delvery Comute ubs and Authortes User User Model Model Peer A Dstller Select URLs Comute Relevance Save Relevant Informaton Peer-to-Peer Protocol Crawler Web Web Search Classfer (Flterng) Query Results PP Search Fg. 1: The general archtecture of PeerBrdge Crawler Classfer (Flterng) Peer-to-Peer Protocol Peer B Adatve content flterng model: An nformaton flterng system can use ntellgent content analyss to automatcally classfy documents. If a document s judged not belongng to a user secfc class, t s an rrelevant document should be dscarded. Such methods nclude k-nearest neghbor classfcaton, lnear least square ft, lnear dscrmnant analyss and naïve Bayesan robablstc classfcaton [1,,10]. owever, because real-world data such as we re usng tend to be nosy and are not clearly defned, lnear or low-order statstcal models cannot always descrbe them. We use artfcal neural networks because they are robust enough to ft a wde range of dstrbutons accurately and can model any hgh-degree exonental models. eural networks are chosen also for comutatonal reasons snce, once traned, they oerate very fast. Moreover, such a learnng and adataton rocess can gve semantc meanng to context-deendent words. User Model To flter nformaton for secfc users accordng to ther reference and nterests, user model s created as an mage of what users need. We defne a user model as: UM := (MID,FD,FT,UI,UIV) (1) Where, UMID s an user model dentfer, FD:= {d 1, d,,d } s a set of samle documents, FT:= {t 1, t,, t M } s a lexcon comrse all feature terms of FD, UI := {u 1,u,...,u T } s a set of nterests secfed by users and UIV := {UIV 1, UIV,..., UIV T } s a set of nterest vectors of a secal user, of whch every element resonds to a nterest u k (1 k T) and s defned as UIV k := <(t 1,w 1k ), (t,w k ),..., (t M,w Mk )>, where w k s the frequency of term t (1 M) n UIV k. Accordng vector sace model (VSM), FD consttutes a term by document matrx X := (d 1,d,,d ), where a column d j :=<(t 1,x 1j ),(t,x j ),..., (t M,x Mj )> s a document vector of the document d j and every element x j s the frequency of the term t n document d j. TDFIF frequency s used, whch s defned as: Where, tf j s the number of the term t that occurs n the document d j and df s the number of documents where the word t occurs. The smlarty between document vectors s defned as: Sm(d,d)=d T d = x x k = 1 k kj x 1 k x k = k = 1 kj. (3) Equaton (3) also can be used to comute the smlarty between document vector and nterest vector. eural networks-based content flterng: The neural networks-based adatve content flter comrses two major rocesses: tranng and classfcaton. Durng tranng, the flter learns from samle documents to form a knowledge base. And then t classfes ncomng documents accordng to ther content. Before tranng or classfcaton, a rerocessng rocedure s needed to extract from the documents words and hrases wth the use of secfc feature selecton algorthm. The eural networks contan an nut layer, wth as many elements as there are feature terms needed to descrbe the documents to be classfed as well as a mddle layer, whch organzes the tranng document set so that an ndvdual rocessng element reresents each nut vector. Fnally, they have an outut layer also called a summaton layer, whch has as many rocessng elements there are nterests of user to be recognzed. Each element n ths layer s combned va rocessng elements wthn the mddle layer, whch relate to the same class and reare that category for outut. Fgure llustrates the form of a content flter based on a three-layer feedforward artfcal neural network. t1 t t t M Inut layer of feature terms Layer of hdden neuros Layer of outut neuros u1 uk ut User nterests Fg. : Adatve content flter based on three layer feedforward artfcal neural network 99

J. Comuter Sc., (1): 97-103, 006 In our content flter, the numercal nut obtaned from each document s a vector contanng the frequency of aearance of terms. Owng to the ossble aearance of thousands of terms, the dmenson of the vectors can be reduced by sngular value decomoston (SVD), Prncal Comonent Analyss, Informaton Entroy Loss and word frequency threshold [10], etc. eterogeneous neural networks ensemble classfer: eural etwork ensemble (E) s a learnng aradgm where many neural networks are jontly used to solve a roblem [11]. It orgnates from ansen and Salamon s work [1], whch shows that the generalzaton erformance of a neural network system can be sgnfcantly mroved through combnng several ndvdual networks on the same task. The creaton of a neural network ensemble s constructed n two stes, the frst beng the judcous creaton of the ndvdual ensemble members and the second ther arorate combnaton to roduce the ensemble outut. There has been much work n tranng ensembles [11~16]. owever, all these methods are used to change weghts n an ensemble. The structure of the ensemble, e.g., the number of s n the ensemble and the structure of ndvdual s, e.g., the number of hdden nodes, are all desgned manually and fxed durng the tranng rocess. Whle manual desgn of s and ensembles mght be arorate for roblems where rch ror knowledge and an exerenced exert exst, t often nvolves a tedous tral-and-error rocess for many real-world roblems because rch ror knowledge and exerence human exerts are hard to get n ractce. In [17], we roose a new method to construct heterogeneous neural network ensemble (E) wth negatve correlaton. It combnes ensemble s archtecture desgn wth cooeratve tranng of ndvdual s n an ensemble. It determnes automatcally not only the number of s n an ensemble, but also the number of hdden nodes n ndvdual s. It uses ncremental tranng based on negatve correlaton learnng [10,13] n tranng ndvdual s. The man advantage of negatve learnng s that t encourages dfferent ndvdual s to learn dfferent asects of the tranng data so that the ensemble can learn the whole tranng data better. It does not requre any manual dvson of the tranng data to roduce dfferent tranng sets for dfferent ndvdual s n an ensemble. Theory Foundaton of eural etwork Ensemble Suose a data set D:= { 1,y 1 ),,y ),...,,y )}, where x s the nut samle and y s the outut result (1 ). An ensemble comrsng comonent neural network and every comonent network s traned to aroxmate a functon f: R C where C s the set of class labels. Suose the weght of the th comonent network s w (1 ) and all the weghts satsfes 100 w 0, w =1. When the nut samle s x, the =1 outut of the th comonent network s f ) and the outut of the ensemble s: f )= j=1 w f ). Thus the generalzaton error of the ensemble n the whole data set s: (4) =1 E= (y -f )) The generalzaton error of the th comonent network n the whole data set s: (5) = E = (y -f )) The weghted generalzaton of the ensemble s: (6) =1 E= w E The dversty of the ensemble s: A= w (f )-f )) =1 ensemble satsfes: j. So the generalzaton of the E=E-A (7) Combnng the oututs s clearly only relevant when they dsagree on some or several of the nuts. Ths nsght was formalzed by [15], who showed that squared error of the ensemble when redctng a sngle target s equal to the average squared error of the ndvdual networks, mnus the dversty defne as the varance of the ndvdual network outut. Thus, to reduce the ensemble error, one tres to ncrease the dversty wthout ncreasng the ndvdual network errors too much. Construct neural network ensemble wth negatve correlaton: Because all the comonent networks are traned wth the samles of the same data set D to aroxmate the same functon, the outut of the comonent networks are hgh correlated otentally leadng to severe colnearty and reducng the robustness of the ensemble network [16]. Defne the correlaton of the th comonent network wth the others s: j =1 j=1,j C = (f )-f )) (f )-f )) (8) To mtgate ths otental colnearty roblem, Equaton (5) s modfed by addng a decorrelaton

J. Comuter Sc., (1): 97-103, 006 enalty to t. The new error functon for an ndvdual network s: E = (y -f )) + λc (9) = Where λ ( λ 0 ) s an adjustable arameter, whch s used to adjust the strength of the enalty. So the ndvdual networks attemt to not only mnmze the error between the target and ther outut, but also to decorrelate ther error wth those from revously traned networks. When the smle average weght s used to combne the comonent networks, namely w =1/, then Equaton (9) can be modfed as: selected to combne a heterogeneous neural network ensemble. Performance evaluaton: As one of the most mortant work of our adatve content flterng, we have mlemented a PP-based nformaton search and dscovery system called PeerBrdge for user-centered tmely nformaton search and extract from Web and other eers ncrementally. The nfrastructure tools of the PeerBrdge nclude Full-text Indexng and Retreval Engne, Metadata Manager, User Mode Manager, E based Content Flter, Web Crawler, PP Protocol, PP Search Engne. The PeerBrdge currently bult on Wndows latform. Fgure 3 shows the snashots of the WebBrdge, and Fgure 4 shows a snashot of the PeerBrdge. 1 E = ( (y -f )) λ(f )-f )) ) (10) =1 The average value of all the comonent error s: 1 1 E sum = ( (y -f )) -(f )-f )) ) (11) =1 =1 The artal dervatve of Equaton (10), wth resect to the outut of network on the th tranng samle, s 1 1 E sum = ( (y -f )) -(f )-f )) ) (1) =1 =1 When λ = 1/, E=E sum, so we get E ) E ) f ) f ) (13) Accordng Equaton (13), the mnmzaton of the emrcal rsk functon of the ensemble s acheved by mnmzng the error functons of the ndvdual networks. From ths vew, negatve correlaton learnng rovdes a novel way to decomose the learnng task of the ensemble nto a number of subtasks for dfferent ndvdual networks. In lterature [17], we rovde a new method to ncremental construct heterogeneous neural network ensemble wth negatve correlaton. The new method ncludes two rocesses: at frst the Cascor [18] s modfed to construct otmal ndvdual heterogeneous networks wth negatve correlaton learnng, durng ths rocess, what are consder s: (1) constructng all the ndvdual networks wth the same data set sequent; () Equaton (10), (1) are used to guarantee all of the ndvdual networks are negatve correlaton; and then the otmal ndvdual heterogeneous networks are Fg. 3: A snashot of the WebBrdge Based on PeerBrdge we have evaluated the flterng erformance of the Chnese Web ages content flter wth varant number of comonent neural network n Web search task. We fnd the heterogeneous neural network ensemble classfer s effcent and feasble for adatve nformaton flter n dstrbuted heterogeneous network envronment. In our exerments sx dfferent heterogeneous neural network ensembles are tested, the number of comonent neural network of whch are resectvely 1,5,10,15,0,5 and are notated as E1, E5, E10, E15, E0 and E5. Wth above dfferent content flters traned by the same 101

J. Comuter Sc., (1): 97-103, 006 nterest documents, PeerBrdge search relevant web documents from Yahoo Chna (htt://cn.yahoo.com). The evaluaton results are shown n Fg. 5. Fg. 4: A snashot of PeerBrdge recson 100 80 60 40 0 0 E1 E5 E10 E15 E0 E5 age number100) 4 1 0 8 36 44 Fg. 5: Precson of content flter wth dfferent number of comonent neural networks Table 1: The document number of the tranng set and test set n sx categores Earn acq money-fx crude gran trade Tranng set 709 1488 460 349 394 337 Test set 1014 630 133 160 130 106 1 0.8 0.6 0.4 0. 0 F1 E0 SVM Bayes E1 earn acq money-fx gran crude trade category Fg. 6: Comarson wth E0, SVM, Bayes, E1 n Reuters-1578 collecton The measurement Rr R F1 = R + R r s used to evaluate the erformance of the classfers, where f a s the number of documents correctly assgned to ths category, b s the number of documents ncorrectly assgned to ths category and c s the number of documents ncorrectly rejected from ths category, then recson a R = and recall a R a + b =. The r a + c exerment results are shown n Fg. 6. Fgure 5 manfested combnng many comonent neural networks mroved the content flterng recson of the Web search system. It s also obvously that ncreasng the number of the comonent neural network can mrove the recson largely at the begnnng, but when the number s suffcently large, the mrovement became small. Fgure 6 showed that the heterogeneous neural network ensemble based classfcaton algorthm was better than other classfcaton algorthm. Once traned, neural network ensemble oerates very fast. Moreover, the assumtons on the roblem s dstrbuton model of neural network classfer are much less than that of aïve Bayes classfer, so t s has less ndeendence on the roblem and they are robust enough to ft a wde range of dstrbutons accurately and can model any hgh-degree exonental models. 10 COCLUSIO Informaton access s one of the most mortant requrements of everybody n nowadays. Facng to the nformaton overload on the Web and CESs roblem, we rovde a PP based, contentsenstve, nterest-related and ersonalzed web crawlng system. A new content flter based on E classfer base s roosed to guarantee each node only crawlng ersonalzed relevant nformaton. Performance evaluaton n the exerments showed that PeerBrdge s effectve to crawl relevant nformaton for secfc tocs. To comare wth other classfers such as SVM, naïve bayesan and ndvdual artfcal neural network, the exerment results showed that E classfer s very effcent and feasble. In the future we wll take nto account those ssues n PeerBrdge such as effcently nformaton search, fault tolerance and access control etc. REFERECES 1. Arasu, A., J. Cho,. Garca-Molna and S. Raghavan, 001. Searchng the Web. ACM Trans. Internet Technol., 1: -43.. Baeza-Yates, R., 003. Informaton retreval n the Web: Beyond current search engnes. Intl. J. Arox. Reasonng, 34: 97-104. 3. Sngh, A., M. Srvatsa, L. Lu and T. Mller, 003. Aodea: A decentralzed eer-to-eer archtecture for crawlng the World Wde Web. SIGIR 003 Worksho on Dstrbuted Informaton Retreval.

J. Comuter Sc., (1): 97-103, 006 4. ejdl, W., B. Wolf, C. Qu, S. Decker, M. Snterk, A. aeve, M. lsson, M. Palmer and T. Rsch, 003. Edutella: A networkng nfrastructure based on RDF. Proc. 1th Intl. Conf. World Wde Web, awa, USA, : 604-15. 5. Suel, T., C. Mathur, J.W. Wu and J. Zhang, 003. ODISSEA: A eer-to-eer archtecture for scalable web search and nformaton retreval. In 6th Intl. Worksho on the Web and Databases. 6. Fu, X.., B.Q. Feng, Z.F. Ma and M. e, 004. Focused crawlng method wth onlne-ncremental adatve learnng. J. X An JaoTong Unv., 38: 599-60. 7. Stoca, I., R. Morrs, D. Karger, M.F. Kaashoek and. Balakrshnan, 001. Chord: A scalable eer-to-eer looku servce for nternet alcaton. Proc. SIGCOMM Ann. Conf. Data Communcaton. 8. Bloom, B., 1970. Sace/tme trade-off n hash codng wth allowable errors. Commun. ACM, 1: 4-46. 9. Fu, X.. and B.Q. Feng, 005. Dstrbuted nformaton search based on toc segments n structured eer-to-eer networks. J. X An JaoTong Unv. (Acceted) 10. Sebastan, F., 00. Machne learnng n automated text categorzaton. ACM Com. Surveys, 34: 1-47. 11. Zhou, Z.., J.X. Wu and W. Tang, 00. Ensemblng neural networks: Many could be better than all. Artfcal Intellgence, 137: 39-63. 1. ansen, L.K. and P. Salamon, 1990. eural network ensembles. IEEE Trans. Pattern Analyss and Machne Intellgence, 1: 993-1001. 13. Lu, Y. and X. Yao, 000. Evolutonary ensembles wth negatve correlaton learnng. IEEE Trans. Evoluton. Com., 4: 380~387. 14. Detterch, T., 000. Ensemble methods n machne learnng. Frst Intl. Worksho on Multle Classfer Systems, : 1-15. 15. Krogh, A. and J. Vedelsby, 1995. eural network ensembles cross valdaton and actve learnng. Advances n eural Informaton Processng Systems, San Mateo, CA: Morgan Kaufman. 16. Rosen, B.E., 1996. Ensemble learnng usng decorrelated neural networks. Connecton Sc., 8: 373-378. 17. Fu, X.., B.Q. Feng, Z.F. Ma and M. e, 004. Method of ncremental constructon of heterogeneous neural network ensemble wth negatve correlaton. J. X An JaoTong Unv., 38: 796-799. 18. Fahlman, S.E. and C. Lebere, 1990. The Cascadecorrelaton learnng archtecture. Advances n eural Informaton Processng Systems,. Los Altos, USA: Morgan Kaufmann Publshers, : 54-53. 19. Joachms, T., 1998. Text categorzaton wth suort vector machnes: Learnng wth many relevant features. Proc. ECML-98, 10th Eur. Conf. Machne Learnng, : 137-14. 103