Construction Algorithms for Index Model Based on Web Page Classification

Similar documents
CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

How To Filter Spam Image From A Picture By Color Or Color

The PageRank Citation Ranking: Bring Order to the Web

A Proposed Algorithm for Spam Filtering s by Hash Table Approach

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

III. DATA SETS. Training the Matching Model

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Semantic Concept Based Retrieval of Software Bug Report with Feedback

ecommerce Web-Site Trust Assessment Framework Based on Web Mining Approach

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

Fault Analysis in Software with the Data Interaction of Classes

RANKING WEB PAGES RELEVANT TO SEARCH KEYWORDS

Bayesian Spam Filtering

SEO Techniques for various Applications - A Comparative Analyses and Evaluation

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS

Web Document Clustering

ISSN: A Review: Image Retrieval Using Web Multimedia Mining

OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP

Search Result Optimization using Annotators

Lasso-based Spam Filtering with Chinese s

Bisecting K-Means for Clustering Web Log data

1 o Semestre 2007/2008

Predict Influencers in the Social Network

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Filtering Noisy Contents in Online Social Network by using Rule Based Filtering System

Index Terms Domain name, Firewall, Packet, Phishing, URL.

A Content based Spam Filtering Using Optical Back Propagation Technique

Subordinating to the Majority: Factoid Question Answering over CQA Sites

Medical Image Segmentation of PACS System Image Post-processing *

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data

Experiments in Web Page Classification for Semantic Web

Predict the Popularity of YouTube Videos Using Early View Data

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Server Load Prediction

Data Mining - Evaluation of Classifiers

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Dr. D. Y. Patil College of Engineering, Ambi,. University of Pune, M.S, India University of Pune, M.S, India

Social Media Mining. Data Mining Essentials

A Comparative Approach to Search Engine Ranking Strategies

Chapter 2 The Research on Fault Diagnosis of Building Electrical System Based on RBF Neural Network

Monitoring Web Browsing Habits of User Using Web Log Analysis and Role-Based Web Accessing Control. Phudinan Singkhamfu, Parinya Suwanasrikham

Search and Information Retrieval

Categorical Data Visualization and Clustering Using Subjective Factors

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Applied Mathematical Sciences, Vol. 7, 2013, no. 112, HIKARI Ltd,

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Random Forest Based Imbalanced Data Cleaning and Classification

International Journal of Emerging Technology & Research

Machine Learning Final Project Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Achieve Better Ranking Accuracy Using CloudRank Framework for Cloud Services

Semantic Search in Portals using Ontologies

How To Use Neural Networks In Data Mining

Active Learning SVM for Blogs recommendation

Simple Language Models for Spam Detection

An Overview of Knowledge Discovery Database and Data mining Techniques

Top Top 10 Algorithms in Data Mining

Top 10 Algorithms in Data Mining

Blog Post Extraction Using Title Finding

How To Fix Out Of Focus And Blur Images With A Dynamic Template Matching Algorithm

Spam Detection Using Customized SimHash Function

Decision Support System For A Customer Relationship Management Case Study

Content-Based Recommendation

The Enron Corpus: A New Dataset for Classification Research

Term extraction for user profiling: evaluation by the user

FCE: A Fast Content Expression for Server-based Computing

Inner Classification of Clusters for Online News

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

A Network Simulation Experiment of WAN Based on OPNET

A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Customer Classification And Prediction Based On Data Mining Technique

A Web Recommender System for Recommending, Predicting and Personalizing Music Playlists

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

Optimize Position and Path Planning of Automated Optical Inspection

Make search become the internal function of Internet

Automatic Mining of Internet Translation Reference Knowledge Based on Multiple Search Engines

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Bayesian Spam Detection

Clustering Technique in Data Mining for Text Documents

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein

DATA PREPARATION FOR DATA MINING

Research of Postal Data mining system based on big data

A Comparison of General Approaches to Multiprocessor Scheduling

Less naive Bayes spam detection

Transcription:

Journal of Computational Information Systems 10: 2 (2014) 655 664 Available at http://www.jofcis.com Construction Algorithms for Index Model Based on Web Page Classification Yangjie ZHANG 1,2,, Chungang YAN 1,2, Pengwei WANG 1,2, Haichun SUN 1,2 1 Department of Computer Science and Technology, Tongji University, Shanghai 201804, China 2 The Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University, Shanghai 201804, China Abstract Web pages on the Internet are massive, diverse, heterogeneous and redundant. How to organize and manage them effectively is an urgent problem. In this paper, we propose a method to index web pages and build an index model based on web page classification and hyperlink analysis. First, an initialization algorithm is given to construct an index model for an initial set of web pages. Then, considering the dynamics of web pages on the Internet, we propose an incremental updating algorithm which can update an index model incrementally. Through theoretical analysis, the complexity of proposed algorithms shows linear relationships with the scale of web pages and their growth. The experimental results show that initialization algorithm can construct an index model for a fixed number of web pages relatively fast, while the incremental updating algorithm can satisfy the update speed of web pages on the Internet. Thus, the proposed algorithms are feasible and effective. The constructed index model can provide supports for diversified information service systems to enable them to make better use of web resources and provide more valuable services to users. Keywords: Web Page Classification; Index Model; Hyperlinks; Naive Bayes 1 Introduction With the rapid development of the Internet, network resources become more and more abundant, and various kinds of information service systems are emerged to facilitate people s daily lives. However, owing to the openness, dynamics and complexity of the Internet, web pages on the Internet are massive, diverse, heterogeneous and redundant. Existing methods cannot provide a very proper way to organize and manage web pages for these systems. As a popular information service system, search engine solves the problem of acquiring web pages rapidly by its main Project supported by National Basic Research Program of China 973 (No. 2010CB328101), International Science & Technology Cooperation Program of China (No. 2013DFM10100), National Natural Science Funds of China (No. 61173016), Shanghai Science & Technology Research Plan (No. 11231202804). Corresponding author. Email address: zyj177484@126.com (Yangjie ZHANG). 1553 9105 / Copyright 2014 Binary Information Press DOI: 10.12733/jcis9017 January 15, 2014

656 Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) 655 664 technologies including inverted index, link analysis, and distributed storage [1]. Search engine helps users to find relevant web pages through keywords matching. However, the rapid growth of web pages reduces search efficiency, increases redundancy, inaccuracy of the returned results and it cannot meet the personalized demands of information retrieval service only through keywords matching. Recommendation system is another popular representative. Since researches on collaborative filtering appeared in mid-1990s, diversified recommendation algorithms have been proposed and used in many research fields [2-4], including cognitive science [5], information retrieval [6] and management science [7]. Web page recommendation systems recommend similar web pages to those having common preferences, and improve the intelligence of internet information service systems. In general, there are two types of recommender systems in this field: one can only find web pages that are similar to the customer interests by content-based filtering, while the other one needs massive user comments. With the increase of web pages, the accuracy and speed will reduce by collaborative filtering. Due to lack of an effective organization and management method of web pages, web-based information systems performance is restricted [8]. To this end, this paper presents an index model for web pages, which can organize web pages and find out the relationships among web pages. We give an assumption that hyperlinks among web pages reflect some kinds of business relationships in the real world and based on web page classification, two construction algorithms for index model are designed and implemented: one is the initialization algorithm for a given set of web pages, and the other is an updating algorithm for an existing index model and an incremental set of web pages. Then the index model is completed by analyzing hyperlinks among web pages after web page classification. Experimental results show that the proposed algorithms are feasible and effective. First, to validate its effectiveness, an example index model is constructed based on one million web pages crawled from the Internet, which shows that the generated relationships among web page classes can reflect the real-world business associations. Second, for a given set of web pages, the initialization algorithm can complete the construction of index model within a limited time. Last, the incremental updating algorithm can update an index model with the actual growth of web pages. 2 Index Model Framework In this paper, c denotes a web page class; C denotes a set of web page classes; p denotes a web page; D denotes a sample set of web pages; t denotes a feature, and it is a word. Definition 1. (Index Model) An index model is defined as a weighted directed graph G = (V, E), where a) V = {c i c i C }, and C is the set of all web page classes in the index model; b) E = { c i, c j c i, c j V andc i c j }; c) Let w c i, c j be the weight of c i, c j, then w c i, c j = p t c j,p k c i,u k =url t (W u P (c j p t ) P (c i p k )) W u = 1 n

Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) 655 664 657 where u k is a hyperlink on web page p k, url t is the Uniform Resource Locator of web page p t, n is the number of hyperlinks on web page p k, P (c j p t ) is the probability that web page p t belongs to web page class c j. The entire construction process of index model is as follows: 1. Web page pretreatment. 1 Extract content enclosed by HTML tags from a web page. 2 Extract content text. 3 Combine the contents extracted in steps 1 and 2. 4 Web page segmentation. 4 Remove stop words. 6 Select features to represent each web page. 7 Extract hyperlinks in web pages. 2. Training classifier and web page classification. 1 Generate a classifier by learning a sample set of web pages. 2 Classify all preprocessed web pages. 3. Compute link relationships among web page classes. 1 Using index initialization algorithm to complete initialization of an index model. 2 Index incremental updating algorithm will be used to updating an index model. The building process of index model is shown in Fig. 1. a sample set of web pages an initial set of Web pages updating web page Web Page Pretreatment Extract tags Extract body Contents Segmentation Remove stop words Select features Extract hyperlinks Training Classifier Web Page Classification Classifier Index Initialization Algorithm Index Incremental Updating Algorithm Compute Link Relationships among Web Page Classes Fig. 1: Building process of index model based on web page classification

658 Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) 655 664 3 Algorithms for Constructing Index Model 3.1 Web page preprocessing A web page contains various HTML tags, and contents within different labels make different contributions to its subject. In order to classify a web page and obtain relationships among web classes, we first preprocess each web page, which includes extracting tags and body contents, content segmentation, extracting hyperlinks, etc. Step 1 Extract contents enclosed in Title, META, H1 H6, a tags in a HTML file. Step 2 Extract content text of a web page using a statistical approach [9]. Step 3 Content segmentation. Processing contents extracted from Step 1 and Step 2 using Chinese word segmentation tool IKAnalyzer [10], and remove stop words. Step 4 Weight each feature using TF*IDF [11]. Step 5 Extract feature vector of each web page. Based on the values obtained in Step 4, select top n feature items to constitute the feature vector ( n is an experimental value). Then, a feature vector of a web page is expressed as: p = (t 1, t 2,..., t n ). Step 6 Extract hyperlinks of web pages. All hyperlinks extracted from a web page is denoted as a vector P out = (u 1, u 2,..., u n ), where u i is a hyperlink on this web page. Step 7 Weight each hyperlink. The weight of a hyperlink u is calculated by W u = 1 n, where n is the total number of hyperlinks on this web page. The basic idea derives from Random Surfer Model [12], which assumes that all the hyperlinks on a web page are equally important. 3.2 Training classifier and web page classification This paper uses Naive Bayesian [13] method to train the classifier and classify all web pages we processed before. Compared to KNN [14] and other methods for dealing with massive web pages, Naive Bayesian method can guarantee better classification accuracy and faster classifying speed, and it is simple. Step 1 Generate a classifier through learning a sample set of web pages. In this process, we calculate the probability that each feature item belongs to each web page class based on sample pages. There have been various methods to complete this calculation. McCallum and Nigam proposed a multinomial model [13] whose misclassification accuracy is smaller than other models when the feature set is relatively large. In this paper, we use their multinomial model to calculate P (t j c), which represents the probability of t j belonging to c. The computing formula is as follows: P (t j c) = V + 1 + T F (t j, c) V k=1 T F (t k, c) Where V represents the total number of feature items in a sample class c, T F (t j, c) represents the total frequency of t j appearing in c. Step 2 Classify all preprocessed web pages. Calculate the probabilities that p belongs to all web page classes, and then p is classified into c i with the greatest probability. The probability is calculated as follows: i = arg max{p (c j p)}, c j C

Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) 655 664 659 P (c j p) = P (c j ) n P (t i c j ) i=1 P (p) Where P (c j ) is a priori probability, and its value is the number of web pages in c j in proportion to the sample set, P (p) is a constant, n is the number of feature items in p. 3.3 Computing link relationships among web classes 3.3.1 Web page representation After the preprocessing and classification of web page, we obtain the elements of a web page in Table 1. Table 1: The representation of a web page p Element Symbol feature vector (t 1, t 2,..., t n ) web page class c probability of p belonging to (P (c p)) a set of hyperlinks on P out = (u 1, u 2,..., u m ) a set of URLs which have hyperlinks to p P in = (u 1, u 2,..., u o ) Where P in is initialized to empty before computing link relationships. 3.3.2 Computing relationships among web classes A directed edge c i, c j in an index model represents the direct link relationship from c i to c j, and its weight w c i, c j is calculated as a conditional probability P (c j c i ). It indicates the probability of users visiting the web page class c j after they browsed c i. In this article, we propose a way to build link relationships among web classes based on weights of hyperlinks and probabilities of web pages belonging to classes. We suppose that a web page only belongs to one class. Then, P (c j c i ) is calculated as follows: P (c j c i ) = Where P (c j p t ) and P (p t c i ) are computed as: P (c j p t ) = P (p t c i ) = p k c i p t c j P (c j p t ) P (p t c i ) (1) P (C j p t ) 1 + p k c j P (C j p k ) P (p t p k ) u k =url t W u (2) (3)

660 Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) 655 664 In Eq. (2), P (c j p t ) is the probability of p t belonging to c j after normalization. In Eq. (3), P (p t c i ) means the probability of users visiting p t after they browsed c i, which is calculated by summing all hyperlinks from web page class c i to page p t. u k represents a hyperlink on p k, url t represents the URL of p t, W u is the weight of hyperlink u k. To facilitate the calculation, the Eq. (3) can be transformed into the following form: P (p t c i ) = Then, P (c j c i ) is calculated as follows: P (c j c i ) = p k c i,u k =url t p t c j,p k c i,u k =url t ( Wu P (c i p k ) ) (4) ( Wu P (c j p t ) P (c i p k ) ) (5) 3.3.3 Two algorithms for constructing index model For a given set of web pages, Algorithm 1 can complete the initial construction of an index model as defined in Def. 1. Algorithm 1 Index Initialization Algorithm: Input: a set of web class C, a set of web pages S. Output: Index Model Matrix IN CC, HashTable HT 1: For c C 2: For p c 3: Calculate 1 + p c P (c p) 4: HT.put( url, c ) // url is URL of p. 5: For p S 6: For u P out // P out is the set of hyperlinks on p. 7: if ( HT.contains( u ) ) 8: Calculate P (c j p k ) and P (c i p) // u is URL of p k. ( ) 9: P (c j c i ) = P (c j c i ) + 1 P P out (c j p k ) P (c i p) 10: else P in.put( url ) // P in is the set of URLs which have hyperlinks to p. There are two main factors that have influence on the complexity of Algorithm 1, namely, the number of web pages and the number of hyperlinks on each web page. We assume that the number of web pages is S and the average number of hyperlinks on each web page is P out. The time complexity of Steps 1-4 is O( S ). Steps 5-10 calculate those link relationships and their time complexity are O( S P out ). Thus, the total time complexity of Algorithm 1 is O( S + S P out ) = O( S P out ). Algorithm 1 can complete the initial construction of an index model based on a given set of web pages, but it cannot update the model in real time with the changes. Considering dynamic changes of web pages on the Internet, we further give an index incremental updating algorithm for updating an existing index model.

Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) 655 664 661 Algorithm 2 Index Incremental Updating Algorithm: Input: a set of web class C, web page p, Index Model Matrix IN CC, HashTable HT. Output: Index Model Matrix IN CC, HashTable HT 1: Update 1 + p k c j P (c p k ) // p belongs to web page class c. 2: HT.put( url, c ) // url is the URL of p. 3: For c i C 4: P (c i c) = P (c i c) 1+ p k c P (c p k) P (p) 1+ p k c P (c p k) 5: P (c c i ) = P (c c i ) 1+ p k c P (c p k) P (p) 1+ p k c P (c p k) 6: For u P in // P in is the set of URLs which have hyperlinks to p. 7: if( HT.contains( u ) ) 8: Calculate P (c j p k ) and P (c p) // u is URL of p k. ( ) 9: P (c c j ) = P (c c j ) + 1 P in P (c j p k ) P (c p) 10: For u P out // P out is the set of hyperlinks on p. 11: if( HT.contains( u )) 12: Calculate P (c j p k ) and P (c p) // u is URL of p k. ( ) 13: P (c j c) = P (c j c) + 1 P out P (c j p k ) P (c p) 14: else P in.put( url ) // P in is the set of URLs which have hyperlinks to u. The main factors affecting the complexity of Algorithm 2 are the number of web classes, the number of hyperlinks on each web page p, and the number of web pages that have hyperlinks pointing to web page p. The time complexity of steps 3-5 is O( C ), and that of steps 6-14 is O( P in + P out ). Thus, the total time complexity of Algorithm 2 is O( C + P in + P out ). 4 Experimental Evaluation and Analysis 4.1 Experimental environment Experiments in this paper are based on 9 Sugon servers with 4 cores and 8G memory. The experimental set of web pages contains one million Chinese web pages that are crawled from the Internet. Eight servers are used to classify web pages by the web page classification tool Mahout [15] based on Hadoop Distributed system, and one server is used to run the proposed construction algorithms for index model. 4.2 Experimental results and analysis In this paper, the result of web page classification will directly affect the construction of index model. Thus, the accuracy of classification algorithm should be guaranteed. We use 282 web page classes and their sample web pages in open directory project dmoz [16] as the training data, and manually select some relevant web pages as a supplement. There are about 500 sample web pages in each web page class.

662 Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) 655 664 We randomly select 80% of web pages in each sample class to train the classifier, while the remaining 20% web pages as the test set. The experimental results show that the classification accuracy is 71.7%. In order to evaluate the efficiency of the two proposed algorithms and compare them, the crawled one million web pages are classified into 282 web classes and then used to construct index model by two algorithms separately as shown in Fig. 2. Specially, we use updating algorithm update an empty model by these web pages. The comparative running time is shown in Fig. 2(a). Time(h) 60 50 40 30 20 10 0 0.1 0.3 0.5 0.7 0.9 1 Web Pages(Million) Initiali zation Updat ing Time (h) 50 40 30 20 10 0 Initialization Updating (a) Building an index model (b) Updating an index model Fig. 2: Time comparisons of two construction algorithms As can be seen from Fig. 2(a), time efficiency of index initialization algorithm is better than that of index incremental updating algorithm. To build an index model with S pages, index incremental updating algorithm requires O( S C + S P in +P out ) time while the initialization one requires O( S P out ) time. The time complexity of these two algorithms increases linearly with the number of web pages as previously mentioned. For the experimental set of web pages we selected, since it is not self-contained from the perspective of hyperlinks contained, i.e., lots of web pages that are linked to from this experimental set are not within it, the growth curve of time efficiency is shown in Fig. 2(a). When the number of web pages is sufficiently large, the time efficiency of two algorithms will be stabilized and tends to grow linearly. In order to evaluate the efficiency that the two proposed algorithms are used to update an existing index model. We update an index model built by half million web pages using another half million web pages. The comparative running time is shown in Fig. 2(b). Compared to index incremental updating algorithm, index initialization algorithm build an index model in a relatively short period of time. But as shown in Fig. 2(b), when you need to update an index model, index initialization algorithm have to rebuild the whole index model, while updating algorithm just need to add new pages to the built index model. Therefore, the two algorithms are applied to different situations: initialization algorithm is used for the initialization of an index model; incremental updating algorithm is used to update an existing index model. According to the annual statistical report by China Internet Network Information Center (CNNIC), the number of web pages increased stably. If the algorithm is applied, by increasing the number of devices, it can meet the update speed of web pages on the internet which illustrates feasible of the algorithm. Further, we analyzed the structure of the index model constructed above. As the edge weights are generally small, its edge weight multiplied by 1 10 8. Then the statistics result of its edge weights is shown in Fig. 3. Since the test data is only one million web pages, lots of web pages that are linked to from this experimental set are not within it, edge weights are generally small and edges whose weight is less than 1 may be caused by heterogeneous of the Internet. The weight distribution shows that the

Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) 655 664 663 The Percentage of All Edges 50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% <1 1-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 >100 Edge Weight Fig. 3: Statistics of the index model edge weight index model can distinguish the strength of link relationships among web page classes. Therefore it can be said that the index model constructed by the algorithm satisfied the basic conditions of validity. 4.3 Exhibition example The index model we have built is too big to complete show. In this section, we select 4 web page classes and their edges from the index model we built before as an exhibition example. As shown in Fig. 4, an index model is a direct graph. Each out-degree of a web page class represents the browsing probability from the class to its point to class. For example, in Fig. 4, a direct edge from Estate to House Renting is 22.1. That is, if users are browsing web pages in Estate now, the probability they will visit web pages in House Renting next step is 22.1%. Based on the index model, we can know which classes are more closely related with a specified class. Compared with House Renting, Estate is more relevant with Retail Market. And compared with Retail Market, House Renting is more relevant with Estate. Along with web pages in those classes are changing, the corresponding edge values are changing too. When most web pages on the Internet have been divided into their corresponding classes, edge values of the index model tend to be stable. Shopping 1.6 Estate 29.7 14.2 Retail Market 22.1 24.9 House Renting 3.8 2.2 Fig. 4: An index model instance 5 Conclusion This paper presents an index model based on web page classification, and designs its initial construction algorithm by hyperlinks analysis. Moreover, in order to reflect the dynamic process of web pages on the

664 Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) 655 664 Internet, we give index incremental updating algorithm and the two algorithms time complexity is only linear growth. Experimental results show that index incremental updating algorithm can dynamically update an existing index model and it can meet the update speed of web pages on the internet. Index model can organize and manage web pages and link relationships among classes can reflect some real business connections. However, this article only preliminary given its construction algorithm. We will further study on how to make better use of an index model to service users and how to build an index model based on other web classification methods such as KNN. References [1] F. S. Hong, and H. Kun, Research on search engine technology and service and its enlightenment, Journal of the China Society for Scientific and Technical Information 19(6) (2002) 628-636. [2] U. Shardanand, P. Maes, Social information filtering: algorithms for automating Word of Mouth, in: Proc. on Human Factors in Computing Systems, ACM Press, New York, 1995, pp. 210-217. [3] W. Hill, L. Stead, M. Rosenstein, and G. Fumas, Recommending and evaluating choices in a virtual community of use, in: Proc. on Human Factors in Computing Systems, ACM Press, New York, 1995, pp. 194-201. [4] P. Resnick, N. lakovou, M. Sushak, P. Bergstrom, and J. Riedl, GroupLens: An open architecture for collaborative filtering of netnews, in: Proc. the Computer Supposed Cooperative Work Conf, ACM Press, New York, 1994, pp. 175-186. [5] E. Rich, User modeling via stereotypes, Cognitive Science, 3(4) (1979) 329-354. [6] R. Baeza-Yates, and B. Ribeiro-Ner, Modern Information Retrieval, New York Addison-Wesley Publishing, 1999. [7] B. P. S. Murthi, and S. Sarkar, The role of the management sciences in research on personalization, Management Science, 49(10) (2003) 1344-1362. [8] G. Adomavicius, and A. Tuzhilin, Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions, IEEE Trans. on Knowledge and Data Engineering, 17(6) (2005) 734-749. [9] S. C. Jie, and G Yi, A statistical approach for content extraction from web page, Journal of Chinese Information Processing, 18(5) (2004) 19-21. [10] H. Y. Biao, The comparative study of chinese word segmentation of lucene interface, Science & Technology Information, 12 (2012) 246-247. [11] Y. Yang, and J. O. Pedersen, A comparative study on feature selection in text categorization, in: Proc. the 14th International Conference on Machine Learning (ICML 97), 1997, pp. 412-420. [12] L. Page, S Brin, and R. Motwani, The pagerank citation ranking: Bringing order to the web, http://www.db.stanford.edu/~backub/pageranksub, 1998. [13] A. McCallum, and K. Nigam, A comparison of event models for naive bayes text classification, in: Proc. AAAI/ICML-98 Workshop on Learning for Text Categorization, CA: AAAI Press, Menlo Park, 1998, pp. 41-48. [14] T. Cover, and P. Hart, Nearest neighbor pattern classification, IEEE Transactions in Information Theory, 13(1) (1967) 21-27. [15] S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout in Action, Manning Publications, Shelter Island, 2010. [16] M. Grobelnik, J. Brank, D. Mladeni, B. Novak, and B. Fortuna, Using dmoz for constructing ontology from data stream, in: Proc. the 28th International Conference on Information Technology Interfaces, Dubrovnik, 2006, pp. 439-444.