Construction Algorithms for Index Model Based on Web Page Classification

Journal of Computational Information Systems 10: 2 (2014) 655 664 Available at http://www.jofcis.com Construction Algorithms for Index Model Based on Web Page Classification Yangjie ZHANG 1,2,, Chungang YAN 1,2, Pengwei WANG 1,2, Haichun SUN 1,2 1 Department of Computer Science and Technology, Tongji University, Shanghai 201804, China 2 The Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University, Shanghai 201804, China Abstract Web pages on the Internet are massive, diverse, heterogeneous and redundant. How to organize and manage them effectively is an urgent problem. In this paper, we propose a method to index web pages and build an index model based on web page classification and hyperlink analysis. First, an initialization algorithm is given to construct an index model for an initial set of web pages. Then, considering the dynamics of web pages on the Internet, we propose an incremental updating algorithm which can update an index model incrementally. Through theoretical analysis, the complexity of proposed algorithms shows linear relationships with the scale of web pages and their growth. The experimental results show that initialization algorithm can construct an index model for a fixed number of web pages relatively fast, while the incremental updating algorithm can satisfy the update speed of web pages on the Internet. Thus, the proposed algorithms are feasible and effective. The constructed index model can provide supports for diversified information service systems to enable them to make better use of web resources and provide more valuable services to users. Keywords: Web Page Classification; Index Model; Hyperlinks; Naive Bayes 1 Introduction With the rapid development of the Internet, network resources become more and more abundant, and various kinds of information service systems are emerged to facilitate people s daily lives. However, owing to the openness, dynamics and complexity of the Internet, web pages on the Internet are massive, diverse, heterogeneous and redundant. Existing methods cannot provide a very proper way to organize and manage web pages for these systems. As a popular information service system, search engine solves the problem of acquiring web pages rapidly by its main Project supported by National Basic Research Program of China 973 (No. 2010CB328101), International Science & Technology Cooperation Program of China (No. 2013DFM10100), National Natural Science Funds of China (No. 61173016), Shanghai Science & Technology Research Plan (No. 11231202804). Corresponding author. Email address: zyj177484@126.com (Yangjie ZHANG). 1553 9105 / Copyright 2014 Binary Information Press DOI: 10.12733/jcis9017 January 15, 2014

656 Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) 655 664 technologies including inverted index, link analysis, and distributed storage [1]. Search engine helps users to find relevant web pages through keywords matching. However, the rapid growth of web pages reduces search efficiency, increases redundancy, inaccuracy of the returned results and it cannot meet the personalized demands of information retrieval service only through keywords matching. Recommendation system is another popular representative. Since researches on collaborative filtering appeared in mid-1990s, diversified recommendation algorithms have been proposed and used in many research fields [2-4], including cognitive science [5], information retrieval [6] and management science [7]. Web page recommendation systems recommend similar web pages to those having common preferences, and improve the intelligence of internet information service systems. In general, there are two types of recommender systems in this field: one can only find web pages that are similar to the customer interests by content-based filtering, while the other one needs massive user comments. With the increase of web pages, the accuracy and speed will reduce by collaborative filtering. Due to lack of an effective organization and management method of web pages, web-based information systems performance is restricted [8]. To this end, this paper presents an index model for web pages, which can organize web pages and find out the relationships among web pages. We give an assumption that hyperlinks among web pages reflect some kinds of business relationships in the real world and based on web page classification, two construction algorithms for index model are designed and implemented: one is the initialization algorithm for a given set of web pages, and the other is an updating algorithm for an existing index model and an incremental set of web pages. Then the index model is completed by analyzing hyperlinks among web pages after web page classification. Experimental results show that the proposed algorithms are feasible and effective. First, to validate its effectiveness, an example index model is constructed based on one million web pages crawled from the Internet, which shows that the generated relationships among web page classes can reflect the real-world business associations. Second, for a given set of web pages, the initialization algorithm can complete the construction of index model within a limited time. Last, the incremental updating algorithm can update an index model with the actual growth of web pages. 2 Index Model Framework In this paper, c denotes a web page class; C denotes a set of web page classes; p denotes a web page; D denotes a sample set of web pages; t denotes a feature, and it is a word. Definition 1. (Index Model) An index model is defined as a weighted directed graph G = (V, E), where a) V = {c i c i C }, and C is the set of all web page classes in the index model; b) E = { c i, c j c i, c j V andc i c j }; c) Let w c i, c j be the weight of c i, c j, then w c i, c j = p t c j,p k c i,u k =url t (W u P (c j p t ) P (c i p k )) W u = 1 n

Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) 655 664 657 where u k is a hyperlink on web page p k, url t is the Uniform Resource Locator of web page p t, n is the number of hyperlinks on web page p k, P (c j p t ) is the probability that web page p t belongs to web page class c j. The entire construction process of index model is as follows: 1. Web page pretreatment. 1 Extract content enclosed by HTML tags from a web page. 2 Extract content text. 3 Combine the contents extracted in steps 1 and 2. 4 Web page segmentation. 4 Remove stop words. 6 Select features to represent each web page. 7 Extract hyperlinks in web pages. 2. Training classifier and web page classification. 1 Generate a classifier by learning a sample set of web pages. 2 Classify all preprocessed web pages. 3. Compute link relationships among web page classes. 1 Using index initialization algorithm to complete initialization of an index model. 2 Index incremental updating algorithm will be used to updating an index model. The building process of index model is shown in Fig. 1. a sample set of web pages an initial set of Web pages updating web page Web Page Pretreatment Extract tags Extract body Contents Segmentation Remove stop words Select features Extract hyperlinks Training Classifier Web Page Classification Classifier Index Initialization Algorithm Index Incremental Updating Algorithm Compute Link Relationships among Web Page Classes Fig. 1: Building process of index model based on web page classification

658 Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) 655 664 3 Algorithms for Constructing Index Model 3.1 Web page preprocessing A web page contains various HTML tags, and contents within different labels make different contributions to its subject. In order to classify a web page and obtain relationships among web classes, we first preprocess each web page, which includes extracting tags and body contents, content segmentation, extracting hyperlinks, etc. Step 1 Extract contents enclosed in Title, META, H1 H6, a tags in a HTML file. Step 2 Extract content text of a web page using a statistical approach [9]. Step 3 Content segmentation. Processing contents extracted from Step 1 and Step 2 using Chinese word segmentation tool IKAnalyzer [10], and remove stop words. Step 4 Weight each feature using TF*IDF [11]. Step 5 Extract feature vector of each web page. Based on the values obtained in Step 4, select top n feature items to constitute the feature vector ( n is an experimental value). Then, a feature vector of a web page is expressed as: p = (t 1, t 2,..., t n ). Step 6 Extract hyperlinks of web pages. All hyperlinks extracted from a web page is denoted as a vector P out = (u 1, u 2,..., u n ), where u i is a hyperlink on this web page. Step 7 Weight each hyperlink. The weight of a hyperlink u is calculated by W u = 1 n, where n is the total number of hyperlinks on this web page. The basic idea derives from Random Surfer Model [12], which assumes that all the hyperlinks on a web page are equally important. 3.2 Training classifier and web page classification This paper uses Naive Bayesian [13] method to train the classifier and classify all web pages we processed before. Compared to KNN [14] and other methods for dealing with massive web pages, Naive Bayesian method can guarantee better classification accuracy and faster classifying speed, and it is simple. Step 1 Generate a classifier through learning a sample set of web pages. In this process, we calculate the probability that each feature item belongs to each web page class based on sample pages. There have been various methods to complete this calculation. McCallum and Nigam proposed a multinomial model [13] whose misclassification accuracy is smaller than other models when the feature set is relatively large. In this paper, we use their multinomial model to calculate P (t j c), which represents the probability of t j belonging to c. The computing formula is as follows: P (t j c) = V + 1 + T F (t j, c) V k=1 T F (t k, c) Where V represents the total number of feature items in a sample class c, T F (t j, c) represents the total frequency of t j appearing in c. Step 2 Classify all preprocessed web pages. Calculate the probabilities that p belongs to all web page classes, and then p is classified into c i with the greatest probability. The probability is calculated as follows: i = arg max{p (c j p)}, c j C

Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) 655 664 659 P (c j p) = P (c j ) n P (t i c j ) i=1 P (p) Where P (c j ) is a priori probability, and its value is the number of web pages in c j in proportion to the sample set, P (p) is a constant, n is the number of feature items in p. 3.3 Computing link relationships among web classes 3.3.1 Web page representation After the preprocessing and classification of web page, we obtain the elements of a web page in Table 1. Table 1: The representation of a web page p Element Symbol feature vector (t 1, t 2,..., t n ) web page class c probability of p belonging to (P (c p)) a set of hyperlinks on P out = (u 1, u 2,..., u m ) a set of URLs which have hyperlinks to p P in = (u 1, u 2,..., u o ) Where P in is initialized to empty before computing link relationships. 3.3.2 Computing relationships among web classes A directed edge c i, c j in an index model represents the direct link relationship from c i to c j, and its weight w c i, c j is calculated as a conditional probability P (c j c i ). It indicates the probability of users visiting the web page class c j after they browsed c i. In this article, we propose a way to build link relationships among web classes based on weights of hyperlinks and probabilities of web pages belonging to classes. We suppose that a web page only belongs to one class. Then, P (c j c i ) is calculated as follows: P (c j c i ) = Where P (c j p t ) and P (p t c i ) are computed as: P (c j p t ) = P (p t c i ) = p k c i p t c j P (c j p t ) P (p t c i ) (1) P (C j p t ) 1 + p k c j P (C j p k ) P (p t p k ) u k =url t W u (2) (3)

660 Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) 655 664 In Eq. (2), P (c j p t ) is the probability of p t belonging to c j after normalization. In Eq. (3), P (p t c i ) means the probability of users visiting p t after they browsed c i, which is calculated by summing all hyperlinks from web page class c i to page p t. u k represents a hyperlink on p k, url t represents the URL of p t, W u is the weight of hyperlink u k. To facilitate the calculation, the Eq. (3) can be transformed into the following form: P (p t c i ) = Then, P (c j c i ) is calculated as follows: P (c j c i ) = p k c i,u k =url t p t c j,p k c i,u k =url t ( Wu P (c i p k ) ) (4) ( Wu P (c j p t ) P (c i p k ) ) (5) 3.3.3 Two algorithms for constructing index model For a given set of web pages, Algorithm 1 can complete the initial construction of an index model as defined in Def. 1. Algorithm 1 Index Initialization Algorithm: Input: a set of web class C, a set of web pages S. Output: Index Model Matrix IN CC, HashTable HT 1: For c C 2: For p c 3: Calculate 1 + p c P (c p) 4: HT.put( url, c ) // url is URL of p. 5: For p S 6: For u P out // P out is the set of hyperlinks on p. 7: if ( HT.contains( u ) ) 8: Calculate P (c j p k ) and P (c i p) // u is URL of p k. ( ) 9: P (c j c i ) = P (c j c i ) + 1 P P out (c j p k ) P (c i p) 10: else P in.put( url ) // P in is the set of URLs which have hyperlinks to p. There are two main factors that have influence on the complexity of Algorithm 1, namely, the number of web pages and the number of hyperlinks on each web page. We assume that the number of web pages is S and the average number of hyperlinks on each web page is P out. The time complexity of Steps 1-4 is O( S ). Steps 5-10 calculate those link relationships and their time complexity are O( S P out ). Thus, the total time complexity of Algorithm 1 is O( S + S P out ) = O( S P out ). Algorithm 1 can complete the initial construction of an index model based on a given set of web pages, but it cannot update the model in real time with the changes. Considering dynamic changes of web pages on the Internet, we further give an index incremental updating algorithm for updating an existing index model.

Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) 655 664 661 Algorithm 2 Index Incremental Updating Algorithm: Input: a set of web class C, web page p, Index Model Matrix IN CC, HashTable HT. Output: Index Model Matrix IN CC, HashTable HT 1: Update 1 + p k c j P (c p k ) // p belongs to web page class c. 2: HT.put( url, c ) // url is the URL of p. 3: For c i C 4: P (c i c) = P (c i c) 1+ p k c P (c p k) P (p) 1+ p k c P (c p k) 5: P (c c i ) = P (c c i ) 1+ p k c P (c p k) P (p) 1+ p k c P (c p k) 6: For u P in // P in is the set of URLs which have hyperlinks to p. 7: if( HT.contains( u ) ) 8: Calculate P (c j p k ) and P (c p) // u is URL of p k. ( ) 9: P (c c j ) = P (c c j ) + 1 P in P (c j p k ) P (c p) 10: For u P out // P out is the set of hyperlinks on p. 11: if( HT.contains( u )) 12: Calculate P (c j p k ) and P (c p) // u is URL of p k. ( ) 13: P (c j c) = P (c j c) + 1 P out P (c j p k ) P (c p) 14: else P in.put( url ) // P in is the set of URLs which have hyperlinks to u. The main factors affecting the complexity of Algorithm 2 are the number of web classes, the number of hyperlinks on each web page p, and the number of web pages that have hyperlinks pointing to web page p. The time complexity of steps 3-5 is O( C ), and that of steps 6-14 is O( P in + P out ). Thus, the total time complexity of Algorithm 2 is O( C + P in + P out ). 4 Experimental Evaluation and Analysis 4.1 Experimental environment Experiments in this paper are based on 9 Sugon servers with 4 cores and 8G memory. The experimental set of web pages contains one million Chinese web pages that are crawled from the Internet. Eight servers are used to classify web pages by the web page classification tool Mahout [15] based on Hadoop Distributed system, and one server is used to run the proposed construction algorithms for index model. 4.2 Experimental results and analysis In this paper, the result of web page classification will directly affect the construction of index model. Thus, the accuracy of classification algorithm should be guaranteed. We use 282 web page classes and their sample web pages in open directory project dmoz [16] as the training data, and manually select some relevant web pages as a supplement. There are about 500 sample web pages in each web page class.

662 Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) 655 664 We randomly select 80% of web pages in each sample class to train the classifier, while the remaining 20% web pages as the test set. The experimental results show that the classification accuracy is 71.7%. In order to evaluate the efficiency of the two proposed algorithms and compare them, the crawled one million web pages are classified into 282 web classes and then used to construct index model by two algorithms separately as shown in Fig. 2. Specially, we use updating algorithm update an empty model by these web pages. The comparative running time is shown in Fig. 2(a). Time(h) 60 50 40 30 20 10 0 0.1 0.3 0.5 0.7 0.9 1 Web Pages(Million) Initiali zation Updat ing Time (h) 50 40 30 20 10 0 Initialization Updating (a) Building an index model (b) Updating an index model Fig. 2: Time comparisons of two construction algorithms As can be seen from Fig. 2(a), time efficiency of index initialization algorithm is better than that of index incremental updating algorithm. To build an index model with S pages, index incremental updating algorithm requires O( S C + S P in +P out ) time while the initialization one requires O( S P out ) time. The time complexity of these two algorithms increases linearly with the number of web pages as previously mentioned. For the experimental set of web pages we selected, since it is not self-contained from the perspective of hyperlinks contained, i.e., lots of web pages that are linked to from this experimental set are not within it, the growth curve of time efficiency is shown in Fig. 2(a). When the number of web pages is sufficiently large, the time efficiency of two algorithms will be stabilized and tends to grow linearly. In order to evaluate the efficiency that the two proposed algorithms are used to update an existing index model. We update an index model built by half million web pages using another half million web pages. The comparative running time is shown in Fig. 2(b). Compared to index incremental updating algorithm, index initialization algorithm build an index model in a relatively short period of time. But as shown in Fig. 2(b), when you need to update an index model, index initialization algorithm have to rebuild the whole index model, while updating algorithm just need to add new pages to the built index model. Therefore, the two algorithms are applied to different situations: initialization algorithm is used for the initialization of an index model; incremental updating algorithm is used to update an existing index model. According to the annual statistical report by China Internet Network Information Center (CNNIC), the number of web pages increased stably. If the algorithm is applied, by increasing the number of devices, it can meet the update speed of web pages on the internet which illustrates feasible of the algorithm. Further, we analyzed the structure of the index model constructed above. As the edge weights are generally small, its edge weight multiplied by 1 10 8. Then the statistics result of its edge weights is shown in Fig. 3. Since the test data is only one million web pages, lots of web pages that are linked to from this experimental set are not within it, edge weights are generally small and edges whose weight is less than 1 may be caused by heterogeneous of the Internet. The weight distribution shows that the

Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) 655 664 663 The Percentage of All Edges 50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% <1 1-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 >100 Edge Weight Fig. 3: Statistics of the index model edge weight index model can distinguish the strength of link relationships among web page classes. Therefore it can be said that the index model constructed by the algorithm satisfied the basic conditions of validity. 4.3 Exhibition example The index model we have built is too big to complete show. In this section, we select 4 web page classes and their edges from the index model we built before as an exhibition example. As shown in Fig. 4, an index model is a direct graph. Each out-degree of a web page class represents the browsing probability from the class to its point to class. For example, in Fig. 4, a direct edge from Estate to House Renting is 22.1. That is, if users are browsing web pages in Estate now, the probability they will visit web pages in House Renting next step is 22.1%. Based on the index model, we can know which classes are more closely related with a specified class. Compared with House Renting, Estate is more relevant with Retail Market. And compared with Retail Market, House Renting is more relevant with Estate. Along with web pages in those classes are changing, the corresponding edge values are changing too. When most web pages on the Internet have been divided into their corresponding classes, edge values of the index model tend to be stable. Shopping 1.6 Estate 29.7 14.2 Retail Market 22.1 24.9 House Renting 3.8 2.2 Fig. 4: An index model instance 5 Conclusion This paper presents an index model based on web page classification, and designs its initial construction algorithm by hyperlinks analysis. Moreover, in order to reflect the dynamic process of web pages on the

664 Y. Zhang et al. /Journal of Computational Information Systems 10: 2 (2014) 655 664 Internet, we give index incremental updating algorithm and the two algorithms time complexity is only linear growth. Experimental results show that index incremental updating algorithm can dynamically update an existing index model and it can meet the update speed of web pages on the internet. Index model can organize and manage web pages and link relationships among classes can reflect some real business connections. However, this article only preliminary given its construction algorithm. We will further study on how to make better use of an index model to service users and how to build an index model based on other web classification methods such as KNN. References [1] F. S. Hong, and H. Kun, Research on search engine technology and service and its enlightenment, Journal of the China Society for Scientific and Technical Information 19(6) (2002) 628-636. [2] U. Shardanand, P. Maes, Social information filtering: algorithms for automating Word of Mouth, in: Proc. on Human Factors in Computing Systems, ACM Press, New York, 1995, pp. 210-217. [3] W. Hill, L. Stead, M. Rosenstein, and G. Fumas, Recommending and evaluating choices in a virtual community of use, in: Proc. on Human Factors in Computing Systems, ACM Press, New York, 1995, pp. 194-201. [4] P. Resnick, N. lakovou, M. Sushak, P. Bergstrom, and J. Riedl, GroupLens: An open architecture for collaborative filtering of netnews, in: Proc. the Computer Supposed Cooperative Work Conf, ACM Press, New York, 1994, pp. 175-186. [5] E. Rich, User modeling via stereotypes, Cognitive Science, 3(4) (1979) 329-354. [6] R. Baeza-Yates, and B. Ribeiro-Ner, Modern Information Retrieval, New York Addison-Wesley Publishing, 1999. [7] B. P. S. Murthi, and S. Sarkar, The role of the management sciences in research on personalization, Management Science, 49(10) (2003) 1344-1362. [8] G. Adomavicius, and A. Tuzhilin, Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions, IEEE Trans. on Knowledge and Data Engineering, 17(6) (2005) 734-749. [9] S. C. Jie, and G Yi, A statistical approach for content extraction from web page, Journal of Chinese Information Processing, 18(5) (2004) 19-21. [10] H. Y. Biao, The comparative study of chinese word segmentation of lucene interface, Science & Technology Information, 12 (2012) 246-247. [11] Y. Yang, and J. O. Pedersen, A comparative study on feature selection in text categorization, in: Proc. the 14th International Conference on Machine Learning (ICML 97), 1997, pp. 412-420. [12] L. Page, S Brin, and R. Motwani, The pagerank citation ranking: Bringing order to the web, http://www.db.stanford.edu/~backub/pageranksub, 1998. [13] A. McCallum, and K. Nigam, A comparison of event models for naive bayes text classification, in: Proc. AAAI/ICML-98 Workshop on Learning for Text Categorization, CA: AAAI Press, Menlo Park, 1998, pp. 41-48. [14] T. Cover, and P. Hart, Nearest neighbor pattern classification, IEEE Transactions in Information Theory, 13(1) (1967) 21-27. [15] S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout in Action, Manning Publications, Shelter Island, 2010. [16] M. Grobelnik, J. Brank, D. Mladeni, B. Novak, and B. Fortuna, Using dmoz for constructing ontology from data stream, in: Proc. the 28th International Conference on Information Technology Interfaces, Dubrovnik, 2006, pp. 439-444.