INTERNATIONAL JOURNAL OF ADVANCES IN COMPUTING AND INFORMATION TECHNOLOGY An International online open access peer reviewed journal

INTERNATIONAL JOURNAL OF ADVANCES IN COMPUTING AND INFORMATION TECHNOLOGY An International online open access peer reviewed journal Research Article ISSN 2277 9140 ABSTRACT Web page categorization based on characteristics of web page Assistant Professor, CSE Department, Sharda University, Greater Noida, India khushbootaneja@gmail.com doi: 10.6088/ijacit.23.10006 Internet is the only source of huge amount of information accessed by large number of people every day. Now a day s web is comprised of trillions of pages and everyday tremendous amount of requests are made to put more web pages on the World Wide Web (WWW). The ability to create information has far exceeded to the ability to manage information. This paper proposes an approach to categorize web pages automatically on the basis of characteristics of web pages using neural network based single discrete perceptron training algorithm which is easy to implement, use and also categorize web pages with high accuracy. Here two major categories of web pages have been considered for categorization, these are newspaper and education. The whole approach can be defined in three steps. In the first step, features are extracted automatically after analyzing the source web pages. The second step includes the implementation and training of the algorithm. The third step will categorize the source web pages into one of the two categories. Keywords: Web page categorization; neural network; single discrete perceptron training algorithm 1. Introduction The number of web pages on the WWW is exponentially increasing day by day. The data available on the web is in the form of text, images, audio, video, graphics and many other forms. The dynamic nature of web and large scale explosion of web pages may put a threat to efficient information retrieval tasks. Categorization is an intellectual task, important and indeed essential for organizing and understanding web content for different applications. Web page categorization also known as web page classification is the process of assigning a web page to one or more predefined category labels. Categorization is often considered as a supervised learning problem in which a labeled data set is used to train a classifier which can be applied to classify and label the test data. In the proposed approach described in this paper the characteristics of the web pages like number of links, number of images and number of words or the amount of text have been used to categorize the web pages into one of the two categories. The source web pages used for categorization are comprised of two categories or domains: Newspaper and Education. After analyzing the web pages belonging to newspaper sites and education sites, it has been found that newspaper web pages contain more number of links, images and words than education web pages. The difference in these characteristics is used for categorization. Figure 1 and 2 shows a newspaper web page and education web page respectively. The difference in the number of links, images and words can be seen clearly in the figures Received on Septmebr 2013, Published on October 2013 88

Figure 1: Newspaper Page Figure 2: Education Web Page 2. Proposed approach There have been a number of researches performed in the field of web page categorization. A number of text categorization algorithms have been applied to the problem of web page categorization. Some of the approaches are K-Nearest Neighbor approach [4], Bayesian probabilistic models, inductive rule learning, decision trees, and support vector machines. The proposed approach is based on the categorization of web pages on the basis of characteristics of web pages. The data set for the categorization is collected from Yahoo! web directory and different education and newspaper sites. The approach can be defined in the following steps: 1. Collect and analyze the data set. 2. Extract features like number of links, images and words from each page automatically. 3. Fix values for the input node of the network. 4. Train the algorithm. 5. Categorize the web pages. 89

3. Implementation The step wise implementation of web page categorization based on characteristics of web page is explained below: A. Feature Extraction First of all features of the source web pages are extracted automatically by analyzing the web pages from different source page websites. The main features which are extracted are number of links, number of images and amount of text present on the web pages. After analyzing these features it has been found that the newspaper web pages contain more number of links, images and words as compared to education web pages. It helps in differentiating these two types of web pages. After analyzing the values obtained for different extracted features, mean and standard deviation is calculated and each value is mapped to the value in the range [-2,2] as shown below: Table 1: Fixing input values for number of links No. of Links 1-100 -2 101-200 -1 201-300 0 Table 2: Fixing input values for 301-400 1 401-500 2 number of images No. of Images Table 3: Fixing input values for 1-30 -2 31-60 -1 61-90 0 91-120 1 121-150 2 number of words No. of Words 1-1000 -2 1001-2000 -1 2001-3000 0 90

3001-4000 1 4000-5001 2 B. Training of Algorithm The discrete perceptron training algorithm is used in the proposed approach.it is based on the concept of neural networks. During training phase weights will keep on modifying by the given equation until the desired output becomes equal to the actual output. w w + 0.5c(d o)y (1) where, c is learning constant, d is desired output, o is actual output, y is input vector, w is modified weight. C. Categorization of Web Pages Once the weights are fixed, training has been completed. Testing data set can be applied to the categorizer to categorize the web pages. 4. Testing and Results Testing of the system is done with the help of a good number of training examples. The training data set should reflect the real world situation. The true performance of any system can be evaluated on the basis of high quality training data set. Therefore training data set is collected from 120 home pages of different education websites and newspaper websites. The data obtained from different web pages is applied to the categorizer in the form of input vector whose value lies in the range of [-2, 2]. Out of 120 web pages with known categories 108 web pages are categorized correctly. The experimental and testing results are shown in table IV Table 4: Experimental Results Category Number of right categorized pages Number of wrong categorized pages Education 58 2 Newspaper 50 10 Total 108 12 According to the above results obtained the accuracy has been found to be 90 %. 5. Conclusion Automated categorization is useful for efficient retrieval of web pages, focused crawling and maintenance of web directories. The proposed approach described the categorization of web pages based on characteristics of web 91

pages like number of links, number of images and number of words present on them using single discrete perceptron training algorithm. The algorithm is used as a binary categorizer, which categorized education web pages and newspaper web pages into appropriate category with an accuracy of 90 %. 5.1 Future Scope The proposed approach can be used further for the categorization of blogs and non-blog web pages by analyzing the features which are highly discriminative so as to categorize the web pages accurately. Similarly it can also be applied to categorize social networking sites and nonsocial networking sites. 6. References 1. J. M. Zurada., Introduction to Artificial Neural Systems. Chapter 3, 93-132. 2. J. M. Pierre, 2000, Practical Issues for Automated Categorization of Web Pages, 3. Q. Xiaoguang and B. D. Davison, 2009 Web page classification: Features and algorithms, ACM Computing Surveys, 41(2). 4. O. Kwon and J. Lee, 2000 Web page classification based Nearest Neighbor approach, IRAL Proceedings of the fifth international workshop on Information retrieval with Asian languages. 5. A. McCallum and K. Nigam, 1998, A Comparison of Event Models for Naive Bayes Text Classification, In AAAI Workshop on Learning for Text Categorization. 6. D. Koller and M. Sahami, 1997, Hierarchically classifying documents using very few words, Proceeding of the Fourteenth International Conference on Machine Learning (ICML), 170-178. 7. D.D. Lewis and M. Ringuette, 1994, A Classification of two learning algorithms for text categorization, Third Annual Symposium on Document analysis and Information Retrieval (SDAIR), 81-93. 8. C. Apte, and F. Damerau, 1994, Automated Learning of Decision rules for Text categorization, ACM Transactions on Information Systems, Vol 12, No.3, 233-251. 9. S. T. Dumais et al.,1998, Inductive Learning Algorithms and representations for text categorization, Proceeding of the Seventh International conference on Information and Knowledge Management (CIKM), 148-155. 10. Yahoo!, http://www.yahoo.com. Accessed 25 April 2012. 92