INTERNATIONAL JOURNAL OF ADVANCES IN COMPUTING AND INFORMATION TECHNOLOGY An International online open access peer reviewed journal

Similar documents
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

A Content based Spam Filtering Using Optical Back Propagation Technique

An Introduction to Data Mining

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

DATA MINING TECHNIQUES AND APPLICATIONS

Keywords data mining, prediction techniques, decision making.

Data Mining - Evaluation of Classifiers

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Data Mining Part 5. Prediction

A Comparative Analysis of Classification Techniques on Categorical Data in Data Mining

Classification algorithm in Data mining: An Overview

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Filtering Noisy Contents in Online Social Network by using Rule Based Filtering System

A Proposed Algorithm for Spam Filtering s by Hash Table Approach

Using Artificial Intelligence to Manage Big Data for Litigation

Bayesian Spam Filtering

Web Document Clustering

Spidering and Filtering Web Pages for Vertical Search Engines

Representation of Electronic Mail Filtering Profiles: A User Study

Machine Learning with MATLAB David Willingham Application Engineer

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

An Approach to Detect Spam s by Using Majority Voting

Learning is a very general term denoting the way in which agents:

Introduction to Data Mining

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS

Inner Classification of Clusters for Online News

Adaptive anti-spam filtering for agglutinative languages: a special case for Turkish

Predicting Student Performance by Using Data Mining Methods for Classification

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION

A Review of Data Mining Techniques

Customer Classification And Prediction Based On Data Mining Technique

A Comparison of Event Models for Naive Bayes Text Classification

Abstract. Find out if your mortgage rate is too high, NOW. Free Search

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Prediction of Heart Disease Using Naïve Bayes Algorithm

1. Classification problems

Social Media Mining. Data Mining Essentials

Semi-Supervised Learning for Blog Classification

BIG DATA IN HEALTHCARE THE NEXT FRONTIER

III. DATA SETS. Training the Matching Model

CAS-ICT at TREC 2005 SPAM Track: Using Non-Textual Information to Improve Spam Filtering Performance

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Domain Classification of Technical Terms Using the Web

Neural Network based Vehicle Classification for Intelligent Traffic Control

First Semester Computer Science Students Academic Performances Analysis by Using Data Mining Classification Algorithms

Towards better accuracy for Spam predictions

An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them

Data Mining Approach For Subscription-Fraud. Detection in Telecommunication Sector

Role of Neural network in data mining

Data Mining Analytics for Business Intelligence and Decision Support

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

INVESTIGATIONS INTO EFFECTIVENESS OF GAUSSIAN AND NEAREST MEAN CLASSIFIERS FOR SPAM DETECTION

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

Scalable Developments for Big Data Analytics in Remote Sensing

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Effective Analysis and Predictive Model of Stroke Disease using Classification Methods

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

The Scientific Data Mining Process

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Impelling Heart Attack Prediction System using Data Mining and Artificial Neural Network

Experiments in Web Page Classification for Semantic Web

Comparison of Data Mining Techniques used for Financial Data Analysis

IJCSES Vol.7 No.4 October 2013 pp Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS

Adaptive Anomaly Detection for Network Security

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

An incremental cluster-based approach to spam filtering

MS1b Statistical Data Mining

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Blog Post Extraction Using Title Finding

Automated News Item Categorization

Keywords : Data Warehouse, Data Warehouse Testing, Lifecycle based Testing

An Overview of Knowledge Discovery Database and Data mining Techniques

DATA MINING AND REPORTING IN HEALTHCARE

CAFE - Collaborative Agents for Filtering s

Supervised Learning (Big Data Analytics)

Improving spam mail filtering using classification algorithms with discretization Filter

Rule based Classification of BSE Stock Data with Data Mining

Data Mining Yelp Data - Predicting rating stars from review text

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Performance Analysis of Data Mining Techniques for Improving the Accuracy of Wind Power Forecast Combination

How To Use Neural Networks In Data Mining

AnalysisofData MiningClassificationwithDecisiontreeTechnique

from Larson Text By Susan Miertschin

Random forest algorithm in big data environment

Detecting client-side e-banking fraud using a heuristic model

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ISSN: A Review: Image Retrieval Using Web Multimedia Mining

LVQ Plug-In Algorithm for SQL Server

Machine Learning: Overview

Neural Networks in Data Mining

Big Data Analytics and Optimization

Transcription:

INTERNATIONAL JOURNAL OF ADVANCES IN COMPUTING AND INFORMATION TECHNOLOGY An International online open access peer reviewed journal Research Article ISSN 2277 9140 ABSTRACT Web page categorization based on characteristics of web page Assistant Professor, CSE Department, Sharda University, Greater Noida, India khushbootaneja@gmail.com doi: 10.6088/ijacit.23.10006 Internet is the only source of huge amount of information accessed by large number of people every day. Now a day s web is comprised of trillions of pages and everyday tremendous amount of requests are made to put more web pages on the World Wide Web (WWW). The ability to create information has far exceeded to the ability to manage information. This paper proposes an approach to categorize web pages automatically on the basis of characteristics of web pages using neural network based single discrete perceptron training algorithm which is easy to implement, use and also categorize web pages with high accuracy. Here two major categories of web pages have been considered for categorization, these are newspaper and education. The whole approach can be defined in three steps. In the first step, features are extracted automatically after analyzing the source web pages. The second step includes the implementation and training of the algorithm. The third step will categorize the source web pages into one of the two categories. Keywords: Web page categorization; neural network; single discrete perceptron training algorithm 1. Introduction The number of web pages on the WWW is exponentially increasing day by day. The data available on the web is in the form of text, images, audio, video, graphics and many other forms. The dynamic nature of web and large scale explosion of web pages may put a threat to efficient information retrieval tasks. Categorization is an intellectual task, important and indeed essential for organizing and understanding web content for different applications. Web page categorization also known as web page classification is the process of assigning a web page to one or more predefined category labels. Categorization is often considered as a supervised learning problem in which a labeled data set is used to train a classifier which can be applied to classify and label the test data. In the proposed approach described in this paper the characteristics of the web pages like number of links, number of images and number of words or the amount of text have been used to categorize the web pages into one of the two categories. The source web pages used for categorization are comprised of two categories or domains: Newspaper and Education. After analyzing the web pages belonging to newspaper sites and education sites, it has been found that newspaper web pages contain more number of links, images and words than education web pages. The difference in these characteristics is used for categorization. Figure 1 and 2 shows a newspaper web page and education web page respectively. The difference in the number of links, images and words can be seen clearly in the figures Received on Septmebr 2013, Published on October 2013 88

Figure 1: Newspaper Page Figure 2: Education Web Page 2. Proposed approach There have been a number of researches performed in the field of web page categorization. A number of text categorization algorithms have been applied to the problem of web page categorization. Some of the approaches are K-Nearest Neighbor approach [4], Bayesian probabilistic models, inductive rule learning, decision trees, and support vector machines. The proposed approach is based on the categorization of web pages on the basis of characteristics of web pages. The data set for the categorization is collected from Yahoo! web directory and different education and newspaper sites. The approach can be defined in the following steps: 1. Collect and analyze the data set. 2. Extract features like number of links, images and words from each page automatically. 3. Fix values for the input node of the network. 4. Train the algorithm. 5. Categorize the web pages. 89

3. Implementation The step wise implementation of web page categorization based on characteristics of web page is explained below: A. Feature Extraction First of all features of the source web pages are extracted automatically by analyzing the web pages from different source page websites. The main features which are extracted are number of links, number of images and amount of text present on the web pages. After analyzing these features it has been found that the newspaper web pages contain more number of links, images and words as compared to education web pages. It helps in differentiating these two types of web pages. After analyzing the values obtained for different extracted features, mean and standard deviation is calculated and each value is mapped to the value in the range [-2,2] as shown below: Table 1: Fixing input values for number of links No. of Links 1-100 -2 101-200 -1 201-300 0 Table 2: Fixing input values for 301-400 1 401-500 2 number of images No. of Images Table 3: Fixing input values for 1-30 -2 31-60 -1 61-90 0 91-120 1 121-150 2 number of words No. of Words 1-1000 -2 1001-2000 -1 2001-3000 0 90

3001-4000 1 4000-5001 2 B. Training of Algorithm The discrete perceptron training algorithm is used in the proposed approach.it is based on the concept of neural networks. During training phase weights will keep on modifying by the given equation until the desired output becomes equal to the actual output. w w + 0.5c(d o)y (1) where, c is learning constant, d is desired output, o is actual output, y is input vector, w is modified weight. C. Categorization of Web Pages Once the weights are fixed, training has been completed. Testing data set can be applied to the categorizer to categorize the web pages. 4. Testing and Results Testing of the system is done with the help of a good number of training examples. The training data set should reflect the real world situation. The true performance of any system can be evaluated on the basis of high quality training data set. Therefore training data set is collected from 120 home pages of different education websites and newspaper websites. The data obtained from different web pages is applied to the categorizer in the form of input vector whose value lies in the range of [-2, 2]. Out of 120 web pages with known categories 108 web pages are categorized correctly. The experimental and testing results are shown in table IV Table 4: Experimental Results Category Number of right categorized pages Number of wrong categorized pages Education 58 2 Newspaper 50 10 Total 108 12 According to the above results obtained the accuracy has been found to be 90 %. 5. Conclusion Automated categorization is useful for efficient retrieval of web pages, focused crawling and maintenance of web directories. The proposed approach described the categorization of web pages based on characteristics of web 91

pages like number of links, number of images and number of words present on them using single discrete perceptron training algorithm. The algorithm is used as a binary categorizer, which categorized education web pages and newspaper web pages into appropriate category with an accuracy of 90 %. 5.1 Future Scope The proposed approach can be used further for the categorization of blogs and non-blog web pages by analyzing the features which are highly discriminative so as to categorize the web pages accurately. Similarly it can also be applied to categorize social networking sites and nonsocial networking sites. 6. References 1. J. M. Zurada., Introduction to Artificial Neural Systems. Chapter 3, 93-132. 2. J. M. Pierre, 2000, Practical Issues for Automated Categorization of Web Pages, 3. Q. Xiaoguang and B. D. Davison, 2009 Web page classification: Features and algorithms, ACM Computing Surveys, 41(2). 4. O. Kwon and J. Lee, 2000 Web page classification based Nearest Neighbor approach, IRAL Proceedings of the fifth international workshop on Information retrieval with Asian languages. 5. A. McCallum and K. Nigam, 1998, A Comparison of Event Models for Naive Bayes Text Classification, In AAAI Workshop on Learning for Text Categorization. 6. D. Koller and M. Sahami, 1997, Hierarchically classifying documents using very few words, Proceeding of the Fourteenth International Conference on Machine Learning (ICML), 170-178. 7. D.D. Lewis and M. Ringuette, 1994, A Classification of two learning algorithms for text categorization, Third Annual Symposium on Document analysis and Information Retrieval (SDAIR), 81-93. 8. C. Apte, and F. Damerau, 1994, Automated Learning of Decision rules for Text categorization, ACM Transactions on Information Systems, Vol 12, No.3, 233-251. 9. S. T. Dumais et al.,1998, Inductive Learning Algorithms and representations for text categorization, Proceeding of the Seventh International conference on Information and Knowledge Management (CIKM), 148-155. 10. Yahoo!, http://www.yahoo.com. Accessed 25 April 2012. 92