Cosine Similarity Measure and Genetic Algorithm for extracting main content from web documents

Similar documents
Blog Post Extraction Using Title Finding

Search Result Optimization using Annotators

Clustering Technique in Data Mining for Text Documents

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Numerical Research on Distributed Genetic Algorithm with Redundant

Keywords: Information Retrieval, Vector Space Model, Database, Similarity Measure, Genetic Algorithm.

ISSN: A Review: Image Retrieval Using Web Multimedia Mining

Mining Text Data: An Introduction

A PERSONALIZED WEB PAGE CONTENT FILTERING MODEL BASED ON SEGMENTATION

How To Cluster On A Search Engine

Social Business Intelligence Text Search System

Research and Implementation of View Block Partition Method for Theme-oriented Webpage

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

Management Science Letters

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

A Study of Web Log Analysis Using Clustering Techniques

A Novel Framework for Personalized Web Search

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

SPATIAL DATA CLASSIFICATION AND DATA MINING

A Parallel Processor for Distributed Genetic Algorithm with Redundant Binary Number

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis

Visualizing e-government Portal and Its Performance in WEBVS

Alpha Cut based Novel Selection for Genetic Algorithm

A Dynamic Approach to Extract Texts and Captions from Videos

Efficient Query Optimizing System for Searching Using Data Mining Technique

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

A Method of Cloud Resource Load Balancing Scheduling Based on Improved Adaptive Genetic Algorithm

Natural Language Querying for Content Based Image Retrieval System

1 o Semestre 2007/2008

Arti Tyagi Sunita Choudhary

Importance of Domain Knowledge in Web Recommender Systems

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

Domain Classification of Technical Terms Using the Web

Automatic Annotation Wrapper Generation and Mining Web Database Search Result

A Survey on Web Mining From Web Server Log

A SURVEY ON WEB MINING TOOLS

A LANGUAGE INDEPENDENT WEB DATA EXTRACTION USING VISION BASED PAGE SEGMENTATION ALGORITHM

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

A QoS-Aware Web Service Selection Based on Clustering

INVESTIGATIONS INTO EFFECTIVENESS OF GAUSSIAN AND NEAREST MEAN CLASSIFIERS FOR SPAM DETECTION

Model-Based Cluster Analysis for Web Users Sessions

IJCSES Vol.7 No.4 October 2013 pp Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS

Design call center management system of e-commerce based on BP neural network and multifractal

LDA Based Security in Personalized Web Search

DATA PREPARATION FOR DATA MINING

Dynamic Data in terms of Data Mining Streams

Spam Detection Using Customized SimHash Function

A Detailed Study on Information Retrieval using Genetic Algorithm

Web Mining using Artificial Ant Colonies : A Survey

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

Mining Signatures in Healthcare Data Based on Event Sequences and its Applications

An Information Retrieval using weighted Index Terms in Natural Language document collections

Social Media Mining. Data Mining Essentials

Selection of Optimal Discount of Retail Assortments with Data Mining Approach

Inner Classification of Clusters for Online News

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

Profile Based Personalized Web Search and Download Blocker

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

A Survey on Intrusion Detection System with Data Mining Techniques

Contents. Dedication List of Figures List of Tables. Acknowledgments

A Robust Method for Solving Transcendental Equations

Self Organizing Maps for Visualization of Categories

Wireless Sensor Networks Coverage Optimization based on Improved AFSA Algorithm

Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm

International Journal of Computer Engineering and Applications, Volume V, Issue III, March 14

User Behavior Analysis Based On Predictive Recommendation System for E-Learning Portal

A Genetic Algorithm-Evolved 3D Point Cloud Descriptor

Data Mining Algorithms Part 1. Dejan Sarka

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

A Survey on Association Rule Mining in Market Basket Analysis

A Virtual Machine Searching Method in Networks using a Vector Space Model and Routing Table Tree Architecture

Client Perspective Based Documentation Related Over Query Outcomes from Numerous Web Databases

A Review of Anomaly Detection Techniques in Network Intrusion Detection System

Bisecting K-Means for Clustering Web Log data

Web Mining Functions in an Academic Search Application

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

A Comparative Study of Different Log Analyzer Tools to Analyze User Behaviors

GA as a Data Optimization Tool for Predictive Analytics

Comparison of K-means and Backpropagation Data Mining Algorithms


UPS battery remote monitoring system in cloud computing

Automatically adapting web pages to heterogeneous devices

Data Mining System, Functionalities and Applications: A Radical Review

Web Log Based Analysis of User s Browsing Behavior

Journal of Chemical and Pharmaceutical Research, 2015, 7(3): Research Article. E-commerce recommendation system on cloud computing

Semantic Concept Based Retrieval of Software Bug Report with Feedback

Science Navigation Map: An Interactive Data Mining Tool for Literature Analysis

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

A New Method for Traffic Forecasting Based on the Data Mining Technology with Artificial Intelligent Algorithms

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Verifying Business Processes Extracted from E-Commerce Systems Using Dynamic Analysis

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Transcription:

Cosine Similarity Measure and Genetic Algorithm for extracting main content from web documents 1 Digvijay B. Gautam, 2 Pradnya V. Kulkarni 1,2 Department of Computer Engineering, Maharashtra Institute of Technology, Pune 411038, India Email: 1 anshul1389@gmail.com, 2 pradnya.kulkarni@mitpune.edu.in Abstract Because of the use of growing information, web mining has become a primary necessity of world. Due to this, research on web mining has received a lot of interest from both industry and academia. Mining and prediction of user s web browsing behaviors and deducing the actual content in a web document is one of the active subjects. The information on web is dirty. Apart from useful information, it contains unwanted information such as copyright notices and navigation bars that are not part of main contents of web pages. These seriously harm Web Data Mining and hence, need to be eliminated. This paper aims at studying the possible similarity criteria based on cosine similarity to deduce which parts of content are more important than others. Under vector space model, information retrieval is based on the similarity measurement between query and documents. Documents with high similarity to query are judge more relevant to the query and should be retrieved first. Under genetic algorithms, each query is represented by a chromosome. These chromosomes feed into genetic operator process: selection, crossover, and mutation until we get an optimize query chromosome for document retrieval. Our testing result show that information retrieval with 0.8 crossover probability and 0.01 mutation probability give the highest precision while 0.8 crossover probability and 0.3 mutation probability give the highest recall. Keywords- Cosine Similarity; Genetic Algorithm; Fitness Function; Web Content Mining. I. INTRODUCTION Information is expanding rapidly and hence, the web. Web is a collection of abundant information. The information on web is dirty. Apart from useful information, it contains unwanted information such as copyright notices and navigation bars that are not part of main contents of web pages. Although these information item are useful for human viewers and necessary for the Web site owners, they can seriously harm automated information collection and Web data mining, e.g. Web page clustering, Web page classification, and information retrieval. So how to extract the main content blocks become very important. Web pages contain Div block, Table block or other HTML blocks. This paper aims at extracting the main content from a web document which is relevant to user s query. Here, a new algorithm is proposed to extract the informative block from web page based on DOM-based analysis and Content Structure Tree (CST). Then we apply cosine similarity measure to identify and separate the rank of the each block from the page. Further, we also use TF- IDF (Term Frequency and Inverse Document Frequency) scheme for calculating the weight of each node on the CST. Finally, we extract the most relevant information to the query. We will discuss two algorithms in this paper. One is for generating Content Structure Tree and another for extracting main content from Content structure Tree. II. LITERATURE SURVEY From time to time, many extraction systems have been developed. In [1], Swe Swe Nyein proposed the method of mining contents in web page using cosine similarity. In [2], C. Li et al. propose a method to extract informative block from a web page based on the analysis of both the layouts and the semantic information of the web pages. They needed to identify blocks occurring in a web collection based on the Vision-based Page Segmentation algorithm. In [3], L. Yi et al. propose a new tree structure, called Style Tree to capture the actual contents and the common layouts (or presentation styles) of the Web pages in a Web site. Their method can difficult to capture the common presentation for many web pages from different web sites. In [4], Y. Fu et al. propose a method to discover informative content block based on DOM tree. They removed clutters using XPath. They could remove only the web pages with similar layout. In [5], P. S. Hiremath et al. propose an algorithm called VSAP (Visual Structure based Analysis of web Pages) to exact the data region based on the visual clue (location of data region / data records / data items / on the screen at which tag are rendered) information of web pages. In [6] S. H. Lin et al. propose a system, InfoDiscoverer to discover informative content blocks from web documents. It first partitions a web page into several content blocks according to HTML tag <TABLE>. In [7] D. Cai et al. propose a Vision-based Page Segmentation (VIPS) algorithm that segments web 34

pages using DOM tree with a combination of human visual cues, including tag cue, color cue, size cue, and others. In [8], P. M. Joshi propose an approach of combination of HTML DOM analysis and Natural Language Processing (NLP) techniques for automated extractions of main article with associated images form web pages. Their approach did not require prior knowledge of website templates and also extracted not only the text but also associated images based on semantic similarity of image captions to the main text. In [9], Y. Li et al. propose a tree called content structure tree which captured the importance of the blocks. In [10], R R Mehta propose a page segmentation algorithm which used both visual and content information to obtain semantically meaningful blocks. The output of the algorithm was a semantic structure tree. In [11], S. Gupta proposes content extraction technique that could remove clutter without destroying webpage layout. It is not only extract information from large logical units but also manipulate smaller units such as specific links within the structure of the DOM tree. Most of the existing approaches based on only DOM tree. In [12], David demonstrates the basic principles of Genetic Algorithms. In [13], Goldberg highlights the significance of Genetic Algorithms in search, optimization and Machine Learning. In [14], Kraft uses The Genetic Programming to Build Queries for Information Retrieval. In [15], Martin-Bautista uses Adaptive Information Retrieval Agent using Genetic Algorithms with Fuzzy Set Genes. III. WEB MINING Web Mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts or activity related to the World Wide Web. It has become a necessity as the information across the world is increasing tremendously and hence the size of the web. According to the differences of the mining objects, there are roughly three knowledge discovery domains that pertain to web mining: Web Content Mining, Web Structure Mining and Web Usage Mining. Web usage mining is an application of data mining technology to mining the data of the web server log file. It can discover the browsing patterns of user and some kind of correlations between the web pages. Web usage mining provides the support for the web site design, providing personalization server and other business making decision, etc. Web mining applies the data mining, the artificial intelligence and the chart technology and so on to the web data and traces users' visiting characteristics, and then extracts the users' using pattern. The web usage mining generally includes the following several steps: 1. Data collection 2. Data pretreatment 3. Establishing interesting model 4. Pattern analysis Web mining algorithm based on web usage mining is also produces the design mentality of the electronic commerce website application algorithm. This algorithm is simple, effective and easy to realize, it is suitable to the web usage mining demand of construct a low cost Business-to-Customer (B2C) website. Web mining algorithm generally includes the following several steps: 1. Collect and pre-treat users' information 2. Establish the topology structure of web site 3. Establish conjunction matrices of the users visit pages 4. Concrete application. A. Web Content Mining: Web Content Mining refers to description and detection of useful information from the web contents / data / documents. There are two views on web content mining: view of information retrieval and data base. The aim of web content mining according to data retrieval based on content is to help the process of data filtering or finding data for the user which is usually performed based on extraction or demand of users; while according to the view of data bases it means attempt for modeling the data on web and its combination such that most of the expert query required for searching the information can be executed on this kind of data mode. B. Web Structure Mining: Web Structure Mining tries to discover the model underlying the link structures of the web. This model can be used to categorize web pages and is useful to generate information such as the similarity and relationship between different web sites. C. Web Usage Mining: Web Usage Mining using the data derived from using effects on the web detects the behavioral models of the users to access the web services automatically. IV. PRINCIPLE OF GENETIC ALGORITHMS A. BASIC PRINCIPLES GA s are characterized by 5 basic components as follow: 1) Chromosome representation for the feasible solutions to the optimization problem. 2) Initial population of the feasible solutions. 35

3) A fitness function that evaluates each solution. 4) Genetic operators that generate a new population from the existing population. 5) Control parameters such as population size, probability of genetic operators, number of generation etc. B. PROCESS OF GENETIC ALGORITHMS GA s is an iterative procedure which maintains a constant size population of feasible solutions. During each iteration step, called a generation, the fitness of current population are evaluated, and population are selected based on the fitness values. The higher fitness chromosomes are selected for reproduction under the action of crossover and mutation to form new population. The lower fitness chromosomes are eliminated. These new population are evaluated, selected and fed into genetic operator process again until we get an optimal solution. (See Fig. 1.) information retrieval problem is how to retrieve user required documents. It seems that we could use the fitness functions in Table 1 to calculate the distance between document and query. From Table 1, there are 2 types of fitness functions: weighted term vector and binary term vector. We define X = (x1, x2, x3,.., xn), X = number of terms occur in X, = number of terms occur in both X and Y Fig. 1. The process of Genetic Algorithms V. FITNESS EVALUATION Fitness function is a performance measure or reward function which evaluates how good each solution is. The Table 1. Fitness Function VI. COSINE SIMILARITY Cosine similarity is a measure of similarity between two high-dimensional vectors. In essence, it is the cosine value of the angle between two vectors. The similarity of content in the web pages is estimated using cosine similarity measure which is the cosine of the angle between the query vector q and the document vector d j. Then the weight of each node (term) in Content Structure Tree which represents the contents in the form of nodes of a tree such as Text node, Image node, and Link node is calculated. TF-IDF scheme (Term Frequency and Inverse Document Frequency) is used to calculate the weight. The weight of a term t i in document d j is the number of times that appears in document. In this scheme, an arbitrary normalized w ij is defined. Then, the weight of content node is obtained, Content Weight = Text Weight + Image Weight + Link Weight The content value which is computed by the children nodes of HtmlItem node will be added to element which is the first child of HtmlItem node. Before adding to the parent node, this system checks the HtmlItem nodes whether there are same level nodes. If so, then similarity of these nodes is computed using their weights and then 36

all HtmlItem nodes are eliminated, except the highest similarity node. Ranking of the documents is done using their similarity values. The top ranked documents are regarded as more relevant to the query. VII. EXPERIMENTATION C. TEST CASE FORMULATION This experimentation tests for 21 queries with 3 different fitness functions: jaccard coefficient (F1), cosine coefficient (F2) and dice coefficient (F3). A particular fitness function tests with set of parameters: probability of crossover (Pc = 0.8), and probability of mutation (Pm = 0.01, 0.10, 0.30) to compare the efficiency of retrieval system. The information retrieval efficiency measures from recall and precision. Recall is defined as the proportion of relevant document retrieved (see equation (1)) relevant documents. 2. Information retrieval with Pc = 0.8 and Pm = 0.01 yields the highest precision 0.746 while information retrieval with Pm = 0.10 yields the moderate precision 0.560 and information retrieval with Pm = 0.30 yields the lowest precision 0.417 as shown in Figure 2. 3. Information retrieval with Pc = 0.8 and Pm = 0.30 yields the highest recall 0.976 while information retrieval with Pm = 0.01 yields the moderate recall and information retrieval with Pm = 0.l0 yields the lowest recall 0.786 as shown in Fig. 2. (1) Precision is defined as the proportion of retrieved document that is relevant (see equation (2)) (2) Fig. 2. Precision and Recall VIII. CONCLUSION From preliminary experiment indicated that precision and recall are invert. To use which parameters depends on the appropriateness that what would user like to retrieve for. In the case of high precision documents prefer, the parameters will be high crossover probability and low mutation probability. While in the case of more relevant documents (high recall) prefer, the parameters will be high mutation probability and lower crossover probability. From preliminary experiment indicated that we could use GA s in information retrieval. Table 2. Information retrieval by 3 Fitness Functions with PC = 0.8 AND PM = 0.01 EXPERIMENTAL RESULTS Preliminary testing indicated that 1. Experiment from 3 fitness functions testing show that optimize queries from these fitness functions are all the same queries but there are different fitness values (F1, F2, and F3) as shown in Table 2. From Table 2, RetRel is defined as number of retrieved relevant documents and RetNRel is defined as number of retrieved but not REFERENCES [1] Swe Swe Nyein, Mining Contents in Web Page Using Cosine Similarity, IEEE, Copyright 2011. [2] C. Li, J. Dong, and J. Chen, Extraction of Informative Blocks from Web Pages Based on VIPS, 1553-9105/ Copyright January 2010. [3] L. Yi, B. Liu, and X. Li, Eliminating Noisy Information in Web Pages for Data Mining, in Proc. ACM SIGKDD International Conference 37

on Knowledge Discovery & Data Mining (KDD-2003). [4] Y. Fu, D. Yang, and S. Tang, Using XPath to Discover Informative Content Blocks of Web Pages, IEEE. DOI 10.1109/SKG, 2007. [5] P. S. Hiremath, S. S. Benchalli, S. P. Algur, and R. V. Udapudi, Mining Data Regions from Web Pages, International Conference on Management of Data COMAD, India, December 2005. [6] S. H. Lin and J. M.Ho, Discovering Informative Content Blocksfrom Web Documents, in Pro. ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.588-593, July 2002. [7] D. Cai, S. Yu, J. R. Wen, and W. Y. Ma, VIPS: a Vision- based PageSegmentation Algorithm, Technical Report, MSR-TR, Nov. 1, 2003. [8] P. M. Joshi, and S. Liu, Web Document Text and Images Extraction using DOM Analysis and Natural Language Processing, ACM, DocEng, 2009. [9] Y. Li and J. Yang, A Novel Method to Extract Informative Blocks from Web Pages, IEEE. DOI 10. 1109/ JCAI, 2009. [10] R. R. Mehta, P. Mitra, and H. Kamick, Extracting Semantic Structure of Web Documents Using Content and Visual Information, ACM, Chiba, Japan, May 2005. [11] S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm, DOM-based Content Extraction of HTML Documents, Pro. 12 th International Conference on WWW, ISBN: 1-58113-680-3, 2003. [12] David, L. Handbook of Genetic Algorithms. New York : Van Nostrand Reinhold. 1991. [13] Goldberg, D.E. Genetic Algorithms: in Search, Optimization, and Machine Learning. New York : Addison-Wesley Publishing Co. Inc. 1989. [14] Kraft, D.H. et. al. The Use of Genetic Programming to Build Queries for Information Retrieval. in Proceedings of the First IEEE Conference on Evolutional Computation. New York: IEEE Press. 1994. PP. 468-473. [15] Martin-Bautista, M.J. et. al. An Approach to An Adaptive Information Retrieval Agent using Genetic Algorithms with Fuzzy Set Genes. In Proceeding of the Sixth International Conference on Fuzzy Systems. New York: IEEE Press. 1997. PP.1227-1232. 38