International Journal of Software and Web Sciences (IJSWS) Web Log Mining Based on Improved FCM Algorithm using Multiobjective

Similar documents
Introduction To Genetic Algorithms

Management Science Letters

A survey on Data Mining based Intrusion Detection Systems

Social Media Mining. Data Mining Essentials

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

A Study of Web Log Analysis Using Clustering Techniques

A Survey on Intrusion Detection System with Data Mining Techniques

Fuzzy Clustering Technique for Numerical and Categorical dataset

Comparison of K-means and Backpropagation Data Mining Algorithms

How To Solve The Cluster Algorithm

Bisecting K-Means for Clustering Web Log data

ISSN: ISO 9001:2008 Certified International Journal of Engineering Science and Innovative Technology (IJESIT) Volume 2, Issue 3, May 2013

Contents. Dedication List of Figures List of Tables. Acknowledgments

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Using Data Mining Techniques for Performance Analysis of Software Quality

Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering

GA as a Data Optimization Tool for Predictive Analytics

D A T A M I N I N G C L A S S I F I C A T I O N

An Overview of Knowledge Discovery Database and Data mining Techniques

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

Social Business Intelligence Text Search System

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Information Retrieval and Web Search Engines

Categorical Data Visualization and Clustering Using Subjective Factors

The Scientific Data Mining Process

Machine Learning using MapReduce

SOME CLUSTERING ALGORITHMS TO ENHANCE THE PERFORMANCE OF THE NETWORK INTRUSION DETECTION SYSTEM

QOS Based Web Service Ranking Using Fuzzy C-means Clusters

International Journal of Software and Web Sciences (IJSWS)

K Thangadurai P.G. and Research Department of Computer Science, Government Arts College (Autonomous), Karur, India.

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Predictive Analytics using Genetic Algorithm for Efficient Supply Chain Inventory Optimization

Alpha Cut based Novel Selection for Genetic Algorithm

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering Data Streams

Quality Assessment in Spatial Clustering of Data Mining

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

uclust - A NEW ALGORITHM FOR CLUSTERING UNSTRUCTURED DATA

International Journal of Advance Research in Computer Science and Management Studies

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

A Genetic Algorithm Approach for Solving a Flexible Job Shop Scheduling Problem

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Essential Components of an Integrated Data Mining Tool for the Oil & Gas Industry, With an Example Application in the DJ Basin.

ISSN: A Review: Image Retrieval Using Web Multimedia Mining

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Grid Density Clustering Algorithm

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

A Novel Framework for Personalized Web Search

Memory Allocation Technique for Segregated Free List Based on Genetic Algorithm

processed parallely over the cluster nodes. Mapreduce thus provides a distributed approach to solve complex and lengthy problems

INVESTIGATIONS INTO EFFECTIVENESS OF GAUSSIAN AND NEAREST MEAN CLASSIFIERS FOR SPAM DETECTION

Genetic Algorithm. Based on Darwinian Paradigm. Intrinsically a robust search and optimization mechanism. Conceptual Algorithm

Using Data Mining for Mobile Communication Clustering and Characterization

Wireless Sensor Networks Coverage Optimization based on Improved AFSA Algorithm

Introduction to Data Mining

Advanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis

PLAANN as a Classification Tool for Customer Intelligence in Banking

Inventory Analysis Using Genetic Algorithm In Supply Chain Management

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

Chapter ML:XI (continued)

Segmentation of stock trading customers according to potential value

Web Mining using Artificial Ant Colonies : A Survey

Data Mining Part 5. Prediction

A SURVEY ON GENETIC ALGORITHM FOR INTRUSION DETECTION SYSTEM

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

LOAD BALANCING IN CLOUD COMPUTING

A New Approach For Estimating Software Effort Using RBFN Network

Clustering UE 141 Spring 2013

Original Article Survey of Recent Clustering Techniques in Data Mining

A Binary Model on the Basis of Imperialist Competitive Algorithm in Order to Solve the Problem of Knapsack 1-0

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Enhanced Customer Relationship Management Using Fuzzy Clustering

An evolutionary learning spam filter system

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Numerical Research on Distributed Genetic Algorithm with Redundant

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

Keywords: Information Retrieval, Vector Space Model, Database, Similarity Measure, Genetic Algorithm.

Neural Networks Lesson 5 - Cluster Analysis

Design of Web Ranking Module using Genetic Algorithm

Manjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India

Binary Coded Web Access Pattern Tree in Education Domain

A Content based Spam Filtering Using Optical Back Propagation Technique

So, how do you pronounce. Jilles Vreeken. Okay, now we can talk. So, what kind of data? binary. * multi-relational

Modeling and Design of Intelligent Agent System

Hadoop SNS. renren.com. Saturday, December 3, 11

Big Data with Rough Set Using Map- Reduce

Model-based Parameter Optimization of an Engine Control Unit using Genetic Algorithms

TECHNIQUES FOR OPTIMIZING THE RELATIONSHIP BETWEEN DATA STORAGE SPACE AND DATA RETRIEVAL TIME FOR LARGE DATABASES

Data Mining Algorithms and Techniques Research in CRM Systems

How To Use Neural Networks In Data Mining

User Behavior Analysis Based On Predictive Recommendation System for E-Learning Portal

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

Transcription:

International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 International Journal of Software and Web Sciences (IJSWS) www.iasir.net Web Log Mining Based on Improved FCM Algorithm using Multiobjective Genetic Algorithm Ranu Singhal 1, Nirupama Tiwari 2 Shri Ram College of Engineering And Management Banmore 1, 2 Gwalior INDIA Abstract: Web log mining is important area of research in concern of user behavior, web cache improving and web security concern. For the purpose of mining various algorithms are used such as clustering, classification and rule mining technique in web log files. In this paper we discuss we log mining using improved fuzzy clustering algorithm using multi-objective genetic algorithm. For the optimization of cluster seed and controlling iteration of clustering technique we used multi-genetic algorithm. The proposed method minimized the error rate and generate accurate cluster for provided web log files. For the experimental process we used real time web log files and compared our technique with FCM algorithm. Our experimental evaluation result shows better clustering performance in compression of FCM algorithm. Keywords: Web logs, clustering, FCM, MOGA I. INTRODUCTION Web log mining refers to behavior analysis of user and rating of website. Web log mining process implied in web usage mining concern. For the web log mining, log files of web passes through various sate of data preprocessing such as data cleaning, data transformation and data validation. All these process implies the uncontrolled event in clustering process Clustering and classification[1] have been useful and active areas of machine learning research that promise to help us cope with the problem of information overload on the Internet. With clustering the goal is to separate a given group of data items (the data set) into groups called clusters such that items in the same cluster are similar to each other and dissimilar to the items in other clusters. In clustering methods no labeled examples are provided in advance for training (this is called unsupervised learning). Web mining methodologies [3] can generally be classified into one of three distinct categories: web usage mining, web structure mining, and web content mining a survey of techniques used in these areas. As an important web data processing technology, clustering analysis has been widely applied to various fields. Because many of the real world objects or the things themselves are uncertain, people have to use fuzzy theory to study the objections or the things, and we called this study to fuzzy clustering analysis. In this paper, we are using FCM in first phase then we applying improved genetic algorithm. The first phase, we uses C-means determine the number of cluster centers, which reduces the algorithm s dependence on the initial cluster centers and improves clustering performance greatly. In the clustering process, we simultaneously combine with the ideas of merger. In second phase, the initial clusters centers are found using C-means algorithm. These give us centers that are widely spread within the data. GA takes these centers as it initial variables and iterates to find the local maxima. Hence, we get clusters that are distributed well using C-means and clusters that are compact using GA. In the subsequent section, we present the proposed architecture and experimentation results of the improved genetic FCM clustering comparing with entropy based FCM.Some conclusions are provided towards the end. The rest of paper is organized as follows. In Section II discuss FCM algorithm. The Section III discusses proposed method IV experimental result followed by a conclusion in Section V. II. FCM ALGORITHM AND RELATED WORK Fuzzy clustering plays an important role in solving problems in the areas of pattern recognition and web log mining. A variety of fuzzy clustering methods have been proposed and most of them are based upon distance criteria [6]. One widely used algorithm is the fuzzy c-means (FCM) algorithm. It uses reciprocal distance to compute fuzzy weights. A more efficient algorithm is the new FCFM. It computes the cluster center using Gaussian weights, uses large initial prototypes, and adds processes of eliminating, clustering and merging. In the following sections we discuss and compare the FCM algorithm and FCFM algorithm.the fuzzy c-means (FCM) algorithm was introduced by J. C. Bezdek [2]. The idea of FCM is using the weights that minimize the total weighted mean-square error: J(w qk, z (k) ) = (k=1,k) (k=1,k) (w qk ) x (q) - z (k) 2 (1) (k=1,k) (w qk ) = 1 for each q w qk = (1/(D qk ) 2 ) 1/(p-1) / (k=1,k) (1/(D qk ) 2 ) 1/(p-1), p > 1 (2) IJSWS 13-224; 2013, IJSWS All Rights Reserved Page 59

The FCM allows each feature vector to belong to every cluster with a fuzzy truth value (between 0 and 1), which is computed using Equation (4). The algorithm assigns a feature vector to a cluster according to the maximum weight of the feature vector over all clusters. III. MODIFIED FCM ALGORITHM Genetic algorithms are inspired by Darwin's theory about evolution. Solution to a problem solved by genetic algorithms is evolved. Algorithm is started with a set of solutions (represented by chromosomes or also called string) called population. Solutions from one population are taken and used to form a new population. This is motivated by a hope, that the new population will be better than the old one. The process of k-means clustering and multi-objective function takes several process such as population, fitness function, mutation and crossover for propagation of clustering algorithm. Some steps are divided into six phase. Initializing population The initial population is done by determining the length of chromosome with size K x d. K is the number of chromosome a lot of d while d is the dimension of the cluster variables Fitness function K-Means clustering optimization with multi-objective genetic algorithm uses 2 objectives, i.e. minimizing error functions within each cluster (equation 1) and maximizing the centeid value between cluster (equation 2). The calculations used as follows. Where i = 1, 2... K; K is the number of cluster : Error in cluster : number of data in banyak data pada cluster : Data in i-cluster, variabe : i-cluster average in j-variable V : Number of variable The first function is to minimize error average in cluster which formulated as follow: V (w) : Error in cluster k : Number of cluster The second function is to maximize inter-cluster error which formulated as follow: dengan: V (b) : Error in cluster : i-cluster average in variable : Grand mean of variable A. Fitness Function with Pareto Ranking Approach Fitness function is calculated by using pareto ranking approach, where each individual datum is evaluated by the overall population based on the non-domination concept. Next, Pareto ranking approach is done by equations 3. Completion rank in iteration Solution number which dominate x completion in iteration B. Crossover The crossover is one genetic operator in generating new chromosome (offspring). In this study, single point crossover is used with crossover probability which is calculated by using equation 4. IJSWS 13-224; 2013, IJSWS All Rights Reserved Page 60

completion selection probability fitness value in solution E. Mutation Another genetic operator is mutation. This process is exploits against the possibility of modifications to the existing results. The selection process of chromosome mutations, as well as the position of a gene to be transformed will be done in a random. The number of affected offspring is determined by the mutation probability as in equation 5. Mutation probability in cluster jarak Euclidean distant between data and centre point of cluster. If group is empty then is defined as 0. F. Elitism The random selection doesn't guarantee the non-dominated solutions will survive in the next generation. The real step for elitism in multi-objective genetic algorithm is doubling the nondominated solution in population Pt, henceforth will be included in the population of Pt + 1 by selecting the non-dominated solution. Figure 1 shows proposed model of modified working process. IJSWS 13-224; 2013, IJSWS All Rights Reserved Page 61

IV. EXPERIMENTAL RESULT ANALYSIS. For the evaluation of web log data used FCM and FCM_MOGA clustering algorithm implement in MATLAB 7.8.0 and found the similar pattern of cluster for ip, session, time and data. The cluster found in color group. The formation of cluster gives the information of valid and invalid cluster according to cluster valid index[13]. The clustering validity criteria are classified into internal, external, and relative. The clustering work focus on the relative association of month, grade and student relative criteria is used as the validity measure. The criteria widely accepted for partitioning a data set into a number of clusters are separation of the clusters, and their compactness. Thus these criteria are obviously good candidates for checking the validity of clustering results. The process of cluster validation defines a relative validity index, for assessing the quality of partitioning for each set of the input values. The proposal formalize clustering validity index based on clusters compactness (in terms of cluster density), and clusters separation (combining the distance between clusters and the inter-cluster density). The evaluation of clustering performance used some standard parameter such as number of valid cluster generation and number of cluster along with mean absolute error of clustering process. The mean absolute error process induced the error rate of clustering technique. The process of clustering used some set of data as number of instant as row in fashion of 1000, 2000, 3000 and 4000 thousand for small data to large size of data. To test the validation of cluster each cluster property assigned the color label of data the unlabeled cluster shows that invalid cluster in the process of cluster generation[7,8]. For validity of cluster and measurement of error used some standard formula given below. In clustering the mean absolute error (MAE) is a quantity used to measure how close real or predictions are to the eventual outcomes. The mean absolute error is given by. (1) As the name suggests, the mean absolute error is an average of the absolute errors is the prediction and the true value., where Figure 2 shows that web log data for clustering process Figure 3 shows that web log cluster in color mode IJSWS 13-224; 2013, IJSWS All Rights Reserved Page 62

Size of data Number of clust Valid cluster Error Data1 8 6 4.607 Data2 7 6 7.890 Data3 8 8 2.304 Data4 5 4 10.34 Table 1 show that cluster generation of educational data and check number of cluster according to valid cluster 5000 4000 3000 2000 1000 Size of data Number of cluster 0 1 2 3 4 Figure 4 shows that comparative cluster generation according to data size 12 10 8 6 4 2 0 1 2 3 4 Valid cluster Error Figure 5 shows that comparative valid cluster generation and generation of error according to cluster size V. CONCLUSION AND FUTURE WORK In this paper we proposed a modified FCM algorithm using multi-objective genetic algorithm for web log clustering. proposed, to a certain degree, which overcomes the defects that FCM is sensitive to initial value and it is easily able to be trapped in a local optimum. The practicability of this approach is analyzed in principle, and its practical effect is confirmed by experiment in technical. The experiment results show that the global distribution characteristic of the space clustering centers which are found during the process of the fuzzy clustering analysis by utilizing improved GA is suitably kept, so clustering effect is more rational. REFERENCES:- [1]Athman Bouguettaya "On Line Clustering", IEEE Transaction on Knowledge and Data Engineering, Volume 8, No. 2, April 1996. [2]D. Beeferman, A. Berger, Agglomerative clustering of search engine query log, KDD 2000. [3]Qingtian Han, Xiaoyan Gao, Wenguo Wu, Study on Web Mining Algorithm Based on Usage Mining, Computer- Aided Industrial Design and Conceptual Design, 2008.CAID/CD 2008. 9th International Conference on 22-25 Nov. 2008. [4]Yi Dong, Huiying Zhang, Linnan Jiao, Research on Application of User Navigation Pattern Mining Recommendation, Intelligent Control and Automation, 2006. WCICA 2006, the Sixth World Congress, Volume 2. [5]Xuan SU,Xiaoye WANG,Zhuo WANG,Yingyuan XIAZO, An New Fuzzy Clustering Algorithm Based On Entropy Weighting, Journal of Computational Information Systems,3319-3326,2010. [6Mohanad Alata, Mohammad Molhim, and Abdullah Ramini, Optimizing of Fuzzy C-Means Clustering Algorithm Using GA, World Academy of Science, Engineering and Technology 39, 2008. [7]K.Suresh, R.Madana Mohana, A.RamaMohan Reddy, Improved FCM algorithm for Clustering on Web Usage Mining, IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 1, January 2011. IJSWS 13-224; 2013, IJSWS All Rights Reserved Page 63