Implementation of a New Approach to Mine Web Log Data Using Mater Web Log Analyzer



Similar documents
Arti Tyagi Sunita Choudhary

Pre-Processing: Procedure on Web Log File for Web Usage Mining

Identifying the Number of Visitors to improve Website Usability from Educational Institution Web Log Data

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Web Log Based Analysis of User s Browsing Behavior

Indirect Positive and Negative Association Rules in Web Usage Mining

Visualizing e-government Portal and Its Performance in WEBVS

ANALYSIS OF WEB LOGS AND WEB USER IN WEB MINING

Web Mining as a Tool for Understanding Online Learning

Bisecting K-Means for Clustering Web Log data

Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm

Exploitation of Server Log Files of User Behavior in Order to Inform Administrator

A Survey on Web Mining From Web Server Log

Enhance Preprocessing Technique Distinct User Identification using Web Log Usage data

Web usage mining: Review on preprocessing of web log file

A Survey on Preprocessing of Web Log File in Web Usage Mining to Improve the Quality of Data

Business Lead Generation for Online Real Estate Services: A Case Study

Analysis of Server Log by Web Usage Mining for Website Improvement

Advanced Preprocessing using Distinct User Identification in web log usage data

An Effective Analysis of Weblog Files to improve Website Performance

Web Usage Mining: Identification of Trends Followed by the user through Neural Network

WebAdaptor: Designing Adaptive Web Sites Using Data Mining Techniques

Web Usage mining framework for Data Cleaning and IP address Identification

Preprocessing Web Logs for Web Intrusion Detection

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Analyzing the Different Attributes of Web Log Files To Have An Effective Web Mining

PREDICTING STUDENTS PERFORMANCE USING ID3 AND C4.5 CLASSIFICATION ALGORITHMS

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

Optimization of C4.5 Decision Tree Algorithm for Data Mining Application

AN EFFICIENT APPROACH TO PERFORM PRE-PROCESSING

Classification and Prediction

A Time Efficient Algorithm for Web Log Analysis

Implementation of Data Mining Techniques to Perform Market Analysis

Mining for Web Engineering

User Behavior Analysis from Web Log using Log Analyzer Tool

COURSE RECOMMENDER SYSTEM IN E-LEARNING

A SURVEY ON WEB MINING TOOLS

AnalysisofData MiningClassificationwithDecisiontreeTechnique

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

COMBINED METHODOLOGY of the CLASSIFICATION RULES for MEDICAL DATA-SETS

PREPROCESSING OF WEB LOGS

Association rules for improving website effectiveness: case analysis

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Generalization of Web Log Datas Using WUM Technique

Log Mining Based on Hadoop s Map and Reduce Technique

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

Web Mining Functions in an Academic Search Application

Hadoop Technology for Flow Analysis of the Internet Traffic

An Enhanced Framework For Performing Pre- Processing On Web Server Logs

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

The Data Mining Process

Binary Coded Web Access Pattern Tree in Education Domain

Rule based Classification of BSE Stock Data with Data Mining

A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model

Web Log Analysis for Identifying the Number of Visitors and their Behavior to Enhance the Accessibility and Usability of Website

Business Intelligence Using Data Mining Techniques on Very Large Datasets

ABSTRACT The World MINING R. Vasudevan. Trichy. Page 9. usage mining. basic. processing. Web usage mining. Web. useful information

How To Analyze Web Server Log Files, Log Files And Log Files Of A Website With A Web Mining Tool

Automatic Recommendation for Online Users Using Web Usage Mining

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Interactive Exploration of Decision Tree Results

A Secured Approach to Credit Card Fraud Detection Using Hidden Markov Model

Web Usage Association Rule Mining System

DATA MINING AND REPORTING IN HEALTHCARE

Using TestLogServer for Web Security Troubleshooting

SPATIAL DATA CLASSIFICATION AND DATA MINING

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Performance Analysis of Decision Trees

ANALYZING OF SYSTEM ERRORS FOR INCREASING A WEB SERVER PERFORMANCE BY USING WEB USAGE MINING

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

Information Management course

A COGNITIVE APPROACH IN PATTERN ANALYSIS TOOLS AND TECHNIQUES USING WEB USAGE MINING

A Study of Web Log Analysis Using Clustering Techniques

International Journal of Innovative Research in Computer and Communication Engineering

CHAPTER 3 PREPROCESSING USING CONNOISSEUR ALGORITHMS

Big Data with Rough Set Using Map- Reduce

Volume 4, Issue 1, January 2016 International Journal of Advance Research in Computer Science and Management Studies

A Comparative Study of Different Log Analyzer Tools to Analyze User Behaviors

1. When will an IP process drop a datagram? 2. When will an IP process fragment a datagram? 3. When will a TCP process drop a segment?

Comparison of Data Mining Techniques used for Financial Data Analysis

A Sun Javafx Based Data Analysis Tool for Real Time Web Usage Mining

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data quality in Accounting Information Systems

Selection of Optimal Discount of Retail Assortments with Data Mining Approach

Social Media Mining. Data Mining Essentials

IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES

An Overview of Knowledge Discovery Database and Data mining Techniques

Design of Prediction System for Key Performance Indicators in Balanced Scorecard

Integrating Web Content Mining into Web Usage Mining for Finding Patterns and Predicting Users Behaviors

An application for clickstream analysis

Web Mining using Artificial Ant Colonies : A Survey

Transcription:

Implementation of a New Approach to Mine Web Log Data Using Mater Web Log Analyzer Mahadev Yadav 1, Prof. Arvind Upadhyay 2 1,2 Computer Science and Engineering, IES IPS Academy, Indore India Abstract - Web is a large and dynamic domain of knowledge and discovery. Many of researcher, technicians and different research organizations are collect and find the important data from web. In this paper, we emphasize on a new approach over the information gathering by the mining of web access logs or web usage data. Our propose framework is composed of five steps. The first step, Web usage based mining on Web Logs Data is applied for preprocessing of web log and extract the formal information from the log file. In next step, User profiling is creating by session identification of same IP Source. In third steps Web Log classification algorithm applied on user profiled data for predict response of server for user request. In forth steps, we analyze the user behavior pattern using frequent pattern analysis algorithms and developed a new modify Apriori algorithm to give the solution with higher optimized efficiency. In the final steps, a comparison between different classification and frequent pattern analysis algorithms is perform with different attribute to know about the behavior of algorithms for different dataset. Keywords - Web Access Log, Web Log, Web Usage Mining, Log Analyzer, User Profiling. I. INTRODUCTION The expansion of the World Wide Web (Web for short) has resulted in a large amount of data that is now in general freely available for user access. The different types of data have to be managed and organized in such a way that they can be accessed by different users efficiently. Thus, Web mining has been developed into an autonomous research area. Web mining involves a wide range of applications [2] that aims at discovering and extracting hidden information in data stored on the Web. Another important purpose of Web mining is to provide a mechanism to make the data access more efficiently and adequately. The third interesting approach is to discover the information which can be derived from the activities of users, which are stored in log files for example for predictive Web caching [4]. Thus, Web mining can be categorized into three different classes based on which part of the Web is to be mined [2, 3, 4]. These three categories are Web content mining, Web structure mining and Web usage mining. Decision tree algorithm [10, 11] is used to classify the data into different classes. Association rule Mining is defined, in [12, 13], as the task of finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. II. RELATED WORK Many different kinds of tools are designed and developed [1, 2] to extract important information from the web log file. Table1 Different web log analyzer tool. Name Firma Type Comments Web log Parse Web log Ana log ACME Labs Softwa re Darryl C. Burgd orf Univer sity of Cambr idge Log files Processi ng Log files Analysi s Tool Log files Analyze r Extract specific fields from a web log file. support different web log format. Keep track of activity on your site by month, week day,pageview byte transfer etc. It tell which page are most popular,which country people are visited from, etc The most of tool extract same data from the log. Thus required a new tool by which administrator can extract more and different information from the analysis of web log file. There are some problem related to existing software are They don t work for user personalization. They are not able to predict what response will be generated by the server for user request. They don t work over the frequent items sets of user navigation with the web site. 750

They perform analysis on huge amount of data it takes more time to process the data. To resolve above problem a new analyzer is developed to extract more information from the web usage data with more accuracy and less amount of time. In next secession a step by step working of new approach of web log mining by Master Web Log Analyzer is define. III. SYSTEM ARCHTECTURE The below diagram shows the basic system architecture of the system this system is combination or integrations of different sub system. Sub system can be defined as Group of interconnected parts that performs an important task as a component of a larger system. A server log [5] is a log file (or several files) automatically created and maintained by a server of activity performed by user. The W3C maintains a standard format (the Common Log Format) for web server log files, but other proprietary formats exist. Web log is click stream data of user navigate with web site. Information about the request, including client IP address, request date/time, page requested, HTTP code, bytes served, user agent, and referrer are typically added. Figure 1: Shows the Proposed Architecture of Master Web Log Analyzer IV. STEP BY STEP PROCESSING OF THE PROPOSED SYSTEM This section describes the details of the operation and performance of our proposed multi-purpose analyzer. A. Import Log File A software program or server computer equipped to offer World Wide Web access. Web servers allow you to serve content over the Internet using the Hyper Text Markup Language (HTML).The Web server accepts requests from browsers like Netscape and Internet Explorer and then returns the appropriate HTML documents. Common Log Format- Figure 2: Sample Web Server Log A typical configuration for the access log might look as follows. 161.184.77.234 - - [18/Sep/2001:01:24:45 +0000] "GET /images/buy_now-a.gif HTTP/1.1" 200 1388 http://www.123 loganalyzer.com /buy.htm "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)" The format above is from an Apache log. Depending on the type of server the site is on, the log entries may look different. Thousands (or even hundreds of thousands) of entries such as the one above are placed into a plain text file, called the server log. The above log entry includes the following information: IP address of the requesting computer 161.184.77.234. This is not the user's IP address, but rather the address of the Host machine they've connected to. Date and time of the request: [18/Sep/2001:01:24:45].That's September 18, 2001 at 1:24:45 pm and the time zone is 5hours behind GMT, which is Eastern Standard Time in the USA (this is because the server is in that time zone, not the user.) The full HTTP request: " GET /images/buy _nowa.gif HTTP/1.1" 751

a. Request method: GET b. Requested file: /images/buy_ now -a.gif c. HTTP Protocol version: HTTP/1.1 HTTP Response Code: 200. This particular code means the request was ok. Response size: 1388 bytes. This is the size of the file that was returned. Referring document: http://www.123loganalyzer.com /buy. htm. User-Agent String (Browser & Operating system information):"mozilla/4.0 ( compatible ; MSIE 5.5; Windows 98)" B. Preprocessing of Web Access Log Data In preprocessing of web log data[7, 9], we remove parts of the original web log data that are not relevant in our mining process. After that we get the web log data as Figure 3: Preprocessed Web Log Data 161.184.77.234 is the IP address (client) that can be used to mind personal usage and the result can be applied in Personalized Systems, Recommender Systems and Pre-fetched System. 18/Sep/2001:01:24:45 is the date time data that is intended to support the Web Site Maintainers, GET/images/buy_now.gif is full HTTP request which contain HTTP request and file. It support web site developer to know about which keywords are frequently uses by the users. http://www.123loganalyzer.com is the domain name. Our analyzer is to know which IP frequently uses which sites for which purposes. Mozilla/4.0 is user agent. Our analyzer is to know which browser used by user. C. Mining of usage Information The analyzer[1] identify different IP address request, different method, web browser used by user, No. of hit on the server etc. To fulfill our purposes, we propose our own procedures that may be useful for various domain areas. Procedure 1.1 find different IP address request to server Output: List of distinct IP address string str = "select distinct IPAddress from D"; Ips.Capacity = i+1; Ips.Add(distinctIP); Return IPList; The above procedure creates a list of different IPAddress which are navigate with the web server. This list help for user profiling to create a data set for particular user. Procedure 1.2 finds different Agent used by user Output: List of distinct Agent string str = "select distinct agent from tblinfo1 "; Return AgentList; The above procedure returns a list of user browser used by different user. Procedure 1.3 find different session Output: List of session string str = "select source from tblinfo1 where source='new'"; Return session; The procedure help to identify different user session.it help to create user profile. 752

D. User Profiling International Journal of Emerging Technology and Advanced Engineering User Profile[6] contains the information related to user navigation with the web server. Different user session is maintained in user profile. It contain all the filled of user preprocessed data and one column to represent the session number. It creates a Dataset according to IP address of the user. It helps to process user navigation with deferent web site. Figure 4: A general architecture for web usage mining Procedure 1.4 find different session of user Output: List of session string str = "select D where IPAddress="selected_IP"; dataset.capacity = i+1; dataset.add(str); Return dataset; The procedure help to identify different user session.it help to create user profile. E. Web Log Classification Decision Tree[7] is use to classifying data using attributes. The tree consists of decision nodes and leaf nodes. A decision node has two or more branches, each representing values for the attribute tested. A leaf node attribute produces a homogeneous result (all in one class), which does not require additional classification testing. There are two different decision tree algorithms is used to classify user profiled data to identify the response of server for the user request. ID3 and C4.5 decision tree algorithms[11,12] is used to predict the response of server,they work over the Entropy and information gaining with respect to different attribute and generate a tree. The working of ID3:- ID3 is mathematical algorithm [11] for building the decision tree Invented by J. Ross Quinlan in 1979. It uses Information Theory invented by Shannon in 1948. Information Gain is used to select the most useful attribute for classification. First the entropy of the total dataset is calculated. The dataset is then split on the different attributes. The entropy for each branch is calculated. Then it is added proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy before the split. The result is the Information Gain, or decrease in entropy. The attribute that yields the largest IG is chosen for the decision node. A branch set with entropy of 0 is a leaf node. Otherwise, the branch needs further splitting to classify its dataset. The ID3 algorithm is run recursively on the nonleaf branches, until all data is classified. The formula to calculate entropy and information gain: The working of C4.5:- Test entropy: Entropy(s) = Σ P i log 2 P i i=1 Gain (A) = E (Current set) Σ E (all child sets) If S is any set of samples, let freq (C i, S) stand for the number of samples in S that belong to class C i (out of k possible classes), and S denotes the number of samples in the set S. Then the entropy of the set S: Info(S) = - ( (freq(c i, S)/ S) log 2 (freq(c i, S)/ S)) After set T has been partitioned in accordance with n outcomes of one attribute test X Info x (T) = ((Ti/ T) Info(Ti)) c 753

Then Gain is identified with respect to attribute X Criterion: value. Gain(X) = Info(T) - Info x (T) select an attribute with the highest Gain F. Mining frequent item set Discovering, extracting the frequent pattern of clients usage, with respect to request, method, web sites, user agent. Web usage based mining algorithm for frequent pattern analysis [12,13] is used to identify different frequent pattern used by the user. First classical apriori algorithm is apply on user dataset.there are some problem with their performance observed then a new modify apriori algorithm is developed to increase the performance of analyzer and reduce the time requirement. Itemset X = x 1,, x k Find all the rules X Y with minimum support and confidence support, s: support (X Y ) = (no of tuples contain X and Y) /( total no of tuples) confidence, c: Confidence (X Y ) = (no of tuples contain X and Y) / ( total no of tuples contain X) AYL Modify Apriori Algorithm - The traditional Apriori algorithm [12] is most frequently used by different researchers and groups to mine log data. This algorithm has some problem with their performance we observe that when the item set are increased then the time and memory required is increased exponent manner. To overcome this problem we propose a new Apriori algorithm. Apriori (T, ms, input set) Initialize k 1, C 1 1-itemsets L 1 frequent 1-itemsets K 2 While L k-1 C k c c=a U b a ϵ L k-1 b ϵ L k-1 b a For transactions t ϵ T If (t == input set) C t c c t ^ c = k C c c ϵ C k c ϵ C t Count[c] Count[c]+1 L k c c ϵ C k Count[c] ms K K+1 Return L k k V. IMPLIMENTATION RESULTS The work focused on extracting different hidden information from web log file. User profiling helps to analyze different users and there navigation pattern among the web sites with more accuracy and less amount of time. Use of enhanced version of decision tree ID3 and C4.5 algorithm for classification provides accurate prediction for different request and response of server. Table 2: comparison table between ID3 and C4.5 with different attribute The graph represents in between ID3 and C4.5 Algorithm with respect to accuracy and Search time. 754

Table 2: comparison table between Traditional Apriori and Modify Apriori Algorithm with different attribute The graph represents in between Traditional Apriori and Modify Apriori Algorithm with respect to accuracy and Search time. Figure 5 and 6: Modal build time and Search time Comparison between ID3 and C4.5 Algorithm Improved Apriori algorithm of finding frequently accessed patterns reduces time consumption and improves accuracy. The rules generated by modify apriori algorithm are: 1,0.8752 -user use GET method 87% of the time 3, 0.751-user use http://www.123loganalyzer.com 75% of the time. 4,0. 752 - user use Mozilla/4.0 75% of the time. 3 4, 0.75- user use http://www.123loganalyzer.com and use Mozilla/4.0 75% of the time. 1 2 4, 0.625 - user use GET method, /images/ buy_now.gif and Mozilla/4.0 75% of the time. 1 3 4, 0.752 - user use GET http://www.123log analyzer.com and use Mozilla/4.0 75% of the time. Comparison between Traditional Apriori and Modify Apriori Algorithm Figure 7 and 8: Accuracy and Search time Comparison between Traditional Apriori and Modify Apriori Algorithm 755

VI. CONCLUSION The work focused on extracting different hidden information from web log file. The new approach to mine web log data using User profiling helps to analyze different users and there navigation pattern among the web sites with more accuracy and less amount of time. Use of enhanced version of decision tree ID3 and C4.5 algorithm for classification provides accurate prediction for different request and response of the server. Improved Apriori algorithm of finding frequently accessed patterns reduces time consumption and improves the accuracy. Our system provide analyzed data for Web Site Maintainers, Web Site Developers, Personalization Systems, Pre-fetched Systems, Recommendation Systems and Web Site Analysts, etc to improve the performance of the web by preference to the patterns navigated by the regular interested users. REFERENCE [1 ] Theint Theint Shwe, Thida Myint, Framework for Multi -purpose Web Log Access Analyzer 978-1-4244-63497/10/$26.00 _ 2010 IEEE V3-289 [2 ] Kosala, R., Blockeel, H., (2000). Web Mining Research: A Survey, ACM 2(1):1-15. [3 ] Cooley, R., Mobasher, B. Srivastava, J., (1997). Web Mining: Information and Pattern Discovery on the World Wide Web, 9 th International Conference on Tools with Artificial Intelligence ICTAI 97), New Port Beach, CA, USA, IEEE Computer Society, 558-567. [4 ] Brijendra Singh1, Hemant Kumar Singh2 WEB DATA MINING RESEARCH: A SURVEY 978-1-4244-5967-4/10/$26.00 2010 IEEE [5 ] Karl Groves The Limitations of Server Log Files for Usability Analysis on 2007/10/24 [6 ] Georgios Lappas,From Web Mining to Social Multimedia Mining 978-0-7695-4375-8/11 $26.00 2011 IEEE DOI10.1109ASONAM.2011 [7 ] Tasawar Hussain, Dr. Sohail Asghar, Dr. Nayyer Masood Web Usage Mining: A Survey on Preprocessing of Web Log File 978-1 4244-8003- 6/10/2010 [8 ] Aamshad Mobasher, Robert Cooley, and Jaideep Srivastava Prsonalization Based On Web Usage Mining August2000/Vol.43, No. 8 COMMUNICATIONS OF THE ACM [9 ] Theint Theint Aye. Web Log Cleaning for Mining of Web Usage Patterns 978-1-61284-840-2/11/$26.00 2011 IEEE [10 ] Peng Zhu, Ming-sheng Zhao Session Identification Algorithm for Web Log Mining, 978-1-4244-5326-9/10/$26.00 2010 IEEE [11 ] Stmik Amilkom yogyakarta, Implementation of C4.5 algorithm to evaluate the cancellation possibility of new student application at isbn 978-979-16338-0-2. [12 ] Wei Peng, Juhua Chen and Haiping Zhou An Implementation of ID3 Decision Tree Learning Algorithm [13 ] Sandeep Singh Rawat, Lakshmi Rajamani, Probability Apriori based Approach to Mine Rare Association Rules 978-1 61284-212-7/11/$26.00 2011 IEEE [14 ] Huiping Peng Discovery of Interesting Association Rules Based on Web Usage Mining 978-0-7695-4136-5/10 $26.00 2010 IEEE [15 ] Yanyu Zhang, Yonggong Ren, A of Predicting Users` Behaviors Based on Inter-transaction Association Rules978-0 7695-3874-7/09 $25.00 2009 IEEE 756