Pre-Processing: Procedure on Web Log File for Web Usage Mining



Similar documents
Exploitation of Server Log Files of User Behavior in Order to Inform Administrator

Microsoft Internet Information Services (IIS)

Survey on web log data in teams of Web Usage Mining

PREPROCESSING OF WEB LOGS

Automatic Recommendation for Online Users Using Web Usage Mining

Analyzing the Different Attributes of Web Log Files To Have An Effective Web Mining

Research on Application of Web Log Analysis Method in Agriculture Website Improvement

Web Usage mining framework for Data Cleaning and IP address Identification

Enhance Preprocessing Technique Distinct User Identification using Web Log Usage data

The web server administrator needs to set certain properties to insure that logging is activated.

Advanced Preprocessing using Distinct User Identification in web log usage data

A Survey on Preprocessing of Web Log File in Web Usage Mining to Improve the Quality of Data

An Approach to Convert Unprocessed Weblogs to Database Table

Research and Development of Data Preprocessing in Web Usage Mining

An Effective Analysis of Weblog Files to improve Website Performance

AN EFFICIENT APPROACH TO PERFORM PRE-PROCESSING

Identifying the Number of Visitors to improve Website Usability from Educational Institution Web Log Data

Web Server Logs Preprocessing for Web Intrusion Detection

AN OVERVIEW OF PREPROCESSING OF WEB LOG FILES FOR WEB USAGE MINING

Apache Logs Viewer Manual

CHAPTER 3 PREPROCESSING USING CONNOISSEUR ALGORITHMS

An Overview of Preprocessing on Web Log Data for Web Usage Analysis

Web Log Mining: A Study of User Sessions

An Enhanced Framework For Performing Pre- Processing On Web Server Logs

Analysis of Server Log by Web Usage Mining for Website Improvement

Data Preprocessing and Easy Access Retrieval of Data through Data Ware House

Preprocessing Web Logs for Web Intrusion Detection

A Survey on Different Phases of Web Usage Mining for Anomaly User Behavior Investigation

Big Data Preprocessing Mechanism for Analytics of Mobile Web Log

Identifying User Behavior by Analyzing Web Server Access Log File

Installing AWStats on IIS 6.0 (Including IIS 5.1) - Revision 3.0

CHAPTER-7 EXPERIMENTS AND TEST RESULTS FOR PROPOSED PREDICTION MODEL

Comparison table for an idea on features and differences between most famous statistics tools (AWStats, Analog, Webalizer,...).

Web Log Analysis for Identifying the Number of Visitors and their Behavior to Enhance the Accessibility and Usability of Website

Arti Tyagi Sunita Choudhary

ANALYSIS OF WEB LOGS AND WEB USER IN WEB MINING

Data Pre-processing on Web Server Logs for Generalized Association Rules Mining Algorithm

How To Mine A Web Site For Data Mining

Data Pre-processing on Web Server Logs for Generalized Association Rules Mining Algorithm

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

Web Log Based Analysis of User s Browsing Behavior

Using the Microsoft IIS SMTP Service for LISTSERV Deliveries

Performance Indicators For Web Sites

Web Usage Mining: Identification of Trends Followed by the user through Neural Network

How To Analyze Web Server Log Files, Log Files And Log Files Of A Website With A Web Mining Tool

Periodic Web Personalization for Meta Search Engine

LogLogic Blue Coat ProxySG Log Configuration Guide

Web Usage Mining: A Survey on Pattern Extraction from Web Logs

A Survey on Web Mining From Web Server Log

Logs. Log File Management APPENDIX

Chapter VIII A Review of Methodologies for Analyzing Websites

Chapter 12: Web Usage Mining

Implementation of a New Approach to Mine Web Log Data Using Mater Web Log Analyzer

Copyright Winfrasoft Corporation. All rights reserved.

ASSOCIATION RULE MINING ON WEB LOGS FOR EXTRACTING INTERESTING PATTERNS THROUGH WEKA TOOL

PoSHServer Documentation AUTHOR: YUSUF OZTURK (MVP)

Symantec Event Collector 3.6 for Blue Coat Proxy Quick Reference

Generalization of Web Log Datas Using WUM Technique

Web Miner: A Tool for Discovery of Usage Patterns From Web Data

Guide to Analyzing Feedback from Web Trends

WEB SITE OPTIMIZATION THROUGH MINING USER NAVIGATIONAL PATTERNS

Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm

ANALYSIS OF WEB SERVER LOG BY WEB USAGE MINING FOR EXTRACTING USERS PATTERNS

Effective User Navigation in Dynamic Website

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

ABSTRACT The World MINING R. Vasudevan. Trichy. Page 9. usage mining. basic. processing. Web usage mining. Web. useful information

Identifying System Errors through Web Server Log Files in Web Log Mining

1. When will an IP process drop a datagram? 2. When will an IP process fragment a datagram? 3. When will a TCP process drop a segment?

Abstract. 2.1 Web log file data

V.Chitraa Lecturer CMS College of Science and Commerce Coimbatore, Tamilnadu, India

ANALYSING SERVER LOG FILE USING WEB LOG EXPERT IN WEB DATA MINING

WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION CONSTRUCTION

User Behavior Analysis from Web Log using Log Analyzer Tool

Improving Privacy in Web Mining by eliminating Noisy data & Sessionization

A Study of Web Log Analysis Using Clustering Techniques

A Comparative Study of Different Log Analyzer Tools to Analyze User Behaviors

Configuring Web services

E-CRM and Web Mining. Objectives, Application Fields and Process of Web Usage Mining for Online Customer Relationship Management.

ANALYZING OF SYSTEM ERRORS FOR INCREASING A WEB SERVER PERFORMANCE BY USING WEB USAGE MINING

Analysis of Requirement & Performance Factors of Business Intelligence Through Web Mining

Pg. 1/20 OVERVIEW... 2 Auto Report Requirements... 4 General SMTP Requirements... 4 STMP Service Requirements... 4 TROUBLESHOOTING: SMTP

v6.1 Websense Enterprise Reporting Administrator s Guide

Digital media glossary

Network Technologies

Using Web Statistics:

Bisecting K-Means for Clustering Web Log data

CDN Operation Manual

Google Analytics for Robust Website Analytics. Deepika Verma, Depanwita Seal, Atul Pandey

Users Interest Correlation through Web Log Mining

graphical Systems for Website Design

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Internet Information Services Agent Version Fix Pack 2.

A Study of Web Traffic Analysis

Web Usage Mining. from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer Chapter written by Bamshad Mobasher

VOL. 3, NO. 7, July 2013 ISSN ARPN Journal of Science and Technology All rights reserved.

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

High Performance Cluster Support for NLB on Window

Web Advertising Personalization using Web Content Mining and Web Usage Mining Combination

Web Mining Functions in an Academic Search Application

End User Guide The guide for /ftp account owner

LogLogic Blue Coat ProxySG Syslog Log Configuration Guide

Transcription:

Pre-Processing: Procedure on Web Log File for Web Usage Mining Shaily Langhnoja 1, Mehul Barot 2, Darshak Mehta 3 1 Student M.E.(C.E.), L.D.R.P. ITR, Gandhinagar, India 2 Asst.Professor, C.E. Dept., L.D.R.P. ITR, Gandhinagar, India 3 Lecturer, Government Polytechnic, Gandhinagar, India Abstract These days World Wide Web becomes very popular and interactive for transferring of Information. Web usage mining is the area of data mining which deals with the discovery and analysis of usage patterns from Web data, specifically web logs, in order to improve web based applications. Web usage mining consists of three phases, preprocessing, pattern discovery, and pattern analysis. After the completion of these three phases the user can find the required usage patterns and use these information for the specific needs. The web access log file is saved to keep a record of every request made by the users. However, the data stored in the log files does not specify accurate details of the users accesses to the Web site. So, preprocessing of the Web log data is first and important phase before web log file can be applied for pattern analysis & pattern discovery. The preprocessed Web Log file can then be suitable for the discovery and analysis of useful information referred to as Web mining. This paper gives detailed description of how pre-processing is done on web log file and after that it is sent to next stages of web usage mining. Keywords Web Mining, Web Usage Mining, Web Log file, Data cleansing, Preprocessing I. INTRODUCTION With the continued growth and proliferation of e- commerce, Web services, and Web-based information systems, the volumes of clickstream and user data collected by Web-based organizations in their daily operations has reached astronomical proportions. Analyzing such data can help these organizations determine the life-time value of clients, design cross-marketing strategies across products and services, evaluate the effectiveness of pro-motional campaigns, optimize the functionality of Web-based applications, provide more personalized content to visitors, and find the most effective logical structure for their Web space. This type of analysis involves the automatic discovery of meaningful patterns and relationships from a large collection of primarily semi-structured data, often stored in Web and applications server access logs, as well as in related operational data sources. Web usage mining refers to the automatic discovery and analysis of patterns in clickstream and associated data collected or generated as a result of user interactions with Web resources on one or more Web sites. The goal is to capture, model, and analyze the behavioral patterns and profiles of users interacting with a Web site. The discovered patterns are usually represented as collections of pages, objects, or re-sources that are frequently accessed by groups of users with common needs or interests. Following the standard data mining process the overall Web usage mining process can be divided into three inter-dependent stages: data collection and pre-processing, pattern discovery, and pattern analysis. This paper provides description about what is Web Log File, where it is located, different formats of it & preprocessing on it. Pre-processing of web log file includes data cleansing, user identification & session identification. II. WEBLOG FILE Web log files are files that contain information about website visitor activity. Log files are created by web servers automatically. Each time a visitor requests any file (page, image, etc.) from the site information on his request is appended to a current log file. Most log files have text format and each log entry (hit) is saved as a line of text. Log file range 1KB to 100MB. A. Location of weblog file: Web log file is located in three different location. Web server logs: Web log files provide most accurate and complete usage of data to web server. The log file do not record cached pages visited. Data of log files are sensitive, personal information so web server keeps them closed. Web proxy server: Web proxy server takes HTTP request from user, gives them to web server, then result passed to web server and return to user. Client send request to web server via proxy server. 419

The two disadvantages are: Proxy-server construction is a difficult task. Advanced network programming, such as TCP/IP, is required for this construction. The request interception is limited. Client browser: Log file can reside in client s browser window itself. HTTP cookies used for client browser. These HTTP cookies are pieces of information generated by a web server and stored in user s computer, ready for future access. B. Type of web log file: There are four types of server logs. Access log file: Data of all incoming request and information about client of server. Access log records all requests that are processed by server. Error log file: list of internal error. Whenever an error is occurred, the page is being requested by client to web server the entry is made in error log.access and error logs are mostly used, but agent and referrer log may or may not enable at server. Agent log file: Information about user s browser, browser version. Referrer log file: This file provides information about link and redirects visitor to site. C. Web log file format: Web log file is a simple plain text file which record information about each user. Display of log files data in three different format W3C Extended log file format NCSA common log file format IIS log file format NCSA and IIS log file format the data logged for each request is fixed.w3c format allows user to choose properties, user want to log for each request. 1. W3C Extended log file format W3C log format is default log file format on IIS server. Field are separated by space, time is recorded as GMT (Greenwich Mean Time). It can be customized that is administrators can add or remove fields depending on what information want to record. In W3C format of year is YYYY-MM-DD. Omitting unwanted attributes field when log file size is limited[w3c]. Figure below shows that #software - version of IIS that is running #version - the log file format #Date- recording date and time of first log entry. #fields: date time c-ip cs-username s-ip cs-method cs-uristem cs-uri-query sc-status sc-bytes cs-bytes time-taken csversion cs(user-agent) cs(cookie) cs(referrer) #Software: Microsoft Internet Information Services 7.5 #Version: 1.0 #Date: 2012-12-05 08:25:10 #Fields: 1998-11-19 22:48:39 206.175.82.5-208.201.133.173 GET/global/images/navlineboards.gif 200 540 324 157 HTTP/1.0 Mozilla/4.0+(compatible;+MSIE+4.01;+Windows+95) USERID=CustomerA;+IMPID=01234 http://www.loganalyzer.net Fig.1. Example of W3C log file format 2. NCSA common log file format The NCSA Common log file format is a fixed ASCII text-based format, so you cannot customize it. The NCSA Common log file format is available for Web sites and for SMTP and NNTP services, but it is not available for FTP sites. Because HTTP.sys handles the NCSA Common log file format, this format records HTTP.sys kernel-mode cache hits.the NCSA Common log file format records the following data: Remote host address Remote log name (This value is always a hyphen.) User name Date, time, and Greenwich mean time (GMT) offset Request and protocol version Service status code (A value of 200 indicates that the Bytes sent 216.67.1.91 - leon [01/Jul/2002:12:11:52 +0000] "GET /index.html HTTP/1.1" 200 431 3. IIS log file format Fig.2 Example of NCSA log file format The IIS log file format is a fixed ASCII text-based format, so you cannot customize it. Because HTTP.sys handles the IIS log file format, this format records HTTP.sys kernel-mode cache hits. The IIS log file format records the following data: Client IP address User name Date Time Service and instance Server name Server IP address Time taken 420

Client bytes sent Server bytes sent Service status code (A value of 200 indicates that the Windows status code (A value of 0 indicates that the Request type Target of operation 172.16.255.255, anonymous, 03/20/01, 23:58:11, MSFTPSVC, SALES1, 172.16.255.255, 60, 275, 0, 0, 0, PASS, /Intro.htm Fig.3 Example of IIS log file format III. PHASE 1: PREPROCESSING There are several pre-processing tasks to be done before data mining algorithms can be performed on the web server logs. These include data cleansing, user identification, session identification. Fig.4 Data Pre-Processing Steps in Web Usage Mining A. Data Cleansing The purpose of data cleaning is to remove irrelevant items stored in the log files that may not be useful for analysis purposes. When a user accesses a HTML document, the embedded images, if any, are also automatically downloaded and stored in the server log. For example, log entries with file name suffixes such as gif, jpeg, GIF, JPEG, jpg and JPG can be removed. Since the main objective of data preprocessing is to obtain only the usage data, file requests that the user did not explicitly request can be eliminated. This can be done by checking the suffix of the URL name. In addition to this, erroneous files can be removed by checking the status of the request (such as status code 404). Data cleaning also involves the removal of references resulting from spider navigations which can be done by maintaining a list of spiders or through heuristic identification of spiders and Web robots. The cleaned log represents the user s accesses to the Web site. 421 Algorithm for Data Cleansing Following is the algorithm used for cleansing web log file for retrieving useful information and eliminating unnecessary data to carry out work related to this paper. The algorithm for Data cleansing step in Web usage mining process of pre-processing stage used in this paper. Here input is raw web log file which is processed and finally output generated is processed web log file and its data is inserted into table of database. Input: raw web log file. Output: processed web log file. 1. for each lines in web log file do 2. if length of line is more then one character then #Avoid Blank Lines 3. if line does not start with # then #Avoid Comments 4. if link name contains domain name then #Consider Application specific links only 5. if page extension is aspx or html then #Eliminate non-page links like images, pdfs insert query for adding log data in database B. User & Session Identification To identify each user and session uniquely we can take measures like IP address, operating system, browser, time out period, etc. Once above step of data cleansing is performed, all useful data records are available with us in database and irrelevant entries are considered to be removed. So, now we can start up the remaining process with database rows itself. Algorithm for User & Sesion Identification The algorithm for the user and session identification can be depicted as below: Input: processed weblog file Output: identification of user & session. 1. for each record in dataset do 2. if currentip is not in ListOfIP then add currentip in ListOfIP 3. else if currentos is not in ListOfOS then add currentos in ListOfOS 4. else if currentbrowser is not in ListOfBrowser then add currentbrowser in ListOfBrowser

5. else if current record timestamp is more than 1800 seconds #30minutes * 60 seconds 6. else mark current record with existing sessionid and userid end if end of loop The above algorithm when used, marks each record in database with respective user and session identified groups which later can be used for further proceedings of web usage mining process. The resulted group of records can be inserted into database and later results of which can be very helpful like total number of users, total number sessions, difference between total number of records before preprocessing and post-preprocessing, etc. IV. EXPERIMENTAL RESULTS We have conducted several experiments on log files collected from Government Polytechnic, Gandhinagar website. During Data cleansing step all irrelevant entries are removed. Sample raw web log file is as below: #Software: Microsoft Internet Information Services 7.5 #Version: 1.0 #Date: 2012-11-19 04:36:21 #Fields: date time s-sitename s-computername s-ip cs-method cs-uristem cs-uri-query s-port cs-username c-ip cs-version cs(user-agent) cs(cookie) cs(referer) cs-host sc-status sc-substatus sc-win32-status sc-bytes cs-bytes time-taken 2012-11-19 04:36:21 W3SVC1 DARSHAK 172.16.1.252 GET / - 80-172.16.1.247 HTTP/1.1 Mozilla/5.0+(Windows+NT+6.1)+AppleWebKit/537.11+(KHTML, +like+gecko)+chrome/23.0.1271.64+safari/537.11 - - 172.16.1.252 200 0 0 1324 367 6334 2012-11-19 04:36:21 W3SVC1 DARSHAK 172.16.1.252 GET /itinfo/images/login.jpg - 80-172.16.1.247HTTP/1.1Mozilla/5.0+(Windows+NT+6.1)+AppleWeb Kit/537.11+(KHTML,+like+Gecko)+Chrome/23.0.1271.64+Safari/5 37.11 - http://172.16.1.252/ 172.16.1.252 200 0 0 20819 361 79 Fig.5. Sample Web Log File Select web log file for cleansing operation as shown below: Fig.6. Data Cleansing Process Thus after completion of Data Cleansing Web Server Log file is cleaned and is prepared for data to be loaded into relational database. Here data is loaded & stored in MS SQL Server 2008. Fig.7. Processed Web Log File Here, since a Government Polytechnic, Gandhinagar site is mostly accessed by students in the computer laboratories without passing through proxy server - we simply use the machines IP addresses to identify unique users. After performing Pre-Processing step result get is shown in table1. 422

Total No. of Users TABLE 1 RESULTS AFTER PRE-PROCESSING Total No. of Sessions Rows in Web Log File Total Rows after pre-processing 18 68 1217 411 V. CONCLUSION Web usage mining is indeed one of the emerging area of research and important sub-domain of data mining and its techniques. In order to take full advantage of web usage mining and its all techniques, it is important to carry out preprocessing stage efficiently and effectively. This paper tries to deliver areas of preprocessing including data cleansing, session identification, user identification, etc. Once preprocessing stage is well-performed, we can apply data mining techniques like clustering, association, classification etc for applications of web usage mining such as business intelligence, e-commerce, e-learning, personalization, etc. REFERENCES [1] Theint Theint Aye. 2011. Web Log Cleaning for Mining Of Web Usage Patterns. IEEE. [2] K.R. Suneetha and Dr. R. Krihnamoorthi. 2009. Identifying User Behavior by Analyzing Web Server Access Log File. IJCSNS. [3] R.Cooley, Bamshad Mobasherand Jaideep Srivastava, "DataPreparation for Mining World Wide Web Browsing Patterns." Knowledge and Information Systems,1(1),1999,5-32 R.Kosala and H. Blockeel, "Web Mining Research : A Survey." ACM SIGKDD Explorations, 2000, 1-15. [4] R.Cooley, B. Mobasher and J. Srivatsava, "Web mining: Information and pattern discovery on the World Wide Web." 9th IEEE Inernational Conference on Tools with Artificial Intelligence. CA, 1997, 558-567. 423