Pre-Processing: Procedure on Web Log File for Web Usage Mining

Pre-Processing: Procedure on Web Log File for Web Usage Mining Shaily Langhnoja 1, Mehul Barot 2, Darshak Mehta 3 1 Student M.E.(C.E.), L.D.R.P. ITR, Gandhinagar, India 2 Asst.Professor, C.E. Dept., L.D.R.P. ITR, Gandhinagar, India 3 Lecturer, Government Polytechnic, Gandhinagar, India Abstract These days World Wide Web becomes very popular and interactive for transferring of Information. Web usage mining is the area of data mining which deals with the discovery and analysis of usage patterns from Web data, specifically web logs, in order to improve web based applications. Web usage mining consists of three phases, preprocessing, pattern discovery, and pattern analysis. After the completion of these three phases the user can find the required usage patterns and use these information for the specific needs. The web access log file is saved to keep a record of every request made by the users. However, the data stored in the log files does not specify accurate details of the users accesses to the Web site. So, preprocessing of the Web log data is first and important phase before web log file can be applied for pattern analysis & pattern discovery. The preprocessed Web Log file can then be suitable for the discovery and analysis of useful information referred to as Web mining. This paper gives detailed description of how pre-processing is done on web log file and after that it is sent to next stages of web usage mining. Keywords Web Mining, Web Usage Mining, Web Log file, Data cleansing, Preprocessing I. INTRODUCTION With the continued growth and proliferation of e- commerce, Web services, and Web-based information systems, the volumes of clickstream and user data collected by Web-based organizations in their daily operations has reached astronomical proportions. Analyzing such data can help these organizations determine the life-time value of clients, design cross-marketing strategies across products and services, evaluate the effectiveness of pro-motional campaigns, optimize the functionality of Web-based applications, provide more personalized content to visitors, and find the most effective logical structure for their Web space. This type of analysis involves the automatic discovery of meaningful patterns and relationships from a large collection of primarily semi-structured data, often stored in Web and applications server access logs, as well as in related operational data sources. Web usage mining refers to the automatic discovery and analysis of patterns in clickstream and associated data collected or generated as a result of user interactions with Web resources on one or more Web sites. The goal is to capture, model, and analyze the behavioral patterns and profiles of users interacting with a Web site. The discovered patterns are usually represented as collections of pages, objects, or re-sources that are frequently accessed by groups of users with common needs or interests. Following the standard data mining process the overall Web usage mining process can be divided into three inter-dependent stages: data collection and pre-processing, pattern discovery, and pattern analysis. This paper provides description about what is Web Log File, where it is located, different formats of it & preprocessing on it. Pre-processing of web log file includes data cleansing, user identification & session identification. II. WEBLOG FILE Web log files are files that contain information about website visitor activity. Log files are created by web servers automatically. Each time a visitor requests any file (page, image, etc.) from the site information on his request is appended to a current log file. Most log files have text format and each log entry (hit) is saved as a line of text. Log file range 1KB to 100MB. A. Location of weblog file: Web log file is located in three different location. Web server logs: Web log files provide most accurate and complete usage of data to web server. The log file do not record cached pages visited. Data of log files are sensitive, personal information so web server keeps them closed. Web proxy server: Web proxy server takes HTTP request from user, gives them to web server, then result passed to web server and return to user. Client send request to web server via proxy server. 419

The two disadvantages are: Proxy-server construction is a difficult task. Advanced network programming, such as TCP/IP, is required for this construction. The request interception is limited. Client browser: Log file can reside in client s browser window itself. HTTP cookies used for client browser. These HTTP cookies are pieces of information generated by a web server and stored in user s computer, ready for future access. B. Type of web log file: There are four types of server logs. Access log file: Data of all incoming request and information about client of server. Access log records all requests that are processed by server. Error log file: list of internal error. Whenever an error is occurred, the page is being requested by client to web server the entry is made in error log.access and error logs are mostly used, but agent and referrer log may or may not enable at server. Agent log file: Information about user s browser, browser version. Referrer log file: This file provides information about link and redirects visitor to site. C. Web log file format: Web log file is a simple plain text file which record information about each user. Display of log files data in three different format W3C Extended log file format NCSA common log file format IIS log file format NCSA and IIS log file format the data logged for each request is fixed.w3c format allows user to choose properties, user want to log for each request. 1. W3C Extended log file format W3C log format is default log file format on IIS server. Field are separated by space, time is recorded as GMT (Greenwich Mean Time). It can be customized that is administrators can add or remove fields depending on what information want to record. In W3C format of year is YYYY-MM-DD. Omitting unwanted attributes field when log file size is limited[w3c]. Figure below shows that #software - version of IIS that is running #version - the log file format #Date- recording date and time of first log entry. #fields: date time c-ip cs-username s-ip cs-method cs-uristem cs-uri-query sc-status sc-bytes cs-bytes time-taken csversion cs(user-agent) cs(cookie) cs(referrer) #Software: Microsoft Internet Information Services 7.5 #Version: 1.0 #Date: 2012-12-05 08:25:10 #Fields: 1998-11-19 22:48:39 206.175.82.5-208.201.133.173 GET/global/images/navlineboards.gif 200 540 324 157 HTTP/1.0 Mozilla/4.0+(compatible;+MSIE+4.01;+Windows+95) USERID=CustomerA;+IMPID=01234 http://www.loganalyzer.net Fig.1. Example of W3C log file format 2. NCSA common log file format The NCSA Common log file format is a fixed ASCII text-based format, so you cannot customize it. The NCSA Common log file format is available for Web sites and for SMTP and NNTP services, but it is not available for FTP sites. Because HTTP.sys handles the NCSA Common log file format, this format records HTTP.sys kernel-mode cache hits.the NCSA Common log file format records the following data: Remote host address Remote log name (This value is always a hyphen.) User name Date, time, and Greenwich mean time (GMT) offset Request and protocol version Service status code (A value of 200 indicates that the Bytes sent 216.67.1.91 - leon [01/Jul/2002:12:11:52 +0000] "GET /index.html HTTP/1.1" 200 431 3. IIS log file format Fig.2 Example of NCSA log file format The IIS log file format is a fixed ASCII text-based format, so you cannot customize it. Because HTTP.sys handles the IIS log file format, this format records HTTP.sys kernel-mode cache hits. The IIS log file format records the following data: Client IP address User name Date Time Service and instance Server name Server IP address Time taken 420

Client bytes sent Server bytes sent Service status code (A value of 200 indicates that the Windows status code (A value of 0 indicates that the Request type Target of operation 172.16.255.255, anonymous, 03/20/01, 23:58:11, MSFTPSVC, SALES1, 172.16.255.255, 60, 275, 0, 0, 0, PASS, /Intro.htm Fig.3 Example of IIS log file format III. PHASE 1: PREPROCESSING There are several pre-processing tasks to be done before data mining algorithms can be performed on the web server logs. These include data cleansing, user identification, session identification. Fig.4 Data Pre-Processing Steps in Web Usage Mining A. Data Cleansing The purpose of data cleaning is to remove irrelevant items stored in the log files that may not be useful for analysis purposes. When a user accesses a HTML document, the embedded images, if any, are also automatically downloaded and stored in the server log. For example, log entries with file name suffixes such as gif, jpeg, GIF, JPEG, jpg and JPG can be removed. Since the main objective of data preprocessing is to obtain only the usage data, file requests that the user did not explicitly request can be eliminated. This can be done by checking the suffix of the URL name. In addition to this, erroneous files can be removed by checking the status of the request (such as status code 404). Data cleaning also involves the removal of references resulting from spider navigations which can be done by maintaining a list of spiders or through heuristic identification of spiders and Web robots. The cleaned log represents the user s accesses to the Web site. 421 Algorithm for Data Cleansing Following is the algorithm used for cleansing web log file for retrieving useful information and eliminating unnecessary data to carry out work related to this paper. The algorithm for Data cleansing step in Web usage mining process of pre-processing stage used in this paper. Here input is raw web log file which is processed and finally output generated is processed web log file and its data is inserted into table of database. Input: raw web log file. Output: processed web log file. 1. for each lines in web log file do 2. if length of line is more then one character then #Avoid Blank Lines 3. if line does not start with # then #Avoid Comments 4. if link name contains domain name then #Consider Application specific links only 5. if page extension is aspx or html then #Eliminate non-page links like images, pdfs insert query for adding log data in database B. User & Session Identification To identify each user and session uniquely we can take measures like IP address, operating system, browser, time out period, etc. Once above step of data cleansing is performed, all useful data records are available with us in database and irrelevant entries are considered to be removed. So, now we can start up the remaining process with database rows itself. Algorithm for User & Sesion Identification The algorithm for the user and session identification can be depicted as below: Input: processed weblog file Output: identification of user & session. 1. for each record in dataset do 2. if currentip is not in ListOfIP then add currentip in ListOfIP 3. else if currentos is not in ListOfOS then add currentos in ListOfOS 4. else if currentbrowser is not in ListOfBrowser then add currentbrowser in ListOfBrowser

5. else if current record timestamp is more than 1800 seconds #30minutes * 60 seconds 6. else mark current record with existing sessionid and userid end if end of loop The above algorithm when used, marks each record in database with respective user and session identified groups which later can be used for further proceedings of web usage mining process. The resulted group of records can be inserted into database and later results of which can be very helpful like total number of users, total number sessions, difference between total number of records before preprocessing and post-preprocessing, etc. IV. EXPERIMENTAL RESULTS We have conducted several experiments on log files collected from Government Polytechnic, Gandhinagar website. During Data cleansing step all irrelevant entries are removed. Sample raw web log file is as below: #Software: Microsoft Internet Information Services 7.5 #Version: 1.0 #Date: 2012-11-19 04:36:21 #Fields: date time s-sitename s-computername s-ip cs-method cs-uristem cs-uri-query s-port cs-username c-ip cs-version cs(user-agent) cs(cookie) cs(referer) cs-host sc-status sc-substatus sc-win32-status sc-bytes cs-bytes time-taken 2012-11-19 04:36:21 W3SVC1 DARSHAK 172.16.1.252 GET / - 80-172.16.1.247 HTTP/1.1 Mozilla/5.0+(Windows+NT+6.1)+AppleWebKit/537.11+(KHTML, +like+gecko)+chrome/23.0.1271.64+safari/537.11 - - 172.16.1.252 200 0 0 1324 367 6334 2012-11-19 04:36:21 W3SVC1 DARSHAK 172.16.1.252 GET /itinfo/images/login.jpg - 80-172.16.1.247HTTP/1.1Mozilla/5.0+(Windows+NT+6.1)+AppleWeb Kit/537.11+(KHTML,+like+Gecko)+Chrome/23.0.1271.64+Safari/5 37.11 - http://172.16.1.252/ 172.16.1.252 200 0 0 20819 361 79 Fig.5. Sample Web Log File Select web log file for cleansing operation as shown below: Fig.6. Data Cleansing Process Thus after completion of Data Cleansing Web Server Log file is cleaned and is prepared for data to be loaded into relational database. Here data is loaded & stored in MS SQL Server 2008. Fig.7. Processed Web Log File Here, since a Government Polytechnic, Gandhinagar site is mostly accessed by students in the computer laboratories without passing through proxy server - we simply use the machines IP addresses to identify unique users. After performing Pre-Processing step result get is shown in table1. 422

Total No. of Users TABLE 1 RESULTS AFTER PRE-PROCESSING Total No. of Sessions Rows in Web Log File Total Rows after pre-processing 18 68 1217 411 V. CONCLUSION Web usage mining is indeed one of the emerging area of research and important sub-domain of data mining and its techniques. In order to take full advantage of web usage mining and its all techniques, it is important to carry out preprocessing stage efficiently and effectively. This paper tries to deliver areas of preprocessing including data cleansing, session identification, user identification, etc. Once preprocessing stage is well-performed, we can apply data mining techniques like clustering, association, classification etc for applications of web usage mining such as business intelligence, e-commerce, e-learning, personalization, etc. REFERENCES [1] Theint Theint Aye. 2011. Web Log Cleaning for Mining Of Web Usage Patterns. IEEE. [2] K.R. Suneetha and Dr. R. Krihnamoorthi. 2009. Identifying User Behavior by Analyzing Web Server Access Log File. IJCSNS. [3] R.Cooley, Bamshad Mobasherand Jaideep Srivastava, "DataPreparation for Mining World Wide Web Browsing Patterns." Knowledge and Information Systems,1(1),1999,5-32 R.Kosala and H. Blockeel, "Web Mining Research : A Survey." ACM SIGKDD Explorations, 2000, 1-15. [4] R.Cooley, B. Mobasher and J. Srivatsava, "Web mining: Information and pattern discovery on the World Wide Web." 9th IEEE Inernational Conference on Tools with Artificial Intelligence. CA, 1997, 558-567. 423