Analyzing the Different Attributes of Web Log Files To Have An Effective Web Mining



Similar documents
Exploitation of Server Log Files of User Behavior in Order to Inform Administrator

Apache Logs Viewer Manual

Pre-Processing: Procedure on Web Log File for Web Usage Mining

Survey on web log data in teams of Web Usage Mining

Arti Tyagi Sunita Choudhary

ANALYSING SERVER LOG FILE USING WEB LOG EXPERT IN WEB DATA MINING

Web Log Mining: A Study of User Sessions

The web server administrator needs to set certain properties to insure that logging is activated.

Microsoft Internet Information Services (IIS)

Comparison table for an idea on features and differences between most famous statistics tools (AWStats, Analog, Webalizer,...).

Web Log Analysis for Identifying the Number of Visitors and their Behavior to Enhance the Accessibility and Usability of Website

Installing AWStats on IIS 6.0 (Including IIS 5.1) - Revision 3.0

How To Analyze Web Server Log Files, Log Files And Log Files Of A Website With A Web Mining Tool

ANALYSIS OF WEB LOGS AND WEB USER IN WEB MINING

Research on Application of Web Log Analysis Method in Agriculture Website Improvement

Identifying the Number of Visitors to improve Website Usability from Educational Institution Web Log Data

1Intro. Apache is an open source HTTP web server for Unix, Apache

The World Wide Web: History

Preprocessing Web Logs for Web Intrusion Detection

An Approach to Convert Unprocessed Weblogs to Database Table

1. When will an IP process drop a datagram? 2. When will an IP process fragment a datagram? 3. When will a TCP process drop a segment?

An Overview of Preprocessing on Web Log Data for Web Usage Analysis

Analysis of Server Log by Web Usage Mining for Website Improvement

An Enhanced Framework For Performing Pre- Processing On Web Server Logs

Apache Usage. Apache is used to serve static and dynamic content

Automatic Recommendation for Online Users Using Web Usage Mining

APACHE WEB SERVER. Andri Mirzal, PhD N

Network Configuration Settings

Web Hosting Features. Small Office Premium. Small Office. Basic Premium. Enterprise. Basic. General

Copyright Winfrasoft Corporation. All rights reserved.

PREPROCESSING OF WEB LOGS

Network Technologies

Enhance Preprocessing Technique Distinct User Identification using Web Log Usage data

Lesson 7 - Website Administration

Web Server Logs Preprocessing for Web Intrusion Detection

Websense Web Security Gateway: Integrating the Content Gateway component with Third Party Data Loss Prevention Applications

Presented by Henry Ng

A Comparative Study of Different Log Analyzer Tools to Analyze User Behaviors

Lecture 2. Internet: who talks with whom?

Advanced Preprocessing using Distinct User Identification in web log usage data

LogLogic Blue Coat ProxySG Log Configuration Guide

A Survey on Preprocessing of Web Log File in Web Usage Mining to Improve the Quality of Data

CDN Operation Manual

Big Data Preprocessing Mechanism for Analytics of Mobile Web Log

Usage Analysis Tools in SharePoint Products and Technologies

AN EFFICIENT APPROACH TO PERFORM PRE-PROCESSING

Configuration Guide. Websense Web Security Solutions Version 7.8.1

DSI File Server Client Documentation

Application Detection

Internet Information TE Services 5.0. Training Division, NIC New Delhi

Client/server is a network architecture that divides functions into client and server

Internet Technologies. World Wide Web (WWW) Proxy Server Network Address Translator (NAT)

Web Usage mining framework for Data Cleaning and IP address Identification

Linux VPS with cpanel. Getting Started Guide

Guide to Analyzing Feedback from Web Trends

FAQs for Oracle iplanet Proxy Server 4.0

Generalization of Web Log Datas Using WUM Technique

1 Introduction: Network Applications

Sophos XG Firewall v Release Notes. Sophos XG Firewall Reports Guide v

Using TestLogServer for Web Security Troubleshooting

Logs. Log File Management APPENDIX

SiteCelerate white paper

Bitrix Site Manager ASP.NET. Installation Guide

Migrating helpdesk to a new server

E-Commerce for IT Advanced. Louis Aguila & Matt Burt

Volume SYSLOG JUNCTION. User s Guide. User s Guide

Configuring Web services

SonicWALL Global Management System Reporting Guide Standard Edition

graphical Systems for Website Design

Administering the Web Server (IIS) Role of Windows Server

Chapter 6 Virtual Private Networking Using SSL Connections

Administering the Web Server (IIS) Role of Windows Server 10972B; 5 Days

Using the Microsoft IIS SMTP Service for LISTSERV Deliveries

Installation Guide. Tech Excel January 2009

MCTS Self-Paced Training Kit (Exam ): Configuring Windows Server 2008 Application Platform

LogLogic Microsoft Internet Information Services (IIS) Log Configuration Guide

The course will be run on a Linux platform, but it is suitable for all UNIX based deployments.

Configuring SonicWALL TSA on Citrix and Terminal Services Servers

LearningServer Portal Manager

Click Studios. Passwordstate. Installation Instructions

v6.1 Websense Enterprise Reporting Administrator s Guide

Device Log Export ENGLISH

SonicWALL Global Management System Reporting User Guide. Version 2.5

A host-based firewall can be used in addition to a network-based firewall to provide multiple layers of protection.

Printer Management Software

MadCap Software. Upgrading Guide. Pulse

Introduction to Endpoint Security

Ensim WEBppliance 3.0 for Windows (ServerXchange) Release Notes

Web Technologies Week 4 Hosting, Servers and Databases. Context. Contents. MSc in Computing Computing - IBITE Liverpool Hope University College

NTT Web Hosting Service [User Manual]

Web Hosting and Domain Name Registration

Creating an Intranet Website for Library & Information Services in an Organization

Cisco Performance Visibility Manager 1.0.1

Reporting Installation Checklist

Unit- I Introduction and Web Development Strategies

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Internet Information Services Agent Version Fix Pack 2.

Transcription:

Analyzing the Different Attributes of Web Log Files To Have An Effective Web Mining Jaswinder Kaur #1, Dr. Kanwal Garg #2 #1 Ph.D. Scholar, Department of Computer Science & Applications Kurukshetra University, Kurukshetra Kurukshetra, India #2 Supervisor, Department of Computer Science & Applications Kurukshetra University, Kurukshetra Kurukshetra, India ABSTRACT With technological advancement and growing popularity of the Internet and World Wide Web (WWW, W3), a large number of Web access log records are being collected in the form of Web log files. Web access log contains huge volume of raw data about client activity and server activity such as IP address, date and time, HTTP request, user agent, referrers, bytes severed. Web log analyzer analysis the Web access log to determine the navigational patterns of the users. In this paper, researcher implemented a Weblog Expert Tool on Web server log files to evaluate the significant attributes of web log files based on the reports of analyzer. These reports provide the information about visitor s behavior, traffic pattern, navigation paths, Web browser, errors etc. Important attributes provide useful information to make the right decision for business and market research. Key words: Apache Web server, IIS Web server, Web server log files, log format, Web log analyzer. INTRODUCTION Internet is a global system in which interconnected networks of computing devices use Internet protocol suite (TCP/IP) to provide the variety of interaction and communication between its devices. World Wide Web (WWW) is the most popular part of the Internet, that follows only one protocol called the HTTP (HyperText Transfer Protocol). The Internet and WWW attract various application areas, such as Marketing and advertising, direct on line selling, research and development and communication. Individual document pages are called Web pages on the WWW, these documents may contain text, videos, images, multimedia and interactive contents. R S. Publication, rspublicationhouse@gmail.com Page 127

Web server is a computer that runs the software application (HTTP Server Software) to transfer the Web pages on the Internet or intranet. A Web server log (sometimes referred to as the raw data ) is a simple text file which records the history of page requests on a server where each line represents one request in the log file. Typical Web log record the fields like IP address, date, time, HTTP request, referring page, browser, status code. However, it is difficult to perform systematic analysis on large amount of data. The data from access logs provides an extensive view of Web servers and users. Therefore, such analysis enables server administrators and decision makers to characterize the users and usage patterns. Furthermore, access logs are also called transfer logs, where it stores information about which files are requested from web server [1]. Thus, raw access log can be very useful for statistical information. With the analysis tool, it is possible to obtain the information about the page view, visitor s behavior, traffic pattern, navigation paths, browser used, errors, internal problems, performance problems and security problems of the Web application. MOST WIDELY USED WEB SERVER Apache and IIS are most widely used web servers [2, 3]. A. Apache Web Server: The first version of Apache was developed by Robert Mc Cool in 1995, that is based on the NCSA (National Center for Supercomputing Activities) httpd Web server. Apache Web server software is now maintained by the Apache software foundation. Openness: Apache Web server is open source software. Extensive: Source code of the software is modular in structure. Portability: Apache operate on all major operating systems such as Unix, Linux, Windows and OS/2. Features: Apache has various supports such as built-in support, Secure Socket Layer and Transport Layer Security. Reliability: Apache Web server is connected to many users in this world and is able to produce fairly quick bug fixes through them. Apache is a stable application. Cost: Apache has open source design; therefore this Web server is completely free. B. IIS Web Server: IIS (Internet Information Services) is formally known as Internet information server. Microsoft s extensible Web server is used with Window NT family. IIS supports various protocols, which includes HTTP (Hypertext Transfer Protocol), HTTPS (Hypertext Transfer Protocol over Secure Socket Layer), FTP (File Transfer Protocol), SMTP (Simple Mail Transfer Protocol). With IIS services, developer can build and run own websites and associated web based application without interference. Low Cost of Deployment: End users of IIS application can work with application using only a browser, additional software is not required for the application to run. Familiar Development Environment: IIS has familiar Visual Basic programming environment. User can add classes, modules or any Visual Basic ActiveX component for development in project. R S. Publication, rspublicationhouse@gmail.com Page 128

Access To A Wide Audience: Different types of browser and operating systems are compatible with IIS application, so user can easily reach a broad audience. Object Model Provides Direct Access: ASP (Active Server Pages) framework provides an object model through which user can directly change the objects, which means to create dynamic contents at the core of Internet Information Server. This allows user to perform various functions on browser (such as retrieve or send) and contents of Web page. Reusability: User can easily access one Web class from another Web class. Tools: IIS comes with set of tools that includes Microsoft FrontPage( user friendly tool) which is used to create pages for Web sites (with its WYSIWYG user interface), Microsoft.NET, Visual Web Developers 2008, SQL Server 2008, Silverlight tools for Visual Studio. WEB LOG FORMAT A. Apache Log Format: Access log records hits and related information. Moreover in Apache, access log are formatted in three ways: Common Log Format, Combined Log Format, Multiple Access Log. By default Apache uses the Common log format, however, the majority of hosting providers set the Combined log format for Apache on their servers. Log format can be configured by editing the "httpd.conf" file in the Apache conf directory (if you have access to this file)[4]. Combined log format with the addition of two more attributes: User agent and Referer. Therefore, Combined log format contains more information than Common log format. The configuration of the combined log format is given below: LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\ " \"%{User-agent}i\"" combined CustomLog log/acces_log combined [5]. Each element of format string that specifies the log format is described below: %h: IP address of the client (Remote Host) accessing the server. %I: RFC 1413 identity of the user determined by identd (Normally Unavailable). %u: User id of the user determined by HTTP authentication. %t: The date, time and time zone when process of HTTP request is completed by Web server. \ %r\ : The request line from the client to website. %>s: Status code of the request that the server delivers to the client. %b: Size of the server s response in bytes returned to the client. \ %{Referer}i\ : Referer is an HTTP request header field. This gives the address of Web page from where the request originated. \ %{User-agent}i\ : The type of Web browser (software) that acts on user s behalf. R S. Publication, rspublicationhouse@gmail.com Page 129

B. IIS Log Format: Different log file formats supported by Internet information server where user gathers information about client request can be IIS, W3C, NCSA, CUSTOM. This format logs client s activity and server s activity into a log file in selected log file format. Therefore, W3C extended format is customizable ASCII format. This is probably the default log file format and commonly used log format, moreover it offers a selection of fields that are included in the log file with which user can limit the size of log file and obtain the detailed information. W3C extended log format is given below: #Software: Microsoft Internet Information Services 7.5 #Version: 1.0 #Date: 2011-11-10 06:44:16 #Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(user- Agent) cs(referer) sc-status sc-substatus sc-win32-status time-taken The available fields in W3C extended log format is described as follows: date: The date on which the HTTP request received by server. time: The time at which the request occurred. s-ip: Server IP cs-method: The HTTP requested action. cs-uri-stem: Stem (Path) portion of the requested Uniform Resource Identifier. cs-uri-query: Query portion of the requested Uniform Resource Identifier. s-port: The server port number of the listener that is configured for the service. cs-username: Name of authenticated user who accessed the server. If user was anonymous, a hyphen (-) is logged instead of user name. c-ip: Client IP address. cs(user-agent): The browser type used on the client. cs(referer): The Referer field identifies the URI that linked to the server being requested.. sc-status: The server sends HTTP response code. sc-substatus: The sub status error code of the HTTP. sc-win32-status: The server sends the status of the action in window system error code. time-taken: The length of time taken to complete the request in milliseconds. WHY WEB LOG ANALYSIS IS REQUIRED Web log analysis is performed on Web Server logs by using Web log analysis software (also called Web log analyzer). This software is used to know whether Web-based services are meeting intended missions and goal or not. Thus, analysis determines the navigational patterns of the user, i.e. how visitors are interacting with the websites and reveal visitors action on the site. R S. Publication, rspublicationhouse@gmail.com Page 130

WEB LOG ANALYSIS SOFTWARE Web log analysis software (Web analytics software) are essential tools, which are categorized based on their popularity, functionality and simplicity of usage. A variety of Web log analyzers are available that take Web server logs as an input and graphical reports are generated from the log files immediately. In addition to this, a powerful Web log analyzer performs analysis and brings visibility into the website access, which makes it an essential analyzer for business decision making and market research. Some of the tools that are available are: WebLog Expert Lite, AlterWind Log Analyzer, IIS and Apache Log Analyzer, Deep Log Analyzer, AWStats. RESULTS AND INTERPRETATION WebLog Expert Lite (Web server log analyzer) is a free Web mining tool, light weight version of WebLog Expert for windows based computer. It can analyze the log file of Apache, IIS and get information about the sites visitors: general statistics, activity statistics, access statistics, referrers, search engine, browser, operating systems, errors and more. The software supports the Apache web server in Common ( default log format) and Combined log formats and IIS Web server in W3C Extended log format (Default log format) of 4/5/6/7/8. It can read logs in formats such as LOG, ZIP, GZ and BZ2.WebLog Expert Lite generates easy- to- read HTML-file reports with graphical and tabular formats. Thus, researcher analyzed Apache and IIS sample log file with the help of WebLog Expert tool. Time range of sample log files: Apache: 8/Apr/2012 07:04:34 8/Jul/2012 22:48:23 and IIS: 3/Mar/2012 00:00:03 4/Mar/2012 23:59:51. A. Analysis on Apache Log File The following results were obtained to identify the behavior of the website users on Apache Web server. 1) General Statistics: This shows the Web usage details (General Information) that also includes total hits and average hits per day, total page views and average page views per day, total visitors, total bandwidth etc. 2) Activity Statistics: This statistics gives the information about the user s activity by date and hour of day. Date and time are essential attributes in web log file because user s activity by date and hour of day provides the details of hits, page views, visitors, bandwidth etc. Fig 1: Bar graph displays the user s activity by hour of the day. R S. Publication, rspublicationhouse@gmail.com Page 131

According to graph given above the Website is hit maximum at 05:00 hrs and is least visited at 03:00 hrs. 3) Access Statistics: Here the statistics for most popular pages, most downloaded files, most requested images, most requested directories, top entry pages and daily page access, image access, directory access, entry pages are shown. Entry page is a first Web page visited by visitor on the site. This statistics also provide an idea of the navigational behaviour of visitors. Therefore, request line contains information about page, image, and directory in the log file. This attribute has a significant role in access statistics. 4) Visitors: It shows the list of IP address/domain names of hosts that accessed the website along with the number of times the website was hit by a particular host. Visitors section analyse the IP address field of Web log file. 5) Referrers: Here the report displays the top referring sites, top referring URLs and top search engine. This section collected referring sites and referring URLs from referer field of Apaches Web log file. Fig 2: Bar graph showing top referring sites while accessing the Website. 6) Browser: This report helps the website owner to analyze the Web browser mostly used by the visitors so that the website can be made compatible with that particular browser. It also provides the list of most preferred Operating systems used by the users and different versions of Internet Explorer. Therefore, report is generated with the help of user agent field of web log file. Fig 3: Pie chart shows mostly used Web browser. R S. Publication, rspublicationhouse@gmail.com Page 132

7) Errors: The last feature shows the different kinds of errors occurred while accessing the website. B. Analysis on IIS Log File WebLog Expert tool analyzed IIS Web server with the same way as the Apache Web server. So that researcher described only Access statistics features. 1)Access Statistics: Here it shows the same information that is described in access statistics of Apache log file. URI Stem field contains information about pages, images, files, directory etc. Thus, WebLog Expert tool analyze the data from URI Stem attribute of log files. Fig 4: Bar graph showing Access statistics of web log for most popular pages. c) Important Attributes of Web log files According to the analysis of WebLog Expert Lite tool that used sample log files of Apache and IIS, researcher analyzed fields and obtained important attributes of Web log files that are IP Address, Date, Time, Request Line or URI Stem, Status Code, Referer and User-Agent as shown in table 1. Table 1. Analysis Report for Important Attributes of Web Log Files. Important Attributes Based on Analysis IP Address Date Time Request Line/ URI Stem Status Code Referer User-Agent Apache and IIS Therefore, these essential attributes are used to make the right decision for business and market research, which saves valuable time and money. R S. Publication, rspublicationhouse@gmail.com Page 133

CONCLUSION Web log analyzer is an essential tool to analyze the Web log file. Therefore, WebLog Expert tool was taken to analyze the Apache and IIS Web server log files and obtain the information about general statistics, activity statistics, access statistics, visitors, referrers, browser and errors. In this paper, the researcher examined the result to find out the important attributes in Web log files. In conclusion, these attributes provide valuable data that is essential for any company which is operating Web application to take the right decision which will affect the growth and security of business and correlate it directly or in directly with the sales and profit. REFERENCE [1] Naga Lakshmi, Raja Sekhara Rao and Sai Satyanarayana Reddy, An Overview of Preprocessing on Web Log Data for Web Usage Analysis, International Journal of Innovative Technology and Exploring Engineering (IJITEE), ISSN: 2278-3075, Vol.2(4), pp.274-279, 2013. [2] Sung-Whan Woo, Omar H. Alhazmi and Yashwant K. Malaiya, Assessing Vulnerabilities in Apache and IIS HTTP Servers, DASC '06 Proceedings of the 2nd IEEE International Symposium on Dependable, Autonomic and Secure Computing, pp.103 110, 2006. [3] Omar H. Alhazmi and Yashwant K. Malaiya, Measuring and Enhancing Prediction Capabilities of Vulnerability Discovery Models for Apache and IIS HTTP Servers, ISSRE '06 Proceedings of the 17th International Symposium on Software Reliability Engineering, ISSN:1071-9458, pp.343-352, 2006. [4] V. Jayakumar and Dr. K. Alagarsamy, Analysing Server Log File Using Web Log Expert In Web Data Mining, International Journal of Science, Environment and Technology, Vol.2(5), pp.1008 1016, 2013. [5] L.K.Joshila Grace, V.Maheswari and Dhinaharan Nagamalai, Analysis of Web Logs And Webuser In Web Mining, International Journal of Network Security & Its Applications (IJNSA), Vol.3(1), 2011. R S. Publication, rspublicationhouse@gmail.com Page 134