AN OVERVIEW OF PREPROCESSING OF WEB LOG FILES FOR WEB USAGE MINING N. M. Abo El-Yazeed Demonstrator at High Institute for Management and Computer, Port Said University, Egypt no3man_mohamed@himc.psu.edu.eg Abstract: Web applications are increasing at an enormous speed and its users are increasing at exponential speed. The evolutionary changes in technology have made it possible to capture the user s essence and interactions with web applications through web server log file. Web log file is saved as text (.txt) file. Due to large amount of irrelevant information in the web log, the original log file cannot be directly used in the web usage mining (WUM) procedure. Therefore the preprocessing of web log file becomes imperative. The proper analysis of web log file is beneficial to manage the web sites effectively for administrative and users prospective. Web log preprocessing is initial necessary step to improve the quality and efficiency of the later steps of WUM. There are number of techniques available at preprocessing level of WUM. Different techniques are applied at preprocessing level such as data cleaning, data filtering, and data integration. Web usage mining, a classification of Web mining, is the application of data mining techniques to discover usage patterns from clickstream and associated data stored in one or more Web servers. This paper presents an overview of the various steps involved in the preprocessing stage. Keywords: Web Server, Web Log File, Data Cleaning, User Identification, Session Identification, Path Completion, Web Usage Mining, Clickstream Analysis. 1
1. INTRODUCTION Web mining is one of the major and important fields of data mining. Data mining techniques are applied [1] on contents, structures and on log files of web sites to achieve performance, web personalization and schema modifications of web sites. Web mining is divided into three categories [2] such as Web Content Mining, Web Structure Mining and Web Usage Mining. In web content mining, we discover useful information from the contents of web site which may include text, hyperlinks, metadata, images, videos, and audios. Search engines and web spiders are used to gather data for content mining [1]. In web structure mining, we mine the structure of website on the basis of hyperlinks and intra-links inside and outside the web pages. In web usage mining (WUM) or web log mining, user s behavior or interests are revealed by applying data mining techniques on web log file. The ability to know the patterns of user s habits and interests helps the operational strategies of enterprises. Various applications are built efficiently by knowing users navigation through web. Web mining is the application of data mining techniques to automatically retrieve, extract and evaluate information for knowledge discovery from web documents and services. These applications may include: Modification of web site design. Schema modifications. Improve web site and web server performance. Improve web personalization. Recommender Systems. Fraud detection and future prediction. Srivastava et al. [3] proposed A framework for web usage mining. This process consists of four phases: the input stage, the preprocessing stage, the pattern discovery stage, and the pattern analysis stage: 1. Input stage. At the input stage, three types of raw web log files are retrieved access logs, referrer logs, and agent logs as well as 2
registration information (if any) and information concerning the site topology. 2. Preprocessing stage. The raw web logs do not arrive in a format conducive to fruitful data mining. Therefore, substantial data preprocessing must be applied. The most common preprocessing tasks are (1) data cleaning and filtering, (2) de-spidering, (3) user identification, (4) session identification, and (5) path completion. 3. Pattern discovery stage. Once these tasks have been accomplished, the web data are ready for the application of statistical and data mining methods for the purpose of discovering patterns. These methods include (1) standard statistical analysis, (2) clustering algorithms, (3) association rules, (4) classification algorithms, and (5) sequential patterns. 4. Pattern analysis stage. Not all of the patterns uncovered in the pattern discovery stage would be considered interesting or useful. For example, an association rule for an online movie database that found If Page = Sound of Music then Section= Musicals would not be useful, even with 100% confidence, since this wonderful movie is, of course, a musical. Hence, in the pattern analysis stage, human analysts examine the output from the pattern discovery stage and glean the most interesting, useful, and actionable patterns. 2. Clickstream Analysis: Web usage mining is sometimes referred to as clickstream analysis. A clickstream is the aggregate sequence of page visits executed by a particular user navigating through a Web site. In addition to page views, clickstream data consist of logs, cookies, metatags, and other data used to transfer web pages from server to browser. When loading a particular web page, the browser also requests all the objects embedded in the page, such as.gif or.jpg graphics files. The problem is that each request is logged separately. All of these separate hits must be aggregated into page views at the preprocessing stage. Then a series of page views can be woven together into a session. Thus, clickstream data require substantial preprocessing before user behavior can be analyzed. 3
3. Web Server Log Preprocessing: Preprocessing being preliminary and essential step but rather ignored due to variations and limitations of web log files. A web log file, as an input to the preprocessing phase of WUM, large in size, contains number of raw and irrelevant entries and is basically designed for debugging purpose [4]. Consequently, web log file cannot be directly used in WUM process. Preprocessing of log fie is complex and laborious job and it takes 80% of the total time of web usage mining process as whole [5]. Weighing the pros and cons, we come to the conclusion that, we cannot negate importance of preprocessing step in web usage mining. Paying due attention to preprocessing step, improves the quality of data [6], furthermore, preprocessing improves the efficiency and effectiveness of other two steps of WUM such as pattern discovery and pattern analysis. 3.1. Web Log Files: Web usage information takes the form of web server log files, or web logs. For each request from a user s browser to a web server, a response is generated automatically, called a web log file, log file, or web log. This response takes the form of a simple single-line transaction record that is appended to an ASCII text file on the web server. This text file may be comma-delimited, space-delimited, or tab-delimited. A sample web log is the excerpt, shown in Figure 1, from the venerable EPA web log data available from the Internet Traffic Archive at http://ita.ee.lbl.gov/html/traces.html [7]. Each line in this file represents a particular action requested by a user s browser, received by the EPA web server in Research Triangle Park, North Carolina. Each line (record) contains the fields described below. 4
141.243.1.172 [29:23:53:25] GET /Software.html HTTP/1.0 200 1497 query2.lycos.cs.cmu.edu [29:23:53:36] GET /Consumer.html HTTP/1.0 200 1325 tanuki.twics.com [29:23:53:53] GET /News.html HTTP/1.0 200 1014 wpbfl2-45.gate.net [29:23:54:15] GET /default.htm HTTP/1.0 200 4889 wpbfl2-45.gate.net [29:23:54:16] GET /icons/circle logo small.gif HTTP/1.0 200 2624 wpbfl2-45.gate.net [29:23:54:18] GET /logos/small gopher.gif HTTP/1.0 200 935 140.112.68.165 [29:23:54:19] GET /logos/us-flag.gif HTTP/1.0 200 2788 wpbfl2-45.gate.net [29:23:54:19] GET /logos/small ftp.gif HTTP/1.0 200 124 wpbfl2-45.gate.net [29:23:54:19] GET /icons/book.gif HTTP/1.0 200 156 wpbfl2-45.gate.net [29:23:54:19] GET /logos/us-flag.gif HTTP/1.0 200 2788 tanuki.twics.com [29:23:54:19] GET /docs/oswrcra/general/hotline HTTP/1.0 302 - wpbfl2-45.gate.net [29:23:54:20] GET /icons/ok2-0.gif HTTP/1.0 200 231 tanuki.twics.com [29:23:54:25] GET /OSWRCRA/general/hotline/ HTTP/1.0 200 991 tanuki.twics.com [29:23:54:37] GET /docs/oswrcra/general/hotline/95report HTTP/1.0 302 - wpbfl2-45.gate.net [29:23:54:37] GET /docs/browner/adminbio.html HTTP/1.0 200 4217 tanuki.twics.com [29:23:54:40] GET /OSWRCRA/general/hotline/95report/ HTTP/1.0 200 1250 wpbfl2-45.gate.net [29:23:55:01] GET /docs/browner/cbpress.gif HTTP/1.0 200 51661 dd15-032.compuserve.com [29:23:55:21] GET /Access/chapter1/s2-4.html HTTP/1.0 200 4602 FIGURE 1: Sample Web Log File i. Basic Log Format: Remote Host Field This field consists of the Internet IP address of the remote host making the request, such as 141.243.1.172. If the remote host name is available through a DNS lookup, this name is provided, such as wpbfl2-45.gate.net. To obtain the domain name of the remote host rather than the IP address, the server must submit a request, using the Internet Domain Name System (DNS) to resolve (i.e., translate) the IP address into a host name. Since humans prefer to work with domain names and 5
computers are most efficient with IP addresses, the DNS system provides an important interface between humans and computers. For more information about DNS, see the Internet Systems Consortium, www.isc.org [8]. Date/Time Field The EPA web log uses the following specialized date/time field format: [DD:HH:MM:SS], where DD represents the day of the month and HH:MM:SS represents the 24-hour time, given in EDT. In this particular data set, the DD portion represents the day in August, 1995 that the web log entry was made. However, it is more common for the date/time field to follow the following format: DD/Mon/YYYY:HH:MM:SS offset, where the offset is a positive or negative constant indicating in hours how far ahead of or behind the local server is from Greenwich Mean Tim (GMT). For example, a date/time field of 09/Jun/1988:03:27:00-0500 indicates that a request was made to a server at 3:27 a.m. on June 9, 1988, and the server is 5 hours behind GMT. HTTP Request Field The HTTP request field consists of the information that the client s browser has requested from the web server. The entire HTTP request field is contained within quotation marks. Essentially, this field may be partitioned into four areas: (1) the request method, (2) the uniform resource identifier (URI), (3) the header, and (4) the protocol. The most common request method is GET, which represents a request to retrieve data that are identified by the URI. For example, the request field in the first record in Figure 1 is GET /Software.html HTTP/1.0, representing a request from the client browser for the web server to provide the web page Software.html. Besides GET, other requests include HEAD, PUT, and POST. For more information on the latter request methods, refer to the W3C World Wide Web Consortium at www.w3.org [9]. The uniform resource identifier contains the page or document name and the directory path requested by the client browser. The URI can be used by web usage miners to analyze the frequency of visitor requests for pages and files. The header section contains optional information 6
concerning the browser s request. This information can be used by the web usage miner to determine, for example, which keywords are being used by visitors in search engines that point to your site. The HTTP request field also includes the protocol section, which indicates which version of the Hypertext Transfer Protocol (HTTP) is being used by the client s browser. Then, based on the relative frequency of newer protocol versions (e.g., HTTP/1.1), the web developer may decide to take advantage of the greater functionality of the newer versions and provide more online features. Status Code Field Not all browser requests succeed. The status code field provides a three-digit response from the web server to the client s browser, indicating the status of the request, whether or not the request was a success, or if there was an error, which type of error occurred. Codes of the form 2xx indicate a success, and codes of the form 4xx indicate an error. Most of the status codes for the records in Figure 1 are 200, indicating that the request was fulfilled successfully. A sample of the possible status codes that a web server could send follows [9]. Successful transmission (200 series) Indicates that the request from the client was received, understood, and completed. 200: success 201: created 202: accepted 204: no content Redirection (300 series) Indicates that further action is required to complete the client s request. 301: moved permanently 302: moved temporarily 303: not modified 304: use cached document Client error (400 series) Indicates that the client s request cannot be fulfilled, due to incorrect syntax or a missing file. 7
400: bad request 401: unauthorized 403: forbidden 404: not found Server error (500 series) Indicates that the web server failed to fulfill what was apparently a valid request. 500: internal server error 501: not implemented 502: bad gateway 503: service unavailable Transfer Volume (Bytes) Field The transfer volume field indicates the size of the file (web page, graphics file, etc.), in bytes, sent by the web server to the client s browser. Only GET requests that have been completed successfully (Status = 200) will have a positive value in the transfer volume field. Otherwise, the field will consist of a hyphen or a value of zero. This field is useful for helping to monitor the network traffic, the load carried by the network throughout the 24-hour cycle. ii. Common Log Format Web logs come in various formats, which vary depending on the configuration of the web server. The common log format (CLF or clog ) is supported by a variety of web server applications and includes the following seven fields: Remote host field Identification field Authuser field Date/time field HTTP request Status code field Transfer volume field Identification Field This field is used to store identity information provided by the client only if the web server is performing an identity check. However, this 8
field is seldom used because the identification information is provided in plain text rather than in a securely encrypted form. Therefore, this field usually contains a hyphen, indicating a null value. Authuser Field This field is used to store the authenticated client user name, if it is required. The authuser field was designed to contain the authenticated user name information that a client needs to provide to gain access to directories that are password protected. If no such information is provided, the field defaults to a hyphen. iii. Extended Common Log Format The extended common log format (ECLF) is a variation of the common log format, formed by appending two additional fields onto the end of the record, the referrer field, and the user agent field. Both the common log format and the extended common log format were created by the National Center for Supercomputing Applications http://www.ncsa.uiuc.edu/ [10]. Referrer Field The referrer field lists the URL of the previous site visited by the client, which linked to the current page. For images, the referrer is the web page on which the image is to be displayed. The referrer field contains important information for marketing purposes, since it can track how people found your site. Again, if the information is missing, a hyphen is used. User Agent Field The user agent field provides information about the client s browser, the browser version, and the client s operating system. Importantly, this field can also contain information regarding bots, such as web crawlers. Web developers can use this information to block certain sections of the Web site from these web crawlers, in the interests of preserving bandwidth. Further, this field allows the web usage miner to determine whether a human or a bot has accessed the site, and thereby to omit the bot s visit from analysis, on the assumption that the developers are interested in the behavior of human visitors. 9
iv. Microsoft IIS Log Format There are other log file formats besides the common and extended common log file formats. The Microsoft IIS log format includes the following fields [11]: Client IP address User name Date Time Service and instance Server name Server IP Elapsed time Client bytes sent Server bytes sent Service status code Windows status code Request type Target of operation Parameters The IIS format records more fields than the other formats, so that more information can be uncovered. For example, the elapsed processing time is included, along with the bytes sent by the client to the server; also, the time recorded is local time. Note that web server administrators need not choose any of these formats; they are free to specify which fields they believe are most appropriate for their purposes. 3.2. Preprocessing Steps: Data Cleaning The first step of preprocessing is data cleaning. It is usually sitespecific, and involves tasks such as, removing extraneous references to embedded objects that may not be important for the purpose of analysis, including references to style files, graphics, or sound files as shown in Table 1. The cleansing process also may involve the removal of at least some of the data fields (e.g. number of bytes transferred or version of 11
protocol used, etc.) that may not provide useful information in the analysis or data mining tasks [12]. No Object Type Unique % of Total Requests Bytes In Users Bytes In 1 *.gif 1 46 89.00KB 0.50% 2 *.js 1 37 753.95KB 4.40% 3 *.aspx 1 34 397.05KB 2.30% 4 *.png 1 31 137.67KB 0.80% 5 *.jpg 1 20 224.72KB 1.30% 6 UnKnown 1 15 15.60KB 0.10% 7 *.ashx 1 15 104.79KB 0.60% 8 *.axd 1 13 274.81KB 1.60% 9 *.css 1 8 71.78KB 0.40% 10 *.dll 1 7 26.41KB 0.20% 11 *.asp 1 4 1.26KB 0.00% 12 *.html 1 3 2.17KB 0.00% 13 *.htm 1 2 69.87KB 0.40% 14 *.pli 1 2 24.92KB 0.10% TABLE1: Example of web log with different extensions. User Identification The task of User Identification is, to identify who access web site and which pages are accessed. The analysis of Web usage does not require knowledge about a user s identity. However, it is necessary to distinguish among different users. Since a user may visit a site more than once, the server logs record multiple sessions for each user. The user activity record is used to refer to the sequence of logged activities belonging to the same user. 11
FIGURE 2: Example of User Identification Consider, for instance, the example of Figure 2. On the left, depicts a portion of a partly preprocessed log file. Using a combination of IP and URL fields in the log file, one can partition the log into activity records for three separate users (depicted on the right). Session Ordering Sessionization is the process of segmenting the user activity record of each user into sessions, each representing a single visit to the site. Web sites without the benefit of additional authentication information from users and without mechanisms such as embedded session ids must rely on heuristic methods for sessionization [12]. The goal of a sessionization heuristic is to reconstruct, from the click stream data, the actual sequence of actions performed by one user during one visit to the site. 12
Generally, sessionization heuristics fall into two basic categories: timeoriented or structure oriented. As an example, time-oriented heuristic, h1: Total session duration may not exceed a threshold θ. Given t0, the timestamp for the first request in a constructed session S, the request with a timestamp t is assigned to S, iff t t0 θ. In Fig 3, the heuristic h1, described above, with θ = 30 minutes has been used to partition a user activity record into two separate sessions. Path Completion FIGURE 3: Example of Sessionization Another potentially important pre-processing task which is usually performed after sessionization is path completion. Path completion is a process of adding the page accesses that are not in the web log but those which is actually occurred. Client or proxy-side caching can often result in missing access references to those pages or objects that have been cached. For instance, if a user returns to a page A during the same session, the second access to A will likely result in viewing the previously downloaded version of A that was cached on the client-side, and therefore, no request is made to the server. This results in the second reference to A not being recorded on the server logs. Missing references due to caching can be heuristically inferred through path completion which relies on the knowledge of site structure and referrer information from server logs. In the case of dynamically generated pages, form-based applications using the HTTP POST method result in all or part of the user input parameter not being appended to the URL accessed by the user. A simple example of missing references is given in Figure 4 [13]. 13
FIGURE 4: Identifying missing references in path completion Data Integration The above pre-processing tasks ultimately result in a set of user sessions each corresponding to a delimited sequence of page views. However, in order to provide the most effective framework for pattern discovery, data from a variety of other sources must be integrated with the preprocessed clickstream data. This is particularly the case in e- commerce applications where the integration of both user data (e.g., demographics, ratings, and purchase histories) and product attributes and categories from operational databases is critical. Such data, used in conjunction with usage data, in the mining process can allow for the discovery of important business intelligence metrics such as customer conversion ratios and lifetime values. In addition to user and product data, e-commerce data includes various product-oriented events such as shopping cart changes, order and shipping information, impressions (when the user visits a page containing an item of interest), click through (when the user actually clicks on an item of interest in the current page), and other basic metrics primarily used for data analysis. The successful integration of these types of data requires the creation of a site-specific event model based on which subsets of a user s clickstream are aggregated and mapped to specific events such as the addition of a product to the shopping cart. Generally, the integrated e-commerce data is stored in the final transaction database. To enable full-featured Web analytics applications, this data is usually 14
stored in a data warehouse called an e-commerce data mart. The e- commerce data mart is a multi-dimensional database integrating data from various sources, and at different levels of aggregation. It can provide pre-computed e-metrics along multiple dimensions, and is used as the primary data source for OLAP (Online Analytical Processing), for data visualization, and in data selection for a variety of data mining tasks. 4. Conclusion: The data collected in the Web server and other associated data sources do not reflect precisely about the pages visited by the user during his interactions with the Web. Due to the presence of superfluous items, in addition to the inability to identify users and sessions, it is essential that the log files need to be preprocessed initially before the mining tasks can be undertaken. Data preprocessing is a significant and prerequisite phase in Web mining. Various heuristics are employed in each step so as to remove irrelevant items and identify users and sessions along with the browsing information. The output of this phase results in the creation of a user session file. Nevertheless, the user session file may not exist in a suitable format as input data for mining tasks to be performed. This paper has focused on a design that can be adopted for preliminary formatting of a user session file so as to be suited for various mining tasks in the subsequent pattern discovery phase. 5. Future Work: In addition to the above mentioned preprocessing and formatting tasks, the future work involves various data transformation tasks that are likely to influence the quality of the discovered patterns resulting from the mining algorithms. The discovered patterns can then be used for various Web usage applications such as site improvement, business intelligence and recommendations. There are a number of issues in preprocessing of log data. Volume of requests in web log in a single log file is the first challenge. It is important to eliminate the irrelevant data. So cleaning is done to speed up analysis as it reduces the number of records and increases the quality of the results in the analysis stage. 15
6. Reference: [1] K. R. Suneetha, and D. R. Krishnamoorthi, Identifying User Behavior by Analyzing Web ServerAccess Log File, International Journal of Computer Science and Network Security (IJCSNS), VOL. 9, No. 4, April 2009. [2] S. Alam, G. Dobbie and P. Riddle, Particle Swarm Optimization Based Clustering Of Web Usage Data, IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pp. 451-454, 2008. [3] J. Srivastava, R. Cooley, M. Deshpande, and P. N. Tan, Web usage mining: discovery and applications of usage patterns from web data, SIGKDD Explore, VOL. 1, NO. 2, Jan 2000. [4] N. Khasawneh and C. C. Chan, Active User-Based and Ontology- Based Web Log Data Preprocessing for Web Usage Mining, Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings) (WI'06) 0-7695-2747-7/06, 2006. [5] Z. Pabarskaite, Implementing Advanced Cleaning and End-User Interpretability Technologies in Web Log Mining, 24th Int. Conf. information Technology Interfaces /TI 2002, Cavtat, Croatia, June 24-27, 2002. [6] J. Han, and M. Kamber, Data Mining: Concepts and Techniques, A. Stephan. San Francisco, Morgan Kaufmann Publishers is an imprint of Elsevier, 2006. [7] http://ita.ee.lbl.gov/html/traces.html. [8] http://www.isc.org. [9] http://www.w3.org. [10] http://www.ncsa.uiuc.edu/. [11] http://www.microsoft.com/. [12] A. Scime, Wed Mining : Applications and Techniques, Idea Group Publishing, ISBN 1-59140-414-2, 2005. [13] M. Géry and H. Haddad, Evaluation of Web Usage Mining Approaches for User s Next Request Prediction, WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management, New York, NY, USA, pp.74-81, 2003. 16