Generalization of Web Log Datas Using WUM Technique 1 M. SARAVANAN, 2 B. VALARAMATHI, 1 Final Year M. E. Student, 2 Professor & Head Department of Computer Science and Engineering SKP Engineering College, Tiruvannamalai, INDIA. deivanai.saravanan@gmail.com, valar_mathi_2007@yahoo.co.in ABSTRACT This paper attempts to understand the behavioral patterns of the websites visitors with the aim to create better and effective websites. The behavioral pattern is understood by analyzing the web log files maintained by the respective websites. The analysis of this work involves how many visitors browse the web site, which pages they view, which they ignore, how long they spend on the site, where they come from and find out the frequency of visitors. In this project, the web log files are analyzed to obtain the user access pattern of the various web pages in the web site. This information is then used to predict the preferences of the different users about the web site and it will give the reports how many number of visitors accessed in the particular website, how many number of unique IP addresses was used, find out the amount of bandwidth was used and finally how many number of hits of the site was received. The number of hits of the site was broken into with respect to time increment, daily usage of the report, day of the week, hour of the day. To learn more about the information that the visitors have accessed, we can see which how many web pages were viewed, how many files are downloaded, what are all directories were accessed and which images are looked at, in which web site. Referrer information includes the domains and URL's that the visitors came from. General Terms: Human Factors, Measurement. Used Keywords: Query log analysis, Web Search Measurement. 1 INTRODUCTION 1.1 BACKGROUND Web users increase at a fast rate and useful information can be obtained from the WWW (World Wide Web).The available data is growing explosively, so, the techniques for analysis and discovery of useful information are important. The information providers and web manager make an effort to construct the effective web site. If providers and administrators can determine user s browsing patterns from web access logs, they could use the patterns as one index to construct an effective web site [2]. However, it is difficult to extract user s browsing patterns manually because the web access log is huge. Therefore, data mining technique is adopted to solve this problem. The data mining is to extract patterns from large amounts of data. Web page complexity far exceeds the complexity of any traditional text document collection. The Web constitutes a highly dynamic information source and Web serves a broad spectrum of user communities[3].further only a small portion of the Web s pages contain truly relevant or useful information. Web mining is mining of data related to the World Wide Web. This may be the data actually present in WebPages or data related to the Web activity [4,5]. Web data can be classified into the following classes: Content of actual web pages. Intra-page structure includes the HTML or XML code for the page. Inter-page structure is the actual linkage structure between Web pages. Usage data that describes how Web pages are accessed by visitors. User profiles include demographic and registration information obtained about users. This could also include information found in cookies. ISSN: 1790-5117 157 ISBN: 978-960-474-162-5
Whenever a visitor access the web server it leaves the IP, authenticated user ID, time/date, request mode, status, bytes, referrer, agent and so on. The available data fields are specified by the HTTP protocol. Web mining task can be divided into several classes. Figure 1.1 shows one taxonomy of web mining activities. General access pattern tracking is a type of usage mining that looks at a history of Web pages visited. This usage may be general or may be targeted to specific usages or users. Taxonomy of Web Mining Figure: 1.1. Taxonomy of Web Mining. Web Usage Mining is that part of Web Mining which deals with the extraction of knowledge from server log files. Source data mainly consist of the (textual) logs that are collected when users access web servers and might be represented in standard formats. 1.2 MOTIVATION The aim of this paperwork is to analyze the log files of a web site obtained from a web server using WUM technique. The data warehouse has been created and populated, various statistical and data mining techniques will be used in order to identify any web usage patterns that exist. An existing application that may be able to assist with this pattern discovery phase is 123LogAnalyzer. These patterns will then be analyzed, interpreted and used to determine how well the web site is being used. A graphical representation of these patterns will also be created. 1.3 OBJECTIVES Web usage mining is the type of Web mining activity that involves the automatic discovery of user access patterns from one or more Web servers. Organizations often generate and collect large volumes of data in their daily operations. Most of this information is usually generated automatically by Web servers and collected in server access logs. Other sources of user information include referrer logs which contains information about the referring pages for each page reference, and user registration or survey data gathered via tools such as CGI scripts [7]. Analyzing such data can help organizations to determine the life time value of customers, cross marketing strategies across products, and effectiveness of promotional campaigns, among other things. Analysis of server access logs and user registration data can also provide valuable information on how to better structure a Web site in order to create a more effective presence for the organization [8]. Finally, for organizations that sell advertising on the World Wide Web, analyzing user access patterns helps in targeting advertisement to specific groups of users. 1.4 CHALLENGES The World Wide Web is a huge, diverse and dynamic medium for the dissemination of information maybe too much information to mine information overload a lot of this information is irrelevant and not indexed.finding relevant information to mine, Personalization and mass customization is difficult and E-commerce businesses have to know what the customers wants. Most of the Web documents are in HTML format and contain many markup tags, mainly used for formatting. Traditional IR systems often contain structured and well- written documents, this is NOT the case on the Web. Most documents in traditional IR systems tend to remain static over time, Web pages are much more dynamic. Web pages are hyperlinked to each other, and it is through hyperlink that a Web page author cites other Web pages. ISSN: 1790-5117 158 ISBN: 978-960-474-162-5
The size of the Web is larger than traditional data sources or document collections by several orders of magnitude. 2 PROPOSED SYSTEM 2.1 SYSTEM OVERVIEW Data mining is a technique used to deduce useful and relevant information to guide professional decisions and other scientific research. It is a cost-effective way of analyzing large amounts of data, especially when a human could not analyze such datasets. Massification of the use the internet has made automatic knowledge extraction from Web log files a necessity. Information provided are interested in techniques that could learn Web users information needs and preferences [9]. This can improve the effectiveness of their Web sites by adapting the information structure of the sites to the users behavior. Recently, the advent of data mining techniques for discovering usage pattern from Web data (Web Usage Mining) indicates that these techniques can be a viable alternative to traditional decision making tools. Web Usage Mining is the process of applying data mining techniques to the discovery of usage patterns from Web data and is targeted towards applications.web Usage Mining mines the secondary data derived from the interactions of the users during certain period of Web sessions. This work explores the use of Web Usage Mining techniques to analyze Web log records collected from Web servers. Using commercial data Web mining tool (123Log analyzer) have identified several Web access pattern by applying well known data mining techniques to the access log files. 2.2 SYSTEM REQUIREMENTS 123LogAnalyzer is a powerful online tool that turns your Web logs into a comprehensive analysis of the customers and prospects [10]. 123LogAnalyzer describes how visitors browse our Web site, which pages they view (and ignore), how long they spend on our site, and where they come from. 123LogAnalyzer's Web server activity report displays the number of visitors, the number of unique IP addresses, the amount of bandwidth used, and the number of hits the site received, broken down by time increment, day of the week, and hour of the day. To learn more about the information that visitors accessed you can see which Web pages were viewed, files were downloaded, directories were accessed, and images were viewed. Referrer information includes the domains and URL's that the visitors came from. The search engine performance report displays the search engines that referred visitors to the site, and the words and phrases that visitors searched for. 123LogAnalyzer provide geographic information about the visitors, as well as which platforms and browsers people are using to visits the site. We can even identify missing files, broken links, and other errors that visitors encountered. The sample output of 123LogAnalyzer is given below. Fig 2.1 adding the log file Fig 2.2 Daily Visit Report ISSN: 1790-5117 159 ISBN: 978-960-474-162-5
end user and improve web server system performance[3]. Fig 2.3 Most popular Day of week Report Fig: 3.1. Design of Web log system The log file contents are retrieved from text file and tokens are separated by using String Tokenize. The contents are then stored into a database. Unwanted Tuples are then removed and stored in another table. Aggregate functions are used for extracting the required tuples. SQL Queries are passed to database using Fig 2.4 Hits in Hour of day Report Fig 2.5 Hits in Day of week Report 3 DESIGN OF THE SYSTEM 3.1 DESIGN OF THE SYSTEM Web usage mining mines web log records to discover web access pattern of web pages. Analyzing and exploring identifying potential customers for e-commerce enhance the quality and delivery of internet information services to LOG FILE: Log files are files that contain a record of website activity. Every time a person visits the website, a log file is updated with the visitor's information by the web server. These log files can be downloaded and used to generate useful statistics. An access of a web page or a file will generate a "Hit" on the web server. For example, if a web page contains 10 pictures, a visit on that page will generate 11 "hits" on the web server, one hit for the web page, 10 hits for the pictures. If a visitor viewed 5 web pages on the web site, each page contain 10 pictures, the web server will record: 55 Hits 5 Page Views 1 Visit 3.1.2 WEBLOG FILES Web Server log files are simple text files that are automatically generated every time ISSN: 1790-5117 160 ISBN: 978-960-474-162-5
someone accesses the Website. Every "hit" of the Web site, including each view of a HTML document, image or other object, is logged. The raw web log file format is essentially one line of text for each hit to the website. This contains information about who was visiting the site, where they came from, and exactly what they were doing on the particular Web site. There are up to four files that is, Access (or transfer), error, agent (or browser), and referrer files. More and more often, the transfer, agent, and referrer are being gathered into a combined file. 3.1.3 SAMPLE LINE OF A WEB LOG FILE IN ITS RAW FORMAT: 217.13.12.209 - - [19/JUL/2007:02:50:32-0400] "GET /meta_tags.htm HTTP/1.1" 200 28950 "http://www.google.com/search?q=meta+and+ta g" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 2000; DigExt). Generalization of Log files Bar / Line chart generation Conversion of log files to Data Base Fig: 3.2.System Architecture Table generation 3.3 DETAILED PROCESS OF WUM This web server log file line tells us: Visitor's IP address or hostname [217.13.12.209] Login [ -] Authuser [ -] Date and time [19/JUL/2007:02:50:32-0400] Request method [GET] Request path [meta_tags.htm] Request protocol [HTTP/1.1] Response status [200] Response content size [28950] Referrer path [http://www.google.com/search?q=meta+and+ta g] User agent [Mozilla/4.0 (compatible; MSIE 5.0; Windows 2000; DigExt)] 3.2SYSTEM ARCHITECTURE As part of system requirements and design activity, the system has to be modeled as a set of components and relationships between these components. The figure 3.2 shows the major sub-systems of software and interconnection between these sub-systems. Figure 3.3.Activites of WUM Step 1: Data preprocessing Data preprocessing has a fundamental role in Web Usage Mining applications. It has different tasks [12]: (a) Data Cleaning-This step consists of removing all the data tracked in web logs that are useless for mining purposes. (b) Session Identification and Reconstruction-This step consists of (i) identifying the different users sessions from the usually very poor information available in log files and (ii) reconstructing the users navigation path within the identified sessions. (c) Content and Structure Retrieving- Web content refers to the discovery of useful information from web contents including text, image, audio and video etc., structure retrieving gives the analysis of the out links of a webpage and it has been used for search engine result ranking. (d) Data Formatting - Once the previous phases have been successfully completed, data are properly formatted before applying mining techniques. So stored data extracted from web logs into a relational database. ISSN: 1790-5117 161 ISBN: 978-960-474-162-5
Fig 3.4 Phases of WUM Step 2: Mining Algorithms Process of mining algorithm or pattern discovery: (a) Statistical Analysis: Statistical techniques are the most common method to extract knowledge about visitors to a Web site. By analyzing the session file, one can perform different kinds of descriptive statistical analyses (frequency, mean, median, etc.) on variables such as page views, viewing time and length of a navigational path. (b)clustering: Clustering is a technique to group together a set of items having similar characteristics. In the Web Usage domain, there are two kinds of interesting clusters to be discovered. (i.e.) usage clusters and page clusters. Clustering of users tends to establish groups of users exhibiting similar browsing patterns. Such knowledge is especially useful for inferring user demographics in order to perform market segmentation in E-commerce applications or provide personalized Web content to the users. (c)classification: Classification is the task of mapping a data item into one of several predefined classes. In the Web domain, one is interested in developing a profile of users belonging to a particular class or category. This requires extraction and selection of features that best describe the properties of a given class or category. (d)association Rules: Association rule generation can be used to relate pages that are most often referenced together in a single server session. In the context of Web Usage Mining, association rules refer to sets of pages that are accessed together with a support value exceeding some specified threshold. These pages may not be directly connected to one another via hyperlinks[11]. (e)sequential Patterns: The technique of sequential pattern discovery attempts to find inter-session patterns such that the presence of a set of items is followed by another item in a time-ordered set of sessions or episodes. By using this approach, Web marketers can predict future visit patterns which will be helpful in placing advertisements aimed at certain user groups. (f)dependency Modeling: Dependency modeling is another useful pattern discovery task in Web Mining. The goal here is to develop a model capable of representing significant dependencies among the various variables in the Web domain. Step 3: Pattern Analysis Pattern analysis is the last step in the overall Web Usage mining process as described in Figure 3. The motivation behind pattern analysis is to filter out uninteresting rules or Patterns from the set found in the pattern discovery phase[13]. The exact analysis methodology is usually governed by the application for which Web mining is done. The most common form of pattern analysis consists of a knowledge query mechanism such as SQL. 4 IMPLEMENTATION OF SYSTEM 4.1 METHODOLOGY OVERVIEW The Web Usage Mining process becomes a major guide line upon project implementation. Fig.4.1 shows the general flow of the project methodology. Fig 4.1 Flow of the project methodology Server Log File The server log file dated from JANUARY 2007 TO SEPTEMBER 2007 has been selected for further analysis. The server log files are retrieved from the (IIS) web server. The large amount of data becomes the most challenging problem to handle during the ISSN: 1790-5117 162 ISBN: 978-960-474-162-5
Data Preprocessing phase. The server log file consists of nine attributes in the single line of record as shown in Fig 4. 192.168.2.85 - - [21/Jun/2007:05:27:59 +0000] "GET / HTTP/1.0" 200 0 "-" "Microsoft-WebDAV- 192.168.10.82 - - [12/May/2007:05:40:57 +0000] "GET /sysvol HTTP/1.0" 404 0 "-" "Microsoft-WebDAV- 192.168.10.79 - - [23/Jul/2007:05:54:52 +0000] "GET /sysvol HTTP/1.0" 404 0 "-" "Microsoft-WebDAV- 192.168.10.75 - - [02/Aug/2007:06:14:07 +0000] "GET / HTTP/1.0" 200 0 "-" "Microsoft-WebDAV- 192.168.10.74 - - [20/May/2007:06:16:33 +0000] "GET /sysvol HTTP/1.0" 404 0 "-" "Microsoft-WebDAV- 192.168.10.72 - - [28/Sep/2007:06:27:33 +0000] "GET / HTTP/1.0" 200 0 "-" "Microsoft-WebDAV- 192.168.10.72 - - [23/Mar/2007:06:27:33 +0000] "GET /sysvol HTTP/1.0" 404 0 "-" "Microsoft-WebDAV- viewed webpage, Most viewed directories).see Figure 4.2. e. Table generation: Based on the information available in the database from the log file, its going to build the required information on the table on that database.(eg.:daily hits, Daily visit, Daily bandwidth, Daily page views, Most popular day of week, Weekly bandwidth, Hits in day of week, Visitor viewed the web most, Most viewed webpage, Most viewed directories).see the table 4.1. 4.3 SAMPLE SCREEN SHOTS Figure 4.2 Bar / Line chart of Daily hits Report 4.2 DESCRIPTION OF THE MODULES WITH SCREEN SHOTS 4.2.1 Description of Modules a. Extracting web log files. Extracting the log files from different web servers with various formats. b. Converting web log files. Converting information from text files (it is a file which is created by the log analyzer) and storing those webs based available in the file to database. c. Generalization web log data Posting of all data to the appropriate tuples. d. Bar / Line chart generation Based on the information available in the database from the log file, it s going to build the required Bar chart. (Eg.:Daily hits,daily visit,daily bandwidth, Daily page views, Most popular day of week, Weekly bandwidth, Hits in day of week, Visitor viewed the web most, Most Table 4.1 Generation of Daily hits Report 5 CONCLUSION AND FUTURE ENHANCEMENTS 5.1 CONCLUSION: The Web Usage Mining modules were used to preprocess the log file and various charts are generated depicting the daily, weekly, ISSN: 1790-5117 163 ISBN: 978-960-474-162-5
monthly usage patterns. Sample charts generated from the mining process are presented below. Web Usage Mining is an active field for research and Web Usage Mining applications are being used in some famous Websites. This project presents an implementation of the Web Usage Mining. Web Server log files are mined in order to analyze the Web Usage pattern. The methodology employs Data Preprocessing, Mining Algorithms and Pattern Analysis. Data Processing phase for the Web Usage Mining is a challenging task. By applying mining algorithms to the Web log file, the relationship between the accessed pages can be mined. The results from this project can be used by Web administrator and Web masters in order to improve Web services and performance through the improvement of Web sites, including their contents, structure, presentation and delivery. 5.2 APPLICATIONS The results can be used to improve the web site from the users viewpoint. Further the results produced by the mining of web logs can used for various purposes: to personalize the delivery of web content to improve user navigation through prefetching and caching to improve web design or in e- commerce to improve the customer satisfaction Personalization of Web Content. Web Usage Mining techniques can be used to provide personalized web user experience. For instance, it is possible to predict, in real time, the user behavior by comparing the current navigation pattern with typical patterns which were extracted from past web log. Prefetching and Caching. The results produced by Web Usage Mining can be exploited to improve the performance of web servers and web-based applications. Typically, Web Usage Mining can be used to develop proper prefetching and caching strategies so as to reduce the server response time. Support to the Design. Usability is one of the major issues in the design and implementation of web sites. The results produced by Web Usage Mining techniques can provide guidelines for improving the design of web applications. E-commerce. Mining business intelligence from web usage data is dramatically important for e-commerce web-based companies. Customer Relationship Management (CRM) can have an effective advantage from the use of Web Usage Mining techniques. In this case, the focus is on business specific issues such as: customer attraction, customer retention, cross sales, and customer departure. 5.3 FUTURE ENHANCEMENT As a future enhancement of this project, web pages can be pre-fetched depending on the usage patterns. Pre-fetching can improve the web performance at a great level. Further, the method for analyzing sparse data can be used in the study of Web log access, use of different similarity Association Rules and conclude about the most suitable alternatives for knowledge extraction from Web log data. Finally the project can be extended to access and process the external web servers with appropriate access rights. REFERENCES [1] Abraham A., Business Intelligence from Web Usage Mining, Journal of Information and Knowledge Management (JIKM), World Scientific Publishing Co., Singapore, Volume 2, No. 4, pp. 1-15, 2003. [2] Azizul Azhar bin Ramli, Web usage mining using apriori algorithm: UUM learning care portal case. In: Proc. of the Int. Conf. on Knowledge Management,pp 1-19,2001. [3] Cooley, R, Mobasher.B.,Srivastava,J,Web mining information and pattern discovery on the World Wide Web, Ninth IEEE International Conference,Volume, Issue, 3-8,page(s):558 567,p. 1-15, 2003. [4] Jiawei Han, Kevin Chen-Chuan Chang, "Data Mining for Web Intelligence" Computer, Vol. 35, no. 11, pp. 64-70, Nov., 2002. [5] Jiawei Han and Micheline Kamber, Data Mining Concepts and Techniques, Second Edition, Morgan Kaufmann Publishers, 2006. ISSN: 1790-5117 164 ISBN: 978-960-474-162-5
[6] Kato, H.; Hiraishi, H.; Mizoguchi, F., Log summarizing agent for Web access data using data mining techniques, IFSA World Congress and 20th NAFIPS International Conference, Joint 9 th Volume, Issue,25-28, Page(s):2642-2647 Vol.5, 2001. [7] Marquardt, C.G.; Becker, K.; Ruiz, D.,A preprocessing tool for Web usage mining in the distance education domain, Database Engineering and Applications Symposium, Volume,Issue,7-9, page(s): 78 87, July 2004. [8]Miriam Baglioni, U. Ferrara, Andrea Romei, Salvatore Ruggieri, Franco Turini, "Preprocessing and Mining Web Log Data for Web Personalization", Proc. of 8th Natl' Conf. of the Italian Association for Artificial Intelligence,2003. [9] Mukesh Mohania, A. Min Tjoa,Data Warehousing and Knowledge Discovery: First International Conference, DaWaK'99 Florence, Italy,1999. [10] F. van Harmelen, A. Kampman, H. Stuckenschmidt, and T. Vogele. Knowledgebased meta-data validation: Analyzing a webbased information system. In K. Greve, editor, 14 th International Symposium Informatics for Environmental Protection. German Computer Society, 2000. [11] Vinodkumar P. Kizhakke, "Mir: A Tool For Visual presentation Of Web Access Behavior", Master thesis, University of Florida, Gainesville, 2000. [12] Yang, T.Li and K.Wang, Web-Log Cleaning for Constructing Sequential Classification Applied Artificial Intelligence, vol 17, 2003. [13] Abraham A., Business Intelligence from Web Usage Mining, Journal of Information and Knowledge Management (JIKM), World Scientific Publishing Co., Singapore, Volume 2, No. 4, pp. 1-15, 2003. http://citeseer.ist.psu.edu/abraham03business.ht ml [14] http://httpd.apache.org/docs/1.3/logs.html [15] http://www.apacheweek.com/features/logfiles [16]http://msdn2.microsoft.com/enus/library/ms 525807.aspx [19]http://webhosting.devshed.com/c/a/Web- Hosting-Articles/The-Top-Web- Servers-inthe-Market/2/ [20]http://www.lib.utexas.edu/dlp/imls /tools/logdb/atributedetails.html [21] Hiraishi, H.; Mizoguchi, F. Log summarizing agent for Web access data using data miningtechniques Kato, H.IFSA World Congress and 20th NAFIPS International Conference, 2001. Joint 9 th Volume, Issue, 25-28 July 2001 Page(s):2642-2647 vol.5 [22] Jiawei Han, Kevin Chen-Chuan Chang, "Data Mining for Web Intelligence," Computer, vol. 35, no. 11, pp. 64-70, Nov., 2002 [23] F. van Harmelen, A. Kampman, H. Stuckenschmidt, and T. Vogele. Knowledgebased meta-data validation: Analyzing a web-based information system. In K. Greve, editor, Fourtheenth International Symposium Informatics for Environmental Protection. German Computer Society, 2000. [24] Miriam Baglioni, U. Ferrara, Andrea Romei, Salvatore Ruggieri, Franco Turini, "Preprocessing and Mining Web Log Data for Web Personalization", Proc. of 8th Natl' Conf. of the Italian Association for Artificial Intelligence,2003 [25] www.123loganalyzer.com/ [17]http://www.summary.net/manual/ log_formats.html [18] http://stream.bo.cnr.it/syshelp/config.htm ISSN: 1790-5117 165 ISBN: 978-960-474-162-5