ANALYSIS OF WEB SERVER LOG BY WEB USAGE MINING FOR EXTRACTING USERS PATTERNS

International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) ISSN 2249-6831 Vol. 3, Issue 2, Jun 2013, 123-136 TJPRC Pvt. Ltd. ANALYSIS OF WEB SERVER LOG BY WEB USAGE MINING FOR EXTRACTING USERS PATTERNS OM KUMAR C. U. 1 & P. BHARGAVI 2 1 Department of Computer Science, Sree Vidyaniketan Engineering College, Tirupathi, Andhra Pradesh, India 2 Senior Asst.Prof, Dept of CSE, Sree Vidyanikethan Engineering College, Tirupathi, Andhra Pradesh, India ABSTRACT WWW is a system of interlinked hypertext documents accessed via the Internet. Around 11 Hundred million people access internet daily. And so the information available on WWW is also growing. With this continued growth of information and proliferation of web services and web based information systems, web sites are also growing to host them. Before analyzing such data using data mining technique the Servers web log need to be preprocessed. The log file data offer insight into website usage. They can be collected from Web servers, Proxy Servers, Web Client. Web Usage mining applies data mining technique to extract knowledge from these web log files. This paper discusses about the Log files and uses Web mining techniques to extract usage patterns by using WEKA. KEYWORDS: Pre-Processing, Web Usage Mining, Web Server Log Data, Classification, Clustering, Rule Based Mining, Pattern Discovery INTRODUCTION WWW continues to grow at an astounding rate in both information and users perspective. The scale of information on the internet is growing at an comprehensible rate, similar to the mystifying size of planets and stars. Internet has become a place where a massive amount of information and data is being generated every day. Every Minute YouTube users upload 48 hours of video, Facebook users share 684,748 pieces of content, Instagram users share 3600 pictures and Tumblr shares 27,778 new posting. Over the last decade with the continued increase in the usage of WWW, Web mining has been established as an important area of research. Web mining is used to analyse users using WWW who leave abundant information in web log, which is structurally complex and incremental in nature. A Log file is a record that records everything that goes in and out of a particular server. Analysing such data will yield knowledge but pre-processing of that data is required before analysing it. Once analysing the Log File, they provide activities of users over a potentially long period of time. They can be collected from web server, proxy server and Web client. These logs when mined properly provide useful information for decision making. They contain information such as username, IP Address, timestamp, bytes transferred, referred URL, User agent. Based on the research of web mining [8] they are classified into 3 domains. Web Content Mining Web Structure Mining Web Usage Mining. Web usage mining in Figure 2 is a process of extracting information from server logs (i.e.,) users history. They help in finding out what users are looking out in Internet.

124 Om Kumar C.U. & P. Bhargavi Web Structure Mining in Figure 2 is the process of using graph theory to analyse node and connection structure of web sites. They are of 2 types. Extracting Pattern from Hyperlinks in the Web They are structural components that connect the web page to different location. Mining Document Structure Analyses tree like structure of page to describe html, xml tag usages.web content mining in Figure 2 is the mining, finding pattern and extracting knowledge from web contents. They are of 2 types. Information retrieval view Database View RELATED WORK Data Preprocessing The Log file contains immaterial attributes. So before mining the Log file, preprocessing needs to be done. General preprocessing techniques applied on data are cleaning, Integration, Transformation and reduction. By applying the above preprocessing techniques incomplete attributes, noisy data which contain errors and inconsistent data that has discrepancies can be removed. By applying Data preprocessing we improve the quality of data. Once the cleaned data is transformed, User sessions may be tracked to identify the user, and from it the user patterns can be extracted. Figure 1: Data Pre-Processing The obtained data is now clean and can be analysed. Web mining process is considered for analysing the preprocessed Log File.

Analysis of Web Server Log by Web Usage Mining for Extracting Users Patterns 125 Figure 2: Web Mining Web Usage Mining The goal of web usage mining is to get into the records of the servers (log files) that store the transactions that are performed in the web in order to find patterns revealing the usage the customers [7][11]. We can also distinguish here: General access pattern tracking. Here we combine the access patterns of a group rather than an individual to get a trend that allow us to organize the web structure in such a way that the user is facilitated. Customized access pattern tracking. Here we gather information regarding a client s behavior with the website. Based on the gathered information suggestions and advices are provided to improve the quality. Web Content Mining Here information is gathered regarding the search performed on the content to identify user patterns. There are two main strategies of Web Content Mining are as follows: Information Retrieval View: R. Kosala et al. summarized the research works done for unstructured data and semistructured data from information retrieval view. Study has revealed that researches use frequent words, which is generally a single word. These Single words are considered as training corpuses for ranking the content in the web based on the number of referrals. For the semi-structured data, all the works utilize the HTML structures inside the documents and some utilized the hyperlink structure between the documents for document representation. Database View As for the database view, in order to have the better information management and querying on the web, the mining always tries to infer the structure of the web site to transform a web site to become a database. Web Structured Mining This type of mining is usually done to reveal the structure of websites by gathering structure related data. Typically it takes into account two types of links: static and dynamic. SERVER LOG The server responds to user requests and the server log records all the transactions right from start up to shutdown of the server. They provide time stamp to user requests and respond by recording requested ID with the requested action.

126 Om Kumar C.U. & P. Bhargavi These Log files can be located in 3 places. Web Servers- A web server is dispenses the web pages as they are requested. Proxy Server- A proxy server is a intermediary compute that acts as a computer hub through which user requests are processed. Web Client- A Web client is a computer application, such as a web browser, that runs on a user local computer or workstation and connects to a server as necessary Table 1: Types of Server Logs S.No Type of Server Log Example Transfer- these log records remote hosts visiting it with its time stamp. 120.236.0.14-2007-11-05 2 Agent - A remote host surfs through a browser. Agent log records the user agent (browser). InternetExplorer/5.0(win 7;) 10/10/2012 10:07:09 unable to open 3 Error- This log records all the warnings, errors caused file academics. No such file 05/06/2008 to its system with the time and type of error. 13:04:10 could not load repository template extension. 4 Referrer- They provide extra features like user reference as links. When used with agent log provides detail like the type of user and the user agent. They can track external hosts using your document from your space. http://myblaze.sez.html>/pictures/myspace/happy.gif Contents of Log File A server log file is a log file that automatically creates and maintains the activities performed in it. It maintains a history of page requests. It helps us in understanding how and when your website pages and application are being accessed by the web browser. These log files contain information such as the IP address of the remote host,content requested, and a time of request. SYNTAX IPaddress, logproprietor, Username, [DD:MM:YYYY: Timestamp GMToffset], "req method". Ex 104.11.13.108 - - [13/Jan/2006:16:56:12-0600] "GET /EDC/cell.htm HTTP/1.0" 200 4093. IP address- The IP address of the http request is recorded to identify the remote host. Ex: 204.31.113.138. log proprietor- The name of the owner making an http request is recorded through this field. They do not expose this information for security purpose. When they are not exposed they are denoted by (-). Username- This field records the name of the user when it gets a http request. They do not expose this information for security purpose. When they are not exposed they are denoted by (-). date- Request date is recorded here in the mentioned format. Ex: 13/Jan/2006. time- Request Time of the HTTP request is recorded here in astronomical format. Ex: 16:56:12.

Analysis of Web Server Log by Web Usage Mining for Extracting Users Patterns 127 GMToffset- This field shows the time difference between the actual request time and Greenwich Mean time so that request from corner of the world can be analysed in any part of the world. Ex: -0600. Reqmethod- The request type of the request is stored. Ex: GET. Types of Log Formats NCSA Log Formats, W3C Extended Log Format, Microsoft IIS Log Format, Sun One Web Server Format. NCSA Log Formats National Centre for Supercomputing Application (NCSA) established in 1986 developed a web server called httpd at its centre. This web server had a log initially which had several extensions later. NCSA Common Log or Access Log Format Stores basic information about the request received. Syntax Host IP address, Proprietor, Username, date: time, request method, status code, byte size. Ex 200.40.12.4, -, -, [2006/Oct/10:10:16:52 +0500], GET /svec.html http 1.0, 200, 1460. NCSA Combined Log Format Stores all common log information with two additional fields. referrer history of request, user_agent type of browser. Syntax Host IPaddress, Proprietor, Username, date: time, request method, status code, byte size, referrer, User_agent, Cookie. Ex 200.40.12.4, -, -,[2006/oct/10:10:16:52 +0500], GET /svec.html http 1.0, 200, 1460, http://www.cbci.com/, Mozilla/5.0 (WIN:7), UserID=om123;Pwd=101112. NCSA Separate Log Format In this type the information is split into 3 log files instead of storing it in a single file. The three log files are a) access log b) Referral log c) Agent log

128 Om Kumar C.U. & P. Bhargavi W3C Extended Log Format The worldwide web consortium (w3c) is an international standards organization. They provide rich information hence the name extended log format. The lines starting with # contain directives. #version <int><int> #Software- the software which generated the log. #date-<date><time> #fields-this directive lists a sequence of entries. They are as follows. Table 2: W3C Directives Ex Acronym C S CS SC r Sr rs Description Client Server Client to Server Server to Client Remote host Server to Remote host Remote host to Server #version: 2.0 #Software: Microsoft windows server1.0 #date : < date> <time> #field: C-ip S-ip CS-username CS-method CS-uri CS-version CS-user-agent SC-status 2000-04-14 10: 16: 42, 192.16.14.1, 200.14.100.4 - GET/Mypictures.gif http1.0 Mozilla/4.0 200. Microsoft IIS Log File Internet information service enables you to track or record the activities happening in your website through File transfer protocol (ftp), Network News Transfer Protocol (NNTP), Simple Mail Transfer Protocol (SMTP) by allowing you to choose a log format that works in synchronization with your system environment. Some supplementary attributes provided are as follows. Elapsed time, total bytes transferred, target file. Syntax IP add, date timestamp, Server name, Server IP, elapsed time, http request size, byte size, status code, error, request method. Ex 192.16.10.1,-,10/4/01 14:02:10, svec, 170.42.14.2, 1604, 140, 4240, 200, 0, GET, /Mypicture.gif. Sun One Web Server They are similar in functionality with the above mentioned log formats. But provides more security in 2 ways.

Analysis of Web Server Log by Web Usage Mining for Extracting Users Patterns 129 -By using Secure Socket Layer (SSL) between client and server. -Administrator can provide access controls or permissions to files and directories. Request & Response by the Server All the log formats specified above has fields that record http request in the form of elapse time and response in the form of status code. To Handle Request Authorization Checks user ID and password URI Translation Translates the Uniform Resource Identifier to local system path. Checking Checks the correctness of file path with user privileges. MIME type checking: Checks the Multi-Purpose Internet mail Encoding of the requested resource. Input Prepares the system for reading input. Output Prepares the output for client. Service Generates the response to client Log Entry Record the activity into the Log. Error This field is used only if any of the above mentioned field fails from its normal execution. They are of 2 types. They are as follows. Connection Errors They happen when a connection established for communication with the web server drops. They are classified as follows: Void URL This simply means that the format of the Uniform Resource Locater is invalid.

130 Om Kumar C.U. & P. Bhargavi Host Not Found Time Out This error occurs when the Server could not be found with its host/domain name. When a connection could not be established with in a predetermined time this error occurs. The default time out is set to 90 seconds. Connection Refused This error occurs when an identified host refuses connection through its default port. No response from Web Server When an identified Web server fails to respond with a time period this error has said to be occurred. Unexpected Error These are errors that does not report itself in an anticipated manner; it cannot be classified into one of the predetermined categories. To Handle Response status codes. If a connection is established successfully with a Web Server then the Server responds with one of the following Web Server Status Code & Messages The Status-Code element is a 3-digit integer result code of the attempt to understand and satisfy the request [6]. Table 3-clearly explains the message status. Table 3: Status Codes Status Code Message 1xx Informational 100 Continue 101 Switching protocols 2xx Success 201 Created 201 Accepted 203 Non authoritative Information 204 No content 205 Reset content 206 Partial content 3xx Redirection 301 Moved permanently 302 Moved temporarily 303 See other 304 Not modified 305 Use proxy 4xx Client error 401 Bad request 402 Unauthorized 403 Payment required 404 Forbidden 405 Method not allowed

Analysis of Web Server Log by Web Usage Mining for Extracting Users Patterns 131 EXPERIMENTAL RESULTS technique in it. Table 3:Contd., 406 Not acceptable 407 Proxy Authentication required 408 Request timeout 409 Conflict 410 Gone 411 Length Required 412 Precondition Failed 413 Required entity too long 414 Required URI too long 415 Unsupported media type 5xx Server error 501 Not implemented 502 Bad gateway 503 Service unavailable 504 Gateway timeout 505 http version timeout A company s log file in Figure 3 is analyzed using WEKA. We apply classification and clustering [9][15] Figure 3: Log File (Username, ReqType and UserAgent of the Log File are Not Considered here for Mining) By applying If Then classification [10] rules we obtain Rule IF freemem <= 193.5 freemem > 117.5 THEN usr = 93 + 0.0001 * outtime

132 Om Kumar C.U. & P. Bhargavi - 0.0016 * intime - 0.0019 * bytesize + 0.0022 * req - 1.9716 * exec Rule: 2 IF freemem > 304.5 fork <= 1.095 bytesize > 1626.5 THEN usr = 88 + 0.152 * intime - 0.0009 * outime - 3.0119 *req - 0.0011 * exec + 0.31 * freeswap Similarly around 17 rules can be mined. Applying kmeans we obtained prior probabilities of clusters. Mean Distribution Attribute: UserId Normal Distribution. Mean = 25.1373 StdDev = 60.3306 Attribute:In Time Normal Distribution. Mean = 16.3085 StdDev = 33.4104 Attribute: Out Time Normal Distribution. Mean = 2979.7651 StdDev = 1538.001 Attribute: Byte Size Normal Distribution. Mean = 271.0738 StdDev = 215.9147 Attribute: Request Normal Distribution. Mean = 2.3778 StdDev = 2.7902 Attribute: Exec Normal Distribution. Mean = 3.7507 StdDev = 6.2205

Analysis of Web Server Log by Web Usage Mining for Extracting Users Patterns 133 Farthest First Cluster: 0 Prior probability: 0.6381 Cluster: 1 Prior probability: 0.3619 Figure 4: Kmeans Cluster-Distribution From the above graph Cluster 0 contains maximum number of users. Farthest first is a variant of K Means that places each cluster centre in turn at the point furthermost from the existing cluster centre. This point must lie within the data area. This greatly speeds up the clustering in most of the cases since less reassignment and adjustment is needed. Cluster Centroids Cluster 0 1098.0 rajesh M Other password 10/9/2010 10/9/2010 2 Cluster 1 123.0 Vaishnavee F Mobile Bill 9.940870123E9 9/1/2009 9/1/2009 227 Time taken to build model (full training data) : 0.06 seconds Clustered Instances 0 740 (95%) 1 42 (5%) Figure 5: Centroids of Cluster

134 Om Kumar C.U. & P. Bhargavi CONCLUSIONS AND FUTURE WORK There is a growing trend among companies, organizations and individuals alike to gather information from log files to gather information regarding user but it is a challenging task for them to fulfill the user needs.web mining has valuable uses to marketing of business and a direct impact to the success of their promotional strategies and internet traffic. This information is gathered on a daily basis and continues to be analyzed consistently [15]. Analysis of this pertinent information will help companies to develop promotions that are more effective, internet accessibility, inter-company communication and structure, and productive marketing skills through web usage mining This paper gives a detailed look about servers, data mining, web mining, web server log file and its format. Further we extracted patterns of the user using clustering, decision trees and If-Then rules. The extended work to this research work is to mine the log file based user clicks.this need to have a deep insight in to log files that stores clicks of the user. REFERENCES 1. V.V.R.MaheswaraRao, Dr.V.Valli Kumari An Enhanced Pre-Processing Research Framework For Web Log Data Using a Learning Algorithm, Journal Of Computer science & information technology,pp.01-15, 2011. 2. Kobra Etminani,Mohammad-R. Akbarzadeh-T.,Noorali Raeeji Yanehsari, Web Usage Mining: users navigational patterns extraction from web logs using Ant-based Clustering Method, International Fuzzy System Association &European Society Of fuzzy Logic Technology(IFSA-EUSFLAT ),pp.396-401, 2009. 3. Mohd Helmy Abd Wahab, Mohd Norzali Haji Mohd, Hafizul Fahri Hanafi, Mohamad Farhan Mohamad Mohsin, Data Pre-processing on Web Server Logs for Generalized Association Rules Mining Algorithm,Proc.Of World Academy Of Science, Engineering and Technology, Vol 36 pp.970-977, Dec 2008. 4. Ratnesh Kumar Jain1, Dr. R. S. Kasana, Dr. Suresh Jain, Efficient Web Log Mining using Doubly Linked Tree, International Journal of Computer Science and Information Security, Vol. 3, No. 1, pp.402-407, 2009. 5. Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava, Data Preparation for Mining World Wide Web Browsing Patterns. 6. K. R. Suneetha, Dr. R. Krishnamoorthi, Identifying User Behavior by Analyzing Web Server Access Log File, International Journal of Computer Science and Network Security, VOL.9 No.4, pp.327-332, April 2009. 7. S. K. Pani, L.Panigrahy, V.H.Sankar, Bikram Keshari Ratha, A.K.Mandal, S.K.Padhi, Web Usage Mining: A Survey on Pattern Extraction from Web Logs, International Journal of Instrumentation, Control & Automation, Volume 1, Issue 1,pp.15-23, 2011. 8. Arvind Kumar Sharma,Dr. P.C. Gupta, Exploration of Efficient Methodologies for the Improvement In Web Mining Techniques: A Survey, International Journal of Research in IT & Management Vol 1, Issue 3, pp.85-95, July 2011. 9. Stavros Valsamidis, Sotirios Kontogiannis, Ioannis Kazanidis, Theodosios Theodosiouand Alexandros Karakos, A Clustering Methodology of Web Log Data for Learning Management Systems, pp. 154 167, 2012.

Analysis of Web Server Log by Web Usage Mining for Extracting Users Patterns 135 10. M. Malarvizhi, S. A. Sahaaya Arul Mary, Preprocessing of Educational Institution Web Log Data for Finding Frequent Patterns using Weighted Association Rule Mining Technique, European Journal of Scientific Research Vol.74 No.4, pp. 617-633,2012. 11. Sawan Bhawsar, Kshitij Pathak, Sourabh Mariya, Sunil Parihar, Extraction of Business Rules from Web logs to Improve Web Usage Mining, Vol 2, Issue 8, Aug,pp.333-340, 2012. 12. Vijay K. Gurbani,Eric Burger, Carol Davids, Tricha Anjal, SIP CLF: A Common Log Format (clf) For The Session Initiation Protocol,pp.1-8, 2010. [13] 13. Thanakorn Pamutha, Siriporn Chimphlee, Chom Kimpan, and Parinya Sanguansat, Data Preprocessing on Web Server Log Files for Mining Users Access Patterns, International Journal of Research and Reviews in Wireless Communications (IJRRWC), Vol. 2, No. 2, June,pp.92-98, 2012,. 14. Navin Kumar Tyagi, A. K. Solanki and Manoj Wadhwa, Analysis of Server Log by Web Usage Mining for Website Improvement, International Journal of Computer Science Issues(IJCSI), Vol. 7, Issue 4, No 8, July pp.17-20, 2010. 15. Ian H.Witten, and Eibe Frank, Data Mining: Practical Machine Learning Toolsand Techniques with Java Implementations Morgan Kaufman Publishers, 1999.