Web Log Mining: A Study of User Sessions



Similar documents
Analyzing the Different Attributes of Web Log Files To Have An Effective Web Mining

Exploitation of Server Log Files of User Behavior in Order to Inform Administrator

Pre-Processing: Procedure on Web Log File for Web Usage Mining

PREPROCESSING OF WEB LOGS

Survey on web log data in teams of Web Usage Mining

Microsoft Internet Information Services (IIS)

Research on Application of Web Log Analysis Method in Agriculture Website Improvement

ANALYSING SERVER LOG FILE USING WEB LOG EXPERT IN WEB DATA MINING

Apache Logs Viewer Manual

Automatic Recommendation for Online Users Using Web Usage Mining

Web Server Logs Preprocessing for Web Intrusion Detection

An Enhanced Framework For Performing Pre- Processing On Web Server Logs

The web server administrator needs to set certain properties to insure that logging is activated.

Identifying User Behavior by Analyzing Web Server Access Log File

Data Preprocessing and Easy Access Retrieval of Data through Data Ware House

Big Data Preprocessing Mechanism for Analytics of Mobile Web Log

LogLogic Blue Coat ProxySG Log Configuration Guide

Installing AWStats on IIS 6.0 (Including IIS 5.1) - Revision 3.0

Comparison table for an idea on features and differences between most famous statistics tools (AWStats, Analog, Webalizer,...).

An Effective Analysis of Weblog Files to improve Website Performance

Identifying the Number of Visitors to improve Website Usability from Educational Institution Web Log Data

Barracuda Networks Web Application Firewall

LogLogic Microsoft Internet Information Services (IIS) Log Configuration Guide

Tidspunkt : : :59 (49 dag(e)) Operativsystem (OS) fordelt på browsere Total: Safari9 ios %

An Approach to Convert Unprocessed Weblogs to Database Table

Cyclope Internet Filtering Proxy. - Installation Guide -

Copyright Winfrasoft Corporation. All rights reserved.

Enhance Preprocessing Technique Distinct User Identification using Web Log Usage data

A Survey on Different Phases of Web Usage Mining for Anomaly User Behavior Investigation

Why Google Analytics Cannot Be Used For Educational Web Content

Logs. Log File Management APPENDIX

PoSHServer Documentation AUTHOR: YUSUF OZTURK (MVP)

Web Log Analysis for Identifying the Number of Visitors and their Behavior to Enhance the Accessibility and Usability of Website

Advanced Preprocessing using Distinct User Identification in web log usage data

Rise of the Machines: An Internet-Wide Analysis of Web Bots in 2014

Massey University Wireless Network Client Configuration Mac OS X

Chapter VIII A Review of Methodologies for Analyzing Websites

Using the Microsoft IIS SMTP Service for LISTSERV Deliveries

ANALYSIS OF WEB SERVER LOG BY WEB USAGE MINING FOR EXTRACTING USERS PATTERNS

Load balancing Microsoft IAG

A Survey on Preprocessing of Web Log File in Web Usage Mining to Improve the Quality of Data

Web Browsing Quality of Experience Score

LIVE CHAT CLOUD SECURITY Everything you need to know about live chat and communicating with your customers securely

Networks and the Internet A Primer for Prosecutors and Investigators

Media Exchange really puts the power in the hands of our creative users, enabling them to collaborate globally regardless of location and file size.

Cyber Security Workshop Ethical Web Hacking

Web Usage Mining: A Survey on Pattern Extraction from Web Logs

AN EFFICIENT APPROACH TO PERFORM PRE-PROCESSING

User Identification (User-ID) Tips and Best Practices

Web Design and Development ACS-1809

LogLogic Blue Coat ProxySG Syslog Log Configuration Guide

Dragonframe License Manager User Guide Version 1.2.2

Interwise Connect. Working with Reverse Proxy Version 7.x

LogCLEF 2011 Multilingual Log File Analysis: Language identification, query classification, and success of a query

User-ID Best Practices

Apache Server Implementation Guide

Crawl Proxy Installation and Configuration Guide

Guide to Analyzing Feedback from Web Trends

Software Engineering I CS524 Professor Dr. Liang Sheldon X. Liang

Preprocessing Web Logs for Web Intrusion Detection

An Overview of Preprocessing on Web Log Data for Web Usage Analysis

Pg. 1/20 OVERVIEW... 2 Auto Report Requirements... 4 General SMTP Requirements... 4 STMP Service Requirements... 4 TROUBLESHOOTING: SMTP

Firewall Builder Architecture Overview

Deployment Guide Microsoft IIS 7.0

LIDL PRIVACY POLICY. Effective Date: June 11, 2015

Anatomy of a Pass-Back-Attack: Intercepting Authentication Credentials Stored in Multifunction Printers

3rd Party VoIP Phone Setup Guide (Panasonic UT )

Network Monitoring using MMT:

Web Usage mining framework for Data Cleaning and IP address Identification

3rd Party VoIP Phone Setup Guide (Panasonic b)

CC File Transfer. User Manual

Configuration Manual English version

SysAidTM Freeware Installation Guide

How we use cookies on our website

DEPLOYMENT GUIDE Version 1.1. Deploying the BIG-IP LTM v10 with Citrix Presentation Server 4.5

SyncThru TM Web Admin Service Administrator Manual

SAML-Based SSO Solution

VCCC Appliance VMware Server Installation Guide

friendly (technical) guide to COUNTER

ICAWEB502A Create dynamic web pages

RTMP Channel Server I6NET Solutions and Technologies

Symantec Event Collector 3.6 for Blue Coat Proxy Quick Reference

How-to: HTTP-Proxy and Radius Authentication and Windows IAS Server settings. Securepoint Security System Version 2007nx

M86 Web Filter USER GUIDE for M86 Mobile Security Client. Software Version: Document Version:

Higher Education Web Information System Usage Analysis with a Data Webhouse

Cisco TelePresence Content Server

Elluminate Live! Access Guide. Page 1 of 7

SAML-Based SSO Solution

A Tool for Web Usage Mining

How To Evaluate Web Applications

Transcription:

UNIVERSITY OF PADUA Department of Information Engineering PersDL 2007 10th DELOS Thematic Workshop on Personalized Access, Profile Management, and Context Awareness in Digital Libraries Corfu, Greece, 29 30 June 2007 Maristella Agosti agosti@dei.unipd.it Giorgio Maria Di Nunzio dinunzio@dei.unipd.it Information Management Systems Research Group

Outline Motivations; Approach; Experimental Analysis; Current and Future Works. PersDL 2007 - Corfu, Greece, 29 30 June 2007 1

Motivations The three main approaches to evaluate an information access service are: the studies based on test collection analysis, the so-called Cranfield approach 1, the user studies, and the analysis of log data. Web log file analysis began with the purpose to offer to Web site administrators a way to ensure adequate bandwidth and server capacity to their organization. It may offer advices about a better way to improve the offer of Web content, information about problems occurred to the users, and even about problems for the security of the site. 1 C. W. Cleverdon The Cranfield Tests on Index Languages Devices. In Readings in Information Retrieval, Morgan Kaufmann Publisher, Inc., San Francisco, California, pp.47 60, 1997. PersDL 2007 - Corfu, Greece, 29 30 June 2007 2

The European Library Case study: The European Library 2, a service set up by the Conference of the European National Librarians (CENL) 3. Born to offer access to combined resources (such as books, magazines, and journals both digital and non-digital) of 47 national libraries of Europe. Analyse the data contained in the logs of their Web servers. 2 http://www.theeuropeanlibrary.org/ 3 http://www.nlib.ee/cenl/ PersDL 2007 - Corfu, Greece, 29 30 June 2007 3

Scope of the study Evaluate the information access service to give recommendations for developing possible future personalization services. Report on initial findings on a specific aspect that is highly relevant for personalization: the study of user sessions. The log data used for the present analysis refer to the collections from 27 out of the 47 national libraries that were full partners at the moment of the analysis. PersDL 2007 - Corfu, Greece, 29 30 June 2007 4

Approach Information extracted from Web has to be managed efficiently for analyses. The proposed solution is based on database management methods which permit the definition of an application which maintains and manages the necessary data 4. The database and the application which have been developed enable separation of the different entities recorded and facilitate data-mining and on-demand querying of the log data. 4 M. Agosti, G.M. Di Nunzio and A. Niero From Web Log Analysis to Web User Profiling In DELOS Conference 2007. Working Notes. Pisa, Italy, 2007, pp 121 132. PersDL 2007 - Corfu, Greece, 29 30 June 2007 5

Methodology Usually Log files come in a text file format. The proposed methodology for acquiring data from Web log files identifies two problems: gathering, and storing the information. Gathering data with parsers; Storing data with databases. PersDL 2007 - Corfu, Greece, 29 30 June 2007 6

Gathering Data Web log files have ordered fields to record activities: date: Date, in the form of yyyy-mm-dd. time: Time, in the form of hh:mm:ss. s-ip: The IP of the server. cs-method: The requested action. Usually GET for common users. cs-uri-stem: The URI-Stem of the request. cs-uri-query: The URI-Query, where requested. s-port: The port of the server for the transaction. cs-username: The username for identification of the user. c-ip: The IP address of the client. cs(user-agent): User-Agent of the Client. For a standard user this means the browser and other information about operative system. cs(referer): The site where the link followed by the user was located. sc-status: HTTP status of the request, that means the response of the server. sc-substatus: The substatus error code. sc-win32-status: The Windows status code. PersDL 2007 - Corfu, Greece, 29 30 June 2007 7

Storing Data ID ID HEURISTIC R5 (1, N) (1, 1) SESSION (1, N) R4 IPaddress (0, N) Timestamp CLIENT (1, N) R1 (1, 1) REQUEST (1, 1) R3 (1, 1) (1, N) R2 (1, N) Useragent IPaddress (0, N) Uristem USERAGENT SERVER URISTEM PersDL 2007 - Corfu, Greece, 29 30 June 2007 8

Approach - HTTP Requests We concentrate our attention here only on deriving the information on user sessions from the analysis of the HyperText Transfer Protocol (HTTP) requests made by clients, grouped in sessions, using a specific heuristic. A request represents the data of the HTTP request that are recorded in the Web log files. 2005-11-30 23:00:37 192.87.31.35 GET /index.htm - 80-152.xxx.xxx.xxx Mozilla/4.0+(compatible;... PersDL 2007 - Corfu, Greece, 29 30 June 2007 9

Approach - Sessions A session is a particular set of requests made in a certain interval of time by the same client. 2005-11-30 23:00:37 192.87.31.35 GET /index.htm - 80-152.xxx.xxx.xxx Mozilla/4.0+(compatible;... 2005-11-30 23:00:38 192.87.31.35 GET /portal/index.htm - 80-152.xxx.xxx.xxx Mozilla/4.0+(... 2005-11-30 23:00:38 192.87.31.35 GET /portal/scripts/hashtable.js - 80-152.xxx.xxx.xxx Mozilla/4.0+. 2005-11-30 23:00:44 192.87.31.35 GET /portal/scripts/session.js - 80-152.xxx.xxx.xxx Mozilla/4.0+... 2005-11-30 23:00:46 192.87.31.35 GET /portal/scripts/query.js - 80-152.xxx.xxx.xxx Mozilla/4.0+... 2005-11-30 23:00:47 192.87.31.35 GET /portal/scripts/search.js - 80-152.xxx.xxx.xxx Mozilla/4.0+... Organizing the HTTP requests in a single session permits to have a better view of the actions performed by visitors. Sessions are found through empirical rules when information about sessions are not available. PersDL 2007 - Corfu, Greece, 29 30 June 2007 10

Approach - Sessions Reconstruction Session reconstruction may be used in order to map the list of activities performed by every single user to the visitors of the site. Possible choices: the IP address and the user-agent are the same of the requests already inserted in the session 5, the request is done less than fifteen minutes after the last request inserted 6. 5 D. Nicholas, P. Huntington, A. Watkinson Scholarly journal usage: the results of deep log analysis, Journal of Documentation Vol. 61 No. 2, 2005. 6 B. Berendt, B. Mobasher, M. Nakagawa, M. Spiliopoulou The Impact of Site Structure and User Environment on Session Reconstruction in Web Usage Analysis, WEBKDD 2002, LNAI 2703, pp 159-179, 2003. PersDL 2007 - Corfu, Greece, 29 30 June 2007 11

Experimental Analyses Experimental analysis was performed on a sample of The European Library Web log files of eleven months from 31st October 2005, to 25th September 2006. The structure of the log file record is conform to the W3C Extended Log File Format 7. The analyses, that are presented in the following, cover software tools such as operating systems and browsers used by clients, sessions in terms of daily distribution, and time intervals per number of HTTP requests. The numbers we are reporting include all the requests and sessions, even those ones which can belong to automatic crawlers and spiders. 7 http://www.w3.org/tr/wd-logfile.html. PersDL 2007 - Corfu, Greece, 29 30 June 2007 12

Experimental Analyses - HTTP Requests A total of 25,881,469 of HTTP requests were recorded in the log files of the eleven months. The distribution of HTTP methods which are present in the log files is the following HTTP method total number CONNECT 2 LINK 6 PROPFIND 760 PUT 3,640 OPTIONS 3,779 HEAD 33,770 POST 844,058 GET 24,995,454 Total 25,881,469 PersDL 2007 - Corfu, Greece, 29 30 June 2007 13

Experimental Analyses - Clients, OS, Browsers During this period, according to the chosen heuristic, 949,643 sessions were reconstructed with an average of 27 accesses per session; 285,125 different pairs IP address and user-agent were found. Operating systems (left) and browsers (right) used by clients Windows Linux Mac Others 14% 2% 9% MSIE Firefox Mozilla Opera Gecko AOL Konqueror Others 13% 2% 3% 3% 4% 56% 5% 75% 14% PersDL 2007 - Corfu, Greece, 29 30 June 2007 14

Experimental Analyses - Times, Sessions Lengths Number of sessions per hour of day, Web server set on CET (left). Number of sessions per HTTP request intervals (right). 6 x 104 8 x 105 5 7 6 number of sessions 4 3 2 number of sessions 5 4 3 1 2 1 0 0 4 8 12 15 20 hour 0 <= 25 > 25, <= 50 > 50, <= 75 > 75, <= 100 > 100 requests per session PersDL 2007 - Corfu, Greece, 29 30 June 2007 15

Experimental Analyses - Sessions Breakdown Breakdown sessions according to both the number of requests and the length of the session. Sessions with less than or equal 25 requests (left), more than 25 (right). 16% of sessions last more than 60 seconds regardless of the number requests per session. 12% of the sessions contain more than 50 requests. 6 5 x 10 5 6 x 10 4 4 4 3 2 2 1 0 <= 25 <= 20 > 20, <= 40 session length (seconds) > 40 0 > 25, <= 50 > 50, <= 75 > 75, <= 100 > 100 requests per session > 60 > 50, <= 60 > 40, <= 50 > 30, <= 40 > 20, <= 30 > 10, <= 20 <= 10 session length (seconds) requests per session PersDL 2007 - Corfu, Greece, 29 30 June 2007 16

Experimental Analyses - Focus on Lengthy Sessions An analysis of the sessions with more than 100 requests has been computed separately. The majority of sessions with a high number of requests last from 2 to 30 minutes. 14000 12000 10000 8000 6000 4000 2000 0 > 100, <= 200 > 200, <= 300 > 300, <= 400 > 400, <= 500 requests per session > 500 > 3600 > 1800, <= 3600 > 600, <= 1800 > 300, <= 600 > 120, <= 300 > 60, <= 120 session length (seconds) PersDL 2007 - Corfu, Greece, 29 30 June 2007 17

Conclusions Preliminary analysis of eleven months of The European Library Web log data, according to a methodology for gathering and mining information information from Web log files based on a DataBase Management System (DBMS) application. Report on initial findings about the study of user sessions which have been reconstructed by means of heuristic methods, since no personal data was available to track each user. Heuristics used to identify users and sessions suggested that authentication would be required since it would allow Web servers to identify users, track their requests, and more importantly create more accurate profiles to tailor specific needs. Authentication would also help to mitigate the problem concerning crawlers accesses, granting access to some sections of the Web site only to registered users, blocking crawlers using faked user agents. PersDL 2007 - Corfu, Greece, 29 30 June 2007 18

Current and Future Works As a follow up of the cooperation with The European Library, the Office of The European Library has implemented the changes suggested by this work. The use of cookies in the HTTP server logging system (September 2006) An user authentication procedure has been established (since August 2006). A comparison with sessions reconstruction using the heuristic and using cookies should give hints about the use of heuristics. A more accurate profile for each user, studying nationality, language, and chosen collections. PersDL 2007 - Corfu, Greece, 29 30 June 2007 19