Web Usage Mining Structuring semantically enriched clickstream data

Transcription

1 Web Usage Mining Structuring semantically enriched clickstream data by Peter I. Hofgesang Stud.nr A thesis submitted to the Department of Computer Science in partial fulfilment of the requirements for the degree of Master of Computer Science at the Vrije Universiteit Amsterdam, The Netherlands August 2004

2

3 supervisor Dr. Wojtek Kowalczyk Faculty of Sciences, Vrije Universiteit Amsterdam Department of Computer Science second reader Dr. Elena Marchiori Faculty of Sciences, Vrije Universiteit Amsterdam Department of Computer Science

4 Abstract Web servers worldwide generate a vast amount of information on web users browsing activities. Several researchers have studied these so-called clickstream or web access log data to better understand and characterize web users. Clickstream data can be enriched with information about the content of visited pages and the origin (e.g., geographic, organizational) of the requests. The goal of this project is to analyse user behaviour by mining enriched web access log data. We discuss techniques and processes required for preparing, structuring and enriching web access logs. Furthermore we present several web usage mining methods for extracting useful features. Finally we employ all these techniques to cluster the users of the domain and to study their behaviours comprehensively. The contributions of this thesis are a data enrichment that is content and origin based and a treelike visualization of frequent navigational sequences. This visualization allows for an easily interpretable tree-like view of patterns with highlighted relevant information. The results of this project can be applied on diverse purposes, including marketing, web content advising, (re-)structuring of web sites and several other E-business processes, like recommendation and advertiser systems. 4

5 Content 1 Introduction Related research Data preparation Data description Cleaning access log data Data integration Storing the log entries An overall picture Data structuring User identification User groups Session identification An overall picture Profile mining models Mining frequent itemsets The mixture model The global tree model Analysing log files of the web server Input data Distribution of content-types within the VU-pages and access log entries Experiments on data structuring Mining frequent itemsets The mixture model The global tree model Conclusion and future work Acknowledgements Bibliography APPENDIX APPENDIX A. The uniform resource locator (URL) APPENDIX B. Input file structures APPENDIX C. Experimental details APPENDIX D. Implementation details APPENDIX E. Content of the CD-ROM

6 Structure This Master Thesis is organized as follows: Chapter 1, Introduction This chapter provides a high-level overview of the related research and main goals of this project. Chapter 2, Related research Chapter 2 gives a comprehensive overview of the related research known so far. Chapter 3, Data preparation This chapter follows through all steps of the data preparation process. It starts describing the main characteristics of the input data followed by a description of the data cleaning process. The section on data integration will explain how the different data sources are merged together for data enrichment while the next section concerns data loading. Finally an overall scheme and an experiments section are laid out. Chapter 4, Data structuring In chapter 4 we explain how the semantically enriched data is combined to form user sessions. It also discusses the process of user identification and gives a description of groups of users, both of which are preliminary requirements of the identification of sessions. The chapter ends with an overall scheme of data structuring followed by a section of experiments. Chapter 5, Profile mining models This chapter provides an overview of the theoretical background of applied data mining models. First it explains the widely used mining algorithm of frequent itemsets. The following section describes the recently researched mixture model architecture. And finally a tree model is proposed for exploiting the hierarchical structure of session data. Chapter 6, Analysing log files of the web server Chapter 6 discusses experimental results of mining models applied on the semantically enriched data. All the input data are related to a specific web domain: Chapter 7, Conclusion and future work Finally in chapter 7 we present the conclusions of our research and explore avenues of future work. 6

7 1 Introduction The extensive growth of the information reachable via the Internet induces its difficulty in manageability. It raises a problem to numerous companies to publish their product range or information online in an efficient, easily manageable way. The exploration of web users customs and behaviours plays a key role in dissecting and understanding the problem. Web mining is an application of data mining techniques to web data sets. Three major web mining methods are web content mining, web structure mining and web usage mining. Content mining applies methods to web documents. Structure mining reveals hidden relations in web site and web document structures. In this thesis we employ web usage mining which presents methods to discover useful usage patterns from web data. Web servers are responsible for providing the available web content on user requests. They collect all the information on request activities into so-called log files. Log data are a rich source for web usage mining. Many scientific researches aim at the field of web usage mining and especially at user behaviour exploration. Besides, there is a great demand in the business sector for personalized, customdesigned systems that conform highly to the requirements of users. There is a substantial amount of prior scientific works as well on modelling web user characteristics. Some of them present a complete framework of the whole web usage mining task (e.g., Mobasher et al. (1996) [18] proposed WEBMINER). Many of them present page access frequency based models and modified association rules mining algorithms, such as [1, 31, 23]. Xing and Shen (2003) [30] proposed two algorithms (UAM and PNT) for predicting user navigational preferences both based on page visits frequency and page viewing time. UAM is a URL-URL matrix providing page-page transition probabilities concerning all users statistics. And PNT is a tree based algorithm for mining preferred navigation paths. Nanopoulos and Manolopoulos (2001) [21] present a graph based model for finding traversal patterns on web page access sequences. They introduce one levelwise and two non-level wise algorithms for large paths exploiting graph structure. While most of the models work on global session levels an increasing number of researches show that the exploration of user groups or clusters is essential for better characterisation: Hay et al. (2003) [14] suggest Sequence Alignment Method (SAM) for measuring distance of sessions incorporated within structural information. The proposed distance is reflected by the number of operations required to transform sessions into one another. SAM distance based clusters form the basis of further examinations. Chevalier et al. (2003) [8] suggest rich navigation patterns consisting of frequent page set groups and web user groups based on demographical patterns. They show the correlation between the two types of data. Other researches point far beyond frequency based models: Cadez et al. (2003) [4] propose a finite mixture of Markov models on sequences of URL categories traversed by users. This complex probability based structure models the data generation process itself. In this thesis we discuss techniques and processes required for further analysis. Furthermore we present several web usage mining methods for extracting useful features. An overall process workflow can be seen in figure 1. 7

8 T e x t T e x t INPUT DATA DATA PREPARATION SESSION IDENTIFICATION PROFILE MINING DATA FILTERING AR FORMAT Association rules Web server s access log data Content type mapping table URL / content type DATA INTEGRATION DATABASE MM FORMAT Probability s Text Text Content types Mixture model Geographical and organizational information USER SELECTION Identified sessions 3 GTM FORMAT 2 Tree model 3 3 Figure 1: The overall process workflow This thesis considers three separate data sets as input data. Access log data are generated by the web server of the specified domain and contains user access entries. The content-type mapping table contains relations between documents and their category in the form of URL / content type pairs. Mapping tables can either be generated by classifier algorithms or by content providers. In the case of this latter type, contents of pages are given explicitly in the form of content categories (e.g., news, sport, weather, etc.). Geographical and organizational information make it possible to determine different categories of users. All data mining tasks start with data preparation, which prepares the input data for further examination. It consists of four main steps as it can be seen in figure 1. Data filtering strips out irrelevant entries, data integration enriches log data with content labels and the enriched data are stored in a database. The user selection process sorts out appropriate user entries of a specified group for session identification. The following step in the whole process is session identification. Related log entries are identified as unique user navigational sequences. Finally these sequences are written to output files in different formats depending on the application. The profile mining step applies several web usage mining methods to discover relevant patterns. It uses an association rules mining algorithm [1] for mining frequent page sets and for generating interesting rules. It also applies the mixture model proposed by Cadez et al. (2001) [5] to build a predictive model of navigational behaviours of users. Finally it presents a tree model for representing and visualizing visiting patterns in a nice and natural way. In the experimental part of this thesis we employ all these techniques to address the problem of defining clusters on the users of the web domain and we study their behaviours comprehensively. The contributions of this thesis are content based data enrichment and visualization of frequent navigational sequences. Data enrichment amplifies users transactional data with the content types of visited pages and documents and makes distinctions among users based on geographical and organizational information. The visualization presents a tree-like view of patterns that highlights relevant information and can be interpreted easily. 8

9 2 Related research There are numerous commercial software packages usable to obtain statistical patterns from web logs, such as [11, 22, 37]. They focus mostly on highlighting log data statistics and frequent navigation patterns but in most cases do not explore relationships among relevant features. Some researches aim at proposing data structures to facilitate web log mining processes. Punin et al. (2001) [24] defined the XGMML and LOGML XML languages. XGMML is for graph description while the latter is for web log description. Other papers focus only (or mostly) on data preparation [6, 13, 15]. Furthermore there are complete frameworks presented for the whole web usage mining task (e.g., Mobasher et al. (1996) [18] proposed WEBMINER). Many researches, such as [1, 23, 31], present page access frequency based models and modified apriori [1] (frequent itemset mining) algorithms. Some papers (e.g., [32] [10] [9]) present online recommender systems to assist the users browsing or purchasing activity. Yao et al. (2000) [32] use standard data mining and machine learning techniques (e.g., frequent itemset mining, C4.5 classifier, etc.) combined with agent technologies to provide an agent based recommendation system for web pages. While Cho et al. (2002) [10] suggest a product recommendation method based on data mining techniques and product taxonomy. This method employs decision tree induction for the selecting of users likely to buy the recommended products. Hay et al. (2003) [14] apply sequence alignment method (SAM) for clustering user navigational paths. SAM is a distance-based measuring technique that considers the order of sequences. The SAM distance of two sequences reflects the number of transformations (i.e., delete, insert, reorder) required to equalize them. A distance matrix is required for clustering which holds SAM distance scores for all session pairs. The analysis of the resulting clusters showed that the SAM based method outperforms the conventional association distance based measuring. In their paper Runkler and Bezdek (2003) [27] use relational alternating cluster estimation (RACE) algorithm for clustering web page sequences. RACE finds the centers for a specified number of clusters based on a page sequence distance matrix. The algorithm alternately computes the distance matrix and one of the cluster centers in each iteration. They propose Levenshtein (a.k.a edit) distance for measuring the distance between members (i.e. textual representation of visited page number sequences within sessions). Levenshtein distance counts the number of delete, insert or change steps necessary to transform one word into the other. Pei et al. (2000) [23] propose a data structure called web access pattern tree (WAP-tree) for efficient mining of access patterns from web logs. WAP-trees store all the frequent candidate sequences that have a support higher than a preset threshold. All the information stored by WAP-tree are labels and frequency counts for nodes. In order to mine useful patterns in WAPtrees they present WAP-mine algorithm which applies conditional search for finding frequent events. WAP-tree structure and WAP-mine algorithm together offer an alternative for apriorilike algorithms. Smith and Ng (2003) [28] present a self-organizing map framework (LOGSOM) to mine web log data and present a visualization tool for user assistance. Jenamani et al. (2003) [16] use a semi-markov process model for understanding e-customer behaviour. The keys of the method are a transition probability matrix (P) and a mean holding time matrix (M). P is a stochastic matrix and its elements store the probabilities of transition 9

10 states. M stores the average lengths of time for processes to remain in state i before moving to state j. In this way this probabilistic model is able to model the time elapsed between transitions. Some papers present methods based on content assumptions. Baglioni et al. (2003) [2] uses URL syntactic to determine page categories and to explore the relation between users sex and navigational behaviour. Cadez et al. (2003) [4] experiment on categorized data from Msnbc.com. Visualization of frequent navigational patterns makes human perception easier. Cadez et al. (2003) [4] present a WebCanvas tool for visualizing Markov chain clusters. This tool represents all user navigational paths for each cluster, colour coded by page categories. Youssefi et al. (2003) [33] present 3D visualization superimposed web log patterns and extracted web structure graphs. 10

11 3 Data preparation Preparing the input data is the first step of all data and web usage mining tasks. The data in this case are, as mentioned above, the access log files of the web server of the examined domain and the content types mapping table of the HTML pages within this domain. Data preparation consists of three main steps such as data cleaning/filtering, data integration and data storing. Data cleaning is the task of removing all irrelevant entries from the access log data set. Data integration establishes the relation between log entries and content mappings. And the last step is to store the enriched data into a convenient database. A comprehensive study has been made by Cooley et al. (1999) [13] on all these preprocessing tasks. This chapter starts with the description of the input data and generation procedure, followed by the details of log access file cleaning and data integration for log entries and mapping data integration. Finally it presents the database scheme for data storing and an overall picture and description of the data preparation process. 3.1 Data description This section describes the details of the access log and content type mapping data Access log files Visitors to a web site click on links and their browser in turn requests pages from the web server. Each request is recorded by the server in so-called access log files 1. Access logs contain requests for a given period of time. The time interval used is normally an attribute of the web server. There is a log file present for each period and the old ones are archived or erased depending on the usage and importance. Most of log files of web servers are stored in a common log file format (CLFF) [34] or in an extended log file format (ELFF) [35]. An extended log file contains a sequence of lines containing ASCII characters terminated by either the sequence LF or CRLF. Entries consist of a sequence of fields relating to a single HTTP transaction. Fields are separated by white space. If a field is unused in a particular entry dash, a "-" marks the omitted field. Web servers can be configured to write different fields into the log file in different formats. The most common fields used by web servers are the followings: remotehost, rfc931, authuser, date, request, status, bytes, referrer, user_agent. 1 There are other types of log files generated by the web server as well, but this project does not consider them. 11

12 The meanings of all these fields are explained in the table below with given examples: The most commonly used fields of access log file entries by web servers Field name Description of the field (with example) Remote hostname (or IP number if DNS hostname is not available) remotehost example: The remote login name of the user. rfc931 authuser [date] "request" status bytes "referer" "user_agent" example: - The username with which the user has authenticated himself. example: - Date and time of the request with the web server s time zone. example: [20/Jan/2004:23:17: ] The request line exactly as it came from the client. It consists of three subfields: the request method, the resource to be transferred, and the used protocol. example: "GET / HTTP/1.1" The HTTP status code returned to the client. example: 200 The content-length of the document transferred. example: The url the client was on before requesting the url. example: "-" The software the client claims to be using. example: "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" Table Content types mapping table A content types mapping table is a table containing URL/content type pair entries. URLs are file locator paths referring to documents, and content types are labels giving the types of documents (for more details about URL refer to APPENDIX A). Content types can either be generated by an algorithm or by content providers where the contents of pages are given explicitly (e.g., sport pages refer to sport content, etc.). Generator algorithms can also be distinguished depending on whether they produce the content types automatically or are driven by human interaction. 12

13 We use an external algorithm [3], which attaches labels to all HTML documents in a collection of HTML pages based on their contents. The algorithm is based on the naive Bayes classifier supplemented by a smart example selector algorithm. It uses only the textual content of the HTML pages stripping out the control tags. Some parts of the text enclosed within special tags (e.g., title or header tags) are biased. The algorithm chooses the first 100 pages randomly to be categorized by humans. This initialization step is followed by an active learning method. This method chooses the examples by considering the ones already selected. This thesis deals with other documents besides HTML as well (eg. pdf, ps, doc, rtf, etc.). However it would be a difficult process to attach labels to each of them based on their content. This is because the structure of these files is specific and most of the time very complex. And their size is usually very large. For these reasons a very simple technique is used to identify such documents. The label documents is attached to all pdf and ps files that refers to scientific papers, e-books, documentations, etc., while the label other documents is attached to all other document types (e.g., doc, rtf, ppt, etc.). Other documents determine e.g., administrative papers, forms, etc. According to these remarks, a mapping table is completed to contain entries for the two labels. The following table presents an example of content types mapping table: An example of content-type mapping table URL content type identifier bi/courses-en.html 4 ci/datamine/diana/index.html 6 Table Cleaning access log data As described above, raw access log files contain a vast amount of variant request entries. Each log entry can be informative for some application but this project excludes most of them. Processing of certain types of requests would lead to misconclusions (e.g., requests generated by spider engines). Besides, stripping the data has a positive effect on processing time and the required storage space. Since this project focuses only on documents themselves (like html, pdf, ps, doc files) all the request entries on different file types should be stripped out. Furthermore as the main goal is the characterization of users, robot transactions, which generate web traffic automatically by robot programs, must also be filtered out. There are several other criteria for filtering. Detailed descriptions of the filtering criteria and methods follow further on Filtering unsupported extensions A typical web page is made up of many individual files. Beyond the HTML page it consists of graphical elements, code styles, mappings etc., all in separate files. Each user request for an 13

14 HTML file evokes hidden requests for all the files required for displaying that specific page. In this manner access log files contain all the hidden requests traces as well. Extension filtering strips out all the request entries for file types other than predefined (for the structure of extension list file refer to APPENDIX B4 Extension filter list file). Requested files extensions in log entries could be extracted from the request field. An example of such request field: "GET /ai/kr/imgs/ibrow.jpg HTTP/1.0" Filtering spider transactions A significant portion of log file entries is generated by robot programs. These robots, also known as spider or crawler engines, automatically search through a specific range of the web. They index web content for search engines, prepare content for offline browsing or for several other purposes. The common point in all crawlers activity is that, although they are mostly supervised by humans, they generate systematic, algorithmic requests. So without eliminating spider entries from log files, real users characteristics would be distorted by features of machines. Spiders can be identified by searching for specific spider patterns in the "user_agent field of log entries. Most of the well-disposed spiders put their name or some kind of pattern that identifies them into this field. Once a pattern has been identified, the filter method ignores the examined log entry. Spider patterns can be looked up browsing the web for spiders. There are several pages considering spider activities and patterns, and there are lots of professional forums on the subject (mostly discussing how to avoid them) [29]. Spider patterns are collected in a separate spider list file (refer to APPENDIX B5). An example of such user_agent field: "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;.NET CLR )" Filtering dynamic pages Web pages generated on user requests dynamically are called dynamic pages. These pages can not be located on the web server as an individual file, since they are built by a specific engine using several data sources. For this reason dynamic pages cannot be analyzed in a simple way. However with the application of several tricks it is possible to still obtain useful information. Jacobs et al. (2001) in [15] use an inductive logic programming (ILP) framework to reveal usage patterns based on dynamic page link parameters that are passed to the server. Since it is not an objective of this thesis to apply sophisticated methods for information recovery on dynamic pages, the filtering process simply eliminates all such reference. 14

15 There is no standard for the structure of URL requests for dynamic pages except that parameters appear after the? (question mark) in the URL which consist of name/value pairs. Therefore, dynamic pages can basically be filtered out by searching for the question mark in request fields of log entries. Note that requests for a single dynamic page without any parameters, thus without the delimiter question mark, would be stripped out during extension filtering (e.g., *.jsp, *.php, *.asp pages). An example of such a dynamic page s request field: "GET /obp/overview.php?lang=en HTTP/1.0" Filtering HTTP request methods HTTP/1.0 [25, 26] allows several methods to be used to indicate the purpose of a request. The most often used methods are GET, HEAD and POST. Since using the GET method is the only way of requesting a document that could be useful for this project, the request method filter ignores any other requests. The filter examines the request field of the log entry for the GET method identifier. An example of such a request field: "POST /modules/coppermine/themes/default/theme.php HTTP/1.0" Filtering and replacing escape characters URL escape characters are special character sequences made up of a leading % character and two hexadecimal characters. They substitute special characters in URL requests that could be problematic while transferring requests to different types of servers. Special characters are simply replaced by sequences of standard characters. In most cases the task is only to replace these escape sequences with their representatives, but in certain instances URLs contain corrupted sequences that cannot be interpreted. In these cases the entries should be ignored. Corrupt sequences can be caused by typing errors of the users, automatically generated robot requests, etc Filtering unsuccessful requests If a user requests a page that does not exist, his browser replies with the well-known 404 error, page not found error message. In this case the user has to use the back button to navigate back to the previous page or type a different URL manually. Either way the user doesn t use the requested page to navigate through it, since the error page doesn t provide any link to follow. For this reason log entries of erroneous requests should also be ignored. These entries can be filtered by examining the status field. The status of corrupt requests mostly equals to 404. In special cases status field can take other values as well, such as 503 etc. 15

16 An example of such a log entry: [16/May/2004:08:07: ] "POST /modules/coppermine/include/init.inc.php HTTP/1.0" "-" "Mozilla 4.0 (Linux)" Filtering request URLs for a domain name A URL of a page request consists of a domain name and the path of the requested document relative to the public directory of the domain. Since the domain name is not ambiguous to the responsible web server, it stores only the relative path of the request in the access log files, without the domain name. In a few cases however, log file entries tend to contain the whole absolute path. This leads to mapping errors during data integration, since the mapping table contains only relative paths and comparison is based on paths similarity. For these reasons a URL in the request field has to be transformed to the relative format. An example of such request field: "GET / HTTP/1.1" Path completion When a user requests a public directory instead of a specific file, the web server tries to find the default page in that directory. The default page is index.html in most cases, but it varies between web servers. Thus the task is to complete the URL with the name of the default page in case a log entry contains a directory request. It is possible that the server does not contain the default page in the requested directory. In this case the certain log entry will be filtered while looking it up in the content type mapping table (refer to section Content types mapping table). An example of such a request field: original request field: "GET /pub/minix/ HTTP/1.1" completed request field: "GET /pub/minix/index.html HTTP/1.1" Filtering anchors Anchors are special qualifiers for HTML link references. They act as reference points within a single web page. If a named anchor is placed somewhere in the HTML page s body, a link referring to the HTML page completed with a special character hash mark and the name of the anchor (e.g., link + # + anchor name) following the link will scroll directly to the place where the anchor is put. Anchors should be stripped out from URLs, otherwise the HTML document can not be found in the mapping table. An example of such a request field: "GET /vakgroepen/ai/education/courses/micd/opgave_1.html#1c HTTP/1.1" 16

17 We don t filter frame pages. Frames are supported by the HTML specification and make it possible to split an HTML document into several sub documents (e.g., a frame for the navigation menu, a frame for the content, etc.). Each frame refers to a specific HTML document, resulting in a separate page request. The main frame page contains mostly special tags for controlling all the subframes. This page is either labelled miscellaneous or labelled the same as its subframes by the text mining algorithm [3]. Either way there is no need to pay special attention to such pages while preparing the data. 3.3 Data integration A novel approach in this project is to use content types of the visited pages rather than URL references. Content types, as described earlier, are given in a special mapping table where each entry consists of an URL/content type pair (refer to section Content types mapping table). Data integration in this context means that there should be a content type label attached to every single stored log entry. The most simple and convenient method is to attach content labels to transactions during data cleaning 2. This would save time, since it uses the same cycle for both processes. After cleaning and filtering a log entry, the data integration step looks up the entry s request URL in the mapping table. If the URL is present, the corresponding type label is attached to the entry. Otherwise the extension of the URL is checked for a valid document type, other than HTML (refer to section Filtering unsupported extensions), and looked up in the table again. If the extension was an HTML page, it should be deleted Storing the log entries The final step of the data preparation is to store the data in a convenient database. MySQL was chosen as a database server in spite of the fact that the current version does not support stored procedures. In most cases it would be easier and faster to use internal methods for manipulating the data inside the database, but there were no inextricable difficulties that occurred during the project in this context. The advantages of MySQL are that it is fast, easy to maintain, free to use for research purposes and it s widely accepted. The database scheme for storing cleaned log entries can be seen in table 3. 2 Depending on the application. For continuous streaming data, a better solution would be to attach labels online to entries, and probably it would use the content identification model also to identify unknown contents besides a preset mapping table. 3 This step could be improved by using the original classifier model in case of a missing URL. 17

18 Database scheme of the cslog table column name type name id bigint remotehost varchar rfc931 varchar authuser varchar transdate datetime request text content_type tinyint status smallint bytes int referer text user_agent text Table 3 The column names respond to the log field names mentioned in section Access log files except for the content_type field which refers to the attached content type described in the previous paragraph and id which is the unique identifier of the entries. 3.5 An overall picture The following figure gives an overall picture of our data preparation scheme. Loading/filtering/mapping access log data MappingTable mapping_table.mtd Object Transaction (filtered,mapped) LogParser Transaction Object TransactionFilter Object Log2Database Object Object Object RAW LOG cslog.txt extension.flt datahandling.prop spider.flt DATABASE Figure 2: An overall picture of the data preparation 18

19 The first step in the data preparation process is to load raw log files into the memory line by line by the LogParser object. This object transforms all entries into suitable Transaction objects, which contain all the fields of the log file. Once a Transaction has been parsed, it goes through the TransactionFilter, which filters out useless entries (by simply ignoring them). After this step a content-type label is attached to all transactions by the MappingTable object. Finally Log2Database loads the filtered transactions into the specified database. 19

20 4 Data structuring Sessions a.k.a. transactions 4 constitute the basis of most web log mining processes. They are related to users and composed of pages visited during a separate browsing activity. This chapter starts with the description of user identification, which is essential for session identification. This is followed by details on grouping of users, which is also a relevant topic as characterization of them is the main goal of this project. The next paragraph deals with session identification methods and types, while discussing moreover how the selection method is restricted to groups of users. The final section presents a comprehensive overview of the data structuring process. 4.1 User identification Identification of users is essential for efficient data mining. It makes it possible to distinguish user specific data within the whole data set. It is straightforward to identify users in Intranet applications since they are required to identify themselves by following the login process. It is much more complicated in the case of public domains. The reason is that Internet protocols (e.g., HTTP, TCP/IP) do not require user authorization from client applications (e.g., web browser). The only private information exchanged is the machine IP address of the client. Identification based on this information is unreliable. This is because multiple users may use the same machine (thus the same IP address) to connect to the Internet. And on the other hand, a single user may use several machines to use the same service. Besides, proxy servers and firewalls hide the true IP address of the client. There are many solutions to resolve this problem. Content providers can force users to register for their services. In this way users have to follow a login process each time they want to browse their contents. To avoid explicit user authentication, servers can use so called cookies. Cookies are user specific files stored on client machines. Each time a user visits the same service, the server can obtain user information from stored cookies. The most accurate identification based solely on access log files is to use both IP address and browser agent type as a unique user identification pair [13]. However some papers use IP/cookie pairs [2]. The identification procedure proposed in this thesis takes place inside the database as a select query, which fills up the users table from the cslog table. Table 4 shows the data scheme of the users table. 4 Market basket analysis terminology uses transaction in terms of items purchased at once. Meanwhile the information technology (IT) sector denotes transaction for unique client-server request-respond information exchanges. Furthermore IT terminology also uses the term session (which is analogous to market basket) to denote consequent user page visits a.k.a. navigation sequences. To resolve the conflict, this thesis uses both terminologies for determination of navigation sequences except in chapter 3 Data preparation, where transaction translates to page accesses. 20

21 Data scheme of the users table column name type name id bigint remotehost varchar host_name varchar TLD varchar user_agent text Table 4 Remotehost and user_agent fields are equal to the above mentioned pair while host_name and TLD will be discussed in the next section (4.2 User groups). 4.2 User groups Arranging users into specific user groups is essential for further examinations. All the statistics and models described later are based on sessions belonging to user groups. The advantage of user authenticated systems is the availability of personal information on registered users. This would help to form the most exact and diverse groups for them. Possibilities are restricted to the information which can be mined from access log files in case of public domains. In public domains, groups can be formed based on user IP addresses (network ranges), geographical data, visiting frequency, etc. Access log file entries contain either the IP address or the domain name in the remotehost field. For this reason in both cases the IP address or the domain name should be looked up and updated in the users table. After this process the remotehost field should refer to IP addresses while the host_name field refers to the corresponding domain name in users table. Organizational groups A natural grouping of users is present in most internal networks in the term of subnetwork address ranges. Subnetwork address ranges determine sub network domains within the whole network. There can be separate network ranges for user groups like staff, management, students, administration, etc. Using these ranges and IP addresses of users, a variety of groups can be formed. Geographical groups Most of the network (IP) addresses or network ranges have a domain name registered to them. The domain name consists of level and sublevel names divided by dots. The most right-hand side name of the whole string refers to the top level domain (TLD). TLD can be country codes like nl, hu, uk, etc. or other reserved names for public organizations such as com, org, gov etc. The rest of the domain name could be built of organization names followed by department names etc., all in hierarchical structure (e.g., Geographical distinction among users can be set up using TLD names. A group can be formed for example based on the nl TLD. Users can be selected for this group by searching for nl TLD in their corresponding domain name. No special geographical observations can be obtained from organizational TLDs, such as network infrastructure (net) and commercial (com) top level domains. This is because these domains can be registered worldwide and thus they have no clear relationship to countries. 21

22 4.3 Session identification Sessions constitute the basis of most web log mining processes. They are related to users and composed of pages visited during a separate browsing activity. Visited pages belong to a specific domain and form a sequence in visiting order. It is worth mentioning that not all the requests are present in log files. Most of the browsers use cache technology that allows the usage of previously visited pages instead of downloading them again. Besides, proxy servers also use page caching. They collect all frequently visited pages within a company and store them to reduce bandwidth load. This result on some pages is visited in offline mode in a visiting sequence. That means that no entry refers to these accesses in log files. This problem can be solved by setting the expiration timestamp of pages to minimal, which forces clients to download expired pages. However this solution assumes that we can change the structure of documents. Several methods were proposed (e.g., [13]) to offer algorithmic solutions for this problem. We believe that the main characteristics can be observed without the necessity of such data preparation techniques. There are several session identification methods described in different scientific literatures [6, 13, 20]. The most widely accepted methods are the so called time frame (or time window) identification [13] and the maximal forward reference (MFR) identification [7]. Both methods work on pre-selected page accesses, so they work on data grouped by users and ordered by access time. The data consists of the user identification number (id field), the date and time of page access (transdate field) and the content type of the visited page (content_type field). In addition, MFR requires the request URL (request field). The time frame identifier method divides page accesses for a user using a time window. This window or time interval is suggested to be approximately 30 minutes [13, 14, 30]. Most of the commercial products set a 30 minute timeout interval for splitting. The identifier iterates through the entries and whenever an entry s access time (transdate) is out of the time interval it starts a new session and starts to measure time interval from that entry again. The maximal forward reference identifier adds page access entries to a session list up to the page before a backward reference is made. Backward reference is defined to be a page that is already contained in the set of pages for the current transaction. In that case it starts a new session list and goes on with iteration. For example, an access sequence of A B C B D E E E F G would be broken into four transactions, i.e. A B C, B D E, E, and E F G. The drawback of this method is that it does not consider that some of the backward references may provide useful information. And besides it may include entries within the same session even if a week elapsed between them. 22

23 4.4 An overall picture The figure below represents the functional model of session identification: GetSessions UserIPGroupSelector Object UserCountryGroupSelector Object UserGroupSelector Object selected. entries SessionFormatPrinter TransactionMemoryIterator Object cslog, users tables DATABASE page access entries TransactionSimple Object Object webmining.prop identified. sessions User sessions in the appropriate data format Identifier Object TimeFrameIdentifier Object MFRIdentifier Object Figure 3: Functional model of the session identification process At the beginning TransactionMemoryIterator object retrieves all the log entries from cslog table ordered by id and sub-ordered by transdate. Note that although the number of log entries can be large, the memory requirement of the whole dataset is still manageable because all the information needed for an entry is its id, content_type and transdate (and URL for MFR identification). After fetching the data, TransactionMemoryIterator iterates through the user ids and for each id it forces UserGroupSelector to decide whether the given user belongs to a group or not. More specifically UserGroupSelector could be a subnet network ranges (UserIPGroupSelector) selector or a geographical group selector (UserCountryGroupSelector) depending on the settings in the webmining.prop properties file (for more information on group selections refer to section 4.2 User groups). When a user is selected by the group selector it is passed forward to the Identifier for identification of access entries into user sessions. 23

24 Note again that an Identifier could more specifically be, as it was described earlier in session identification section (4.3 Session identification), a time frame identifier (TimeFrameIdentifier) or a maximal forward reference identifier (MFRIdentifier). Finally, identified sessions for a user are appended to the output file by the SessonFormatPrinter in the appropriate format (e.g., association rule format, mixture model format, global tree model format, etc.). 24

25 5 Profile mining models So far we discussed all techniques and steps required for data preparation and data enrichment. This chapter deals with the discussion of data mining models used in this project for pattern discovery on enriched data. It starts with an explanation of the widely used association rules mining technique and follows with the discussion of a recent model called mixture model. Finally it presents the global tree model, which represents session data in a natural way and makes it easy to mine sessionspecific statistics on stored data. This model is also able to represent its structure in an easily interpretable graphical way. Consider the following formal notion 5 as dataset representation for all the models described below: Notion 5.1 Let D = { D1, D2,..., DN } be a transaction or session data set generated by N individuals, where D i is the observed data on the i th user, 1 i N. Each individual data set D consist of a set of one or more transactions for that user, i.e., i D i = { yi 1,..., yij,..., yin }, where n i i is the total number of transactions observed for user i, and y ij is the j th transaction for user i, 1 j ni. An individual session y ij consists of content-type references of visited pages within a user session. y ij = { nij 1,..., nijk,..., nijk }, where k ij ij is the length of the i th user s j th session, k 1. ij n nij is a content-type reference, which can take values from the content type reference range: 1 n nij K. Each reference of the range 1... K refers to a content group (refer to section Content types mapping table). 5.1 Mining frequent itemsets One of the most well known and popular data mining techniques is the association rules (AR) or frequent itemsets mining algorithm. The algorithm was originally proposed by Agrawal et al. [1] for market basket analysis. Because of its significant applicability, many revised algorithms have been introduced since then, and AR mining is still a widely researched area. 5 Note that the notion is almost the same as it was proposed in [9], with the difference that transactions are not considered as sets of items but rather as an ordered list of content types of visited pages within a session. 25

26 The aim of association rule mining is exploring relations and important rules in large datasets in expressions of the form if premise then conclusion ( X Y X Y = 0 ) implication form. A dataset is considered as a sequence of entries consisting of attribute values also known as items. A set of such items is called an itemset (entries themselves are itemsets). Formally, Let I = { i1, i2,..., in} be a collection of all items, where i j ( 1... n) is an item. An itemset is a collection of items, where each item can occur at most once. A transaction or session is an itemset. Using the notions (Notion 5.1) introduced at the beginning of this chapter, items refer to content-type references and an itemset is a can occur at most once. n nij y ij user session with the restriction that each item i A problem with association rules is that for a given i number of items there are 2 itemsets and k for each k itemset there are 2 rules. This could result in an unacceptable amount of rules. The solution is to consider only rules with a support and confidence higher than s and c. Let X Y X Y = 0 be an association rule. It has support s (in D ) if s % of transactions from D contain X Y. It has confidence c if c % of transactions from D that contain X also contain Y. The problem of mining association rules can be decomposed in two major steps: 1. Find all frequent itemsets that have support greater than the threshold s and 2. for each frequent itemset, generate all the rules that have confidence greater than the threshold c. Apriori was the first association rules mining algorithm. Lots of improved algorithms (most of them are apriori -based) have been introduced since it was published. In the following we give the pseudo code of the apriori algorithm [1]. 26

27 Initial conditions Lk : set of large k-itemsets (have minimal support) Ck : set of candidate k-itemsets D : set of transactions (as described above), t D s: support threshold Algorithm L for ( k = 2; L } 1 C L = { frequent1 itemsets}; k k for all k k 1 = Set of newcandidates for all transactions t D if ( sub C ) sub. count + + ; = { c C subsets sub of t Set of all frequent itemsets C k = set of step1 C k step2 C step3 C k k k <> 0; k + + ){ k empty c. count s} newcandidates { p q p, q L { p p C k = UkL k 1 k p q = k} ( k 1) subset L k 1 } Rules can be generated incrementally, starting from 1-itemset conclusions, because of the property of confidence: Let L be a frequent itemset and statement is true: A L is a subset, then the following If confidence of ( L A) A is c then for any B A the confidence of ( L B) B is at least c. 5.2 The mixture model In their paper Cadez et al. (2001) [5] proposed a generative mixture model for predicting user profiles and behaviours based on historical transaction data. A mixture model is a way of representing a more complex probability distribution in terms of simpler models. It uses a Bayesian framework for parameter estimation on the other hand the mixture model addresses 27

28 the heterogenity of page visits. Even if a user hasn t visited a page before, the model can predict it with a low probability. Cadez et al. (2001) presented both a global and an individual model, this thesis applies only the global mixture model. Transaction data consistently mean web page visits or sessions in this thesis, instead of the slightly different market basket data mentioned in [5]. While sessions are ordered sequences of visited pages, market baskets are sets of purchased items. However session data can be simply transformed towards the market basket data structure for applying mixture model: Notion 5.2 (alteration of Notion 5.1) For the mixture model approach transaction notion should be altered in the following way: an individual session y consists of counts of content type references of visited ij pages within a user navigational sequence. y ij = { n ij1,..., n ijk,..., n ijk }, where n ijk indicates how many pages of k content type are in the i th user s j th session, 1 k K, 0 nijk. The global mixture model consists of K components. Each of the components describes a prototype transaction forming a basis function. A component models a specific session s prototype which consists of visited page types with counts relatively higher than for other items. A K-component mixture model for modeling a users site visit y is given below: ij Notion 5.3 K-component mixture model p( y ) = α P ( y ) (1) k = 1 As for modeling components, [5] proposed a simple memoryless multinomial model. For every component there is a multinomial distribution Θ k = ( Θ k1,..., Θ kc ) present, conditioned on n ij, the total number of pages visited in the i-th user s j-th session. The mixture model (Notion 5.3 (1)) completed with multinomials can be written as Notion 5.4 Mixture model with multinomials ij K Where α > 0 is the component weight for the k-th component, α =. P k k, 1 k K is the k-th mixture component. k k ij k k 1 K ijc p( y ) = α Θ (2) ij C k k = 1 c= 1 n kc 28

29 The full data likelihood is presented below with the independency assumption of an individual s behaviour: Notion 5.5 Full data likelihood N p( D Θ) = p( Θ) (3) i= 1 D i Θ represents the unknown parameters: both the parameters of the K component multinomials, Θ,..., Θ }, and the α vector for profile weights, α,..., α }. { 1 K { 1 K The unknown parameters Θ,..., Θ } and α,..., α } are estimated by an expectation maximization (EM) algorithm. { 1 K { 1 K 5.3 The global tree model Pei et al. (2000) [23] propose a WAP-tree architecture for efficiently mining frequent itemsets. The tree based model contains besides the tree structure a link-queue for each type of label. The queues connect all the same labels forming chains. Xing and Shen (2003) in [30] present socalled preferred navigation tree (PNT) for mining preferred navigation paths. PNT stores URL, frequency of visits and visiting time in its nodes. In our approach we use a global tree model (GTM). The GTM provides a special representation of session data for groups of users. The structure of the model is similar to that of the PNT presented in [30]. The model preserves the information obtained from the structure of sessions and it stores individual pages in visiting order. In this model sessions with the same prefix share the same branch of the tree. This results in less storage required for the model. Also, the model was built to be able to visualize frequent navigational paths in a tree structure. Visualization helps to understand the patterns by highlighting relevant information. Each node in a tree model registers four pieces of information: content-type label, frequency number, reference to its parent node and reference to its children nodes. The root of the tree model is a special virtual node with an optional title label and frequency 0. Every other node is labelled by one of the content-type labels and is associated with a frequency which stores the number of occurrences of the corresponding prefix ended with that content-type in the original session database. A model consists of K 6 branches (session trees) connected to the virtual root node. Each branch contains a root node labelled with a unique content-type identifier. A branch stores only those user sessions which start with a page labelled with the same content-type as its root s. Figure 4 presents the visualization of a sample tree model. An A B path of a tree from any A node to any B node (where the level number of A in the tree is not greater than that of B) represents one or more subsessions where the frequency 6 K is the number of content-types, refer to Notion

30 number of the B node represents the total number of sessions containing this ordered subsequence pattern. A special case of the A B path is when A is the root node (of a session tree). In this case the path represents one or more sessions or subsessions depending on the frequency of B node and the sum frequency of its children nodes: Let f B be the frequency number of the B node and let the summed frequency of its children nodes. sum = be for all f c C children node of B Let Root B be the path from the root node to the B node, then Root B represents at least one real session if f B > sum, in which case the f B sum difference gives the number of real Root B sessions. Building the tree model Model building starts with the initialization of the K session trees. All trees are initialized for a unique k content type. Then all sessions of the data set are added to its correspondent session tree. Each session is examined for its first page type and a tree is selected according to the result. Adding a session to its tree can be implemented recursively. The recursive function takes a parent node and subsession parameters and updates or creates the child node of this parent with the content-type given by the first element of the subsession. The recursive step is to pass the child node as parent parameter and the new subsession parameter arises from the removal of the first entry of the original subsession. The recursive process stops when the length of the subsession is equal or less than one. Algorithm to build the global tree model Initial conditions s D :session s i : is the ith element of session s sessiontrees: array [1..K] of SessionTree SessionTree: tree object for k content type, consists of a root node and children nodes node: is a SessioTree node containing ct: content-type of node freq: is the frequency of this node parent: node reference to parent node children: array [1..L] of node references 30

31 Algorithm scheme of the algorithm: init sessiontrees; for all s D{ } sessiontrees[ s ]. add( 1 s initialization of sessiontrees: init sessiontrees{ } for i = 1.. K { } sessiontrees[ i] = SessionTree( i,0); root. ct i; root. freq 0; ); root. parent null; root. children null; adding a session to the correspondent SessionTree: sessiontrees[ content _ type]. add( s : session){ } if s <> content _ type 1 return; addsession( root, s); addsession( node : parentnode, s : session){ node. freq + + ; if s. length > 1{ s. removefirstelement(); if child for s. firstelement() exists{ addsession( child } else{ addsession( create child } } } for s. firstelement(), s); for s. firstelement(), s); 31

32 Mining preferred paths from GTM Preferred navigation paths can be mined directly from the tree model. (A, B) paths or sessions that have a higher support than a preset threshold value are the preferred navigation paths. The algorithm given below scans each level of all session trees for possible candidates ignoring branches that have low support. Mining preferred paths initial conditions candidates : the list of candidate nodes candidatechildren : list of candidate children nodes supported : list of supported sessions and their support value s : support treshold algorithm candidates root node while candidates. size() <> 0 do{ } candidatechildren empty for i = 1.. candidates. size() do{ } if frequencyof ( root, candidates[ i]) s{ } // gather possible child candidates for j = 1.. candidates[ i]. NumberOfChildren do{ } supported add(( root, candidates[ i]), support) if child. freq s{ } j candidatechildren add( child candidates candidatechildren; j ); Trees similarity In the following a tree similarity measure is proposed for determining different tree models distance. By the means of the similarity measure we can determine the likeness of trees. We expect that the similarity measure of two trees built based on two distinct session data set will be high if the data sets were generated by users of similar behaviours. The proposed distance measure considers both the structure of the tree and the frequency of tree nodes. The distance satisfies the following criteria: 32

33 Assumptions 1. A similarity distance measures not only the structure of trees but also (or rather) the frequencies of their nodes. Higher frequencies should be taken into account with higher weights. 2. The extra information that originates from sessions should be exploited. 3. Considering T,T 1 2 trees: the distance of T 1 from T 2 should be equal to the distance of T 2 tot 1. Formally T 1. dist( T2 ) = T2. dist( T1 ). The distance measure proposed is a simple approach based on forming the intersection of the two trees session dataset. Trees similarity initial conditions T1, T2 : the twotree model candidates : the list of candidate nodes candidatechildren : list of candidate children nodes sum : registers the number of algorithm sum 0; candidates while candidates. size <> 0 do{ } that occurein both trees; for i = 1.. candidates. size do{ } } 1 2 present in both trees; candidates candidatechildren; all common sessions if T and T both have( root, candidates ) session( s){ sum add the number of common sessions; candidatechildren that are put all the root nodes from session trees put all the children nodes of i candidates i The similarity proportion can be easily calculated then by dividing the sum value by the summed number of different sessions in the two trees (the summed number of all sessions in the two trees subtracted the sum value). If we multiply the resulted value by 100 we get the similarity percentage. 33

34 Visualization of tree models Frequent navigational paths are conventionally represented by text or tables which are not easy to understand. Visualization of a tree model however makes it easy to interpret the patterns. A picture of a tree model consists of nodes with content-type labels and their colour code. Nodes are connected with lines (edges) in different thickness marking the frequencies of given paths. Besides thickness, edges contain proportional numbers for each child of a node measuring the distribution of frequencies for the given children nodes. Besides, the number of real sessions for that path of the tree is also given in parentheses. The tree visualization contains only the supported sessions based on a support threshold set for the model. Figure 4 presents a sample tree. Figure 4: Visualization of a sample tree model The sample tree above (figure 4) contains nine different content-type nodes. Its most frequent starting node is english/department. 62% of the visitors (that is 9 visitors in this case) start on the department pages and then go on to the faculty pages. Faculty pages have 100% of visiting rate within this branch which means that all of the users whom went on from department pages visited faculty pages also etc. 34

35 6 Analysing log files of the web server For the purpose of this thesis the discussion will be restricted to the analysis of user behaviours for a single web domain Therefore all the data used in the following experiments are in connection with the web server of the Computer Science Department of the Vrije Universiteit. This chapter presents experimental results using all the techniques described earlier. The first section describes the details of the input access log files and mapping table. This is followed by experimental results of data preparation and data structuring techniques. Finally, the last sections present results of the three profile mining models AR, MM and GTM. Results of association rules and frequent itemsets mining can show which page sets users tend to visit within a session and what rules can be defined on frequent itemsets. A mixture model can tells what distribution the data come from and how many components (based on different user behaviours) are likely to have generated the data. Both the AR mining and the Mixture model ignore the information which can be mined from the order of pages within sessions. The global mixture model, in contrast, is based on the structure of sessions. It can answer the question which session sequences (or subsequences) are highly preferred by users. It also provides a visualization of frequent navigational paths in the tree structure. Most of the algorithms were implemented in the Java programming language. For further details on their implementation each section refers to the proper APPENDIX table. Only the most frequent and most important patterns will be presented in this section but an additional CD-ROM for this master thesis contains all the results and outputs experimented (refer to APPENDIX E). 6.1 Input data The input data in this case are the access log files of the web server for a certain period of time, the content types mapping table of the HTML pages of the domain and the organizational and geographical information for user group identification Access log files Four consecutive access log files were collected and merged together from the server. In total they sum up to one month of access log entries. The details are summarized in the table below: Details on the merged access log entries File name cs_access_log_ Size (MB) 1 533, 344 Period 30 May July 2004 Number of entries Table 5 35

36 The apache web server of the domain writes the following fields, in the given sequence, into the log files: remotehost, rfc931, authuser, date, request, status, bytes, referrer, user_agent. For the accepted access log file structure refer to APPENDIX B The mapping table Data enrichment is partly based on the content information of visited web pages. This information is given by a table with URL/content type entries. The table was generated by a text mining algorithm that was developed in a different project [3]. The text mining algorithm attaches labels to all HTML pages of a document set based on their contents. The HTML pages (VU-pages) were downloaded by wget [36] invoked with the following parameters: wget -l5 -r -t5 -A.htm,.html The given parameters force wget to download all the *.htm and *.html files from the domain in the depth of five levels recursively. In case of a page access failure it retries to download the page four more times again. This resulted in a collection of HTML pages (with a total size of about 90MB) that were consequently assigned to 19 categories: Description of the content-types (content-categories) type id. type name description 1 photo This type refers to pages containing a negligible quantity of textual information with one or more images. It most likely refers to personal photo albums, lecture slides or informational pages with messages like under construction or this page has been moved to. 2 miscellaneous Miscellaneous type refers to pages with absent or insufficient content. It most likely refers to framesets, empty, file list, form, moved or redirected pages. It can contain photo pages as well in case that the page doesn t contain relevant textual information. 3 dutch/department This type-group contains department pages in Dutch. English/reference group most likely refers to pages 4 english/reference containing e-books or manual pages for different systems or programs. It can be a manual for an operation system or an API reference for a programming language. It contains pages written in English. 5 english/activity This group most likely refers to pages containing invitations for official or free time activities. Among these events can be science conferences, exhibitions, concerts, trips for international students or any other happening which is in connection with the University. The group contains pages written in English. 6 english/department This category contains department pages in English. 7 english/project This type most likely refers to research projects of the computer science department written in English. 8 english/person This group is most likely refers to pages of the faculty 36

37 9 10 /faculty english/person /student english/person /faculty/publication 11 english/course 12 dutch/course dutch/person /student dutch/person /faculty 15 other_language 16 dutch/project 17 dutch/activity 18 documents 19 other documents members. They are usually very formal and they mostly consist of fields of research, professional background, research projects and other information related to the member s research area or department. It contains pages written in English. This group most likely refers to student pages. Student pages mostly contain personal information (e.g., hobby, lyrics, etc.) and links to pages of friends and courses. The group contains pages written in English. English/person/faculty/publication category most likely refers to pages containing publications of faculty members comprising at least the abstracts. It contains pages written in English. This group most likely refers to course pages. They mostly contain the description of the course, lecture slides, recommended literature and set assignments in English. The same as the english/course group, but containing pages written in Dutch. The same as the english/person/student group, but containing pages written in Dutch. The same as the english/person/faculty group, but containing pages written in Dutch. This type-group contains pages written in other languages than English or Dutch. The same as the english/project group, but containing pages written in Dutch. The same as the dutch/activity group, but containing pages written in Dutch. This group contains documents in Adobe Acrobat (pdf) or Postscript (ps) format. They are most likely to be scientific papers, publications, e-books, etc. Other documents contains documents in Microsoft Word (doc), Microsoft PowerPoint (ppt), Microsoft Excel (xls), Rich text (rtf) or plain text (txt) format. They are most likely to be administrative papers, forms, course materials etc. Table 6: Description of the content-types The labelling algorithm (supported by human) provided only an approximate categorization of pages. Roughly, about 74% of pages got the right labels; see [3] for details. To reduce the length of type names in some places we will use the letter E and D referring to English and Dutch groups (e.g., E/department refers to english/department ). For the accepted (by the webmining package) file structure of the mapping table refer to APPENDIX B3. 37

38 6.1.3 Experiments on data preparation The following table contains statistical results of the access log data filtering and content type data integration. Statistics of the access log data filtering and content type data integration Filtering method Number of filtered (bad) records Percentage to the total nr. of entries Unsupported extensions ,04% Spider transactions ,00% Dynamic pages ,88% Unsupported HTTP request methods ,78% Corrupt escape characters ,14% Unsuccessful requests ,35% Domain filter 728 0,01% Path completion or anchor stripping ,92% All methods (valid transactions) ,17% Mapping errors (on valid transactions) ,88% Total transactions stored (valid transactions with content type) Table ,29% All numbers in the table were compared to the total number of records and for this reason the sum of percentages doesn t amount to 100%. Most of the filtering methods above are required for more exact user behavioural results. Except for the elimination of dynamic pages that is not a necessity for this reason. The analysis of dynamic pages requires a much more sophisticated system. Since the targeted domain does not contain a significant amount of dynamic pages (2,88% of total accesses), and it can be assumed that static and dynamic pages would mostly not be mixed during one single user session, the correspondent filter simply ignores all dynamic pages appearing in the log files. Not surprisingly, statistics show that most of the entries contain requests for unsupported file types. Also a vast amount of transactions are generated by spider transactions. The mapping table does not contain content type entries for 43,73% of the valid transactions. We assume this result from a frequent change in the s pages. Mapping errors occur when a page referred by a log entry is missing from the page collection (and therefore from the mapping table). The total number of valid transactions after mapping the content-types to log entries is These were stored in a database. For the implementation details on data cleaning and data integration, refer to APPENDIX D1 and APPENDIX D2. 38

39 6.2 Distribution of content-types within the VU-pages and access log entries The following table shows the frequencies of content-types within the VU-pages and the access log entries of the Distribution of content-types within the VU-pages and access log entries within VU-pages id category frequency percentage frequency percent. within access log entries percent. of total of (1-17) 1 photo ,68% ,49% 7,74% 2 miscellaneous ,95% ,38% 14,99% 3 dutch/department 3 0,02% ,84% 3,95% 4 english/reference 966 7,42% ,62% 2,96% 5 english/activity 32 0,25% 640 0,18% 0,14% 6 english/department 269 2,07% ,33% 5,17% 7 english/project 441 3,39% ,10% 3,35% 8 english/person/faculty ,33% ,97% 15,48% 9 english/person/student 549 4,22% ,71% 3,84% 10 english/person/faculty /publications ,60% ,73% 8,75% 11 english/course 111 0,85% ,36% 4,37% 12 dutch/course 806 6,19% ,22% 2,62% 13 dutch/person/student ,29% ,68% 7,08% 14 dutch/person/faculty 10 0,08% 260 0,07% 0,06% 15 other_language 417 3,20% ,47% 0,38% 16 dutch/project 27 0,21% 212 0,06% 0,05% 17 dutch/activity 26 0,20% ,79% 0,65% 18 documents* ,91% 19 other documents* ,51% - total without ,00% % - - total with all ,00% * The mapping table entries contain only the extensions for categories 18 and 19. Table 8 The two distributions of content types (table 8) show relevant information on user behaviours. According to their relatively large proportion among the collection of HTML pages category photo, E/reference and D/course is overmatched within user visit frequencies. One would not expect the relatively low proportion of course (dutch and english) visits (8,58%). Furthermore the high proportion of visits from the Netherlands (refer to figure 5 in section 6.3.1) may indicate that students visit course pages mostly from home. Publications (E/person/faculty/publication) are mostly visited from foreign countries as one may expect. On the other side E/person/faculty, E and D/department categories have higher rates in log entries. The high proportion of E/department category can be explained by that pages within this class are placed on the top level hierarchy of the VU-pages structure. Thus they present the links for reaching other pages. And besides many users within the VU set department pages as their starting page. The E/reference category is mostly visited from other countries and documents are mostly downloaded also from foreign countries as it can be seen in figure 5. The summed 39

40 proportion of English pages within the VU-pages is 41,13% against the 15,99% of Dutch pages. And on the other side 54% of the log entries belong to English categories while 17,66% of the entries belongs to Dutch page requests. 6.3 Experiments on data structuring This section provides details on data structuring. It starts with presenting the user groups and their related statistics. While in the continuation it shows details on the session identification The user groups formed for the users of The remotehost field of a log entry is either given in the form of an IP address or of a domain name. The IP address is required for grouping users by network ranges while the domain names are important for the geographical sorting. The UpdateDBIPAddresses program was used to update all the remotehost fields of the cslog table (refer to table 3 in section 3.4 Storing the log entries) to the corresponding IP addresses that were given by domain names. The next step is to select users into the users table using the updated remotehost fields and user_agent fields from the cslog table. In the following step, the UpdateDBHostNames program fills in the host_name field for every corresponding remotehost address in the users table. Meanwhile the processing of the domain names it also determines their top level domain (TLD) addresses and fills in the TLD field. For details on UpdateDBIPAddresses and UpdateDBHostNames refer to APPENDIX D1. A total number of 118,141 users have been identified from log entries based on unique remotehost/user_agent pairs. The following groups make distinctions among these users. After identifying all the available IP addresses and domain names, the following demographical data can be obtained in relation with the users TLD field. The table below contains the details of the first 20 most frequent TLDs. TLDs are ranked by frequency and a summarized count for all the other top level domains is present in the last row. A table containing all the details of the TLDs can be found in the APPENDIX C3. The 20 most frequent top level domains rank TLD count country 1 nl Netherlands 2 net network infrastructure 3 com commercial 4 fr 3125 france 5 be 3058 belgium 6 de 3001 germany 7 ca 2133 canada 8 it 2038 italy 9 uk 1903 united kingdom 10 au 1852 australia 11 edu jp 1532 japan educational establishments 13 br 1485 brazil 14 ch 963 switzerland 15 mx 935 mexico 16 pl 878 poland 17 at 635 austria 18 fi 610 finland 19 dk 553 denmark 20 se 531 sweden sum of all other countries number of users without geographical information 40

41 Table 9 Not surprisingly the table represents the fact that most of the user visits come from the Netherlands. Besides the home country, users tend to show keen interest on computer science pages of the VU from nine particular other countries. Three of them are neighbouring (or near) countries, namely France, Belgium and Germany. Among the visitors from these countries would probably be students looking for further studies or fellow researchers interested on project or member details. The other six countries are spread worldwide. There are a total number of 33,460 users without geographical information. This is because their IP addresses cannot be looked up for domain names. The following user groups were formed according to the available geographical and organisational information. Geographical groups Groups formed by geographical information (by TLD acronyms) are described in the table below: The description of the geographical groups group name description nr of users nl Contains users identified by the nl top level domain other All the other countries and organizations differ from the nl TLD. Note that we didn t eliminate com, org, net and edu TLDs from this category despite their undeterminable geographical origin. They form the basis and the most frequent part of this group, thus eliminating them would result in a loss of many valuable user sessions. However, during the analysis we have to consider that a significant part of this group may belong to the nl group Table 10 Organizational groups In the Computer Science Department of Vrije Universiteit there are separate network ranges for users groups like staff, students, administration, etc. Groups identified from their belonging address ranges are described in the table below: The description of the organizational groups group name description nr of users Contains users identified by the subnet network range staff addresses for teachers of the Computer Science 274 department. student Contains users identified by the subnet network range addresses for student machines of the Computer Science department. 567 Table 11 Figure 5 shows the distributions of content-types for user groups. The most popular group was the geographical other group followed by the nl group (with a proportion of 58,74% and 41,26% among all geographically labelled transactions). Not surprisingly the organizational groups have a much lower visit rate compared to the geographical groups, since they contain 41

42 much less user (the proportion of staff and student groups are almost identical, 52,42% and 47,58%). For more details on figure 5 refer to the analysis of table 8 (section 6.2). Content-type distributions among user groups content-type identifiers visiting frequency nl other staff student Figure 5: Distribution of content-types among user groups 42

43 6.3.2 Session identification There are two session identification methods described earlier in this thesis. The following table shows statistics on time frame (TF) identified sessions for all user groups. The timeout parameter was set to the standard [13] 30 minutes length. User group session statistics for time frame identification group name total nr. of Session length statistics sessions min avg max std. deviation all users Geographical groups nl others Subnet network ranges staff students Table 12 The table above shows that users visit around 3-5 pages per average within a single session. Statistics also show that users within the VU tend to visit more pages per sessions than the average. The surprisingly large maximal length within all users is likely to refer to a spider transaction. However checking the details in raw transaction data shows no signs of spider activity. Neither the user_agent field contains any spider pattern nor do requested pages show systematic download. This can be because some spiders, for various reasons, pretend to be a real user. The table below contains statistics on maximal forward reference (MFR) identified sessions for all user groups. It shows that all groups contain much more sessions than in case of TF identification. This derives from the nature of MFR identification which breaks the session if a page has been previously occurred in it. User group session statistics for maximal forward reference identification group name total nr. of Session length statistics sessions min avg max std. deviation all users Geographical groups nl other Subnet network ranges staff students Table 13 Geographical groups don t sum up in both tables to the total session number of all users. This is because there are lots of IP addresses with missing domain names in the database (host names can not be looked up for them). According to our observations we are going to use time frame identified session data for further experiments. The TF identificator seemed to be more realistic on the examined database entries and most of the researchers apply also this method, such as [30]. All the following experiments are based on session data instead of raw log entries. 43

44 6.4 Mining frequent itemsets This section will provide information on frequent page sets and association rules for all the user groups. The AR implementation used by this project for data analysis is an Apriori-T (Apriori Total) algorithm, developed by the LUCS-KDD research team, which makes use of a "reverse" set enumeration tree where each level of the tree is defined in terms of an array (i.e. the T-tree data structure is a form of Trie) 7 [12]. For further details on the implementation refer to APPENDIX D4. The support and confidence threshold values for the association rules mining algorithm were tuned to give as much important patterns as possible and to keep the percentage of useless information in a low level. Analysis of all sessions presents an overall picture of all the user sessions retrieved from the database. A more sophisticated characterisation will follow in the part for analysis of the geographical and organizational groups The analysis of all visits The analysis of frequent itemsets within sessions of all users gives an overall picture of user behaviour on the domain. Frequent one-itemsets with their supports are presented in the table below: Frequent one-itemsets of all visits items (content-type labels and category names) support 1 (8) E/person/faculty 51,10% 2 (10) E/person/faculty/publication 35,45% 3 (2) miscellaneous 32,81% 4 (6) E/department 18,87% 5 (11) E/course 16,89% 6 (13) D/person/student 16,54% 7 (4) E/reference 16,00% 8 (18) documents 15,56% 9 (1) photo 14,65% 10 (7) E/project 12,45% 11 (19) other documents 9,64% 12 (3) D/department 9,59% 13 (12) D/course 9,52% 14 (9) E/person/student 9,51% Table 14 Item (1) shows that more than half of the sessions contain pages of faculty members (in English) and 35,45% of them include publication pages of faculty members (2). The high support of miscellaneous pages (3) does not indicate any special custom. It shows probably that 7 The input data for the algorithm contain sessions with redundant elements removed and types in ascending order. Trivial sessions that contain only one page are also stripped out. 44

45 a great proportion of the pages contain frames 8. Department pages were used in 24,05% of the transactions as (probably for) starting points of user visits 9. Course pages were visited in approximately 26% of the sessions (in this case the co-occurrence of the two categories is negligible, approximately 0,5%). English course pages were almost twice as popular as Dutch course pages. Dutch student pages are more popular than English ones. The joint occurrence of English and Dutch student pages is 23,09% based on the same calculation. Table 15 shows the selected frequent two-itemsets. Frequent two-itemsets of all visits items (content-type labels and category names) support 1 (10) E/person/faculty/publication, (8) E/person/faculty 19,44% 2 (8) E/person/faculty, (6) E/department 12,69% 3 (8) E/person/faculty, (2) miscellaneous 11,72% 4 (8) E/person/faculty, (1) photo 9,19% 5 (10) E/person/faculty/publication, (2) miscellaneous 9,12% 6 (18) documents, (8) E/person/faculty 8,26% 7 (11) E/course, (8) E/person/faculty 8,13% 8 (8) E/person/faculty, (4) E/reference 7,99% 9 (13) D/person/student, (2) miscellaneous 7,83% 10 (18) documents, (10) E/person/faculty/publication 7,63% 11 (10) E/person/faculty/publication, (6) E/department 7,27% 12 (10) E/person/faculty/publication, (4) E/reference 6,96% 13 (11) E/course, (10) E/person/faculty/publication 6,81% 14 (13) D/person/student, (8) E/person/faculty 6,43% 15 (8) E/person/faculty, (7) E/project 6,17% 16 (10) E/person/faculty/publication, (7) E/project 5,16% 17 (9) E/person/student, (8) E/person/faculty 4,61% 18 (6) E/department, (3) D/department 4,41% 19 (13) D/person/student, (1) photo 4,08% 20 (8) E/person/faculty, (3) D/department 3,84% 21 (13) D/person/student, (12) D/course 3,70% 22 (7) E/project, (6) E/department 3,26% 23 (19) other documents, (11) E/course 3,23% 24 (10) E/person/faculty/publication, (3) D/department 3,07% 25 (19) other documents, (10) E/person/faculty/publication 2,97% 26 (13) D/person/student, (9) E/person/student 2,96% Table 15 We can set up some rough custom models based on two-itemsets. 19,44% of the visits show interest on information of faculty members and their research. Itemsets don t provide sequential information but presumably visits belonging to (1) consist of an entry page for a faculty member and a consequent publication page of that person. (2), (6), (10) and (24) may also belong to this custom group. (2) and (24) forecast that such visits start on the department pages and then go on to faculty member pages. Itemset (6) and (10) show that many of the users download scientific material from the pages of faculty members. (8), (12), (15) and (16) show special interest on faculty member pages for project information and references. Itemsets that contain 8 Frame pages, as it was described earlier, mostly do not contain valuable information for the content classifier algorithm. 9 This result comes from the sum of supports of the English and Dutch pages subtracted the support of their co-occurrence, refer to table 15 of two-itemsets 45

46 miscellaneous type indicate that pages are probably structured in framesets, such as pages in content categories 8, 10, 13 in itemsets (3), (5), (9). Itemset (4) can be interpreted as a primitive model for free time or photo viewer activities. It contains page visits for photo galleries of faculty members. These galleries mostly contain personal photos like travel etc. images. (19) also relates to this custom group with the difference that it contains student photo gallery pages. (7) (13) and (23) form a study custom group. Many persons of the scientific staff present all their professional information on a single web page. The content classifier algorithm in this case will probably choose a content-type that refers to the largest topic on it. This resulted presumably in the strange combination of itemset (13). (7) and (13) basically indicates the same consequence which is that they contain course page (in English) visits from faculty member pages. In all certainty this member and the teacher of the course is the same person or has a strong relation to the course. (23) shows that 3,23% of the visits result in the download of course materials. Frequent three-itemsets of all visits items (content-type labels and category names) support 1 (10) E/person/faculty/publication, (8) E/person/faculty, (6) E/department 4,96% 2 (18) documents, (10) E/person/faculty/publication, (8) E/person/faculty 4,21% 3 (10) E/person/faculty/publication, (8) E/person/faculty, (7) E/project 3,54% 4 (11) E/course, (10) E/person/faculty/publication, (8) E/person/faculty 3,05% 5 (10) E/person/faculty/publication, (8) E/person/faculty, (4) E/reference 2,68% Table 16 The table above contains the frequent three-itemsets. Itemsets (1), (2), (3) and (5) forms the previously described faculty member or research custom group. A possible scenario for a user visit based on these sets can be that a user starts the visit on the department pages. Then he goes to a faculty member page and visits the member s publication page. In the meantime he downloads materials from the member s pages. He would also with a great probability visit project or reference pages from faculty member pages. (4) represents the study custom group. Such visits start from a faculty member s or his publication s page (which probably also has a mixed type of content) and ends on course pages related to the member. Association rules of all visits premise conclusion confidence 1 (7) E/project, (10) E/person/faculty/publication (8) E/person/faculty 68.7% 2 (6) E/department, (10) E/person/faculty/publication (8) E/person/faculty 68.22% 3 (6) E/department (8) E/person/faculty 67.26% 4 (7) E/project, (8) E/person/faculty (10) E/person/faculty/ publication 57.39% 5 (10) E/person/faculty/publication, (18) documents (8) E/person/faculty 55.1% 6 (10) E/person/faculty/publication (8) E/person/faculty 54.82% 7 (18) documents (8) E/person/faculty 53.04% 8 (8) E/person/faculty, (18) documents (10) E/person/faculty /publication 50.94% Table 17 46

47 Association rules provide more information on frequent itemsets. Table 17 contains rules that have higher confidence than 50%. (1) indicates that if a user visits project and publication pages he will also visit faculty member pages with 68,7% confidence, etc. All the rules in the table belong to the research custom group. This fact consolidates the importance of this type of behaviour and indicates that it is the most significant among visiting behaviour types The analysis of the geographical groups Table 18 shows the selected frequent one-itemsets of the nl and other geographical groups. Frequent one-itemsets of the geographical groups items (content-type labels and category support names) nl group other group 1 (8) E/person/faculty 44,57% 53,34% 2 (13) D/person/student 36,13% 5,94% 3 (10) E/person/faculty/publication 22,81% 41,32% 4 (1) photo 20,17% 11,58% 5 (12) D/course 19,86% 3,84% 6 (6) E/department 17,90% 19,81% 7 (9) E/person/student 12,69% 7,51% 8 (18) documents 11,70% 16,74% 9 (7) E/project 11,38% 15,63% 10 (11) E/course 8,54% 21,49% 11 (4) E/reference 8,35% 18,63% Table 18 The research behaviour type is significant in both categories but considering the summed support values for itemsets (1), (3), (8), (9) and (11) shows that research pages have an almost 50% higher visit rate within group other. The summed support values for content-categories of type 8, 10, 4, 18 and 7 are 145,66% for other and 98,81% for nl user groups. Free time behaviour is more frequent in the nl group based on the support values for student and photo categories. While the support of the photo category is 20,17% in nl visits, the other group contains 11,58% of the photo visit rate. Student pages are also frequently visited within the nl group. The summed supports for Dutch and English student pages is 42,92% (subtracted their co-occerrence) while the same value in the other group is approximately 13,45% (their cooccurrence within this category is negligible). Not surprisingly, the study custom group is also more frequent in the nl than in the other group. Dutch and English course pages have 28,4% of summed support in the nl group while the same value in the other group is 25,33%. In case of the nl group it indicates that many students probably study and therefore visit course pages from home. The other group contains very few Dutch course visits, which is the second most frequent category among the nl visits, but has a surprisingly large amount of visits to English course pages. This fact indicates that English course pages contain useful information for foreign visitors. 47

48 Table 19 contains frequent two-itemsets of the geographical groups. Frequent two-itemsets of the geographical groups support items (content-type labels and category names) other nl group group 1 (10) E/person/faculty/publication, (8) E/person/faculty 14,17% 22,11% 2 (13) D/person/student, (8) E/person/faculty 12,04% 3,04% 3 (8) E/person/faculty, (6) E/department 9,42% 14,91% 4 (13) D/person/student, (1) photo 8,59% - * 5 (9) E/person/student, (8) E/person/faculty 8,55% - * 6 (8) E/person/faculty, (7) E/project 8,43% 5,31% 7 (13) D/person/student, (12) D/course 8,36% - * 8 (8) E/person/faculty, (1) photo 8,24% 9,50% 9 (10) E/person/faculty/publication, (6) E/department 7,02% 7,37% 10 (8) E/person/faculty, (4) E/reference 6,17% 7,85% 11 (18) documents, (8) E/person/faculty 5,91% 9,15% 12 (13) D/person/student, (9) E/person/student 5,90% - * 13 (12) D/course, (8) E/person/faculty 5,60% - * 14 (9) E/person/student, (1) photo 4,24% - * 15 (10) E/person/faculty/publication, (4) E/reference 4,17% 8,18% 16 (18) documents, (10) E/person/faculty/publication 4,11% 9,00% 17 (11) E/course, (8) E/person/faculty 3,56% 10,43% 18 (11) E/course, (10) E/person/faculty/publication 3,33% 8,43% * Not supported by the set support threshold value. Table 19 The table above shows that indeed the other group contains mostly official visits, such as itemsets (1), (3), (6), (9), (10), (11), (15) and (16). Visitors in this group most likely start on English department pages and from there they go on to faculty member pages and navigate to member s publication pages. A large percentage of them also visit reference and project pages following links from faculty members pages. Many users within this group download documents from faculty members. Official visits are also frequent in the nl group, but in contrast with the other group it also contains a great number of study visits. (7), (13), (17) and (18) support the assumption that most of the study visits start on the faculty member pages and then go on to the course pages. It is interesting to note that the Dutch and English pages are not mixing within sessions. Probably this is because the Vrije Universiteit provides bachelor and masters degrees and while the official language of bachelor education is Dutch, most of the courses are in English in case of the masters. The other group also contains a large number of course pages in English visited from faculty members pages. Such visits can be generated by interested teachers and students from abroad. (4), (8), (12) and (14) show free time visits. (4) and (14) contain Dutch and English student page visits and visits for their photo pages followed the links from them. (8) contains the same types of sessions for faculty member pages. (2), (5) and (7) indicate a mixed activity of free time and other behaviour types. 48

49 Frequent three-itemsets of the geographical groups are presented in table 20. Frequent three-itemsets of the geographical groups support items (content-type labels and category names) other nl group group 1 (10) E/person/faculty/publication, (8) E/person/faculty, (6) E/department 4,49% 5,13% 2 (10) E/person/faculty/publication, (8) E/person/faculty, (7) E/project 4,41% 3,25% 3 (13) D/person/student, (8) E/person/faculty, (1) photo 4,41% - * 4 (13) D/person/student, (9) E/person/student, (8) E/person/faculty 4,37% - * 5 (13) D/person/student, (9) E/person/student, (1) photo 3,68% - * 6 (10) E/person/faculty/publication, (8) E/person/faculty, (4) E/reference 3,30% - * 7 (18) documents, (10) E/person/faculty/publication, (8) E/person/faculty 2,63% 4,93% 8 (11) E/course, (10) E/person/faculty/publication, (8) E/person/faculty - * 3,59% * Not supported by the set support threshold value. Table 20 (1) shows the classic research visit. Users start on department pages, navigate to faculty members pages and to the members publication pages. (2), (6), (7) and probably (8) also belong to the research custom. (3), (4) and (5) present mostly free time visits. The study behaviour type is missing from the three-itemsets. An explanation can be that students of the University know the URLs of study pages exactly and go directly there instead of starting from department pages and following through the links The analysis of the organizational groups Table 21 shows the frequent one-itemsets of the staff and student organizational groups. Frequent one-itemsets of the organizational groups support items (content-type labels and category names) student staff group group 1 (8) E/person/faculty 68,36% 59,41% 2 (10) E/person/faculty/publication 31,75% 36,88% 3 (3) D/department 31,23% 16,31% 4 (6) E/department 26,58% 18,61% 5 (13) D/person/student 22,13% 16,38% 6 (1) photo 16,75% 5,88% 7 (18) documents 15,62% 18,40% 8 (12) D/course 13,34% 16,59% 9 (7) E/project 12,82% 41,50% 10 (4) E/reference 7,65% 19,03% 11 (9) E/person/student 6,83% 6,16% 12 (11) E/course - * 10,36% * Not supported by the set support threshold value. Table 21 49

50 One would expect higher differences among the support values for content categories within the staff and student organizational groups than the presented supports in table 21. One may think that categories like student, photo and course pages are visited at a significantly higher rate in the student than in the staff group. The opposite can be observed in case of itemsets (5), (6) and (11). This fact shows that for some reasons teachers are more interested in student and photo pages than students. A possible explanation for this phenomenon can be that Ph.D students within the staff group visit their fellow student pages. The table shows also that both groups are interested in research pages and members of the staff group don t visit the English course pages. (9) and (10) show that students are much more interested in project and reference pages than teachers. Two- and three-itemsets will probably provide more information on the afore-mentioned discrepancies. Frequent two-itemsets are presented in the table below. Frequent two-itemsets for the organizational groups support items (content-type labels and category names) staff group student group 1 (10) E/person/faculty/publication, (8) E/person/faculty 22,34% 27,08% 2 (8) E/person/faculty, (3) D/department 19,23% 5,95% 3 (8) E/person/faculty, (6) E/department 14,99% 9,45% 4 (13) D/person/student, (8) E/person/faculty 14,89% 5,11% 5 (8) E/person/faculty, (1) photo 10,55% - * 6 (18) documents, (8) E/person/faculty 10,55% 7,07% 7 (10) E/person/faculty/publication, (6) E/department 9,72% 7,84% 8 (8) E/person/faculty, (7) E/project 8,79% 37,16% 9 (8) E/person/faculty, (4) E/reference 5,89% 18,05% 10 (7) E/project, (4) E/reference - * 15,75% * Not supported by the set support threshold value. Table 22 (2) and (3) show that the research behaviour type is more general within staff group than it is in student. Teachers may look for contact information (e.g., telephone number, address etc.) of their colleagues via faculty member pages. Interesting is that photo galleries of faculty members (5) are only popular among teachers. (8) and (9) indicate that project and reference pages mostly contain study material for students. Project and reference pages probably cover useful information for course assignments that students have to do in groups. This would explain that a great proportion of students use the University s infrastructure to visit such pages. 50

51 Table 23 contains frequent three-itemsets of the organizational groups: Frequent three-itemsets of the organizational groups support items (content-type labels and category names) staff group student group 1 (8) E/person/faculty, (6) E/department, (3) D/department 6,83% 2,52% 2 (10) E/person/faculty/publication, (8) E/person/faculty (6) E/department 6,10% 4,48% 3 (13) D/person/student, (8) E/person/faculty (10) E/person/faculty/publication 5,79% - * 4 (10) E/person/faculty/publication, (8) E/person/faculty (3) D/department 5,17% - * 5 (10) E/person/faculty/publication, (6) E/department (3) D/department 5,07% - * 6 (10) E/person/faculty/publication, (8) E/person/faculty (7) E/project 4,45% 18,61% 7 (18) documents, (10) E/person/faculty/publication, (8) E/person/faculty 3,72% 3,43% 8 (8) E/person/faculty, (7) E/project, (4) E/reference - * 15,68% 9 (10) E/person/faculty/publication, (8) E/person/faculty (4) E/reference - * 12,32% 10 (10) E/person/faculty/publication, (7) E/project, (4) E/reference - * 11,41% 11 (12) D/course, (8) E/person/faculty, (6) E/department - * 2,52% * Not supported by the set support threshold value. Table 23 (2) and (4) show the classic research visit. Teachers tend to visit faculty pages in a sequence of department, faculty member and faculty member s publication pages. Itemsets (1) and (5) indicate that in most cases users change the language of department pages within their sessions. The study behaviour type is popular among student visits. They consist of pages in sequence of faculty member, publication and project or reference categories, such as (6), (8), (9) and (10) Conclusion The study of frequent itemsets indicated that the most significant behaviour type is research in almost every user group. More than 50% is the proportion of research visits among all sessions. In case of the geographical groups other also contains more than 50% of support for research pages while among the organizational groups staff has a higher visit rate for this type of pages. The study behaviour has a relatively low base among all visits. However, the geographical nl and the organizational student groups have a high visit rate for the study custom. The free time behaviour type has a base of approximately 20% within all sessions. This high visit rate is also typical within the nl geographical group but not apparent significantly among sessions belong to the organizational groups. However, the staff has a relatively large visit rate for photo galleries of faculty members. 51

52 6.5 The mixture model We drew the basic inferences of the frequent itemsets in the previous section. We try to refine the established custom characteristics in this passage with an analysis of mixture models (MM) for each session group. The mixture model implementation used for modelling session data was developed in a different project [17]. A mixture model can be viewed as a clustering of all the users. Each cluster is characterized by a vector of frequencies (tetas) with which members of such a cluster visit specific pages. These frequencies can be visualized in a bar chart - a kind of a "group profile". Additionally, the parameter alpha can be interpreted as the cluster size. To interpret the charts in the following sections it is necessary to look at the legend (refer to table 6: Description of the content-types in section Mapping table). In our experiments we run the MM algorithm with 10 different settings for mixture component numbers (starting with 1 component up to 10 components mixtures) for building models on each data set. We set the algorithm to iterate through the model building process 10 times for each mixture model and to choose the most probable model for each component setting. We use log-probability scores ( logp scores ) to evaluate the predictive power of the models. Logp scores are calculated based on the formula of Notion 5.5 (in chapter 5) transformed to the logarithm of the expression. Higher logp scores mean that the model is evaluated to be more probable on the dataset. In most cases we put only the figure of the most probable mixture models in the thesis The analysis of all visits The figure below presents the logp scores of all the 10 mixture components settings. LogL -7.5 x 105 LogLikelihood(#iterations, #clusters) Number of iterations Figure 6: Logp scores of the 10 mixture component settings of all visits 52

53 The mixture model with the trivial one component shows only data statistics similar to frequent one-itemsets which we already discussed in the previous section. We choose the maximal number of mixture components heuristically. The logp scores for models with a number of components higher than 6 or 7 tend to have the same characteristics and are close to each other. Therefore we chose the most probable model by the analysis of 2 to 7 components models LogL= α= α= Figure 7: Two-component mixture model of all visits The histograms in the mixture model above present clusters of similar users. Alphas are their sizes and the levels of histogram values present interests of members of these clusters. The first component of figure 7 shows the research and study behaviour types and the second presents a mixture of free time and study activities. These mixtures within the base components indicate that the number of components is probably higher than two. The analysis of all figures resulted in choosing the model with six mixture components as the most probable (figure 8). The first base component refers to a research behaviour that has a very high (0,27) probability. The second component also has a high probability and shows a study behaviour type with the visiting of faculty member, publication, reference and course pages and downloading of course materials. The third component refers to the student page visit custom with visiting of English and Dutch student pages. It also has some visits to Dutch course and faculty member pages. Component number four is presented in almost all mixture models for all visits regardless to component numbers (numbers above two components) and presents no interpretable information given that the miscellaneous category mostly refers to framset or empty pages. Component five refers to determined research downloads behaviour which means that users know exactly the URL of the material they want to download. The last component presents free time visit model in that users visit photo galleries. The visited photos belong mostly to faculty members. 53

54 α=0.031 α=0.1 α=0.15 α=0.21 α=0.24 α= LogL= Figure 8: Six-component mixture model of all visits The analysis of the geographical groups We chose also six as the most probable number of components in the case of the geographical groups. The first component of the mixture model of the nl geographical group in figure 9 presents the free time behaviour by visiting Dutch student pages at a probability rate of The second most probable component (nr. 2) refers to study visits that probably start on Dutch department pages, go on to Dutch course pages, and finally download course materials. The research component (nr. 3) also has a high probability and refers to the classical sequence of department pages, faculty member pages, and members publication pages. In case of component four miscellaneous pages are combined with department, faculty member, and student pages. This could mean that the structures of these pages are based on frames but no visiting characteristics can be observed. The determined research downloads habit is presented in component five in a proportion of 7,6% to all sessions. The last component is a mixture of free time visits. It contains visits for faculty members and students photo pages as well as for activity pages. 54

55 α=0.054 α=0.076 α=0.16 α=0.21 α=0.22 α= LogL= Figure 9: Six-component mixture model of the geographical nl group The probability of the research behaviour is much higher within the geographical other group than in case of the nl model. The first and second components of figure 10 together form more than 50% of probability for research pages. The first component can also model the study custom for foreign students. Component three models the interest in student pages. This component also has a relatively large probability. Determined research downloads has approximately 10% of proportion among sessions in this group. The photo viewing habit is presented with a low probability in the last component. 55

56 α=0.042 α=0.1 α=0.16 α=0.17 α=0.25 α= LogL= Figure 10: Six-component mixture model of the geographical other group The analysis of the organizational groups Figure 11 shows the six-component mixture model of the organizational staff group. High presence of English and Dutch department pages in the first component (from the top) without any other significant category may imply that the web browser clients of staff members machines are set to show department pages as start pages. Component two also shows such habit, with the difference that teachers may set their own home page as starting page with 0,29 probability. The third component shows interest on faculty member and research pages. Teachers may look for colleagues contact information. Component four refers to a determined student page visits behaviour and five to a determined download habit. The last component also shows that photo pages are visited with direct request for the pages. But the photo viewing behaviour is not popular within this group. Most of the components don t contain department pages. This fact indicates that most of the users within this group know the URLs for the required resources. 56

57 α=0.039 α=0.08 α=0.089 α=0.18 α=0.29 α= LogL= Figure 11: Six-component mixture model of the organizational staff group Component one and five most probably refer to a study habit in the six-component mixture model of the student group (figure 12). The third component implies the classic research sequence. Component four represent the free time visit behaviour for Dutch student page visits. This group also contains the determined download habit represented by component five. The last component also shows some kind of free time activity with visits for activity pages and possible downloads of registration forms for free time events. 57

58 α=0.068 α=0.11 α=0.15 α=0.19 α=0.23 α= LogL= Figure 12: Six-component mixture model of the organizational student group Conclusion The results of mixture model analysis show the same major characteristics as the results of frequent itemsets mining. The research behaviour type is the most probable visit activity among all visits followed by study and free time habits. The geographical nl group contains more free time while the other group have a higher visit rate for research pages. Sessions among the staff group are more likely to be research or start up (department) pages whilst the student group contains more visits for behaviour types like study, research and free time. 58

59 6.6 The global tree model In contrast with the previous models the global tree model (GTM) is based on the sequential information presented for sessions (in term of consecutive page visits). The tree model provides frequent navigational paths and tree-like visualization of relevant patterns. The analysis of all raw sessions for user groups would result in large, slightly informative, plain trees. Since we want to analyse complex user navigational paths we strip out one-length sessions. One-length sessions are generated mostly by users following links of search result pages, starting page settings for web clients and direct visits. Either way these items shift the whole characteristics of user behaviours. We also eliminate consecutive redundant elements within sessions (e.g., we analyse the sequence of instead of ). This transformation gathers up all sessions with the same characteristics and preserves the ordering information. In the following experiments we insert only partial trees or trees referring to sessions with a relatively high support rate. Furthermore we refer to total trees in APPENDIX C4 C8 for each group. The CD-ROM contains additional tree visualization figures in high resolution The analysis of all visits Figure 13 refers to the tree visualization of all visits by 3% of support threshold: Figure 13: The tree model of all visits (3% of support treshold) This partial tree shows that research is the most important behaviour type among all visits. 29% of the sessions start with faculty member pages and go on to publication pages. Table 24 (and the figure in APPENDIX C4) shows that surprisingly, only a relatively low number of sessions start on the department pages. Most of the users go directly to faculty members pages and browse members publication pages from there. If a user starts from department pages he continues mostly on faculty member pages, as shown in session type (8). It is interesting that 16% of the sessions that start on faculty members pages go on to the department pages. This activity is the opposite one might expect. A relatively large proportion of sessions start directly 59

60 on publication pages. 17% of these sessions end with downloading of documents whilst 19% of them end with visiting reference pages. Frequent sessions of all visits by 1% of support treshold session frequency percentage 1 (8) E/person/faculty, (10) E/person/faculty/publication % 2 (2) miscellaneous, (7) E/project, (2) miscellaneous, (7) E/project, (2) miscellaneous, (7) E/project, (2) 887 2% miscellaneous, (7) E/project 3 (8) E/person/faculty, (6) E/department 705 2% 4 (8) E/person/faculty, (4) E/reference 600 1% 5 (8) E/person/faculty, (11) E/course 582 1% 6 (8) E/person/faculty, (1) photo 571 1% 7 (10) E/person/faculty/publication, (18) documents 505 1% 8 (6) E/department, (8) E/person/faculty 478 1% 9 (13) D/person/student, (2) miscellaneous, (13) D/person/student 454 1% 10 (8) E/person/faculty, (18) documents 444 1% 11 (10) E/person/faculty/publication, (4) E/reference 439 1% 12 (11) E/course, (19) other documents 397 1% Table The analysis of the geographical groups Figures 14 and 15 and in the APPENDIX C5 and C6 contain the most frequent navigational paths of geographical groups. As we stated earlier, users within the other group tend to visit research pages more frequently than within the nl group. 34% of their visits start on faculty member pages and then the majority go on to publication pages. Some of these visits end with downloading of documents or returning to member pages. The nl group also contains a large proportion (18%) of research pages. However the referring branch in the nl tree is a mixture of contents. It contains student and photo pages as well. 11% of visitors among the other group use faculty member pages to visit E/course materials. No such behaviour can be observed among the nl group. Quite the contrary, the nl group doesn t contain E/course pages at all among its frequent navigational paths. 13% of the faculty member pages visitors go and see photo pages of the members. This proportion is twice as much as in case of the nl group. Visitors among the other group use the department pages more frequently. Most of these visits are likely to go on to the faculty members pages in both the other (55% of them) and the nl (26% of them) groups. In case of the other group, publication and project pages are also frequent destinations from the department pages. The nl group tends to have more free time visits than the group other. 19% of the visits related to the nl group contain student pages and 20% of them include photo galleries. The study behaviour type does not appear as an individual (sub)branch of the nl tree but study pages are spread around in the tree. 60

61 Figure 14: The tree model of the nl group Figure 15: The tree model of the other group The analysis of the organizational groups Figures 16 and 17 and in the APPENDIX C7 and C8 contain the most frequent navigational paths for the organizational groups. Visits of the staff users start mostly on faculty member pages. They then navigate to publication pages, download materials or simply go to department pages. The most relevant session structure for the student group starts on miscellaneous pages, then goes to faculty member pages followed by project pages and finally ends either on publication or reference pages. Reference, project, faculty member, and publication pages are spread in the whole student tree mixed with other components. Both staff and student trees are a kind of mixtures. They don t contain clear user behaviour types, whereas trees for the geographical groups and for all sessions do. The reason probably is that organizational groups contain much less sessions. 61

62 Figure 16: The tree model of the staff group Figure 17: The tree model of the student group The similarity of tree models Table 25 contains the similarity measures for all tree model pairs of all groups. We equalized all the session data pairs before measuring them. This means that we randomly stripped out sessions from the greater data set to make the number of sessions equal in all pairs. The diagonal from the upper left to the lower right corner contains 100% similarity since they refer to similarities of the same groups. The similarity matrix is symmetrical because of the commutative property of the similarity measure. 62

63 Similarity measures for tree models of all user groups group all geog. nl geog. other org. staff org. student group all 100% 40.36% 70.46% 20.29% 21.11% geographical nl 40.36% 100% 27.76% 20.08% 27.78% geographical other 70.46% 27.76% 100% 19.18% 16.59% organizational staff 20.29% 20.08% 19.18% 100% 23.75% organizational student 21.11% 27.78% 16.59% 23.75% 100% Table 25 According to these figures the other group is the most similar to the all sessions group. This is not surprising since the other group contains the greatest part within all sessions. Comparing the nl and other groups results in 27,76% of similarity while measuring the distance between the staff and student yields 23,75% of similarity Conclusion The analysis of the tree models mostly confirms our preliminary assumptions (in the AR and MM sections for the details refer to section 6.4 and 6.5) for ordering of pages in typical sessions. However in some cases it turned out that the expected orders are not realistic. Most of the groups contain the subsequence of faculty member pages and department pages in a higher frequency than department pages and faculty member pages. One would expect the opposite of the phenomenon. 63

64 7 Conclusion and future work In our work we have presented a methodology for web usage mining. We discussed data preprocessing and data enrichment processes of access log entries of web servers. Data enrichment is about integrating content types of documents with access log entries. The enriched data is structured into user navigational sequences in the next step. With the application of geographical and organizational data we have set up user groups for users and their related sessions. We presented three data mining models for exploring user behaviours among groups of users: the association rules mining algorithm was used to explore frequent itemsets and rules on them. The mixture model presented a clustering of users by similar collections of pages they visit. Thirdly the global tree model was proposed for mining frequent navigational paths with the preservation of sequential information of page visits. Visualization of the tree models facilitate human perception and in this manner helped to obtain the most important patterns. Finally we applied all the discussed techniques to the web site of the Computer Science Department of Vrije Universiteit, The Netherlands ( domain). We discovered three significant types of user behaviours analysing the experimental results. These types are research, study and free time. Sessions belonging to the research behaviour type consist mostly of faculty members pages, their publication pages, reference and project pages. They include department pages for navigations and downloads of (scientific) documents. The study custom mostly refers to Dutch and English course pages but also contains reference and project pages in large numbers. The free time visits consist mostly of photo pages, activity pages and Dutch and English student pages. Other minor behaviour types are described within the analysis of the models. In general the research custom is the far most popular among all sessions and among most of the session groups. Study pages are not as popular as research pages but have a significant base within all sessions. The free time habit is the least popular among behaviour types but it still has a relatively large support among all sessions. We categorized all the user sessions into the four subcategories of geographical and organizational group categories. Geographical categories are the nl and the other groups, where nl refers to user sessions related to users from the Netherlands and other consists of sessions for users from all the other countries. Organizational categories are the staff and student categories referring to the sessions of the staff and student users of the university. The research custom is the most frequent among users of the other group. Approximately half of the sessions within this group relate to this custom. Staff has more research sessions within organizational groups. The study custom is the most frequent in the geographical nl and the organizational student groups. The other group also contains a large number of visits for English course pages. Free time visits are the most popular among nl sessions and have a significant base within student visits. The staff group also contains a significant proportion of sessions containing pages of faculty members photos. Surprisingly, department pages are infrequent among starting pages of user sessions. This indicates that most of the users don t use department pages for page lookups. However a popular scenario of user visits is to start the visiting sequence in faculty member pages and then go on to department pages to navigate to the following destination. Another conclusion of the results is that course pages are not so popular among students. This may indicate that students 64

65 visit course pages mostly from home. However a significant proportion of students visit reference and project pages from the labs of the University. This may imply that they mostly use the facilities of the University for solving group assignments. It is to be remarked that the Vrije Universiteit has a dedicated Intranet system (the Blackboard) for managing all the information about courses. Albeit most of the courses have informational pages within the VU-pages, this system may provide extra information for users. User visits to the Blackboard system were not tracked within this project. Another important pattern is that users from abroad tend to visit English course pages. These sessions can be generated either by students looking for course materials for their studies or by foreign teachers read up on courses information. The analysis covers only a short period (one month: June) and the observed patterns certainly change over time. This fact may explain some extreme patterns covered by data analysis. To avoid periodical patterns it would be interesting to perform the data analysis automatically, e.g., once per a month. We tried to develop as accurate algorithms as possible but there are some internal and external limitations that are influenced by or might influence the experimental results. Web logs of public web domains provide insufficient information on users. Some of the identified users and their sessions may contain incorrect data despite the applied heuristics for identification processes. The accuracy can only be improved by using cookies or other external identification techniques (refer to section 6.3 Data structuring). The problem is completely solved in case of the analysis of an Intranet (login required) application because of the automatic user identification. Another problem is the high number of mapping errors that occur either because some requests refer to a deeper level in HTML pages structure than was set or because the requested pages were removed in the meantime. Therefore the number of mapping errors can partly be reduced by downloading the VU-pages in deeper levels. The exponential growing of page numbers would overload the content classifier algorithm though. A much more sophisticated solution would be to build a separate content retrieval system where all the pages in a website would have at least one URL, content type, and timestamp entry. Each time a page is changed a new entry would refer to its new content type in the system (in case it is different from the previous one). During the analysis of access log files, their timestamp and the timestamp of content entries would be compared and the suitable content label would be chosen. The accuracy of content labelling is the most critical part of the whole process. The actual 74% of average accuracy of content categories assures the reliability of the major user characteristics that were experienced. However, increasing the accuracy would result in lower noise of the data and even more reliable experiments. The web usage mining system proposed in this thesis was built basically for static analysis. This means that the access log files, the VU-pages and all the other input data were evaluated offline, independently from their generation time. A potential improvement of the system would be to process access log entries and to attach labels to the requested documents online, at the same time when they are generated. This would allow us to analyse systems dynamically. I will go on with my researches within the DIANA project ( DIANA/) focusing on the real time analysis of dynamic systems and on the development of new, adaptive algorithms for mining data streams. 65

66 Acknowledgements I wish to express sincere appreciation to Dr. Wojtek Kowalczyk for his assistance and insight throughout the development of this project. I would also like to express sincere thanks to Dr. Elena Marchiori for her valuable advice and feedback. In addition, special thanks to Dr. Frits Daalmans, for technical and non-technical advice, support, and editing. I also thank to Krisztián Balog for the fruitful cooperation during the project and for providing the content label data for the VU-pages. I thank to Dr. Elisabeth Hornung, my Mom for her advice and suggestion, and many thanks to my family for supporting me during the one year in the Netherland. I would also like to say a special thanks to my lovely girlfriend Maya, for her patience and for pushing me over the final finishing line. I would like to express my special thanks to the Vrije Universiteit, Amsterdam for the opportunity to participate the one year International Master Program. 66

67 Bibliography 1. Agrawal, R., Imielinski, T., and Swami, A. (1993), Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages Baglioni, M., Ferrara, U., Romei, A., Ruggieri, S., and Turini, F. (2003), Preprocessing and Mining Web Log Data for Web Personalization. 8th Italian Conf. on Artificial Intelligence vol of LNCS, p Balog, K. (2004), An Intelligent Support System for Developing Text Classifiers. MSc. Thesis, Vrije Universiteit of Amsterdam, The Netherlands 4. Cadez, I. V., Heckerman, D., Meek, C., Smyth, P., and White, S. (2003), Model-Based Clustering and Visualization of Navigation Patterns on a Web Site. Data Mining and Knowledge Discovery, vol.7 n.4, p Cadez, I.V., Smyth, P., Ip, E., and Mannila, H. (2001), Predictive Profiles for Transaction Data using Finite Mixture Models. Technical Report No , Information and Computer Science Department, University of California, Irvine 6. Chen, Z., Fu, A., and Tong, F. (2002), Optimal Algorithms for Finding User Access Sessions from Very Large Web Logs. Proceedings of the Sixth Pacific-Asia Conference on Knowledge Discovery and Data Mining, (PAKDD), Taipei 7. Chen, M., Park, J.S., and Yu, P.S. (1998), Efficient data mining for path traversal patterns. IEEE Transactions on Knowledge and Data Engineering, vol.10 n.2, p Chevalier, K., Bothorel, C., and Corruble, V. (2003), Discovering rich navigation patterns on a web site. Proceedings of the 6th International Conference on Discovery Science Hokkaido University Conference Hall, Sapporo, Japan 9. Cho, Y.H., and Kim, J.K. (2004), Application of Web usage mining and product taxonomy to collaborative recommendations in e-commerce. Expert Systems with Applications vol.26, p Cho, Y.H., Kim, J.K., Kim, S.H. (2002), A personalized recommender system based on web usage mining and decision tree induction. Expert Systems with Applications vol.23, p ClickTracks. Retrieved February 12, 2004 from Coenen, F. (2004), The LUCS-KDD Apriori-T Association Rule Mining Algorrithm, Department of Computer Science, The University of Liverpool, UK. 13. Cooley, R., Mobasher, B., Srivastava, J. (1999), Data Preparation for Mining World Wide Web Browsing Patterns. In Knowledge and Information System, vol.1(1), p Hay B., Wets, G., and Vanhoof K. (2003), Segmentation of visiting patterns on websites using a sequence alignment method. Journal of Retailing and Consumer Services vol.10, p Jacobs, N., Heylighen, A., and Blockeel, H. (2001), Dynamic Website Mining. Proceedings of the European Symposium on Intelligent Technologies, Hybrid Systems and their implementation on Smart Adaptive Systems, Tenerife (Spain) 16. Jenamani, M., Mohapatra, P.K.J., and Ghose, S. (2003), A stochastic model of e-customer behaviour. Electronic Commerce Research and Applications vol.2, p

68 17. Mixture model implementation within the DIANA project Mobasher, B., Jain, N., Han, E., and Srivastava, J. (1996), Web Mining: Pattern discovery from World Wide Web transactions. Technical Report TR , University of Minnesota, Dept. of Computer Science, Minneapolis 19. Zaki, M. J. (2002), Efficiently Mining Frequent Trees in a Forest. SIGKDD 02 Edmonton, Alberta, Canada 20. Nanopoulos, A., and Manolopoulos, Y. (2000), Finding Generalized Path Patterns for Web Log Data Mining. J. Stuller et al. (Eds.): ADBIS-DASFAA, LNCS 1884, p Nanopoulos A., Manolopoulos Y. (2001), Mining patterns from graph traversals. Data and Knowledge Engineering No. 37, p OneStat.com. Retrieved February 12, 2004 from Pei, J., Han, J., Mortazavi-asl, B., and Zhu, H. (2000), Mining Access Patterns Efficiently from Web Logs. Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, p Punin, J.R., Krishnamoorthy, M.S., and Zaki, M.J. (2001), LOGML: Log Markup Language for Web Usage Mining. Proceedings in WEBKDD Workshop 2001: Mining Log Data Across All Customer TouchPoints (with SIGKDD01), San Francisco 25. Fielding R., Gettys, J., Frystyk, H., Masinter, L., Leach, P. and Berners-Lee, T. Hypertext Transfer Protocol - HTTP/1.1. Network Working Group. RFC Berners-Lee, T., Fielding, R. and H. Frystyk. Hypertext Transfer Protocol - HTTP/1.0. Network Working Group. RFC Runkler, T.A., Bezdek, and J.C. (2003), Web mining with relational clustering. International Journal of Approximate Reasoning vol.32, p Smith, K.A., and Ng, A. (2003), Web page clustering using a self-organizing map of user navigation patterns. Decision Support Systems vol.35, p Spider pattern lists were verified on the services listed below. Retrieved March 17, 2004 from (list of well known robots - not up to date), (list of spiders) and using and search engine for missing or uncertain spiders. 30. Xing, D., and Shen, J. (2004), Efficient data mining for web navigation patterns. Information and Software Technology vol.46, p Yang, Q., Li T.I., and Wang K. (2003), Web-log Cleaning for Constructing Sequential Classifiers. Applied Artificial Intelligence vol. 17, iss. 5-6, p Yao, Y., Hamilton, H.J., and Wang, X.W. (2000), PagePrompter: An Intelligent Agent for Web Navigation Created Using Data Mining Techniques. Technical report, Department of Computer Science, University of Regina Regina, Saskatchewan, Canada 33. Youssefi, A.H., Duke, D.J., Zaki, M.J., and Glinert, E.P. (2003), Towards Visual Web Mining. In Proceeding of Visual Data Mining at IEEE Intl Conference on Data Mining (ICDM), Florida 34. Luotonen, A. (1995), The Common Logfile Format, Extended Log File Format - W3C WD-logfile , W3C Working Draft WD-logfile GNU Software Foundation (1999), Wget. Available at Webtrends, Retrieved February 12, 2004 from 68

69 APPENDIX APPENDIX A. The uniform resource locator (URL) Uniform resource locators (URL) identify resources in the World Wide Web. The syntax of an HTTP URL is ' host.domain [':'port] [ path ['?' query]] where - host.domain is the name of the web service (server) - port is optional (default is 80) - path is the absolute location of the requested resource in the server (consists of path + file name + extension with delimiter fields) - query is a collection of parameters in case of dynamic pages APPENDIX B. Input file structures 1. The structure of the properties file The properties file contains the most adjustable properties for the webmining package in form of key/value pairs in each line. Pairs are delimited by = character. Supported properties are described in the table below: Supported properties of the properties file Database properties JDBC_driver_name Name of the JDBC driver for database connection. e.g., com.mysql.jdbc.driver connection_name Name of the database connection. e.g., jdbc:mysql://localhost/test user_name Name of the user for the specified database. e.g., TEST user_password Password for the user. e.g., test log_table_name Name of the access log table. e.g., cslog log_users_table_name Name of the users table. e.g., users Properties for data handling access_log_path Path and file name of the (merged) access log file. e.g., c:\log.txt Properties for transaction filtering default_page_name Name of the default HTML page. e.g., index.html accepted_extensions_list Path and file name of the extension list file. spider_engines_list Path and file name of the spider list file. Properties for session identification time_frame_intervall Length of time frame (in minutes) for time frame identification. e.g., 30 Type of group selector. Possible values: all, subnets, country. They refer to all sessions (all), only sessions group_selector_type generated by a user specified by the file given by the network_range_file_name key (subnets) and sessions generated by a user specified by the file given by the country_list_file_name key (country). network_range_file_name Path and file name for the file specifying a subnet group. country_list_file_name Path and file name for the file specifying a country group. 69

70 Properties for data integration mapping_table_path Path and file name for the content mapping table file. Path and file name for the artificial content mapping table generated_mapping_table_path file to be generated. Properties for geographical statistics Path and file name for the country codes file containing the country_codes_file_name names and short names for most of the countries in the World. Table 26: Supported properties of the properties file 2. The structure of access log files The apache web server of the domain uses the extended log file format [35] and writes the following fields into the log files in order of appearance: remotehost, rfc931, authuser, date, request, status, bytes, referrer, user_agent. Log files can contain an arbitrary number of entries containing all the fields described above. Each line of the file should contain only one entry and each field is separated by one ore more white spaces. The syntactic of an entry is the following: Syntactic of access log file remotehost character string (e.g., ) rfc931 character string (e.g., -) authuser character string (e.g., -) date given in [dd/mmm/yyyy:hh:mm:ss Z] format (e.g., [30/May/2004:03:30: ]) request character string (e.g., "GET /~fbenmba/straatremixes/images/home.png HTTP/1.1") status integer (e.g., 200) bytes integer (e.g., 12193) referrer character string (e.g., 8&q=Andrew&meta=) user_agent character string (e.g., Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)) Table The structure of the content types mapping table file A content type mapping table contains URL/content type pairs. Three types of entries can occur in a mapping file according to its structure: Description of the file structure of the content types mapping table number of content types URL / content type pairs textual entries The first valuable line should contain the number of distinct content types (n). Content types considered as integers in range 0.. n (all numbers inclusive). Following there can be an arbitrary number of URL/content type pair entries. Where each line corresponds to a pair entry. The pair should be delimited with a white space. The file may contain comment lines anywhere in the structure presuming that a hash mark ( # ) stands as the first character of the line. Table 28 70

71 4. The structure of the extension filter list file Extension filter list file contains extension entries for filtering not allowed file types. The structure of such file is as follows: Description of the file structure of the extension filter list file extension Each valuable line of the file should contain a specific file extension entries (without any dots or special marks, e.g., html ). The file may contain comment lines anywhere in the structure textual entries presuming that a hash mark ( # ) stands as the first character of the line. Table The structure of the spider filter pattern list file Spider filtering is based on the spider pattern list file, which contains recognizable patterns for known spiders. The file was made by pre-examining log files for cs.vu.nl web server and filtering out suspicious user agents, extraordinary patterns. These patterns were tested against spider list providers pages like [29]. The file contains spider pattern entries each of them in a separate line. The structure of such a file is as follows: Description of the file structure of the spider pattern list file spider pattern entries textual entries Each valuable line of the file should contain a specific spider pattern. The file may contain comment lines anywhere in the structure presuming that a hash mark ( # ) stands as the first character of the line. Table 30 APPENDIX C. Experimental details 1. Spider pattern list and rank Table 31 contains spider patterns ranked by frequency counts during the access log file analysis. 71

72 Spider pattern list and rank rank pattern frequency 1 NET CLR BOT WGET FUNWEBPRODUCTS DIGEXT CRAWLER SLURP HOTBAR JEEVES HTTRACK IA_ARCHIVER GRUB-CLIENT YCOMP LIBWWW-PERL APPIE AVSEARCH SPIDER TELEPORT WEBCOPIER WEBCOLLAGE DVD OWNER FREESURF LYCOS 467 Table 31: Spider pattern list and rank 24 ROADRUNNER YAHOO SCOOTER FLASHGET INFOSEEK WEBSEARCH PITA T-H-U-N-D-E-R-S-T- O-N-E FLUFFY 4 33 NETNEWSWIRE 3 34 WEBDUP 2 35 WEBVAC 0 36 VIAS 0 37 ZYBORG 0 38 TEOMAAGENT 0 39 GULLIVER 0 40 ARCHITEXT 0 41 MERCATOR 0 42 ULTRASEEK 0 43 MANTRAAGENT 0 44 MOGET 0 45 MUSCATFERRET 0 46 SLEEK 0 KIT_FIREBALL 0 2. Extension list and rank Table 32 contains extensions ranked by frequency counts during the analysis of access log files. We listed here the top 100 most frequent extensions leaving out some unknown and infrequent items. 72

73 Extension list and frequeny rank extension freq. 1 gif jpg html js png pdf css php htm ico pac txt ps mp php gz zip doc wrl bmp ppt taz tar class tgz z shtml swf misc jpeg xml pl asp wma mid eps java tex c rdf jar dcr cgi wmv xbm mnx hs pas announce wav xls exe tab h spf xhtml dvi imp bib 830 Table 32: Extension list and frequeny 68 xsl stdout rss bak idx smi m dtd pn avi cur nl readme wmz old fst cpp mpg log ref owl rtf au com pps fla sgml aux ram srt mpeg rar bat Geographical distribution of users visiting The table below contains the geographical distribution of users visiting during the observed period. 73

74 Geographical distribution of users rank TLD count country 1 nl netherlands 2 net network infrastructure 3 com commercial 4 fr 3125 france 5 be 3058 belgium 6 de 3001 germany 7 ca 2133 canada 8 it 2038 italy 9 uk 1903 united kingdom 10 au 1852 australia 11 edu 1803 educational establishments (primarily us) 12 jp 1532 japan 13 br 1485 brazil 14 ch 963 switzerland 15 mx 935 mexico 16 pl 878 poland 17 at 635 austria 18 fi 610 finland 19 dk 553 denmark 20 se 531 sweden 21 ar 498 argentina 22 es 491 spain 23 gr 471 greece 24 hu 443 hungary 25 no 393 norway 26 us 374 united states other organizations not 27 org 352 clearly falling within the other gtlds 28 nz 341 new zealand 29 il 313 israel 30 ru 302 russian federation 31 pt 301 portugal 32 sg 275 singapore 33 mil 273 us military 34 cz 261 czech republic 35 gov 235 us government 36 tr 233 turkey 37 cl 228 chile 38 tw 226 taiwan, province of china 39 ro 210 romania 40 in 151 india 41 hr 128 croatia 42 sk 116 slovakia 43 za 114 south africa 44 ma 104 morocco 45 lt 102 lithuania 46 hk 101 hong kong 47 uy 94 uruguay 48 ie 93 ireland 49 th 90 thailand 50 ee 86 estonia 51 sa 83 saudi arabia 52 co 79 colombia 53 do 79 dominican republic 54 my 67 malaysia 55 id 65 indonesia 56 kr 60 korea, republic of 57 ua 54 ukraine 58 si 54 slovenia yugoslavia (now serbia and 59 yu 49 montenegro, iso code has changed to cs) 60 ph 49 philippines 61 is 47 iceland 62 cy 38 cyprus 63 bg 34 bulgaria 64 ve 32 venezuela 65 lu 28 luxembourg 66 mu 27 mauritius 67 int 26 null 68 cn 24 china 69 tt 22 trinidad tobago 70 lv 19 latvia 71 py 19 paraguay 72 ec 19 ecuador 73 cr 16 costa rica 74 np 14 nepal 75 pk 13 pakistan 76 pe 12 peru 77 lb 12 lebanon 78 md 11 moldova, republic of 79 nu 10 niue 80 by 9 belarus 81 fj 9 fiji 82 ni 9 nicaragua 83 ke 9 kenya 84 aw 8 aruba 85 mz 7 mozambique 86 mt 6 malta 87 jo 6 jordan 88 bn 5 brunei darussalam 89 bw 5 botswana and address and 90 arpa 5 routing parameter area 91 cu 5 cuba 92 qa 5 qatar 93 na 5 namibia 94 zw 4 zimbabwe 95 aero 4 null 74

75 96 kh 4 cambodia 97 bm 4 bermuda 98 su 4 null macedonia, the 99 mk 3 former yugoslav republic of 100 kz 3 kazakhstan 101 fo 3 faroe islands 102 ir 3 iran, islamic republic of 103 tz 3 tanzania, united republic of 104 tv 3 tuvalu 105 to 3 tonga 106 tg 2 togo 107 biz 2 null 108 sv 2 el salvador 109 al 2 albania 110 uz 2 uzbekistan 111 ad 2 andorra 112 lk 2 sri lanka 113 om 2 oman 114 gl 2 greenland 115 jm 2 jamaica 116 cc 2 cocos (keeling) islands 117 mg 2 madagascar Table 33: Geographical distribution of users 118 sr 2 suriname 119 ba 1 bosnia and herzegovina 120 cx 1 christmas island 121 nc 1 new caledonia 122 am 1 armenia 123 sz 1 swaziland 124 pa 1 panama 125 vn 1 viet nam 126 ls 1 lesotho 127 ge 1 georgia 128 ae 1 united arab emirates 129 pg 1 papua new guinea 130 rw 1 rwanda 131 bs 1 bahamas 132 ao 1 angola 133 ky 1 cayman islands 134 sm 1 san marino 135 bt 1 bhutan 136 ug 1 uganda 137 st 1 sao tome and principe 138 zm 1 zambia 139 az 1 azerbaijan 75

76 4. Global tree model of all visits by s = 1,3 support threshold Figure 18 76

77 5. Global tree model of nl group by s = 1,0 support threshold Figure 19 77

78 6. Global tree model of other group by s = 1,5 support threshold Figure 20 78

79 7. Global tree model of staff group by s = 1,0 support threshold Figure 21 79

80 8. Global tree model of student group by s=0,8 support threshold Figure 22 80

81 APPENDIX D. Implementation details All the algorithms required for the tasks described in the thesis were implemented in the Java language. We used a MySQL database server for data storage and retrieval. Details on the implementation and database are listed in the table below. Technical details Implementation language package name notion Database name MySQL version note java webmining the package is database independent MySQL doesn t support stored procedures up to version 5.0 (which, while this thesis was written, was only in beta stadium and as such unstable) This fact makes data processing a bit more difficult and less effective, because some processing steps would like to work directly inside the database. However, all the tasks and problems could be done with proper efficiency. Table 34 Our webmining package contains six major subpackages: datahandling, dataintegration, sessionidentification, patterndiscovery, stats and visualization. All the main classes belonging to these packages are listed below with brief descriptions. 1. Data preparation (cleaning, filtering, loading) webmining.datahandling package Main objects of the webmining.datahandling package DatabaseConnection Handles database connection (based on the properties file). HostNameLookup Provides methods for IP address domain name lookup. LoadLog The main object which manages the cleaning and loading process Log2Database Loads the prepared transactions into the database. LogParser The parser object which parses the input raw log file into useful Transactions. Transaction This object stores all information of a log entry in parsed format. TransactionFilter Filter object that can filter out useless or not supported transactions. TransactionSimple Simplified transaction object for log information retrieval from the database. UpdateDBHostNames Updates the users table with host names for corresponding remotehost fields. UpdateDBIPAddresses Updates the cslog table s remotehost fields with IP addresses in case they contain host names. Data files used by the package cslog.txt Text file containing log entries in raw format webmining.prop Properties file that contains all the properties needed for the process (e.g., database properties, file paths and file names, etc.). extension.flt This file contains all the file extensions for request URLs that are 81

82 spider.flt supported by this project. This file contains all known spider engine names or spider patterns for filtering out spider transactions. Table Data preparation (integration) webmining.dataintegration package Main objects of the webmining.dataintegration package The main process for generating an artifical mapping table GenerateAMT using GenerateArtificialMappingTable object. Generates artificial mapping data from the specified GenerateArtificalMappingTable access log file, with randomly added content types in the given interval. Representation of the mapping table. It reads the mapping information, (URL, content type) entries, from the specified MappingTable text file and stores them in an effectively searchable HashTable. Data files used by the package Text file containing mapping entries for the specific mapping_table.mtd collection of documents (HTML pages). The properties file which contains all the properties which webmining.prop are needed for the process (e.g., database properties, file paths and names) Table Data structuring webmining.sessionidentification package Main objects of the webmining.sessionidentification package GetSessions The main object which manages the identification process. Interface for all identifier objects. Describes that an identifier Identifier should make sessions from a given set of user page access entries (from an Array of TransactionSimple objects). This is an identifier object, which identifies sessions by MFRIdentifier maximal forward reference method. Provides methods for printing identified user sessions in SessionFormatPrinter different output formats into the specified output file. This is an identifier object, which identifies sessions using time TimeFrameIdentifier frame identification method. This object retrieves user page accesses for every user separately and invokes the specified identifier on the collected TransactionDBIterator data. As a result it gives back the identified sessions. (It is (deprecated class) much slower than memory iterator, thus this class is out of usage.) This object retrieves all the page accesses (rather content types of them) for every user into the memory and invokes the TransactionMemoryIterator specified identifier with the collected data. As a result it gives back the identified sessions. Data files used by the package webmining.prop The properties file which contains all the properties which are needed for the processes (e.g., database properties, file paths and names). Table 37 82

83 4. Profile mining models webmining.patterndiscovery package This package contains two subpackages for association rules mining (assoc) and global tree model (gtm) implementations. Main objects of the webmining.patterndiscovery.assoc package Note, that the LUCS-KDD Apriori-T Association Rule Mining Algorrithm implemented by Coenen, F. (2004) [12] was put into note the webmining package structure without any modification. The following class descriptions are mainly from the documentation of the program. AprioriTapp Fundamental Apriori-T application. Apriori-T application with input data preprocessed so that it is AprioriTsortedApp ordered according to frequency of single items --- this serves to reduce the computation time. Apriori-T application with data ordered according to frequency of single items and columns representing unsupported 1-itemsets AprioriTsortedPrunedApp removed --- again this serves to enhance computational efficiency. Set of general ARM utility methods to allow: (i) data input and input error checking, (ii) data preprocessing, (iii) manipulation of AssocRuleMining records (e.g., operations such as subset, member, union etc.) and (iv) data and parameter output. Set of methods that allow the creation and manipulation (e.g., RuleList ordering, etc.) of a list of ARs. Methods to implement the "Apriori-T" algorithm using the "Total TotalSupportTree support" tree data structure (T-tree). Methods concerned with the structure of Ttree nodes. Arrays of these structures are used to store nodes at the same level in any TtreeNode sub-branch of the T-tree. Note this is a separate class to the other T-tree classes which are arranged in a class hierarchy. Data files used by the package Plain texts file containing the input user sessions in a special input_session.txt format. Page content types within a session are in ascending order with redundant pages removed. Table 38 Main objects of the webmining.patterndiscovery.gtm package GlobalTreeModel This class provides the representation for the tree model. Initialize the tree model and loads all the user sessions into it. LoadGTM Besides, it is also responsible for managing tree visualization. SessionTree is a tree structure containing all the sessions for a SessionTree specific starting page. The whole model consists of SessionTrees in a number of distinct content types. Contains information for one node such as parent and children TreeNode references, content type and frequency of the node, etc. DATA FILES USED BY THE PACKAGE input_session.txt Plain text file containing the input user sessions. Table 39 83

84 APPENDIX E. Content of the CD-ROM The additional CD-ROM to the master thesis contains all the input and data files as well as all the important results that were made during the project. It contains the source and binary code of the whole webmining package and this master thesis in electronic format. To make the browsing easier we made an HTML user interface for the provided content. It is accessible from the root of the CD by opening the index.html file. 84