A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis



Similar documents
Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

WEB SITE OPTIMIZATION THROUGH MINING USER NAVIGATIONAL PATTERNS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Social Media Mining. Network Measures

Bisecting K-Means for Clustering Web Log data

Hadoop SNS. renren.com. Saturday, December 3, 11

A Study of Web Log Analysis Using Clustering Techniques

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Web Page Prediction System Based on Web Logs and Web Domain Using Cluster Technique

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Categorical Data Visualization and Clustering Using Subjective Factors

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

A Cube Model for Web Access Sessions and Cluster Analysis

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

DATA ANALYSIS II. Matrix Algorithms

Practical Graph Mining with R. 5. Link Analysis

Social Media Mining. Data Mining Essentials

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

w ki w kj k=1 w2 ki k=1 w2 kj F. Aiolli - Sistemi Informativi 2007/2008

Machine Learning using MapReduce

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

1. Introduction MINING AND TRACKING EVOLVING WEB USER TRENDS FROM LARGE WEB SERVER LOGS. Basheer Hawwash and Olfa Nasraoui

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

1 o Semestre 2007/2008

MapReduce Approach to Collective Classification for Networks

Web Usage Mining: Identification of Trends Followed by the user through Neural Network

Chapter 6. The stacking ensemble approach

Part 2: Community Detection

Web Document Clustering

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Topological Properties

How To Identify A Churner

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

A GENERAL TAXONOMY FOR VISUALIZATION OF PREDICTIVE SOCIAL MEDIA ANALYTICS

Index Terms Domain name, Firewall, Packet, Phishing, URL.

User Data Analytics and Recommender System for Discovery Engine

Visualizing e-government Portal and Its Performance in WEBVS

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Query Recommendation employing Query Logs in Search Optimization

Movie Classification Using k-means and Hierarchical Clustering

Data Mining: Algorithms and Applications Matrix Math Review

Analysis of Internet Topologies

Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Network (Tree) Topology Inference Based on Prüfer Sequence

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery

Tensor Factorization for Multi-Relational Learning

Statistical Models in Data Mining

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Forschungskolleg Data Analytics Methods and Techniques

Data exploration with Microsoft Excel: analysing more than one variable

CHAPTER 1 INTRODUCTION

Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors

InfiniteGraph: The Distributed Graph Database

Using Data Mining for Mobile Communication Clustering and Characterization

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Clustering Data Streams

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

9. Text & Documents. Visualizing and Searching Documents. Dr. Thorsten Büring, 20. Dezember 2007, Vorlesung Wintersemester 2007/08

Data mining and statistical models in marketing campaigns of BT Retail

Predict Influencers in the Social Network

Self Organizing Maps for Visualization of Categories

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Integrating Web Content Clustering into Web Log Association Rule Mining

User Authentication/Identification From Web Browsing Behavior

2. IMPLEMENTATION. International Journal of Computer Applications ( ) Volume 70 No.18, May 2013

Inner Classification of Clusters for Online News

Tutorial for proteome data analysis using the Perseus software platform

Context Aware Predictive Analytics: Motivation, Potential, Challenges

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Personalized Hierarchical Clustering

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

AN EFFICIENT APPROACH TO PERFORM PRE-PROCESSING

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

Web Users Session Analysis Using DBSCAN and Two Phase Utility Mining Algorithms

Web Mining using Artificial Ant Colonies : A Survey

Model-Based Cluster Analysis for Web Users Sessions

Character Image Patterns as Big Data

Advanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Transcription:

A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis Yusuf Yaslan and Zehra Cataltepe Istanbul Technical University, Computer Engineering Department, Maslak 34469 Istanbul, Turkey {yyaslan,cataltepe}@itu.edu.tr Abstract. In this paper, different types of web session similarity metrics are compared and combined for better web session clustering. Syntactic and co-occurrence information are used for similarity calculation. Syntactic information on a web page includes the place of the page in the directory hierarchy. Co-occurrence information is the amount of the occurrences of two web pages in the same sessions. Vector space representation of sessions and cosine, pearson and jaccard similarities between them are also used as a similarity metric. Clustering quality is used as the goodness measure of a similarity. The clustering quality is given by the internal and external cluster similarity. First, clustering quality is evaluated when different similarity metrics (jaccard, pearson, cosine, syntactic, co-occurrence) are used. Similarity calculation is performed for different number of clusters from 1 to the number of sessions. For reasonable cluster numbers (15-100) syntactic similarity results in the best internal cluster similarity, followed by the co-occurrence similarity. Finally the best two methods and others are used for similarity combination. It is found that linear combination of similarity metrics decreases the external cluster similarity. Similarity metrics are also evaluated using Hubert s statistics. It is found that syntactic similarity alone gives the best results according to Hubert s statistics followed by the linear combination of syntactic and cooccurrence similarity. 1 Introduction Every day both the number of web pages on the web and the data produced by the people who browse those pages keep increasing. The vastness of the amount and usage opportunities of these data resulted in the web usage mining techniques[1]. Web usage mining is the processing of web usage data and it is used for customization for web users by using client-server transactions on one or more web servers. Hence it can be used for improving the web cache performance, personalize the browsing experience and improve search engines[3]. Previously in [4, 5] web session logs are used to better understand user behavior in order to design better environments and applications. The transactions are grouped by sessions where a session consists of transactions of a web user within a specific

time. In order to do customization for users, similarities between web sessions are computed. Then each session is assigned to a group by using the similarity value together with different clustering techniques such as graph partitioning. In this paper syntactic, co-occurrence and vector space based features are obtained. Syntactic similarity gives information on the web site directory hierarchy. Co-occurrence information is the number of web pages visited together in sessions. Visited web sites information is used for constructing the session vectors. Cosine, Extended Jaccard and Pearson Correlation similarity metrics are used for obtaining vector space based similarities. Finally linear combinations of similarities are evaluated. In [4, 5] multi-modal feature vectors are generated using the content, usage and topology of the analyzed web site. The content modality is the unique words that appear in the web page. Usage modality is obtained by tokenizing the URLs of the site by using the / delimiter. The other modalities are obtained by using inlinks and outlinks of the web page. Outlinks of a page represent the pages that are reachable from the page. Similarly inlinks of a page represent the pages that points to the web page. The topology of a site is represented using an adjacency matrix, where the rows correspond to the outlinks and columns correspond to the inlinks. Previously in [2] cosine, pearson correlation and extended jaccard similarity metrics are used in conjunction with different clustering methods (random, hyper-graph partitioning, self organizing feature maps) for web page clustering. It is shown that cosine and extended jaccard similarities give the best results. In this paper we cluster not the web pages but the web sessions. In [2], similarities between TF.IDF vectors that correspond to occurrences of words in a document are used. In this study, we use similarities between sessions, i.e. each component of the vectors we use corresponds to a web page, as opposed to a word in a web page. Syntactic similarity is calculated on the tokens of the visited URLs within a session. Previously in [6] syntactic similarities of web pages are combined using cosine similarity in order to obtain dissimilarity values for web log clustering. In this paper, we also combine different similarity metrics together with syntactic similarity for obtaining similarity values between web sessions. The rest of the paper is organized as follows: In Section 2, the way we represent sessions and syntactic and co-occurrence similarities between them, together with different similarity metrics are introduced. In Section 3, details of combination methods of different similarity metrics are given. In Sections 4 and 5, we present the experimental results and the conclusions. 2 Session Similarity Metrics 2.1 Session Representation Using Vector Space Model Each session consists of a sequence of visited web pages. In our vector space representation, we represent a session with a vector whose size equals the total

number of pages. For example, let the web site consist of a total of 5 web pages, represented using letters A through E. If a session consists of A B D, then vector space representation for the session could be: (1,1,0,1,0). Different types of path weighting schemes can be applied[5]: Uniform: Each page has an equal weight in the session. e.g. A B D=(1,1,0,1,0) TF.IDF(Term Frequency Inverse Document Frequency): Each page receives a TF.IDF weighting. The number of a specific web page within a session is divided by the total number of occurrances of the web page in the logs. Linear Order (or Position): The web pages are weighted using the order of page accesses in the session e.g. A B D=(1,2,0,3,0). View Time: The web pages are weighted by using the time spent on that page accesses in the session e.g. A(10sec) B(20s) D(15s) = (10,20,0,15,0). Various Combined Weighting: Each page is weighted with various combinations of the TF.IDF, Linear Order, and View Time path weighting. Linear Order+View Time, e.g.: A(10sec) B(20s) D(15s) = (10,40,0,45,0). In this study, we have used linear order and view time schemes for web session similarity calculation. We represent a session i using a vector s(i): s(i) = (s(i,1),s(i,2),s(i,3),...,s(i,n i )) (1) s(i) is an N i dimensional vector of web pages s(i,j) where N i is the number of pages visited during the session. The first method uses a vector space model similar to the one in document categorization. If page k has been visited during session i, then the clickstream vector Θ(i) that contains the page visit information can be computed as: { Θ(i,k) =number of k if k s.t. s(i,k Θ(i) = ) = p k Θ(i, k) = 0 otherwise (2) Θ(i) is an N dimensional binary vector where N is the total number of pages on the web site we are interested in. p k is the kth web page. The number of visit of a web page is represented in the vector model. Cosine, pearson correlation and extended jaccard similarity, which we all define below, work on the vector space model. 2.2 Cosine Similarity Cosine similarity captures scale invariant understanding of similarity and it is defined related the angle between two vectors as follows[2]: Sim cos (s(i),s(j)) = Θ(i) T Θ(j) Θ(i)T Θ(i)Θ(j) T Θ(j) (3)

2.3 Pearson Correlation Similarity Pearson correlation reflects the degree of linear correlation between two sessions: Sim pear (s(i),s(j)) = 1 2 ( (Θ(i) Θ(i))T (Θ(j) Θ(j)) + 1) (4) (Θ(i) Θ(i)) T (Θ(j) Θ(j)) where Θ refers to the mean value of the feature vectors. 2.4 Extended Jaccard Similarity Extended Jaccard similarity measures the length dependent measure of similarity [2]: Θ(i) T Θ(j) Sim jaccard (s(i),s(j)) = (Θ(i) T Θ(i) + Θ(j) T Θ(j) Θ(i) T (5) Θ(j)) 2.5 Syntactic Similarity Syntactic similarity uses the dependencies between the web page directories. According to [6] syntactic similarity between k th and l th web pages of sessions s(i) and s(j) is calculated as follows: Sim u (s(i,k),s(j,l)) = min(1, p(i,k) p(j,l) max(1,max( p(i,k), p(j,l) ) 1) ) (6) where p(i,k) denotes the path traversed from the root node to the node corresponding to the kth URL, and p(i,k) indicates the length of this path. This similarity measures the amount of overlap between the paths of the two visited web pages. The similarity between sessions s(i) and s(j) is calculated as: N N k=1 l=1 Sim synt (s(i),s(j)) = Θ(i,k)Θ(j,l)Sim u(s(i,k),s(j,l)) N k=1 Θ(i,k) N l=1 Θ(j,l) (7) Note that if two sessions are identical then the similarity might be small depending on the number of pages accessed. 2.6 Co-Occurrence Similarity Co-occurrence similarity is derived from document clustering[8]. The amount of occurrences of the two sessions are proportional to the minimum number of visits of two pages as follows: Sim o (s(i,k),s(j,l)) = #(p(i,k) p(j,l)) min(#(p(i,k)),#(p(j,l))) (8)

where #(p(i,k) p(j,l)) refers to the amount of the k th and l th web pages visited together. Similarly #(p(i,k)) refers to the total occurrence number of k th web page. Similar to the syntactic similarity, co-occurrence similarity between sessions s(i) and s(j) is calculated between all pages in two sessions as follows: N N k=1 l=1 Sim cooc (s(i),s(j)) = Θ(i,k)Θ(j,l)Sim o(s(i,k),s(j,l)) N k=1 Θ(i,k) N l=1 Θ(j,l) (9) 3 Combination of Similarity Metrics In our study, we have combined cosine, pearson correlation and extended jaccard similarity values with syntactic similarity in order to obtain similarity between two sessions. This combination is based on the intuition that we may be able to gain information from vector and URL of the web pages which may carry information about the visited webpages. 3.1 Linear Combination We have used linear combination. We have combined the best two results and all similarity values alone. Our aim is to gain balanced information from vector space model and URL based syntactic information. The linear combination of syntactic and co-occurrence similarity is obtained as follows: Sim com = αsim synt + βsim cooc (10) where α + β = 1. In our experiments we have used α and β as (0.5,0.5) pairs. The second combination is obtained using all similarity values as follows: Sim com = αsim synt + βsim cooc + γsim cos + ηsim pear + δsim jaccard (11) where α + β + γ + η + δ = 1. In our experiments we have used α, β, γ, η and δ as (0.2,0.2,0.2,0.2,0.2). 4 Experimental Results In our experiments we used a real dataset obtained from web logs of a Turkish web portal. The portal provides e-mail and news as well as web search service. Different types of domains are available such as; news, sport, cinema, finance, TV, fortune, estate and etc. More than 28000 web sessions are obtained from web logs. In order use the web server logs, first data cleaning is applied. Polliwog 0.6 (http://polliwog.sourceforge.net) is used for web session log data cleaning. In our experiments, we only use sessions whose lengths are greater than or equal to 7 pages. After log cleaning we have obtained 1321 web sessions for clustering. In order to compare different similarity metrics and their combinations, we

have used Cluto (http://www-users.cs.umn.edu/ karypis/cluto) clustering program. The output of the program is internal cluster similarity, external cluster similarity and their standard deviations. In order to compare the experimental results fairly we have obtained the mean internal cluster similarity versus the mean external cluster similarity plots. The means are computed using the internal and external similarity of each cluster weighed by the number of data points in the cluster. The experiments are performed for ten different cluster sizes: 1,5,10,15,20,50,100,500,1000 and 1321. In figure 1 mean internal cluster similarity versus mean external cluster similarity values for syntactic, jaccard, pearson, cosine and co-occurrence similarity are given. Clusters numbers are in ascending order of the internal similarity. 0.3 0.25 jaccard syntactic pearson cosine CoOccurrence External Cluster Similarity 0.2 0.15 0.1 0.05 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Internal Cluster Similarity Fig. 1. Mean internal Cluster similarity versus mean external cluster similarity for 1-1321 clusters obtained by using different similarity metrics 4.1 Combining Similarity Metrics In figure 2 mean internal cluster similarity versus mean external cluster similarity values for i) syntactic, ii) linear combination of syntactic, jaccard, pearson, cosine and co-occurrence similarity (linear comb.all sim. in figure) and iii) linear combination of syntactic and co-occurrence similarity (linear comb. of syntactic co-occurrence in figure) are given. Clusters numbers are again in ascending order of internal similarity. As it can be seen in figure 2 linear combination of all similarity metrics decrease the external cluster similarity when number of clusters are greater than 15.

0.35 0.3 syntactic linear comb.all sim. linear comb.of syntactic CoOccurrence 0.25 External Cluster Similarity 0.2 0.15 0.1 0.05 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Internal Cluster Similartiy Fig. 2. Mean internal Cluster similarity versus mean external cluster similarity for 1-1321 clusters obtained by using linear combination of similarity metrics 4.2 Clustering Evaluation using Hubert s Γ Statistics The goodness of a clustering can be obtained using the correlation between the similarity matrix and an ideal version of the similarity matrix based on clustering labels[9]. Ideal clustering can be defined as the clustering where internal similarities between objects in the clusters are 1 and external similarities are 0.An ideal similarity matrix for clusters can be obtained by sorting rows and columns of the similarity matrix. This ideal similarity matrix has a block diagonal structure. Thus intra-cluster similarity is 1 and all other entries are 0. Hubert s Γ statistics between similarity matrix S sim and ideal similarity matrix S ideal can be formulated as follows[10]; Γ = 1 M N 1 N i=1 j=i+1 S sim (i,j)s ideal (i,j) (12) Where N is the dimension of the similarity matrix and M is the pairwise comparison and can be obtained as: M = N(N 1)/2 (13) In figure 3 cluster number versus Hubert s Γ statistics for different similarity metrics are shown. As it is shown, the best Γ value is obtained for syntactic similarity followed by linear combination of syntactic and co-occurrence similarity. Note that if a web portal is hieratically well designed syntactic similarity enforces objects to have higher similarity values. Hence the Γ value for syntactic similarity overcomes the others. However the combination of similarity metrics not only decrease the external cluster similarity but also increases the Γ value.

0.16 0.14 0.12 jaccard syntactic pearson cosine CoOcurrence linear comb. all sim. linear comb. of syntactic CoOccurrence 0.1 Hubert s Statistics 0.08 0.06 0.04 0.02 0 0.02 1 5 10 15 20 50 100 500 1000 1321 Number of Clusters Fig. 3. Cluster number versus Hubert s statistics Γ for different similarity metrics 5 Conclusions In this paper, different similarity metrics and their linear combinations are evaluated according to internal and external similarities and Hubert s Γ statistics. We have evaluated different α and β values for linear combination. However most of them gave similar results. Hence we give equal weight to all similarities (i.e. α = β = 0.5). Experimental results have shown that, syntactic similarity gives the best Γ statistics because it enforces similarities to approach a maximum. However, combination of the syntactic and co-occurrence similarity metrics decreases the external similarity between clusters and increases the Γ statistics and gives satisfactory internal similarity results. Thus can be used for web session clustering for different numbers of clusters. Note that for syntactic similarity, increase in the cluster number increases the external cluster similarity. Acknowledgements This paper is partially supported by Turkish Scientific and Technical Research Foundation (TUBITAK) EEEAG Project number 105E162. Authors would like to thank Murat Goksedef and Nildem Demir of ITU for providing the user session data and Sule Gunduz-Oguducu and Sima Etaner-Uyar for useful discussions. References 1. Pierrakos, D., Paliouras,G., Papatheodorou,C., Spyropoulos,C.D.:Web Usage Mining as a Tool for Personalization: A Survey,User Modeling and User-Adapted Interaction, 13,(2003) 311-372.

2. Strehl, A., Ghosh, J., Mooney, R. :Impact of Similarity Measures on Web-page Clustering,Workshop of Artificial Intelligence for Web Search, (2000). 3. Deshpande, M., Karypis, G. : Selective Markov Models for Predicting Web Page Accesses, ACM Transactions on Internet Technology, 4(2), (2004)163-184. 4. Heer,J.,Chi, E. :Identification of Web User Traffic Composition using Multi-Modal Clustering and Information Scent, in Proc. of the Workshop on Web Mining, SIAM Conference on Data Mining, Chicago, (2001),51-58. 5. Heer, J.,Chi, E. :Mining the Structure of User Activity using Cluster Stability,in Proc. of the Workshop on Web Analytics, SIAM Conference on Data Mining, Arlington VA,(2002). 6. Joshi, A., Krishnapuram,R.:On Mining Web Access Logs,in Proc. of the ACM SIG- MOD Workshop on Research Issues in Data Mining and Knowledge Discovery, (2000),63-69. 7. Nasraoui, O., Krishnapuram, R. :An Evolutionary Approach to Mining Robust Multi-Resolution Web Profiles and Context Sensitive URL Associations,International Journal of Computational Intelligence and Applications, 2 (3),(2002),1-10. 8. Pal, S., Talwar, V.,Mitra,P. :Web Mining in Soft Computing Framework: Relevance, State of the Art and Future Directions,IEEE Transactions on Neural Networks, 13,(5), (2002), 1163-1177. 9. Tan,P.N., Steinbach, M., Kumar, V. :Introduction to Data Mining,Addison-Wesley, (2005). 10. Theodoridis,S., Koutroumbas,K.:Pattern Recognition,Academic Press, (1999).