The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis



Similar documents
Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

A DATA MINING APPLICATION IN A STUDENT DATABASE

A Secure Password-Authenticated Key Agreement Using Smart Cards

Forecasting the Direction and Strength of Stock Market Movement

APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT

An Interest-Oriented Network Evolution Mechanism for Online Communities

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

Study on Model of Risks Assessment of Standard Operation in Rural Power Network

Cluster Analysis. Cluster Analysis

A neuro-fuzzy collaborative filtering approach for Web recommendation. G. Castellano, A. M. Fanelli, and M. A. Torsello *

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Improved SVM in Cloud Computing Information Mining

The OC Curve of Attribute Acceptance Plans

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Using Content-Based Filtering for Recommendation 1

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

A heuristic task deployment approach for load balancing

RequIn, a tool for fast web traffic inference

An Alternative Way to Measure Private Equity Performance

Mining Multiple Large Data Sources

RESEARCH ON DUAL-SHAKER SINE VIBRATION CONTROL. Yaoqi FENG 1, Hanping QIU 1. China Academy of Space Technology (CAST)

Adaptive Fractal Image Coding in the Frequency Domain

Network Security Situation Evaluation Method for Distributed Denial of Service

DEFINING %COMPLETE IN MICROSOFT PROJECT


Web Object Indexing Using Domain Knowledge *

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

Project Networks With Mixed-Time Constraints

IMPACT ANALYSIS OF A CELLULAR PHONE

Sciences Shenyang, Shenyang, China.

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES

How To Classfy Onlne Mesh Network Traffc Classfcaton And Onlna Wreless Mesh Network Traffic Onlnge Network

Gender Classification for Real-Time Audience Analysis System

SCHEDULING OF CONSTRUCTION PROJECTS BY MEANS OF EVOLUTIONARY ALGORITHMS

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

Dynamic Pricing for Smart Grid with Reinforcement Learning

Semantic Link Analysis for Finding Answer Experts *

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Calculating the high frequency transmission line parameters of power cables

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

A Falling Detection System with wireless sensor for the Elderly People Based on Ergnomics

A Dynamic Load Balancing for Massive Multiplayer Online Game Server

Efficient Project Portfolio as a tool for Enterprise Risk Management

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

Research on Evaluation of Customer Experience of B2C Ecommerce Logistics Enterprises

A Programming Model for the Cloud Platform

A Simple Approach to Clustering in Excel

An Efficient Recovery Algorithm for Coverage Hole in WSNs

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

The Greedy Method. Introduction. 0/1 Knapsack Problem

A Load-Balancing Algorithm for Cluster-based Multi-core Web Servers

On-Line Fault Detection in Wind Turbine Transmission System using Adaptive Filter and Robust Statistical Features

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

Probabilistic Latent Semantic User Segmentation for Behavioral Targeted Advertising*

Traffic State Estimation in the Traffic Management Center of Berlin

BUSINESS PROCESS PERFORMANCE MANAGEMENT USING BAYESIAN BELIEF NETWORK. 0688,

Politecnico di Torino. Porto Institutional Repository

The Journal of Systems and Software

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Fast Fuzzy Clustering of Web Page Collections

Vehicle Detection and Tracking in Video from Moving Airborne Platform

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

ERP Software Selection Using The Rough Set And TPOSIS Methods

Gaining Insights to the Tea Industry of Sri Lanka using Data Mining

Design and Development of a Security Evaluation Platform Based on International Standards

Optimization Model of Reliable Data Storage in Cloud Environment Using Genetic Algorithm

An interactive system for structure-based ASCII art creation

A Passive Network Measurement-based Traffic Control Algorithm in Gateway of. P2P Systems

A New Task Scheduling Algorithm Based on Improved Genetic Algorithm

Automated Network Performance Management and Monitoring via One-class Support Vector Machine

Optimal Choice of Random Variables in D-ITG Traffic Generating Tool using Evolutionary Algorithms

Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001

Demographic and Health Surveys Methodology

Single and multiple stage classifiers implementing logistic discrimination

Context-aware Mobile Recommendation System Based on Context History

An Introduction to 3G Monte-Carlo simulations within ProMan

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System

Set. algorithms based. 1. Introduction. System Diagram. based. Exploration. 2. Index

Recurrence. 1 Definitions and main statements

Performance Analysis and Coding Strategy of ECOC SVMs

USING GOAL PROGRAMMING TO INCREASE THE EFFICIENCY OF MARKETING CAMPAIGNS

The Network flow Motoring System based on Particle Swarm Optimized

Transcription:

The Development of Web Log Mnng Based on Improve-K-Means Clusterng Analyss TngZhong Wang * College of Informaton Technology, Luoyang Normal Unversty, Luoyang, 471022, Chna wangtngzhong2@sna.cn Abstract. The man obectve of Web log mnng s to extract nterestng patterns from the Web access to records. Web log mnng has been successfully appled to a personalzed recommendaton system mprovement and busness ntellgence. Ths paper presents the development of Web log mnng based on mprove-k-means clusterng analyss. K-Means clusterng algorthm s analyzed and the paper proposes effectve ndex of the K-Means clusterng algorthm and verfed by experment, and proposes automatcally selected based on the ntal cluster centers that ths selecton method can reduce the outler and mprove the clusterng results. Keywords: web log mnng, K-Means clusterng, solated pont clusterng. 1 Introducton Web log mnng, also known as Web usage mnng, namely the use of the data set to analyze the mnng, data mnng technology on the ste use a lot of data (user access) and other relevant data to obtan valuable webste access mode of knowledge the man obectve of Web log mnng s to extract nterestng patterns from the Web access to records. Web log mnng s manly used n e-commerce, through the analyss and explores Web log records law, to dentfy potental customers, and enhance the qualty of Internet nformaton servces to end-users, to mprove the performance and structure of the Web server system. Currently studed n the Web Usage Mnng technques and tools can be dvded nto two categores: pattern dscovery and pattern analyss. Web log mnng has two man research drectons: user access pattern trackng and personalzed use of the recorded track. Track user access patterns are to understand the user's access patterns and tendences n order to mprove the organzatonal structure of the ste by analyzng the use of records [1]. Therefore, these data were analyzed to help understand user behavor, to mprove the ste structure and to provde users wth personalzed servce. Web access to the most common applcatons n the mnng Web log mnng, mnng server's log fles, draw the user access patterns, the artcle s based on Web * Author Introduce: TngZhong Wang(1973.7-), Male, Han, Master of Henan Unversty of Scence and Technology, Research area: web mnng, data mnng, clusterng. D. Jn and S. Ln (Eds.): Advances n CSIE, Vol. 2, AISC 169, pp. 613 618. sprngerlnk.com Sprnger-Verlag Berln Hedelberg 2012

614 T. Wang Log Mnng for Personalzed Recommendaton. Ths paper presents the development of Web log mnng based on mprove-k-means clusterng analyss. In ths paper, the K-Means clusterng algorthm to cluster the user, therefore, descrbed n detal below the K-Means clusterng algorthm. 2 K-Means Clusterng Algorthm and Improve The cluster analyss used to dscover the data dstrbuton and patterns, s an mportant research drecton n data mnng. The clusterng problem can be descrbed as follows: collecton of data ponts are dvded nto classes (called clusters, cluster), makes the greatest extent possble between each cluster of data ponts s smlar to the data ponts n dfferent clusters to maxmze the cluster. Web log clusterng n two ways: user clusterng and page clusterng. User clusterng user sessons, accordng to the user access to the acton, lookng for patterns of behavor smlar to the user. User clusterng results can be used as a lbrary of smart Web ste mode recommended mode, such as: Web Server analyss, udgment, user A and user B belong to the same group, assumng that the user profle n the group {a.html, b. html, c.html}, user A has access page contans a.html page, the smart Web ste real-tme recommendaton module wll be recommended to the user A - b.html and c.html two pages [2]. If user B has access to the page b.html page recommendaton module s real-tme user B should be recommended to a.html page and c.html page, that s equaton 1: β n = κr cos( φ φn ) snθ K-means clusterng method s a common dvson-based clusterng method, also known as K-means method, s a wdely used algorthm. A form of clusterng wll make an obectve crtera for the classfcaton (often referred to as the smlarty functon, such as: dstance, smlarty coeffcent) optmzaton. In ths artcle we use the dstance between the relatvely smple and commonly used data to descrbe the smlarty, the greater the dstance, the smaller the smlarty, on the contrary s the greater. Its core dea s the data obects through an teratve clusterng, n order to target functon s mnmzed, so that the generated cluster as compact as possble and ndependent. Ths teratve relocaton process s repeated untl the obectve functon (generally used to mean square error as the standard measure functon) to mnmze so far, that s, untl each cluster s no longer changes untl. The obectve functon (error functon) s generally used to mean square error as the standard measure functon such as Equaton 2. k E = l w (2) = 1 l c In general, pre-determned the value of the clusterng parameter k s very dffcult, therefore, should be based on the data sets and clusterng crtera to obtan the clusterng parameter k. Ray and Tur, the measure of an effectve ndex of the cluster 2 (1)

The Development of Web Log Mnng Based on Improve-K-Means Clusterng Analyss 615 dstance and the dstance between the clusters, and appled to mage processng, the effectve ndex such as (3),(4),(5) as shown. Intra( k) Valdty( k) = (3) Inter( k) 2 x Z N (4) = 1 x C ( ) 1 k Intra k = mn Inter( k ) ( Z Z ) 2 = (5), Ths artcle wll effectvely ndex and K-Means clusterng algorthm s proposed, whch combnes the K-Means clusterng algorthm based on the effectve ndex. The algorthm does not requre the user to determne n advance the clusterng parameter k, can be automatcally determned, but requred Kmax lmt the number of clusters. Under normal crcumstances, the cluster parameters s much smaller than the number of obects (k << n).the algorthm s descrbed as follows. Algorthmc thnkng: the algorthm wll be effectve ndex and the K-Means clusterng algorthm, the combnaton of effectve ndex based on the average of the obects n the cluster and clusterng; Input: a data set of n obects where each obect m attrbutes; Output: the number of clusters k and the set of k clusters, whch mnmze the effectve ndex of the clusterng., the Whle (k = 2 to of Kmax Step by a varable value);, random selecton of k obects as ntal cluster centers: c1 (1), c2 (1),..., ck (1);, to re-allocate each obect to the clusterng of the obect and the center of the cluster closest to; v, update cluster mean, usng the followng formula to calculate the obect n each cluster mean as equaton 6: c 1 ( k + 1) = X N (6) x C( k ) Of whch: = 1,2,..., k, the number of Propertes N as to C (k) of the obect; v, repeat steps, v untl the cluster centers no longer change, for all = 1,2,..., k c c ( k+= 1) ( k) Such as cluster centers no longer change, swtch to the next step; valdty of the effectve ndex of v, n accordance wth the formula (1) - (3) to calculate the number of clusters s k (k); v, compare the effectve ndex of valdty (k) and the prevous ndex of valdty (k-1) to retan the make valdty value smaller k; v, the end of the algorthm, the output of the most effectve number of clusters k and k the center of a cluster and cluster C1, C2, C3,..., Ck.

616 T. Wang 3 Web Log Mnng Technology Web log mnng has been successfully appled to a personalzed recommendaton system mprovement and busness ntellgence. Accumulaton, especally n the busness ste wth a large number of users access to log data, busnesses can use these data to provde users wth personalzed servces to mprove customer trust and servce qualty [3]. Web access to the most common applcatons n the mnng Web log mnng, mnng server's log fles, draw the user access patterns, the artcle s based on Web Log Mnng for Personalzed Recommendaton, as s shown by equaton7. u ( 0 ) N = 1 M = α e + α e (7) = N + 1 Web log mnng can be dvded nto three phases: data preprocessng, pattern mnng and dg out the pattern analyss. Web server access log (Access Log) generally nclude: IP address, request tme, the method (eg GET, POST), the URL of the requested fle, the HTTP verson number, the return code, transmt the number of bytes. Table 1 lsts several Web server http://lpqf.haust.edu.cn access log. In Table 1 of the frst log that user from the IP address 192.168.2.174 to a GET request transmsson / comment/lst4s.asp, ths request s successfully transferred 93 bytes of data, 200 for return code, ndcatng that the response successfully. Table 1. Content of Web-server s Access Log IP Address Tme Method/url Status Sze 192.168.2.174 2006-10-16 00:23:40 GET /comment/lst4s.asp 200 93 192.168.2.222 2006-10-16 00:25:02 GET /nclude/pagecount.asp 200 188 192.168.2.174 2006-10-16 00:25:48 POST /comment/comment.asp 200 242 192.168.2.233 2006-10-16 00:27:21 GET /nclude/functonht.asp 188 231 Manly based on the dea of the automatc evaluaton methods: f the user s a long tme or hgh frequency access to a ste or a page, ndcatng ther nterest n the ste or page hgh, therefore, you can access tme and frequency as a hobby measure the weght, the algorthm s as follows: calculate the user to access a url of the frequency obtaned by the statstcs of the url s the number of users to access. Takng nto account the data cleanng stage to remove the occasonal vsts of the page, you can set the number of users to access a url n the fxed tme perod should be greater than or equal to a set value. 4 Expermental Results and Analyss In order to verfy the valdty of the algorthm test, the log data after data cleanng, user dentfcaton, page recognton and other steps, the two sets of data: The frst set

The Development of Web Log Mnng Based on Improve-K-Means Clusterng Analyss 617 of data conssts of 201 users and 81 lnks; the two sets of data, ncludng 792 dfferent users and 1644 lnks. Effectve ndex values, as shown n Fgure 1. It can be seen from the above two sets of test results to algorthm clusterng k = 62 the mnmum effectve ndex, the clusterng meet close and maxmum reparablty between the clusters, the largest cluster experments to acheve the desred results. Usng the clusterng algorthm based on the sze and number of data obects to be clusterng to select the approprate step. When the amount of data s small, choose smaller step length can mprove the precson of clusterng; when the large amount of data, ncreasng the step sze reduces the computaton for a large amount of data, a step ncrease n the accuracy of the algorthm the mpact s neglgble. Fg. 1. Valdty versus Cluster Number Test usng the mean of the conventonal k-means method to cluster, the ntal pont selected were random and before the automatc cluster center selecton algorthm, test users n 2658 (nne clusters) clusterng, test results such as Fgure 2 to Fgure 3, the results show that the ntal cluster centers automatcally selects the algorthm s better than randomly selected. It can be seen from Fgures 2, 3; automatc ntal pont selecton method s superor to the random ntal pont selecton method. Fg. 2. Stochastc selecton clustng ntalzaton pont and result of clustng Fg. 3. Automatc selecton clustng ntalzaton pont and result of clustng

618 T. Wang Experment cluster analyss to 2658 users, the results show that the ntal cluster centers automatcally selected a better soluton to the problem of solated ponts, the comparson shown n Fgure 4. Is obvous from the fgure can be seen: before the cluster center automatcally selects the algorthm, reducng the ntal cluster centers randomly selected to result n solated ponts more. Fg. 4. Comparsons of solated pont Ths paper analyzes the clusterng algorthm on the k-means clusterng algorthm, the ntal value problem for a tradtonal clusterng algorthm, mproved K-Means clusterng algorthm proposed effectve ndex of the K-Means clusterng algorthm and valdated through experments. Isolated ponts are more randomly selected from the ntal pont of clusterng to reduce the outler, automatcally selected based on the ntal cluster centers, the experment found that ths selecton method can reduce the outler and mprove the clusterng effect. 5 Summary Web log mnng has been successfully appled to a personalzed recommendaton system mprovement and busness ntellgence. K-means clusterng method s a common dvson-based clusterng method, also known as K-means method, s a wdely used algorthm. Ths paper presents the development of Web log mnng based on mprove-k-means clusterng analyss. In ths paper, the K-Means clusterng algorthm to cluster the user, therefore, descrbed n detal the K-Means clusterng algorthm. References 1. Mobasher, B., Cooley, R., Srvastava, J.: Automatc personalzaton based on web usage mnng. Communcatons of the ACM 43(8), 142 151 (2000) 2. Huang, Z.: Extensons to the K-means algorthm for clusterng large data sets wth categorcal values. Data Mnng and Knowledge Dscovery 2, 283 304 (1998) 3. Srvastava, J., Cooley, R., Deshpande, M., et al.: Web usage mnng: dscovery and applcaton of usage patterns from web data. SIGKDD Exploratons 1(2), 12 23 (2000)