Clustering Algorithm Analysis of Web Users with Dissimilarity and SOM Neural Networks



Similar documents
Evaluating Model for B2C E- commerce Enterprise Development Based on DEA

Recovery time guaranteed heuristic routing for improving computation complexity in survivable WDM networks

AGC s SUPERVISORY TRAINING PROGRAM

INVESTMENT PERFORMANCE COUNCIL (IPC) Guidance Statement on Calculation Methodology

Modified Line Search Method for Global Optimization

*The most important feature of MRP as compared with ordinary inventory control analysis is its time phasing feature.

5.4 Amortization. Question 1: How do you find the present value of an annuity? Question 2: How is a loan amortized?

Review: Classification Outline

Baan Service Master Data Management

Reliability Analysis in HPC clusters

(VCP-310)

Business Rules-Driven SOA. A Framework for Multi-Tenant Cloud Computing

Chapter 5 Unit 1. IET 350 Engineering Economics. Learning Objectives Chapter 5. Learning Objectives Unit 1. Annual Amount and Gradient Functions

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

Lesson 17 Pearson s Correlation Coefficient

DAME - Microsoft Excel add-in for solving multicriteria decision problems with scenarios Radomir Perzina 1, Jaroslav Ramik 2

A guide to School Employees' Well-Being

Study on the application of the software phase-locked loop in tracking and filtering of pulse signal

The analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection

Confidence Intervals for One Mean

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN

CHAPTER 3 THE TIME VALUE OF MONEY

Systems Design Project: Indoor Location of Wireless Devices

1 Correlation and Regression Analysis

Data Analysis and Statistical Behaviors of Stock Market Fluctuations

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

CS100: Introduction to Computer Science

LEASE-PURCHASE DECISION

Chapter XIV: Fundamentals of Probability and Statistics *

ODBC. Getting Started With Sage Timberline Office ODBC

Chair for Network Architectures and Services Institute of Informatics TU München Prof. Carle. Network Security. Chapter 2 Basics

How to read A Mutual Fund shareholder report

Definition. A variable X that takes on values X 1, X 2, X 3,...X k with respective frequencies f 1, f 2, f 3,...f k has mean

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

CONTROL CHART BASED ON A MULTIPLICATIVE-BINOMIAL DISTRIBUTION

INVESTMENT PERFORMANCE COUNCIL (IPC)

WHERE CHANGE IS POSSIBLE

CHAPTER 11 Financial mathematics

Domain 1 Components of the Cisco Unified Communications Architecture

STUDENTS PARTICIPATION IN ONLINE LEARNING IN BUSINESS COURSES AT UNIVERSITAS TERBUKA, INDONESIA. Maya Maria, Universitas Terbuka, Indonesia

NEW HIGH PERFORMANCE COMPUTATIONAL METHODS FOR MORTGAGES AND ANNUITIES. Yuri Shestopaloff,

Predictive Modeling Data. in the ACT Electronic Student Record

Subject CT5 Contingencies Core Technical Syllabus

Domain 1: Designing a SQL Server Instance and a Database Solution

Engineering Data Management

Lesson 15 ANOVA (analysis of variance)

Soving Recurrence Relations

Application of Combination Forecasting Model in the Patrol Sales Forecast

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

Intelligent Sensor Placement for Hot Server Detection in Data Centers - Supplementary File

Domain 1: Identifying Cause of and Resolving Desktop Application Issues Identifying and Resolving New Software Installation Issues

France caters to innovative companies and offers the best research tax credit in Europe

Automatic Tuning for FOREX Trading System Using Fuzzy Time Series

Chapter 7: Confidence Interval and Sample Size

1 Computing the Standard Deviation of Sample Means

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means)

I apply to subscribe for a Stocks & Shares ISA for the tax year 20 /20 and each subsequent year until further notice.

Mining Customer s Data for Vehicle Insurance Prediction System using k-means Clustering - An Application

Journal of Chemical and Pharmaceutical Research, 2015, 7(3): Research Article

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals

Lecture 2: Karger s Min Cut Algorithm

A Fuzzy Model of Software Project Effort Estimation

IT Support n n support@premierchoiceinternet.com. 30 Day FREE Trial. IT Support from 8p/user

A probabilistic proof of a binomial identity

Multi-server Optimal Bandwidth Monitoring for QoS based Multimedia Delivery Anup Basu, Irene Cheng and Yinzhe Yu

Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments

Digital Enterprise Unit. White Paper. Web Analytics Measurement for Responsive Websites

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

Professional Networking

ADAPTIVE NETWORKS SAFETY CONTROL ON FUZZY LOGIC

A Combined Continuous/Binary Genetic Algorithm for Microstrip Antenna Design

SaaS Resource Management Model and Architecture Research

Measures of Spread and Boxplots Discrete Math, Section 9.4

BENEFIT-COST ANALYSIS Financial and Economic Appraisal using Spreadsheets

FACIAL EXPRESSION RECOGNITION BASED ON CLOUD MODEL

This is a refereed journal and all articles are professionally screened and reviewed

Chatpun Khamyat Department of Industrial Engineering, Kasetsart University, Bangkok, Thailand

The Forgotten Middle. research readiness results. Executive Summary

LECTURE 13: Cross-validation

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

client communication

The Canadian Council of Professional Engineers

CCH Accountants Starter Pack

Neural Network Web-Based Human Resource Management System Model (NNWBHRMSM)

Overview on S-Box Design Principles

Solving Logarithms and Exponential Equations

Installment Joint Life Insurance Actuarial Models with the Stochastic Interest Rate

Confidence Intervals. CI for a population mean (σ is known and n > 30 or the variable is normally distributed in the.

Transcription:

JONAL OF SOFTWARE, VOL. 7, NO., NOVEMBER 533 Clusterig Algorithm Aalysis of Web Users with Dissimilarity ad SOM Neal Networks Xiao Qiag School of Ecoomics ad maagemet, Lazhou Jiaotog Uiversity, Lazhou; Chia, Email:lzt_q@6.com Qia Xiao-dog, Lazhou Jiaotog Uiversity Graduate School, Lazhou; Chia, Email:qiad@mail. lztu.c Liao Hui School of Ecoomics ad maagemet, Lazhou Jiaotog Uiversity, Lazhou; Chia, Email:lzt_liaohui@6.com Abstract To effectively orgaize ad aalyze massive web iformatio, desig a web user s clusterig miig algorithm. SOM eal etwork algorithm has lots of disadvatages, to solve the data clusterig, propose a ew method that uses D-SOM (Dissimilarity-Self Orgaizig feate Mappig) algorithm, for clusterig web user s. This algorithm ca estimate the ceter ad umber of clusterig data set by dissimilarity computig, optimize SOM eal etwork learig ad improve clusterig effect. Through desig the eperimet, these web data are collected ad processed by D-SOM algorithm Eperimetal results verify which D-SOM clusterig algorithm has better clusterig accacy ad imore efficiet tha SOM eal etwork algorithm. Ide Terms Clusterig; Dissimilarity; Self Orgaizig feate Mappig; E-commerce Ⅰ. INTRODUCTION With the iformatio techology developmet, E- commerce offers the differet forms of platform for the various busiess activities by usig iteret []. How to help users gai eact iformatio quickly is becomig a get problem, especially web data miig techology is the core problem i etwork for researchers. The web log files cotais the iformatio of customers browse, if we ca effectively aalyze the web logs ad uderstad customers behaviors, we ca reveal the relatio betwee web users ad access paths, improve web site, fid the behavior of user s access, ad provide web user s persoalized service support. I kowledge discovery i database, SOM eal etwork has developed rapidly i recet years, it solved may data miig problems, because eal etwork ca simulate huma brai thikig, ad stregth ability to lear []. We ca optimize cluster effect by iterative computatio. However, it is observed that SOM has may disadvatages, so i this paper, we use improved SOM eal etwork as clusterig to desig the system of web clusterig miig. This paper is orgaized as follows. I sectio we will review web log data ad build web sessio matri. I sectio 3 we will itroduce SOM eal etwork structe ad the lack of clusterig i the data. I sectio 4 we will itroduce D-SOM eal etwork algorithm, followed by the eperimetal evaluatios i sectio 5. The coclusios will be give i sectio 6. Ⅱ. BUILDING MATRIX OF WEB USER S DIALOGUE Whe a user access to website, it will come ito beig a series of log files i websites. The log files are recorded i the web server. The web log files iclude data ad time, IP address, the method, status, size, aget ad referee [3].I order to realize clusterig aalysis about E- commerce websites of users, we eed to obtai the users browse mode ad etract users the iformatio the server logs, amely: P=<ip (l-id,l-time)> Where P deotes browse page i a certai of time, where IP deotes access to Ecommerce site users, where l-id deotes access the page, where l-time deotes the user access a web page time. To web users,if the pages of the visits is t successful, or access time is less tha the threshold of visited liks, these web users will be deleted, accordig to the fial web users sessio, establish Tables I, the list below: TABLEⅠ WEB USER S SESSION I P L L L 3. i p. i p.. i p. L N ACADEMY PUBLISHER doi:.434/sw.7..533-537

534 JONAL OF SOFTWARE, VOL. 7, NO., NOVEMBER As the table Ⅰ shows, where LN deotes E- commerce website liks, ip deotes user which access to E-commerce web site, where deotes the users do ot click the lik of the website, where deotes the user the lik of website. We build the matri P of web user sessio by use Table I: ip ip P = LL ip L L L () Aalyzed the shortcomigs of the SOM eal etwork, a ew algorithm is preseted. Iput vector will be clustered by dissimilarity-calculated. Accordig to the umber clusterig ad ceter clusterig vector, we ca determie the output layer umber odes ad lik weight betwee the iput odes ad the output odes. So cluster iitializatio data will be etered ito the iput layer of SOM etwork, so we ca obtai a better clusterig effect. O basis of D-SOM eal etwork, the system of web users clusterig is desiged as Fig. The matri P will be iput vector ad processed i SOM eal etwork, to realize clusterig of web user. Ⅲ. THE SOM NEAL NETWORK Self-orgaized feate mappig eal etwork is amed SOM eal etwork, ad it is the umerical simulatio method. It was preset by Kohoe professors accordig to the characteristics of the huma brai [5][6]. The SOM algorithm maily icludes the competitio, cooperatio, weights adust; obtai etwork traiig ad usupervised orgaizatio learig [9]. The SOM eal etwork structe is show i Fig.: Fig SOM eal etwork structe From Fig., the etwork icludes iput layer, output layer ad weight. The iput layer icludes iput odes ad iput vector, the output layer icludes output odes ad output vector, there is weight betwee iput layer ad output layer [8][9].Where k deotes the iput vector, y deotes the output vector, Wi deotes weight. The iput vector will be clustered i output layer by computed ad adusted Wi. So we ca obtai clusterig result of data sets. However, SOM eal etwork structe has disadvatages, the clusterig effect is t satisfied, the reasos iclude: ) It is difficult to establish output odes, affectig the clusterig effect about date sets. ) Likig the output ode of the weight, select the iitializatio values may lead to differet clusterig results. How to improve these disadvatages, we eed to fi the iput vector ad select a suitable weight, so we ca obtai a better clusterig effect. I sectio 4, we propose a ew algorithm to address these deficiecies. Fig. The system of D-SOM Algorithm A. Iput Vectors Dissimilarity-calculated Dissimilarity deotes the similar degree betwee obect ad obect. How to calculate the dissimilarity, to biary variables, the dissimilarity will be calculated by the Jaccard coefficiet d (i,) [4] [7], give by: f + f d ( i, ) = () f + f + f + f Where f deotes umber, whe ad y take, f deotes umber, whe take ad y take, f deotes umber, whe take ad y take, f deotes umber, whe ad y take. the greater Jaccard coefficiet meas the more similar to two obects, the smaller meas the less similar to two obects. Dissimilarity ca be epressed by dissimilarity matri [] [], so we ca build the dissimilarity matri accordig matri of web users dialogue, give by: (,) ( ) = d d D t (3,) (3,) d( d (3) i, ) L d (,) d (,) d (,3) L The clusterig of matri D (t) is as follows: ) Select the largest elemet d (i,) i the matri D (t), whe t=, i lie ad lie merged ito a class. ) Calculate dissimilarity betwee the ew class ad other class. Build a ew dissimilarity matri D (t+). 3) If all simples have bee clustered ito oe class, the stop algorithm, otherwise t=t+, go to step. 4) Set differet thresholds to get differet ceter clusterig. I order to uderstad the dissimilarity matri D, we take a sample, matri p [6 7] give by: Ⅳ. D-SOM NEAL NETWORK ALGORITHM ACADEMY PUBLISHER

JONAL OF SOFTWARE, VOL. 7, NO., NOVEMBER 535 (,) = = 3 3 d P d(3,) d(3,) (4,) (4,) (4,) 5 4 5 4 D d d d d(5,) d(5,) d(5,3) d(5,4) 6 6 d(6,) d(6,) d(6,3) d(6,4) d(6,5) d(,) D = 3 d(3,) d(3,) (4,) (4,) (4,) 5 4 d d d d(5,) d(5,) d(5,3) d(5,4) 6 d(6,) d(6,) d(6,3) d(6,4) d(6,5) c D = c.486.574.857 6.743.486.574 c3 D = c.486 D = c4.486.574.857 c.857 D= 3.49.857.857.486.857 4.743.574.486.857 6.574.743.574.486.574 c D = c.486.574.857 The c4 ad c cluster ito oe class, deoted by c5. The clusterig dedrogram is show as Fig.: Fig. the Clusterig dedrogram Accordig to cluster dedrogram, set dissimilarity threshold α is.6 ad determie the cluster ceter vector data sets, 4,. B. Determie the Output Layer ad Lik Weight of SOM Neal Network By calculatig the dissimilarity matri, we get the Web site of the Web user clusterig ceter vector C, C... C ad the umber of clusters,the process is as follow: Step : SOM eal etwork output odes determied by the umber of clusters. Step:SOM eal etwork to determie the regulatio of weights obtai from the dissimilarity matri of the cluster ceter vector clusterig, such as W=C, W=C W=C. Step 3: Matri P i the sessio from the Web site of vector composed of WEB users, as the etwork iput samples, oe sample represets a user's access lik. Step 4: Calculate the iput vector at time the distace to all the output odes, d = ( i ( t) Wi ( t)) (4) i= Where d deotes at time t the distace the distace, where i (t) is iput vector. Step 5: Select a miimum of odes as the best match eo that i () = mi (d ), eo i as we have obtaied eos. Step 6: By updatig the formula to adust weight vector eos, adust the output ode of the coectio weights vector. Wi ( t + ) = Wi ( t) + η ( t)( ( t) Wi ( t)) (5), Where t η( t) = e ( ) Step 7: The traiig times for differet t, repeat steps, util the etwork weights stabilize as covergece. Step 8: Network covergece, accordig to the ode respose, determie the sample clusterig. V. EXPERIMENTS I order to prove that SOM method ad the D-SOM method to cluster i the Web site data, i this paper to evaluate by the desity of cluster ad the average separatio betwee the two clusterig []. The desity of cluster is cocetrated all the data poits ad the ceter of similarity, that the value is the higher deotes the effect that clusterig is the better. S = dist( c, S)/ m, o N ( p) Where dist( ci, S) deotes that the distace betwee two poits, the method ca be used i Euclidea distace, Mahatta distace, ad Mikowski distace. If dist (C i, S) =, the that C i, S is a poit, ot to each other as C i eighbors. N (p) said that after the data poits ad cluster, m set umber of clusters that cluster. The average separatio betwee the clusterig is the differece betwee the ceters of the differet degree of clusterig, that o average value is higher deot effect that clusterig is the better. ds = ( ci c ) + ( ci c ) + L( cim c m ) / i= = Where ( ci, ci, Lcim ) ad ( c, c, L c m ) deote he ceters of clusterig. I this eperimet, the operatig eviromet is Petium (R) Dual-core CPU E53.6GHZ.98GB RAM, eperimetal software is MATLAB7b. Eperimets usig UCI KDD ARCHIVE sites provide access to log data (http://kdd.ics.uci.edu/databases/msbc/msbc.data.html), use of Web log user access to data to costruct the sessio matri P, the two algorithms i accacy ad ruig time compariso, Web log of user data by removig the legth of less tha 4 sessios focused o recordig ad sessio legth greater tha 7 records, select oe of the 6 users of eperimetal data as a L lik. Ad fo data sets were collected, 4, 8,6, the assessmet of two algorithms to cluster. i ACADEMY PUBLISHER

536 JONAL OF SOFTWARE, VOL. 7, NO., NOVEMBER Web users TABLE Ⅱ WEB USER'S SESSION DATA MATRIX l l l 3 l 4 l 5 l 6 Ip Ip Ip 3 l 7 Ip 6 Desity of the assessmet results i clusterig as i Fig. : Fige : SOM clusterig algorithm ad the improved SOM algorithm assessmet withi the desity map The average separatio betwee the assessmet of clusterig is show i Fig. 3: Fig. 3: SOM algorithm ad the improved SOM algorithm the average separatio withi the evaluatio map From Fig., clusterig withi the desity assessmet maps ca be see that the amout of data for differet SOM algorithm ad the improved SOM algorithm, clusterig effect is ot the same. I a small amout of data, the two algorithms are similar, but with the icreasig amout of data, the improved SOM algorithm to cluster sigificatly is better tha SOM algorithm. From Fig. 3, the average separatio betwee poly assessmet charts ca be see, for differet data, the improved SOM algorithm to cluster the data better tha the SOM algorithm to cluster the data. Maily because of the improved SOM algorithm ca create more accate SOM output odes, ad iitialize the weights closer to the cluster ceter. SOM algorithm usig the improved access to the Web site clusterig ca be liked with the same access to iterest users together, Web site easy to improve the website's lik structe, oce agai access to the IP for differet users accordig to the specific Web site is services, improve site click-through rates ad icrease the efficiecy of Web site users to buy. Ⅵ. CONCLUSION I this paper, the proposed method for Web user access patters i the clusterig is valid. The data have proved this eperimet, the SOM algorithm for eal etwork ACADEMY PUBLISHER

JONAL OF SOFTWARE, VOL. 7, NO., NOVEMBER 537 itself defective ad that i the data miig applicatio is ot very good. I this paper, the SOM algorithm to improve the lack of improvemet ca be well used i Web log data miig. Improvig the desig of persoalized busiess website has broadeed applicatio prospects. Fther work is the combiatio of the user's registratio iformatio, such as age, geder, icome, regio, etc., to access the time to eted this algorithm. ACKNOWLEDGEMENT This work is supported Natioal Fuds of Social Sciece (NO. 8XTQ) by to Qia Xiao-dog respectively ad proect supported by youg scholars sciece Foudatio of LAN Zhou Jiao Tog uiversity (NO.44). REFERENCES [] Zhou Hua, Huag Li-Pig. C-meas clusterig algorithm based o SOM eal etwork. Computer Applicatio.7.VOL.7 NO.6 Page 5-5 [] Guo Wei-ye,Zhao Xiao-da,Pag Yig-zhi,etc Reseach o Clusterig Algorithm Based o SOM Neal Network i Data Miig. Iformatio Sciece.9.vol.6 NO.6 Page874-876 [3] Li Gag AN Lu.Clusterig aalysis of E-commerce Trasactios with self-orgaizig map.new Techology Of Library ad Iformatio Sercice 8.VOL.69 NO.9 Page7-77 [4] DONG Yi-Hog ZHUANG Yue-Tig.Web log miig based o a ovel a ovel competitive eal etwork.joal Of Computer Research Ad Developmet.3.vo.4 NO.5 Page:66-667 [5] KRISHMA.MTY MN. Ceetiv k-meas algorithm.ieee Trasactios o system,ma ad Cybemetics Part B.999.VOL.9 NO.3:433-439. [6] KOHONEN T. Self orgaized formatio of topologically correct fear te maps.biological Cy-bemetics 98.VOL.43 NO.:59-69. [7] G A Carpeter,S Grossberg.A massively parallel architecte for a self-orgaizig eal patter recogitio machie.computer visio,graphics ad Image Processig,987,VOL.37:54-5. [8] J Kagas, T Kohoe et al.variats of self-orgaizig maps,ieee Tras o Neal Networks,99,VOL. NO.:93-99 [9] Dig C,Patra J C.User Modelig for Persoalized Web Search with Self-Orgaizig Map.Joal of the America Society for iformatio Sciece ad Techology.7.VOL.58 NO.4:494-57 [] Zhao Mig-Qig, JIANG Chag-Ju, Tao Shu-feg, dissimilarity matri based o equivalece of cluster. Computer Sciece, 4,VOL.3 NO.7 :83-84 [] Gu Zogwei, i Huiua, based o the dissimilarity mease of the graph clusterig method. Shai Agricultal Uiversity. 9,VOL.9 NO.3 :84-88 [] Big Liu a, Yu Chug, Xue Guirog, South Korea set a traslatio. Web data miig.tsighua Uiversity Press, 9 NO.4 :58-66. ACADEMY PUBLISHER