JONAL OF SOFTWARE, VOL. 7, NO., NOVEMBER 533 Clusterig Algorithm Aalysis of Web Users with Dissimilarity ad SOM Neal Networks Xiao Qiag School of Ecoomics ad maagemet, Lazhou Jiaotog Uiversity, Lazhou; Chia, Email:lzt_q@6.com Qia Xiao-dog, Lazhou Jiaotog Uiversity Graduate School, Lazhou; Chia, Email:qiad@mail. lztu.c Liao Hui School of Ecoomics ad maagemet, Lazhou Jiaotog Uiversity, Lazhou; Chia, Email:lzt_liaohui@6.com Abstract To effectively orgaize ad aalyze massive web iformatio, desig a web user s clusterig miig algorithm. SOM eal etwork algorithm has lots of disadvatages, to solve the data clusterig, propose a ew method that uses D-SOM (Dissimilarity-Self Orgaizig feate Mappig) algorithm, for clusterig web user s. This algorithm ca estimate the ceter ad umber of clusterig data set by dissimilarity computig, optimize SOM eal etwork learig ad improve clusterig effect. Through desig the eperimet, these web data are collected ad processed by D-SOM algorithm Eperimetal results verify which D-SOM clusterig algorithm has better clusterig accacy ad imore efficiet tha SOM eal etwork algorithm. Ide Terms Clusterig; Dissimilarity; Self Orgaizig feate Mappig; E-commerce Ⅰ. INTRODUCTION With the iformatio techology developmet, E- commerce offers the differet forms of platform for the various busiess activities by usig iteret []. How to help users gai eact iformatio quickly is becomig a get problem, especially web data miig techology is the core problem i etwork for researchers. The web log files cotais the iformatio of customers browse, if we ca effectively aalyze the web logs ad uderstad customers behaviors, we ca reveal the relatio betwee web users ad access paths, improve web site, fid the behavior of user s access, ad provide web user s persoalized service support. I kowledge discovery i database, SOM eal etwork has developed rapidly i recet years, it solved may data miig problems, because eal etwork ca simulate huma brai thikig, ad stregth ability to lear []. We ca optimize cluster effect by iterative computatio. However, it is observed that SOM has may disadvatages, so i this paper, we use improved SOM eal etwork as clusterig to desig the system of web clusterig miig. This paper is orgaized as follows. I sectio we will review web log data ad build web sessio matri. I sectio 3 we will itroduce SOM eal etwork structe ad the lack of clusterig i the data. I sectio 4 we will itroduce D-SOM eal etwork algorithm, followed by the eperimetal evaluatios i sectio 5. The coclusios will be give i sectio 6. Ⅱ. BUILDING MATRIX OF WEB USER S DIALOGUE Whe a user access to website, it will come ito beig a series of log files i websites. The log files are recorded i the web server. The web log files iclude data ad time, IP address, the method, status, size, aget ad referee [3].I order to realize clusterig aalysis about E- commerce websites of users, we eed to obtai the users browse mode ad etract users the iformatio the server logs, amely: P=<ip (l-id,l-time)> Where P deotes browse page i a certai of time, where IP deotes access to Ecommerce site users, where l-id deotes access the page, where l-time deotes the user access a web page time. To web users,if the pages of the visits is t successful, or access time is less tha the threshold of visited liks, these web users will be deleted, accordig to the fial web users sessio, establish Tables I, the list below: TABLEⅠ WEB USER S SESSION I P L L L 3. i p. i p.. i p. L N ACADEMY PUBLISHER doi:.434/sw.7..533-537
534 JONAL OF SOFTWARE, VOL. 7, NO., NOVEMBER As the table Ⅰ shows, where LN deotes E- commerce website liks, ip deotes user which access to E-commerce web site, where deotes the users do ot click the lik of the website, where deotes the user the lik of website. We build the matri P of web user sessio by use Table I: ip ip P = LL ip L L L () Aalyzed the shortcomigs of the SOM eal etwork, a ew algorithm is preseted. Iput vector will be clustered by dissimilarity-calculated. Accordig to the umber clusterig ad ceter clusterig vector, we ca determie the output layer umber odes ad lik weight betwee the iput odes ad the output odes. So cluster iitializatio data will be etered ito the iput layer of SOM etwork, so we ca obtai a better clusterig effect. O basis of D-SOM eal etwork, the system of web users clusterig is desiged as Fig. The matri P will be iput vector ad processed i SOM eal etwork, to realize clusterig of web user. Ⅲ. THE SOM NEAL NETWORK Self-orgaized feate mappig eal etwork is amed SOM eal etwork, ad it is the umerical simulatio method. It was preset by Kohoe professors accordig to the characteristics of the huma brai [5][6]. The SOM algorithm maily icludes the competitio, cooperatio, weights adust; obtai etwork traiig ad usupervised orgaizatio learig [9]. The SOM eal etwork structe is show i Fig.: Fig SOM eal etwork structe From Fig., the etwork icludes iput layer, output layer ad weight. The iput layer icludes iput odes ad iput vector, the output layer icludes output odes ad output vector, there is weight betwee iput layer ad output layer [8][9].Where k deotes the iput vector, y deotes the output vector, Wi deotes weight. The iput vector will be clustered i output layer by computed ad adusted Wi. So we ca obtai clusterig result of data sets. However, SOM eal etwork structe has disadvatages, the clusterig effect is t satisfied, the reasos iclude: ) It is difficult to establish output odes, affectig the clusterig effect about date sets. ) Likig the output ode of the weight, select the iitializatio values may lead to differet clusterig results. How to improve these disadvatages, we eed to fi the iput vector ad select a suitable weight, so we ca obtai a better clusterig effect. I sectio 4, we propose a ew algorithm to address these deficiecies. Fig. The system of D-SOM Algorithm A. Iput Vectors Dissimilarity-calculated Dissimilarity deotes the similar degree betwee obect ad obect. How to calculate the dissimilarity, to biary variables, the dissimilarity will be calculated by the Jaccard coefficiet d (i,) [4] [7], give by: f + f d ( i, ) = () f + f + f + f Where f deotes umber, whe ad y take, f deotes umber, whe take ad y take, f deotes umber, whe take ad y take, f deotes umber, whe ad y take. the greater Jaccard coefficiet meas the more similar to two obects, the smaller meas the less similar to two obects. Dissimilarity ca be epressed by dissimilarity matri [] [], so we ca build the dissimilarity matri accordig matri of web users dialogue, give by: (,) ( ) = d d D t (3,) (3,) d( d (3) i, ) L d (,) d (,) d (,3) L The clusterig of matri D (t) is as follows: ) Select the largest elemet d (i,) i the matri D (t), whe t=, i lie ad lie merged ito a class. ) Calculate dissimilarity betwee the ew class ad other class. Build a ew dissimilarity matri D (t+). 3) If all simples have bee clustered ito oe class, the stop algorithm, otherwise t=t+, go to step. 4) Set differet thresholds to get differet ceter clusterig. I order to uderstad the dissimilarity matri D, we take a sample, matri p [6 7] give by: Ⅳ. D-SOM NEAL NETWORK ALGORITHM ACADEMY PUBLISHER
JONAL OF SOFTWARE, VOL. 7, NO., NOVEMBER 535 (,) = = 3 3 d P d(3,) d(3,) (4,) (4,) (4,) 5 4 5 4 D d d d d(5,) d(5,) d(5,3) d(5,4) 6 6 d(6,) d(6,) d(6,3) d(6,4) d(6,5) d(,) D = 3 d(3,) d(3,) (4,) (4,) (4,) 5 4 d d d d(5,) d(5,) d(5,3) d(5,4) 6 d(6,) d(6,) d(6,3) d(6,4) d(6,5) c D = c.486.574.857 6.743.486.574 c3 D = c.486 D = c4.486.574.857 c.857 D= 3.49.857.857.486.857 4.743.574.486.857 6.574.743.574.486.574 c D = c.486.574.857 The c4 ad c cluster ito oe class, deoted by c5. The clusterig dedrogram is show as Fig.: Fig. the Clusterig dedrogram Accordig to cluster dedrogram, set dissimilarity threshold α is.6 ad determie the cluster ceter vector data sets, 4,. B. Determie the Output Layer ad Lik Weight of SOM Neal Network By calculatig the dissimilarity matri, we get the Web site of the Web user clusterig ceter vector C, C... C ad the umber of clusters,the process is as follow: Step : SOM eal etwork output odes determied by the umber of clusters. Step:SOM eal etwork to determie the regulatio of weights obtai from the dissimilarity matri of the cluster ceter vector clusterig, such as W=C, W=C W=C. Step 3: Matri P i the sessio from the Web site of vector composed of WEB users, as the etwork iput samples, oe sample represets a user's access lik. Step 4: Calculate the iput vector at time the distace to all the output odes, d = ( i ( t) Wi ( t)) (4) i= Where d deotes at time t the distace the distace, where i (t) is iput vector. Step 5: Select a miimum of odes as the best match eo that i () = mi (d ), eo i as we have obtaied eos. Step 6: By updatig the formula to adust weight vector eos, adust the output ode of the coectio weights vector. Wi ( t + ) = Wi ( t) + η ( t)( ( t) Wi ( t)) (5), Where t η( t) = e ( ) Step 7: The traiig times for differet t, repeat steps, util the etwork weights stabilize as covergece. Step 8: Network covergece, accordig to the ode respose, determie the sample clusterig. V. EXPERIMENTS I order to prove that SOM method ad the D-SOM method to cluster i the Web site data, i this paper to evaluate by the desity of cluster ad the average separatio betwee the two clusterig []. The desity of cluster is cocetrated all the data poits ad the ceter of similarity, that the value is the higher deotes the effect that clusterig is the better. S = dist( c, S)/ m, o N ( p) Where dist( ci, S) deotes that the distace betwee two poits, the method ca be used i Euclidea distace, Mahatta distace, ad Mikowski distace. If dist (C i, S) =, the that C i, S is a poit, ot to each other as C i eighbors. N (p) said that after the data poits ad cluster, m set umber of clusters that cluster. The average separatio betwee the clusterig is the differece betwee the ceters of the differet degree of clusterig, that o average value is higher deot effect that clusterig is the better. ds = ( ci c ) + ( ci c ) + L( cim c m ) / i= = Where ( ci, ci, Lcim ) ad ( c, c, L c m ) deote he ceters of clusterig. I this eperimet, the operatig eviromet is Petium (R) Dual-core CPU E53.6GHZ.98GB RAM, eperimetal software is MATLAB7b. Eperimets usig UCI KDD ARCHIVE sites provide access to log data (http://kdd.ics.uci.edu/databases/msbc/msbc.data.html), use of Web log user access to data to costruct the sessio matri P, the two algorithms i accacy ad ruig time compariso, Web log of user data by removig the legth of less tha 4 sessios focused o recordig ad sessio legth greater tha 7 records, select oe of the 6 users of eperimetal data as a L lik. Ad fo data sets were collected, 4, 8,6, the assessmet of two algorithms to cluster. i ACADEMY PUBLISHER
536 JONAL OF SOFTWARE, VOL. 7, NO., NOVEMBER Web users TABLE Ⅱ WEB USER'S SESSION DATA MATRIX l l l 3 l 4 l 5 l 6 Ip Ip Ip 3 l 7 Ip 6 Desity of the assessmet results i clusterig as i Fig. : Fige : SOM clusterig algorithm ad the improved SOM algorithm assessmet withi the desity map The average separatio betwee the assessmet of clusterig is show i Fig. 3: Fig. 3: SOM algorithm ad the improved SOM algorithm the average separatio withi the evaluatio map From Fig., clusterig withi the desity assessmet maps ca be see that the amout of data for differet SOM algorithm ad the improved SOM algorithm, clusterig effect is ot the same. I a small amout of data, the two algorithms are similar, but with the icreasig amout of data, the improved SOM algorithm to cluster sigificatly is better tha SOM algorithm. From Fig. 3, the average separatio betwee poly assessmet charts ca be see, for differet data, the improved SOM algorithm to cluster the data better tha the SOM algorithm to cluster the data. Maily because of the improved SOM algorithm ca create more accate SOM output odes, ad iitialize the weights closer to the cluster ceter. SOM algorithm usig the improved access to the Web site clusterig ca be liked with the same access to iterest users together, Web site easy to improve the website's lik structe, oce agai access to the IP for differet users accordig to the specific Web site is services, improve site click-through rates ad icrease the efficiecy of Web site users to buy. Ⅵ. CONCLUSION I this paper, the proposed method for Web user access patters i the clusterig is valid. The data have proved this eperimet, the SOM algorithm for eal etwork ACADEMY PUBLISHER
JONAL OF SOFTWARE, VOL. 7, NO., NOVEMBER 537 itself defective ad that i the data miig applicatio is ot very good. I this paper, the SOM algorithm to improve the lack of improvemet ca be well used i Web log data miig. Improvig the desig of persoalized busiess website has broadeed applicatio prospects. Fther work is the combiatio of the user's registratio iformatio, such as age, geder, icome, regio, etc., to access the time to eted this algorithm. ACKNOWLEDGEMENT This work is supported Natioal Fuds of Social Sciece (NO. 8XTQ) by to Qia Xiao-dog respectively ad proect supported by youg scholars sciece Foudatio of LAN Zhou Jiao Tog uiversity (NO.44). REFERENCES [] Zhou Hua, Huag Li-Pig. C-meas clusterig algorithm based o SOM eal etwork. Computer Applicatio.7.VOL.7 NO.6 Page 5-5 [] Guo Wei-ye,Zhao Xiao-da,Pag Yig-zhi,etc Reseach o Clusterig Algorithm Based o SOM Neal Network i Data Miig. Iformatio Sciece.9.vol.6 NO.6 Page874-876 [3] Li Gag AN Lu.Clusterig aalysis of E-commerce Trasactios with self-orgaizig map.new Techology Of Library ad Iformatio Sercice 8.VOL.69 NO.9 Page7-77 [4] DONG Yi-Hog ZHUANG Yue-Tig.Web log miig based o a ovel a ovel competitive eal etwork.joal Of Computer Research Ad Developmet.3.vo.4 NO.5 Page:66-667 [5] KRISHMA.MTY MN. Ceetiv k-meas algorithm.ieee Trasactios o system,ma ad Cybemetics Part B.999.VOL.9 NO.3:433-439. [6] KOHONEN T. Self orgaized formatio of topologically correct fear te maps.biological Cy-bemetics 98.VOL.43 NO.:59-69. [7] G A Carpeter,S Grossberg.A massively parallel architecte for a self-orgaizig eal patter recogitio machie.computer visio,graphics ad Image Processig,987,VOL.37:54-5. [8] J Kagas, T Kohoe et al.variats of self-orgaizig maps,ieee Tras o Neal Networks,99,VOL. NO.:93-99 [9] Dig C,Patra J C.User Modelig for Persoalized Web Search with Self-Orgaizig Map.Joal of the America Society for iformatio Sciece ad Techology.7.VOL.58 NO.4:494-57 [] Zhao Mig-Qig, JIANG Chag-Ju, Tao Shu-feg, dissimilarity matri based o equivalece of cluster. Computer Sciece, 4,VOL.3 NO.7 :83-84 [] Gu Zogwei, i Huiua, based o the dissimilarity mease of the graph clusterig method. Shai Agricultal Uiversity. 9,VOL.9 NO.3 :84-88 [] Big Liu a, Yu Chug, Xue Guirog, South Korea set a traslatio. Web data miig.tsighua Uiversity Press, 9 NO.4 :58-66. ACADEMY PUBLISHER