The 9 th International Days of Statistis and Eonomis, Prague, September 10-1, 015 NOMCLUST: AN R PACKAGE FOR HIERARCHICAL CLUSTERING OF OBJECTS CHARACTERIZED BY NOMINAL VARIABLES Zdeněk Šul Hana Řezanková Abstrat Clustering of objets haraterized by nominal (ategorial) variables is not still suffiiently solved in software systems. Most of them provide only hierarhial luster analysis with one basi similarity measure for nominal data, the simple mathing oeffiient, whih provides worse results than lustering based on some of reently proposed measures. In this paper, we introdue the nomlust R pakage, whih ompletely overs hierarhial lustering of objets haraterized by nominal variables from a proimity matri omputation to final lusters evaluation. It enables to hoose among 11 similarity measures, three linkage methods of hierarhial lustering, and uses si evaluation riteria based on the mutability and the entropy. Some of the riteria were onstruted for omparison of different luster approahes and some for determining the optimal number of lusters. The pakage ontains both the main funtion, whih overs the whole lustering proess, and a set of subsidiary funtions, whih allow omputing only a part of lustering proess, e.g. a proimity matri, or evaluation of given luster solution. We illustrate the methodology of this pakage on a small dataset. Key words: luster analysis, nominal variables, similarity measures, evaluation riteria JEL Code: C38, C88 Introdution In this paper, we present the nomlust pakage for the R software, whih was developed for hierarhial lustering of objets haraterized by nominal variables. This pakage is available on the Comprehensive R Arhive Network (CRAN) web site 1. We developed it due to the lak of software implementation of similarity measures determined for nominal variables. Nominal variables are urrently lustered by means of methods based on the simple mathing (SM) oeffiient, whih are usually used in statistial software. This treatment is not 1 http://cran.r-projet.org/ 1581
The 9 th International Days of Statistis and Eonomis, Prague, September 10-1, 015 suffiient though, beause it neglets important harateristis of a dataset, suh as frequenies of ategories or number of ategories, whih ould be used for better similarity determination. The other possibility is to use measures determined for mied type data. One of the famous ones is the Gower similarity oeffiient, see (Gower, 1971), whih an be found e.g. in funtion daisy in the R pakage luster, or the log-likelihood distane in two-step luster analysis, see (Chiu et al., 001), in the IBM SPSS software. However, the Gower oeffiient treats nominal variables in the same way as measures based on the SM oeffiient, and as far as we know, the log-likelihood distane is not available in any non-ommerial software. Another possibility is to transform the original nominal variables into a set of dummy variables, to whih measures for quantitative or binary data an be applied. However, this proedure always provides some loss of information aused by dummy transformation. Overall, urrent software solutions do not provide satisfatory results. In reent years, there have been introdued many similarity measures whih were proposed for nominal variables, e.g. the Lin measure, see (Lin, 1998), or the Eskin measure, see (Eskin et al., 00). Many of them provide very good results as well, see (Šul and Řezanková, 014) or (Šul, 015). The problem is that those measures have not beome widely known, mostly due to the lak of software implementation. By introduing our R pakage, we aspire to deal with this problem. The pakage ontains 11 similarity measures determined for nominal data, and ompletely overs the lustering proess from proimity matri omputation, over hierarhial lustering analysis, to evaluation riteria of resulting lusters determination. In this paper, we introdue the funtionality of this pakage in detail. The paper is organized as follows. The Setion 1 offers a short introdution of used similarity measures, linkage methods, and evaluation riteria. The Setion presents several typial use illustrations on a simple eample. Final remarks are plaed in Conlusion. 1 Theoretial bakground of the pakage The pakage ontains 11 similarity measures for nominal data, 10 of them were summarized in (Boriah et al., 008), and one of them was introdued by Morlini and Zani, (01). All the measures were more thoroughly eamined in (Šul and Řezanková, 014) or (Šul, 015). Tab. 1 presents equations whih determine similarity between the i-th and j-th objets by the -th variable. This is the first (and the most important) level of a proimity matri omputation. On the seond level, similarity for objets aross all variables is omputed. The last level involve omputation of the dissimilarity. All formulas in Tab. 1 are based on the 158
The 9 th International Days of Statistis and Eonomis, Prague, September 10-1, 015 data matri X = [i], where i = 1,,..., n (n is the total number of objets); = 1,,..., m (m is the total number of variables). The number of ategories of the -th variable is denoted as n, absolute frequeny as f, and relative frequeny as p. Tab. 1: Similarity measures in the nomlust pakage(without dummy transformation) Measure Similarity between the i-th and j-th objets by the -th variable Eskin 1 if Goodall 1 Goodall S ( i S ( i S ( i,,, j j j ) ) ) n otherwise n 1 i p j q 0 otherwise 1 if i j q Q, Q X :, ( ) ( ) q Q p q p i p q 0 otherwise Goodall 3 1 p S ( i, j ) if i j q Q, Q X :, ( ) ( ) q Q p q p i i 0 otherwise Goodall 4 p IOF Lin Lin 1 OF SM S ( i S ( i S ( i S ( i S ( i S ( i,,,,,, j j j j j j ) ) ) ) ) ) i if 0 otherwise 1 if i j if i i 1 1 ln f ( ) ln f ( ln p( i ln( p( qq ln 1if ln p qq i i ) if i q n 1 ln f ( 1if i i j ) p( if i j j j j otherwise ) )) otherwise j p qotherwise j i 0 otherwise j, Q X : q Q, p ( i) p ( q) p ( j ) 1 otherwise n ln ) f ( ) j 1583
The 9 th International Days of Statistis and Eonomis, Prague, September 10-1, 015 The Morlini and Zani s similarity measure uses dummy transformation; thus, it is omputed in a different way than the rest of measures. Similarity between the objets i and j is epressed: m K 1 ( i, u ln 1 u1 fu S( i, j ), m K m K 1 1 ( i, u ln ( i, f u ln 1 u1 f u 1 u1 f u (1) Where u = 1,,... K is an inde of the u-th dummy variable of the -th original variable, fu is the seond power of the frequeny of the u-th ategory of the original -th variable. When using Eq. (1), the following onditions have to be held: ( i, ( i, u u 1if iu 1and 0 otherwise, ju 1, ( i, ( i, 1if iu 1 0 otherwise. jt 1and it 1 ju 0, for u t, After omputation of proimity matri, hierarhial luster analysis is performed. The pakage enables to use three different linkage methods omplete, single and average one. They differ from the way, how they determine dissimilarity between two lusters. The omplete linkage method onsider this dissimilarity as dissimilarity between two furthest objets from two different lusters. The single linkage method uses dissimilarity between two losest objets, and the average linkage method takes average pairwise dissimilarity between objets in two different lusters. The resulting lusters an be evaluated by si riteria, whih were introdued in (Řezanková et al., 011). They are based on the within-luster variability of lusters, whih an be measured either by the mutability or by the entropy. The riteria are presented in Tab., where k is the number of lusters, n g is the number of objets in the g-th luster (g = 1,,...,, n giu is the number of objets in the g-th luster by the i-th objet with the u-th ategory (u = 1,,..., K ; K is the number of ategories), = 1,,..., m (m is the number of variables). WCM and WCE measure the within-luster variability in a dataset. They take values from zero to one, where values lose to one indiate high variability. With inreasing number of lusters, the within-luster variability dereases, and thus, values of these indies derease. 1584
The 9 th International Days of Statistis and Eonomis, Prague, September 10-1, 015 Values of oeffiients PSTau and PSU take values from zero to one as well. This time, values lose to one indiate the low within-luster variability and vie versa. Pseudo F oeffiients, originally determined for quantitative data, see e.g. (Löster, 014), are based on the mutability or the entropy for nominal variables. They were developed to determine the optimal luster solution, i.e. the solution with the relative highest derease of within-luster variability. Maimal value should indiate the best solution. Tab. : Evaluation riteria in the nomlust pakage Within-luster mutability oeffiient K K 1 g (WCM) WCM ( 1 Within-luster entropy oeffiient k n n m m g1 1 u1 g WCE( (WCE) K n n gu g k m K n 1 n gu ngu ln 1 n m 1 ln K u1 ng ng Pseudo tau oeffiient (PSTau) WCM(1) WCM( PSTau( WCM(1) g Pseudo unertainty oeffiient (PSU) WCE(1) WCE( PSU ( WCE(1) Pseudo F oeffiient based on the ( n ( WCM(1) WCM( ) mutability (PSFM) PSFM( ( k 1) WCM( k ) Pseudo F oeffiient based on the ( n ( WCE(1) WCE( ) entropy (PSFE) PSFE( ( k 1) WCE( k ) Coeffiients based on the mutability and the entropy usually do not differ very muh. They are inluded in the pakage in order to provide two independent ways of variability omputation. Big differenes between mutability- and entropy-based oeffiients should attrat researher s attention. The use of two types of oeffiients is espeially useful when determining the optimal number of lusters using the pseudo F oeffiients. If both the oeffiients prefer the same luster solution, there is strong evidene for hoosing this partiular solution by a researher. Illustrations of use There are three possible ways how to use the nomlust pakage. One an use it to a proimity matri omputation using one of 11 available similarity measures, or to perform the omplete 1585
The 9 th International Days of Statistis and Eonomis, Prague, September 10-1, 015 lustering proess, where the input is data matri X, and the output onsists of luster membership variables, and a set of evaluation riteria. The third way of the use is to ompute a set of evaluation riteria of a dataset with luster membership variables, whih were obtained by any other lustering proess. In this setion, all the ways of use will be desribed in detail. Eamples will be demonstrated on the dataset data0, whih ontains five nominal variables, eah with 0 observations. It is available in the pakage after typing: R> data(data0).1 Proimity matri omputation In ase one wants to ompute a proimity matri using only one of available similarity measures, e.g. for researh purposes, the best way is to use diretly the funtion of a partiular measure. All the funtional alls are presented in Tab. 3. The resulting proimity matries have standard proportions; their size is nn, and they have zero diagonal elements. Thus, they an be used as the input in other pakages as well, for instane hlust in the stats pakage. Tab. 3: Funtional alls for proimity matries omputation Measure Funtional all Measure Funtional all Eskin R>eskin(data) Morlini R>morlini(data) Goodall 1 R> good1(data) Lin R> lin(data) Goodall R> good(data) Lin 1 R> lin1(data) Goodall 3 R> good3(data) OF R> of(data) Goodall 4 R> good4(data) SM R>sm(data) IOF R>iof(data). Complete lustering proess The most omple funtion in the pakage is the nomlust funtion. It overs the whole lustering proess, and it has following synta: R>nomlust(data, measure = "iof", lu_low =, lu_high = 6, eval = TRUE, pro = FALSE, method = "omplete") The only mandatory parameter is data, whih represents the input data for the analysis. The rest of parameters is preset by default in a way whih should suit to majority of users, but it an be hanged if needed. Parameter measure hooses the similarity measure for the analysis. 1586
The 9 th International Days of Statistis and Eonomis, Prague, September 10-1, 015 As a default hoie, the IOF measure is hosen, beause it often provides the best results, see (Šul and Řezanková, 014). Generally, any of measures from Tab. 3 an be used in its plae. Parameters lu_low and lu_high determine the lower and upper bounds for the number of reated luster solutions. There are two logial parameters eval and pro, whih enable to ompute evaluation riteria or to display a proimity matri respetively. The last parameter is method, whih enables to use one of three linkage methods. By default, the omplete linkage method is hosen, beause it usually provides the best results; see (Šul, 014). In the pakage, the agnes algorithm from the pakage luster was used for performing the hierarhial luster analysis. In order to use the nomlust funtion on the data0 dataset with the IOF measure, and displayed proimity matri, the following synta is needed: R> data <- nomlust(data0, pro = TRUE) The output of the nomlust funtion has a form of a list with up to three omponents. The mem omponent ontains luster memberships for all ases. Beause the output would be large, only first si rows of the output are displayed: R>lu_mem<- head(data$mem) lu_ lu_3 lu_4 lu_5 lu_6 1 1 1 1 1 1 1 1 1 1 1 3 1 4 1 1 1 3 3 5 1 1 1 1 1 6 3 3 4 4 The evaluation riteria an be found in the eval omponent after typing: R>lu_eval<- data$eval luster WCM WCE PSTau PSU PSFM PSFE 1 1 0.9666 0.9600 NA NA NA NA 0.7330 0.7010 0.1987 0.084 4.4646 4.738 3 3 0.617 0.5789 0.3365 0.3607 4.3106 4.7949 4 4 0.4546 0.4066 0.5005 0.5491 5.3446 6.4961 1587
The 9 th International Days of Statistis and Eonomis, Prague, September 10-1, 015 5 5 0.4136 0.3658 0.5373 0.5807 4.3551 5.1943 6 6 0.3600 0.3085 0.6004 0.6514 4.074 5.37 The output ontains values for all the evaluation riteria presented in Tab. for the defined range of lusters. In the first row, there is always displayed variability in the whole dataset by means of the WCM and WCE oeffiients. The other evaluation riteria have always symbol NA in this row. The other rows represent the within-luster variability derease pretty well. When studying the PSFM and PSFE oeffiients, both of them prefer the four-luster solution. The pro omponent displays proimity matri, if the logial parameter is set to TRUE. Sine the proimity matri would be large, the output is omitted. R>lu_pro<- data$pro.3 Clustering evaluation Sometimes it may arise a need to evaluate set of luster solutions obtained by a similarity measure or method whih is not available in the nomlust pakage, for instane, results obtained by two-step luster analysis in IBM SPSS. For suh ases, one an use the evallust funtion, whih has the following synta: R>evallust(data, num_var, lu_low =, lu_high = 6) The funtion has two mandatory parameters; data, whih is an original dataset with luster membership variables in an inreasing order, and num_var, whih defines the number of original variables in the dataset. Again, lu_low and lu_high parameters define the range of used luster solutions. Thus, evaluation riteria omputation for the data0 dataset with five additional luster membership variables ( to 6 lusters) an be written in the following way: R>evallust(data0, 5, lu_low =, lu_high = 6) luster WCM WCE PSTau PSE PSFM PSFE 1 1 0.9666 0.9600 NA NA NA NA 0.7601 0.714 0.1943 0.366 4.3400 5.580 3 3 0.6118 0.5670 0.3490 0.4034 4.5570 5.7473 4 4 0.4615 0.400 0.4851 0.5488 5.05 6.4857 5 5 0.3938 0.384 0.566 0.6355 4.8955 6.5379 1588
The 9 th International Days of Statistis and Eonomis, Prague, September 10-1, 015 6 6 0.383 0.649 0.693 0.6941 4.7540 6.354 The output has the same form as omponent eval in the nomlust funtion. Thus, a diret omparison of outputs is ensured. For instane, in this partiular ase, two-step luster analysis provides omparable results to those ones gotten by the IOF measure used in the nomlust funtion. Moreover, PSFM and PSFE determine the same optimal solution in ase of IOF, whereas their values in ase of two-step luster analysis evaluation differ. For more information about two-step luster analysis lustering performane, see e.g. (Šul and Řezanková, 014). Conlusion In this paper, we introdued the main features of the nomlust pakage. We think it offers a omprehensive look at the nominal lustering issue whih annot be found in any other R pakages or ommerial software. We are aware that the R pakage development is a living proess; and thus, we are going to improve it steadily. In net releases, we would like to fous on performane optimization, beause proimity matries omputation takes a lot of omputational time by large datasets. Net, we will inorporate graphial outputs of the analysis, whih should make a luster evaluation proess even easier. Aknowledgments This paper was proessed with ontribution of long term institutional support of researh ativities by Faulty of Informatis and Statistis, University of Eonomis, Prague. Referenes Boriah, S., Chandola, V., Kumar, V. (008). Similarity measures for ategorial data: a omparative evaluation. In Proeedings of the 8th SIAM International Conferene on Data Mining. SIAM, 43-54. Chiu, T., Fang, D., Chen, J., Wang, Y., Jeris, C. (001). A robust and salable lustering algorithm for mied type attributes in large database environment. In Proeedings of the seventh ACM SIGKDD international onferene on knowledge disovery and data mining, 63. Gower, J. C. (1971). A general oeffiient of similarity and some of its properties. In Biometris, 8(4), 857-871. 1589
The 9 th International Days of Statistis and Eonomis, Prague, September 10-1, 015 Eskin, E., Arnold, A., Prerau, M., Portnoy, L., Stolfo, S. (00). A geometri framework for unsupervised anomaly detetion. In Appliations of Data Mining in Computer Seurity, 78-100. Morlini, I., Zani, S. (01). A new lass of weighted similarity indies using polytomous variables. In Journal of Classifiation, 9(), 199-6. Lin, D. (1998). An information-theoreti definition of similarity. In ICML '98: Proeedings of the 15th International Conferene on Mahine Learning. San Franiso: Morgan Kaufmann Publishers In., 96-304. Löster, T. (014). The evaluation of CHF oeffiient in determining the number of lusters using Eulidean distane measure. In The 8th International Days of Statistis and Eonomis. Slaný: Melandrium, 014, 858-869. http://msed.vse.z/msed_014/artile/463-loster-tomaspaper.pdf. Řezanková, H., Löster, T.,Húsek, D. (011). Evaluation of ategorial data lustering. In Mugellini, E, Szzepaniak, P. S., Pettenati, M. C. et al. (Eds.), In Advanes in Intelligent Web Mastering 3. Berlin: Springer Verlag, 173-18. Šul, Z. (014). Similarity Measures for Nominal Variable Clustering. In The 8th International Days of Statistis and Eonomis. Slaný: Melandrium, 1536-1545. http://msed.vse.z/msed_014/artile/75-sul-zdenek-paper.pdf. Šul, Z. (015). Appliation of Goodall's and Lin's similarity measures in hierarhial lustering. In Sborník praí vědekého semináře doktorského studia FIS VŠE. Praha: Oeonomia, 11-118. http://fis.vse.z/wp-ontent/uploads/015/01/dd_fis_015_cely_sbornik.pdf. Šul, Z., Řezanková, H. (014). Evaluation of reent similarity measures for ategorial data. In Proeedings of the 17th International Conferene Appliations of Mathematis and Statistis in Eonomis. Wrolaw: Wydawnitwo Uniwersytetu Ekonomiznego we Wroławiu, 49-58. Zdeněk Šul University of Eonomis, Prague, Dept. of Statistis and Probability W. Churhill sq. 4, 130 67 Prague 3, Czeh Republi zdenek.sul@vse.z Hana Řezanková University of Eonomis, Prague, Dept. of Statistis and Probability W. Churhill sq. 4, 130 67 Prague 3, Czeh Republi hana.rezankova@vse.z 1590