Comparing Dissimilarity Measures for Symbolic Data Analysis

Comaring Dissimilarity Measures for Symbolic Data Analysis Donato MALERBA, Floriana ESPOSITO, Vincenzo GIOVIALE and Valentina TAMMA Diartimento di Informatica, University of Bari Via Orabona 4 76 Bari, Italy e-mail: {malerba, esosito}@di.uniba.it, vincenzo_gio@yahoo.it, valli@csc.liv.ac.uk Abstract: Symbolic data analysis aims at extending classical data analysis techniues to manage symbolic data, which are a form of aggregated data. This extension asses through the definition of a new set of dissimilarity measures. Many of them have been reorted in the literature, but no comarative study has been erformed. This aer resents an emirical evaluation of dissimilarity measures roosed for a restricted class of symbolic data, namely Boolean symbolic obects. To define a ground truth for the emirical evaluation, a data set with a fully understandable and exlainable roerty has been selected. Emirical results show a variety of, sometimes unexected, behaviours of the comared measures. Keywords: Symbolic data analysis, Boolean symbolic obects, dissimilarity measure. 1. Introduction Most of statistical techniues for data analysis have been designed for a relatively simle situation: the unit for statistical analysis is an individual (e.g., a erson or an obect) described by a well defined set of random variables (either ualitative or uantitative), each of which result in ust one single value. Nowadays, data analysts are confronted with new challenges: they are asked to rocess data that go beyond the classical framework, as in the case of data concerning more or less homogeneous classes or grous of individuals (second-order obects) instead of single individuals (first-order obects). A tyical situation is that of census data, which raise rivacy issues in all governmental agencies that distribute them. To guarantee that data analysts cannot identify an individual or a single business establishment, data are made available in aggregate form. Data aggregations by census tracts or by enumeration districts are examles of second-order obects. Presently at the Deartment of Comuter Science, The University of Liverool, Chadwick Building, Peach St., Liverool L69 7ZF, UK. 473

Donato Malerba, Floriana Esosito, Vincenzo Gioviale and Valentina Tamma Aggregated data describe a grou of individuals by set-valued or modal variables. A variable Y defined for all elements k of a set E is termed set-valued with the domain Y if it takes its values in P(Y)={U U Y }, that is the ower set of Y. When Y(k) is finite for each k, than Y is called multi-valued. A single-valued variable is a secial case of set-valued variable for which Y(k) =1 for each k. When an order relation is defined on Y then the value returned by a set-valued variable can be exressed by an interval [,], and Y is termed an interval variable. More generally, a modal variable is a set-valued variable with a measure or a (freuency, robability or weight) distribution associated to Y(k). A class or grou of individuals described by a number of set-valued or modal variables is termed symbolic data. Symbolic data lead to more comlex data tables called symbolic data tables. The extension of classical data analysis techniues to such tables is termed symbolic data analysis [1]. Imlementing an integrated software environment for both the construction of symbolic data tables from records of individuals and the analysis of symbolic data has been the general aim of the three-years ESPRIT roect SODAS 1 (Symbolic Official Data Analysis System), concluded in November 1999. The recently started three-years IST roect ASSO (Analysis System of Symbolic Official Data) is intended to imrove the SODAS rototye with resect to several asects (management of symbolic data, new or imroved data analysis methods, new visualization techniues). An imortant module of the SODAS software is that concerning the comutation of some dissimilarity measures. Indeed, the extension of statistical techniues to symbolic data reuires the secification of some dissimilarity (or conversely, similarity) measures. Many formulations of dissimilarity measures for symbolic data have been reorted in literature [5], but no comarative study on their suitability to real-world roblems has been erformed. This aer resents an emirical evaluation of dissimilarity measures roosed for a restricted class of symbolic data, namely Boolean symbolic obects (BSOs). To define a ground truth for the emirical evaluation, a data set with a fully understandable and exlainable roerty has been selected. Emirical results show a variety of, sometimes unexected, behaviours of the comared measures. The reorted exerimentation is the first ste towards the fulfillment of one of the obectives of the ASSO roect, namely comaring various dissimilarity measures to suort users in selecting the best measure for their data analysis roblem. 2. Dissimilarity measures for Boolean Symbolic Obects Henceforth, the term dissimilarity measure d on a set of obects E refers to a real valued * function on E E such that: d a d( a, a) d( a, b) d( b, a) for all a,be. Conversely, a similarity measure s on a set of obects E is a real valued function on E E such that: * * * * * s a s( a, a) s( a, b) s( b, a) for all a,be. Generally, d a d and s a s for each obect a in E, and more secifically, d * = 1 while s * =. Studies on their roerties can be limited to dissimilarity measures alone, since it is always ossible to transform a similarity 1 The SODAS software is freely distributed by CISIA at the following URL address: htt://www.cisia.com/download.htm. 474

Comaring dissimilarity measures for symbolic data analysis measure into a dissimilarity one with the same roerties. Many methods have been reorted in the literature to derive dissimilarity measures from a matrix of observed data [4], or, more generally, for a set of symbolic obects [5]. In the following, only some measures roosed for BSOs are briefly reorted. Let a and b be two BSOs: a = [Y 1 =A 1 ] [Y 2 =A 2 ]... [Y =A ] b = [Y 1 =B 1 ] [Y 2 =B 2 ]... [Y =B ] where each variable Y takes values in a domain Y and A and B are subsets of Y. It is ossible to define a dissimilarity measure between two BSOs a and b by aggregating dissimilarity values comuted indeendently at the level of single variables Y (comonentwise dissimilarities). A classical aggregation function is the generalized Minkowski metric. However, another class of measures defined for BSOs is based on the the notion of descrition otential, (a), which is defined as the volume of the Cartesian roduct A 1 A 2... A. For this class of measures, no comonentwise decomosition is necessary, so that no function is reuired to aggregate dissimilarities comuted indeendently for each variable. The list of dissimilarity measures considered in this study are reorted in Table 1, where they are denoted as in the SODAS software, namely: U_1: Gowda and Diday s dissimilarity measure [6]; U_2: Ichino and Yaguchi s first formulation of a dissimilarity measure [7]; U_3: Ichino and Yaguchi s dissimilarity measure normalized w.r.t. domain length [7]; U_4: Ichino and Yaguchi s normalized and weighted dissimilarity measure [7]; SO_1: De Carvalho s dissimilarity measure [2]; SO_2: De Carvalho s extension of Ichino and Yaguchi s dissimilarity [2]; SO_3: De Carvalho s first dissimilarity measure based on descrition otential [3]; SO_4: De Carvalho s second dissimilarity measure based on descrition otential [3]; SO_5: De Carvalho s normalized dissimilarity measure based on descrition otential [3]; C_1: De Carvalho s normalized dissimilarity measure for constrained BSOs [3]. The term constrained BSO refers to the fact that some deendences between variables are defined, namely either hierarchical deendences (mother-daughter) which establish conditions for some variables being not measurable (not-alicable values), or logical deendences which establish the set of ossible values for a variable Y conditioned by the set of values taken by another variable Y k. An investigation of the effect of constraints on the comutation of dissimilarity measures is out of the scoe of this aer, nevertheless it is always ossible to aly the measures defined for constrained BSOs to unconstrained BSOs. This exlains why C_1 has been considered in the emirical comarison reorted in the next section. 3. Exerimental evaluation Many dissimilarity measures have been roosed for symbolic data analysis, nevertheless they have never been comared in order to understand both their common and their eculiar roerties. In this section, an emirical evaluation is reorted with reference to a data set for which a desirable behaviour of a dissimilarity measure can be defined. The data set is called Abalone data and is available at the UCI Machine Learning Reository 475

Donato Malerba, Floriana Esosito, Vincenzo Gioviale and Valentina Tamma Table 1. Dissimilarity measures available in the DI method for BSO s, and related arameters. Name Comonentwise Obectwise dissimilarity measure dissimilarity measure U_1 D () (A, B ) = D (A, B ) + () D s (A, B ) + D c (A, B ) d(a, b) = D (A, B) 1 where D (A, B ) is due to osition, D s (A, B ) is due to sanning and D c (A,B ) is due to content. U_2 (A, B ) = A B A B + (2A B d (a, b) = A B A, B ) 1 where meet () and oin () are two Cartesian oerators. U_3 ( A, B ) ( A, B ) Y d (a, b) = A, B U_4 ( A, B ) SO_1 d i (A,B ) i=1,...,5 SO_2 SO_3 none SO_4 none SO_5 none ' A, B ( A, B ) C_1 d i (A,B ) i=1,...,5 Y A, B A B 1 d (a, b) = A, B w 1 i d (a, b) = w di A, B 1 d' (a, b) = 1 ' A, 1 B d 1 (a, b) = (a b) (a b) + (2(a b) (a) (b)) ' ( a b ) ( a b ) ( 2 ( a b ) ( a ) ( b )) d2( a,b ) E ( a ) where (a E ) = [Y 1 Y 1 ] [Y Y ] ( a b ) ( a b ) ( ( a b ) ( a ) ( b )) d ' 2 3( a,b ) ( a b ) i 1 d (a, b) = w d i J 1 A, B ( ) where () is the indicator function, 476

Comaring dissimilarity measures for symbolic data analysis Table 2. Attributes of the Abalone data set Attribute Name Data Tye Unit Descrition Sex Nominal M, F. I (infant) Length Continuous mm Longest shell measurement Diameter Continuous mm Perendicular to length Height Continuous mm Measured with meat in shell Whole weight Continuous grams Weight of the whole abalone Shucked weight Continuous grams Weight of the meat Viscera weight Continuous grams Gut weight after bleeding Shell weight Continuous grams Weigh of the dried shell Rings Integer Number of rings (URL: htt://www.ics.uci.edu/~mlearn/mlreository.html). It contains 4177 cases of marine crustaceans, which are described by means of the nine attributes listed in Table 2. There are no missing values in the data. Generally this data set is used for rediction tasks. The number of rings (last attribute) is the value to be redicted from which it is ossible to know the age in years of the crustacean by adding 1.5 to the number of rings. Since the deendent attribute is integervalued, this database has been extensively investigated in emirical studies concerning regression-tree induction [8,1]. The number of rings varies between 1 and 29, with samle mean eual to 9.934 and samle standard deviation eual to 3.224 (there are few cases of crustaceans with less than 3 or more than 25 rings). The erformance of regression-tree induction systems reorted in the literature is generally high, meaning that the eight indeendent attributes are actually sufficiently informative for the intended rediction task. In other words, two abalones with the same number of rings should also resent similar values for the attributes sex, length, diameter, height, and so on. Basing uon this consideration we exect to observe that the degree of dissimilarity between crustaceans comuted on the indeendent attributes do actually be roortional to the dissimilarity in the deendent attribute (i.e., difference in the number of rings). We will call this roerty as monotonic increasing dissimilarity (shortly, MID roerty). Abalone data can be aggregated into symbolic obects, each of which corresond to a range of values for the number of rings. In articular, nine BSOs have been generated by alying the DB2SO facility [9] available in the SODAS software (see Table 3). The dissimilarity measures briefly resented in Section 2 have been alied to the BSOs reviously illustrated. In this examle the value of the arameter is set to.5, the order of ower is 2 and the weights are uniformly distributed. Results are deicted in Figure 1. Dissimilarities are reorted along the vertical axis, while BSOs are listed along the horizontal axis, in ascending order with resect to the number of rings. Each line reresents the dissimilarity between a given BSO and the subseuent BSOs in the list. The number of lines in each grah is eight, since there are nine BSOs. For the sake of clarity, the lower triangular matrix of the dissimilarities deicted in the grah labeled Abalone- U_1 is reorted in Table 4. 477

Donato Malerba, Floriana Esosito, Vincenzo Gioviale and Valentina Tamma Table 3. Boolean symbolic obects generated by artitioning the Rings attribute into nine intervals of eual length BSO Rings Sex Length Diameter Height Whole Shucked Viscera Shell 1 1-3 I,M [.8:.24] [.5:.17] [.1:.6] [.:.7] [.:.3] [.:.1] [.:.2] 2 4-6 I,M,F [.13:.66] [.9:.47] [.:.18] [.1:1.37] [.:.64] [.:.29] [.:.35] 3 7-9 I,M,F [.2:.75] [.16:.58] [.:1.13] [.4:2.33] [.2:1.25] [.1:.54] [.2:.56] 4 1- I,M,F [.29:.78] [.22:.63] [.6:.51] [.:2.78] [.4:1.49] [.2:.76] [.4:.73] 5 13-15 I,M,F [.32:.81] [.25:.65] [.8:.25] [.16:2.55] [.6:1.35] [.3:.57] [.5:.8] 6 16-18 I,M,F [.4:.77] [.31:.6] [.1:.24] [.35:2.83] [.11:1.15] [.6:.48] [.:1.] 7 19-21 I,M,F [.45:.74] [.35:.59] [.:.23] [.41:2.13] [.11:.87] [.7:.49] [.16:.85] 8 22-24 M,F [.45:.8] [.38:.63] [.14:.22] [.64:2.53] [.16:.93] [.11:.59] [.24:.71] 9 25-29 M,F [.55:.7] [.47:.58] [.18:.22] [1.6:2.18] [.32:.75] [.19:.39] [.38:.88] It is noteworthy that the MID roerty does not hold when the dissimilarity among BSOs is comuted by means of Gowda and Diday s measure (U_1). Surrisingly, old crustaceans with a high number of rings (25-29) are considered more similar to very young crustaceans with low number of rings (1-3) than to middle aged abalones with 1618 rings. Actually, for all numeric variables the dissimilarity comonents due to osition (D ) and content (D c ) increase along the horizontal axis, while the comonent due to sanning (D s ) first increases and then decreases. Sanning measures the difference between two interval widths and indeed BSO1 and BSO9 seem retty similar due to the fact that continuous variables have intervals with the same (small) width desite the fact that the intervals are uite distant. Incidentally, such intervals are relatively small since there are few cases of abalones aggregated into BSO1 and BSO9. Thus, U_1 can lead to unexected results in those cases in which BSOs are generated from uneually distributed cases with resect to a given class variable, such as rings. Also for the dissimilarity measure U_2 the MID roerty does not hold and the first BSO is the most atyical. In articular, BSO1 and BSO7 are more similar than BSO1 and BSO4. This can be exlained by noting that the dissimilarity comonent due to meet () has no effect when is set to.5, while the oin () of intervals in BSO1 and BSO7 is lower than the oin of intervals in BSO1 and BSO4, since all numerical intervals of BSO7 are included in the corresonding intervals for BSO4. Generalizing, we note that it is advisable not to nullify the effect of the meet oerator by setting =.5, otherwise anomalous similarities can be found. Similar considerations aly to the dissimilarity measures U_3 and U_4, although their normalization factor (i.e., domain Table 4. Dissimilarity values comuted by means of the dissimilarity U_1. Rings 1-3 4-6 7-9 1-13-15 16-18 19-21 22-24 25-29 1-3 4-6.76 7-9 13.817 6.6 1-13.6988 7.4184 3.8644 13-15 13.2341 6.6741 4.135 2.7576 16-18.439 7.7761 6.1147 5.225 3.3863 19-21 11.6926 7.882 7.263 6.8322 5.1771 3.946 22-24 11.3287 9.1946 8.59 7.6176 6.1752 4.8742 3.5518 25-29 9.211 11.7497.1851.65 11.1622 9.8778 8.61 7.443 478

Comaring dissimilarity measures for symbolic data analysis 15 9 6 3 1-34-67-91- Abalone - U_1 2,5 2 1,5 1,5 Abalone - U_2 1-34-67-91- 1,5 1,2,9,6,3 Abalone - U_3 1-3 4-6 7-9 1-13- 15 16-19- 22-25- 18 21 24 29,2,15,1,5 Abalone - U_4 1-3 4-6 7-9 1-,4,3,2,1 Abalone - SO_1 1-3 4-6 7-9 1-,25,2,15,1,5 Abalone - SO_2 1-3 4-6 7-9 1-1,5 1,2,9,6,3 Abalone - SO_3 1-3 4-6 7-9 1-,4,3,2,1 Abalone - SO_4 1-34-67-91- 1,2,9,6,3 Abalone - SO_5 1-34-67-91- 1,2,9,6,3 Abalone - C_1 1-3 4-6 7-9 1- Figure 1. Grahs of ten dissimilarity measures for the abalone data. 479

Donato Malerba, Floriana Esosito, Vincenzo Gioviale and Valentina Tamma cardinality) tend to reduce the effect of the missing meet comonent. On the contrary, the normalization by the oin volume (A B ), roosed by De Carvalho in SO_2, totally removes the roblem even when =.5, and the exected MID roerty is satisfied. The MID roerty is generally valid for SO_1 as well, and the atyical symbolic obect is still BSO1. However, in this case, the numerical variables have almost no effect on the comutation of the dissimilarity, since the agreement index at the numerator of each d i, namely =(A B ), is very small. As a matter of fact, the dissimilarity between two symbolic obects is comuted on the basis of the only nominal variable sex. In the case of SO_3, the atyical obect is the third, while BSO1 is considered similar to all other symbolic obects, including BSO9. Once again, the contribution of the meet is nullified by setting =.5, nevertheless the situation is uite different from that resented in the case of U_2. The effect of the oin is multilicative in the comutation of the descrition otential defined by De Carvalho, while it is additive in Ichino and Yaguchi s dissimilarity measures. In SO_3 small intervals can zero the dissimilarity, while in U_2 they have ractically no effect. On the contrary, a single large interval can have a strong imact in U_2, while it may have no effect in SO_3. In the Abalone data set, small intervals are obtained by alying the oin oerator to continuous variables of symbolic obects BSO i, i3. This exlains the strange behaviour of lines deicted in the grah Abalone SO_3. Also in SO_5, which is a normalized version of SO_3, the MID roerty is not satisfied even though it has a more regular behaviour than SO_3 since the contribution of the oin oerator on numerical variables is reduced. Finally, in the case of C_1, the MID roerty is satisfied. According to this measure all symbolic obects are uite dissimilar (the maximum distance is 1.), and the most atyical is that reresenting young abalones (BSO1). Summarizing, the following conclusions can be drawn from this emirical evaluation: 1. Only three dissimilarity measures roosed by de Carvalho, namely SO_1, SO_2 and C_1, satisfy the MID roerty. For all these measures the less tyical obect is that reresenting very young abalones (BSO1). 2. When BSOs are generated from uneually distributed cases with resect to a given class variable, the actual size of variable value sets A i might simly reflect the original distribution of cases. In articular, small value sets may be due to the scantiness of cases used in the BSO generation rocess, while large value sets may occur because of the natural variability in a large oulation of cases used to synthesize BSOs. When this haens, distance measures based on the sanning factor (e.g., U_1) may lead to unexected results. 3. In Ichino and Yaguchi s measures (i.e., U_2, U_3 and U_4), the contribution of the meet oerator should not be nullified by setting the arameter to.5. An intermediate value between and.5 is generally recommended. Moreover, the normalization roosed by de Carvalho (see SO_2) show better results. 4. In the case of continuous variables, the width of value intervals is critical, since dissimilarities measures based on additive aggregation tend to return high values when only one comonentwise dissimilarity is uite large, while measures based on descrition otential return small values when only one comonentwise dissimilarity is uite small. 48

Comaring dissimilarity measures for symbolic data analysis 4. Conclusions In symbolic data analysis a key role is layed by the comutation of dissimilarity measures. Many measures have been roosed in the literature, although a comarison that investigates their alicability to real data has never been reorted. The main difficulty was due to the lack of a standard in the reresentation of symbolic obects and the necessity of imlementing many dissimilarity measures. The software roduced by the ESPRIT Proect SODAS has artially solved this roblem by defining a suite of modules that enable the generation, visualization and maniulation of symbolic obects. In this work, a comarative study of the dissimilarity measures for BSOs is reorted with reference to a articular data set for which an exected roerty could be defined. Interestingly enough, such a roerty has been observed only for some dissimilarity measures, which actually show very different behaviours. There are a number of ossible directions for future research. One is to exeriment whether other data sets with fully understandable and exlainable roerties related to the roximity concet. Another direction is to extend the emirical evaluation to dissimilarity measure defined on robabilistic symbolic obects. A third direction is to develo new dissimilarity measures for symbolic data that remove the two basic assumtions, namely variable indeendence and eual attribute relevance. Acknowledgements This work was suorted artly by the Euroean IST roect no. 25161 (ASSO). References [1] Bock, H.H., Diday, E. (eds.): Analysis of Symbolic Data.Exloratory Methods for Extracting Statistical Information from Comlex Data, Series: Studies in Classification, Data Analysis, and Knowledge Organisation, Vol. 15, Sringer-Verlag, Berlin, (2), ISBN 3-54-66619-2 [2] de Carvalho, F.A.T.: Proximity coefficients between Boolean symbolic obects. In: Diday, E. et al. (eds.): New Aroaches in Classification and Data Analysis, Series: Studies in Classification, Data Analysis, and Knowledge Organisation, Vol. 5, Sringer-Verlag, Berlin, (1994) 387-394. [3] de Carvalho, F.A.T.: Extension based roximity coefficients between constrained Boolean symbolic obects. In: Hayashi, C. et al. (eds.): Proc. of IFCS 96, Sringer, Berlin, (1998) 37-378. [4] Esosito, F., Malerba, D., Tamma, V., and Bock, H.H.: Classical resemblance measures. In: Bock, H.H, Diday, E. (eds.): Analysis of Symbolic Data. Exloratory Methods for extracting Statistical Information from Comlex Data, Series: Studies in Classification, Data Analysis, and Knowledge Organisation, Vol. 15, Sringer-Verlag, Berlin, (2) 139-152 [5] Esosito, F., Malerba, D., Tamma, V.: Dissimilarity Measures for Symbolic Obects. In: Bock, H.H., Diday, E. (eds.): Analysis of Symbolic Data. Exloratory Methods for extracting Statistical Information from Comlex Data, Series: Studies in Classification, Data Analysis, and Knowledge Organisation, Vol. 15, Sringer-Verlag, Berlin, (2) 165-185 [6] Gowda, K. C., Diday, E.: Symbolic clustering using a new dissimilarity measure. In Pattern Recognition, Vol. 24, No. 6, (1991) 567-578. [7] Ichino, M., Yaguchi, H.: Generalized Minkowski Metrics for Mixed Feature-Tye Data Analysis. IEEE Transactions on Systems, Man, and Cybernetics, Vol. 24, No. 4, (1994) 698-77 [8] Robnik-Šikona, M., Kononenko, I.: Pruning regression trees with MDL. In: Prade, H. (ed.): Proc. of the 13th Euroean Conference on Artificial Intelligence, John Wiley & Sons, Chichester, England, (1998) 455-459. [9] Stéhan, V., Hébrail, G., and Lechevallier, Y.: Generation of Symbolic Obects from Relational Databases. In: Bock, H.H., Diday, E. (eds.): Analysis of Symbolic Data. Exloratory Methods for extracting Statistical Information from Comlex Data, Series: Studies in Classification, Data Analysis, and Knowledge Organisation, Vol. 15, Sringer-Verlag, Berlin, (2) 78-15. [1] Torgo, L.: Functional Models for Regression Tree Leaves. In: Fisher, D. (ed.): Proc. of the International Machine Learning, Morgan Kaufmann, San Francisco, CA, (1997) 385-393. 481