Visual Cluster Analysis in Data Mining A thesis submitted in fulfilment of the requirements for the Doctorial Degree of Philosophy by Ke-Bing Zhang October 2007 Master (Honours) of Science (Macquarie University) 2002 Bachelor of Engineering (Tianjin University of Technology) 1987 Department of Computing Division of Information and Communication Sciences Macquarie University, NSW 2109, Australia 2007 Ke-Bing Zhang
To my parents Zhang Lu-de and Zhang Xiao-lin
ACKNOWLEDGMENT First of all, I am deeply indebted to my supervisor, Associate Professor Mehmet A. Orgun, for providing me with supervision, motivation and encouragement throughout the course of this wor. His insight, breadth of nowledge and enthusiasm has been invaluable to my training as a researcher. He led me to the correct direction at every stage of the research. Without his care, supervision and friendship, I would not be able to complete this wor. I am also grateful to my co-supervisor Professor Kang Zhang, for his suggestions and guidance of my research. His help has been integrated into the success of this wor. I am faithfully indebted to my parents Zhang Lu De and Zhang Xiao Lin for their love, forever affection, patience and constant encouragement. I deeply than my wife Liu Yu for her love and understanding, especially for her comforting me when I encountered sorrow and loneliness. I would lie to express my appreciation to my brother, Professor Kewei Zhang for his comments on the mathematics in this wor. Finally, my thans also go to other faculty and staff members of the Department of Computing, and to my fellow graduate students, for providing a friendly and enjoyable environment during my time here. I
DECLARATION I hereby certify that the wor embodied in this thesis is the result of original research. This wor has not been submitted for a higher degree to any other university or institution. Signed: Date : II
ABSTRACT Cluster analysis is a widely applied technique in data mining. However, most of the existing clustering algorithms are not efficient in dealing with arbitrarily shaped distribution data of extremely large and high-dimensional datasets. On the other hand, statistics-based cluster validation methods incur very high computational cost in cluster analysis which prevents clustering algorithms from being effectively used in practice. Visualization techniques have been introduced into cluster analysis. However, most visualization techniques employed in cluster analysis are mainly used as tools for information rendering, rather than for investigating how data behavior changes with the variations of the parameters of the algorithms. In addition, the impreciseness of visualization limits its usability in contrasting grouping information of data. This thesis proposes a visual approach called HOV 3, Hypothesis Oriented Verification and Validation by Visualization, to assist data miners in cluster analysis. HOV 3 employs quantified domain nowledge, statistical measures, and explorative observation as predictions to project high dimensional data onto 2D space for revealing the gaps of data distribution against the predictions. Based on the capability of quantified measurement of HOV 3, this thesis also proposes a visual external cluster validation method to verify the stability of clustering results by comparing the data distributions of a clustered subset and non-clustered subsets projected by HOV 3. With this method, data miners can perform an intuitive visual assessment and have a precise evaluation of the consistency of the cluster structure. This thesis also introduces a visual approach called M-HOV 3 to enhance the visual separation of clusters based on the projection technique of HOV 3. With enhanced separation of clusters, data miners can explore cluster distribution intuitively as well as dealing with cluster validation effectively in HOV 3. As a consequence, with the advantage of the quantified measurement feature of HOV 3, data miners can identify the cluster number in the pre-processing stage of clustering efficiently, and also verify the membership formation of clusters effectively in the postprocessing stage of clustering in data mining. III
Table of Contents LIST OF FIGURES... VI CHAPTER 1. INTRODUCTION... 1 CHAPTER 2. CONTRIBUTIONS... 5 CHAPTER 3. CLUSTER ANALYSIS... 11 3.1 Clustering and Clustering Algorithms... 11 3.1.1 Partitioning methods... 12 3.1.2 Hierarchical methods... 14 3.1.3 Density-based methods... 16 3.1.4 Grid-based methods... 17 3.1.5 Model-based clustering methods... 18 3.2 Cluster Validation... 19 3.2.1 Internal criteria... 19 3.2.2 Relative criteria... 21 3.2.3 External criteria... 22 3.3 The issues of cluster analysis... 23 CHAPTER 4. VISUAL CLUSTER ANALYSIS... 25 4.1 Multidimensional Data Visualization... 26 4.1.1 Icon-based techniques... 26 4.1.2 Pixel-oriented Techniques... 28 4.1.3 Geometric Techniques... 31 4.2 Visual Cluster Analysis... 34 4.2.1 MDS and PCA... 34 4.2.2 HD-Eye... 35 4.2.3 Grand Tour... 37 4.2.4 Hierarchical BLOB... 38 4.2.5 SOM... 39 4.2.6 FastMap... 40 4.2.7 OPTICS... 40 4.2.8 Star Coordinates and VISTA... 41 4.3 Major Challenges... 45 4.3.1 Requirements of Visualization in Cluster Analysis... 45 4.3.2 Motivation... 45 4.4 Our Approach... 46 4.4.1 HOV 3 Model... 47 4.4.2 External Cluster Validation by HOV 3... 48 4.4.3 Enhanced the Separation of Clusters By HOV 3... 49 4.4.4 Prediction-based Cluster Analysis by HOV 3... 50 IV
CHAPTER 5. CONCLUSION AND FUTURE WORK... 53 5.1 Conclusion... 53 5.2 Future Wor... 54 5.2.1 Three Dimensional HOV 3... 55 5.2.2 Dynamic Visual Cluster Analysis... 55 5.2.3 Quasi-Cluster Data Points Collection... 56 5.2.4 Combination of Fuzzy Logical approaches and HOV 3... 57 APPENDIX... 59 BIBLIOGRAPHY... 61 V
List of Figures Figure 3-1 An Example of clustering procure of K-means [HaK01]...13 Figure 3-2 Hierarchical Clustering Process [HaK01]... 15 Figure 3-3 The grid-cell structure of gird-based clustering methods...17 Figure 3-4 External criteria based validation [ZOZ07a]...22 Figure 4-1. An example of Chernoff-Faces...27 Figure 4-2. Stic Figure Visualization Technique...27 Figure 4-3. Stic Figure Visualization of the Census Data...28 Figure 4-4. Displaying attribute windows for data with six attributes...29 Figure 4-5. Illustration of the Recursive Pattern Technique...30 Figure 4-6. The Recursive Pattern Technique in VisDB [KeK94]...30 Figure 4-7. Scatterplot-Matrices [Cle93]...31 Figure 4-8. 15,000 coloured data items in Parallel Coordinates...32 Figure 4-9. Star plots of data items [SGF71]...33 Figure 4-10. Clustering of 1352 genes in MDS by [Bes]...35 Figure 4-12. The framewor of HD-Eye system and its different visualization projections..36 Figure 4-13. The 3D data structures in HD-Eye and their intersection trails on the planes...36 Figure 4-14. The Grand Tour Technique and its 3D example...38 Figure 4-15. Cluster hierarchies are shown for 1, 5, 10 and 20 clusters [SBG00]...38 Figure 4-16. Model matching with SOM by [KSP01]...39 Figure 4-17. Data structure mapped in Gaussian bumps by OPTICS [ABK+99]...40 Figure 4-18. Clustering structure of 30,000 16-Dimensional data items Visualized by OPTICS...41 Figure 4-19. Positioning a point by an 8-attribute vector in Star Coordinates [Kan01]...42 Figure 4-20 axis scaling, angle rotation and foot print functions of Star Coordinates [Kan01]...43 VI
VII
CHAPTER 1. INTRODUCTION CHAPTER 1 INTRODUCTION Clustering analysis, also called segmentation analysis or taxonomy analysis [MaW], aims to identify homogeneous objects into a set of groups, named clusters, by given criteria. Clustering is a very important technique of nowledge discovery for human beings. It has a long history and can be traced bac to the times of Aristotle [HaJ97]. These days, cluster analysis is mainly conducted on computers to deal with very large-scale and complex datasets. With the development of computer-based techniques, clustering has been widely used in data mining, ranging from web mining, image processing, machine learning, artificial intelligence, pattern recognition, social networ analysis, bioinfomatics, geography, geology, biology, psychology, sociology, customers behavior analysis, mareting to e-business and other fields [JMF99] [Har75]. Cluster analysis includes two major aspects: clustering and cluster validation. Clustering achieves to distinguish objects into groups according to certain criteria. The grouped objects are called clusters, where the similarity of objects is high within clusters and low between clusters. To achieve different application purposes, a large number of clustering algorithms have been developed [JMF99, Ber06]. However, there are no general-purpose clustering algorithms that fit all inds of applications, thus, the evaluation of the quality of clustering results plays the critical role of cluster analysis, i.e., cluster validation, which aims to assess the quality of clustering results and find a fit cluster scheme for a specific application. 1
CHAPTER 1. INTRODUCTION However, in practice, it may not always be possible to cluster huge datasets by using clustering algorithms successfully, due to the weaness of most existing automated clustering algorithms on dealing with arbitrarily shaped data distribution of the datasets. As Abul et al pointed out In high dimensional space, traditional clustering algorithms tend to brea down in terms of efficiency as well as accuracy because data do not cluster well anymore [AAP+03]. In addition, the very high computational cost of statistics-based cluster validation methods directly impacts on the efficiency of cluster validation [HCN01]. The clustering of large sized datasets in data mining is an iterative process involving humans [JMF99]. Thus, the user s initial estimation of the cluster number is important for choosing the parameters of clustering algorithms for the pre-processing stage of clustering. Also, the user s clear understanding on cluster distribution is helpful for assessing the quality of clustering results in the post-processing of clustering. All these heavily rely on the user s visual perception of data distribution. Clearly, visualization is a crucial aspect of cluster exploration and verification in cluster analysis. Visual presentations can be very powerful in revealing trends, highlighting outliers, showing clusters, and exposing gaps in data [Shn01]. Therefore, introducing visualization techniques to explore and understand high dimensional datasets is becoming an efficient way to combine human intelligence with the immense brute force computation power available nowadays [PGW03]. Visualization used in cluster analysis maps the high-dimensional data to a 2D or 3D space and aids users having an intuitive and easily understood graph/image to reveal the grouping relationship among the data. As an indispensable revealing technique, visualization is almost involved into every step in data mining [Chi00, AnK01]. Visual cluster analysis is a combination of visualization and cluster analysis. The data sets that clustering algorithms deal 2
CHAPTER 1. INTRODUCTION with are normally in high dimensions (>3D). Thus, choosing a fit technique to visualize clusters of high dimensional data is the first tas of visual cluster analysis. There have been many wors on multidimensional data visualization [WoB94], but those earlier techniques of multidimensional data visualization are not suitable to visualize cluster structures in very high dimensional and very large datasets. With the increasing applications of clustering in data mining, in the last decade, more and more visualization techniques have been developed to study the structure of datasets in the applications of cluster analysis [OlL03, Shn05]. Several approaches have been proposed for visual cluster analysis [ABK+99, ChL04, HCN01, HuL00, Kan01, KSP01, HWK99, SBG00], but their arbitrary exploration of group information maes them inefficient and time consuming in the cluster exploration stage. On the other hand, the impreciseness of visualization limits its utilisation in quantitative verification and validation of clustering results. Thus developing a visualization technique, with the features of purposeful cluster detection and precise contrast between clustering results is the motivation of this research. To mitigate the above-mentioned problems, based on hypothesis testing, we propose a visual projection technique called Hypothesis Oriented Verification and Validation by Visualization, (HOV 3 ) [ZOZ06a, ZOZ06b]. HOV 3 generalizes random adjustments of Star Coordinates based techniques as measure vectors. Thus, compared with the Star Coordinates based techniques, HOV 3 has several superiorities. First, data miners can summarize their prior nowledge of the studied data as measure vectors, i.e., hypotheses of the data. Base on hypothesis testing, data miners can quantitatively analyze data distribution projected by HOV 3 with hypotheses. Second, HOV 3 avoids the arbitrariness and randomness of most existing 3
CHAPTER 1. INTRODUCTION visual techniques on cluster exploration, for example Star Coordinates [Kan01] and its implementations, such as VISTA/iVIBRATE [ChL04, ChL06]. As a consequence, HOV 3 provides data miners a purposeful and effective visual method on cluster analysis [ZOZ06b]. Based on the quantified measurement feature of HOV 3, we propose a visual external cluster validation model to verify the consistency of cluster structures. Compared with statistics based external cluster validation methods, we show that HOV 3 based external cluster validation model is more intuitive and effective [ZOZ07a]. We also introduce a visual approach called M-HOV 3 /M-Mapping to enhance the visual separation of clusters [ZOZ07b]. With the above features of HOV 3, a prediction-based visual approach is proposed to explore and verify clusters [ZOZ07c, ZOZ07d]. The next chapter presents more detailed contributions of this thesis. This thesis is structured as follows: Chapter 2 summarizes the contributions of this thesis. Chapter 3 gives an introduction to clustering, clustering algorithms and cluster validation. Chapter 4 reviews related wor on high dimensional data visualization and the visual techniques that have been used in cluster analysis. Finally, Chapter 5 summarise the wor in the thesis and discusses future wor. 4
CHAPTER 2. CONTRIBUTIONS CHAPTER 2 CONTRIBUTIONS This is a publication-based thesis. Its main contributions have been published in the proceedings of five international conferences. A followup report has also been submitted to an established journal for further publication. Below, we summarise the contributions of the thesis in the chronological order of those publications: 1. Hypothesis Oriented Verification and Validation by Visualization (HOV 3 ) model: To fill the gap between imprecise cluster detection by visualization and the unintuitive result often obtained by clustering algorithms, a novel visual projection technique called Hypothesis Oriented Verification and Validation by Visualization, HOV 3, is proposed [ZOZ+06, ZOZ06]. The aim of interactive visualization on cluster exploration and rendering is to aid data miners to have some visually separated groups or full-separated clustering result of data. For example, Star Coordinates and its extensions provide such interaction by tuning the weight value of each axis (axis scaling in Star Coordinates [Kan01], α-adjustment in VISTA/iVIBRATE [ChL04, ChL06]), but their arbitrary and random adjustments limit their applicability. HOV 3 generalizes these adjustments as a coefficient/measure vector [ZOZ06]. Compared with the Star Coordinates model and its implementations, such as VISTA/iVIBRATE, it is observed that HOV 3 has better performance on cluster detection. This is because HOV 3 provides data miners a mechanism to quantify their nowledge or hypotheses as measure vectors for precisely exploring grouping information. 5
CHAPTER 2. CONTRIBUTIONS As a consequence, HOV 3 provides a bridge between qualitative analysis and quantitative analysis. Based on the idea of obtaining group clues by contrasting a dataset against quantified measures, HOV 3 synthesizes the feedbacs from exploration discovery and users domain nowledge to produce quantified measures, and then projects the test dataset against the measures. Geometrically, HOV 3 reveals the data distribution against the measures in visual form. This approach not only inherits the intuitive and easily understood features of visualization, but also avoids the weanesses of randomness and arbitrary exploration of the existing visual methods employed in data mining. [ZOZ+06] K-B, Zhang, M. A. Orgun, K. Zhang and Y. Zhang, Hypothesis Oriented Cluster Analysis in Data Mining by Visualization, Proceedings of the woring conference on Advanced visual interfaces 2006 (AVI06), May 23-26, 2006, Venezia, Italy. ACM Press, pp. 254-257 (2006) [ZOZ06] K-B, Zhang, M. A. Orgun, K. Zhang, HOV 3 : An Approach for Visual Cluster Analysis, Proceedings of The 2nd International Conference on Advanced Data Mining and Applications. (ADMA 2006), Xi'an, China, August 14-16, 2006, Lecture Notes in Computer Science, Volume 4093 Springer Press, pp.316-327 (2006) 2. An Algorithm for External Cluster Validation based on Data Distribution Matching: This part of the wor starts with the assumption that If two same-sized data sets have a similar cluster structure, by applying a linear transformation to the data sets, the similarity of the newly produced distributions of the two sets would still be high. With the quantified measurement feature of HOV 3, an external cluster validation based on distribution matching is proposed to verify the consistency of cluster structures between the clustered subset and non-clustered subsets of a large dataset [ZOZ07a]. In this approach, a clustered subset from a dataset is chosen as a visual model to verify the similarity of cluster structures between the model and the other same-sized non-clustered subsets from the dataset by projecting them together in HOV 3. As a consequence, the user 6
CHAPTER 2. CONTRIBUTIONS can utilize the well-separated clusters produced by scaling axes in HOV 3 as a model to pic out their corresponding quasi-clusters, where the points overlap clusters. In addition, instead of using statistical methods to assess the similarity between the two subsets, this approach simply computes the overlapping rate between the clusters and their quasi-clusters to show their consistency. The experiments show that when the HOV 3 based external cluster validation method is introduced into cluster analysis, it can have more effective cluster validation results than those obtained from pure clustering algorithms, for example K-means, with the statistics-based validation methods. [ZOZ07a]. [ZOZ07a] K-B. Zhang, M. A. Orgun and K. Zhang, A Visual Approach for External Cluster Validation, Proceedings of the first IEEE Symposium on Computational Intelligence and Data Mining (CIDM2007), Honolulu, Hawaii, USA, April 1-5, 2007, IEEE Press, pp. 576-582 (2007) 3. M-HOV 3 /M-Mapping, Enhanced the Separation of Clusters: To visually separate overlapping clusters, an approach called M-HOV 3 /M-Mapping is introduced to enhance the separation of clusters by HOV 3 [ZOZ07b, ZOZ07c]. Technically, if it is observed that several groups of data points can be roughly separated (where ambiguous points exist between groups) by projecting a measure vector in HOV 3 to a data set, then the application of M-HOV 3 /M-mapping with the measure vector to the data set would lead to the groups being more contracted and have a good separation. These features of M-HOV 3 /M-mapping are significant for identifying the membership formation of clusters in the process of cluster exploration and cluster verification. This is because the contracting feature of M-HOV 3 /M-mapping eeps the data points within a cluster relatively closer, i.e., grouping information is preserved. On the other hand, the enhanced separation feature of M-HOV 3 /M-mapping can extend the distance of far data points relatively further. With the advantage of the enhanced separation and contraction features 7
CHAPTER 2. CONTRIBUTIONS of M-HOV 3 /M-mapping, the user can identify the cluster number in the pre-processing stage of clustering efficiently, and also verify the membership formation of data points among the clusters effectively in the post-processing stage of clustering by M-HOV 3 /M-mapping. [ZOZ07b] K-B. Zhang, M. A. Orgun and K. Zhang, Enhanced Visual Separation of Clusters by M-mapping to Facilitate Cluster Analysis, Proceedings of 9th International Conference series on Visual Information Systems (VISUAL 2007), June 28-29, 2007, Shanghai, China, Lecture Notes in Computer, Volume 4781, Springer Press, pp. 285-297 (2007) 4. Prediction-based Cluster Detection by HOV 3 : With the quantified measurement of HOV 3 and enhanced separation features of M-HOV 3, the user not only can summarise their historically explored nowledge about datasets as predictions but also directly introduce abundant statistical measurements of the studied data as predictions to investigate cluster clues, or refine clustering results purposefully and effectively [ZOZ07c, ZOZ07d]. In fact, prediction-based cluster detection by statistical measurements in HOV 3 leads to more purposeful cluster exploration, and it gives an easier geometrical interpretation of the data distribution. In addition, with the statistical predictions in HOV 3 the user may even be able to expose cluster clues that are not easy to be found by random cluster exploration. [ZOZ07c] K-B. Zhang, M. A. Orgun and K. Zhang, A Prediction-based Visual Approach for Cluster Exploration and Cluster Validation by HOV 3, Proceedings of 18th European Conference on Machine Learning/11th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2007), Warsaw, Poland, September 17-21, 2007, Lecture Notes in Computer, LNAI 4702 Springer Press, pp. 336 349 (2007) 5. Prediction-based Cluster Validation by HOV 3 : When mapping high-dimensional data into two-dimensional space, there may be overlapping data and ambiguities in visual form, therefore, to separate clusters from a lot 8
CHAPTER 2. CONTRIBUTIONS of overlapping data points is an aim of this thesis. Based on the wors such as M-HOV 3 /M-mapping and HOV 3 with statistical measurement, the measures that resulted in fully separated clusters can be treated as predictions to be introduced into External Cluster Validation based on Data Distribution Matching by HOV 3. In principle, any linear transformation, even a complex linear transformation, can be employed into HOV 3 if it can separate clusters well. With the well-separated clusters, we can improve the efficiency of external cluster validation by HOV 3 [ZOZ07c, ZOZ07d]. [ZOZ07d] K-B. Zhang, M. A. Orgun and K. Zhang, Predictive Hypothesis Oriented Cluster Analysis by Visualization, Journal of Data Mining and Knowledge Discovery (submitted) 9
10 CHAPTER 2. CONTRIBUTIONS
CHAPTER 3. CLUSTER ANALYSIS CHAPTER 3 CLUSTER ANALYSIS Cluster analysis is an exploratory discovery process. It can be used to discover structures in data without providing an explanation/interpretation [JaD88]. Cluster analysis includes two major aspects: clustering and cluster validation. Clustering aims at partitioning objects into groups according to a certain criteria. To achieve different application purposes, a large number of clustering algorithms have been developed [JaD88, KaR90, JMF99, Ber06]. While, due to there are no general-purpose clustering algorithms to fit all inds of applications, thus, it is required an evaluation mechanism to assess the quality of clustering results that produced by different clustering algorithms or a clustering algorithm with different parameters, so that the user may find a fit cluster scheme for a specific application. The quality assessment process of clustering results is regarded as cluster validation. Cluster analysis is an iterative process of clustering and cluster verification by the user facilitated with clustering algorithms, cluster validation methods, visualization and domain nowledge to databases. In this chapter, we give a review of cluster analysis as the bacground of this thesis. First we introduce clustering, clustering algorithms and their features, and also the drawbacs of these algorithms. This is followed by the introduction of cluster validation, existing the cluster validation methods, and the problems with the existing cluster validation approaches. 3.1 Clustering and Clustering Algorithms Clustering is considered as an unsupervised classification process [JMF99]. The clustering problem is to partition a dataset into groups (clusters) so that the data elements within a 11
CHAPTER 3. CLUSTER ANALYSIS cluster are more similar to each other than data elements in different clusters by given criteria. A large number of clustering algorithms have been developed for different purposes [JaD88, KaR90, JMF99, XuW05, Ber06]. Based on the strategy of how data objects are distinguished, clustering techniques can be broadly divided in two classes: hierarchical clustering techniques and partitioning clustering techniques [Ber02]. However there is no clear boundary between these two classes. Some efforts have been done on the combination of different clustering methods for dealing with specific applications. Beyond the two traditional hierarchical and partitioning classes, there are several clustering techniques that are categorized into independent classes, for example, density-based methods, Grid-based methods and Modelbased clustering methods [HaK01, Ber06, Pei]. A short review of these methods is described below. 3.1.1 Partitioning methods Partitioning clustering algorithms, such as K-means [Mac67], K-medoids PAM [KaR87], CLARA [KaR90] and CLARANS [NgH94] assign objects into (predefined cluster number) clusters, and iteratively reallocate objects to improve the quality of clustering results. K-means is the most popular and easy-to-understand clustering algorithm [Mac67]. The main idea of K-means is summarised in the following steps: Arbitrarily choose objects to be the initial cluster centers/centroids; Assign each object to the cluster associated with the closest centroid; Compute the new position of each centroid by the mean value of the objects in a cluster; and Repeat Steps 2 and 3 until the means are fixed. Figure 3-1 presents an example of the process of K-means clustering algorithm. 12
CHAPTER 3. CLUSTER ANALYSIS Figure 3-1 An Example of clustering procure of K-means [HaK01] However, K-means algorithm is very sensitive to the selection of the initial centroids, in other words, the different centroids may produce significant differences of clustering results. Another drawbac of K-means is that, there is no general theoretical solution to find the optimal number of clusters for any given data set. A simple solution would be to compare the results of multiple runs with different numbers and choose the best one according to a given criterion, but when the data size is large, it would be very time consuming to have multiple runs of K-means and the comparison of clustering results after each run. Instead of using the mean value of data objects in a cluster as the center of the cluster, a variation of K-means, K-medoids calculates the medoid of the objects in each cluster. The process of K-medoids algorithm is quite similar as K-means. Whereas, K-medoids clustering algorithm is very sensitive to outliers. Outliers could seriously influences clustering results. To solve this problem, some efforts have been made based on K-medoids, for example PAM (Partitioning Around Medoids) was proposed by Kaufman and Rousseeuw [KaR87]. PAM 13
CHAPTER 3. CLUSTER ANALYSIS inherits the features of K-medoids clustering algorithm. Meanwhile, PAM equips a medoids swap mechanism to produce better clustering results. PAM is more robust than -means in terms of handling noise and outliers, since the medoids in PAM are less influenced by outliers. With the O((n-) 2 ) computational cost for each iteration of swap (where is the cluster number, n is the items of the data set), it is clear that PAM only performs well on small-sized datasets, but does not scale well to large datasets. In practice, PAM is embedded in the statistical analysis systems, such as SAS, R, S+ and etc. to deal with the applications of large sized datasets, i.e., CLARA (Clustering LARge Applications) [KaR90]. By applying PAM to multiple sampled subsets of a dataset, for each sample, CLARA can produce the better clustering results than PAM in larger data sets. But the efficiency of CLARA depends on the sample size. On the other hand, a local optimum clustering of samples may not the global optimum of the whole data set. Ng and Han [NgH94] abstracts the mediods searching in PAM or CLARA as searching subgraphs from n points graph, and based on this understanding, they propose a PAM-lie clustering algorithm called CLARANS (Clustering Large Applications based upon RANdomized Search). While PAM searches the whole graph and CLARA searches some random sub-graphs, CLARANS randomly samples a set and selects medoids in climbing sub-graph mountains. CLARANS selects the neighboring objects of medoids as candidates of new medoids. It samples subsets to verify medoids in multiple times to avoid bad samples. Obviously, multiple time sampling of medoids verification is time consuming. This limits CLARANS from clustering very large datasets in an acceptable time period. 3.1.2 Hierarchical methods 14
CHAPTER 3. CLUSTER ANALYSIS Hierarchical clustering algorithms assign objects in tree-structured clusters, i.e., a cluster can have data points or representatives of low level clusters [HaK01]. Hierarchical clustering algorithms can be classified into categories according their clustering process: agglomerative and divisive. The process of agglomerative and divisive clustering are exhibited in Figure 3-2. Figure 3-2 Hierarchical Clustering Process [HaK01] Agglomerative: one starts with each of the units in a separate cluster and ends up with a single cluster that contains all units. Divisive: to start with a single cluster of all units and then form new clusters by dividing those that had been determined at previous stages until one ends up with clusters containing individual units. AGNES (Agglomerative Nesting) adopts agglomerative strategy to merge clusters [KaR90]. AGNET arranges each object as a cluster at the beginning, then merges them as upper level clusters by given agglomerative criteria step-by-step until all objects form a cluster, as shown in Figure 3-2. The similarity between two clusters is measured by the similarity function of the closest pair of data points in the two clusters, i.e., single lin. DIANA (Divisive ANAlysis) 15
CHAPTER 3. CLUSTER ANALYSIS adopts an opposite merging strategy, it initially puts all objects in one cluster, then splits them into several level clusters until each cluster contains only one object [KaR90]. The merging/splitting decisions are critical in AGNES and DIANA. On the other hand, with O(n 2 ) computational cost, their application is not scalable to very large datasets. Zhang et al [ZRL96] proposed an effective hierarchical clustering method to deal with the above problems, BIRCH (Balanced and Iterative Reducing and Clustering using Hierarchies). BIRCH summarizes an entire dataset into a CF-tree and then runs a hierarchical clustering algorithm on a multi-level compression technique, CF-tree, to get the clustering result. Its linear scalability is good at clustering with a single scan and its quality can be further improved by a few additional scans. It is an efficient clustering method on arbitrarily shaped clusters. But BIRCH is sensitive to the input order of data objects, and can also only deal with numeric data. This limits its stability of clustering and scalability in real world applications. CURE uses a set of representative points to describe the boundary of a cluster in its hierarchical algorithm [GRS98]. But with the increase of the complexity of cluster shapes, the number of representative points increases dramatically in order to maintain the precision. CHAMELEON [KHK99] employs a multilevel graph partitioning algorithm on the -Nearest Neighbour graph, which may produce better results than CURE on complex cluster shapes for spatial datasets. But the high complexity of the algorithm prevents its application on higher dimensional datasets. 3.1.3 Density-based methods The primary idea of density-based methods is that for each point of a cluster the neighborhood of a given unit distance contains at least a minimum number of points, i.e. the 16
CHAPTER 3. CLUSTER ANALYSIS density in the neighborhood should reach some threshold [EKS+96]. However, this idea is based on the assumption of that the clusters are in the spherical or regular shapes. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) was proposed to adopt densityreachability and density-connectivity for handling the arbitrarily shaped clusters and noise [EKS+96]. But DBSCAN is very sensitive to the parameter Eps (unit distance or radius) and MinPts (threshold density), because before doing cluster exploration, the user is expected to estimate Eps and MinPts. DENCLUE (DENsity-based CLUstEring) is a distribution-based algorithm [HiK98], which performs well on clustering large datasets with high noise. Also, it is significantly faster than existing density-based algorithms, but DENCLUE needs a large number of parameters. OPTICS is good at investigating the arbitrarily shaped clusters, but its non-linear complexity often maes it only applicable to small or medium datasets [ABK+99]. 3.1.4 Grid-based methods The idea of grid-based clustering methods is based on the clustering-oriented query answering in multilevel grid structures. The upper level stores the summary of the information of its next level, thus the grids mae cells between the connected levels, as illustrated in Figure 3-3. Figure 3-3 The grid-cell structure of gird-based clustering methods 17
CHAPTER 3. CLUSTER ANALYSIS Many grid-based methods have been proposed, such as STING (Statistical Information Grid Approach) [WYM97], CLIQUE [AGG+98], and the combination of grid-density based technique WaveCluster [SCZ98]. The grid-based methods are efficient on clustering data with the complexity of O(N). However the primary issue of grid-based techniques is how to decide the size of grids. This quite depends on the user s experience. 3.1.5 Model-based clustering methods Model-based clustering methods are based on the assumption that data are generated by a mixture of underlying probability distributions, and they optimize the fit between the data and some mathematical model, for example statistical approach, neural networ approach and other AI approaches. The typical techniques in this category are Autoclas [CKS+88], DENCLUE [HiK98] and COBWEB [Fis87].When facing an unnown data distribution, choosing a suitable one from the model based candidates is still a major challenge. On the other hand, clustering based on probability suffers from high computational cost, especially when the scale of data is very large. Based on the above review, we can conclude that, the application of clustering algorithms to detect grouping information in real world applications in data mining is still a challenge, primarily due to the inefficiency of most existing clustering algorithms on coping with arbitrarily shaped distribution of data of extremely large and high-dimensional datasets. Extensive survey papers on clustering techniques can be found in the literature [JaD88, KaR90, JMF99, EML01, XuW05, Ber06]. 18
CHAPTER 3. CLUSTER ANALYSIS 3.2 Cluster Validation A large number of clustering algorithms have been developed to deal with specific applications [JMF99]. Several questions arise: which clustering algorithm is best suitable for the application at hand? How many clusters are there in the studied data? Is there a better cluster scheme? These questions are related with evaluating the quality of clustering results, that is, cluster validation. Cluster validation is a procedure of assessing the quality of clustering results and finding a fit cluster strategy for a specific application. It aims at finding the optimal cluster scheme and interpreting the cluster patterns [HBV02]. Cluster validation is an indispensable process of cluster analysis, because no clustering algorithm can guarantee the discovery of genuine clusters from real datasets and that different clustering algorithms often impose different cluster structures on a data set even if there is no cluster structure present in it [Gor98] [Mil96]. Cluster validation is needed in data mining to solve the following problems [HCN01]: 1. To measure a partition of a real data set generated by a clustering algorithm. 2. To identify the genuine clusters from the partition. 3. To interpret the clusters. Generally speaing, cluster validation approaches are classified into the following three categories Internal approaches, Relative approaches and External approaches [ALA+03]. We give a short introduction of cluster validation methods as follows. 3.2.1 Internal criteria 19
CHAPTER 3. CLUSTER ANALYSIS Internal cluster validation is a method of evaluating the quality of clusters when statistics are devised to capture the quality of the induced clusters using the available data objects only [VSA05]. In other words, internal cluster validation excludes any information beyond the clustering data, and only focuses on assessing clusters quality based on the clustering data themselves. The statistical methods of quality assessment are employed in internal criteria, for example, root-mean-square standard deviation (RMSSTD) is used for compactness of clusters [Sha96]; R-squared (RS) for dissimilarity between clusters; and S_Dbw for compound evaluation of compactness and dissimilarity [HaV01]. The formulas of RMSSTD, RS and S_Dbw are shown below. (3-1) where, x is the expected value in the jth dimension; n j ij is the number of elements in the ith cluster jth dimension; n j is the number of elements in the jth dimension in the whole data set; n c is the number of clusters. (3-2) where, (3-3) The formula of S_Dbw is given as: S_Dbw = Scat(c) + Dens_bw(c) (3-4) where Scat(c) is the average scattering within c clusters. The Scat(c) is defined as: 20
CHAPTER 3. CLUSTER ANALYSIS (3-5) The value of Scat(c) is the degree of the data points scattered within clusters. It reflects the compactness of clusters. The term is the variance of a data set; and the term is the variance of cluster c i. Dens_bw(c) indicates the average number of points between the c clusters (i.e., an indication of inter-cluster density) in relation with density within clusters. The formula of Dens_bw is given as: (3-6) where u ij is the middle point of the distance between the centres of the clusters v i and v j. The density function of a point is defined as the number of points around a specific point within the given radius. 3.2.2 Relative criteria Relative assessment compares two structures and measures their relative merit. The idea is to run the clustering algorithm for a possible number of parameters (e.g., for each possible number of clusters) and identify the clustering scheme that best fits the dataset [ALA+03], i.e., they assess the clustering results by applying an algorithm with different parameters on a data set and finding the optimal solution. In practice, relative criteria methods also use RMSSTD, RS and S_Dbw to find the best cluster scheme in terms of compactness and dissimilarity from all the clustering results. Relative cluster validity is also called cluster 21
CHAPTER 3. CLUSTER ANALYSIS stability, and the recent wors on research of relative cluster validity are presented in [KeC00, LeD01, BEG02, RBL+02, BeG03]. 3.2.3 External criteria The results of a clustering algorithm are evaluated based on a pre-specified structure, which reflects the user s intuition about the clustering structure of the data set [HKK05]. As a necessary post-processing step, external cluster validation is a procedure of hypothesis test, i.e., given a set of class labels produced by a cluster scheme, and compare it with the clustering results by applying the same cluster scheme to the other partitions of a database, as shown in the Figure 3-4. Figure 3-4 External criteria based validation [ZOZ07a] External cluster validation is based on the assumption that an understanding of the output of the clustering algorithm can be achieved by finding a resemblance of the clusters with existing classes [Dom01], [KDN+96], [Ran71], [FoM83], [MSS83]. The statistical methods for quality assessment are employed in external cluster validation, such as Rand statistic [Ran71], Jaccard Coefficient [Jac08], Foles and Mallows index [MSS83], Huberts Γ statistic and Normalized Γ statistic [Th99], and Monte Carlo method 22
CHAPTER 3. CLUSTER ANALYSIS [Mil81], to measure the similarity between the priori modelled partitions and clustering results of a dataset. Extensive surveys on cluster validation can be found in the literature [JaD88, MiI96, Gor98, JMF99, Th99, HBV01, HBV02, HKK05]. 3.3 The issues of cluster analysis By the survey of cluster analysis above, it is clear that there are two major drawbacs that influence the feasibility of cluster analysis in real world applications in data mining. The first one is the weaness of most existing automated clustering algorithms on dealing with arbitrarily shaped data distribution of the datasets. The second issue is that, the evaluation of the quality of clustering results by statistics-based methods is time consuming when the database is large, primarily due to the drawbac of very high computational cost of statistics-based methods for assessing the consistency of cluster structure between the sampling subsets. The implementation of statistics-based cluster validation methods does not scale well in very large datasets. On the other hand, arbitrarily shaped clusters also mae the traditional statistical cluster validity indices ineffective, which leaves it difficult to determine the optimal cluster structure [HBV02]. In addition, the inefficiency of clustering algorithms on handling arbitrarily shaped clusters in extremely large datasets directly impacts the effect of cluster validation, because cluster validation is based on the analysis of clustering results produced by clustering algorithms. Moreover, most of the existing clustering algorithms tend to deal with the entire clustering process automatically, i.e., once the user sets the parameters of algorithms, the clustering 23
CHAPTER 3. CLUSTER ANALYSIS result is produced with no interruption, which excludes the user until the end. As a result, it is very hard to incorporate user domain nowledge into the clustering process. Cluster analysis is a multiple runs iterative process, without any user domain nowledge, it would be inefficient and unintuitive to satisfy specific requirements of application tass in clustering. Visualization techniques have proven to be of high value in exploratory data analysis and data mining [Shn01]. Therefore, the introduction of domain experts nowledge supported by visualization techniques is a good remedy to solve those problems. A detailed review of visualization techniques used in cluster analysis is presented in the next chapter. 24
CHAPTER 4. VISUAL CLUSTER ANALYSIS CHAPTER 4 VISUAL CLUSTER ANALYSIS As described in the last chapter, most of the existing automated clustering algorithms suffer in terms of efficiency and effectiveness on dealing with arbitrarily shaped cluster distributions of extremely large and multidimensional datasets [AAP+03]. Baumgartner et al [BPR+04] concluded that In high dimensional space, traditional clustering algorithms tend to brea down in terms of efficiency as well as accuracy because data do not cluster well anymore. Another obstacle to the application of cluster analysis in data mining is that, the high computational cost of statistics-based cluster validation methods [HCN01]. These drawbacs limit the usability of clustering algorithms in real-world data mining applications. To mitigate the above problems, visualization has been introduced into cluster analysis. As Card et al [CMS99] described, visualization is the use of computer-supported interactive, and visual representation of abstract data to amplify cognition. Visualization is considered as one of the most intuitive methods for cluster detection and validation, especially performing well on the representation of irregularly shaped clusters. Visual data mining is the use of visualization techniques to allow data miners and analysts to evaluate, monitor, and guide the inputs, products and process of data mining [GHK+96]. As a branch of visual data mining, visual cluster analysis is a combination of information visualization and cluster analysis techniques. In the cluster analysis process, visualization provides analysts with intuitive feedbac on data distribution and support decision-maing activities. As a consequence, visual presentations can be very powerful in revealing trends, 25
CHAPTER 4. VISUAL CLUSTER ANALYSIS highlighting outliers, showing clusters, and exposing gaps in data [Shn01]. Nowadays, introducing visualization techniques to explore and understand high-dimensional datasets is becoming an efficient way to combine human intelligence with the immense brute force computation power available nowadays [PGW03]. A large number of visualization techniques have been developed to map multidimensional datasets to two or three dimensional space [WTP+95, AhW95, HKW99, RSE99, RKJ+99, ABK+99, AEK00, SGB00, MRC02, FGW02, Shn01, PGW03, LKS+04]. In practice, most of them simply tae information visualization as a layout problem, therefore, they are not suitable to visualize clusters of very large datasets. 4.1 Multidimensional Data Visualization Many efforts have been performed on multidimensional (d>3) data visualization [OlL03]. However, most of those visual approaches have difficulty in dealing with high dimensional and very large datasets. We give a more detailed discussion of them as follows. 4.1.1 Icon-based techniques Icon-based presentations are relatively older techniques for visual data mining. The idea of icon-based techniques is to map each multidimensional data item as an icon, for example [Pic70, Che73, Bed90, Lev91, KeK94, Hea95]. We explain several popular techniques below. Chernoff Faces A well-nown iconic approach is Chernoff faces [Che73]. The Chernoff face uses the two dimensions of multidimensional data to locate a face position in the two display dimensions. The remaining dimensions are mapped to the properties of the face icon, i.e., the shape of nose, mouth, eyes, and the shape of the face itself, as shown in Figure 4-1. 26
CHAPTER 4. VISUAL CLUSTER ANALYSIS Chernoff face visualization capitalizes on the human sensitivity to faces and facial features. However, the number of data items that can be visualized using the Chernoff face technique is quite limited. Figure 4-1. An example of Chernoff-Faces Stic Figures Another famous icon-based technique is to use stic figures for visualizing a larger amounts of data, therefore, an adequate number of data items can be presented for data mining purposes [Pic70][PiG88]. The stic figures technique uses two dimensions as the display dimensions, and the other dimensions are mapped to the angles and lengths of the stic figure icon, as illustrated in Figure 4-2a. Different stic figure icons with variable dimensionality may be used, as shown in Figure 4-2b. a. Stic Figure Icon b. A Family of Stic Figures Figure 4-2. Stic Figure Visualization Technique 27
CHAPTER 4. VISUAL CLUSTER ANALYSIS Figure 4-3 shows the census data of 1980 United States visualized by the stic figure visualization technique, and the census data have five dimensions. In Figure 4-3, where income and age are used as the display space, and other the attributes: occupation, education level, marital status and sex are visualized by the stic figures. However, it can be observed that, in Figure 4-3, the user cannot easily understand and interpret the graph of stic figures. The user has to have a good training in advance. Figure 4-3. Stic Figure Visualization of the Census Data Many other icon-based systems have also been proposed, such as Shape-Coding [Bed90], Color Icons [Lev91, KeK94], and TileBars [Hea95]. Icon-based techniques can display multidimensional properties of data, however, with the amount of data increasing, the user hardly maes any sense of most properties of data intuitively, this is because the user cannot focus on the details of each icon when the data scale is very large. 4.1.2 Pixel-oriented Techniques Pixel-oriented visualization techniques map each attribute value of data to a single colored pixel, yielding the display of the most possible information at a time [KeK94, KKA95, Kei97, 28
CHAPTER 4. VISUAL CLUSTER ANALYSIS An01]. With this technique, each data value is mapped to a colored pixel and present the data values belonging to one attribute in separate windows, as displayed in Figure 4-4. Pixel-oriented techniques use various colour mapping approaches, such as linear variation of brightness, maximum variation of hue (colour) and constant maximum saturation to map each data value to a colored pixel and arrange them adequately in limited space. Pixel-oriented techniques are powerful to provide an overview of large amounts of data, and meanwhile they preserve the perception of small regions of interest. This feature maes them suitable for being used in a variety of data mining tass of extremely large databases. Figure 4-4. Displaying attribute windows for data with six attributes Keim [KeK94] presented the first Pixel-oriented technique in the VisDB system, which has the capability to represent large amounts of multidimensional data with respect to a given query. As a result, users are able to refine their query based on the nowledge gathered from the visual representation of the data. Other pixel-oriented techniques have been developed, for example, Recursive Pattern Technique [KKA95], Circle Segments Technique [AKK96], Spiral [KeK94], Axes [Kei97], PBC [AEK00] and OPTICS [ABK+99]. They are successfully applied in data exploration for high dimensional databases. 29
CHAPTER 4. VISUAL CLUSTER ANALYSIS Figure 4-5. Illustration of the Recursive Pattern Technique For having more expressive data in a limited area, the recursive pattern of pixel-oriented technique has been proposed based on a generic recursive schema [KeK94]. With changeable parameters of the recursive schema, the user can control the semantically meaningful substructures, which determine the arrangement of the attribute values, as presented in Figure 4-5. A use of VisDB for visualizing financial information is illustrated in Figure 4-6. Figure 4-6. The Recursive Pattern Technique in VisDB [KeK94] In particular, pixel-oriented techniques, aim at representing datasets in the input time order according to one attribute, because clustering arranges data items with similar values closer based on distance/density functions according to the similarity/dissimilarity measures. 30
CHAPTER 4. VISUAL CLUSTER ANALYSIS However, the close data items are coloured similarly, but distributed in time series order by the pixel-oriented techniques, which cannot visualize the insight of clusters very well. Therefore, they are not suitable to be imposed as visual representation methods in cluster analysis very well. 4.1.3 Geometric Techniques The basic idea of geometric techniques is to visualize the geometric transformations and projections of the data to produce useful and insightful visualizations. Geometric projection techniques aim at finding interesting projections of multidimensional data sets [Hub85] [FrT74]. The typical systems used the geometric techniques are Scatterplot-Matrices [And72, Che73], Parallel Coordinates [Ins85, ID90], Star Plots [Fie79], Landscapes [WTP+95], Projection Pursuit Techniques [Hub85], Prosection Views [FuB94, STD+95] and Hyperslice [WiL93]. Here we introduce several of them as follows. Scatterplot-Matrices Plot-based data visualization approaches such as Scatterplot-Matrices [Cle93] and similar techniques [AlC91] visualize data in rows and columns of cells containing simple graphical depictions. Figure 4-7. Scatterplot-Matrices [Cle93] 31
CHAPTER 4. VISUAL CLUSTER ANALYSIS This category of techniques gives bi-attributes visual information. An example of Scatterplot- Matrices is shown in Figure 4-7. The user can clearly observe each bi-attributes data distribution. However, plot-based techniques do not provide the best overview of the whole dataset. As a result, they are not able to present clusters of datasets very well. On the other hand, Plotbased visual techniques do not perform well on the presentation of large number of dimensional databases, due to the physical size limitation of computer monitors. Parallel Coordinates A famous multidimensional visualization technique, Parallel Coordinates, utilizes equidistant parallel axes to visualize each attribute of a given dataset and projects multiple dimensions on a two-dimensional surface [Ins97]. The axes correspond to the dimensions and are linearly scaled from the minimum to the maximum value of the corresponding dimension. Each data item is presented as a polygonal line, intersecting each of the axes at that point which corresponds to the value of the considered dimension, as presented in Figure 4-8. Figure 4-8. 15,000 coloured data items in Parallel Coordinates 32
CHAPTER 4. VISUAL CLUSTER ANALYSIS Star Plots arranges coordinate axes on a circle space with equal angles between neighbouring axes from the centre of a circle and lins data points on each axis by lines to form a star [SGF71]. An example of Star Plots technique is presented in Figure 4-9. In principle, these two techniques can provide visual presentations of any number of attributes. However, neither Parallel Coordinates nor Star Plots is adequate to give the user a clear overall insight of data distribution when the dataset is huge, primarily due to the unavoidably high overlapping between data points. And another drawbac of these two techniques is that, though they can supply a more intuitive visual relationship between the neighbouring axes, for the non-neighbouring axes, the visual presentation may confuse the users perception. These obstacles mae them not properly to visualize cluster structure in very large and high dimensional databases. Figure 4-9. Star plots of data items [SGF71] An integrated multidimensional data visualization system, XmdvTool has been proposed, which includes such as scatterplot matrix, parallel coordinates, star plots, and dimensional stacing by lining alternative views using brushing [War95]. Many other techniques have been introduced in multidimensional data visualization [OlL03]. However, most of them 33
CHAPTER 4. VISUAL CLUSTER ANALYSIS either suffer from the weaness on visualizing large amount data items and higher dimensional data or hardly provide clearly clustered perception in visual form to the user. The multidimensional data visualization techniques developed earlier can be found in the literature [BeR78, Fie79, HoG01, Che07, FrD07]. In the last decade, many efforts have been made on using visualization techniques to assist data miners for finding cluster patterns in data. A survey of these visualization techniques is presented below. 4.2 Visual Cluster Analysis Visual cluster analysis, as the term implies, is a discipline of information visualization and cluster analysis techniques. With wide applications of cluster analysis in data mining, many visualization techniques have been employed to study the structure of datasets in the applications of cluster analysis. Reviews of these wors can be found in the literature [AnK01, HoG01, Kei02, OlL03, MiG04]. Several representative visualization techniques that are especially important in cluster analysis are discussed below. 4.2.1 MDS and PCA Multidimensional scaling (MDS) maps multidimensional data as points in a 2D Euclidean space, where the distances between data points reflect similarity/dissimilarity of them [KrW78] as illustrated in Figure 4-10. However, the relative high computational cost of MDS (polynomial time O(N 2 )) limits its usability in very large datasets. 34
CHAPTER 4. VISUAL CLUSTER ANALYSIS Figure 4-10. Clustering of 1352 genes in MDS by [Bes] Principal Component Analysis (PCA) involves a mathematical procedure that transforms a number of (possibly) correlated variables (of higher dimensions) into a number of uncorrelated variables (of smaller, lower dimensions) called principal components [Jol02]. PCA first has to find the correlated variables for reducing the dimensionality, which restricts its performance in the exploration of unnown data. 4.2.2 HD-Eye HD-Eye is an interactive visual clustering system based on density-plots of any two interesting dimensions [HKW03]. It projects any two dimensions of the multidimensional data based on density-plots to investigate interested grouping clues. HD-Eye employs icons to represent the clusters and the relationship between the clusters, as shown in Figure 4-12. 35
CHAPTER 4. VISUAL CLUSTER ANALYSIS Figure 4-12. The framewor of HD-Eye system and its different visualization projections HD-Eye provides the user a rough data structure of a high dimensional data in visual representations. Whereas, the user hardly synthesizes all of the interesting 2D projections to find the general pattern of the clusters. Figure 4-13. The 3D data structures in HD-Eye and their intersection trails on the planes 36
CHAPTER 4. VISUAL CLUSTER ANALYSIS HD-Eye is also addressed to use 3D techniques to visualize data in mountain-lie structures, and use the intersected planes of the 3D graphs for presenting the trails of the graphs on the planes in different level in 2D forms [HKW99], as illustrated in Figure 4-13. But the ernel density-based 3D graphs formation in HD-Eye limits it to be employed in the interactive cluster detection of large datasets. 4.2.3 Grand Tour The Grand Tour technique uses a series of variable projections to map multidimensional data onto a two orthogonal 2D space in order to obtain different perspectives of data [Asi85]. For effectively reducing huge search space of data, Projection Pursuit is introduced to help the user for the purpose of investigating the interesting projections [CBC+95]. The projection of Grand Tour with Projection Pursuit is illustrated in Figure 4-14a. However, due to Grand Tour systems have several times projections and complicated computation, their visualization models are not intuitive to users. Based on the Grand Tour technique, several extensions have been proposed. For example, Yang implemented a 3D version Grand Tour technique to project data in animations [Yan03], but the complex 3D graph formation of this technique limits its use in large scale data visualization. An example of Yang s Grand Tour based visualization is presented in Figure 4-14b. Dhillon, et al. proposed a technique to visualize cluster structure [DMS98], but their technique visualizes 3 clusters. It requires more assistance with a sophisticated Grand Tour technique to deal with more than 3 clusters. 37
CHAPTER 4. VISUAL CLUSTER ANALYSIS a. The projections of the Grand Tour technique b. Grand Tour based 3D animation by [Yan03] Figure 4-14. The Grand Tour Technique and its 3D example 4.2.4 Hierarchical BLOB Based on the hierarchical clustering and visualization algorithm H-BLOB, Sprenger et al presented a technique for visualizing hierarchical clusters in the nested blobs [SBG00], as shown in Figure 4-15. Figure 4-15. Cluster hierarchies are shown for 1, 5, 10 and 20 clusters [SBG00] The most significant feature of their technique is that, H-BLOB not only provides the overview manner of whole dataset in blobs, but also gives the detailed visual representation of 38
CHAPTER 4. VISUAL CLUSTER ANALYSIS lower levelled clusters. Exhibiting clusters in the form of blobs H-BLOB results in a very intuitive and easily understood visual presentation. However the high visual complexity of the two stages of blob graphs formation maes them unsuitable to be applied in cluster visualization of very large sized datasets. 4.2.5 SOM Kasi el. al employs Self-organizing maps (SOM) technique [Koh97] to project multidimensional data sets to 2D space for matching visual models [KSP01]. Technically, in their method, a sample data is mapped into a bar graph, then the graph is compared with all existing vector models in bar graphs to find the most matched one, as shown in Figure 4-16, where the bar graphs in the rectangular region are existing models, X is the sample bar graph. Figure 4-16. Model matching with SOM by [KSP01] However, the traversal matching process is time-consuming. On the other hand, the SOM technique is based on a single projection strategy. It is not powerful enough to discover all the interesting features from the original data. Another drawbac of this technique is that, with an 39
CHAPTER 4. VISUAL CLUSTER ANALYSIS increasing number of dimensions, the bar graphs would be wider. As a result, the user cannot easily observe the matched cluster model in intuition. 4.2.6 FastMap Huang et. al proposed the approaches based on FastMap [FaL95] to assist users in identifying and verifying the validity of clusters in visual form [HCN01, HuL00]. Their techniques wor well in cluster identification, but are unable to evaluate the cluster quality very well. On the other hand, these techniques visualize clusters statically and do not always present the genuine cluster structure. As a consequence, they do not provide enough information for either clustering or cluster validation. 4.2.7 OPTICS OPTICS uses a density-based technique to detect cluster structure and visualizes them in Gaussian bumps [ABK+99]. It is an intuitive method to assist the user to observe cluster structures. But its non-linear time complexity maes it neither suitable to deal with very large data sets, nor suitable to provide the contrast between clustering results, as shown in Figure 4-17. Figure 4-17. Data structure mapped in Gaussian bumps by OPTICS [ABK+99] OPTICS also visualizes clusters in 1D visualization manner [ABK+99]. It wors well in finding the basic arbitrarily shaped clusters, as presented in Figure 4-18. However it lacs the ability in helping the user understand inter-cluster relationships. 40
CHAPTER 4. VISUAL CLUSTER ANALYSIS Figure 4-18. Clustering structure of 30,000 16-Dimensional data items Visualized by OPTICS [ABK+99] 4.2.8 Star Coordinates and VISTA The most relevant approach to this thesis is the Star Coordinates technique [Kan01]. The idea of Star Coordinates technique is intuitive, which extends the perspective of traditional orthogonal 2D X-Y and 3D X-Y-Z coordinates technique to a higher dimensional space. Star Coordinates plots a 2D plane into n equal sectors with n coordinate axes, where each axis represents a dimension and all axes share the initials at the centre of a circle on the 2D space [Kan01]. First, data in each dimension are normalized into [0, 1] interval. Then the values of all axes are mapped to orthogonal X-Y coordinates which share the initial point with Star Coordinates on the 2D space. Thus, an n-dimensional data item is expressed as a point in the X-Y 2D plane. Figure 4-19 illustrates the mapping from 8 Star Coordinates to X-Y coordinates. 41
CHAPTER 4. VISUAL CLUSTER ANALYSIS Figure 4-19. Positioning a point by an 8-attribute vector in Star Coordinates [Kan01] Star Coordinates provides users the ability to apply various transformations dynamically, integrate and separate dimensions of interest, analyze correlations of multiple dimensions, view clusters, trends, and outliers in the distribution of data. Formula (4-1) states the mathematical description of Star Coordinates. p j n ( x, y) = ( u xi ( d ji min i ), i= 1 r n i= 1 r u yi ( d ji min i ) ) (4-1) where p j (x, y) is the normalized location of D j =(d j1, d j2,., d jm ), and d ji is the value of the jth record of a data set on the ith coordinate C i in Star Coordinates space; ū xi (d ji -min i ) and ū yi (d ji -min i ) are unit vectors of d ji mapping to X direction and Y direction, min i =min(d ji,0 j<m) and max i =max (d ji, 0 j<m) are the minimum and maximum values of the ith dimension respectively; and m is the number of records in the data set. As presented in the formula (4-1), the computational complexity of Star Coordinates projection is linear time. Therefore, the Star Coordinates based techniques are powerful for interactive visualization and analysis of clusters. Interactive Functions of Star Coordinates Star Coordinates provides various interactive functions to stimulate visual thining in early stages of the nowledge discovery process. Those functions include scaling axes, see Figure 42
CHAPTER 4. VISUAL CLUSTER ANALYSIS 4-20-a; rotating angle between axes Figure 4-20-b; maring data points in a certain area by coloring; selecting data value ranges on one or more axes and maring the corresponding data points in the visualization; presenting histograms of selected clusters; foot print for tracing the foot points for data points, see Figure 4-20-c and etc [Kan01]. a. Axis scaling of name attribute of autompg data b. Angle rotation of the attributes of auto-mpg data c. Foot prints of axes scaling of weight and mpg attributes of auto-mpg data Figure 4-20 axis scaling, angle rotation and foot print functions of Star Coordinates [Kan01] The Star Coordinates based techniques Based on the idea of the Star Coordinates, instead of normalizing data in each dimension into [0, 1] interval by Star Coordinates, Chen and Liu proposed an approach, named α-mapping, which normalizes data into [-1, 1] interval of each dimension for having more expressive space of axis scaling in their VISTA and/or ivibrate systems [ChL04, ChL06]. Moreover, Chen and Liu also discussed using their approach to refine and verify clusters by 43
CHAPTER 4. VISUAL CLUSTER ANALYSIS VISTA/iVIBRATE [ChL06]. Shai and Yeasin addressed a 3D manner of the Star Coordinates to provide users a more intuitive 3D environment for the observation of cluster structure [ShY06]. Ma and Teoh employed the Star Coordinates technique into their StarClass system for data classification by visualization [MaT03]. A very similar technique of Star Coordinate, RadViz were presented by [HGM97]. But its non-linear mapping is an obstacle for RadViz to be employed as an interactive tool for cluster analysis of very large-sized databases. The issues of the existing Star Coordinates based Visualization Techniques The existing Star Coordinates based visualization techniques employed in cluster analysis tend to be used as information rendering tools, but do not perform well on verifying the validation of clustering results. On the other hand, the exploration-oriented characteristics of these techniques, inevitably lead them to be random and imprecise in the process of cluster detection and validation. Chen and Liu combined clustering algorithms and their visualization technique VISTA/iVIBRATE to observe cluster structures of datasets, refine the quality of clusters produced by clustering algorithms, and validate clusters [ChL04, ChL05]. However, the data observation based on α-mapping (α-adjustment) of their approach is still an randomly exploratory process, which inevitably suffers from subjectivity and randomness. In addition, VISTA adopts landmar points as representatives from a clustered subset and resamples them to deal with cluster validation [ChL04]. But its experience-based landmar point selection does not always handle the scalability of data very well, due to the wellrepresentative landmar points selected in a subset may fail in other subsets of a database. 44
CHAPTER 4. VISUAL CLUSTER ANALYSIS 4.3 Major Challenges Visualization is considered as a collection of transformations from the problem domain to the representation domain [GMH+94]. A more practical and effective approach of cluster visualization is to incorporate all available clustering information, for example algorithmic clustering results and the domain nowledge, into visual cluster exploration. 4.3.1 Requirements of Visualization in Cluster Analysis By the above analysis, we can summarize that the visualization techniques to be used in cluster analysis should be able to handle several important aspects of visual perception: 1. Visualizing large and multidimensional datasets; 2. Providing a clear overview and detailed insight of cluster structure; 3. Having linear time complexity on data mapping from higher dimensional space to lower dimensional space; 4. Supporting interactive cluster visual representation dynamically; 5. Involving nowledge of domain experts into the cluster exploration; 6. Giving data miners purposeful and precise guidance of cluster investigation and cluster validation rather than simply random cluster exploration; As discussed above, most existing cluster visualization techniques wor well on visualizing multidimensional data sets. However, as the size and dimensiononality of data sets increase, these techniques do not perform well on very large data visualization, they can hardly deal with visual representation of higher dimensional data, they can not provide an intuitive overview of cluster structure, etc. In short, they satisfy not all of the above requirements. 4.3.2 Motivation 45
CHAPTER 4. VISUAL CLUSTER ANALYSIS A question arises: which visualization technique can provide a genuine representation of cluster structure of data? In practice, a few visualization techniques can achieve the above requirements. As Seo and Shneiderman pointed out, A large number of clustering algorithms have been developed, but only a small number of cluster visualization tools are available to facilitate researchers understanding of the clustering results [SeS05]. How to preserve the identity of problem domain and representation domain by visualization is the critical challenge of cluster visualization. Star Coordinates based techniques are a good choice for cluster visualization, because they almost meet all the considerations above, except the last one. Simple static visualization is not sufficient in visualizing clusters [Kei01, Shn02], and it has been shown that clusters can hardly be satisfactorily preserved in a static visualization [CBC+95, DMS98]. With the feature of linear time transformation/projection, Star Coordinates based techniques are powerful for large scale data visualization, especially for interactive and dynamic cluster visualization. But the random and subjective characteristics of these techniques hinder their effectiveness and efficiency in real-world applications. The main motivation of this thesis is to provide an effective and purposeful visual guidance to data miners in cluster analysis. 4.4 Our Approach In this subsection, we briefly describe a novel approach called HOV 3 for addressing the challenge presented above. As a publication-based thesis, the detailed discussion of the wors in the thesis can be found in the cited papers. 46
CHAPTER 4. VISUAL CLUSTER ANALYSIS 4.4.1 HOV 3 Model Visualization is typically employed as an observational mechanism to assist users with intuitive comparisons and better understanding of the studied data. Instead of precisely contrasting clustering results, most of the existing visualization techniques employed in cluster analysis focus on providing the user with an easy and intuitive understanding of the cluster structure, or explore clusters randomly. In general, it is not easy to visualize multidimensional data sets on 2D space and give a genuine visual interpretation. This is because mapping multidimensional data onto 2D space inevitably introduces overlapping and bias. For mitigating the problem, Star Coordinates based techniques provide some visual adjustment mechanisms [Kan01, ChL04, ChL06]. However, the stochastic adjustment of Star Coordinates and VISTA limits their usability in cluster analysis. To overcome the arbitrary and random adjustments of Star Coordinates and its extensions, Zhang et al proposed a hypothesis-oriented visual approach (Hypothesis Oriented Verification and Validation by Visualization) HOV 3 in short, to detect clusters [ZOZ+06, ZOZ06]. The idea of HOV 3 is that, in analytical geometry, the difference of a data set (a matrix) D j and a measure vector M with the same number of variables as D j can be represented by their inner product, D j M. HOV 3 uses a measure vector M to represent the corresponding axes weight values. Then given a non-zero measure vector M in n, and a family of vectors P j, the projection of P j against M in the complex number system, the HOV 3 model is presented as: P j n ( z ) = [(d min( d ))/(max( d ) min( d )) z ] m 0 = 1 j 0 (4-2) 47
CHAPTER 4. VISUAL CLUSTER ANALYSIS where (d ) and (d ) represent the minimal and maximal values of th dimension respectively; and m is the th attribute of measure M. The aim of interactive adjustments of Star Coordinates and its extensions is to have some separated groups or full-separated clustering result of data by tuning the weight value of each axis (axis scaling in Star Coordinates, α-adjustment in VISTA/iVIBRATE), but their arbitrary and random adjustments limit their applicability. As shown in formula (4-2), HOV 3 summarizes these adjustments as a coefficient/measure vector. Compared the formulas (4-1) and (4-2), it can be observed that HOV 3 subsumes the Star Coordinates model [ZOZ06]. Thus the HOV 3 model provides the user a mechanism to quantify a hypothesis/prediction about a data set as a measure vector of HOV 3 for precisely exploring grouping information. 4.4.2 External Cluster Validation by HOV 3 With the quantified measurement feature of HOV 3, an external cluster validation method based on distribution matching is proposed to verify the consistency of cluster structures between a clustered subset and non-clustered subsets of a dataset [ZOZ07a]. The idea of this approach is based on the assumption that If two same-sized data sets have a similar cluster structure, by applying a linear transformation to the data sets, the similarity of the newly produced distributions of the two sets would still be high. This approach employs a clustered subset from a database as a visual model (classifier) to verify the similarity of cluster structures between the model and the other same-sized nonclustered subsets of the database by projecting them together in HOV 3. Technically, the user first separates each overlapped cluster individually by axes scaling or M-Mapping in HOV 3. Then data points in the separated cluster and their geometrical covered data points, called quasi-cluster, in the non-clustered subset are piced up. Finally, instead of using statistical 48
CHAPTER 4. VISUAL CLUSTER ANALYSIS methods to assess the similarity between the two subsets, this approach simply computes the overlapping rate between the clusters and their quasi-clusters to show their consistency. Compared with the statistics-based validation methods, distribution matching based external cluster validation is not only visually intuitive, but also more effective in real applications [ZOZ07a]. 4.4.3 Enhanced the Separation of Clusters by HOV 3 To assist data miners investigating cluster clues effectively, an approach called M-HOV 3 /M- Mapping is introduced to enhance the separation of data groups in cluster analysis by HOV 3 [ZOZ07b, ZOZ07c]. In the paper [ZOZ07c], the mathematical proof has been presented of the property that, if it is observed that several groups of data points can be roughly separated (where ambiguous points exist between groups) by projecting a measure vector in HOV 3 to a data set, then the application of M-HOV 3 /M-mapping with the measure vector to the data set would lead to the groups being more contracted, in other words, having a good separation of the groups. This feature is significant for identifying the membership formation of clusters in the process of cluster exploration and cluster verification. This is because the contracting feature of M- HOV 3 /M-mapping eeps the data points within a cluster relatively closer, i.e., grouping information is preserved. On the other hand, the enhanced separation feature of M-HOV 3 /Mmapping can extend the distance of far data points relatively further. With the advantage of the enhanced separation and contraction features of M-HOV 3 /M-mapping, the user can identify the cluster number in the pre-processing stage of clustering efficiently, and also verify the membership formation of data points among the clusters effectively in the postprocessing stage of clustering by M-HOV 3 /M-mapping. 49
CHAPTER 4. VISUAL CLUSTER ANALYSIS 4.4.4 Prediction-based Cluster Analysis by HOV 3 Having a precise overview of data distribution in the early stages of data mining is important, because having correct insights of data overview is helpful for data miners to mae decisions on adopting appropriate algorithms for the forthcoming analysis stages. Exploration discovery (qualitative analysis) is regarded as a pre-processing of verification discovery (quantitative analysis), which is mainly used for building user predictions based on cluster detection, or other techniques. It is an iterative process under the guidance of the user s domain nowledge, but not an aimless and/or arbitrary process. In each iteration of exploration discovery, the user s feedbac provides new insights and enriches their domain nowledge on the dataset they are dealing with. Predictive exploration is a mathematical description of future behaviour based on historical exploration of patterns. The goal of predictive visual exploration by HOV 3 is that by applying a prediction (measure vector) to a dataset, the user may identify the groups from the result of visualization. Thus the ey issue of applying HOV 3 to detect grouping information is how to quantify historical patterns (or users domain nowledge) as a measure vector to achieve this goal. Equation (2) is a standard form of linear transformation of n variables, where m is the coefficient of the th variable of P j. In principle, any measure vector, even in complex number form, can be introduced into the linear transformation of HOV 3 if it can distinguish a data set into groups or have well separated clusters visually. For example, the randomly obtained the grouping data by axis scaling in HOV 3, M-Mapping/M-HOV 3, the statistical 50
CHAPTER 4. VISUAL CLUSTER ANALYSIS methods that reflect the characteristics of studied data set can be introduced as predictions in the HOV 3 projection. With the quantified measurement of HOV 3 and enhanced separation features of M- Mapping/M-HOV 3, the user not only can summarise their historically explored nowledge about datasets as predictions but also directly introduce the abundant statistical measurements of the studied data as predictions to investigate cluster clues, or refine clustering results effectively [ZOZ07c, ZOZ07d]. In fact, prediction based cluster detection by statistical measurements in HOV 3 is more purposeful cluster exploration, and it gives an easier geometrical interpretation of the data distribution. In addition, with the statistical predictions in HOV 3 the user may even expose the cluster clues that are not easy to be found by random cluster exploration. To separate clusters from a lot of overlapped data points is an aim of this thesis. Based on the wor such as M-HOV 3 /M-mapping, and HOV 3 with statistical measurement, any measure that resulted in fully separated clusters can be treated as predictions to be introduced into External Cluster Validation based on Data Distribution Matching by HOV 3. In principle, any linear transformation, even the complex linear transformation, can be employed into HOV 3 if it can separate clusters well. With the well-separated clusters, the efficiency of external cluster validation by HOV 3 may be improved [ZOZ07c, ZOZ07d]. 51
CHAPTER 5. CONCLUSION AND FUTURE WORK CHAPTER 5 CONCLUSION AND FUTURE WORK 5.1 Conclusion This thesis has proposed a novel visual approach called HOV 3, Hypothesis Oriented Verification and Validation by Visualization, to assist data miners in cluster analysis of highdimensional datasets. HOV 3 provides data miners an effective mechanism to introduce their quantified domain nowledge as predictions in the cluster exploration process for revealing the gaps of data distribution against the predictions. As a result, it is more efficient and purposeful by using HOV 3 to investigate cluster clues in very large and high-dimensional datasets. This thesis has also proposed a visual cluster validation approach based on distribution matching supported by the projection mechanism of HOV 3. This approach is based on the assumption that by using a measure vector to project the data sets in the similar cluster structure, the similarity of the changes of their behaviour of data distribution should be high. By comparing the data distributions of a clustered subset and non-clustered subsets projected by HOV 3 with measures, the data miners can intuitively have a visual assessment, and also have a precise evaluation of the consistency of the cluster structure by performing geometrical computation on their data distributions. Compared with existing visual techniques involved in cluster validation, it has been observed that this approach is not only efficient in performance, but also effective in real-world applications. 53
CHAPTER 5. CONCLUSION AND FUTURE WORK Based on the projection technique of HOV 3, a visual approach called M-HOV 3 /M-mapping has also been introduced to enhance the visual separation of clusters. The visual separability of clusters is significant for cluster analysis. Fully geometrical separation of clusters is not only beneficial in revealing the membership formation of clusters, but also beneficial in verifying the validity of clustering results. With M-HOV 3 /M-mapping, data miners can both explore cluster distribution intuitively and verify clustering results effectively by matching the geometrical distributions of clustered and non-clustered subsets produced by M- HOV 3 /M-mapping. Experiments show that HOV 3 technique can improve the effectiveness of cluster analysis by visualization. HOV 3 can be seen as a bridging process between qualitative analysis and quantitative analysis. It not only supports quantified domain nowledge verification and validation, but also directly utilizes the abundant statistical measurements of the studied data as predictions in order to give data miners an effective guidance for having more precise cluster information in data mining. As a consequence, with the advantage of the quantified measurement feature of HOV 3 data miners can identify the cluster number in the pre-processing stage of clustering efficiently, and also verify and refine the membership of data points among the clusters effectively in the post-processing stage of clustering. We believe the application of HOV 3 will be fruitful. 5.2 Future Wor This thesis has addressed the challenges of introducing visualization techniques to cluster analysis in data mining, and proposed a visual technique called HOV 3 to mitigate the 54
CHAPTER 5. CONCLUSION AND FUTURE WORK problems in visual cluster analysis. However, there are still some open research issues worth future efforts. 5.2.1 Three Dimensional HOV 3 This thesis has introduced the quantified measures as predictions with HOV 3 to detect cluster clues and verify clustering results in clustering large datasets that automated clustering algorithms cannot effectively handle. So far HOV 3 projects high dimensional data onto 2D space [ZOZ06]. 3D visualization can provide more intuitions and also more information on the studied data [Rei95]. However, most of the existing 3D visual techniques involved in cluster analysis are density-based or metaphor-based [KOC+04]. They suffer from the high computational cost on composing 3D graphs of clusters. This drawbac limits them to be applied for 3D cluster investigation in very large databases, especially for 3D interactive cluster exploration. [Yan03]. Recently, Shai and Yeasin proposed a 3D visualization model based on the Star Coordinates technique [ShY06]. However, the relatively complex projection of 3D formation of their approach is a drawbac on 3D visualization of large datasets. In fact, with the nown two orthogonal vectors in HOV 3 to compose the third dimensional vector is not hard. Then the 3D visual presentation of data in HOV 3 can be produced by linear combinations of the three vectors. Based on the advantage of linear time complexity of HOV 3 projection, the 3D HOV 3 projection is also in linear time. Thus data miners may more effectively grasp the cluster clues from the studied datasets by the interaction of 3D HOV 3 exploration. 5.2.2 Dynamic Visual Cluster Analysis 55
CHAPTER 5. CONCLUSION AND FUTURE WORK Dynamic clustering is also called stepwise clustering, which is a ind of iterative clustering method based on distance [ZHW+03]. Dynamic clustering intends to study the groups behaviour changes and revise clusters dynamically along with the cluster exploration process, even revise the criteria of clustering. It deals with data grouping as a cluster analysis in time series [BeC96, AbM98, CHS04]. In each clustering iteration, clustering algorithms sample at the time series points and revise the formation of clusters dynamically by given criteria. However, the existing clustering algorithms do not perform well with arbitrarily shaped data distribution of the datasets, and the very high computational cost of the statistics-based cluster validation methods limits their usability in real time applications. Based on the HOV 3 model, We have proposed a cluster validation method based on distribution matching in this thesis [ZOZ07a]. This approach can provide a solution to the above problem, because the approach only calculates the overlapping rate between the classifier (a clustered subset of a dataset) and its geometrically covered data points. It is much quicer than the existing statistics based cluster validation methods [ZOZ07a]. For revising clustering criteria, the newly produced clustering criteria can be generated automatically by the density function of the data points of overlapped area. 5.2.3 Quasi-Cluster Data Points Collection Based on the quantified measurement feature of HOV 3, an external cluster validation based on distribution matching has been proposed in this thesis to verify the consistency of cluster structures between a clustered subset and non-clustered subsets of a dataset [ZOZ07a]. But, so far, the quasi-cluster point is piced up manually by a geometrical intuition. The Newton method of data analysis can be introduced into this quasi-cluster point collection. The Newton 56
CHAPTER 5. CONCLUSION AND FUTURE WORK method is an efficient approach to find the neighbouring points of a given point [Smi86]. This would improve the accuracy and effectiveness of quasi-cluster point collection in HOV 3. 5.2.4 Combination of Fuzzy Logical approaches and HOV 3 Fuzzy clustering is an active branch of cluster analysis [OlP07]. Instead of data points being only exactly assigned into one cluster, in fuzzy (soft) clustering, data points can belong to more than one cluster [Sim93]. The data points can be associated with different grades with clusters. The grades of data points indicate the nearness degree of relationship to clusters. However, when fuzzy clustering algorithms deal with dynamic clustering applications, the recomputation of the grades of membership associated with clusters is very high [BLO+03]. In fuzzy clustering proposed in [BaB99], each data point has a vector V(1...) associated with the K clusters [Bez81]. In this thesis, HOV 3 model has been proposed to assist data miners in cluster investigation and verification [ZOZ+06, ZOZ06]. The HOV 3 model is presented in the formula (8) in [ZOZ06]. There, the measure coefficient m of the th dimension can be combined with the associated grade of each data point. Thus with the color mapping function [Fai98], HOV 3 could provide very intuitive visual presentation of the membership of each data point, due to the closest data points being colored similarly. This approach would be very helpful to the data miners to identify the membership formation of clusters during interactive cluster exploration. 57
58 CHAPTER 5. CONCLUSION AND FUTURE WORK
APPENDIX APPENDIX The Relevant Publications with This Thesis: 1. K-B, Zhang, M. A. Orgun, K. Zhang and Y. Zhang, Hypothesis Oriented Cluster Analysis in Data Mining by Visualization, Proceedings of the woring conference on Advanced visual interfaces 2006 (AVI06), May 23-26, 2006, Venezia, Italy. ACM Press, pp. 254-257 (2006) 2. K-B, Zhang, M. A. Orgun, K. Zhang, HOV 3 : An Approach for Visual Cluster Analysis, Proceedings of The 2nd International Conference on Advanced Data Mining and Applications. (ADMA 2006), Xi'an, China, August 14-16, 2006, Lecture Notes in Computer Science, Volume 4093 Springer Press, pp.316-327 (2006) 3. K-B. Zhang, M. A. Orgun and K. Zhang, A Visual Approach for External Cluster Validation, Proceedings of the first IEEE Symposium on Computational Intelligence and Data Mining (CIDM2007), Honolulu, Hawaii, USA, April 1-5, 2007, IEEE Press. pp. 576-582 (2007) 4. K-B. Zhang, M. A. Orgun and K. Zhang, Enhanced Visual Separation of Clusters by M-mapping to Facilitate Cluster Analysis, Proceedings of 9th International Conference series on Visual Information Systems (VISUAL 2007), June 28-29, 2007, Shanghai, China, Lecture Notes in Computer, Volume 4781, Springer Press, pp. 285-297 (2007) 5. K-B. Zhang, M. A. Orgun and K. Zhang, A Prediction-based Visual Approach for Cluster Exploration and Cluster Validation by HOV 3, Proceedings of 18th European Conference on Machine Learning/11th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2007), Warsaw, Poland, September 17-21, 2007, Lecture Notes in Computer, LNAI 4702 Springer Press, pp. 336 349 (2007) 6. K-B. Zhang, M. A. Orgun and K. Zhang, Predictive Hypothesis Oriented Cluster Analysis by Visualization, Journal of Data Mining and Knowledge Discovery (2007) (submitted) Not Relevant Publications with This Thesis: 7. K-B. Zhang, M.A.Orgun and K. Zhang, "Compiled Visual Programs by VisPro", Pan- Sydney Area Worshop on Visual Information Processing, Sydney Australia, December 2003, Australian Computer Society Press. Vol. 36, pp.113-117 (2004) 8. K-B. Zhang, K. Zhang, and M.A.Orgun "Semantic Specifications in Reserved Graph Grammars", The Ninth International Conference on Distributed Multimedia Systems (DMS'2003), Florida International University Miami, Florida, USA September (2003) 59
60 APPENDIX
BIBLIOGRAPHY BIBLIOGRAPHY [AAP+03] A. L. Abul, R. Alhajj, F. Polat, and K. Barer, Cluster Validity Analysis Using Subsampling, Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, IEEE Press, Vol. 2: pp. 1435-1440 (2003) [ABK+99] M. Anerst, M. Breunig, H.-P. Kriegel, J. Sander OPTICS: Ordering Points To Identify the Clustering Structure, Proceedings of ACM SIGMOD 99, International Conference on Management of Data, Philadelphia, PA. pp. 49-60 (1999) [AbM98] A. J. Abrantesy, J. S. Marques, A Method for Dynamic Clustering of Data, Proceedings of the British Machine Vision Conference 1998, BMVC 1998, Southampton, UK, 1998. British Machine Vision Association, pp.154-163 (1988) [AEK00] M.Anerst, M. Ester M, H. P. Kriegel, Towards an Effective Cooperation of the Computer and the User for Classification, Proceedings of. ACM SIGKDD International Conference On Knowledge Discovery & Data Mining (KDD 2000), Boston, MA, pp. 179-188 (2000) [AGG+98] R. Agrawal, J. Gehre, D. Gunopulos, and P. Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, Proceedings of the ACM SIGMOD Conference, Seattle, WA., pp.94-105 (1998) [AhW95] C. Ahlberg, E. Wistrand, IVEE: An Environment for Automatic Creation of Dynamic Queries Applications, Proceedings of Human Factors in Computing Systems CHI 95 Conference, Demo Program, Denver, CO (1995) [AKK96] M. Anerst, D. A. Keim, H.-P. Kriegel, Circle Segments: A Technique for Visually Exploring Large Multidimensional Data Sets, Proceedings of Visualization 96, Hot Topic Session, San Francisco, CA, 1996. [ALA+03] O. Abult, A. Lo, R. Alhajjt, F. Polat, K. Bared, Cluster Validity Analysis Using Subsampling, Proceedings of IEEE International Conference on Systems, Man and Cybernetics (IEEE-SMC), Vol.2 pp.1435-1440 (2003) [AlC91] B. Alpern, L. Carter. Hyperbox, Proceedings of Visualization 91, San Diego, CA, pp.133-139 (1991) [And72] D. F. Andrews, Plots of High-Dimensional Data, Biometrics, Vol. 29, pp. 125-136 (1972) [And73] M. Anderberg, Cluster Analysis for Applications. New Yor: Academic (1973) [AnK01] M. Anerst, and D. Keim, Visual Data Mining and Exploration of Large Databases, Proceedings of 5th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'01), Freiburg, Germany, September 2001 61
BIBLIOGRAPHY [Asi85] D. Asimov, The grand tour: A tool for viewing multidimensional, SIAM Journal of Scientificand Statistical Computing, Vol. 6 (1), pp. 128-143 (1985) [BaB99]A. Baraldi, P. Blonda, A Survey of Fuzzy Clustering Algorithms for Pattern Recognition Part I, IEEE Transactions on Systems, Man, and Cybernetics Part b: Cybernetics, vol. 29(6), pp.778-785 (1999) [BeC96] D. J. Berndt and J. Clifford, Finding patterns in time series: A dynamic programming approach, in Advances in Knowledge Discovery and Data Mining. Menlo Par, CA: AAAI/MIT Press, 1996, pp. 229 248. [Bed90] J. Beddow, Shape Coding of Multidimensional Data on a Mircocomputer Display, Proceedings of Visualization 90, San Francisco, CA, 1990, pp. 238-246. [BeG03] A. Ben-Hur and I. Guyon, Detecting stable clusters using principal component analysis, Methods in Molecular Biology, M.J. Brownstein and A. Kohodursy (eds.) Humana press, pp.159-182 (2003) [BEG02] A. Ben-Hur, A. Elisseeff and I. Guyon, A stability based method for discovering structure in clustered data, Proceedings of the Pacific Symposium on Biocomputing (2002) [Ber06] Berhin, P: A Survey of Clustering Data Mining Techniques, Kogan, Jacob; Nicholas, Charles; Teboulle, Marc (Eds.) Grouping Multidimensional Data, Springer Press pp. 25-72 (2006) [BeR78] J. R. Beniger and D. L. Robyn, Quantitative graphics in statistics: A brief history, The American Statistician, Vol 32(1) pp. 1-9 (1978) [Bes] C. Best, http://www.computationalgroup.com/tigertiger/cb/index.html [Bez81] J. C. Bezde, Pattern Recognition with Fuzzy Objective Function, Alnon'thms. Plenum Press. New Yor. 1981 [BLO+03] M. Bueri, K.O. Lovblad, H. Oswald, A.C. Niro, P. Stein, C. Kiefer and G. Schroth, Multiresolution fuzzy clustering of functional MRI data, Neuroradiology Vol.45, pp.691-699 (2003) [BPR+04] C. Baumgartner, C. Plant, K. Railing, H-P. Kriegel, P. Kroger, Subspace Selection for Clustering High-Dimensional Data, Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM 04), pp:11-18 (2004) [CBC+95] D. Coo, A. Buja, J. Cabrera, and C. Hurley, Grand tour and projection pursuit, Journal of Computational and Graphical Statistics, vol. 23, pp.155-172 (1995) [Che07] C. Chen, A Brief History of Data Visualization, W. Hardle and A. Unwin (eds.), Handboo of Computational Statistics: Data Visualization, Vol III, Springer, 2007. [Che73] H. Chernoff, The Use of Faces to Represent Points in -Dimensional Space Graphically, Journal Amer. Statistical Association, Vol. 68, pp.361-368 (1973) 62
BIBLIOGRAPHY [Chi00] E. Chi. A taxonomy of visualization techniques using the data state reference model, Proceedings of the Symposium on Information Visualization (InfoVis 2000), pp.69-75 (2000) [CHS04] W.-P. Chen, J. C. Hou, L. Sha, Dynamic Clustering for Acoustic Target Tracing in Wireless Sensor Networs, IEEE Transactions on Mobile Computing, Vol. 3 (3), July- September 2004, pp. 358-371 [CKS+88]] P. Cheeseman, J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman, AutoClass: A bayesian classification system, Proceedings of 5th International Conference on Machine Learning, Morgan Kaufmann, pp. 54-64 (1988) [Cle93] W. S. Cleveland, Visualizing Data, AT&T Bell Laboratories, Murray Hill, NJ, Hobart Press, Summit NJ, (1993) [CMS99] S. K. Card, J. D. Macinlay, and B. Shneiderman, editors. Readings in Information Visualization: Using Vision to Thin. Morgan Kaufmann, San Francisco, 1999. [DMS98] I. S. Dhillon, D. S. Modha, and W. S. Spangler, Visualizing class structure of multidimensional data, the 30th Symposium on the Interface: Computing Science and Statistics,Vol. 30, pp.488 493 (1998) [Dom01] B. Dom, "An information-theoretic external cluster-validity measure", Research Report, IBM T.J. Watson Research Center RJ 10219 (2001) [DuJ79] R. Dubes and A. K. Jain, "Validity studies in clustering methodologies", Pattern Recognition, Vol. 1(1), pp.235-254 (1979) [EKS+96] M. Ester, H-P Kriegel, J. Sander, X. Xu, A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), pp.226-231 (1996) [ELL01] B. Everitt, S. Landau, and M. Leese, Cluster Analysis. London: Arnold, 2001. [Fai98] Mar D. Fairchild, Color Appearance Models, Addison-Wesley, Reading, MA (1998) [FaL95] C. Faloutsos and K. Lin, Fastmap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia data sets Proceedings of ACM-SIGMOD95, pp.163-174. (1995) [FGW02] U. Fayyad, G. Grinstein and A. Wierse (eds.), Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann Publishers, 2002 [Fie79] S. E. Fienberg, Graphical methods in statistics, American Statisticians Vol.33 pp. 165-178 (1979) [Fis87] D. Fisher, Improving Inference through Conceptual Clustering, Proceedings of 1987 AAAI Conferences, Seattle Washington, pp.461-465 (1987) 63
BIBLIOGRAPHY [FoM83] E. Fowles and C. Mallows, A method for comparing two hierarchical clusterings, Journal of American Statistical Association,Vol. 78, pp.553 569 (1983) [FrD01] J Fridlyand J. and Dudoit S., Applications of resampling methods to estimate the number of clusters and to improve the accuracy of a clustering method, University of California, Statistics Department Technical Report, No.600 (2001) [FrD07] M. Friendly, D. J. Denis, Milestones in the history of thematic cartography, statistical graphics, and data visualization, http://www.math.yoru.ca/scs/gallery /milestone/visualization_milestones.pdf, Yor University, Canada (2007) [FrT74] J. Friedman, J. Tuey, A Projection Pursuit Algorithm for Exploratory Data Analysis, IEEE Transactions on Computers, Vol. 23, pp. 881-890 (1974) [FuB94] G. W. Furnas, A. Buja, Prosections Views: Dimensional Inference through Sections and Projections, Journal of Computational and Graphical Statistics, Vol. 3(4), pp.323-353 (1994) [Fu90] K. Fuunaga, "Introduction to Statistical Pattern Recognition, San Diego CA, Academic Press (1990) [GMH+94] G. Grinstein T. Mihalisin, H. Hinterberger A. Inselberg, Visualizing multidimensional (multivariate) data and relations, Proceedings of the conference on Visualization '94, IEEE Visualization, pp. 404-409 (1994) [Gor98] A. D. Gordon, "Cluster validation, Data Science, Classification, and Related Methods, C. Hayashi, N. Ohsumi, K. Yajima, Y. Tanaa, H-H. Boc and Y. Baba Edited Springer, Toyo, pp 22-39(1998) [GRS98] S. Guha, R. Rastogh and K. Shim, CURE: An efficient clustering algorithm for large databases, Proceedings of ACM SIGMOD Conference 98, pp.73 84 (1998) [HaK01] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers (2001) [HaJ97] P. Hansen and B. Jaumard, Cluster analysis and mathematical programming, Math. Program, Vol. 79, pp.191 215 (1997) [HaV01] M. Halidi and M. Vazirgiannis, Clustering Validity Assessment: Finding the Optimal Partitioning of a Data Set, Proceedings of ICDM 2001, pp. 187-194 (2001) [Har75] J. Hartigan, Clustering Algorithms. New Yor: Wiley (1975) [HBV01] M. Halidi, Y. Batistais and M. Vazirgiannis M., On Clustering Validation Techniques, Journal of Intelligent Infomation Systems, Vo1.7(2-3) (2001) [HBV02] M. Halidi, Y. Batistais, M. Vazirgiannis: Cluster Validity Methods: Part I&II, SIGMOD Record,Vol. 31(2-3) (2002) 64
BIBLIOGRAPHY [HCN01] Z. Huang, D. W. Cheung, M. K. Ng, An Empirical Study on the Visual Cluster Validation Method with Fastmap, Proceedings of the 7th International Conference on Database Systems for Advanced Applications, pp. 84-91 (2001) [Hea95] M. Hearst, TileBars: Visualization of Term Distribution Information in Full Text Information Access, Proceedings of ACM Human Factors in Computing Systems Conference, (CHI'95), pp.59-66 (1995) [HGM97] P. Hoffman, G. Grinstein,. K. Marx, I. Grosse, and E. Stanley, Dna visual and analytic data mining, IEEE Visualization, pp. 437-442 (1997) [HiK98] A. Hinneburg and D. Keim, "An Efficient Approach to Clustering in Large Multimedia Databases with Noise, Proceedings of KDD-98 (1998) [HKK05] [J. Handl, J. Knowles and D. B. Kell, Computational cluster validation in postgenomic data analysis, Journal of Bioinformatics, Vol. 21(15), pp.3201-3212 (2005) [HKW99] A. Hinneburg, D. A. Keim., M, Wawryniu, HD-Eye:Visual Mining of High- Dimensional Data, IEEE Computer Graphics and Applications, Volume 19, Issue 5 (September 1999), pp.22-31 [HKW03] A. Hinneburg, D. A. Keim., M, Wawryniu, HD-Eye-Visual Clustering of High dimensional Data, Proceedings of the 19th International Conference on Data Engineering, pp.753-755 (2003) [HoG01] Patric E. Hoffman Georges G. Grinstein, A survey of visualizations for multidimensional data mining, Information visualization in data mining and nowledge discovery, Morgan Kaufmann Publishers Inc, pp. 47-82, 2001 [Hub85] P. J. Huber, Projection Pursuit, The Annals of Statistics, Vol. 13 (2), pp.435-474 (1985) [HuL00] Z. Huang and T. Lin, A visual method of cluster validation with Fastmap, Proceedings of PAKDD-2000, pp.153-164 (2000) [InD90] A. Inselberg, B. Dimsdale, Parallel Coordinates: A Tool for Visualizing Multi- Dimensional Geometry, Proceedings of Visualization 90, San Francisco, CA, pp. 361-370 (1990) [Ins85] A. Inselberg, The Plane with Parallel Coordinates, Special Issue on Computational Geometry, The Computer, Vol. 1, pp. 69-97 (1985) [Ins97] A. Inselberg, Multidimensional Detective, Proceedings of IEEE Information Visualization '97 pp.100-107 (1997) [JaD88] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall Press (1988) [Jac08] S. Jaccard, Nouvelles recherches sur la distribution florale, Bull. Soc. Vaud. Sci. Nat., 44, pp.223-270 (1908) 65
BIBLIOGRAPHY [JMF99] A. Jain, M. N. Murty and P. J. Flynn, Data Clustering: A Review, ACM Computing Surveys, Vol. 31(3), pp. 264-323 (1999) [Jol02] T. Ian Jolliffe. Principal Component Analysis, Springer Press (2002) [Kei01] D. A. Keim, Visual exploration of large data sets, ACM Communication, vol. 44 (8), pp.38 44, (2001) [Kan00] E. Kandogan, Star Coordinates: A Multi-dimensional Visualization Technique with Uniform, Treatment of Dimensions, IEEE Symposium on Information Visualization 2000. Salt Lae City, Utah. pp.4-8 (2000) [KaR90] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: an Introduction to Cluster Analysis, John While & Sons. (1990) [KDN+96] T. Kanungo, B. Dom, W. Niblac, and D. Steele, "A fast algorithm for mdl-based multi band image segmentation", in Image Technology,J. Sanz, Ed. Springer-Verlag, 1996. [KeC00] M. K. Kerr and G. A. Churchill, "Bootstrapping Cluster Analysis: Assessing the Reliability of Conclusions from Microarray Experiments", Proceedings of the National Academy of Sciences (2000) [Kei02] D. A. Keim, Information Visualization and Data Mining, IEEE Transactions on Visualization and Computer Graphics, Vol. 7(1), January-March 2002, pp.100-107 (2002) [KeK94] D. A. Keim and H.-P. Kriegel, VisDB: Database Exploration Using Multidimensional Visualization, IEEE Computer Graphics and Applications, 14(5) pp. 40-49 (1994) [KHK99] G. Karypis, E.-H. S Han, and V. Kumar, Chameleon: hierarchical clustering using dynamic modeling, IEEE Computer, vol. 32(8), pp.68 75 (1999) [KrW78] J. B.Krusal, M. Wish, Multidimensional Scaling, SAGE university paper series on quantitive applications in the social sciences, Sage Publications, CA. pp. 07-011 (1978) [KKA95] D.A. Keim, H.-P. Kriegel, M. Anerst, Recursive Pattern: A Technique for Visualizing Very Large Amounts of Data, Proceedings of Visualization 95, Atlanta, GA, pp. 279-286 (1995) [KOC+04] S. Kabelac, S. Olbrich, K. Chmielewsi, K. Meier, C. Holznecht, 3D Visualization of Molecular Simulations in High-performance Parallel Computing Environments, Journal of Molecular Simulation, Volume 30(7), June 2004, Taylor and Francis Ltd. pp. 469-477 (2004) [Koh97] T. Kohonen, Self-Organizing Maps Springer, Berlin, second extended edition (1997) [KSP01] S. Kasi, J. Sinonen. and J. Peltonen, Data Visualization and Analysis with Self- Organizing Maps in Learning Metrics, DaWaK 2001, LNCS 2114, pp.162-173 (2001) 66
BIBLIOGRAPHY [LeD01] E. Levine and E. Domany, "Resampling Method for Unsupervised Estimation of Cluster Validity", Neural Computation. 2001. [Lev91] H. Levowitz, Color icons: Merging color and texture perception for integrated visualization of multiple parameters, Proceedings of the 2nd conference on Visualization '91, San Diego, CA, pp. 164-170 (1991) [LKS+04] J. Lin, E. Keogh, S. Lonardi, J. Lanford and D. M. Nystrom, Visually Mining and Monitoring Massive Time Series, KDD 04, August 22-25, 2004, Seattle, Washington, U.S.A (2004) [Mac67] J. B. MacQueen, Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5th Bereley Symposium on Mathematical Statistics and Probability, Bereley, University of California Press, pp.281-297 (1967) [MaT03] K.-L. Ma, S. T. Teoh, StarClass: Interactive Visual Classification Using Star Coordinates, Proceedings of the 3rd SIAM International Conference on Data Mining, pp. 178-185 (2003) [MaW] The MathWors, Inc. textboo online, http://www.mathwors.com/ [MiG04] James R Miller, E A Gustavo; "The Immersive Visualization Probe for Exploring n- Dimensional Spaces", Proceedings of IEEE Computer Graphics and Applications 2004, pp.76-85 (2004) [MiI80] G. W. Milligan, and P. D. Isaac, The validation of four ultrametric clustering algorithms, Pattern Recognition, Vol. 12, pp.41-50 (1980) [Mil81] G. W. Milligan, A Monte Carlo study of thirty internal criterion measures for cluster analysis, Psychometria, Vol. 46 (2), pp. 187-199 (1981) [Mil96] G. W. Milligan, Clustering validation: results and implications for applied analysis. in Clustering and Classification ed. P. Arabie, L. J. Hubert and G. (1996) De Soete, World Scientific, pp.34 1-375. [MRC02] A. Morrison, G. Ross and M. Chalmers, Combining and comparing clustering and layout algorithms, University of Glasgow (2002) [MSS83] G.W. Milligan, L.M. Sool, and S.C. Soon The effect of cluster size, dimensionality and the number of clusters on recovery of true cluster structure, IEEE Trans PAMI, Vol. 5(1), pp. 40-47 (1983) [OlL03] F. Oliveira, H. Levowitz, From Visual Data Exploration to Visual Data Mining: A Survey, IEEE Trans.Vis.Comput. Graph, Volume 9(3), pp.378-394 (2003) [OlP07] J. V. de Oliveira and W. Pedrycz (Editor), Advances in Fuzzy Clustering and its Applications, Wiley, June (2007) [Pei] http://www.cs.sfu.ca/~jpei/ 67
BIBLIOGRAPHY [PGW03] E. Pampal, W. Goebl, and G. Widmer, Visualizing Changes in the Structure of Data for Exploratory Feature Selection, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (SIGKDD 03), August 24-27, 2003, Washington, DC, USA pp.157-166 (2003) [Pic70] R. M. Picett, Visual Analyses of Texture in the Detection and Recognition of Objects, in: Picture Processing and Psycho-Pictorics, Lipin B. S., Rosenfeld A. (eds.), Academic Press, New Yor (1970) [Ran71] W. M. Rand, Objective Criteria for the Evaluation of Clustering Methods, Journal of the American Statistical Association, Vol 66, pp. 846-850 (1971) [Rei95] S. P. Reiss, An Engine for the 3D Visualization of Program Information, Journal of Visual Languages and Computing, Vol. 6, pp. 299-323 (1995) [RBL+02] V. Roth, M. L. Braun, T. Lange and J. M. Buhmann "Stability-Based Model Order Selection in Clustering with Applications to Gene Expression Data", Lecture Notes In Computer Science; Vol. 2415, Proceedings of the International Conference on Artificial Neural Networs, pp.607-612 (2002) [RKJ+99] W. Ribarsy, J. Katz, F. Jiang, A. Holland, Discovery Visualization using Fast Clustering, IEEE Computer Graphics and Applications, Vol. 19(5) 1999. [RSE99] R.M. Rohrer, J.L. Sibert, D.S. Ebert, Shape-based Visual Interface for Text Retrieval, IEEE Computer Graphics and Applications, Vol. 19(5) 1999. [SBG00] T.C. Sprenger, R. Brunella, M. H.Gross, H-BLOB: a hierarchical visual clustering method using implicit surfaces, Proceedings of Visualization 2000, pp. 61-68 (2000) [SCZ98] G. Sheiholeslami, S. Chatterjee, A. Zhang, Wavecluster: A multi-resolution clustering approach for very large spatial databases, Proceedings of Very Large Databases Conference (VLDB98), pp.428-439 (1998) [SDT+95] H. Su, H. Dawes, L. Tweedie, R. Spence, An Interactive Visualization Tool for Tolerance Design, Technical Report, Imperial College, London, (1995) [SeS05] J. Seo and B. Shneiderman, From Integrated Publication and Information Systems to Virtual Information and Knowledge Environments, Essays Dedicated to Erich J. Neuhold on the Occasion of His 65th Birthday. Lecture Notes in Computer Science Vol.3379, Springer (2005) [SGF71] J.H. Siegel, R. M. Goldwyn and H. P. Friedman, Irregular polygon to represent multivariate data (with vertices of equal intervals, distanced from the centre proportionally to the value of the variable), USA (1971) October [Sha96] S. Sharma, Applied multivariate techniques, John Wiley & Sons, Inc. (1996) [Shn01] B. Shneiderman, Inventing Discovery Tools: Combining Information Visualization with Data Mining, Proceedings of Discovery Science 2001,Lecture Notes in Computer Science Vol 2226, pp.17-28 (2001) 68
BIBLIOGRAPHY [Shn02] B. Shneiderman,, Inventing discovery tools: Combining information visualization with data mining, Information Visualization, Vol. 1, pp.5 12 (2002) [ShY06] J. Shai and M. Yeasin, Visualization of High Dimensional Data using an Automated 3D Star Co-ordinate System, Proceedings of International Joint Conference on Neural Networs, 2006 (IJCNN '06), 16-21 July 2006,Vancouver, Canada, IEEE Press, pp.1339-1346 [Sim93] P. K. Simpson, Fuzzy min-max neural networ Part II: Clustering,, IEEE Trans. Fuzzy Syst., Vol. 1(1), pp. 32 45 (1993) [Smi86] W. A. Smith, Elementary Numerical Analysis, Prentice-Hall, (1986) [Th99] S. Theodoridis and K. Koutroubas, Pattern Recognition, Academic Press. 1999. [VSA05] R.Vilalta, T. Stepinsi, M. Achari, An Efficient Approach to External Cluster Assessment with an Application to Martian Topography, Technical Report, No. UH-CS-05-08, Department of Computer Science, University of Houston (2005) [War95] M. Ward, High dimensional brushing for interactive exploration of multivariate data, Proceedings of Visualization'95, pp.271-278 (1995.) [WiL93] J. J van Wij., R. D. van Liere, Hyperslice, Proceedings of Visualization 93 Conference, San Jose, CA, pp.119-125 (1993) [WTP+95] J. A. Wise, J. J. Thomas, K. Pennoc, D. Lantrip, M. Pottier, A. Schur, V. Crow, Visualizing the Non-Visual: Spatial Analysis and Interaction with Information from Text Document, Proceedings of Symposium on Information Visualization1995, Atlanta, GA, pp.51-58 (1995) [WYM97] W. Wang, J. Yang, and R. Muntz, STING: A statistical information grid approach to spatial data mining, Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB97), pp.186-195 (1997) [XEK+98] X. Xu, M. Ester, H-P. Kriegel, and J. Sander, A distribution-based clustering algorithm for mining in large spatial databases, Proceedings of IEEE International Conference on Data Engineering, (ICDE 98), pp.324-331 (1998) [XuW05] R. Xu and D. C. Wunsch, Survey of Clustering Algorithms, IEEE Transactions on Neural Networs, Vol. 16(3), May 2005, pp.645-678 (2005) [Yan03] L. Yang, Visual Exploration of Large Relational Data Sets through 3D Projections and Footprint Splatting, IEEE Transactions on Knowledge and Data Engineering, Vol. 15(6), pp.1460-1471, November/December (2003) [ZHW+03] X. Zheng, P. He, F. Wan, Z. Wang, G. Wu, Dynamic Clustering Analysis of Documents Based on Cluster Centroids, Proceedings of the Second International Conference on Machine Learning and Cybernetics, XiAn, 2-5 Nov. 2003, IEEE Press. Vol.1, pp.194-198 (2003) 69
BIBLIOGRAPHY [ZOZ+06] K-B, Zhang, M. A. Orgun, K. Zhang and Y. Zhang, Hypothesis Oriented Cluster Analysis in Data Mining by Visualization, Proceedings of the woring conference on Advanced visual interfaces 2006 (AVI06), May 23-26, 2006, Venezia, Italy. ACM Press, pp. 254-257 (2006) [ZOZ06] K-B, Zhang, M. A. Orgun, K. Zhang, HOV 3 : An Approach for Visual Cluster Analysis, Proceedings of The 2nd International Conference on Advanced Data Mining and Applications. (ADMA 2006), Xi'an, China, August 14-16, 2006, Lecture Notes in Computer Science, Volume 4093 Springer Press, pp.316-327 (2006) [ZOZ07a] K-B. Zhang, M. A. Orgun and K. Zhang, A Visual Approach for External Cluster Validation, Proceedings of the first IEEE Symposium on Computational Intelligence and Data Mining (CIDM2007), Honolulu, Hawaii, USA, April 1-5, 2007, IEEE Press. pp. 576-582 (2007) [ZOZ07b] K-B. Zhang, M. A. Orgun and K. Zhang, Enhanced Visual Separation of Clusters by M-mapping to Facilitate Cluster Analysis, Proceedings of 9th International Conference series on Visual Information Systems (VISUAL 2007), June 28-29, 2007, Shanghai, China, Lecture Notes in Computer, Volume 4781, Springer Press, pp. 288-300 (2007) [ZOZ07c] K-B. Zhang, M. A. Orgun and K. Zhang, A Prediction-based Visual Approach for Cluster Exploration and Cluster Validation by HOV 3, Proceedings of 18th European Conference on Machine Learning/11th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2007), Warsaw, Poland, September 17-21, 2007, Lecture Notes in Computer, LNAI 4702 Springer Press, pp. 336 349 (2007) [ZOZ07d] K-B. Zhang, M. A. Orgun and K. Zhang, Predictive Hypothesis Oriented Cluster Analysis by Visualization, Journal of Data Mining and Knowledge Discovery (2007) (submitted) [ZRL96] T. Zhang, R. Ramarishana and M. Livny, "An Efficient Data Clustering Method for Very Large Database", Proceedings of ACM SIGMOD International Conference on Management of Data, pp.103-114 (1996) 70
Hypothesis Oriented Cluster Analysis in Data Mining by Visualization Ke-Bing Zhang Mehmet A. Orgun Department of Computing Macquarie University Sydney, NSW 2109, Australia 612-9850 9590, 612-9850 9570 {ebing, mehmet}@ics.mq.edu.au Kang Zhang Department of Computer Science The University of Texas at Dallas Richardson, TX 75083-0688 USA 1-972-883 6351 zhang@utdallas.edu Yihao Zhang Department of Computing Macquarie University Sydney, NSW 2109, Australia 612-9850 9590 yihao@ics.mq.edu.au!! " # $ % & ' ( ( ( $'( ) % $ *+ '( ) & &, ( + & ( $ -./ '+ 0.'/ 1 23 4# 2-) 5 -- 6-74! 8 90 : 264. 2-)4 & 9+ /25 4 # 2--4 8 ; + : 274 '<. 2-4 & 9! Permission to mae digital or hard copies of all or part of this wor for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. AVI '06, May 23-26, 2006, Venezia, Italy. Copyright 2006 ACM 1-59593-353-0/06/0005...$5.00. &! $! &!. 9 2-*4 1 2* =4 & $. $ 2->4. $! $ 2?4 9 (. 2)4# 9 *: @ : + # ', A '0 < < ' + $ 2->4 + 9 2? 4 9 & *- 2? 4 & & & 9 9 9 9 8 & B C & 9& & 8-254
8 - -< <? & 8 D-E n n p j ( x, y) = ( u xi ( d ji min i ), u yi ( d ji min i )) D-E D-E i= 1 i= 1 %D9 E + % D u xi u E " yi C j u j = %FG%>H%IJ man min j j 9 % F9G % >H%IJ" + & & 8! % 9 9! 9 0 D E D E % **' && &&'( ) : 9 D E & D E $ 9 %! 9& ' ( ( ( $'( )! '( ) $! 9! % 1 1 '( ) % & < B C. K @ FD - * L E FD - * L E M INF - - O * * OL O F n = 1 M cos( θ ) = b a D*E A, B A B P P P P M 2 2 2 2 2 2 P P F a 1 + a2 +... + an P P F b 1 + b2 +... + bn @ " IN < D E 8 *? 8 *( %% < < 1 1. 9 + % D E 1 % 9+ % 1 @ + %FD %- %* L % E1 FD - * L E" %DF-L E + % 1 D*E M I%1 NF - %- O * %*OL O % F m i d j n = 1 D)E & D E 8 M * M 8 D+ %1 EFDI+ %1 NE F D5 E E # + % 1.9/ 9 1 @ D1 @ 1 #!.E 9 @ $F9O : M e ix = cos x + i sin x 255
@ 2π i/ n z 0 = e " $ - > $ * > $ ) > L L $ &- > $ > D $ > F-E 9&- D-E M n p j ( z0 ) = [( d j min d j ) /( max d j min d j )] z0 = 1 D6 E E min d j max dj 9! K & M n p j ( z ) = [( d j min d j ) /( max d j min d j )] z D=E = 1 i θ z = e " 9" n θ = 2π = 1 ( d j min d j ) /(max dj min d j # ) D6ED=E $ d j, dn j D=E M n p j ( z ) = dn j z = 1 D3 E E. & 9 * Q &$ < % % < % D5E D3E '( ) D? EM n p j ( z ) = dn j m z D? D? E E = 1!! *)+ +. 2? 4(. 2)4 9 % D3E'( ) D?E!D!F-L E D5 ED? E -'( ) D3E. 9 % 9 9 2>-42&--4(. 1 '( )! $ ): B < :.1 : / #. '( ) 9 '( ) (. 2)4# 1 @ # *>>> < 9 0. M MKK KRK1 &@ -> -6 >>> 8 )5 (. 1 @ '( ). 9 (. )> &! 8 ) (. 8 5 '( ) 1 @ (. K 98 6 % D E 256
8 6! % : 9 '( ) $# 9 &! 0.. # 9 '( ) 9 8 6 < < % (. (. # 8 = % '( ) 9 & 1 9 D E 8 = 1 '( ) # 9 8 3 '( ) (. 8 3 1 9 9.'( ) (. '( ) 257 5'/@ 0 +./Q : 1,. '( ) &'( ) & % & % $ '( ) : 9 '( ) $ 6: 8 : : /: 2-4!1 1 1, < S'<. M ' < 1.Q 1 '+ -777 2*4! 1, + ( + 1 : 9 @ + 6 : < <, + + D<, + + T>-E 8 Q *>>- 2)4, @ @ (. M( ( $S. ( $()D5 E*6 3& *3 >*>>5 254 : 1, < S B B &., + + 1-77= 264 Q, 0 : M < ' 1.Q 1 '+ -77? 2= 4 < : Q Q $ &. $! 1, <. *>>- 23 4 S1 1 /8 < S+ M 1 )-D)E*= 5&)*)-7 7 7 2? 4, : ( $ & < 1.Q, + + ->3&--=*>>- 274 U C V Q V, M 8 ; + : M 8 : + + / +.< 1.Q 1 '+ *>>5 < 8 -)&-?S *>>51 < 7*-&7***>>5 2->4! #, $ S S 8 + $ Q.: : : ( -7D6E)*W)7-777 2--4! Q % V # M & < (@ + D(@ + E -77? 2-*4 M. + M. ( $ + 1 + -3&*? *>>- < @ / ***= *>>- 2-)4V! @ 1.M. <.Q 1 '+ 7=1 ->)&--5-77=
HOV 3 : An Approach to Visual Cluster Analysis Ke-Bing Zhang 1, Mehmet A. Orgun 1, and Kang Zhang 2 1 Department of Computing, Macquarie University, Sydney, NSW 2109, Australia {ebing, mehmet}@ics.mq.edu.au 2 Department of Computer Science, University of Texas at Dallas Richardson, TX 75083-0688, USA zhang@utdallas.edu Abstract. Clustering is a major technique in data mining. However the numerical feedbac of clustering algorithms is difficult for user to have an intuitive overview of the dataset that they deal with. Visualization has been proven to be very helpful for high-dimensional data analysis. Therefore it is desirable to introduce visualization techniques with user s domain nowledge into clustering process. Whereas most existing visualization techniques used in clustering are exploration oriented. Inevitably, they are mainly stochastic and subjective in nature. In this paper, we introduce an approach called HOV 3 (Hypothesis Oriented Verification and Validation by Visualization), which projects high-dimensional data on the 2D space and reflects data distribution based on user hypotheses. In addition, HOV 3 enables user to adjust hypotheses iteratively in order to obtain an optimized view. As a result, HOV 3 provides user an efficient and effective visualization method to explore cluster information. 1 Introduction Clustering is an important technique that has been successfully used in data mining. The goal of clustering is to distinguish objects into groups (clusters) based on given criteria. In data mining, the datasets used in clustering are normally huge and in high dimensions. Nowadays, clustering process is mainly performed by computers with automated clustering algorithms. However, those algorithms favor clustering spherical or regular shaped datasets, but are not very effective to deal with arbitrarily shaped clusters. This is because they are based on the assumption that datasets have a regular cluster distribution. Several efforts have been made to deal with datasets with arbitrarily shaped data distributions [2], [11], [9], [21], [23], [25]. However, those approaches still have some drawbacs in handling irregular shaped clusters. For example, CURE [11], FAÇADE [21] and BIRCH [25] perform well in low dimensional datasets, however as the number of dimension increases, they encounter high computational complexity. Other approaches such as density-based clustering techniques DBSCAN [9] and OPTICS [2], and wavelet based clustering WaveCluster [23] attempt to cope with this problem, but their non-linear complexity often maes them unsuitable in the analysis of very large datasets. In high dimensional spaces, traditional clustering algorithms tend to brea down in terms of efficiency as well as accuracy because data do not cluster well X. Li, O.R. Zaiane, and Z. Li (Eds.): ADMA 2006, LNAI 4093, pp. 316 327, 2006. Springer-Verlag Berlin Heidelberg 2006
HOV 3 : An Approach to Visual Cluster Analysis 317 anymore. The recent clustering algorithms applied in data mining are surveyed by Jain et al [15] and Berhin [4]. Visual Data Mining is mainly a combination of information visualization and data mining. In the data mining process, visualization can provide data miners with intuitive feedbac on data analysis and support decision-maing activities. In addition, visual presentations can be very powerful in revealing trends, highlighting outliers, showing clusters, and exposing gaps in data [24]. Many visualization techniques have been employed to study the structure of datasets in the applications of cluster analysis [18]. However, in practice, those visualization techniques tae the problem of cluster visualization simply as a layout problem. Several visualization techniques have been developed for cluster discovery [2], [6], [16], but they are more exploration oriented, i.e., stochastic and subjective in the cluster discovery process. In this paper, we propose a novel approach, named HOV 3, Hypothesis Oriented Verification and Validation by Visualization, which projects the data distribution based on given hypotheses by visualization in 2D space. Our approach adopts the user hypotheses (quantitative domain nowledge) as measures in the cluster discovery process to reveal the gaps of data distribution to the measures. It is more object/goal oriented and measurable. The rest of this paper is organized as follows. Section 2 briefly reviews related wor on cluster analysis and visualization in data mining. Section 3 provides a more detailed account of our approach HOV 3 and its mathematical description. Section 4 demonstrates the application of our approach on several well-nown datasets in data mining area to show its effectiveness. Finally, section 5 evaluates our approach and provides a succinct summary. 2 Related Wor Cluster analysis is to find patterns (clusters) and relations among the patterns in large multi-dimensional datasets. In high-dimensional spaces, traditional clustering algorithms tend to brea down in terms of efficiency as well as accuracy because data do not cluster well anymore. Thus, using visualization techniques to explore and understand high dimensional datasets is becoming an efficient way to combine human intelligence with the immense brute force computation power available nowadays [19]. Many studies have been performed on high dimensional data visualization [18]. While, most of those visualization approaches have difficulty in dealing with high dimensional and very large datasets, for example, icon-based methods [7], [17], [20] can display high dimensional properties of data. However, as the amount of data increases substantially, the user may find it hard to understand most properties of data intuitively, since the user cannot focus on the details of each icon. Plot-based data visualization approaches such as Scatterplot-Matrices [8] and similar techniques [1], [5] visualize data in rows and columns of cells containing simple graphical depictions. This ind of a technique gives bi-attributes visual information, but does not give the best overview of the whole dataset. As a result, they are not able to present clusters in the dataset very well.
318 K.-B. Zhang, M.A. Orgun, and K. Zhang Parallel Coordinates [14] utilizes equidistant parallel axes to visualize each attribute of a given dataset and projects multiple dimensions on a two-dimensional surface. Star Plots [10] arranges coordinate axes on a circle space with equal angles between neighbouring axes from the centre of a circle and lins data points on each axis by lines to form a star. In principle, those techniques can provide visual presentations of any number of attributes. However, neither parallel coordinates nor star plots is adequate to give the user a clear overall insight of data distribution when the dataset is huge, primarily due to the unavoidably high overlapping. And another drawbac of these two techniques is that while they can supply a more intuitive visual relationship between the neighbouring axes, for the non-neighbouring axes, the visual presentation may confuse the users perception. HD-Eye [12] is an interactive visual clustering system based on density-plots of any two interesting dimensions. The 1D visualization based OPTICS [2] wors well in finding the basic arbitrarily shaped clusters. But they lac the ability in helping the user understand inter-cluster relationships. The approaches that are most relevant to our research are Star Coordinates [16] and its extensions, such as VISTA [6]. Star Coordinates arranges coordinate axes on a two-dimensional surface, where each axis shares the same origin point. This approach utilizes a point to represent a vector element. We give a more detailed discussion on Star Coordinates in contrast with our model in the next section. The recent surveys [3], [13] provide a comprehensive summary on high dimensional visualization approaches in data mining. 3 Our Approach Data mining approaches are roughly categorized into discovery driven and verification driven [22]. Discovery driven approaches attempt to discover information by using appropriate tools or algorithms automatically, while verification driven approaches aim at validating a hypothesis derived from user domain nowledge. Discovery driven method can be regarded as discovering information by exploration, and the verification driven approach can be thought of as discovering information by verification. Star Coordinates [16] is a good choice as an exploration discovery tool for cluster analysis in a high dimensional setting. Star Coordinates technique and its salient features are briefly presented below. 3.1 Star Coordinates Star Coordinates arranges values of n-attributes of a database to n-dimensional coordinates on a 2D plane. The minimum data value on each dimension is mapped to the origin, and the maximum value, is mapped to the other end of the coordinate axis. Then unit vectors on each coordinate axis are calculated accordingly to allow scaling of data values to the length of the coordinate axes. Finally the values on n- dimensional coordinates are mapped to the orthogonal coordinates X and Y, which share the origin point with n-dimensional coordinates. Star Coordinates uses x-y values to represent a set of points on the two-dimensional surface, as shown in Fig.1.
HOV 3 : An Approach to Visual Cluster Analysis 319 Fig. 1. Positioning a point by an 8-attribute vector in Star Coordinates [16] Formula (1) states the mathematical description of Star Coordinates. p j n n r r u yi ( d ji min i )) (1) i= 1 i= 1 ( x, y) = ( u xi ( d ji min i ), p j (x, y) is the normalized location of Dj=(dj1, dj2,., djn), where dji is the coordinates point of jth record of a dataset on Ci, the ith coordinate in Star Coordinates space. r r And u xi (d ji min i ) and u yi (d ji min i ) are unit vectors of dji mapping to X direction and Y direction respectively, where u r i = C r i /(max i min i ),in which min i =min(dji,0 j< m), max i =max(d ji, 0 j <m), where m is the number of records in the dataset. By mapping high-dimensional data into two-dimensional space, Star Coordinates inevitably produces data overlapping and ambiguities in visual form. For mitigating these drawbacs, Star Coordinates established visual adjustment mechanisms, such as scaling the weight of attributes of a particular axis; rotating angles between axes; maring data points in a certain area by coloring; selecting data value ranges on one or more axes and maring the corresponding data points in the visualization [16]. However, Star Coordinates is a typical method of exploration discovery. Using numerically supported (quantitative) cluster analysis is time consuming and inefficient, while using visual (qualitative) clustering approaches, such as Star Coordinates is subjective, stochastic, and less of preciseness. To solve the problem of precision of visual cluster analysis, we introduce a new approach in the next section. 3.2 Our Approach HOV 3 Having a precise overview of data distribution in the early stages of data mining is important, because having correct insights of data overview is helpful for data miners to mae decisions on adopting appropriate algorithms for the forthcoming analysis stages. 3.2.1 Basic Idea Exploration discovery (qualitative analysis) is regarded as a pre-processing of verification discovery (quantitative analysis), which is mainly used for building user hy-
320 K.-B. Zhang, M.A. Orgun, and K. Zhang potheses based on cluster detection, or other techniques. But it is not an aimless and/or arbitrary process. Exploration discovery is an iterative process under the guidance of user domain nowledge. Each of iterations of exploration feeds bac users new insight and enriches their domain nowledge on the dataset that they are dealing with. However, the way in which the qualitative analysis is done by visualization mostly depends on each individual user s experience. Thus subjectivity, randomness and lac of preciseness may be introduced in exploration discovery. As a result, quantitative analysis based on the result of imprecise qualitative analysis may be inefficient and ineffective. To fill the gap between the imprecise cluster detection by visualization and the unintuitive result by clustering algorithms, we propose a new approach, called HOV 3, which is a quantified nowledge based analysis and provides a bridging process between qualitative analysis and quantitative analysis. HOV 3 synthesizes the feedbacs from exploration discovery and user domain nowledge to produce quantified measures, and then projects test dataset against the measures. Geometrically, HOV 3 reveals data distribution against the measures in visual form. We give the mathematical description of HOV 3 below. 3.2.2 Mathematic Model of HOV 3 To project a high-dimensional space into a two-dimensional surface, we adopt the Polar Coordinates representation. Thus any vector can be easily transformed to the orthogonal coordinates X and Y. In analytic geometry, the difference of two vectors A and B can be presented by their inner/dot product, A.B. Let A=(a 1, a 2,, a n ) and B=( b 1, b 2,, b n ), then their inner product can be written as: n <A, B>=a 1.b 1 +a 2.b 2 + +a n.b n = a (2) b =1 Fig. 2. Vector B projected against vector A in Polar Coordinates
HOV 3 : An Approach to Visual Cluster Analysis 321 Then we have the equation: cos( θ ) = < A, B > A B where q is the angle between A and B, and A and B are the lengths of A and B correspondingly, as shown below: A = a 1 a... + an 2 2 2 + 2 + and B = b 1 b... + bn 2 2 2 + 2 +. Let A be a unit vector; the geometry of <A, B> in Polar Coordinates presents the gap from point B (d b, q) to point A, as shown in Fig. 2, where A and B are in 8 dimensional space. Mapping to Measures In the same way, a matrix Dj, a set of vectors (dataset) also can be mapped to a measure vector M. As a result, it projects the matrix Dj distribution based on the vector M. Let Dj=(d j1, d j2,, d jn ) and M=(m 1, m 2,, m n ), then the inner product of each vector dji, (i =1,, n) of Dj with M has the same equation as (2) and written as: n <dji, M>= m 1.d j1 +m 2.d j2 + +m n.d jn = m d j (3) So from an n-dimensional dataset to one measure (dimension) mapping F: R n ØR 2 can be defined as: =1 F (Dj, M)=(<Dj,M>)=.*(m 1, m 2,, m n ) = (4) Where Dj is a dataset with n attributes, and M is a quantified measure. In Complex Number System Since our experiments are run by MATLAB (MATLAB, the MathWors, Inc), in order to better understand our approach, we use complex number system. Let z = x + i.y, where i is the imaginary unit. According to the Euler formula: e ix = cos x + i sin x Let 2π i/ n z 0 = e ; we see that z 1 0, z 2 0, z 3 0,, z n-1 0, z n 0 (with z n 0 = 1) divide the unit circle on the complex plane into n-1 equal sectors. Then mapping in Star Coordinates (1) can now be simply written as: where, n p j ( z0 ) = [( d j min d j ) /( max dj min d j )] z0 = 1 j min d and max dj represents minimal and maximal values of the th attribute/coordinate respectively. (5)
322 K.-B. Zhang, M.A. Orgun, and K. Zhang This is the case of equal-divided circle surface. Then the more general form can be defined as: where z n p j ( z ) = [( d j min d j ) /( max d j min d j )] z = 1 i θ e n θ = 2. = 1 ( d j min d j ) /(max dj min d j = ; θ is the angle of neighbouring axes; and π While, the part of ) in (5) (6) is normalized of original d j, we write it as dn j.. Thus formula (6) is written as: n p j z ) = dn j z = 1 (6) ( (7) In any case these can be viewed as mappings from R n to C - the complex plane, i.e., R n Ø C 2. Given a non-zero measure vector m in R n, and a family of vectors P j, then the projections of P j against m according to formulas (4) and (7), we present our model HOV 3 as the following equation (8): n p j ( z ) = dn j m z = 1 where m is the th attribute of measure m. 3.2.3 Discussion In Star Coordinates, the purpose of scaling the weight of attributes of a particular axis (or α-mapping called in VISTA) is for adjusting the contribution of the attribute laid on a specific coordinate by the interactive actions, so that data miners might gain some interesting cluster information that automated clustering algorithms cannot easily provide [6], [16]. Thus, comparing the model of Star Coordinates, in equation (7), and our model HOV 3 in equation (8), we may observe that our model covers the model of Star Coordinates, in that the condition of the angle of coordinates is the same in both models. This is because, any change of weights in Star Coordinates model can be viewed as changing one or more values of m (=1,,n) in measure vector m in equation (8) or (4). As a special case, when all values in m are set to 1, it is clear that HOV 3 is transformed into Star Coordinates model (7), i.e., no measure case. In addition, either moving a coordinate axis to its opposite direction or scaling up the adjustment interval of axis, for example, from [0,1] to [-1,1] in VISTA, is also regarded as negating the original measure value. Moreover, as a bridge between qualitative analysis and quantitative analysis, HOV 3 not only supports quantified domain nowledge verification and validation, but also can directly utilize the rich statistical analysis tools as measures and guide data miners with additional cluster information. We demonstrate several examples running in MATLAB in comparison to the same dataset running in VISTA system [6] in the next section. (8)
HOV 3 : An Approach to Visual Cluster Analysis 323 4 Examples and Explanation In this section, we present several examples to demonstrate the advantages of using HOV 3. We have implemented our approach in MATLAB running under Windows 2000 Professional. The results of our experiments with HOV 3 are compared to those of VISTA, a Star Coordinates based system [6]. At this stage, we only employed several simple statistical methods on those datasets as measures. The datasets used in the examples are well nown and can be obtained from the UCI machine learning website: http://www.ics.uci.edu/~mlearn/machine-learning.html. 4.1 Iris Iris dataset is perhaps the best-nown in the pattern recognition literature. Iris has 3 classes, 4 numeric attributes and 150 instances. The diagram presented in Fig. 3 (left) is the initial data distribution in Star Coordinates produced by the VISTA system. Fig 3 (right) shows the result of data distribution presented by HOV 3 without any adopted measures. It can be observed that the shapes of data distribution are almost identical in the two figures. Only the directions for two shapes are little bit different, since VISTA shifted the appearance of data by 30 degrees in counter-clocwise direction. Fig. 3. The original data distribution in VISTA system (left) and its distribution by HOV 3 in MATLAB (right) Fig. 4 illustrates the results after several random weight adjustment steps. In Fig. 4, it can be observed very clearly that there are three data groups (clusters). The initial data distribution cannot provide data miners a clear idea about the clusters, see Fig.3 (left). Thus, in VISTA the user may verify them by further interactive actions, such as weight scaling and/or changing angles of axes. However, though sometimes better results may appear, as shown in Fig.4, even users do not now where the results came from, because this adjustment process is pretty stochastic and not easily repeatable.
324 K.-B. Zhang, M.A. Orgun, and K. Zhang Fig. 4. The labeled clusters in VISTA after performing random adjustments by the system Fig. 5. Projecting iris data against to its mean (left), and Iris data projection against to its standard division (right) We use simple statistical methods such as mean and standard division of Iris as measures to detect cluster information. Fig.5 gives the data projections based on these measures respectively. HOV 3 also provides three data groups, and in addition, several outliers. Moreover, the user can clearly understand how the results came about, and iteratively perform experiments with the same measures. 4.2 Shuttle Shuttle dataset is much bigger both in size and in attributes than Iris. It has 10 attributes and 15,000 instances. Fig.6 illustrates the initial Shuttle data distribution, the same for both VISTA and HOV 3. The clustered data is illustrated in Fig.7 after performing manual weight scaling of axes in VISTA, where clusters are mared by different colours. We used the median and the covariance matrix of Shuttle to detect the gaps of Shuttle dataset against its median and covariance matrix. The detected results are shown in Fig. 8. These distributions provide the user with different cluster information as in VISTA. On the other hand, HOV 3 can repeat the exact performance as VISTA did, if the user can record each weight scaling and quantified them, as mentioned in equation (8), HOV 3 model subsumes Star Coordinates based techniques.
HOV 3 : An Approach to Visual Cluster Analysis 325 Fig. 6. Left: the initial shuttle data distribution in VISTA. Right: the initial shuttle data distribution in HOV 3. Fig. 7. Post adjustment of the Shuttle data with colored labels in VISTA Fig. 8. Mapping shuttle dataset against to its median by HOV 3 (left) and mapping shuttle dataset against to its covariance matrix (right)
326 K.-B. Zhang, M.A. Orgun, and K. Zhang The experiments we performed on the Shuttle dataset also show that HOV 3 has the capability to provide users an efficient and effective method to verify their hypotheses by visualization. As a result, HOV 3 can feed bac more precise visual performance of data distribution to users. 5 Conclusions In this paper we have proposed a novel approach called HOV 3 to assist data miners in cluster analysis of high-dimensional datasets by visualization. The HOV 3 visualization technique employs hypothesis oriented measures to project data and allows users to iteratively adjust the measures for optimizing the result of clusters. Experiments show that HOV 3 technique can improve the effectiveness of the cluster analysis by visualization and provide a better, intuitive understanding of the results. HOV 3 can be seen as a bridging process between qualitative analysis and quantitative analysis. It not only supports quantified domain nowledge verification and validation, but also can directly utilize the rich statistical analysis tools as measures and give data miners an efficient and effective guidance to get more precise cluster information in data mining. Iteration is a commonly used method in numerical analysis to find the optimized solution. HOV 3 supports verification by quantified measures, thus provides us an opportunity to detect clusters in data mining by combining HOV 3 and iteration method. This is the future goal of our wor. Acnowledgement We would lie to than Kewei Zhang for his valuable support on mathematics of this wor. We also would lie to express our sincere appreciation to Kee Chen and Ling Liu for offering their VISTA system code, which greatly accelerated our wor. References 1. Alpern B. and Carter L.: Hyperbox. Proc. Visualization 91, San Diego, CA (1991) 133-139 2. Anerst M., Breunig MM., Kriegel, Sander HP. J.: OPTICS: Ordering points to identify the clustering structure. Proc. of ACM SIGMOD Conference (1999) 49-60 3. Anerst M., and Keim D.: Visual Data Mining and Exploration of Large Databases. 5th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'01), Freiburg, Germany, September (2001) 4. Berhin P.: Survey of clustering data mining techniques. Technical report, Accrue Software (2002) 5. Coo D.R., Buja A., Cabrea J., and Hurley H.: Grand tour and projection pursuit. Journal of Computational and Graphical Statistics Volume: 23 (1995) 225-250 6. Chen K. and Liu L.: VISTA: Validating and Refining Clusters via Visualization. Journal of Information Visualization Volume: l3 (4) (2004) 257-270
HOV 3 : An Approach to Visual Cluster Analysis 327 7. Chernoff H.: The Use of Faces to Represent Points in -Dimensional Space Graphically. Journal Amer. Statistical Association, Volume: 68 (1973) 361-368 8. Cleveland W.S.: Visualizing Data. AT&T Bell Laboratories, Murray Hill, NJ, Hobart Press, Summit NJ. (1993) 9. Ester M., Kriegel HP., Sander J., Xu X.: A density-based algorithm for discovering clusters in large spatial databases with noise. 2nd International Conference on Knowledge Discovery and Data Mining (1996) 10. Fienberg S. E.: Graphical methods in statistics. American Statisticians Volume: 33 (1979) 165-178 11. Guha S., Rastogi R., Shim K.: CURE: An efficient clustering algorithm for large databases. In Proc. of ACM SIGMOD Int'l Conf. on Management of Data, ACM Press (1998) 73--84 12. Hinneburg, A. Keim D. A., Wawryniu M.: HD-Eye-Visual Clustering of High dimensional Data. Proc. of the 19th International Conference on Data Engineering, (2003) 753-755 13. Hoffman P. E. and Grinstein G.: A survey of visualizations for high-dimensional data mining. In Fayyad U., Grinstein G. G. and Wierse A. (eds.) Information visualization in data mining and nowledge discovery, Morgan Kaufmann Publishers Inc. (2002) 47-82 14. Inselberg A.: Multidimensional Detective. Proc. of IEEE Information Visualization '97 (1997) 100-107 15. Jain A., Murty M. N., and Flynn P.J.: Data Clustering: A Review. ACM Computing Surveys Volume: 31(3) (1999) 264-323 16. Kandogan E.: Visualizing multi-dimensional clusters, trends, and outliers using star coordinates. Proc. of ACM SIGKDD Conference, (2001) 107-116 17. Keim D.A. And Kriegel HP.: VisDB: Database Exploration using Multidimensional Visualization. Computer Graphics & Applications (1994) 40-49 18. Maria Cristina Ferreira de Oliveira, Haim Levowitz: From Visual Data Exploration to Visual Data Mining: A Survey. IEEE Transaction on Visualization and Computer Graphs Volume: 9(3) (2003) 378-394 19. Pampal E., Goebl W., and Widmer G.: Visualizing Changes in the Structure of Data for Exploratory Feature Selection. SIGKDD 03, Washington, DC, USA (2003) 20. Picett R. M.: Visual Analyses of Texture in the Detection and Recognition of Objects. Picture Processing and Psycho-Pictorics, Lipin B. S., Rosenfeld A. (eds.) Academic Press, New Yor, (1970) 289-308 21. Qian Y., Zhang G., and Zhang K.: FAÇADE: A Fast and Effective Approach to the Discovery of Dense Clusters in Noisy Spatial Data. In Proc. ACM SIGMOD 2004 Conference, ACM Press (2004) 921-922 22. Ribarsy W., Katz J., Jiang F. and Holland A.: Discovery visualization using fast clustering. Computer Graphics and Applications, IEEE, Volume: 19 (1999) 32-39 23. Sheiholeslami G., Chatterjee S., Zhang A.: WaveCluster: A multi-resolution clustering approach for very large spatial databases. Proc. of 24th Intl. Conf. On Very Large Data Bases (1998) 428-439. 24. Shneiderman B.: Inventing Discovery Tools: Combining Information Visualization with Data Mining. Discovery Science 2001, Proceedings. Lecture Notes in Computer Science Volume: 2226 (2001) 17-28 25. Zhang T., Ramarishnan R. and Livny M.: BIRCH: An efficient data clustering method for very large databases. In Proc. of SIGMOD96, Montreal, Canada (1996) 103-114
Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007) A Visual Approach for External Cluster Validation Ke-Bing Zhang, Mehmet A. Orgun, Senior Member, IEEE, and Kang Zhang, Senior Member, IEEE Abstract Visualization can be very powerful in revealing cluster structures. However, directly using visualization techniques to verify the validity of clustering results is still a challenge. This is due to the fact that visual representation lacs precision in contrasting clustering results. To remedy this problem, in this paper we propose a novel approach, which employs a visualization technique called HOV 3 (Hypothesis Oriented Verification and Validation by Visualization) which offers a tunable measure mechanism to project clustered subsets and non-clustered subsets from a multidimensional space to a 2D plane. By comparing the data distributions of the subsets, users not only have an intuitive visual evaluation but also have a precise evaluation on the consistency of cluster structure by calculating geometrical information of their data distributions. T I. INTRODUCTION HE goal of clustering is to distinguish objects into partitions/clusters based on given criteria. A large number of clustering algorithms have been developed for different application purposes [8, 14, 15]. However, due to the memory limitation of computers and the extremely large sized databases, in practice, it is infeasible to cluster entire data sets at the same time. Thus, applying clustering algorithms to sampling data to extract hidden patterns is a commonly used approach in data mining [5]. As a consequence of sampling data cluster analysis, the goal of external cluster validation is to evaluate a well-suited cluster scheme learnt from a subset of a database to see whether it is suitable for other subsets in the database. In real applications, achieving this tas is still a challenge. This is not only due to the high computational cost of statistical methods for assessing the robustness of cluster structures between the subsets of a large database, but also due the non-linear time complexity of most existing clustering algorithms. Visualization provides users an intuitive interpretation of cluster structures. It has been shown that visualization allows for verification of the clustering results [10]. However, the direct use of visualization techniques to evaluate the quality of clustering results has not attracted Manuscript received October 31, 2006. K-B. Zhang is with the Department of Computing, Macquarie University, Sydney, NSW 2109, Australia (phone: 612-9850-9590; fax: 612-9850-9551; e-mail: ebing@ics.mq.edu.au). M.A. Orgun is with the Department of Computing, Macquarie University, Sydney, NSW 2109, Australia (e-mail: mehmet@ics.mq.edu.au). K. Zhang is with the Department of Computer Science, University of Texas at Dallas, Richardson, TX 75083-0688, USA (e-mail: zhang@utdallas.edu). enough attention in the data mining community. This might be due to the fact that visual representation lacs precision in contrasting clustering results. We have proposed an approach called HOV 3 to detect cluster structures [28]. In this paper, we discuss its projection mechanism to support external cluster validation. Our approach is based on the assumption that by using a measure to project the data sets in the same cluster structure, the similarity of their data distributions should be high. By comparing the distributions produced by applying the same measures to a clustered subset and other non-clustered subsets of a database by HOV 3, users can investigate the consistency of cluster structures between them both in visual form and in numerical calculation. The rest of this paper is organized as follows. Section 2 briefly introduces ideas of cluster validation (with more details of external cluster validation) and visual cluster validation. A review of related wor on cluster validation by visualization, and a more detailed account of HOV 3 are presented in Section 3. Section 4 describes our idea on verifying the consistency of cluster structure by a distribution matching based method in HOV 3. Section 5 demonstrates the application of our approach on several well-nown data sets. Finally, section 6 summarizes the contributions of this paper. A. Cluster Validation II. BACKGROUND Cluster validation is a procedure of assessing the quality of clustering results and finding a cluster strategy fit for a specific application. It aims at finding the optimal cluster scheme and interpreting the cluster patterns. In general, cluster validation approaches are classified into the following three categories [9, 15, 27]. Internal approaches: they assess the clustering results by applying an algorithm with different parameters on a data set for finding the optimal solution [1]; Relative approaches: the idea of relative assessment is based on the evaluation of a clustering structure by comparing it to other clustering schemes [8]; and External approaches: the external assessment of a clustering approach is based on the idea that there exists nown priori clustered indices produced by a clustering algorithm, and then assessing the consistency of the clustering structures generated by applying the clustering algorithm to different data sets [12]. 1-4244-0705-2/07/$20.00 2007 IEEE 576
Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007) B. External Cluster Validation As a necessary post-processing step, external cluster validation is a procedure of hypothesis test, i.e., given a set of class labels produced by a cluster scheme, compare it with the clustering results by applying the same cluster scheme to the other partitions of a database, as shown in the Fig. 1. Fig. 1. External criteria based validation The statistical methods for quality assessment are employed in external cluster validation, such as Rand statistic [24], Jaccard Coefficient [7], Foles and Mallows index [21], Huberts statistic and Normalized statistic [27], and Monte Carlo method [20], to measure the similarity between the priori modeled partitions and clustering results of a dataset. However, achieving these tass is time consuming when the database is large, due to the drawbac of high computational cost of statistics-based methods for assessing the consistency of cluster structure between the sampling subsets. Recent surveys on cluster validation methods can be found in the literatures [10, 12, 27]. C. Visual Cluster Validation In high dimensional space, traditional clustering algorithms tend to brea down in terms of efficiency as well as accuracy because data do not cluster well anymore [3]. Thus, introducing visualization techniques to explore and understand high-dimensional datasets is becoming an efficient way to combine human intelligence with the immense brute force computation power available nowadays [23]. Visual presentations can be very powerful in revealing trends, highlighting outliers, showing clusters, and exposing gaps in data [26]. Visual cluster validation is a combination of information visualization and cluster validation techniques. In the cluster analysis process, visualization provides analysts with intuitive feedbac on data distribution and supports decision-maing activities. tools are available to facilitate researchers understanding of the clustering results [25]. Several efforts have been made in cluster validation with visualization [2, 4, 11, 13, 16, 18]. While, these techniques tend to help users have intuitive comparisons and better understanding of cluster structures, but they do not focus on assessing the quality of clusters. For example, OPTICS [2] uses a density-based technique to detect cluster structures and visualizes them in Gaussian bumps, but its non-linear time complexity maes it neither suitable to deal with very large data sets, nor suitable to provide the contrast between clustering results. Kasi el. al [18] imposes the technique of Self-organizing maps (SOM) technique to project multidimensional data sets on a 2D space for matching visual models [17]. However, the SOM technique is based on a single projection strategy and not powerful enough to discover all the interesting features from the original data. Huang et. al [11, 13] proposed approaches based on FastMap [5] to assist users on identifying and verifying the validity of clusters in visual form. Their techniques are good on cluster identification, but are not able to evaluate the cluster quality very well. The most prominent feature of techniques based on Star Coordinates, such as VISTA [4] and HOV 3 [28], is their linear time computational complexity. This feature maes them suitable to be used as visual interpretation and detection tools in cluster analysis. However, the characteristic of imprecise qualitative analysis of Star Coordinates and VISTA limits them as quantitative analysis tools. In addition, VISTA adopts landmar points as representatives from a clustered subset and re-samples them to deal with cluster validation [4]. But its experience-based landmar point selection does not always handle the scalability of data very well, due to the fact that wellrepresentative landmar points selected in a subset may fail in other subsets of a database. Visualization techniques used in data mining and cluster analysis are surveyed in the literatures [22, 25]. B. Star Coordinates The approach HOV 3 employed in this research was inspired from the Star Coordinates [16]. For better understanding our wor, we briefly describe it here. Star Coordinates utilizes a point on a 2D surface to represent a set of points of n-dimensional data. The values of n-dimensional coordinates are mapped to the orthogonal coordinates X and Y, as shown in Fig. 2. III. RELATED WORK A. Previous Wors A large number of clustering algorithms have been developed, but only a small number of cluster visualization Fig. 2. Positioning a point by an 8-attribute vector in Star Coordinates [16] 577
Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007) The mapping from n-dimensional Star Coordinates to 2D X-Y coordinates is calculated as in Formula (1). where p j (x, y) is the normalized location of D j =(d j1, d j2,., d jm ), and d ji is the value of jth record of a data set on the ith coordinate C i in Star Coordinates space; xi (d ji -min i ) and yi (d ji -min i ) are unit vectors of d ji mapping to X direction and Y direction, min i =min(d ji,0j<m) max i =max (d ji, 0 j<m) are minimum and maximum values of ith dimension respectively; and m is the number of records in the data set. C. HOV 3 Model The idea of HOV 3 is based on hypothesis test by visualization. It treats hypotheses as measures to reveal the difference between hypotheses and real performance by projecting the test data against the measures [28]. Geometrically, the difference of a matrix D j and a vector M can be represented by their inner product, D j M. Let Dj=(d j1, d j2,, d jm ) be a data set with n attributes, and M=(m 1, m 2,,m n ). The inner product of each vector d ji, (i =1,, n) of Dj with M can be seen as a mapping from an n- dimensional data set to one measure F: R n R 2. It is written as: (1) < d ji, M >= m1 d j1 + m2d j2 +... + mnd jn = md j (2) In order to enlarge the data analysis space, we introduce the complex number system into our study. Let z = x + i.y, where i is the imaginary unit. According to the Euler formula, we have: e ix = cosx+isinx. Let 2π i/ n z 0 = e ; we see that z 1 0, z 2 0, z 3 0,, z n-1 n 0, z 0 (with n z0 = 1) divide the unit circle on the complex plane into n-1 equal sectors. Then the formula (1) can be simply written as: ( ) = n Pj z0 [( d j min d )/( max d min d ) z 0 = ] (3) 1 Where, min d j and max d j represents the minimal and the maximal values of the th coordinate respectively. This is the case of equally-divided circle surface. Then the more general form can be defined as: P n = 1 ( z ) = n [( d j min d )/( max d min d ) z K ] j where z 0 (4) = 1 iθ = e ; is the angle of neighbouring axes; and. In any case equation (4) can be viewed as mappings from R n C 2. Given a non-zero measure vector m in R n, and a family of vectors P j, and the projections of P j against m according to formulas (2) and (4), the HOV 3 model is given as the following equation: P ( z ) = n [( d j min d )/( max d min d ) z K m ] j 0 (5) = 1 where m is the th attribute of measure m. As shown above, a hypothesis in HOV 3 is a quantified measure vector. Thus HOV 3 is also able to detect the consistency of cluster structures among the subsets of a database by comparing their data distributions, because cluster validation procedure is primarily a hypothesis test process. D. The Axis Tuning Feature Overlapping and ambiguities are inevitably introduced by projecting multidimensional data into 2D space. For mitigating the problem, Star Coordinates provides several visual adjustment mechanisms, such as axis scaling, axes angles rotation; coloring data points, etc [15]. We use Iris, a well-nown data set in machine learning research, as an example to demonstrate the feature of axis scaling of techniques based on Star Coordinates as follows. Fig. 3. The initial data distribution of clusters of Iris produced by - means in VISTA. Iris has 4 numeric attributes and 150 instances. We first applied K-means clustering algorithm to it and obtained 3 clusters (=3,here), and then tuned the weight value of each axis (called -adjustment in VISTA) of Iris in VISTA [4]. Fig.3 shows the original data distribution of Iris, which has overlapping among the clusters. A well-separated distribution of Iris is illustrated in Fig. 4 by a series of axis scaling. The clusters are much easier to recognize in Fig. 4 than those in the original one. Fig. 4. The tuned version of the Iris data distribution in VISTA. 578
Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007) This axis-tuning feature is significant for our external cluster validation method based on distribution matching by HOV 3. We give the detailed explain next. IV. CLUSTER VALIDATION WITH HOV 3 The feature of tunable axis provides us a mechanism to quantitatively handle the external cluster validity by HOV 3. Our approach is based on the assumption that by using a measure to project the data sets in the same cluster structure, the similarity of their data distributions should be high. Based on this idea, we have implemented an approach for external cluster validation based on distribution matching by HOV 3. A. Definitions To explain our approach precisely, we first give a few definitions below. Definition 1: A data projection from n-dimensional space to 2D plane by applying HOV 3 to a data set, as shown in formula (5), is denoted as D p =C (, M), where is an n-dimensional data set, and =(p 1, p 2,., p m ), p j (1 m) is an instance of ; M =(w 1t, w 2t,., w nt ), is a non-zero measure vector; w it (1 i n) is the weight value of th coordinate at t moment in the Star Coordinates plane; D p is the geometrical distribution of in 2D space, D p =(p 1 (x 1,y 1 ), p 2 (x 2,y 2 ),..., p m (x m,y m )), p j (x i,y i ) is the location of p j in X-Y Coordinates plane. Definition 2: Let be a database of data points. A cluster C :=(D, L) is a non-empty set D on a label set L, and the ith cluster C i ={p D, l L C j.p: C j.l=i i>0} where l is the cluster label of p, l {-1, 0, 1,}, and is the number of clusters. As special case, an outlier point is an element of and with cluster label 1; a non-clustered element of ; has a cluster label of 0, i.e., it has not been clustered. Definition 3: A spy subset s is a clustered subset of produced by a clustering algorithm, where s ={C 1,C 2,, C, C E }, C i (1 i ) is a cluster in s ; C E is the outlier set of s A spy subset is used as a model to verify the cluster structure in the other partitions in the database. Definition 4: A subset t is a target subset of s, t ={P t.p, P t.l L P t.p:p t.l=0 s = t }. A target subset t is a non-clustered subset of and has the same size of a spy subset P s of. It is used as a target to investigate the similarity of cluster structure with the spy subset s. Definition 5: A non-clustered point p o is called an overlapping point of a cluster C i, denoted as C i p o iff (p C i p o C i p o -p ), where is the threshold distance given by the user. Definition 6: The overlapping point set of cluster C i is composed as a quasi-cluster of C i, denoted as C qi i.e., {p o C qi C i p o } All overlapping points of C i are composed a quasi-cluster C qi of C i. Definition 7: A cluster C i is called a well-separated cluster visually, when it satisfies the condition that (Ci s, Cj s p C i : p C j p o i j ). A well-separated cluster Ci in the spy subset implies that no points in Ci are within the threshold distance to any other clusters in the spy subset. Based on above the definitions, we present the application of our approach to external cluster validation based on distribution matching by HOV 3 as follows. B. The Stages of Our Approach The stages of the application of our approach are summarized in the following steps: 1. Clustering First, the user applies a clustering algorithm to a randomly selected subset s from the given dataset. 2. Cluster Separation The clustering result of s is introduced and visualized in HOV 3. Then the user manually tunes the weight value of each axis to separate overlapping clusters. If one or more cluster(s) are separated from the others visually, then the weight values of each axis are recorded as a measure vector M. 3. Data Projection by HOV 3 The user samples another observation with the number of points as in s as a target subset t. The clustered subset s (now as a spy subset) and its target subset t are projected together by HOV 3 with vector M to detect the distribution consistency between s and t. 4. The Generation of Quasi-Clusters The user gives a threshold, and then according to Definitions 5, 6 and 7, a quasi-cluster C qi of a separated cluster C i is computed. Then C qi is removed from t, and C i is removed from s. If s has clusters then we go bac to step 2, otherwise we proceed to the next step. 5. The Interpretation of Result The overlapping rate of each cluster-and-quasi-cluster pair is calculated as (C qi, C i )= C qi / C i. If the overlapping rate approaches 1, cluster C i and its quasiclusters C qi have high similarity, since the amount ratio of the spy subset and the target subset is 1:1. Thus the overlapping analysis is simply transformed into a linear regression analysis, i.e., the points around the line C=C q. 579
Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007) Corresponding to the procedure mentioned above, we give the algorithm of external cluster validation based on distribution matching by HOV 3 below, in Fig. 5. subsets from the database. To handle the scalability on resampling datasets, we choose the non-cluster observations with the same size as the clustered subset, and then project them together by HOV 3. As a consequence, the user can easily utilize the well-separated clusters produced by scaling axes in HOV 3 as a model to pic out their corresponding quasi-clusters, where points in a quasi-cluster overlap its corresponding cluster. Also, instead of using statistical methods to assess the similarity between the two subsets, we simply compute the overlapping rate between the clusters and their quasi-clusters to explore their consistency. V. EXAMPLES AND EXPLANATION In this section, we present several examples to demonstrate the advantages of the external cluster validation in HOV 3. We have implemented our approach in MATLAB running under Windows 2000 Professional. The datasets used in the examples are obtained from the UCI machine learning website: http://www.ics.uci.edu/~mlearn/machine- Learning. html. Fig 5. The algorithm of external cluster validation based on distribution matching in HOV 3 In Fig. 5, the procedure clusterseparate responds the user s axis tuning to separate the clusters in the spy subset, and to gather weight values of axes as a measure vector; the procedure quasiclustergeneration produces quasi clusters in the target subset corresponding to the clusters in the spy subset. C. Our Model In contrast to statistics-based external cluster validation model illustrated in Fig. 2, we exhibit our model for external cluster validation by visualization in HOV 3 in Fig. 6. Fig. 6. External cluster validation by HOV 3 Comparing these two models, we may observe that instead of using a clustering algorithm to cluster another sampling data sets, in our model, we use a clustered subset from a database as a model to verify the similarity of cluster structure between the model and the other non-clustered Fig. 7. The original data distribution of the first 5,000 data points of Shuttle in MATLAB by HOV 3 (without cluster indices) Shuttle data set has 9 attributes and 4,3500 instances. We choose the first 5,000 instances of Shuttle as a sampling data and apply the K-means algorithm [19] to it. Then we utilize the clustered result as a spy subset. We assumed that we have found the optimal cluster number =5 for the sampling data. The original data distributions with and without cluster indices are illustrated in the diagrams of Fig.7 and Fig.8 respectively. It can been seen that there exists a cluster overlapping in Fig. 8. To obtain well-separated clusters, we tuned the weight of each coordinate, and had a satisfied version of the data distribution as shown in Fig. 9. The weight values of axes are recorded as a measure vector [0.80, 0.55, 0.85, 0.0, 0.40, 0.95, 0.20, 0.05, 0.459], in this case. Then we chose the second 5,000 instances of Shuttle as a target subset and projected the target subset and the spy subset together against the measure vector by HOV 3. 580
Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007) cluster are listed in Table 1, and their curves of linear regression to the line C=C q are illustrated in Fig. 11. TABLE I CLUSTERS AND THEIR CORRESPONDING QUASI-CLUSTERS Subset C q1/c 1 C q2/c 2 C q3/c 3 C q4/c 4 C q5/c 5 Spy 318 773 513 2254 1142 Fig. 8. The original data distribution of the first 5,000 data points of Shuttle in MATLAB by HOV 3 (with cluster indices) Their distributions are presented in Fig. 10, where we may observe that their data distributions are matched very well. We chose the points in the enclosed area in Fig. 10 as a cluster then obtained a quasi-cluster in the target subset corresponding to the cluster in the enclosed area. In the same way, we can find the other quasi-clusters from the target subset. Target 1 278/318 =0.8742 670/773 =0.8668 503/513 =0.9805 2459/2254 =1.0909 1123/1142 =0.9834 Target 2 279/318 =0.8773 897/773 =1.1604 626/513 =1.2203 2048/2254 =0.9086 1602/1142 =1.4028 Target 3 280/318 =0.8805 875/773 =1.1320 481/513 =0.9376 2093/2254 =0.9286 1455/1142 =1.2741 Target 4 261/318 =0.8208 713/773 =0.9224 368/513 =0.7173 2416/2254 =1.0719 1169/1142 =1.0264 *At current stage, we collect the quasi-clusters manually, thus C qi here may have redundancy and misloading. It is observed that the curves are well matched to the line C=C q, i.e. the overlapping rate between the clusters and their quasi-clusters are high. The standard deviation is a good way to reflect the difference between the two vectors. Thus we have calculated the standard deviation of each C qi -C i pairs among the target (=1,..4) and the spy subsets. They are 0.0826, 0.1975, 0.1491 and 0.1304. This means that the similarity of cluster structure in the spy and the target subsets is high. In summary, the experiments show that the same cluster structure in the spy subset of Shuttle also exists in the target subsets of Shuttle. Fig. 9. A well-separated version of the spy subset distribution of Shuttle Fig. 11. The curves of linear regression to the line C=C q. In these experiments, we have also measured the timing for both clustering and projection in MATLAB. The results are listed in the Table 2. Fig. 10. The projection of the spy subset and a target subset of Shuttle by applying a measure vector. We have done the same experiment on 4 target subsets of Shuttle. The size of each quasi-cluster and its corresponding 581
Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007) TABLE 2 TIMING OF CLUSTERING AND PROJECTING Clustering by K-mens (=5) Projecting by HOV 3 Subset Amount Time Subset Size Time (Second) (Second) Target 1 5,000.532 Syp+Target 1 10,000.11 Target 2 5,000.61 Syp+Target 2 10,000.109 Target 3 5,000.656 Syp+Target 3 10,000.11 Target 4 5,000.453 Syp+Target 4 10,000.109 Based on this calculation, it has been observed that the projection by HOV 3 is much faster than the clustering process by the K-means algorithm. It is particularly effective for verifying the clustering results within extremely huge databases. Although the cluster separation in our approach may incur some time, once the well-separated clusters are found, using a measure vector to project a huge data set will be a lot more efficient than re-applying a clustering algorithm to the data set. VI. CONCLUDING REMARKS In this paper we have proposed a novel visual approach to assist users to verify the validity of any cluster scheme, i.e., an approach based on distribution matching for external cluster validation by visualization. The HOV 3 visualization technique has been employed in our approach, which uses measure vectors to project a data set and allows the user to iteratively adjust the measures for optimizing the result of clusters. By comparing the data distributions of a clustered subset and non-clustered subsets projected by HOV 3 with tunable measures, users can performance intuitive visual evaluation, and also have a precise evaluation on the consistency of the cluster structure by performing geometrical computation on their data distributions as well. By comparing our approach with existing visual methods, we have observed that our method is not only efficient in performance, but also effective in real applications. REFERENCES [1] A. L. Abul, R. Alhajj, F. Polat and K. Barer Cluster Validity Analysis Using Subsampling, in proceedings of IEEE International Conference on Systems, Man, and Cybernetics, Washington DC, Oct. 2003 Volume 2: pp. 1435-1440. [2] M. Anerst, M. M. Breunig, H.-P. Kriegel, J.Sander, OPTICS: Ordering points to identify the clustering structure, in proceedings of ACM SIGMOD Conference, 1999 pp. 49-60. [3] C. Baumgartner, C. Plant, K. Railing, H-P. Kriegel, P. Kroger, Subspace Selection for Clustering High-Dimensional Data, Proc. of the Fourth IEEE International Conference on Data Mining (ICDM 04), 2004, pp.11-18. [4] K. Chen and L. Liu,. VISTA: Validating and Refining Clusters via Visualization, Journal of Information Visualization. Volume3 (4), 2004, pp. 257-270. [5] E. Clifford, Data Analysis by Resampling: Concepts and Applications, Duxbury Press, 2000. [6] C. Faloutsos and K. Lin, Fastmap: a fast algorithm for indexing, datamining and visualization of traditional and multimedia data sets Proc. of ACM-SIGMOD, 1995 pp.163-174. [7] Jaccard, S. (1908) Nouvelles recherches sur la distribution florale. Bull. Soc. Vaud. Sci. Nat., 44, 223 270. [8] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001. [9] M. Halidi, Y. Batistais, M. Vazirgiannis, On Clustering Validation Techniques Journal of Intelligent Information Systems, Volume 17 (2/3), 2001, pp. 107 145. [10] M. Halidi, Y. Batistais, M. Vazirgiannis, Cluster validity methods: Part I and II, SIGMOD Record, 31, 2002. [11] Z. Huang, D. W. Cheung and M. K. Ng, An Empirical Study on the Visual Cluster Validation Method with Fastmap, Proceedings of DASFAA01, Hong Kong, April 2001, pp.84-91. [12] J. Handl, J. Knowles, and D. B. Kell, Computational cluster validation in post-genomic data analysis, Journal of Bioinformatics Volume 21(15), 2005, pp. 3201-3212. [13] Z. Huang and T. Lin, A visual method of cluster validation with Fastmap, Proc. of PAKDD-2000, 2000 pp. 153-164. [14] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall,1988 [15] A. Jain, M. N. Murty and P. J. Flynn, Data Clustering: A Review, ACM Computing Surveys, Volume 31(3), 1999, pp. 264-323. [16] E. Kandogan, Visualizing multi-dimensional clusters, trends, and outliers using star coordinates, Proc. of ACM SIGKDD Conference, 2001, pp.107-116. [17] T. Kohonen, Self-Organizing Maps Springer, Berlin, second extended edition,1997. [18] S. Kasi, J. Sinonen. and J. Peltonen, Data Visualization and Analysis with Self-Organizing Maps in Learning Metrics, DaWaK 2001, LNCS 2114, 2001, pp.162-173. [19] J. McQueen, Some methods for classification and analysis of multivariate observations, Proc. of 5th Bereley Symposium on Mathematics, Statistics and Probability, Volume 1, 1967, pp. 281-298. [20] G. W. Milligan, A Review Of Monte Carlo Tests Of Cluster Analysis, Journal of Multivariate Behavioral Research Vol. 16( 3), 1981, pp. 379-407. [21] G.W. Milligan, L.M. Sool, & S.C. Soon The effect of cluster size, dimensionality and the number of clusters on recovery of true cluster structure, IEEE Trans PAMI, 1983 5(1):40-47. [22] F. Oliveira, H. Levowitz, From Visual Data Exploration to Visual Data Mining: A Survey, IEEE Trans.Vis.Comput. Graph, Volume 9(3), 2003, pp.378-394. [23] E. Pampal, W. Goebl, and G. Widmer, Visualizing Changes in the Structure of Data forexploratory Feature Selection, SIGKDD 03, August 24-27, 2003, Washington, DC, USA [24] Rand, W.M., Objective Criteria for the Evaluation of Clustering Methods. J. Am. Stat. Assoc., 66:846-850, 1971. [25] J. Seo and B. Shneiderman, From Integrated Publication and Information Systems to Virtual Information and Knowledge Environments, Essays Dedicated to Erich J. Neuhold on the Occasion of His 65th Birthday. Lecture Notes in Computer Science Volume 3379, Springer, 2005. [26] B Shneiderman, Inventing Discovery Tools: Combining Information Visualization with Data Mining, Proc. of Discovery Science 2001,Lecture Notes in Computer Science Volume 2226, 2001, pp.17-28. [27] S. Theodoridis and K. Koutroubas, Pattern Recognition, Academic Press. 1999. [28] K-B. Zhang, M. A. Orgun and K. Zhang, HOV 3, An Approach for Cluster Analysis, Proc. of ADMA 2006, XiAn, China, Lecture Notes in Computer Science series, Volume. 4093, 2006, pp317-328. 582
Enhanced Visual Separation of Clusters by M-Mapping to Facilitate Cluster Analysis* Ke-Bing Zhang 1, Mehmet A. Orgun 1, and Kang Zhang 2 1 Department of Computing, Macquarie University, Sydney, NSW 2109, Australia {ebing, mehmet}@ics.mq.edu.au 2 Department of Computer Science, University of Texas at Dallas Richardson, TX 75083-0688, USA zhang@utdallas.edu Abstract. The goal of clustering in data mining is to distinguish objects into partitions/clusters based on given criteria. Visualization methods and techniques may provide users an intuitively appealing interpretation of cluster structures. Having good visually separated groups of the studied data is beneficial for detecting cluster information as well as refining the membership formation of clusters. In this paper, we propose a novel visual approach called M-mapping, based on the projection technique of HOV 3 to achieve the separation of cluster structures. With M-mapping, users can explore visual cluster clues intuitively and validate clusters effectively by matching the geometrical distributions of clustered and non-clustered subsets produced in HOV 3. Keywords: Cluster Analysis, Visual Separability, Visualization. 1 Introduction Cluster analysis is an iterative process of clustering and cluster verification by the user facilitated with clustering algorithms, cluster validation methods, visualization and domain nowledge to databases. The applications of clustering algorithms to detect grouping information in real world applications are still a challenge, primarily due to the inefficiency of most existing clustering algorithms on coping with arbitrarily shaped distribution data of extremely large and high-dimensional databases. Moreover, the very high computational cost of statistics-based cluster validation methods is another obstacle to effective cluster analysis. Visual presentations can be very powerful in revealing trends, highlighting outliers, showing clusters, and exposing gaps in data [18]. Nowadays, as an indispensable technique, visualization is involved in almost every step in cluster analysis. However, due to the impreciseness of visualization, it is often used as an observation and rendering tool in cluster analysis, but it has been rarely employed directly in the precise comparison of the clustering results. HOV 3 is a visualization technique based on hypothesis testing [20]. In HOV 3, each hypothesis is quantified as a measure vector, which is used to project a data set for * The datasets used in this paper are available from http://www.ics.uci.edu/~mlearn/machine-learning.html G. Qiu et al. (Eds.): VISUAL 2007, LNCS 4781, pp. 288 300, 2007. Springer-Verlag Berlin Heidelberg 2007
Enhanced Visual Separation of Clusters by M-Mapping to Facilitate Cluster Analysis 289 investigating cluster distribution. The projection of HOV 3 is also proposed to deal with cluster validation [21]. In this paper, in order to gain an enhanced visual separation of groups, we develop the projection of HOV 3 into a technique which we call M-mapping, i.e., projecting a data set against a series of measure vectors. We structure the rest of this paper as follows. Section 2 briefly introduces the current issues of cluster analysis. It also briefly reviews the efforts that have been done in the visual cluster analysis and discusses the projection of HOV 3 as the bacground of this research. Section 3 discusses the M-mapping model and its several important features. Section 4 demonstrates the effectiveness of the enhanced separation feature of M-mapping on cluster exploration and validation. Finally, section 5 concludes the paper with a brief summary of our contributions. 2 Bacground 2.1 Cluster Analysis Cluster analysis includes two processes: clustering and cluster validation. Clustering aims to distinguish objects into partitions, called clusters, by a given criteria. The objects in the same cluster have a higher similarity than those between the clusters. Many clustering algorithms have been proposed for different purposes in data mining [8, 11]. Cluster validation is regarded as the procedure of assessing the quality of clustering results and finding a fit cluster scheme for a specific application at hand. Since different cluster results may be obtained by applying different clustering algorithms to the same data set, or even, by applying a clustering algorithm with different parameters to the same data set, cluster validation plays the critical role in cluster analysis. However, in practice, it may not always be possible to cluster huge datasets by using clustering algorithms successfully. As Abul et al pointed out In high dimensional space, traditional clustering algorithms tend to brea down in terms of efficiency as well as accuracy because data do not cluster well anymore [1]. In addition, the very high computational cost of statistics-based cluster validation methods directly impacts on the efficiency of cluster validation. 2.2 Visual Cluster Analysis The user s correct estimation of the cluster number is important for choosing the parameters of clustering algorithms in the pre-processing stage of clustering, as well as assessing the quality of clustering results in the post-processing stage of clustering. The success of these tass heavily relies on the user s visual perception of the distribution of a given data set. It has been observed that visualization of a data set is crucial in the verification of the clustering results [6]. Visual cluster analysis enhances cluster analysis by combining it with visualization techniques. Visualization techniques are typically employed as an observational mechanism to understand the studied data. Therefore, instead of contrasting the quality of clustering results, most of the visualization techniques used in cluster analysis focus on assisting users in having an easy and intuitive understanding of the cluster structure in the data. Visualization has been shown to be an intuitive and effective method used in the exploration and verification of cluster analysis.
290 K.-B. Zhang, M.A. Orgun, and K. Zhang Several efforts have been made in the area of cluster analysis with visualization: OPTICS [2] uses a density-based technique to detect cluster structures and visualizes clusters in Gaussian bumps, but it has non-linear time complexity maing it unsuitable to deal with very large data sets or to provide the contrast between clustering results. H-BLOB [16] visualizes clusters into blob manners in a 3D hierarchical structure. It is an intuitive cluster rendering technique, but its 3D and two stages expression limits its capability in an interactive investigation of cluster structures. Kasi el. al [15] uses Self-organizing maps (SOM) to project high-dimensional data sets to 2D space for matching visual models [14]. However, the SOM technique is based on a single projection strategy and it is not powerful enough to discover all the interesting features from the original data. Huang et. al [7, 10] proposed several approaches based on FastMap [4] to assist users on identifying and verifying the validity of clusters in visual form. Their techniques are good on cluster identification, but not able to deal with the evaluation of cluster quality very well. Moreover, the techniques discussed here are not very well suited to the interactive investigation of data distributions of high-dimensional data sets. A recent survey of visualization techniques in cluster analysis can be found in the literature [19]. Interactive visualization is useful for the user to import his/her domain nowledge into the cluster exploration stage on the observation of data distribution changes. Star Coordinates favors to do so with its interactive adjustment features [21]. The M-mapping approach discussed in this paper has been developed based on the Star Coordinates and the projection of HOV 3 [20]. For a better understanding of the wor in this paper, we briefly describe them next. 2.3 The Star Coordinates Technique The Star Coordinates is a technique for mapping high-dimensional data to 2D dimensions. It plots a 2D plane into n equal sectors with n coordinate axes, where each axis represents a dimension and all axes share the initials at the centre of a circle on the 2D space. First, data on each dimension are normalized into [0, 1] or [-1, 1] interval. Then the values of all axes are mapped to orthogonal X-Y coordinates which share the initial point with Star Coordinates on the 2D space. Thus, an n-dimensional data item is expressed as a point in the 2D plane. The most prominent feature of Star Coordinates and its extensions such as VISTA [4] and HOV 3 [20] is that their computational complexity is only in linear time (that is, every n-dimensional data item is processed only once). Therefore they are very suitable as visual interpretation and exploration tools in cluster analysis. However, it is inevitable to introduce overlapping and bias by mapping high-dimensional data to 2D space. For mitigating the problem, Star Coordinates based techniques provide some visual adjustment mechanisms, such as axis scaling (called α-adjustment in VISTA), footprints, rotating axes angles and coloring data points [12]. Axis Scaling The purpose of axis scaling in Star Coordinates is to adjust the weight value of each axis dynamically and observe the changes to the data distribution under the newly weighted axes. We use the Iris, a well-nown data set in machine learning area as an example to demonstrate how axis scaling wors as follows.
Enhanced Visual Separation of Clusters by M-Mapping to Facilitate Cluster Analysis 291 Fig. 1. The initial data distribution of clusters of Iris produced by -means in VISTA Fig. 2. The tuned version of the Iris data distribution in VISTA Iris has 4 numeric attributes and 150 instances. We first applied the K-means clustering algorithm to it and obtained 3 clusters (with =3), and then tuned the weight value of each axis of Iris in VISTA [4]. The diagram in Fig.1 shows the original data distribution of Iris, which has overlapping among the clusters. A well-separated cluster distribution of Iris is illustrated in Fig. 2 by a series of axis scaling. The clusters are much easier to be recognized than those in the original one. Footprints To observe the effect of the changes to the data points under axis scaling, Star Coordinates provides the footprints function to reveal the trace of each point [12]. We use another data set auto-mpg to demonstrate this feature. The data set auto-mpg has 8 attributes and 397 items. Fig. 3 presents the footprints of axis scaling of attributes weight and mpg, where we may find some points with longer traces, and some with shorter footprints. However, the imprecise and random adjustments of Star Coordinates and VISTA limits them to be utilized as quantitative analysis tools. Fig. 3. Footprints of axis scaling of weight and mpg attributes in Star Coordinates [12] 2.4 The HOV 3 Model The HOV3 Model improves the Star Coordibates model. Geometrically, the difference of a matrix D j (a data set) and a vector M (a measure) can be represented by their inner product, D j M. Based on this idea Zhang et al proposed a projection technique, called HOV 3, which generalizes the weight values of axes in Star Coordinates as a hypothesis (measure vector) to reveal the differences between the hypotheses and the real performance [20].
292 K.-B. Zhang, M.A. Orgun, and K. Zhang The Star Coordinates model can be simply described by the Euler formula. According to the Euler formula: e ix = cosx+isinx, where z = x + i.y, and i is the imaginary unit. Let z 0 =e 2πi/n ; we see that z 0 1, z 0 2, z 0 3,, z 0 n-1, z 0 n (with z0 n = 1) divide the unit circle on the complex plane into n equal sectors. Thus Star Coordinates mapping can be simply written as: P n ( z ) = [(d min d ) /(max d min d ) z ] j 0 j 0 = 1 (1) where mind and maxd represent the minimal and maximal values of the th coordinate respectively, and m is the th attribute of measure M. In any case equation (1) can be viewed as mappings from R n C 2. Then given a non-zero measure vector M in R n and a family of vectors P j, the projection of P j against M according to formula (1), in the HOV 3 model [18], is given as: P n ( z ) = [(d min d ) /(max d min d ) z m ] j 0 j 0 (2) = 1 As shown above, a hypothesis in HOV 3 is a quantified measure vector. HOV 3 not only inherits the axis scaling feature of Star Coordinates, but also generalizes the axis scaling as a quantified measurement. The processes of cluster detection and cluster validation can be tacled with HOV 3 based on its quantified measurement feature [20, 21]. To improve the efficiency and effectiveness of HOV 3, we develop the projection technique of HOV 3 further with M-mapping. 3 M-Mapping 3.1 The M-Mapping Model It is not easy to synthesize hypotheses into one vector. In practice, rather than using a single measure to implement a hypothesis test, it is more feasible to investigate the synthetic response of applying several hypotheses/predictions together to a data set. For simplifying the discussion of the M-mapping model, we give a definition first. Definition 1 (Poly-multiply vectors to a matrix). The inner product of multiplying a series of non-zero measure vectors M 1, M 2,,Ms to a matrix A is denoted as s A * M i =A M 1 M 2. Ms. i=1 A simple notation of HOV 3 projection as Δ p =Η C (Π, M) was given by Zhang et al [20], where Π is a data set; Δ p is the data distribution of Π by applying a measure s vector M. Then the projection of M-mapping is denoted as Δ p =Η C (Π, Mi ). Based i=1 on equation (2), M-mapping is formulated as follows: P n s ( z ) = [(d min d ) /(max d min d ) z m ] j j 0 i = 1 i= 1 0 (3)
Enhanced Visual Separation of Clusters by M-Mapping to Facilitate Cluster Analysis 293 where m i is the th attribute (dimension) of the ith measure vector M i, and s 1. When s=1, the equation (3) is transformed into equation (2) (the HOV 3 model). We may observe that instead of using a single multiplication of m in formula (2), s it is replaced by a poly-multiplication of m i in equation (3). Equation (3) is more i= 1 general and also closer to the real procedure of cluster detection. It introduces several aspects of domain nowledge together into the process of cluster detection by HOV 3. Geometrically, the data projection by M-mapping is the synthesized effect of applying each measure vector by HOV 3. In addition, the effect of applying M-mapping to datasets with the same measure vector can enhance the separation of grouped data points under certain conditions. We describe this enhanced the separation feature of M-mapping below. 3.2 The Features of M-Mapping For the explanation of the geometrical meaning of the M-mapping projection, we use the real number system. According to the equation (2), the general form of the distance σ (i.e., weighted Minowsi distance) between two points a and b in HOV 3 plane can be represented as: n q q ( a,b,m) = m (a b ) =1 σ (q>0) * (4) If q = 1, σ is Manhattan (city bloc) distance; and if q = 2, σ is Euclidean distance. To simplify the discussion of our idea, we adopt the Manhattan metric. Note that there exists an equivalent mapping (bijection) of distance calculation between the Manhattan and Euclidean metrics [13]. For example, if the distance ab between points a and b is longer than the distance a b between points a and b in the Manhattan metric, it is also true in the Euclidean metric, and vice versa. In Fig 4, the orthogonal lines represent the Manhattan distance and the diagonal lines are the Euclidean distance (red for a b and blue for ab) respectively. Then the Manhattan distance between points a and b is calculated as in formula (5). Fig. 4. The distance representation in Manhattan and Euclidean metrics n σ (a,b,m) = m (a b (5) = 1 ) According to the equations (2), (3) and (5), we can present the distance of M-mapping in Manhattan distance as follows: s n s i i = 1 i= 1 σ (a,b, m ) = m (a b ) (6) i= 1
294 K.-B. Zhang, M.A. Orgun, and K. Zhang Definition 2 (The distance representation of M-mapping). The distance between two data points a and b projected by M-mapping is denoted as M s σab. If the measure vectors in an M-mapping are the same, σab can be simply written as M s σab; if i=1 each attribute of M is 1 (no measure case), the distance between points a and b is denoted as σab. For example, the distance between two points a and b projected by M-mapping with the same two measures can be represented as M 2 σab. Thus the projection of HOV 3 of a and b can be written as Mσab. M s Contracting Feature From the equations (5) and (6), we may observe that the application of M-mapping to a data set is a contracting process of data distribution of the data set. This is because, when m i <1 and σab 0, we have m (a b ) < (a b ) n n = 1 i=1 = 1 σ(a, b, m)< σab. In the same way, we have σ(a, b, m 2 )< σ(a, b, m) and σ(a, b, m n+1 )< σ(a, b, m n ), n. Hinneburg at el proved that a contracting projection of a data set could strictly preserve the density of the dataset [9]. Chen and Liu also proved that in the Star Coordinates 2D space, the original closed data points are also more closed relatively in the newly produced data distribution by axis scaling [4]. Thus, the relative geometrical position of data points within a cluster in a data set would be closer by applying M-mapping to the data set. Enhanced Separation Feature If the measure vector is changed from M to M, and Mσab -Mσac < M σab -M σac then 2 2 M' σab M' σac M' σab M' σac Mσab Mσac M' σab M' σac >. M' σab M' σac Mσab Mσac Due to the space limitation, the detailed proof of this property can be found in [22]. This inequality shows that if the difference of the distance ab and the distance ac are increased by scaling axes from M to M (which can be observed by the footprints of points a, b and c, as shown in Fig 3), then after applying M-mapping to a, b and c, the distance variation rate of distances ab and ac would be enhanced. In other words, if it is observed that several groups of data points can be roughly separated (where ambiguous points exist between groups) by projecting a measure vector in HOV 3 to a data set, then the application of M-mapping with the measure vector to the data set would lead to the groups being more contracted, i.e., they will be a good separation of the groups. These two features of M-mapping are significant for identifying the membership formation of clusters in the process of cluster exploration and cluster verification. This is because the contracting feature of M-mapping eeps the data points within a
Enhanced Visual Separation of Clusters by M-Mapping to Facilitate Cluster Analysis 295 cluster relatively closer, i.e., grouping information is preserved. On the other hand, the enhanced separation feature of M-mapping can extend the distance of far data points relatively further. Improving Accuracy of Data Point Selection with Zooming External cluster validation [19] refers to the comparison of previously produced cluster patterns with newly produced cluster patterns to evaluate the genuine cluster structure about a data set. However, due to the very high computational cost of statistical methods on assessing the consistency of cluster structures between the subsets of a large database, achieving this tas is still a challenge. Let us assume that, if two sampling subsets of a dataset have similar data distributions and a measure vector is applied to them by HOV 3, the similarity of their data distributions should still be high. Based on this assumption, Zhang et al [21] proposed a visual external cluster validation approach with HOV 3. Their approach uses a clustered subset from a database and a same-sized unclustered subset as an observation. It then applies several measure vectors that can separate clusters in the clustered subset. Thus each cluster and its geometrically covered data points (called quasi-cluster in their approach) based on a given threshold distance are selected. Finally, the overlapping rate of each cluster and quasi-cluster pair is calculated; and if the overlapping rate approaches 1, this means that the two subsets have a similar cluster distribution. Compared to statistics-based external validation methods, their method is not only visually intuitive, but also more effective in real applications [21]. However, separating a cluster from lots of overlapping points manually is often time consuming. We claim that the enhanced separation feature of HOV 3 can provide improvements not only in efficiency but also in accuracy in dealing with external cluster validation by the proposed approach [21]. As mentioned above, the application of M-mapping to a data set is a contracting process. In order to avoid the contracting effect causing pseudo data points being selected, we introduce a zooming feature with M-mapping. According to equation (2), zooming in HOV 3 can be understood as projecting a data set with a vector, which has the same attribute values, i.e., each m in equation (2) has the same value. Then we choose min(m ) -1 as the zooming vector values, where min(m ) is the non-zero minimal value of m. Thus the scale of patterns in HOV 3 is amplified by applying the combination of M-mapping and zooming. This combination is formalized in equation (7). P j ( ) min ) /(max min ) ( s s z d d d d z m min( m ) )] 0 n = [( 0 (7) = 1 j Because m <1, and min(m ) is the non-zero minimal value of m in a measure vector, thus (m ) S min(m ) -S >1 if there exists m > min(m ). With the effect of (m ) S min(m ) -S, M-mapping enlarges the scale of data distributions projected by HOV 3. With the same threshold distance of the data selection proposed by Zhang et al [21], M-mapping with zooming can improve the precision of geometrically covered data point selection.
296 K.-B. Zhang, M.A. Orgun, and K. Zhang 4 Examples and Explanation In this section we present several examples to demonstrate the efficiency and the effectiveness of M-mapping in cluster analysis. 4.1 Cluster Exploration with M-Mapping Choosing the appropriate cluster number of an unnown data set is meaningful in the pre-clustering stage. The enhanced separation feature of M-mapping is advantageous in the identification of the cluster number in this stage. We demonstrate this advantage of M-mapping by the following examples. Wine Data The Wine data set (Wine in short) has 13 attributes and 178 records. The original data distribution of Wine data in 2D space is shown in Fig. 5a, where no grouping information can be observed. Then we tuned axes weight values randomly and had a (a) The original data distribution of Wine (no measure case) (b) The data distribution of Wine after tuning axes weight values randomly (c) Δ p2=η C (Wine, M *M) (d) Δ p2 colored by cluster indices of K-means (=3) Fig. 5. Distributions of Wine data produced by HOV 3 in MATLAB
Enhanced Visual Separation of Clusters by M-Mapping to Facilitate Cluster Analysis 297 roughly separated data distribution (what loos lie two groups) of Wine, as demonstrated in Fig 5b; we recorded the axes values of Wine as M = [-0.44458, - 0.028484, -0.23029, -0.020356, -0.087636, 0.015982, 0.17392, 0.21283, - 0.11461,0.099163, -0.19181, 0.34533, 0.27328]. Then we employed M 2 (inner dot) as a measure vector and applied it to Wine. The newly projected distribution Δ p2 of Wine is presented in Fig 5c. It has become easier to identify 3 groups of Wine in Fig 5c. Thus, we colored the Wine data with cluster indices that were produced by the K-means clustering algorithm with =3. The colored data distribution Δ p2 of Wine is illustrated in Fig 5d. To demonstrate the effectiveness of the enhanced separation feature of M-mapping, we contrast the statistics of Δ p2 of Wine (clustered by their distribution, as shown in Fig 5c) and the clustering result of Wine by K-means (=3). The result is shown in Table 1, where left side of Table1 (C H ) is the statistics of clustering result of Wine by M-mapping, and the right side of Table1 (C K ) is the clustering result by K-means (=3). By comparing the statistics of these two clustering results, we may observe that the quality of clustering result by the distribution of Δ p2 of Wine is slightly better than that produced by K-means according to their variance of clustering results. Observing the colored data distribution in Fig 5d carefully, we may find that there is a green point grouped in the brown group by K-means. Table 1. The statistics of the clusters in wine data produced by M-mapping in HOV 3 and K-means C H Items % Radius Variance MaxDis C Items % Radius Variance MaxDis 1 48 26.966 102.286 0.125 102.523 1 48 27.528 102.008 0.126 102.242 2 71 39.888 97.221 0.182 97.455 2 71 39.326 97.344 0.184 97.579 3 59 33.146 108.289 0.124 108.497 3 59 33.146 108.289 0.124 108.497 By analyzing the data of these 3 groups, we have found that, group 1 contains 48 items and with Alcohol value 3; group 2 has 71 instances and with Alcohol value 2; and group 3 includes 59 records with Alcohol value 1. Boston Housing Data The Boston Housing data set (simply written as Housing) has 14 attributes and 506 instances. The original data distribution of Housing is given in Fig. 6a. As in the above example, based on observation and axis scaling we had a roughly separated data distribution of Housing, as demonstrated in Fig 6b; we fixed the weight values of each axis as M = [0.5, 1, 0, 0.95, 0.3, 1, 0.5, 0.5, 0.8, 0.75, 0.25, 0.55, 0.45, 0.75]. By comparing diagrams of Fig. 6a and Fig. 6b, we can see that the data points in Fig. 6b are constricted to possibly 3 or 4 groups. Then M-mapping was applied to Housing. Fig.6c and Fig. 6d show the results of M-mapping with M.*M and M.*M.* M correspondingly. It is much easier to observe the grouping insight from Fig. 6c and Fig. 6d, where we can identify the group members easily. We believe that with the domain experts getting involved in the process, the M-mapping approach can perform better in real world applications of cluster analysis.
298 K.-B. Zhang, M.A. Orgun, and K. Zhang (a) The original data distribution of Housing (b) Δ p1 =Η C (Housing, M) (c) Δ p2 =Η C (Housing, M *M) (d) Δp3= =Η C (Π, M *M *M) Fig. 6. The enhanced separation of the data set Housing 4.2 Cluster Validation with M-Mapping We may observe that the data distributions in Fig 6c and Fig 6d are more contracted than the data distributions in Fig 6a and Fig 6b. To ensure that this contracting (a) Δ p2 =Η C (Housing, M *M) (b) Δp2 =Η C (Housing, M *M *V) Fig. 7. The distributions produced by M-mapping and M-mapping with Zooming
Enhanced Visual Separation of Clusters by M-Mapping to Facilitate Cluster Analysis 299 process does not affect data selection, we introduce zooming in the M-mapping process. For example, in the last example, the non-zero minimal value of the measure vector M is 0.25. We then use V = [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4] (4=1/0.25) as the zooming vector. We discuss the application of M-mapping with zooming below. It can be observed that the shape of the patterns in Fig.7a is exactly the same as that in Fig. 7b, but the scale in Fig. 7b is enlarged. Thus the effect of combining M-mapping and zooming would improve the accuracy of data selection in external cluster validation by HOV 3 [21]. 5 Conclusions In this paper we have proposed a visual approach, called M-mapping, to aid users to enhance the separation and the contraction of data groups/clusters in cluster detection and cluster validation. We have also shown that, based on the observation of data footprints, users can trace grouping clues, and then by applying the M-mapping technique to the data set, they can enhance the separation and the contraction of the potential data groups, and therefore find useful grouping information effectively. With the advantage of the enhanced separation and contraction features of M-mapping, users can identify the cluster number in the pre-processing stage of clustering efficiently, and they can also verify the membership formation of data points among the clusters effectively in the post-processing stage of clustering by M-mapping with zooming. References 1. Abul, A.L., Alhajj, R., Polat, F., Barer, K.: Cluster Validity Analysis Using Subsampling. In: proceedings of IEEE International Conference on Systems, Man, and Cybernetics, vol. 2, pp. 1435 1440. Washington DC (October 2003) 2. Anerst M., Breunig MM., Kriegel, Sander HP. J.: OPTICS: Ordering points to identify the clustering structure. Proc. of ACM SIGMOD Conference (1999) 49-60 3. Baumgartner, C., Plant, C., Railing, K., Kriegel, H.-P., Kroger, P.: Subspace Selection for Clustering High-Dimensional Data. In: Perner, P. (ed.) ICDM 2004. LNCS (LNAI), vol. 3275, pp. 11 18. Springer, Heidelberg (2004) 4. Chen, K., Liu, L.: VISTA: Validating and Refining Clusters via Visualization. Journal of Information Visualization 13(4), 257 270 (2004) 5. Faloutsos, C., Lin, K.: Fastmap: a fast algorithm for indexing, data mining and visualization of traditional and multimedia data sets. In: Proc. of ACM-SIGMOD, pp. 163 174 (1995) 6. Halidi, M., Batistais, Y., Vazirgiannis, M.: Cluster validity methods: Part I and II, SIGMOD Record, 31 (2002) 7. Huang, Z., Cheung, D.W., Ng, M.K.: An Empirical Study on the Visual Cluster Validation Method with Fastmap. In: Proc. of DASFAA01, pp. 84 91 (2001) 8. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001) 9. Hinneburg, K.A., Keim, D.A., Wawryniu, M.: Hd-eye: Visual mining of highdimensional data. Computer Graphics & Applications Journal~19(5), 22--31 (1999)
300 K.-B. Zhang, M.A. Orgun, and K. Zhang 10. Huang, Z., Lin, T.: A visual method of cluster validation with Fastmap. In: Proc. of PAKDD-2000 pp. 153 164 (2000) 11. Jain, A., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3), 264 323 (1999) 12. Kandogan, E.: Visualizing multi-dimensional clusters, trends, and outliers using star coordinates. In: Proc. of ACM SIGKDD Conference, pp. 107 116 (2001) 13. Komine, J., Blac, A.W.: Measuring Unsupervised Acoustic Clustering through Phoneme Pair Merge-and-Split Tests. In: 9th European Conference on Speech Communication and Technology (Interspeech 2005), Lisbon, Portugal, pp. 689 692 (2005) 14. Kohonen, T.: Self-Organizing Maps, 2nd edn. Springer, Berlin (1997) 15. Kasi, S., Sinonen, J., Peltonen, J.: Data Visualization and Analysis with Self- Organizing Maps in Learning Metrics. In: Kambayashi, Y., Winiwarter, W., Ariawa, M. (eds.) DaWaK 2001. LNCS, vol. 2114, pp. 162 173. Springer, Heidelberg (2001) 16. Sprenger, T.C, Brunella, R., Gross, M.H.: H-BLOB: A Hierarchical Visual Clustering Method Using Implicit Surfaces. In: Proc. of the conference on Visualization 2000, pp. 61 68. IEEE Computer Society Press, Los Alamitos (2000) 17. Shneiderman, B.: Inventing Discovery Tools: Combining Information Visualization with Data Mining. In: Jante, K.P., Shinohara, A. (eds.) DS 2001. LNCS (LNAI), vol. 2226, pp. 17 28. Springer, Heidelberg (2001) 18. Seo, J., Shneiderman, B.: In: Hemmje, M., Niederée, C., Risse, T. (eds.): From Integrated Publication and Information Systems to Information and Knowledge Environments. LNCS, vol. 3379, Springer, Heidelberg (2005) 19. Vilalta, R., Stepinsi, T., Achari, M.: An Efficient Approach to External Cluster Assessment with an Application to Martian Topography, Technical Report, No. UH-CS-05-08, Department of Computer Science, University of Houston (2005) 20. Zhang, K-B., Orgun, M.A., Zhang, K.: HOV3, An Approach for Cluster Analysis. In: Li, X., Zaïane, O.R., Li, Z. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 317 328. Springer, Heidelberg (2006) 21. Zhang, K-B., Orgun, M.A., Zhang, K.: A Visual Approach for External Cluster Validation. In: CIDM2007. Proc. of the first IEEE Symposium on Computational Intelligence and Data Mining, Honolulu, Hawaii, pp. 577 582. IEEE Computer Press, Los Alamitos (2007) 22. Zhang, K-B., Orgun, M.A., Zhang, K.: A Prediction-based Visual Approach for Cluster Exploration and Cluster Validation by HOV3, ECML/PKDD 2007, Warsaw, Poland, September 17-21 pp. 336 349 (2007)
A Prediction-Based Visual Approach for Cluster Exploration and Cluster Validation by HOV 3 * Ke-Bing Zhang 1, Mehmet A. Orgun 1, and Kang Zhang 2 1 Department of Computing, ICS, Macquarie University, Sydney, NSW 2109, Australia {ebing,mehmet}@ics.mq.edu.au 2 Department of Computer Science, University of Texas at Dallas Richardson, TX 75083-0688, USA zhang@utdallas.edu Abstract. Predictive nowledge discovery is an important nowledge acquisition method. It is also used in the clustering process of data mining. Visualization is very helpful for high dimensional data analysis, but not precise and this limits its usability in quantitative cluster analysis. In this paper, we adopt a visual technique called HOV 3 to explore and verify clustering results with quantified measurements. With the quantified contrast between grouped data distributions produced by HOV 3, users can detect clusters and verify their validity efficiently. Keywords: predictive nowledge discovery, visualization, cluster analysis. 1 Introduction Predictive nowledge discovery utilizes the existing nowledge to deduce, reason and establish predictions, and verify the validity of the predictions. By the validation processing, the nowledge may be revised and enriched with new nowledge [20]. The methodology of predictive nowledge discovery is also used in the clustering process [3]. Clustering is regarded as an unsupervised learning process to find group patterns within datasets. It is a widely applied technique in data mining. To achieve different application purposes, a large number of clustering algorithms have been developed [3, 9]. However, most existing clustering algorithms cannot handle arbitrarily shaped data distributions within extremely large and high-dimensional databases very well. The very high computational cost of statistics-based cluster validation methods in cluster analysis also prevents clustering algorithms from being used in practice. Visualization is very powerful and effective in revealing trends, highlighting outliers, showing clusters, and exposing gaps in high-dimensional data analysis [19]. Many studies have been proposed to visualize the cluster structure of databases [15, 19]. However, most of them focus on information rendering, rather than investigating on how data behavior changes with the parameters variation of the algorithms. * The datasets used in this paper are available from http://www.ics.uci.edu/~mlearn/machine- Learning.html. J.N. Ko et al. (Eds.): PKDD 2007, LNAI 4702, pp. 336 349, 2007. Springer-Verlag Berlin Heidelberg 2007
A Prediction-Based Visual Approach 337 In this paper we adopt HOV 3 (Hypothesis Oriented Verification and Validation by Visualization) to project high dimensional data onto a 2D complex space [22]. By applying predictive measures (quantified domain nowledge) to the studied data, users can detect grouping information precisely, and employ the clustered patterns as predictive classes to verify the consistency between the clustered subset and unclustered subsets. The rest of this paper is organized as follows. Section 2 briefly introduces the current issues of cluster analysis, and the HOV 3 technique as the bacground of this research. Section 3 presents our prediction-based visual cluster analysis approach with examples to demonstrate its effectiveness on cluster exploration and cluster validation. A short review of the related wor in visual cluster analysis is provided in Section 4. Finally, Section 5 summarizes the contributions of this paper. 2 Bacground The approach reported in this paper has been developed based on the projection of HOV 3 [22], which was inspired from the Star Coordinates technique. For a better understanding of our wor, we briefly describe Star Coordinates and HOV 3. 2.1 Visual Cluster Analysis Cluster analysis includes two major aspects: clustering and cluster validation. Clustering aims at identifying objects into groups, named clusters, where the similarity of objects is high within clusters and low between clusters. Hundreds of clustering algorithms have been proposed [3, 9]. Since there are no general-purpose clustering algorithms that fit all inds of applications, the evaluation of the quality of clustering results becomes the critical issue of cluster analysis, i.e., cluster validation. Cluster validation aims to assess the quality of clustering results and find a fit cluster scheme for a given specific application. The user s initial estimation of the cluster number is important for choosing the parameters of clustering algorithms for the pre-processing stage of clustering. Also, the user s clear understanding on cluster distribution is helpful for assessing the quality of clustering results in the post-processing of clustering. The user s visual perception of the data distribution plays a critical role in these processing stages. Using visualization techniques to explore and understand high dimensional datasets is becoming an efficient way to combine human intelligence with the immense brute force computation power available nowadays [16]. Visual cluster analysis is a combination of visualization and cluster analysis. As an indispensable aid for human-participation, visualization is involved in almost every step of cluster analysis. Many studies have been performed on high dimensional data visualization [2, 15], but most of them do not visualize clusters well in high dimensional and very large data. Section 4 discusses several studies that have focused on visual cluster analysis [1, 7, 8, 10, 13, 14, 17, 18] as the related wor of this research. Star Coordinates is a good choice for visual cluster analysis with its interactive adjustment features [11].
338 K.-B. Zhang, M.A. Orgun, and K. Zhang 2.2 Star Coordinates The idea of Star Coordinates technique is intuitive, which extends the perspective of traditional orthogonal X-Y 2D and X-Y-Z 3D coordinates technique to a higher dimensional space [11]. Technically, Star Coordinates plots a 2D plane into n equal sectors with n coordinate axes, where each axis represents a dimension and all axes share the initials at the centre of a circle on the 2D space. First, data in each dimension are normalized into [0, 1] or [-1, 1] interval. Then the values of all axes are mapped to orthogonal X-Y coordinates which share the initial point with Star Coordinates on the 2D space. Thus, an n-dimensional data item is expressed as a point in the X-Y 2D plane. Fig.1 illustrates the mapping from 8 Star Coordinates to X-Y coordinates. In practice, projecting high dimensional data onto 2D space inevitably introduces overlapping and ambiguities, even bias. To mitigate the problem, Star Coordinates and its extension ivibrate [4] provide several visual adjustment mechanisms, such as axis scaling, axis angle rotating, data point filtering, etc. to change the data distribution of a dataset interactively in order to detect cluster characteristics and render clustering results effectively. Below we briefly introduce the two relevant adjustment features with this research. Fig. 1. Positioning a point by an 8-attribute vector in Star Coordinates [11] Axis scaling The purpose of the axis scaling in Star Coordinates (called a-adjustment in ivi- BRATE) is to interactively adjust the weight value of each axis so that users can observe the data distribution changes dynamically. For example, the diagram in Fig.2 shows the original data distribution of Iris (Iris has 4 numeric attributes and 150 instances) with the clustering indices produced by the K-means clustering algorithm in ivibrate, where clusters overlap (here =3). A well-separated cluster distribution of Iris is illustrated in Fig. 3 by a series of random a-adjustments, where clusters are much easier to be recognized than those of the original distribution in Fig 2. For tracing data points changing in a certain period time, the footprint function is provided by Star Coordinates. It is discussed below. Fig. 2. The initial data distribution of clusters of Iris produced by -means in ivibrate Fig. 3. The separated version of the Iris data distribution in ivibrate
A Prediction-Based Visual Approach 339 Footprint We use another data set auto-mpg to demonstrate the footprint feature. The data set auto-mpg has 8 attributes and 398 items. Fig. 4 presents the footprints of axis tuning of attributes weight and mpg, where we may find some points with longer traces, and some with shorter footprints. The most prominent feature of Star Coordinates and its extensions such as ivibrate is that their computational complexity is only in linear time. This maes them very suitable to be employed as a visual tool for interactive interpretation and exploration in cluster analysis. Fig. 4. Footprints of axis scaling of weight and mpg attributes in Star Coordinates [11] However, the cluster exploration and refinement based on the user s intuition inevitably introduces randomness and subjectiveness into visual cluster analysis, and as a result, sometimes the adjustments of Star Coordinates and ivibrate could be arbitrary and time consuming. 2.3 HOV 3 In fact, the Star Coordinates model can be mathematically depicted by the Euler formula. According to the Eular formula: e ix = cosx+isinx, where z = x + i.y, and i is the imaginary unit. Let z 0 =e 2pi/n ; such that z 0 1, z 0 2, z 0 3,, z 0 n-1, z 0 n (with z0 n = 1) divide the unit circle on the complex 2D plane into n equal sectors. Thus, Star Coordinates can be simply written as: P n ( z ) = [(d min d ) /(maxd min d ) z ] j 0 j 0 = 1 where mind and maxd represent the minimal and maximal values of the th coordinate respectively. In any case equation (1) can be viewed as mapping from Ñ n ØÂ 2. To overcome the arbitrary and random adjustments of Star Coordinates and ivi- BRATE, Zhang et al proposed a hypothesis-oriented visual approach called HOV 3 to detect clusters [22]. The idea of HOV 3 is that, in analytical geometry, the difference of a data set (a matrix) D j and a measure vector M with the same number of variables as D j can be represented by their inner product, D j M. HOV 3 uses a measure vector M to represent the corresponding axes weight values. Then given a non-zero measure vector M in Ñ n, and a family of vectors P j, the projection of P j against M, according to formula (1), the HOV 3 model is presented as: P n ( z ) = [(d min d ) /(maxd min d ) z m ] j 0 0 (2) = 1 j where m is the th attribute of measure M. The aim of interactive adjustments of Star Coordinates and ivibrate is to have some separated groups or full-separated clustering result of data by tuning the weight value of each axis, but their arbitrary and random adjustments limit their applicability. As shown in formula (2), HOV 3 summarizes these adjustments as a coefficient/measure vector. Comparing the formulas (1) and (2), it can be observed that (1)
340 K.-B. Zhang, M.A. Orgun, and K. Zhang HOV 3 subsumes the Star Coordinates model [22]. Thus the HOV 3 model provides users a mechanism to quantify a prediction about a data set as a measure vector of HOV 3 for precisely exploring grouping information. Equation (2) is a standard form of linear transformation of n variables, where m is the coefficient of th variable of P j. In principle, any measure vectors, even in complex number form, can be introduced into the linear transformation of HOV 3 if it can distinguish a data set into groups or have well separated clusters visually. Thus the rich statistical methods of reflecting the characteristics of data set can be also introduced as predictions in the HOV 3 projection, such that users may discover more clustering patterns. The detailed explanation of this approach is presented next. 3 Predictive Visual Cluster Analysis by HOV 3 Predictive exploration is a mathematical description of future behavior based on historical exploration of patterns. The goal of predictive visual exploration by HOV 3 is that by applying a prediction (measure vector) to a dataset, the user may identify the groups from the result of visualization. Thus the ey issue of applying HOV 3 to detect grouping information is how to quantify historical patterns (or users domain nowledge) as a measure vector to achieve this goal. 3.1 Multiple HOV 3 Projection (M-HOV 3 ) In practice, it is not easy to synthesize historical nowledge about a data set into one vector; rather than using a single measure to implement a prediction test, it is more suitable to apply several predictions (measure vectors) together to the data set, we call this process multiple HOV 3 projection, M-HOV 3 in short. Now, we provide the detailed description of M-HOV 3 and its feature of enhanced group separation. For simplifying the discussion of the M-HOV 3 model, we give a definition first. Definition 1. (poly-multiply vectors to a matrix) The inner product of multiplying a series of non-zero measure vectors M 1, M 2,,Ms to a matrix A is denoted as s A * M i =A *M 1 *M 2 *. *Ms. i=1 Zhang et al [23] gave a simple notation of HOV 3 projection as D p =H C (P, M), where P is a data set; D p is the data distribution of P by applying a measure vector M. Then the projection of M-HOV 3 is denoted as D p =H C (P, Mi ). Based on equation (2), we i=1 formulate M-HOV 3 as: P n s ( z ) = [(d min d ) /(maxd min d ) z m ] j j 0 = 1 i= 1 s 0 (3) where m i is the th attribute (dimension) of the ith measure vector M i, and s 1. When s=1, the formula (3) is transformed to formula (2). We may observe that instead of using a single multiplication of m in formula (2), s it is replaced by a poly-multiplication of m in formula (3). Formula (3) is more i=1 i i
A Prediction-Based Visual Approach 341 general and also closer to the real procedure of cluster detection, because it introduces several aspects of domain nowledge together into the cluster detection. In addition, the effect of applying M-HOV 3 to datasets with the same measure vector can enhance the separation of grouped data points under certain conditions. 3.2 The Enhanced Separation Feature of M-HOV 3 To explain the geometrical meaning of M-HOV 3 projection, we use the real number system. According to equation (2), the general form of the distance s (i.e., weighed Minowsi distance) between two points a and b in HOV 3 plane can be represented as: n q =1 q (a b ) σ ( a,b,m) = m (q>0) (4) If q = 1, s is Manhattan (city bloc) distance; and if q = 2, s is Euclidean distance. To simplify the discussion of our idea, we adopt the Manhattan metric for the explanation. Note that there exists an equivalent mapping (bijection) of distance calculation between the Manhattan and Euclidean metrics [6]. For example, if the distance between points a and b is longer than the distance between points a and b in then Manhattan metric, it is also true in the Euclidean metric, and vice versa. Then the Manhattan distance between points a and b is calculated as in formula (5). n σ ( a, b, m) = m ( a b ) (5) = 1 According to formulas (2), (3) and (5), we can present the distance of M-HOV 3 in Manhattan distance as follows: s n s i i = 1 i= 1 σ (a,b, m ) = m (a b ) (6) i= 1 Definition 2. (the distance representation of M- HOV 3 ) The distance between two data points a and b projected by M- HOV 3 is denoted as M s σab. In particular, if the measure vectors in an M-HOV 3 are the same, σab can be simply written as M s sab; if each i=1 attribute of M is 1 (no measure case), the distance between points a and b is denoted as sab. Thus, we have M s s σab = 1 =H C ((a,b), M ). For example, the distance between two points i i i=1 M s a and b projected by M-HOV 3 with the same two measures can be represented as M 2 sab. Thus the projection of HOV 3 of a and b can be written as Msab. We now give several important properties of M- HOV 3 as follows. Lemma 1. In Star Coordinates space, if sab 0 and M 0 ($m œm 0< m <1), then sab > Msab. Proof n sab= (a = 1 b ) sab- Msab= (a n = 1 and Msab= m b ) n n = 1 b (a ) - m (a b = 1 ) i=1 n ( = 1 = a b ) ( 1 m )
342 K.-B. Zhang, M.A. Orgun, and K. Zhang M 0 {$m 0 m œm 0< m <1, =1 n} 1 m ) >0 sab 0 sab >(Msab) ( Ñ This result shows that the distance Msab between points a and b projected by HOV 3 with a non-zero M is less than the original distance sab between a and b. Lemma 2. In Star Coordinates space, if sab 0 and M 0 ("m œm 0< m <1), then M n sab > M n+1 sab, nœν. Proof Let M n sab=s ab Definition 1 M n+1 sab= Ms ab Lemma 1 s ab >Ms ab M n sab > M n+1 sab In general, it can be proved that in Star Coordinates space, if sab 0 and M 0 ("m œm m <1), then M m sab > M n sab, nœν, mœν and m<n. Theorem 1. If the measure vector is changed from M to M, ( m 1, m t,+ t <1) and Msab -Msac < M sab - M sac then 2 2 M' σab M' σac M' σab M' σac Mσab Mσac M' σab M' σac > M' σab M' σac Mσab Mσac Proof n ' ' M sab= m (a b ) and M sac= m (a c ) = 1 n = 1 n =1 ' M sab - M sac = m [ (a b ) (a c ) ] n 2 M 2 sac - M 2 ' sab= m = 1 [ (a b ) (a c ) ] Let a -b =x and a -c =y n n ' ' M sac - M sab = m [ (a b ) (a c ) ] = m (x y ) =1 n ' 2 m (x y = 1 M 2 sac - M 2 sab= ) =1 M sac - M sab = M sxy M 2 sac - M 2 sab = M 2 sxy Lemma 2 M 2 2 2 2 M' σxy M' σab M' σac sxy< M sxy <1 <1 M' σxy M' σab M' σac M 2 sab - M 2 sac < M sab - M sac Msab -Msac < M sab - M sac M 2 sab - M 2 sac. Msab -Msac < M scab - M sac 2 2 2 M' σab M' σac M' σab M' σac < M' σab M' σac Mσab Mσac 2 2 M' σab M' σac M' σab M' σac 1- >1- M' σab M' σac Mσab Mσac Ñ
A Prediction-Based Visual Approach 343 2 2 M' σab M' σac M' σab M' σac Mσab Mσac M' σab M' σac > M' σab M' σac Mσab Mσac Ñ Theorem 1 shows that if the user observes that the difference of the distance between a and b and the distance between a and c are increased relatively (it can be observed by the footprints of points a, b and c, as shown in Fig 4) by tuning weight values of axes from M to M, then after applying M-HOV 3 to a, b and c, the distance variation rate of the distances between pairs of points a, b and a, c is enhanced, as presented in Fig 5. Fig. 5. The contraction and separation effect of M-HOV 3 In other words, if it is observed that several data point groups can be roughly separated visually (there may exist ambiguous points between groups) by projecting a measure vector in HOV 3 to a data set, then applying M-HOV 3 with the same measure vector to the data set would lead to the groups being more condensed, i.e., have a good separation of the groups. 3.3 Predictive Cluster Exploration by M-HOV 3 According to the notation of HOV 3 projection of a dataset P as D p =H C (P, M), the M- HOV 3 is denoted as D p =H C (P, M n ) where nœí. We use the auto-mpg dataset again as an example to demonstrate predictive cluster exploration by M-HOV 3. Fig. 6a illustrates the original data distribution of auto-mpg produced by HOV 3 in MATLAB, where it is not possible to recognize any group information. Then we tuned each axis manually and had roughly distinguished three groups, as shown in Fig 6b. The weight values of axes were recorded as a vector M=[0.10, 0, 0.25, 0.2, 0.8, 0.85, 0.1, 0.95]. Fig. 6b shows that there exist several ambiguous data points between groups. Then we employed M 2 (inner dot) as a predictive measure vector and applied it to data set auto-mpg. The projected distribution D p2 of auto-mpg is presented in Fig 6c. It is much easier to identify 3 groups of auto-mpg in Fig 6c than in Fig 6b. To show the contrast between these two diagrams D p1 and D p2, we overlap them in Fig. 6d. By analyzing the data of these 3 groups, we have found that, group 1 contains 70 items and with original value 2 (sourcing Europe); group 2 has 79 instances and with original 3 (Japanese product); and group 3 includes 249 records with original 1 (from USA). Actually this natural grouping based on the user s intuition serendipitously clustered the data set according to the original attribute of autompg. In the same way, the user may find more grouping information from the interactive cluster exploration by applying predictive measurement.
344 K.-B. Zhang, M.A. Orgun, and K. Zhang Fig. 6a. The original data distribution of auto-mpg Fig. 6b. D p1 =H C (auto-mpg, M) Fig. 6c. D p2 =H C (auto-mpg, M 2 ) Fig. 6d. The overlapping diagram of D p1 and D p2 Fig. 6. Diagrams of data set auto-mpg projected by HOV 3 in MATLAB 3.4 Predictive Cluster Exploration by HOV 3 with Statistical Measurements Many statistical measurements, such as mean, median, standard deviation and etc. can be directly introduced into HOV 3 as predictions to explore data distributions. In fact, prediction based on statistical measurements is more purposefully cluster exploration, and give an easier geometrical interpretation of the data distribution. We use the Iris dataset as an example. As shown in Fig. 3, by random axis scaling, the user can divide the Iris data in 3 groups. This example exhibits that cluster exploration based on random adjustment may expose data groping information, but sometimes, it is hard to interpret such groupings. We employ the standard deviation of Iris M = [0.2302, 0.1806, 0.2982, 0.3172, 0.4089] as a prediction to project Iris by HOV 3 in ivibrate. The result is shown in Fig. 7, where 3 groups clearly exist. It can be observed in Fig 7 that, there is a blue point in the pin-colored cluster and a pin point in the green-colored cluster, resulting from the K-means clustering algorithm with =3. Intuitively, they have been wrongly clustered. We re-clustered them by their distributions, as shown in Fig 8. The contrast of clusters (C ) produced by the K-means clustering algorithm and new clustering result (C H ) projected by HOV 3 is summarized in Table 1. We can see that the
A Prediction-Based Visual Approach 345 Fig. 7. Data distribution projected by HOV 3 in ivibrate of Iris with cluster indices maed by K-means Fig. 8. Data distribution projected by HOV 3 in ivibrate of Iris with the new clustering indices by the user s intuition quality of the new clustering result of Iris is better than that obtained by K-means according to their Variance comparison. Each cluster projected by HOV 3 has a higher similarity than that produced by K-means. By analyzing the new grouping data points of Iris, we have found that they are distinguished by the class attribute of Iris, i.e. Irissetosa, Iris-versicolor and Iris-virginica. The cluster 1 generated by K-means is an outlier. Table 1. The statistics of the clusters in Iris produced by HOV 3 with predictive measure C % Radius Variance MaxDis C H % Radius Variance MaxDis 1 1.333 1.653 2.338 3.306 2 32.667 5.754 0.153 6.115 1 33.333 5.753 0.152 6.113 3 33.333 8.196 0.215 8.717 2 33.333 8.210 0.207 8.736 4 33.333 7.092 0.198 7.582 3 33.333 7.112 0.180 7.517 With the statistical predictions in HOV 3 the user may even expose the cluster clues that are not easy to be found by random adjustments. For example, we adopted the 8th row of auto-mpg s covariance matrix as a predictive measure (0.04698, -0.07657, - 0.06580, 0.00187, -0.05598, 0.01343, 0.02202, 0.16102) to project auto-mpg by HOV 3 in MATLAB. The result is shown in Fig 9. We grouped them by their distribution as in Fig 10. Table 2 (right part) reports the statistics of the clusters generated by the projection of HOV 3, and reveals that the points in each cluster have very high similarity. As we chose the 8th row of auto-mpg s covariance matrix as the prediction, the result mainly depends on the 8th column of auto-mpg data, i.e., origin (country). Fig. 10 shows that C1, C2 and C3 are closer because they have the same origin value 1. The more detailed formation of clusters is given in the right part of Table 2.We believe that a domain expert could give a better and intuitive explanation about this clustering. Then we chose number 5 to cluster auto-mpg by the K-means. Its clustering result is presented in the left part of Table 2. Comparing their corresponding statistics, we can see that according to the Variance of clusters, the quality of the clustering result by
346 K.-B. Zhang, M.A. Orgun, and K. Zhang Fig. 9. Data distribution projected by HOV 3 in MATLAB of auto-mpg with 8 th row of automap s covariance matrix as prediction Fig. 10. Clustered distribution of data in Fig. 8 by the user s intuition Table 2. The statistical contrast of clusters in auto-mpg produced by K-means and HOV 3 Clusters produced by K-means (=5) Clusters generated by the user intuition on the data distribution C % Radius Variance MaxDis Origin Cylinders % Radius Variance MaxDis 1 0.503 681.231 963.406 1362.462 1 8 25.879 4129.492 0.130 4129.768 2 18.090 2649.108 0.206 2649.414 1 6 18.583 3222.493 0.098 3222.720 3 16.080 2492.388 0.139 2492.595 1 4 18.090 2441.881 0.090 2442.061 4 21.608 3048.532 0.207 3048.897 2 4 17.588 2427.449 0.142 2427.632 5 25.377 3873.052 0.220 3873.670 3 3 19.849 2225.465 0.093 2225.658 6 18.593 2417.804 0.148 2417.990 HOV 3 with covariance prediction of auto-mpg is better than that one produced by K- means (=5, cluster 1 produced by K-means is an outlier). 3.5 Predictive Cluster Validation by HOV 3 In practice, with extremely large sized datasets, it is infeasible to cluster an entire data set within an acceptable time scale. A common solution used in data mining is that, clustering algorithms are first applied to the training (a sampling) subset of data from a database to extract cluster patterns, and then the cluster scheme is assessed to see whether it is suitable for other subsets in the database. This procedure is regarded as external cluster validation [21]. Due to the high computational cost of statistical methods on assessing the consistency of cluster structures between large sized subsets, to achieve this goal by statistical methods is still a challenge in data mining. Based on the assumption that if two same-sized data sets have a similar cluster structure, by applying a linear transformation to the data sets, the similarity of the newly produced distributions of the two sets would still be high, Zhang et al proposed a visual external validation approach by HOV 3 [23]. Technically, their approach uses a clustered subset and a same-sized unclustered subset from a database as the observation by applying the measure vectors that can separate clusters in the clustered subset by HOV 3. Thus each cluster and its geometrically covered data points (called quasi- Cluster in their approach) are selected. Finally, the overlapping rate of each
A Prediction-Based Visual Approach 347 cluster-quasicluster pair is calculated; and if the overlapping rate approaches 1, this means that the two subsets have a similar cluster distribution. Compared with the statistics-based validation methods, their method is not only visually intuitive, but also more effective in real applications [23]. As mentioned above, sometimes, it is time consuming to separate clusters manually in Star Coordinates or ivibrate. Thus, separation of clusters from lots of overlapping points is an aim of this research. As we described above, the approaches such as M-HOV 3 and HOV 3 with statistical measurement can be introduced into external cluster validation by HOV 3. In principle, any linear transformation can be employed into HOV 3 if it can separate clusters well. Fig. 11. The data distribution of auto-mpg projected by HOV3 with cos(m*10i ) as the prediction We therefore introduce the complex linear transformation to this process. We again use auto-mpg data set as an example. As shown in Fig. 6b, three roughly separated clusters appear there, where the vector M=[0.10, 0, 0.25, 0.2, 0.8, 0.85, 0.1, 0.95] was obtained from the axes values. Then we adopt cos(m 10i ) as a prediction, where i is the imaginary unit. The projection of HOV 3 with cos(m 10i ) is illustrated in Fig. 11, where three clusters are separated very well. In the same way, many other linear transformations can be applied to different datasets to obtain well-separated clusters. With the fully separated clusters, there will be mared improvement of the efficiency of visual cluster validation. 4 Related Wor Visualization is typically employed as an observational mechanism to assist users with intuitive comparisons and better understanding of the studied data. Instead of quantitatively contrasting clustering results, most of the visualization techniques employed in cluster analysis focus on providing users with an easy and intuitive understanding of the cluster structure, or explore clusters randomly. For instance, Multidimensional Scaling, MDS [14] and Principal Component Analysis, PCA [10] are two commonly used multivariate analysis techniques. However, the relative high computational cost of MDS (polynomial time O(N 2 )) limits its usability in very large datasets, and PCA first has to find the correlated variables for reducing the dimensionality, which maes it not suitable for unnown data exploration. OPTICS [1] uses a density-based technique to detect cluster structure and visualizes clusters in Gaussian bumps, but its non-linear time complexity maes it neither suitable for dealing with very large data sets, nor for providing the contrast between clustering results. H-BLOB visualizes clusters into blob manners in a 3D hierarchical structure [17]. It is an intuitive cluster rendering technique, but its 3D and two stages expression restricts it from interactively investigating cluster structures apart from existing clusters. Kasi el. al [13] uses Self-organizing maps (SOM) to project high-dimensional data sets to 2D space for matching visual models [12]. However, the SOM technique is based
348 K.-B. Zhang, M.A. Orgun, and K. Zhang on a single projection strategy and it is not powerful enough to discover all the interesting features from the original data set. Huang et. al [7, 8] proposed the approaches based on FastMap [5] to assist users in identifying and verifying the validity of clusters in visual form. Their techniques wor well in cluster identification, but are unable to evaluate the cluster quality very well. On the other hand, these techniques are not well suited to the interactive investigation of data distributions of high-dimensional data sets. A recent survey of visualization techniques in cluster analysis can be found in the literature [18]. 5 Conclusions In this paper, we have proposed a prediction-based visual approach to explore and verify clusters. This approach uses the HOV 3 projection technique and quantifies the previously obtained nowledge and statistical measurements about a high dimensional data set as predictions, so that users can utilize the predictions to project the data on 2D plane in order to investigate grouping clues or verify the validity of clusters based on the distribution of the data. This approach not only inherits the intuitive and easy understanding features of visualization, but also avoids the weanesses of randomness and arbitrary exploration of the existing visual methods employed in data mining. As a consequence, with the advantage of the quantified predictive measurement of this approach, users can identify the cluster number in the pre-processing stage of clustering efficiently, and also can intuitively verify the validity of clusters in the post-processing stage of clustering. References 1. Anerst, M., Breunig, M.M., Kriegel, S.H.P.J.: OPTICS: Ordering points to identify the clustering structure. In: Proc. of ACM SIGMOD Conference, pp. 49 60. ACM Press, New Yor (1999) 2. Anerst, M., Keim, D.: Visual Data Mining and Exploration of Large Databases. In: 5th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 01), Freiburg, Germany (September 2001) 3. Berhin, P.: A Survey of Clustering Data Mining Techniques. In: Jacob, K., Charles, N., Marc, T. (eds.) Grouping Multidimensional Data, pp. 25 72. Springer, Heidelberg (2006) 4. Chen, K., Liu, L.: ivibrate: Interactive visualization-based framewor for clustering large datasets. ACM Transactions on Information Systems (TOIS) 24(2), 245 294 (2006) 5. Faloutsos, C., Lin, K.: Fastmap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia data sets. In: Proc. of ACM-SIGMOD, pp. 163 174 (1995) 6. Fleming, W.: Functions of Several Variables. In: Gehring, F.W., Halmos, P.R. (eds.) 2nd edn. Springer, Heidelberg (1977) 7. Huang, Z., Cheung, D.W., Ng, M.K.: An Empirical Study on the Visual Cluster Validation Method with Fastmap. In: Proc. of DASFAA01, pp. 84 91 (2001) 8. Huang, Z., Lin, T.: A visual method of cluster validation with Fastmap. In: Terano, T., Chen, A.L.P. (eds.) PAKDD 2000. LNCS, vol. 1805, pp. 153 164. Springer, Heidelberg (2000)
A Prediction-Based Visual Approach 349 9. Jain, A., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3), 264 323 (1999) 10. Jolliffe Ian, T.: Principal Component Analysis. Springer Press, Heidelberg (2002) 11. Kandogan, E.: Visualizing multi-dimensional clusters, trends, and outliers using star coordinates. In: Proc. of ACM SIGKDD Conference, pp. 107 116. ACM Press, New Yor (2001) 12. Kohonen, T.: Self-Organizing Maps, 2nd extended edn. Springer, Berlin (1997) 13. Kasi, S., Sinonen, J., Peltonen, J.: Data Visualization and Analysis with Self- Organizing Maps in Learning Metrics. In: Kambayashi, Y., Winiwarter, W., Ariawa, M. (eds.) DaWaK 2001. LNCS, vol. 2114, pp. 162 173. Springer, Heidelberg (2001) 14. Krusal, J.B., Wish, M.: Multidimensional Scaling, SAGE university paper series on quantitive applications in the social sciences, pp. 7 11. Sage Publications, CA (1978) 15. Oliveira, M.C., Levowitz, H.: From Visual Data Exploration to Visual Data Mining: A Survey. IEEE Transaction on Visualization and Computer Graphs 9(3), 378 394 (2003) 16. Pampal, E., Goebl, W., Widmer, G.: Visualizing Changes in the Structure of Data for Exploratory Feature Selection. In: SIGKDD 03, Washington, DC, USA (2003) 17. Sprenger, T.C, Brunella, R., Gross, M.H.: H-BLOB: A Hierarchical Visual Clustering Method Using Implicit Surfaces. In: Proc. of the conference on Visualization 00, pp. 61 68. IEEE Computer Society Press, Los Alamitos (2000) 18. Seo, J., Shneiderman, B.: From Integrated Publication and Information Systems to Virtual Information and Knowledge Environments. In: Hemmje, M., Niederée, C., Risse, T. (eds.) From Integrated Publication and Information Systems to Information and Knowledge Environments. LNCS, vol. 3379, Springer, Heidelberg (2005) 19. Shneiderman, B.: Inventing Discovery Tools: Combining Information Visualization with Data Mining. In: Jante, K.P., Shinohara, A. (eds.) DS 2001. LNCS (LNAI), vol. 2226, pp. 17 28. Springer, Heidelberg (2001) 20. Weiss, S.M., Indurhya, N.: Predictive Data Mining: A Practical Guide. Morgan Kaufmann Publishers, San Francisco (1998) 21. Vilalta, R., Stepinsi, T., Achari, M.: An Efficient Approach to External Cluster Assessment with an Application to Martian Topography, Technical Report, No. UH-CS-05-08, Department of Computer Science, University of Houston (2005) 22. Zhang, K-B., Orgun, M.A., Zhang, K.: HOV 3, An Approach for Cluster Analysis. In: Li, X., Zaïane, O.R., Li, Z. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 317 328. Springer, Heidelberg (2006) 23. Zhang, K-B., Orgun, M.A., Zhang, K.: A Visual Approach for External Cluster Validation. In: Proc. of IEEE Symposium on Computational Intelligence and Data Mining (CIDM2007), Honolulu, Hawaii, USA, April 1-5, 2007, pp. 576 582. IEEE Press, Los Alamitos (2007)
Predictive Hypothesis Oriented Cluster Analysis by Visualization Abstract Ke-Bing Zhang 1, Mehmet A. Orgun 1, Kang Zhang 2 1 Department of Computing, ICS, Macquarie University, NSW 2109, Australia {ebing, mehmet}@ics.mq.edu.au 2 Department of Computer Science, University of Texas at Dallas TX 75083-0688, USA zhang@utdallas.edu Clustering is a widely applied technique in data mining and many clustering algorithms have been developed for real-world applications. However, when dealing with arbitrarily shaped cluster distributions, most existing automated clustering algorithms suffer in terms of efficiency and they are sometimes not suitable for clustering extremely large and high dimensional datasets. On the other hand, high computational cost of statisticsbased cluster validation methods is another obstacle to the applications of cluster analysis in data mining. As a remedy, visualization techniques have been introduced into cluster analysis and they have been very helpful for the analysis of high-dimensional data. However, most visualization techniques employed in cluster analysis are mainly used as tools for information rendering, rather than investigating how data behavior changes with the variations of the parameters of the algorithms. In addition, the impreciseness of visualization limits its usability in contrasting grouping information of data precisely. This paper presents a visual technique called HOV 3 (Hypothesis Oriented Verification and Validation by Visualization) to map high dimensional data onto 2D space with quantified measurements. Therefore, users can quantify the domain nowledge or the historical patterns about datasets as predictions to detect clusters and verify clustering results effectively by HOV 3. 1 Introduction Predictive nowledge discovery is regarded as the procedure of using the existing nowledge to deduce, reason and establish predictions and verify the validity of the predictions. By validation processing, the nowledge may be revised and enriched with new nowledge [Wei98]. The methodology of predictive nowledge discovery is also used in the clustering process [Ber06]. Cluster analysis is a very important nowledge mining method of large-scale data. It is widely applied in data mining in such areas as image processing, mareting, customer behavior analysis, business trend prediction, bioinformatics to geology science, and so on. Cluster analysis includes two major aspects: clustering and cluster validation. Clustering aims at identifying objects into groups according to the given criteria. Each group of objects is called a cluster, where the similarity of objects is high within clusters and low between clusters. To achieve different application purposes, a large number of clustering algorithms have been developed [JMF99, Ber06]. However, there are no general-purpose clustering algorithms that fit all inds of applications, thus, the evaluation of the quality of clustering results taes the critical role of cluster analysis, i.e., cluster
validation, which aims to assess the quality of clustering results and find a fit cluster scheme for a specific application. In practice, cluster analysis is not always successfully applied to databases in data mining. This is because most of the existing automated clustering algorithms do not deal with arbitrarily shaped data distribution of the datasets very well, and statistics-based cluster validation methods incur a very high computational cost in cluster analysis. The user s initial estimation of the cluster number is very important for choosing the parameters of clustering algorithms for the pre-processing stage of clustering. Also, the user s clear understanding of the cluster distribution is helpful for assessing the quality of clustering results in the post-processing stage of clustering. All these issues rely heavily on the user s visual perception of the data distribution. Clearly, visualization is a crucial aspect of cluster exploration and verification in cluster analysis. Visual presentations can be very powerful in revealing trends, highlighting outliers, showing clusters, and exposing gaps in data [Shn01]. Therefore, the introduction of visualization techniques to explore and understand high dimensional datasets is becoming an efficient way to combine human intelligence with the immense brute force computation power available nowadays [PGW03]. The visualization methods utilized in cluster analysis map high-dimensional data to 2D or 3D space and aid users having an intuitive and easy understanding graph and/or image to reveal the grouping relationships among the data. Visual cluster analysis is a combination of visualization and cluster analysis. As an indispensable exploration technique, visualization is almost involved into every step in cluster analysis. Clustering algorithms normally deal with data sets in high dimensions (>3D). Thus, the choice of a technique fit for visualizing clusters of high dimensional data is the first tas of visual cluster analysis. Many research efforts have been made on multidimensional data visualization [WoB94], but those earlier techniques of multidimensional data visualization are not suitable to visualize cluster structure in very high dimensional and very large datasets. With the increasing applications of clustering in data mining in the last decade, more and more visualization techniques have been developed to study the structure of datasets in the applications of cluster analysis [OlL03, Shn05]. However, in practice, those visualization techniques intend to tae the problem of cluster visualization simply as a layout problem. They mainly focus on rendering the cluster structure, rather than investigating how data behavior changes with the variation of the parameters of the algorithms used. There have been many research efforts on visual cluster analysis [ABK+99, ChL04, HCN01, HuL00, Kan01, KSP01, HWK99, SBG00]. They normally facilitate an arbitrary exploration of group information and this causes them to be inefficient and time consuming in the cluster exploration stage. On the other hand, the impreciseness of visualization limits its usability in quantitative verification and validation of clustering results. Thus the motivation of our wor is the developing a visualization technique that supports more purposeful cluster detection and contrasts more precise data distribution to facilitate researchers in cluster analysis. As the solution of above problems, in this paper, we propose a novel visual project technique Hypothesis Oriented Verification and Validation by Visualization, (HOV 3 ), which project high-dimensional datasets by user s quantified measurements to 2D space [ZOZ06]. Based on the quantified measurement feature of HOV 3, we also address a distribution matching based visual external cluster validation model to verify the consistency of cluster structures between clustered subset and non-subsets [ZOZ07a]. To separate clusters overlapping, in this paper, we introduce a visual approach called M-HOV 3 to enhance the visual separation of clusters [ZOZ07b]. With the enhanced separation feature of M-HOV 3, the user not only separate overlapped clusters efficiently in the post-processing stage of clustering, but also can obtain more cluster clues effectively in the pre-processing stage of clustering.
The rest of this paper is organized as follows. Section 2 briefly introduces the current issues of cluster analysis, and visual techniques that have been employed in cluster analysis. A review of related wor on cluster analysis by visualization is presented in Section 3. Section 4 describes our Hypothesis Oriented Verification and Validation by Visualization (HOV 3 ) model. Section 5 demonstrates the use of HOV 3 to achieve cluster exploration purposefully with given quantified measurements as predictions, and to verify the consistency of the cluster structure by matching a distribution based on the quantified measurements of HOV 3. Section 6 focuses on external cluster validation by HOV 3 on several well-nown data sets. Finally, section 7 summarizes the contributions of this paper. 2 Bacground Cluster analysis is an iterative process of clustering and cluster verification by the user facilitated with clustering algorithms, cluster validation methods, visualization and domain nowledge to databases. 2.1 The Issue of Clustering Clustering taes the responsibility to assign objects into groups about the studied data based on a given distinguishing strategy. Hundreds of clustering algorithms have been developed to deal with data sets in different real world applications [JMF99, Ber06]. The existing clustering algorithms are not always successfully applied to very huge databases, because they perform well when clustering spherical or regularly shaped datasets, but are not very effective to deal with arbitrarily shaped clusters. Several research efforts have been made to deal with datasets with arbitrarily shaped cluster distributions [ZRL96, EKS+96, SCZ98, RGS98, ABK+99]. However, those approaches still have some drawbacs in handling irregular shaped clusters. For example, CURE [GRS98] and BIRCH [ZRL96] perform well in low dimensional datasets, however, as the dimensionality of the data increases, they encounter high computational complexity. Other approaches such as density-based clustering techniques DBSCAN [EKS+96] and OPTICS [ABK+99] and wavelet based clustering techniques such as WaveCluster [SCZ98] attempt to cope with this problem, but their non-linear complexity often maes them unsuitable in the analysis of very large datasets. As Abul et al pointed out In high dimensional spaces, traditional clustering algorithms tend to brea down in terms of efficiency as well as accuracy because data do not cluster well anymore [AAP+03]. A recent survey of clustering algorithms can be found in the literature [JMF99, Ber06]. 2.2 The Issue of Cluster Validation The selection of a cluster scheme from hundreds of clustering algorithms with variable parameters that is fit for a specific application is hard. Different clustering results may be obtained by applying different clustering algorithms to the same data set, or even by applying a clustering algorithm with different parameters to the same data set. However, the very high computational cost of statistics-based cluster validation methods directly impacts on the efficiency of cluster validation, since cluster validation is the procedure of comparing previously produced cluster patterns with newly produced cluster patterns to evaluate the genuine cluster structure about a data set. In general, the methods of cluster validation are classified into the following three categories [HBV01, JMF99, ThK99]: (1) Internal approaches: they assess the clustering results by applying an algorithm with different parameters on a data set and finding the optimal solution [AAP+03]; (2) Relative approaches: the idea of relative assessment is based on the evaluation of a clustering structure by comparing it to other clustering schemes [HaK01]; and (3) External approaches: the external assessment of clustering is based
on the idea that there exists nown priori clustered indices produced by a clustering algorithm, and then assessing the consistency of the clustering structures generated by applying the clustering algorithm to different data sets [HKK05]. 2.3 External Cluster Validation External cluster validation is a procedure of hypothesis test, i.e., given a set of class labels produced by a cluster scheme, it is compared with the clustering results by applying the same cluster scheme to the other partitions of a database, as shown in the Figure 1. Figure 1. External cluster validation by statistics-based methods The statistical methods for quality assessment are employed in external cluster validation, such as Rand statistic [Ran71], Jaccard Coefficient [Jac08], Foles and Mallows index [MSS83], Huberts statistic and Normalized statistic [ThK99], and Monte Carlo method [Mil81] to measure the similarity between the priori modeled partitions and clustering results of a dataset. Recent surveys on cluster validation methods can be found in the literature [HBV02, HKK05, ThK99]. 3 Related Wor This section discusses related wor on visualization techniques and tools that have been proposed for cluster representation and analysis. Of particular interest is the Star Coordinates technique proposed by Kandogan [Kan01] that inspired HOV 3. 3.1 Visual Cluster Representation There have been many studies on multidimensional data visualization. However, most of the proposed techniques do not visualize the cluster structure very well for high dimensional or very large databases [OlL03]. For example, icon-based methods [Pic70, Che73, KeK94] can display high-dimensional properties of data. However, as the amount of data increases substantially, the user may find it hard to understand most properties of data intuitively, since the user cannot focus on the details of each icon. Plotbased data visualization approaches such as Scatterplot-Matrices [Cle93] and similar techniques [AlC91] [CBC+5] visualize data in rows and columns of cells containing simple graphical depictions. This ind of techniques give visual information of bi-attributes, but do not give the best overview of the whole dataset; and they are slimly not able to present clusters in the dataset very well. Parallel Coordinates [Ins97] utilizes equidistant parallel axes to visualize each attribute of a given
dataset and projects multiple dimensions on a two-dimensional surface. Star Plots [Fie79] arranges coordinate axes on a circular space with equal angles between neighbouring axes from the centre of a circle and lins data points on each axis by lines to form a star. In principle, those techniques can provide visual presentations of any number of attributes. However, neither parallel Coordinates nor Star plots is adequate to give the user a clear overall insight of the data distribution when the dataset is huge, primarily due to the unavoidably high overlapping among points. And another drawbac of these two techniques is that while they can supply a more intuitive visual relationship between the neighbouring axes, for the nonneighbouring axes, the visual presentation may confuse the users perception. 3.2 Visual Cluster Analysis A large number of clustering algorithms have been developed, but only a small number of cluster visualization tools are available to facilitate researchers understanding of the clustering results [SeS05]. Several research efforts have been made in the area of visual cluster analysis [ABK+99, ChL04, HCN01, HuL00, Kan01, KSP01, HWK99, SBG00]. While these techniques help users have intuitive comparisons and understand cluster structures better, they do not focus on the assessment of the quality of clusters. For example, OPTICS [ABK+99] uses a density-based technique to detect cluster structures and visualizes them in Gaussian bumps, but its non-linear time complexity maes it neither suitable to deal with very large data sets, nor to provide the contrast between clustering results. Kasi el. al [KSP01] employs Self-organizing maps (SOM) technique to project multidimensional data sets to 2D space for matching visual models [Koh97]. However, the SOM technique is based on a single projection strategy and it is not powerful enough to discover all the interesting features from the original data. H-BLOB visualizes clusters into blob manners in 3D hierarchical structure [SBG00]. It is an intuitive cluster rendering technique, but the 3D and two stage expression of H-BLOB limits it in the interactive investigation of cluster structures. Huang et. al [HCN01, HuL00] proposed the approaches based on FastMap [FaL95] to assist users on identifying and verifying the validity of clusters in visual form. Their techniques are good on cluster identification, but are not able to deal with the evaluation of cluster quality very well. HD-Eye [HKW99] is an interactive visual clustering system based on density-plots of any two interesting dimensions. But it lacs the ability in helping the user understand inter-cluster relationships. To verify the validity of clustering results by visualization, VISTA adopts landmar points as representatives from a clustered subset and re-samples them to deal with cluster validation [ChL04]. However, its experience-based landmar point selection does not always handle the scalability of data very well, because the representative landmar points selected in a subset may fail in other subsets of a database. Star Coordinates is very suitable for visual cluster analysis with its interactive adjustment features [Kan01]. Since the starting point of our approach reported in this paper is the Star Coordinates technique, we describe it in more detail next. The survey of other wors on visual cluster analysis can be found in the literature [SeS05]. 3.3 Star Coordinates
The idea of Star Coordinates technique is intuitive, which extends the perspective of traditional orthogonal 2D X-Y and 3D X-Y-Z coordinates technique to a higher dimensional space [Kan01]. Star Coordinates plots a 2D plane into n equal sectors with n coordinate axes, where each axis represents a dimension and all axes share the initials at the centre of a circle on the 2D space. First, data in each dimension are normalized into [0, 1] or [-1, 1] interval. Then the values of all axes are mapped to orthogonal X-Y coordinates which share the initial point with Star Coordinates on the 2D space. Thus, an n-dimensional data item is expressed as a point in the X-Y 2D plane. Figure 2 illustrates the mapping from 8 Star Coordinates to X-Y coordinates. Formula (1) states the mathematical description of Star Coordinates. Figure 2. Positioning a point by an 8-attribute vector in Star Coordinates [Kan01]. p j n n u yi ( d ji min i )) (1) i= 1 i= 1 ( x, y) = ( u xi ( d ji min i ), where p j (x, y) is the normalized location of D j =(d j1, d j2,., d jm ), and d ji is the value of the jth record of a data set on the ith coordinate C i in Star Coordinates space; xi (d ji -min i ) and yi (d ji -min i ) are unit vectors of d ji mapping to X direction and Y direction, min i =min(d ji,0j<m) and max i =max (d ji, 0 j<m) are the minimum and maximum values of the ith dimension respectively; and m is the number of records in the data set. In practice, mapping high dimensional data to 2D space inevitably introduces overlapping and ambiguities, and even bias. To mitigate the problem, Star Coordinates and its extension VISTA [ChL04] provide several visual adjustment mechanisms, such as axis scaling, rotating axis angles, filtering data points to vary the data distribution of a dataset in order to detect cluster characteristics and render clustering results effectively. We briefly introduce two relevant adjustment features with this research below. Axis scaling The purpose of the axis scaling in Star Coordinates (called -adjustment in VISTA) is to interactively adjust the weight value of each axis so that users can observe the change in the data distribution dynamically. For example, the diagram in Figure 3 shows the original data distribution of Iris (Iris has 4 numeric attributes and 150 instances) with the cluster indices by applying the K-means (here =3) clustering algorithm in VISTA, where clusters overlap. A well-separated cluster distribution of Iris is illustrated in Figure 4 by a series of random -adjustments, where clusters are much easier to be recognized than those of the original distribution in Figure 3.
Figure 3. The initial data distribution of clusters of Iris produced by -means in VISTA Footprint Figure 4. The separated version of the Iris data distribution in VISTA For tracing data points changing in a certain period of time, the footprint function is provided by Star Coordinates. We use another data set auto-mpg to demonstrate the footprint feature. The data set auto-mpg has 8 attributes and 398 items. Figure 5 shows the footprints of axis tuning of attributes weight and mpg, where we may find some points with longer traces, and some with shorter ones. Figure 5. Footprints of axis scaling of in Star Coordinates [Kan01] The most prominent feature of Star Coordinates and its extensions such as VISTA [ChL04] and HOV 3 [ZOZ06] is that their computational complexity is only in linear time. This maes them very suitable to be employed as a visual tool for interactive interpretation and exploration in cluster analysis. However, the exploration and refinement of clusters based on the user s intuition may be random and subjective in visual cluster analysis, and as a result, sometimes the adjustments of Star Coordinates and VISTA could be arbitrary and time consuming. To overcome the arbitrary and random adjustments of Star Coordinates and VISTA, we have proposed a hypothesis-oriented visual approach called HOV 3 to detect clusters [ZOZ06]. We present the detailed description of HOV 3 model in the next section.
4 HOV 3 Model Cluster exploration (qualitative analysis) is regarded as the pre-processing of cluster validation (quantitative analysis), which is mainly used for building user hypotheses/predictions based on the cluster exploration. This is not an aimless and/or arbitrary process. Having a precise overview of the data distribution in the early stages of data mining is important, because, with correct insights of data, data miners can mae more informed decisions on adopting appropriate algorithms for the forthcoming analysis stages.to fill the gap between the imprecise visual cluster analysis and the unintuitive numerical cluster analysis, we have proposed a new approach, called HOV 3, Hypothesis Oriented Verification and Validation by Visualization [ZOZ06]. 4.1 The Basic Idea of HOV 3 When we discuss the measurement of an object, first we must provide a coordinate system for the discussion. For example, without another contrasting object, the user cannot have any idea about the bigness of the object. Based on the same principle, the idea of HOV 3 is more concerned with how to obtain cluster clues by contrasting a data set against quantified measurements, rather than the random adjustments of Star Coordinates and VISTA. In analytic geometry, the difference of two vectors A=(a 1, a 2,, a n ) and B=( b 1, b 2,, b n ) can be represented by their inner/dot product, denoted as A.B. Let and. We use the notation <A, B> for their inner product given as: <A, B>= b 1. a 1 + b 2. a 2 + + b n. a n = n =1b a (2) Then we have the equation: cos( θ ) = A, B A B, where is the angle between A and B, and A and B are 2 2 2 the lengths of A and B respectively, shown as A = 1 2 2 2 2 a + a +... + an and B = b 1 + b2 +... + bn. Let A be a unit vector; the geometry of <A, B> in Polar Coordinates presents the gap from point B (db, ) to point A, as shown in Figure 6, where A and B are in 8 dimensional space. In the same way, a matrix Dj, a set of vectors (dataset) can also be mapped to a measure vector M. As a result, it projects the distribution of the matrix Dj based on the vector M. Let Dj=( d j1, d j2,, d jn ) and M=( m 1, m 2,, m n ), then the inner product of each vector dji, (i =1,, n) of Dj with M has the same equation as (2) and written as: Figure 6. Vector B projected against vector A in Polar Coordinates. 4.2 The Mathematical Description of HOV 3 < d ji, M>= m 1. d j1 + m 2. d j2 + + m n. d jn = =1m d j The Star Coordinates model can in fact be mathematically depicted by the Euler formula. According to the Eular formula: e ix = cosx+isinx, where z = x + i.y, and i is the imaginary unit. Let z 0 =e 2i/n ; such that z 0 1, z 0 2, z 0 3,, z 0 n-1, z 0 n (with z0 n = 1) divide the unit circle on the complex 2D plane into n equal sectors. Thus, the Star Coordinates model can be simply written as: n (3)
Pj [ j z ] ( z ) ( d min d )/( max d min d ) = n = 1 0 (4) where mind and maxd represent the minimal and maximal values of the th coordinate respectively. Then, in any case, equation (4) can be viewed as mapping from n 2. HOV 3 uses a measure vector M to represent the corresponding axes weight values. Given a non-zero measure vector M in n and a family of vectors P j, the HOV 3 model is presented as: Pj [ j z m ] ( z ) ( d min d )/( max d min d ) = n = 1 0 (5) where m is the th attribute of measure M. Comparing the model of Star Coordinates in equation (4) and the model HOV 3 in equation (5), we may observe that the HOV 3 model subsumes of the Star Coordinates model. This is because, any axis scaling and axis angle rotation in Star Coordinates model or in VISTA can be viewed as changing one or more coefficient value of m (=1,, n) in equation (5). For example, either moving a coordinate axis to its opposite direction or scaling up the adjustment interval of axis from [0,1] to [-1,1] in VISTA can be regarded as setting the measure value as minus as its original one. As a special case, when all m (=1,, n) in M are set to 1, HOV 3 is transformed into the Star Coordinates model (4), i.e., no measure case. Thus the HOV 3 model provides users a mechanism to quantify domain nowledge about a data set as a measure vector (prediction) for precisely investigating cluster clues. Note that, the equation (5) is a standard form of linear transformation of n-variables, where m is the coefficient of th variable of P j. In principle, any measure vector, even in complex number form, can be introduced into the linear transformation of HOV 3 if it can distinguish a data set into groups or have well separated clusters visually. Thus the rich statistical methods of reflecting the characteristics of a data set can be also introduced as predictions in the HOV 3 projection, such that users may discover more clustering patterns. The detailed explanation of this approach is presented in the next section. 5 Predictive Visual Cluster Exploration by HOV 3 0 0 Predictive exploration is a mathematical description of future behavior based on the historical exploration of patterns. The goal of predictive visual exploration by HOV 3 is that by applying a prediction (measure vector) to a dataset, the user may identify the groups from the result of visualization. Thus the ey issue of applying HOV 3 to detect grouping information is how to quantify historical patterns (or users domain nowledge) as a measure vector to achieve this goal. 5.1 Multiple HOV 3 Projection (M-HOV 3 ) In practice, it is not easy to synthesize historical nowledge about a data set into one vector. So, rather than using a single measure to implement a prediction test, it is more suitable to apply several predictions (measure vectors) together to the data set. We call this process multiple HOV 3 projection, M-HOV 3 in short [ZOZ07b]. Now, we provide the detailed description of M-HOV 3 and its feature of enhanced group separation. For simplifying the discussion of the M-HOV 3 model, we give two definitions first. Definition 1 (HOV 3 projection) A data projection from n-dimensional space to 2D plane by applying HOV 3 to a data set, as shown in formula (5), is denoted as D p =C (, M), where is an n-dimensional data set, and =(p 1, p 2,., p m ), p j (1 m) is an instance of ; m is the size of the data set ;
M =(m 1t, m 2t,., m nt ), is a non-zero measure vector where m it (1 i n) is the weight value of the th coordinate at moment t in Star Coordinates plane; D p is the geometrical distribution of in 2D space, D p =(p 1 (x 1,y 1 ), p 2 (x 2,y 2 ),..., p j (x i,y i ),, p m (x m,y m )), is the location of p j in X-Y Coordinates plane. Definition 2 (poly-multiply vectors to a matrix) The inner product of multiplying a series of non-zero measure vectors M 1, M 2,,Ms to a matrix A is denoted as A * M s i=1 i =A M 1 M 2. Ms. Then the projection of M-HOV 3 is denoted as p = C (, Mi ). Based on equation (5), we formulate M-HOV 3 as: P s i=1 s [ ] j z 0 m i n ( z ) = ( d min d )/( max d min d ) j 0 (6) = 1 i= 1 where m i is the th attribute (dimension) of the ith measure vector M i, and s1. When s=1, the formula (6) is transformed into equation (5). We may observe that instead of using a single multiplication of m in equation (5), it is replaced by a poly-multiplication of s m i=1 i in formula (6). Formula (6) is more general and also closer to the real procedure of cluster detection, as it introduces several aspects of domain nowledge together into the cluster detection process. In addition, the effect of applying M-HOV 3 to datasets with the same measure vector can enhance the separation of grouped data points under certain conditions. We give the mathematical proof below. 5.2 The Enhanced Separation Feature of M-HOV 3 To explain the geometrical meaning of M-HOV 3 projection, we use the real number system. According to equation (5), the general form of the distance (i.e., weighed Minowsi distance) between two points a and b in the HOV 3 plane can be represented as: n q q ( a,b,m) = m (a b ) =1 σ (q>0) (7) If q = 1, is the Manhattan (city bloc) distance; and if q = 2, is the Euclidean distance. To simplify the discussion of our idea, we adopt the Manhattan metric for the explanation. Note that there exists an equivalent mapping (bijection) of distance calculation between the Manhattan and Euclidean metrics [Fle77]. For example, if the distance between points a and b is longer than the distance between points a and b in the Manhattan metric, it is also true in the Euclidean metric, and vice versa. As shown in Figure 7, the orthogonal lines represent the Manhattan distance and the diagonal lines are the Euclidean distance (red for a b and blue for ab) respectively. Then the Manhattan distance between points a and b is calculated as in formula (8). σ (a,b,m) = m (a b (8) = 1 According to the formulas (6), (7) and (8), we can present the distance of M-HOV 3 in Manhattan distance as follows: s n n s ) σ (a,b, m ) = m (a b ) i i (9) Figure 7. The distance representation i= 1 = 1 i= 1 in Manhattan and Euclidean metrics.
Definition 3 (the distance representation of M-HOV 3 ) The distance between two data points a and b projected by M- HOV 3 is denoted as same, s i = Mσ 1. In particular, if the measure vectors in an M-HOV 3 are the ab s i = Mσ 1 can be simply written as M s ab; if each attribute of M is 1 (no measure case), the distance ab between points a and b is denoted as ab. s s Thus, we have = Mσ 1 ab = C ((a,b), M ). For example, the distance between two points a and b projected i i i=1 by M-HOV 3 with the same two measures can be represented as M 2 ab. Thus the projection of HOV 3 of a and b can be written as Mab. We now give several important properties of M- HOV 3 as follows. Lemma 1 In the Star Coordinates space, if ab0 and M0 (m M0< m <1), then ab > Mab. Proof: n ab= = 1 (a b ) ab- Mab= n = 1 (a and Mab= b ) n n = 1 m (a b ) - m (a b = 1 ) n ( = 1 = a b ) ( 1 m ) M0 {m 0 m M 0< m <1, =1 n} 1 m ) >0 ab0 ab >(Mab) ( This result shows that the distance Mab between points a and b projected by HOV 3 with a non-zero M is less than the original distance ab between a and b Lemma 2 In the Star Coordinates space, if ab0 and M0 (m M0< m <1), then M n ab > M n+1 ab, n. Proof: Let M n ab= ab Definition 1 M n+1 ab= M ab Lemma 1 ab >M ab M n ab > M n+1 ab Lemma 3 In the Star Coordinates space, if ab0 and M0 (m M m <1), then M m ab > M n ab, n, m and m<n. Proof: According to the transitivity of inequality. Theorem 1 If the measure vector is changed from M to M, ( m 1, m t,+ t <1) and Mab -Mac < M ab 2 2 M'ab σ M'ac σ M' σab M' σac Mσab Mσac M'ab σ M'ac σ - M ac then > M'ab σ M'ac σ Mσab Mσac Proof: n M ab= = 1 ' m (a b ) and M cd= n = 1 ' m (a c ) n =1 ' M ab - M ac = m [ (a b ) (a c ) ] n 2 = 1 M 2 cd - M 2 ' ab= m [ (a b ) (a c ) ] Let a -b =x, a -c =y n =1 n =1 ' ' M cd - M ab = m [ (a b ) (a c ) ] = m (x y )
n 2 = 1 M 2 cd - M 2 ' ab= m (x y ) M cd - M ab = M xy M 2 cd - M 2 ab = M 2 xy Lemma 2 M 2 xy< M xy 2 M' σxy <1 M'xy σ M 2 ab - M 2 cd < M cab - M cd Mab -Mac < M ab - M ac M 2 ab - M 2 cd. Mab -Mac < M cab - M cd 2 2 2 M' σab M' σac M'ab σ M'ac σ < M'ab σ M'ac σ Mσab Mσac 2 2 M' σab M' σac M'ab σ M'ac σ 1- >1- M'ab σ M'ac σ Mσab Mσac 2 2 M' σab M' σac <1 M'ab σ M'ac σ 2 2 M'ab σ M'ac σ M' σab M' σac Mσab Mσac M'ab σ M'ac σ > M'ab σ M'ac σ Mσab Mσac Theorem 1 shows that if the user observes that the difference of the distance between a and b and the distance between a and c is increased by tuning weight values of axes from M to M (which can be observed by the footprints of points a, b and c, as shown in Figure 5), then after applying M-HOV 3 to a, b and c, the distance variation rate of the distances between pairs of points a, b and a, c is enhanced. In other words, if it is observed that several data point groups can be roughly separated visually by projecting a measure vector in HOV 3 to a data set (there may exist ambiguous points between groups), then applying M- HOV 3 with the same measure vector to the data set would lead to the groups being more compacted, i.e., have a good separation of the groups. This enhanced separation feature of M-HOV 3 is significant for identifying the membership formation of clusters during the exploration of clusters and the verification of the validity of the clustering structure in unclustered subsets [ZOZ07b]. We present several examples to demonstrate the efficiency and the effectiveness of M-HOV 3 in cluster analysis below. 5.3 Predictive Cluster Exploration by M-HOV 3 According to the notation of HOV 3 projection of a dataset as p = C (, M), the M-HOV 3 model is denoted as p = C (, M n ) where n. We use the auto-mpg dataset again as an example to demonstrate predictive cluster exploration by M- HOV 3. Figure 8a illustrates the original data distribution of auto-mpg produced by HOV 3 in MATLAB where it is not possible to recognize any group information. Then we tuned each axis manually and had roughly distinguished three groups, as shown in Figure 8b. The weight values of axes were recorded as a vector M=[0.10, 0, 0.25, 0.2, 0.8, 0.85, 0.1, 0.95].
Figure 8a. auto-mpg s original data distribution Figure 8b. p1 = C (auto-mpg, M) Figure 8c. p2 = C (auto-mpg, M 2 ) Figure 8d. the overlapping diagram of p1 and p2 Figure 8 Diagrams of data set auto-mpg projected by HOV 3 in MATLAB Figure 8b shows that there exist several ambiguous data points between the groups. Then we employed M 2 (inner dot) as a predictive measure vector and applied it to data set auto-mpg. The projected distribution p2 of auto-mpg is presented in Figure 8c. It is much easier to identify 3 groups of auto-mpg in Figure 8c than in Figure 8b. To show the contrast between these two diagrams p1 and p2, we overlap them in Figure 8d. By analyzing the data of these 3 groups, we have found that, group 1 contains 70 items and with original value 2 (sourcing Europe); group 2 has 79 instances and with original 3 (Japanese product); and group 3 includes 249 records with original 1 (from USA). Actually this natural grouping based on the user s intuition serendipitously clustered the data set according to the original attribute of auto-mpg. In the same way, the user may find more grouping information from the interactive cluster exploration process by applying predictive measurements. 5.4 Cluster Exploration by HOV 3 with Statistical Measurements Many statistical measurements, such as mean, median, standard deviation and others can be directly introduced into HOV 3 as predictions to explore data distributions. In fact, prediction based on statistical measurements is a more purposeful cluster exploration, and easier to give a geometrical interpretation of the data distribution. We use the Iris dataset as an example to demonstrate cluster exploration with statistical measurements. As shown above in Figure 4, by random axis scaling, the user can divide the Iris data into 3 groups. This example shows that cluster exploration based on random adjustments may expose data grouping information, but sometimes, it is hard to find or interpret such grouping. Now, let s see that clustering Iris by HOV 3 with a statistical measurement. First, we applied the K- means clustering algorithm with =3 (three clusters) to Iris, and displayed the clustered Iris data by VISTA.
Its original distribution is shown in Figure 3. It can be observed that there exist overlapping points in Figure 3. Then we employed the standard deviation of Iris M = [0.2302, 0.1806, 0.2982, 0.3172, 0.4089] as a prediction to project the clustered Iris data by HOV 3 in VISTA. The result is shown in Figure 9, where 3 groups clearly exist. It can be observed in Figure 9 that, there is a blue point in the pin-colored cluster and a pin point in the green-colored cluster, resulting from the K-means clustering algorithm with =3. Intuitively, they have been wrongly clustered. We re-clustered them by their distributions, as shown in Figure 10. Figure 9 data distribution of clustered projected by HOV 3 in VISTA mared by K-means Figure 10 data distribution of Iris projected by HOV 3 in VISTA with the new clustering indices by the user s intuition The contrast of clusters (C ) produced by the K-means clustering algorithm and the new clustering result (C H ) projected by HOV 3 is summarized in Table 1. We can see that the quality of the new clustering result of Iris is better than that obtained by K-means according to their Variance comparison. Each cluster projected by HOV 3 has a higher similarity than that produced by K-means. By analyzing the new grouping data points of Iris, we have found that they are distinguished by the class attribute of Iris, i.e. Iris-setosa, Iris-versicolor and Iris-virginica. The cluster 1 generated by K-means is an outlier. Table 1 The statistics of the clusters in Iris produced by K-means (=3) and by HOV 3 with predictive measures C % Radius Variance MaxDis C H % Radius Variance MaxDis 1 1.333 1.653 2.338 3.306 2 32.667 5.754 0.153 6.115 1 33.333 5.753 0.152 6.113 3 33.333 8.196 0.215 8.717 2 33.333 8.210 0.207 8.736 4 33.333 7.092 0.198 7.582 3 33.333 7.112 0.180 7.517 With the statistical predictions in HOV 3 the user may even expose the cluster clues that are not easy found by random adjustments. For example, we adopted the 8th row of auto-mpg s covariance matrix as a predictive measure [0.04698, -0.07657, -0.06580, 0.00187, -0.05598, 0.01343, 0.02202, 0.16102] to project auto-mpg by HOV 3 in MATLAB. The result is shown in Figure 11. We grouped them by their distribution as in Figure 12. Table 2 reports the statistics of the clusters (in the left part of the table, i.e. C H ), and reveals that the points in each cluster have very high similarity.
Figure 11 data distribution of auto-mpg projected by HOV 3 in MATLAB with 8 th row of auto-map s covariance matrix as prediction Figure 12 clustered distribution of data in Fig. 8 by the user s intuition As we chose the 8th row of auto-mpg s covariance matrix as the prediction, the result mainly depends on the 8th column of auto-mpg data, i.e., origin (country). Figure 12 shows that C1, C2 and C3 are closer because they have the same origin value 1. The more detailed formation of clusters is given in Table 2.We believe that a domain expert could give a better and intuitive explanation about this clustering. Table 2 The statistics of clusters in auto-mpg produced by HOV 3 with covariance prediction of auto-mpg C H origin (auto) Cylinder (auto) % Radius Variance MaxDis C K % Radius Variance MaxDis 1 0.503 681.231 963.406 1362.462 1 1 8 25.879 4129.492 0.130 4129.768 2 18.090 2649.108 0.206 2649.414 2 1 6 18.583 3222.493 0.098 3222.720 3 16.080 2492.388 0.139 2492.595 3 1 4 18.090 2441.881 0.090 2442.061 4 21.608 3048.532 0.207 3048.897 4 2 4 17.588 2427.449 0.142 2427.632 5 25.377 3873.052 0.220 3873.670 5 3 3 19.849 2225.465 0.093 2225.658 6 18.593 2417.804 0.148 2417.990 Then we chose cluster number 5 to cluster auto-mpg by the K-means. Its clustering result is presented in the right part of Table 2 (C K ). By comparing these two clustering results, we can see that according to the Variance of clusters, the quality of the clustering result by HOV 3 with covariance prediction of autompg is better than that produced by K-means (=5, cluster 1 in C is an outlier). 5.5 Cluster Exploration by HOV 3 with Complex Linear Transformation In principle, any linear transformation can be employed in HOV 3 if it can separate clusters well. We therefore introduce the complex linear transformation to this process. We again use auto-mpg data set as an example. As shown in Figure 8b, three roughly separated clusters appear there, where the vector M=[0.10, 0, 0.25, 0.2, 0.8, 0.85, 0.1, 0.95] was obtained from the axes values. Then we adopt cos(m 10i) as a prediction, where i is the imaginary unit. The projection of HOV 3 with cos(m 10i ) is illustrated in Figure 13, where three clusters are separated very well. In the same way, many other linear transformations can be applied to different datasets to obtain well-separated clusters. With the clearly grouped objects or fully separated clusters, it would maredly improve the efficiency of the identification of cluster formation in cluster analysis by visualization.
Figure 13. The data distribution of auto-mpg projected by HOV3 with cos(m*10i ) as the prediction 6 External Cluster Validation by HOV 3 In practice, with extremely large sized datasets, it is infeasible to cluster an entire data set within an acceptable time scale. A common solution used in data mining is that, clustering algorithms are first applied to the training (a sampling) subset of data from a database to extract cluster patterns, and then the cluster scheme is assessed to see whether it is suitable for other subsets in the database. This procedure is regarded as external cluster validation [VSA05]. Due to the high computational cost of statistical methods on assessing the consistency of cluster structures between large sized subsets, to achieve this goal by statistical methods is still a challenge in data mining. Based on the assumption that if two same-sized data sets have a similar cluster structure, by applying a linear transformation to the data sets, the similarity of the newly produced distributions of the two sets would still be high, we proposed a distribution matching based external cluster validation by HOV 3 [ZOZ07a]. The detailed explanation of this approach is presented in the follows. 6.1 Definitions For precisely explaining our approach, firstly we give some formal definitions as pre-description. Definition 4 (cluster) Let be a database of data points. A cluster C :=(D, L) is a non-empty set D on a label set L, and the ith cluster C i ={p D, l L C j.p: C j.l=i i>0} where l is the cluster label of p, l {-1, 0, 1,}, and is the number of clusters. As the special cases, an outlier point is an element of and with cluster label 1; a non-clustered element of has a cluster label of 0, i.e., it has not been clustered. Definition 5 (spy subset) A spy subset s, is a clustered subset of produced by a clustering algorithm, where s ={C 1,C 2,, C, C E }, C i (1 i ) is a cluster in s ; C E is the outlier set of s A spy subset is used as a visual model to verify the cluster structure in the other partitions in the database. Definition 6 (target subset) A subset t is a target subset of s, t ={P t.p, P t.l L P t.p:p t.l=0 s = t }.
A target subset t is a non-clustered subset of and has the same size of a spy subset P s of. It is used as a target to investigate the similarity of cluster structure with the spy subset s. Definition 7 (overlapping point) A non-clustered point p o is called an overlapping point of a cluster C i, denoted as C i p o iff (p C i p o C i p o -p ), where is the threshold distance given by the user. Definition 8 (quasi-cluster) The overlapping point set of cluster C i is composed as a quasi-cluster of C i, denoted as C qi i.e., {p o C qi C i p o } All overlapping points of C i is composed a quasi-cluster C qi of C i. All overlapping points of C i are composed of a quasi-cluster C qi of C i. Definition 9 (well-separated cluster) A cluster C i is called a well-separated cluster visually, when it satisfies the condition that (Ci P s, Cj P s p C i : p C j p o i j ). A well-separated cluster Ci in the spy subset implies that no points in Ci are within the threshold distance to any other clusters in the spy subset. Based on above definitions, we present the application of our approach to external cluster validation based on distribution matching by HOV 3 as follows. 6.2 The Processing of the Approach The stages in the application of our approach are summarized in the following 5 steps: 1. Clustering: Firstly, the user applies a clustering algorithm to a randomly selected subset s from the given dataset. 2. Cluster Separation: The clustering result of s is introduced and visualized in the HOV 3 system. Then the user manually tunes the weight value of each axis or applies other cluster separation methods by HOV 3, such as H-HOV 3, separating clusters by statistical measurements, complex linear transformations to separate overlapping clusters. If one or more cluster(s) are separated from the others visually, then the weight values of each axis are recorded as a measure vector M. 3. Data Projection by HOV 3 : The user samples another observation from with the same number of points as in s as a target subset t. The clustered subset s (now as a spy subset) and its target subset t are projected together by HOV 3 with vector M to detect the distribution consistency between s and t. 4. The Generation of Quasi-Clusters: The user gives a threshold, and then according to the definitions 5, 6 and 7, a quasi-cluster C qi of a separated cluster C i is computed. Then C qi is removed from t, and C i is removed from s. If s has clusters then we go bac to step 2, otherwise we proceed to the next step. 5. The Interpretation of Results: The overlapping rate of each cluster-and-quasi-cluster pair is calculated as (C qi, C i ) = C qi / C i. If the overlapping rate approaches 1, it means cluster C i and its quasi-clusters C qi have high similarity, since the amount ratio of the spy subset and the target subset is 1:1. Thus the overlapping analysis is simply transformed into a linear regression analysis, i.e., the points around the line C=C q. Corresponding to the procedure mentioned above, we give the algorithm of external cluster validation based on distribution matching by HOV 3 below in Figure 14.
Figure 14. The algorithm of external cluster validation based on distribution matching by HOV 3 In Figure 14, the procedure clusterseparate responds the axis tuning by the user to separate the clusters in the spy subset, and to gather weight values of axes as a measure vector; the procedure quasiclustergeneration produces quasi clusters in the target subset corresponding to the clusters in the spy subset. 6.3 The Model of Distribution-Matching Based External Cluster Validation by HOV 3 In contrast to statistics-based external cluster validation model illustrated in Figure 1, we show our model of external cluster validation by visualization in Figure 15. Figure 15. External cluster validation by HOV 3 Comparing these two models, we may observe that instead of using a clustering algorithm to cluster another sampling data set, in our model, we use a clustered subset from a database as a visual model to verify the similarity of cluster structures between the model and the other non-clustered subsets from the database. To handle the scalability on resampling datasets, we choose the non-cluster observations with the same size as the clustered subset, and then project them together by HOV 3. As a consequence, the user can utilize the well-separated clusters produced by scaling axes in HOV 3 as a model to pic out their corresponding quasi-clusters, where the points overlap clusters. Also, instead of using statistical methods to assess the similarity between the two subsets, we simply compute the overlapping rate between the clusters and their quasi-clusters to show their consistency. Compared with the statistics-based validation methods, our method is not only visually intuitive, but also more effective in real applications [ZOZ07a]. Obviously, how to obtain well-separated clusters plays a very important role in the procedure of
external cluster validation by HOV 3. Separating clusters from lots of overlapping points is also an aim of this research. Thus the approaches mentioned above such as M-HOV 3 [ZOZ07b], HOV 3 with statistical measurement and complex linear transformation could be introduced into this process. 6.4 External Cluster Validation with M-HOV 3 Separation of clusters from lots of overlapping points manually is often time consuming. We claim that the enhanced separation feature of M-HOV 3 can provide improvements not only in efficiency but also in accuracy in dealing with external cluster validation [ZOZ07b]. This is because the combination of zooming and M-HOV 3 with the same threshold distance can improve the precision of quasi-cluster data point selection. According to formula (5), zooming in HOV 3 can be understood as projecting a data set with a vector, which has the same attribute values, i.e., each m >1 in equation (5) has the same value. Note that the application of an M-HOV 3 would normally shrin the size of patterns in HOV 3. Technically, then we choose min(m ) -1 as the zooming vector values, where min(m ) is the non zero minimal value of m. Thus under the condition of a fixed distance between the closest data points, the scale of patterns in HOV 3 is amplified by applying the combination of M-HOV 3 and zooming. Thus this combination is formalized in equation (10). P s 1 [ z (m min(m ) ] j 0 i n ( z ) = ( d min d )/( maxd min d ) ) j = 1 i= 1 0 (10) We have presented examples on how to gain cluster clues by applying HOV 3 projection to databases in the previous sections. In the next section, we demonstrate the effectiveness of external cluster validation by HOV 3 by several examples. 7 Examples and Explanation In this section, we present several examples to demonstrate the advantages of the cluster exploration and external cluster validation by HOV 3. We have implemented our approach in MATLAB running under Windows 2000 Professional. The datasets used in the examples are obtained from the UCI machine learning website: http://www.ics.uci.edu/~mlearn/machine-learning.html. 7.1 M-HOV 3 Choosing the appropriate cluster number of an unnown data set is meaningful in the pre-clustering stage. The enhanced separation feature of M-HOV 3 is advantageous in the identification of cluster number in this stage. We demonstrate this advantage of M-HOV 3 by the example next. We use data set Boston Housing (simply written as Housing) as an example. The Housing set has 14 attributes and 506 instances. The original data distribution of Housing is given in Figure 16a. As in the process of the last example, based on observation and axis scaling we had a roughly separated data distribution of Housing, as demonstrated in Figure 16b; we fixed the weight values of each axis as M = [0.5, 1, 0, 0.95, 0.3, 1, 0.5, 0.5, 0.8, 0.75, 0.25, 0.55, 0.45, 0.75]. Comparing diagrams of Figure 16a and Figure 16b, we can see that data points in Figure 16b are constricted as 3 (or 4?) groups. Then M-HOV 3 was applied to the data set Housing. Figure 16c and Figure 16d are the results of M-HOV 3 with M. *M and M. *M. * M correspondingly. So, it is much easier to gain grouping insight from Figure 16c and Figure 16d, where we can identify the group members conveniently.
Figure 16a. The original data distribution of Housing Figure 16b. p1 = C (Housing, M) Figure 16c. p2 = C (Housing, M *M) Figure 16d. p3= = C (, M *M)*M We believe that with the domain experts involved in the process, the M-HOV 3 approach can perform better in real world applications. Now we demonstrate the improvement of precision by applying M-HOV 3 with zooming to gain cluster members. We still use the above example, the non zero minimal value of the measure vector M is 0.25, then we use V=[4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4] (4=1/0.25) as the zooming vector. We contrast applying M-HOV 3 with zooming below. Figure 17 p2 = C (Housing, M *M) Figure 18 p2 = C (Housing, M *M *V) It is observed that the shape of patterns in Figure 17 is exactly the same as in Figure 18, but the scale in Figure 18 is enlarged. Thus the effect of combining M-HOV 3 and zooming would improve the accuracy of data selection in external cluster validation by HOV 3. 7.2 External Cluster Validation by HOV 3
Shuttle data set has 9 attributes and 4,3500 instances. We choose the first 5,000 instances of Shuttle as a sampling data and apply the K-means algorithm [McQ67] to it. Then we utilize the clustered result as a spy subset. We assumed that we have found the optimal cluster number =5 for the sampling data. The original data distributions with and without cluster indices are illustrated in the diagrams of Figure 19 and Figure 20 respectively. It can be seen that there exists a cluster overlapping in Figure 20. Figure 19. The original data distribution of the first 5,000 data points of Shuttle in MATLAB by HOV 3 (without cluster indices). Figure 20. The original data distribution of the first 5,000 data points of Shuttle in MATLAB by HOV 3 (with cluster indices) To obtain well-separated clusters, we tuned the weight of each coordinate, and had a satisfied version of the data distribution as shown in Figure 20. The weight values of axes are recorded as a measure vector [0.80, 0.55, 0.85, 0.0, 0.40, 0.95, 0.20, 0.05, 0.459], in this case. Then we chose the second 5,000 instances of Shuttle as a target subset and projected the target subset and the spy subset together against the measure vector by HOV 3. Their distributions are presented in Figure 21, where we may observe that their data distributions are matched very well. We chose the points in the enclosed area in Figure 22 as a cluster then obtained a quasi-cluster in the target subset corresponding to the cluster in the enclosed area. In the same way, we can find the other quasi-clusters from the target subset. Figure 21. A well-separated version of the spy subset distribution of Shuttle Figure 22. The projection of the spy subset and a target subset of Shuttle by applying a measure vector. We have done the same experiment on 4 target subsets of Shuttle. The size of each quasi-cluster and its corresponding cluster are listed in Table 3, and their curves of linear regression to the line C=C q are illustrated in Figure 23. Table 3 Cluster-QuasiCluster pairs and their overlapping rate
Subset C q1 /C 1 C q2 /C 2 C q3 /C 3 C q4 /C 4 C q5 /C 5 Spy 318 773 513 2254 1142 Target1 278/318=0.8742 670/773=0.8668 503/513=0.9805 2459/2254=1.0909 1123/1142=0.9834 Target2 279/318=0.8773 897/773=1.1604 626/513=1.2203 2048/2254=0.9086 1602/1142=1.4028 Target3 280/318=0.8805 875/773=1.1320 481/513=0.9376 2093/2254=0.9286 1455/1142=1.2741 Target4 261/318=0.8208 713/773=0.9224 368/513=0.7173 2416/2254=1.0719 1169/1142=1.0264 (*At current stage, we collect the quasi-clusters manually, thus C qi here may exist redundancy and misloading.) It is observed that the curves are well matched to the line C=C q, i.e. the overlapping rate between the clusters and their quasi-clusters are high. The standard deviation is a good way to reflect the difference between the two vectors. Thus we have calculated the standard deviation of each C qi -C i pairs among the target (=1,..4) and the spy subsets. They are 0.0826, 0.1975, 0.1491 and 0.1304. This means that the similarity of cluster structure in the spy and the target subsets is high. In summary, the experiments show that the same cluster structure in the spy subset of Shuttle also exists in the target subsets of Shuttle. Figure 23. The curves of linear regression to the line C=C q. In these experiments, we have also measured the timing for both clustering and projection in MATLAB. The results are listed in the Table 4. Table 4 Timing of Clustering and Projecting Clustering by K-mens (=5) Projecting by HOV 3 Subset Amount Time(Second) Subset Size Time(Second) Target 1 5,000.532 Syp+Target1 10,000.11 Target 2 5,000.61 Syp+Target2 10,000.109 Target 3 5,000.656 Syp+Target3 10,000.11 Target 4 5,000.453 Syp+Target4 10,000.109 Based on this calculation, it has been observed that the projection by HOV 3 is much faster than the clustering process by the K-means algorithm. It is particularly effective for verifying the clustering results within extremely huge databases. Although the cluster separation in our approach may incur some time, once the well-separated clusters are found, using a measure vector to project a huge data set will be a lot
more efficient than re-applying a clustering algorithm to the data set. 8 Concluding Remars In this paper we have proposed a novel approach called HOV 3, Hypothesis Oriented Verification and Validation by Visualization, to assist data miners in cluster analysis of high-dimensional datasets by visualization. The HOV 3 visualization technique employs hypothesis-oriented measures to project data and allows users to iteratively adjust the measures for optimizing the result of clusters. This approach provides data miners an opportunity to introduce their quantified domain nowledge as predictions in the cluster discovery process for revealing the gaps of data distribution. HOV 3 is a more purposeful visual method to investigate clusters in high-dimensional databases. In this paper, based on the projection technique of HOV 3, we have also introduced a visual approach called M-HOV 3 to enhance the visual separation of clusters. The visual separability of clusters is significant for cluster analysis. A good visual separation of clusters is not only beneficial in revealing the membership formation of clusters, but also beneficial in verifying the validity of clustering results. With M-HOV 3, users can both explore cluster distribution intuitively and deal with cluster validation effectively by matching the geometrical distributions of clustered and non-clustered subsets produced by M-HOV 3. Based on the capability of quantified domain nowledge about datasets as predictions/measurements of HOV 3, we have also addressed visual external cluster validation supported by the projection mechanism of HOV 3. This approach is based on the assumption that by using a measure to project the data sets in the same cluster structure, the similarity of their data distributions should be high. By comparing the data distributions of a clustered subset and non-clustered subsets projected by HOV 3 with tunable measures, users can perform intuitive visual evaluation, and also have a precise evaluation of the consistency of the cluster structure by performing geometrical computation on their data distributions as well. By comparing our approach with existing visual methods, we have observed that our method is not only efficient in performance, but also effective in real applications. Experiments show that HOV 3 technique can improve the effectiveness of cluster analysis by visualization and provide a better, intuitive understanding of the results. HOV 3 can be seen as a bridging process between qualitative analysis and quantitative analysis. It not only supports quantified domain nowledge verification and validation, but also can directly utilize the rich statistical analysis tools as measures and give data miners an efficient and effective guidance to get more precise cluster information in data mining. As a result, with the advantage of the quantified measurement feature of HOV 3 data miners can identify the cluster number in the pre-processing stage of clustering efficiently, and also verifying the membership of data points among the clusters effectively in the post-processing stage of clustering in data mining. References [AAP+03] A. L. Abul, R. Alhajj, F. Polat and K. Barer Cluster Validity Analysis Using Subsampling, in proceedings of IEEE International Conference on Systems, Man, and Cybernetics, Washington DC, Oct. 2003 Volume 2: pp. 1435-1440. [ABK+99] M. Anerst, M. M. Breunig, H.-P. Kriegel, J.Sander, OPTICS: Ordering points to identify the clustering structure, in proceedings of ACM SIGMOD Conference, 1999 pp. 49-60. [AlC91] Alpern B., Carter L.: Hyperbox. Proc. Visualization 91, San Diego, CA (1991) 133-139 [Ber06] P. Berhin, A Survey of Clustering Data Mining Techniques Kogan, Jacob; Nicholas, Charles; Teboulle, Marc (Eds.) Grouping Multidimensional Data, Springer Press (2006) 25-72 [BPR+04] C. Baumgartner, C. Plant, K. Railing, H-P. Kriegel, P. Kroger, Subspace Selection for Clustering
High-Dimensional Data, Proc. of the Fourth IEEE International Conference on Data Mining (ICDM 04), 2004, pp.11-18. [CBC+95] Coo D.R., Buja A., Cabrea J., and Hurley H.: Grand tour and projection pursuit. Journal of Computational and Graphical Statistics Volume: 23 (1995) [Che73] Chernoff H.: The Use of Faces to Represent Points in -Dimensional Space Graphically. Journal Amer. Statistical Association, Volume: 68 (1973) 361-368 [ChL04] K. Chen and L. Liu,. VISTA: Validating and Refining Clusters via Visualization, Journal of Information Visualization. Volume3 (4), 2004, pp. 257-270. [Cle93] Cleveland W.S.: Visualizing Data. AT&T Bell Laboratories, Murray Hill, NJ, Hobart Press, Summit NJ. (1993) [Cli00] E. Clifford, Data Analysis by Resampling: Concepts and Applications, Duxbury Press, 2000. [EKS+96] Ester M., Kriegel HP., Sander J., Xu X.: A density-based algorithm for discovering clusters in large spatial databases with noise. Second International Conference on Knowledge Discovery and Data Mining (1996) [FaL95] C. Faloutsos and K. Lin, Fastmap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia data sets Proc. of ACM-SIGMOD, 1995 pp.163-174. [Fie79] Fienberg S. E.: Graphical methods in statistics. American Statisticians Volume: 33 (1979) 165-178 [GRS98] Guha S., Rastogi R., Shim K.: CURE: An efficient clustering algorithm for large databases. Proc. Of ACM SIGMOD Conference (1998) [HaK01] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001. [HBV01] M. Halidi, Y. Batistais, M. Vazirgiannis, On Clustering Validation Techniques Journal of Intelligent Information Systems, Volume 17 (2/3), 2001, pp. 107 145. [HBV02] M. Halidi, Y. Batistais, M. Vazirgiannis, Cluster validity methods: Part I and II, SIGMOD Record, 31, 2002. [HKW99] KA. Hinneburg, D. A. Keim and M. Wawryniu,. Hd-eye: Visual mining of high-dimensional data. Computer Graphics & Applications Journal, 19(5):22 31, September/October 1999. [HCN01] Z. Huang, D. W. Cheung and M. K. Ng, An Empirical Study on the Visual Cluster Validation Method with Fastmap, Proceedings of DASFAA01, Hong Kong, April 2001, pp.84-91. [HKK05] J. Handl, J. Knowles, and D. B. Kell, Computational cluster validation in post-genomic data analysis, Journal of Bioinformatics Volume 21(15), 2005, pp. 3201-3212. [HuL00] Z. Huang and T. Lin, A visual method of cluster validation with Fastmap, Proc. of PAKDD-2000, 2000 pp. 153-164. [Ins97] Inselberg A.: Multidimensional Detective. Proc. of IEEE Information Visualization '97 (1997) 100-107 [Jac08] Jaccard, S. (1908) Nouvelles recherches sur la distribution florale. Bull. Soc. Vaud. Sci. Nat., 44, 223 270. [JaD88] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall,1988 [JMF99] A. Jain, M. N. Murty and P. J. Flynn, Data Clustering: A Review, ACM Computing Surveys, Volume 31(3), 1999, pp. 264-323. [Kan01] E. Kandogan, Visualizing multi-dimensional clusters, trends, and outliers using star coordinates, Proc. of ACM SIGKDD Conference, 2001, pp.107-116. [KeK94] Keim D.A. And Kriegel HP.: VisDB: Database Exploration using Multidimensional Visualization. Computer Graphics & Applications (1994) 40-49. [Koh97] T. Kohonen, Self-Organizing Maps Springer, Berlin, second extended edition,1997. [KSP73] S. Kasi, J. Sinonen. and J. Peltonen, Data Visualization and Analysis with Self-Organizing Maps in Learning Metrics, DaWaK 2001, LNCS 2114, 2001, pp.162-173. [McQ67] J. McQueen, Some methods for classification and analysis of multivariate observations, Proc. of 5th Bereley Symposium on Mathematics, Statistics and Probability, Volume 1, 1967, pp. 281-298 [Mil81] G. W. Milligan, A Review Of Monte Carlo Tests Of Cluster Analysis, Journal of Multivariate Behavioral Research Vol. 16( 3), 1981, pp. 379-407. [MSS83] G.W. Milligan, L.M. Sool, & S.C. Soon The effect of cluster size, dimensionality and the number of clusters on recovery of true cluster structure, IEEE Trans PAMI, 1983 5(1):40-47. [OlL03] F. Oliveira, H. Levowitz, From Visual Data Exploration to Visual Data Mining: A Survey, IEEE Trans.Vis.Comput. Graph, Volume 9(3), 2003, pp.378-394. [PGW03] E. Pampal, W. Goebl, and G. Widmer, Visualizing Changes in the Structure of Data forexploratory
Feature Selection, SIGKDD 03, August 24-27, 2003, Washington, DC, USA [PiC70] Picett R. M.: Visual Analyses of Texture in the Detection and Recognition of Objects. Picture Processing and Psycho-Pictorics, Lipin B. S., Rosenfeld A. (eds.) Academic Press, New Yor, (1970) 289-308 [Ran 71] Rand, W.M., Objective Criteria for the Evaluation of Clustering Methods. J. Am. Stat. Assoc., 66:846-850, 1971. [SeS05] J. Seo and B. Shneiderman, From Integrated Publication and Information Systems to Virtual Information and Knowledge Environments, Essays Dedicated to Erich J. Neuhold on the Occasion of His 65th Birthday. Lecture Notes in Computer Science Volume 3379, Springer, 2005. [Shn01] B Shneiderman, Inventing Discovery Tools: Combining Information Visualization with Data Mining, Proc. of Discovery Science 2001,Lecture Notes in Computer Science Volume 2226, 2001, pp.17-28. [SCZ98] Sheiholeslami G., Chatterjee S., Zhang A.: Wavecluster: A multi-resolution clustering approach for very large spatial databases. Proc. of Very Large Databases Conference (1998) [ThK99] S. Theodoridis and K. Koutroubas, Pattern Recognition, Academic Press. 1999. [VSA05] Vilalta R., Stepinsi T., Achari M.: An Efficient Approach to External Cluster Assessment with an Application to Martian Topography, Technical Report, No. UH-CS-05-08, Department of Computer Science, University of Houston (2005) [Wei98] S. M. Weiss and N. Indurhya, Predictive Data Mining: A Practical Guide, Morgan Kaufmann Publishers (1998). [WoB94] P.C. Wong and R. D. Bergeron, 30 Years of Multidimensional Multivariate Visualization, Scientific Visualization, Overviews, Methodologies, and Techniques, IEEE Computer Society, pp.3-33, 1994 [ZOZ06] K-B. Zhang, M. A. Orgun and K. Zhang, HOV 3, An Approach for Cluster Analysis, Proc. of ADMA 2006, XiAn, China, Lecture Notes in Computer Science series, Volume. 4093, 2006, pp317-328 [ZOZ07a] K-B. Zhang, M. A. Orgun, K. Zhang, A Visual Approach for External Cluster Validation, Proc. of IEEE Symposium on Computational Intelligence and Data Mining (CIDM2007), Honolulu, Hawaii, USA, April 1-5, 2007, IEEE Press, 2007, pp576-582., Montreal, Canada (1996) 103-114 [ZOZ07b] K-B. Zhang, M. A. Orgun, K. Zhang, Enhanced Visual Separation of Clusters by M-mapping to Facilitate Cluster Analysis, 9th International Conference series on Visual Information Systems (VISUAL 2007), June 28-29, 2007, Shanghai, China (to appear) [ZRL96] Zhang T., Ramarishnan R. and Livny M.: BIRCH: An efficient data clustering method for very large databases. In Proc. of SIGMOD96