Visual Cluster Analysis in Data Mining

Size: px
Start display at page:

Download "Visual Cluster Analysis in Data Mining"

Transcription

1 Visual Cluster Analysis in Data Mining A thesis submitted in fulfilment of the requirements for the Doctorial Degree of Philosophy by Ke-Bing Zhang October 2007 Master (Honours) of Science (Macquarie University) 2002 Bachelor of Engineering (Tianjin University of Technology) 1987 Department of Computing Division of Information and Communication Sciences Macquarie University, NSW 2109, Australia 2007 Ke-Bing Zhang

2 To my parents Zhang Lu-de and Zhang Xiao-lin

3 ACKNOWLEDGMENT First of all, I am deeply indebted to my supervisor, Associate Professor Mehmet A. Orgun, for providing me with supervision, motivation and encouragement throughout the course of this wor. His insight, breadth of nowledge and enthusiasm has been invaluable to my training as a researcher. He led me to the correct direction at every stage of the research. Without his care, supervision and friendship, I would not be able to complete this wor. I am also grateful to my co-supervisor Professor Kang Zhang, for his suggestions and guidance of my research. His help has been integrated into the success of this wor. I am faithfully indebted to my parents Zhang Lu De and Zhang Xiao Lin for their love, forever affection, patience and constant encouragement. I deeply than my wife Liu Yu for her love and understanding, especially for her comforting me when I encountered sorrow and loneliness. I would lie to express my appreciation to my brother, Professor Kewei Zhang for his comments on the mathematics in this wor. Finally, my thans also go to other faculty and staff members of the Department of Computing, and to my fellow graduate students, for providing a friendly and enjoyable environment during my time here. I

4 DECLARATION I hereby certify that the wor embodied in this thesis is the result of original research. This wor has not been submitted for a higher degree to any other university or institution. Signed: Date : II

5 ABSTRACT Cluster analysis is a widely applied technique in data mining. However, most of the existing clustering algorithms are not efficient in dealing with arbitrarily shaped distribution data of extremely large and high-dimensional datasets. On the other hand, statistics-based cluster validation methods incur very high computational cost in cluster analysis which prevents clustering algorithms from being effectively used in practice. Visualization techniques have been introduced into cluster analysis. However, most visualization techniques employed in cluster analysis are mainly used as tools for information rendering, rather than for investigating how data behavior changes with the variations of the parameters of the algorithms. In addition, the impreciseness of visualization limits its usability in contrasting grouping information of data. This thesis proposes a visual approach called HOV 3, Hypothesis Oriented Verification and Validation by Visualization, to assist data miners in cluster analysis. HOV 3 employs quantified domain nowledge, statistical measures, and explorative observation as predictions to project high dimensional data onto 2D space for revealing the gaps of data distribution against the predictions. Based on the capability of quantified measurement of HOV 3, this thesis also proposes a visual external cluster validation method to verify the stability of clustering results by comparing the data distributions of a clustered subset and non-clustered subsets projected by HOV 3. With this method, data miners can perform an intuitive visual assessment and have a precise evaluation of the consistency of the cluster structure. This thesis also introduces a visual approach called M-HOV 3 to enhance the visual separation of clusters based on the projection technique of HOV 3. With enhanced separation of clusters, data miners can explore cluster distribution intuitively as well as dealing with cluster validation effectively in HOV 3. As a consequence, with the advantage of the quantified measurement feature of HOV 3, data miners can identify the cluster number in the pre-processing stage of clustering efficiently, and also verify the membership formation of clusters effectively in the postprocessing stage of clustering in data mining. III

6 Table of Contents LIST OF FIGURES... VI CHAPTER 1. INTRODUCTION... 1 CHAPTER 2. CONTRIBUTIONS... 5 CHAPTER 3. CLUSTER ANALYSIS Clustering and Clustering Algorithms Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering methods Cluster Validation Internal criteria Relative criteria External criteria The issues of cluster analysis CHAPTER 4. VISUAL CLUSTER ANALYSIS Multidimensional Data Visualization Icon-based techniques Pixel-oriented Techniques Geometric Techniques Visual Cluster Analysis MDS and PCA HD-Eye Grand Tour Hierarchical BLOB SOM FastMap OPTICS Star Coordinates and VISTA Major Challenges Requirements of Visualization in Cluster Analysis Motivation Our Approach HOV 3 Model External Cluster Validation by HOV Enhanced the Separation of Clusters By HOV Prediction-based Cluster Analysis by HOV IV

7 CHAPTER 5. CONCLUSION AND FUTURE WORK Conclusion Future Wor Three Dimensional HOV Dynamic Visual Cluster Analysis Quasi-Cluster Data Points Collection Combination of Fuzzy Logical approaches and HOV APPENDIX BIBLIOGRAPHY V

8 List of Figures Figure 3-1 An Example of clustering procure of K-means [HaK01]...13 Figure 3-2 Hierarchical Clustering Process [HaK01] Figure 3-3 The grid-cell structure of gird-based clustering methods...17 Figure 3-4 External criteria based validation [ZOZ07a]...22 Figure 4-1. An example of Chernoff-Faces...27 Figure 4-2. Stic Figure Visualization Technique...27 Figure 4-3. Stic Figure Visualization of the Census Data...28 Figure 4-4. Displaying attribute windows for data with six attributes...29 Figure 4-5. Illustration of the Recursive Pattern Technique...30 Figure 4-6. The Recursive Pattern Technique in VisDB [KeK94]...30 Figure 4-7. Scatterplot-Matrices [Cle93]...31 Figure ,000 coloured data items in Parallel Coordinates...32 Figure 4-9. Star plots of data items [SGF71]...33 Figure Clustering of 1352 genes in MDS by [Bes]...35 Figure The framewor of HD-Eye system and its different visualization projections..36 Figure The 3D data structures in HD-Eye and their intersection trails on the planes...36 Figure The Grand Tour Technique and its 3D example...38 Figure Cluster hierarchies are shown for 1, 5, 10 and 20 clusters [SBG00]...38 Figure Model matching with SOM by [KSP01]...39 Figure Data structure mapped in Gaussian bumps by OPTICS [ABK+99]...40 Figure Clustering structure of 30, Dimensional data items Visualized by OPTICS...41 Figure Positioning a point by an 8-attribute vector in Star Coordinates [Kan01]...42 Figure 4-20 axis scaling, angle rotation and foot print functions of Star Coordinates [Kan01]...43 VI

9 VII

10 CHAPTER 1. INTRODUCTION CHAPTER 1 INTRODUCTION Clustering analysis, also called segmentation analysis or taxonomy analysis [MaW], aims to identify homogeneous objects into a set of groups, named clusters, by given criteria. Clustering is a very important technique of nowledge discovery for human beings. It has a long history and can be traced bac to the times of Aristotle [HaJ97]. These days, cluster analysis is mainly conducted on computers to deal with very large-scale and complex datasets. With the development of computer-based techniques, clustering has been widely used in data mining, ranging from web mining, image processing, machine learning, artificial intelligence, pattern recognition, social networ analysis, bioinfomatics, geography, geology, biology, psychology, sociology, customers behavior analysis, mareting to e-business and other fields [JMF99] [Har75]. Cluster analysis includes two major aspects: clustering and cluster validation. Clustering achieves to distinguish objects into groups according to certain criteria. The grouped objects are called clusters, where the similarity of objects is high within clusters and low between clusters. To achieve different application purposes, a large number of clustering algorithms have been developed [JMF99, Ber06]. However, there are no general-purpose clustering algorithms that fit all inds of applications, thus, the evaluation of the quality of clustering results plays the critical role of cluster analysis, i.e., cluster validation, which aims to assess the quality of clustering results and find a fit cluster scheme for a specific application. 1

11 CHAPTER 1. INTRODUCTION However, in practice, it may not always be possible to cluster huge datasets by using clustering algorithms successfully, due to the weaness of most existing automated clustering algorithms on dealing with arbitrarily shaped data distribution of the datasets. As Abul et al pointed out In high dimensional space, traditional clustering algorithms tend to brea down in terms of efficiency as well as accuracy because data do not cluster well anymore [AAP+03]. In addition, the very high computational cost of statistics-based cluster validation methods directly impacts on the efficiency of cluster validation [HCN01]. The clustering of large sized datasets in data mining is an iterative process involving humans [JMF99]. Thus, the user s initial estimation of the cluster number is important for choosing the parameters of clustering algorithms for the pre-processing stage of clustering. Also, the user s clear understanding on cluster distribution is helpful for assessing the quality of clustering results in the post-processing of clustering. All these heavily rely on the user s visual perception of data distribution. Clearly, visualization is a crucial aspect of cluster exploration and verification in cluster analysis. Visual presentations can be very powerful in revealing trends, highlighting outliers, showing clusters, and exposing gaps in data [Shn01]. Therefore, introducing visualization techniques to explore and understand high dimensional datasets is becoming an efficient way to combine human intelligence with the immense brute force computation power available nowadays [PGW03]. Visualization used in cluster analysis maps the high-dimensional data to a 2D or 3D space and aids users having an intuitive and easily understood graph/image to reveal the grouping relationship among the data. As an indispensable revealing technique, visualization is almost involved into every step in data mining [Chi00, AnK01]. Visual cluster analysis is a combination of visualization and cluster analysis. The data sets that clustering algorithms deal 2

12 CHAPTER 1. INTRODUCTION with are normally in high dimensions (>3D). Thus, choosing a fit technique to visualize clusters of high dimensional data is the first tas of visual cluster analysis. There have been many wors on multidimensional data visualization [WoB94], but those earlier techniques of multidimensional data visualization are not suitable to visualize cluster structures in very high dimensional and very large datasets. With the increasing applications of clustering in data mining, in the last decade, more and more visualization techniques have been developed to study the structure of datasets in the applications of cluster analysis [OlL03, Shn05]. Several approaches have been proposed for visual cluster analysis [ABK+99, ChL04, HCN01, HuL00, Kan01, KSP01, HWK99, SBG00], but their arbitrary exploration of group information maes them inefficient and time consuming in the cluster exploration stage. On the other hand, the impreciseness of visualization limits its utilisation in quantitative verification and validation of clustering results. Thus developing a visualization technique, with the features of purposeful cluster detection and precise contrast between clustering results is the motivation of this research. To mitigate the above-mentioned problems, based on hypothesis testing, we propose a visual projection technique called Hypothesis Oriented Verification and Validation by Visualization, (HOV 3 ) [ZOZ06a, ZOZ06b]. HOV 3 generalizes random adjustments of Star Coordinates based techniques as measure vectors. Thus, compared with the Star Coordinates based techniques, HOV 3 has several superiorities. First, data miners can summarize their prior nowledge of the studied data as measure vectors, i.e., hypotheses of the data. Base on hypothesis testing, data miners can quantitatively analyze data distribution projected by HOV 3 with hypotheses. Second, HOV 3 avoids the arbitrariness and randomness of most existing 3

13 CHAPTER 1. INTRODUCTION visual techniques on cluster exploration, for example Star Coordinates [Kan01] and its implementations, such as VISTA/iVIBRATE [ChL04, ChL06]. As a consequence, HOV 3 provides data miners a purposeful and effective visual method on cluster analysis [ZOZ06b]. Based on the quantified measurement feature of HOV 3, we propose a visual external cluster validation model to verify the consistency of cluster structures. Compared with statistics based external cluster validation methods, we show that HOV 3 based external cluster validation model is more intuitive and effective [ZOZ07a]. We also introduce a visual approach called M-HOV 3 /M-Mapping to enhance the visual separation of clusters [ZOZ07b]. With the above features of HOV 3, a prediction-based visual approach is proposed to explore and verify clusters [ZOZ07c, ZOZ07d]. The next chapter presents more detailed contributions of this thesis. This thesis is structured as follows: Chapter 2 summarizes the contributions of this thesis. Chapter 3 gives an introduction to clustering, clustering algorithms and cluster validation. Chapter 4 reviews related wor on high dimensional data visualization and the visual techniques that have been used in cluster analysis. Finally, Chapter 5 summarise the wor in the thesis and discusses future wor. 4

14 CHAPTER 2. CONTRIBUTIONS CHAPTER 2 CONTRIBUTIONS This is a publication-based thesis. Its main contributions have been published in the proceedings of five international conferences. A followup report has also been submitted to an established journal for further publication. Below, we summarise the contributions of the thesis in the chronological order of those publications: 1. Hypothesis Oriented Verification and Validation by Visualization (HOV 3 ) model: To fill the gap between imprecise cluster detection by visualization and the unintuitive result often obtained by clustering algorithms, a novel visual projection technique called Hypothesis Oriented Verification and Validation by Visualization, HOV 3, is proposed [ZOZ+06, ZOZ06]. The aim of interactive visualization on cluster exploration and rendering is to aid data miners to have some visually separated groups or full-separated clustering result of data. For example, Star Coordinates and its extensions provide such interaction by tuning the weight value of each axis (axis scaling in Star Coordinates [Kan01], α-adjustment in VISTA/iVIBRATE [ChL04, ChL06]), but their arbitrary and random adjustments limit their applicability. HOV 3 generalizes these adjustments as a coefficient/measure vector [ZOZ06]. Compared with the Star Coordinates model and its implementations, such as VISTA/iVIBRATE, it is observed that HOV 3 has better performance on cluster detection. This is because HOV 3 provides data miners a mechanism to quantify their nowledge or hypotheses as measure vectors for precisely exploring grouping information. 5

15 CHAPTER 2. CONTRIBUTIONS As a consequence, HOV 3 provides a bridge between qualitative analysis and quantitative analysis. Based on the idea of obtaining group clues by contrasting a dataset against quantified measures, HOV 3 synthesizes the feedbacs from exploration discovery and users domain nowledge to produce quantified measures, and then projects the test dataset against the measures. Geometrically, HOV 3 reveals the data distribution against the measures in visual form. This approach not only inherits the intuitive and easily understood features of visualization, but also avoids the weanesses of randomness and arbitrary exploration of the existing visual methods employed in data mining. [ZOZ+06] K-B, Zhang, M. A. Orgun, K. Zhang and Y. Zhang, Hypothesis Oriented Cluster Analysis in Data Mining by Visualization, Proceedings of the woring conference on Advanced visual interfaces 2006 (AVI06), May 23-26, 2006, Venezia, Italy. ACM Press, pp (2006) [ZOZ06] K-B, Zhang, M. A. Orgun, K. Zhang, HOV 3 : An Approach for Visual Cluster Analysis, Proceedings of The 2nd International Conference on Advanced Data Mining and Applications. (ADMA 2006), Xi'an, China, August 14-16, 2006, Lecture Notes in Computer Science, Volume 4093 Springer Press, pp (2006) 2. An Algorithm for External Cluster Validation based on Data Distribution Matching: This part of the wor starts with the assumption that If two same-sized data sets have a similar cluster structure, by applying a linear transformation to the data sets, the similarity of the newly produced distributions of the two sets would still be high. With the quantified measurement feature of HOV 3, an external cluster validation based on distribution matching is proposed to verify the consistency of cluster structures between the clustered subset and non-clustered subsets of a large dataset [ZOZ07a]. In this approach, a clustered subset from a dataset is chosen as a visual model to verify the similarity of cluster structures between the model and the other same-sized non-clustered subsets from the dataset by projecting them together in HOV 3. As a consequence, the user 6

16 CHAPTER 2. CONTRIBUTIONS can utilize the well-separated clusters produced by scaling axes in HOV 3 as a model to pic out their corresponding quasi-clusters, where the points overlap clusters. In addition, instead of using statistical methods to assess the similarity between the two subsets, this approach simply computes the overlapping rate between the clusters and their quasi-clusters to show their consistency. The experiments show that when the HOV 3 based external cluster validation method is introduced into cluster analysis, it can have more effective cluster validation results than those obtained from pure clustering algorithms, for example K-means, with the statistics-based validation methods. [ZOZ07a]. [ZOZ07a] K-B. Zhang, M. A. Orgun and K. Zhang, A Visual Approach for External Cluster Validation, Proceedings of the first IEEE Symposium on Computational Intelligence and Data Mining (CIDM2007), Honolulu, Hawaii, USA, April 1-5, 2007, IEEE Press, pp (2007) 3. M-HOV 3 /M-Mapping, Enhanced the Separation of Clusters: To visually separate overlapping clusters, an approach called M-HOV 3 /M-Mapping is introduced to enhance the separation of clusters by HOV 3 [ZOZ07b, ZOZ07c]. Technically, if it is observed that several groups of data points can be roughly separated (where ambiguous points exist between groups) by projecting a measure vector in HOV 3 to a data set, then the application of M-HOV 3 /M-mapping with the measure vector to the data set would lead to the groups being more contracted and have a good separation. These features of M-HOV 3 /M-mapping are significant for identifying the membership formation of clusters in the process of cluster exploration and cluster verification. This is because the contracting feature of M-HOV 3 /M-mapping eeps the data points within a cluster relatively closer, i.e., grouping information is preserved. On the other hand, the enhanced separation feature of M-HOV 3 /M-mapping can extend the distance of far data points relatively further. With the advantage of the enhanced separation and contraction features 7

17 CHAPTER 2. CONTRIBUTIONS of M-HOV 3 /M-mapping, the user can identify the cluster number in the pre-processing stage of clustering efficiently, and also verify the membership formation of data points among the clusters effectively in the post-processing stage of clustering by M-HOV 3 /M-mapping. [ZOZ07b] K-B. Zhang, M. A. Orgun and K. Zhang, Enhanced Visual Separation of Clusters by M-mapping to Facilitate Cluster Analysis, Proceedings of 9th International Conference series on Visual Information Systems (VISUAL 2007), June 28-29, 2007, Shanghai, China, Lecture Notes in Computer, Volume 4781, Springer Press, pp (2007) 4. Prediction-based Cluster Detection by HOV 3 : With the quantified measurement of HOV 3 and enhanced separation features of M-HOV 3, the user not only can summarise their historically explored nowledge about datasets as predictions but also directly introduce abundant statistical measurements of the studied data as predictions to investigate cluster clues, or refine clustering results purposefully and effectively [ZOZ07c, ZOZ07d]. In fact, prediction-based cluster detection by statistical measurements in HOV 3 leads to more purposeful cluster exploration, and it gives an easier geometrical interpretation of the data distribution. In addition, with the statistical predictions in HOV 3 the user may even be able to expose cluster clues that are not easy to be found by random cluster exploration. [ZOZ07c] K-B. Zhang, M. A. Orgun and K. Zhang, A Prediction-based Visual Approach for Cluster Exploration and Cluster Validation by HOV 3, Proceedings of 18th European Conference on Machine Learning/11th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD 2007), Warsaw, Poland, September 17-21, 2007, Lecture Notes in Computer, LNAI 4702 Springer Press, pp (2007) 5. Prediction-based Cluster Validation by HOV 3 : When mapping high-dimensional data into two-dimensional space, there may be overlapping data and ambiguities in visual form, therefore, to separate clusters from a lot 8

18 CHAPTER 2. CONTRIBUTIONS of overlapping data points is an aim of this thesis. Based on the wors such as M-HOV 3 /M-mapping and HOV 3 with statistical measurement, the measures that resulted in fully separated clusters can be treated as predictions to be introduced into External Cluster Validation based on Data Distribution Matching by HOV 3. In principle, any linear transformation, even a complex linear transformation, can be employed into HOV 3 if it can separate clusters well. With the well-separated clusters, we can improve the efficiency of external cluster validation by HOV 3 [ZOZ07c, ZOZ07d]. [ZOZ07d] K-B. Zhang, M. A. Orgun and K. Zhang, Predictive Hypothesis Oriented Cluster Analysis by Visualization, Journal of Data Mining and Knowledge Discovery (submitted) 9

19 10 CHAPTER 2. CONTRIBUTIONS

20 CHAPTER 3. CLUSTER ANALYSIS CHAPTER 3 CLUSTER ANALYSIS Cluster analysis is an exploratory discovery process. It can be used to discover structures in data without providing an explanation/interpretation [JaD88]. Cluster analysis includes two major aspects: clustering and cluster validation. Clustering aims at partitioning objects into groups according to a certain criteria. To achieve different application purposes, a large number of clustering algorithms have been developed [JaD88, KaR90, JMF99, Ber06]. While, due to there are no general-purpose clustering algorithms to fit all inds of applications, thus, it is required an evaluation mechanism to assess the quality of clustering results that produced by different clustering algorithms or a clustering algorithm with different parameters, so that the user may find a fit cluster scheme for a specific application. The quality assessment process of clustering results is regarded as cluster validation. Cluster analysis is an iterative process of clustering and cluster verification by the user facilitated with clustering algorithms, cluster validation methods, visualization and domain nowledge to databases. In this chapter, we give a review of cluster analysis as the bacground of this thesis. First we introduce clustering, clustering algorithms and their features, and also the drawbacs of these algorithms. This is followed by the introduction of cluster validation, existing the cluster validation methods, and the problems with the existing cluster validation approaches. 3.1 Clustering and Clustering Algorithms Clustering is considered as an unsupervised classification process [JMF99]. The clustering problem is to partition a dataset into groups (clusters) so that the data elements within a 11

21 CHAPTER 3. CLUSTER ANALYSIS cluster are more similar to each other than data elements in different clusters by given criteria. A large number of clustering algorithms have been developed for different purposes [JaD88, KaR90, JMF99, XuW05, Ber06]. Based on the strategy of how data objects are distinguished, clustering techniques can be broadly divided in two classes: hierarchical clustering techniques and partitioning clustering techniques [Ber02]. However there is no clear boundary between these two classes. Some efforts have been done on the combination of different clustering methods for dealing with specific applications. Beyond the two traditional hierarchical and partitioning classes, there are several clustering techniques that are categorized into independent classes, for example, density-based methods, Grid-based methods and Modelbased clustering methods [HaK01, Ber06, Pei]. A short review of these methods is described below Partitioning methods Partitioning clustering algorithms, such as K-means [Mac67], K-medoids PAM [KaR87], CLARA [KaR90] and CLARANS [NgH94] assign objects into (predefined cluster number) clusters, and iteratively reallocate objects to improve the quality of clustering results. K-means is the most popular and easy-to-understand clustering algorithm [Mac67]. The main idea of K-means is summarised in the following steps: Arbitrarily choose objects to be the initial cluster centers/centroids; Assign each object to the cluster associated with the closest centroid; Compute the new position of each centroid by the mean value of the objects in a cluster; and Repeat Steps 2 and 3 until the means are fixed. Figure 3-1 presents an example of the process of K-means clustering algorithm. 12

22 CHAPTER 3. CLUSTER ANALYSIS Figure 3-1 An Example of clustering procure of K-means [HaK01] However, K-means algorithm is very sensitive to the selection of the initial centroids, in other words, the different centroids may produce significant differences of clustering results. Another drawbac of K-means is that, there is no general theoretical solution to find the optimal number of clusters for any given data set. A simple solution would be to compare the results of multiple runs with different numbers and choose the best one according to a given criterion, but when the data size is large, it would be very time consuming to have multiple runs of K-means and the comparison of clustering results after each run. Instead of using the mean value of data objects in a cluster as the center of the cluster, a variation of K-means, K-medoids calculates the medoid of the objects in each cluster. The process of K-medoids algorithm is quite similar as K-means. Whereas, K-medoids clustering algorithm is very sensitive to outliers. Outliers could seriously influences clustering results. To solve this problem, some efforts have been made based on K-medoids, for example PAM (Partitioning Around Medoids) was proposed by Kaufman and Rousseeuw [KaR87]. PAM 13

23 CHAPTER 3. CLUSTER ANALYSIS inherits the features of K-medoids clustering algorithm. Meanwhile, PAM equips a medoids swap mechanism to produce better clustering results. PAM is more robust than -means in terms of handling noise and outliers, since the medoids in PAM are less influenced by outliers. With the O((n-) 2 ) computational cost for each iteration of swap (where is the cluster number, n is the items of the data set), it is clear that PAM only performs well on small-sized datasets, but does not scale well to large datasets. In practice, PAM is embedded in the statistical analysis systems, such as SAS, R, S+ and etc. to deal with the applications of large sized datasets, i.e., CLARA (Clustering LARge Applications) [KaR90]. By applying PAM to multiple sampled subsets of a dataset, for each sample, CLARA can produce the better clustering results than PAM in larger data sets. But the efficiency of CLARA depends on the sample size. On the other hand, a local optimum clustering of samples may not the global optimum of the whole data set. Ng and Han [NgH94] abstracts the mediods searching in PAM or CLARA as searching subgraphs from n points graph, and based on this understanding, they propose a PAM-lie clustering algorithm called CLARANS (Clustering Large Applications based upon RANdomized Search). While PAM searches the whole graph and CLARA searches some random sub-graphs, CLARANS randomly samples a set and selects medoids in climbing sub-graph mountains. CLARANS selects the neighboring objects of medoids as candidates of new medoids. It samples subsets to verify medoids in multiple times to avoid bad samples. Obviously, multiple time sampling of medoids verification is time consuming. This limits CLARANS from clustering very large datasets in an acceptable time period Hierarchical methods 14

24 CHAPTER 3. CLUSTER ANALYSIS Hierarchical clustering algorithms assign objects in tree-structured clusters, i.e., a cluster can have data points or representatives of low level clusters [HaK01]. Hierarchical clustering algorithms can be classified into categories according their clustering process: agglomerative and divisive. The process of agglomerative and divisive clustering are exhibited in Figure 3-2. Figure 3-2 Hierarchical Clustering Process [HaK01] Agglomerative: one starts with each of the units in a separate cluster and ends up with a single cluster that contains all units. Divisive: to start with a single cluster of all units and then form new clusters by dividing those that had been determined at previous stages until one ends up with clusters containing individual units. AGNES (Agglomerative Nesting) adopts agglomerative strategy to merge clusters [KaR90]. AGNET arranges each object as a cluster at the beginning, then merges them as upper level clusters by given agglomerative criteria step-by-step until all objects form a cluster, as shown in Figure 3-2. The similarity between two clusters is measured by the similarity function of the closest pair of data points in the two clusters, i.e., single lin. DIANA (Divisive ANAlysis) 15

25 CHAPTER 3. CLUSTER ANALYSIS adopts an opposite merging strategy, it initially puts all objects in one cluster, then splits them into several level clusters until each cluster contains only one object [KaR90]. The merging/splitting decisions are critical in AGNES and DIANA. On the other hand, with O(n 2 ) computational cost, their application is not scalable to very large datasets. Zhang et al [ZRL96] proposed an effective hierarchical clustering method to deal with the above problems, BIRCH (Balanced and Iterative Reducing and Clustering using Hierarchies). BIRCH summarizes an entire dataset into a CF-tree and then runs a hierarchical clustering algorithm on a multi-level compression technique, CF-tree, to get the clustering result. Its linear scalability is good at clustering with a single scan and its quality can be further improved by a few additional scans. It is an efficient clustering method on arbitrarily shaped clusters. But BIRCH is sensitive to the input order of data objects, and can also only deal with numeric data. This limits its stability of clustering and scalability in real world applications. CURE uses a set of representative points to describe the boundary of a cluster in its hierarchical algorithm [GRS98]. But with the increase of the complexity of cluster shapes, the number of representative points increases dramatically in order to maintain the precision. CHAMELEON [KHK99] employs a multilevel graph partitioning algorithm on the -Nearest Neighbour graph, which may produce better results than CURE on complex cluster shapes for spatial datasets. But the high complexity of the algorithm prevents its application on higher dimensional datasets Density-based methods The primary idea of density-based methods is that for each point of a cluster the neighborhood of a given unit distance contains at least a minimum number of points, i.e. the 16

26 CHAPTER 3. CLUSTER ANALYSIS density in the neighborhood should reach some threshold [EKS+96]. However, this idea is based on the assumption of that the clusters are in the spherical or regular shapes. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) was proposed to adopt densityreachability and density-connectivity for handling the arbitrarily shaped clusters and noise [EKS+96]. But DBSCAN is very sensitive to the parameter Eps (unit distance or radius) and MinPts (threshold density), because before doing cluster exploration, the user is expected to estimate Eps and MinPts. DENCLUE (DENsity-based CLUstEring) is a distribution-based algorithm [HiK98], which performs well on clustering large datasets with high noise. Also, it is significantly faster than existing density-based algorithms, but DENCLUE needs a large number of parameters. OPTICS is good at investigating the arbitrarily shaped clusters, but its non-linear complexity often maes it only applicable to small or medium datasets [ABK+99] Grid-based methods The idea of grid-based clustering methods is based on the clustering-oriented query answering in multilevel grid structures. The upper level stores the summary of the information of its next level, thus the grids mae cells between the connected levels, as illustrated in Figure 3-3. Figure 3-3 The grid-cell structure of gird-based clustering methods 17

27 CHAPTER 3. CLUSTER ANALYSIS Many grid-based methods have been proposed, such as STING (Statistical Information Grid Approach) [WYM97], CLIQUE [AGG+98], and the combination of grid-density based technique WaveCluster [SCZ98]. The grid-based methods are efficient on clustering data with the complexity of O(N). However the primary issue of grid-based techniques is how to decide the size of grids. This quite depends on the user s experience Model-based clustering methods Model-based clustering methods are based on the assumption that data are generated by a mixture of underlying probability distributions, and they optimize the fit between the data and some mathematical model, for example statistical approach, neural networ approach and other AI approaches. The typical techniques in this category are Autoclas [CKS+88], DENCLUE [HiK98] and COBWEB [Fis87].When facing an unnown data distribution, choosing a suitable one from the model based candidates is still a major challenge. On the other hand, clustering based on probability suffers from high computational cost, especially when the scale of data is very large. Based on the above review, we can conclude that, the application of clustering algorithms to detect grouping information in real world applications in data mining is still a challenge, primarily due to the inefficiency of most existing clustering algorithms on coping with arbitrarily shaped distribution of data of extremely large and high-dimensional datasets. Extensive survey papers on clustering techniques can be found in the literature [JaD88, KaR90, JMF99, EML01, XuW05, Ber06]. 18

28 CHAPTER 3. CLUSTER ANALYSIS 3.2 Cluster Validation A large number of clustering algorithms have been developed to deal with specific applications [JMF99]. Several questions arise: which clustering algorithm is best suitable for the application at hand? How many clusters are there in the studied data? Is there a better cluster scheme? These questions are related with evaluating the quality of clustering results, that is, cluster validation. Cluster validation is a procedure of assessing the quality of clustering results and finding a fit cluster strategy for a specific application. It aims at finding the optimal cluster scheme and interpreting the cluster patterns [HBV02]. Cluster validation is an indispensable process of cluster analysis, because no clustering algorithm can guarantee the discovery of genuine clusters from real datasets and that different clustering algorithms often impose different cluster structures on a data set even if there is no cluster structure present in it [Gor98] [Mil96]. Cluster validation is needed in data mining to solve the following problems [HCN01]: 1. To measure a partition of a real data set generated by a clustering algorithm. 2. To identify the genuine clusters from the partition. 3. To interpret the clusters. Generally speaing, cluster validation approaches are classified into the following three categories Internal approaches, Relative approaches and External approaches [ALA+03]. We give a short introduction of cluster validation methods as follows Internal criteria 19

29 CHAPTER 3. CLUSTER ANALYSIS Internal cluster validation is a method of evaluating the quality of clusters when statistics are devised to capture the quality of the induced clusters using the available data objects only [VSA05]. In other words, internal cluster validation excludes any information beyond the clustering data, and only focuses on assessing clusters quality based on the clustering data themselves. The statistical methods of quality assessment are employed in internal criteria, for example, root-mean-square standard deviation (RMSSTD) is used for compactness of clusters [Sha96]; R-squared (RS) for dissimilarity between clusters; and S_Dbw for compound evaluation of compactness and dissimilarity [HaV01]. The formulas of RMSSTD, RS and S_Dbw are shown below. (3-1) where, x is the expected value in the jth dimension; n j ij is the number of elements in the ith cluster jth dimension; n j is the number of elements in the jth dimension in the whole data set; n c is the number of clusters. (3-2) where, (3-3) The formula of S_Dbw is given as: S_Dbw = Scat(c) + Dens_bw(c) (3-4) where Scat(c) is the average scattering within c clusters. The Scat(c) is defined as: 20

30 CHAPTER 3. CLUSTER ANALYSIS (3-5) The value of Scat(c) is the degree of the data points scattered within clusters. It reflects the compactness of clusters. The term is the variance of a data set; and the term is the variance of cluster c i. Dens_bw(c) indicates the average number of points between the c clusters (i.e., an indication of inter-cluster density) in relation with density within clusters. The formula of Dens_bw is given as: (3-6) where u ij is the middle point of the distance between the centres of the clusters v i and v j. The density function of a point is defined as the number of points around a specific point within the given radius Relative criteria Relative assessment compares two structures and measures their relative merit. The idea is to run the clustering algorithm for a possible number of parameters (e.g., for each possible number of clusters) and identify the clustering scheme that best fits the dataset [ALA+03], i.e., they assess the clustering results by applying an algorithm with different parameters on a data set and finding the optimal solution. In practice, relative criteria methods also use RMSSTD, RS and S_Dbw to find the best cluster scheme in terms of compactness and dissimilarity from all the clustering results. Relative cluster validity is also called cluster 21

31 CHAPTER 3. CLUSTER ANALYSIS stability, and the recent wors on research of relative cluster validity are presented in [KeC00, LeD01, BEG02, RBL+02, BeG03] External criteria The results of a clustering algorithm are evaluated based on a pre-specified structure, which reflects the user s intuition about the clustering structure of the data set [HKK05]. As a necessary post-processing step, external cluster validation is a procedure of hypothesis test, i.e., given a set of class labels produced by a cluster scheme, and compare it with the clustering results by applying the same cluster scheme to the other partitions of a database, as shown in the Figure 3-4. Figure 3-4 External criteria based validation [ZOZ07a] External cluster validation is based on the assumption that an understanding of the output of the clustering algorithm can be achieved by finding a resemblance of the clusters with existing classes [Dom01], [KDN+96], [Ran71], [FoM83], [MSS83]. The statistical methods for quality assessment are employed in external cluster validation, such as Rand statistic [Ran71], Jaccard Coefficient [Jac08], Foles and Mallows index [MSS83], Huberts Γ statistic and Normalized Γ statistic [Th99], and Monte Carlo method 22

32 CHAPTER 3. CLUSTER ANALYSIS [Mil81], to measure the similarity between the priori modelled partitions and clustering results of a dataset. Extensive surveys on cluster validation can be found in the literature [JaD88, MiI96, Gor98, JMF99, Th99, HBV01, HBV02, HKK05]. 3.3 The issues of cluster analysis By the survey of cluster analysis above, it is clear that there are two major drawbacs that influence the feasibility of cluster analysis in real world applications in data mining. The first one is the weaness of most existing automated clustering algorithms on dealing with arbitrarily shaped data distribution of the datasets. The second issue is that, the evaluation of the quality of clustering results by statistics-based methods is time consuming when the database is large, primarily due to the drawbac of very high computational cost of statistics-based methods for assessing the consistency of cluster structure between the sampling subsets. The implementation of statistics-based cluster validation methods does not scale well in very large datasets. On the other hand, arbitrarily shaped clusters also mae the traditional statistical cluster validity indices ineffective, which leaves it difficult to determine the optimal cluster structure [HBV02]. In addition, the inefficiency of clustering algorithms on handling arbitrarily shaped clusters in extremely large datasets directly impacts the effect of cluster validation, because cluster validation is based on the analysis of clustering results produced by clustering algorithms. Moreover, most of the existing clustering algorithms tend to deal with the entire clustering process automatically, i.e., once the user sets the parameters of algorithms, the clustering 23

33 CHAPTER 3. CLUSTER ANALYSIS result is produced with no interruption, which excludes the user until the end. As a result, it is very hard to incorporate user domain nowledge into the clustering process. Cluster analysis is a multiple runs iterative process, without any user domain nowledge, it would be inefficient and unintuitive to satisfy specific requirements of application tass in clustering. Visualization techniques have proven to be of high value in exploratory data analysis and data mining [Shn01]. Therefore, the introduction of domain experts nowledge supported by visualization techniques is a good remedy to solve those problems. A detailed review of visualization techniques used in cluster analysis is presented in the next chapter. 24

34 CHAPTER 4. VISUAL CLUSTER ANALYSIS CHAPTER 4 VISUAL CLUSTER ANALYSIS As described in the last chapter, most of the existing automated clustering algorithms suffer in terms of efficiency and effectiveness on dealing with arbitrarily shaped cluster distributions of extremely large and multidimensional datasets [AAP+03]. Baumgartner et al [BPR+04] concluded that In high dimensional space, traditional clustering algorithms tend to brea down in terms of efficiency as well as accuracy because data do not cluster well anymore. Another obstacle to the application of cluster analysis in data mining is that, the high computational cost of statistics-based cluster validation methods [HCN01]. These drawbacs limit the usability of clustering algorithms in real-world data mining applications. To mitigate the above problems, visualization has been introduced into cluster analysis. As Card et al [CMS99] described, visualization is the use of computer-supported interactive, and visual representation of abstract data to amplify cognition. Visualization is considered as one of the most intuitive methods for cluster detection and validation, especially performing well on the representation of irregularly shaped clusters. Visual data mining is the use of visualization techniques to allow data miners and analysts to evaluate, monitor, and guide the inputs, products and process of data mining [GHK+96]. As a branch of visual data mining, visual cluster analysis is a combination of information visualization and cluster analysis techniques. In the cluster analysis process, visualization provides analysts with intuitive feedbac on data distribution and support decision-maing activities. As a consequence, visual presentations can be very powerful in revealing trends, 25

35 CHAPTER 4. VISUAL CLUSTER ANALYSIS highlighting outliers, showing clusters, and exposing gaps in data [Shn01]. Nowadays, introducing visualization techniques to explore and understand high-dimensional datasets is becoming an efficient way to combine human intelligence with the immense brute force computation power available nowadays [PGW03]. A large number of visualization techniques have been developed to map multidimensional datasets to two or three dimensional space [WTP+95, AhW95, HKW99, RSE99, RKJ+99, ABK+99, AEK00, SGB00, MRC02, FGW02, Shn01, PGW03, LKS+04]. In practice, most of them simply tae information visualization as a layout problem, therefore, they are not suitable to visualize clusters of very large datasets. 4.1 Multidimensional Data Visualization Many efforts have been performed on multidimensional (d>3) data visualization [OlL03]. However, most of those visual approaches have difficulty in dealing with high dimensional and very large datasets. We give a more detailed discussion of them as follows Icon-based techniques Icon-based presentations are relatively older techniques for visual data mining. The idea of icon-based techniques is to map each multidimensional data item as an icon, for example [Pic70, Che73, Bed90, Lev91, KeK94, Hea95]. We explain several popular techniques below. Chernoff Faces A well-nown iconic approach is Chernoff faces [Che73]. The Chernoff face uses the two dimensions of multidimensional data to locate a face position in the two display dimensions. The remaining dimensions are mapped to the properties of the face icon, i.e., the shape of nose, mouth, eyes, and the shape of the face itself, as shown in Figure

36 CHAPTER 4. VISUAL CLUSTER ANALYSIS Chernoff face visualization capitalizes on the human sensitivity to faces and facial features. However, the number of data items that can be visualized using the Chernoff face technique is quite limited. Figure 4-1. An example of Chernoff-Faces Stic Figures Another famous icon-based technique is to use stic figures for visualizing a larger amounts of data, therefore, an adequate number of data items can be presented for data mining purposes [Pic70][PiG88]. The stic figures technique uses two dimensions as the display dimensions, and the other dimensions are mapped to the angles and lengths of the stic figure icon, as illustrated in Figure 4-2a. Different stic figure icons with variable dimensionality may be used, as shown in Figure 4-2b. a. Stic Figure Icon b. A Family of Stic Figures Figure 4-2. Stic Figure Visualization Technique 27

37 CHAPTER 4. VISUAL CLUSTER ANALYSIS Figure 4-3 shows the census data of 1980 United States visualized by the stic figure visualization technique, and the census data have five dimensions. In Figure 4-3, where income and age are used as the display space, and other the attributes: occupation, education level, marital status and sex are visualized by the stic figures. However, it can be observed that, in Figure 4-3, the user cannot easily understand and interpret the graph of stic figures. The user has to have a good training in advance. Figure 4-3. Stic Figure Visualization of the Census Data Many other icon-based systems have also been proposed, such as Shape-Coding [Bed90], Color Icons [Lev91, KeK94], and TileBars [Hea95]. Icon-based techniques can display multidimensional properties of data, however, with the amount of data increasing, the user hardly maes any sense of most properties of data intuitively, this is because the user cannot focus on the details of each icon when the data scale is very large Pixel-oriented Techniques Pixel-oriented visualization techniques map each attribute value of data to a single colored pixel, yielding the display of the most possible information at a time [KeK94, KKA95, Kei97, 28

38 CHAPTER 4. VISUAL CLUSTER ANALYSIS An01]. With this technique, each data value is mapped to a colored pixel and present the data values belonging to one attribute in separate windows, as displayed in Figure 4-4. Pixel-oriented techniques use various colour mapping approaches, such as linear variation of brightness, maximum variation of hue (colour) and constant maximum saturation to map each data value to a colored pixel and arrange them adequately in limited space. Pixel-oriented techniques are powerful to provide an overview of large amounts of data, and meanwhile they preserve the perception of small regions of interest. This feature maes them suitable for being used in a variety of data mining tass of extremely large databases. Figure 4-4. Displaying attribute windows for data with six attributes Keim [KeK94] presented the first Pixel-oriented technique in the VisDB system, which has the capability to represent large amounts of multidimensional data with respect to a given query. As a result, users are able to refine their query based on the nowledge gathered from the visual representation of the data. Other pixel-oriented techniques have been developed, for example, Recursive Pattern Technique [KKA95], Circle Segments Technique [AKK96], Spiral [KeK94], Axes [Kei97], PBC [AEK00] and OPTICS [ABK+99]. They are successfully applied in data exploration for high dimensional databases. 29

Data Clustering Using Data Mining Techniques

Data Clustering Using Data Mining Techniques Data Clustering Using Data Mining Techniques S.R.Pande 1, Ms. S.S.Sambare 2, V.M.Thakre 3 Department of Computer Science, SSES Amti's Science College, Congressnagar, Nagpur, India 1 Department of Computer

More information

Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

More information

A comparison of various clustering methods and algorithms in data mining

A comparison of various clustering methods and algorithms in data mining Volume :2, Issue :5, 32-36 May 2015 www.allsubjectjournal.com e-issn: 2349-4182 p-issn: 2349-5979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering

More information

On Clustering Validation Techniques

On Clustering Validation Techniques Journal of Intelligent Information Systems, 17:2/3, 107 145, 2001 c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques MARIA HALKIDI mhalk@aueb.gr YANNIS

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Data Clustering Techniques Qualifying Oral Examination Paper

Data Clustering Techniques Qualifying Oral Examination Paper Data Clustering Techniques Qualifying Oral Examination Paper Periklis Andritsos University of Toronto Department of Computer Science periklis@cs.toronto.edu March 11, 2002 1 Introduction During a cholera

More information

Cluster Analysis: Basic Concepts and Methods

Cluster Analysis: Basic Concepts and Methods 10 Cluster Analysis: Basic Concepts and Methods Imagine that you are the Director of Customer Relationships at AllElectronics, and you have five managers working for you. You would like to organize all

More information

Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1

Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1 Clustering: Techniques & Applications Nguyen Sinh Hoa, Nguyen Hung Son 15 lutego 2006 Clustering 1 Agenda Introduction Clustering Methods Applications: Outlier Analysis Gene clustering Summary and Conclusions

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier Data Mining: Concepts and Techniques Jiawei Han Micheline Kamber Simon Fräser University К MORGAN KAUFMANN PUBLISHERS AN IMPRINT OF Elsevier Contents Foreword Preface xix vii Chapter I Introduction I I.

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

More information

Linköpings Universitet - ITN TNM033 2011-11-30 DBSCAN. A Density-Based Spatial Clustering of Application with Noise

Linköpings Universitet - ITN TNM033 2011-11-30 DBSCAN. A Density-Based Spatial Clustering of Application with Noise DBSCAN A Density-Based Spatial Clustering of Application with Noise Henrik Bäcklund (henba892), Anders Hedblom (andh893), Niklas Neijman (nikne866) 1 1. Introduction Today data is received automatically

More information

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

More information

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009 Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative K-means Density-based Interpretation

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

The Role of Visualization in Effective Data Cleaning

The Role of Visualization in Effective Data Cleaning The Role of Visualization in Effective Data Cleaning Yu Qian Dept. of Computer Science The University of Texas at Dallas Richardson, TX 75083-0688, USA qianyu@student.utdallas.edu Kang Zhang Dept. of Computer

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

An Introduction to Cluster Analysis for Data Mining

An Introduction to Cluster Analysis for Data Mining An Introduction to Cluster Analysis for Data Mining 10/02/2000 11:42 AM 1. INTRODUCTION... 4 1.1. Scope of This Paper... 4 1.2. What Cluster Analysis Is... 4 1.3. What Cluster Analysis Is Not... 5 2. OVERVIEW...

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms 8 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should

More information

Specific Usage of Visual Data Analysis Techniques

Specific Usage of Visual Data Analysis Techniques Specific Usage of Visual Data Analysis Techniques Snezana Savoska 1 and Suzana Loskovska 2 1 Faculty of Administration and Management of Information systems, Partizanska bb, 7000, Bitola, Republic of Macedonia

More information

Clustering methods for Big data analysis

Clustering methods for Big data analysis Clustering methods for Big data analysis Keshav Sanse, Meena Sharma Abstract Today s age is the age of data. Nowadays the data is being produced at a tremendous rate. In order to make use of this large-scale

More information

How To Cluster On A Large Data Set

How To Cluster On A Large Data Set An Ameliorated Partitioning Clustering Algorithm for Large Data Sets Raghavi Chouhan 1, Abhishek Chauhan 2 MTech Scholar, CSE department, NRI Institute of Information Science and Technology, Bhopal, India

More information

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

More information

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph

More information

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool Comparison and Analysis of Various Clustering Metho in Data mining On Education data set Using the weak tool Abstract:- Data mining is used to find the hidden information pattern and relationship between

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

More information

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering

GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering Yu Qian Kang Zhang Department of Computer Science, The University of Texas at Dallas, Richardson, TX 75083-0688, USA {yxq012100,

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering K-means Intuition Algorithm Choosing initial centroids Bisecting K-means Post-processing Strengths

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Going Big in Data Dimensionality:

Going Big in Data Dimensionality: LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für

More information

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets Preeti Baser, Assistant Professor, SJPIBMCA, Gandhinagar, Gujarat, India 382 007 Research Scholar, R. K. University,

More information

USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS

USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS Koua, E.L. International Institute for Geo-Information Science and Earth Observation (ITC).

More information

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool. International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

Quality Assessment in Spatial Clustering of Data Mining

Quality Assessment in Spatial Clustering of Data Mining Quality Assessment in Spatial Clustering of Data Mining Azimi, A. and M.R. Delavar Centre of Excellence in Geomatics Engineering and Disaster Management, Dept. of Surveying and Geomatics Engineering, Engineering

More information

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Clustering Techniques: A Brief Survey of Different Clustering Algorithms Clustering Techniques: A Brief Survey of Different Clustering Algorithms Deepti Sisodia Technocrates Institute of Technology, Bhopal, India Lokesh Singh Technocrates Institute of Technology, Bhopal, India

More information

A Survey of Clustering Techniques

A Survey of Clustering Techniques A Survey of Clustering Techniques Pradeep Rai Asst. Prof., CSE Department, Kanpur Institute of Technology, Kanpur-0800 (India) Shubha Singh Asst. Prof., MCA Department, Kanpur Institute of Technology,

More information

Grid Density Clustering Algorithm

Grid Density Clustering Algorithm Grid Density Clustering Algorithm Amandeep Kaur Mann 1, Navneet Kaur 2, Scholar, M.Tech (CSE), RIMT, Mandi Gobindgarh, Punjab, India 1 Assistant Professor (CSE), RIMT, Mandi Gobindgarh, Punjab, India 2

More information

Information Visualization WS 2013/14 11 Visual Analytics

Information Visualization WS 2013/14 11 Visual Analytics 1 11.1 Definitions and Motivation Lot of research and papers in this emerging field: Visual Analytics: Scope and Challenges of Keim et al. Illuminating the path of Thomas and Cook 2 11.1 Definitions and

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

How To Solve The Cluster Algorithm

How To Solve The Cluster Algorithm Cluster Algorithms Adriano Cruz adriano@nce.ufrj.br 28 de outubro de 2013 Adriano Cruz adriano@nce.ufrj.br () Cluster Algorithms 28 de outubro de 2013 1 / 80 Summary 1 K-Means Adriano Cruz adriano@nce.ufrj.br

More information

Visual Data Mining with Pixel-oriented Visualization Techniques

Visual Data Mining with Pixel-oriented Visualization Techniques Visual Data Mining with Pixel-oriented Visualization Techniques Mihael Ankerst The Boeing Company P.O. Box 3707 MC 7L-70, Seattle, WA 98124 mihael.ankerst@boeing.com Abstract Pixel-oriented visualization

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA Prakash Singh 1, Aarohi Surya 2 1 Department of Finance, IIM Lucknow, Lucknow, India 2 Department of Computer Science, LNMIIT, Jaipur,

More information

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar

More information

Data Mining K-Clustering Problem

Data Mining K-Clustering Problem Data Mining K-Clustering Problem Elham Karoussi Supervisor Associate Professor Noureddine Bouhmala This Master s Thesis is carried out as a part of the education at the University of Agder and is therefore

More information

Data Preprocessing. Week 2

Data Preprocessing. Week 2 Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

More information

Data Mining Process Using Clustering: A Survey

Data Mining Process Using Clustering: A Survey Data Mining Process Using Clustering: A Survey Mohamad Saraee Department of Electrical and Computer Engineering Isfahan University of Techno1ogy, Isfahan, 84156-83111 saraee@cc.iut.ac.ir Najmeh Ahmadian

More information

Data Mining for Knowledge Management. Clustering

Data Mining for Knowledge Management. Clustering Data Mining for Knowledge Management Clustering Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management Thanks for slides to: Jiawei Han Eamonn Keogh Jeff

More information

Cluster analysis Cosmin Lazar. COMO Lab VUB

Cluster analysis Cosmin Lazar. COMO Lab VUB Cluster analysis Cosmin Lazar COMO Lab VUB Introduction Cluster analysis foundations rely on one of the most fundamental, simple and very often unnoticed ways (or methods) of understanding and learning,

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

A Two-Step Method for Clustering Mixed Categroical and Numeric Data

A Two-Step Method for Clustering Mixed Categroical and Numeric Data Tamkang Journal of Science and Engineering, Vol. 13, No. 1, pp. 11 19 (2010) 11 A Two-Step Method for Clustering Mixed Categroical and Numeric Data Ming-Yi Shih*, Jar-Wen Jheng and Lien-Fu Lai Department

More information

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS Mrs. Jyoti Nawade 1, Dr. Balaji D 2, Mr. Pravin Nawade 3 1 Lecturer, JSPM S Bhivrabai Sawant Polytechnic, Pune (India) 2 Assistant

More information

Visualization of large data sets using MDS combined with LVQ.

Visualization of large data sets using MDS combined with LVQ. Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk

More information

A Computational Framework for Exploratory Data Analysis

A Computational Framework for Exploratory Data Analysis A Computational Framework for Exploratory Data Analysis Axel Wismüller Depts. of Radiology and Biomedical Engineering, University of Rochester, New York 601 Elmwood Avenue, Rochester, NY 14642-8648, U.S.A.

More information

Visualization Techniques in Data Mining

Visualization Techniques in Data Mining Tecniche di Apprendimento Automatico per Applicazioni di Data Mining Visualization Techniques in Data Mining Prof. Pier Luca Lanzi Laurea in Ingegneria Informatica Politecnico di Milano Polo di Milano

More information

Proposed Application of Data Mining Techniques for Clustering Software Projects

Proposed Application of Data Mining Techniques for Clustering Software Projects Proposed Application of Data Mining Techniques for Clustering Software Projects HENRIQUE RIBEIRO REZENDE 1 AHMED ALI ABDALLA ESMIN 2 UFLA - Federal University of Lavras DCC - Department of Computer Science

More information

OPTICS: Ordering Points To Identify the Clustering Structure

OPTICS: Ordering Points To Identify the Clustering Structure Proc. ACM SIGMOD 99 Int. Conf. on Management of Data, Philadelphia PA, 1999. OPTICS: Ordering Points To Identify the Clustering Structure Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander

More information

A Survey of Clustering Algorithms for Big Data: Taxonomy & Empirical Analysis

A Survey of Clustering Algorithms for Big Data: Taxonomy & Empirical Analysis TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2014 1 A Survey of Clustering Algorithms for Big Data: Taxonomy & Empirical Analysis A. Fahad, N. Alshatri, Z. Tari, Member, IEEE, A. Alamri, I. Khalil A.

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 1 What is data exploration? A preliminary

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

Clustering Data Streams

Clustering Data Streams Clustering Data Streams Mohamed Elasmar Prashant Thiruvengadachari Javier Salinas Martin gtg091e@mail.gatech.edu tprashant@gmail.com javisal1@gatech.edu Introduction: Data mining is the science of extracting

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Clustering Methods in Data Mining with its Applications in High Education

Clustering Methods in Data Mining with its Applications in High Education 2012 International Conference on Education Technology and Computer (ICETC2012) IPCSIT vol.43 (2012) (2012) IACSIT Press, Singapore Clustering Methods in Data Mining with its Applications in High Education

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli (alberto.ceselli@unimi.it)

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms K-means and its variants Hierarchical clustering

More information

Interactive Data Mining and Visualization

Interactive Data Mining and Visualization Interactive Data Mining and Visualization Zhitao Qiu Abstract: Interactive analysis introduces dynamic changes in Visualization. On another hand, advanced visualization can provide different perspectives

More information

Forschungskolleg Data Analytics Methods and Techniques

Forschungskolleg Data Analytics Methods and Techniques Forschungskolleg Data Analytics Methods and Techniques Martin Hahmann, Gunnar Schröder, Phillip Grosse Prof. Dr.-Ing. Wolfgang Lehner Why do we need it? We are drowning in data, but starving for knowledge!

More information

DEA implementation and clustering analysis using the K-Means algorithm

DEA implementation and clustering analysis using the K-Means algorithm Data Mining VI 321 DEA implementation and clustering analysis using the K-Means algorithm C. A. A. Lemos, M. P. E. Lins & N. F. F. Ebecken COPPE/Universidade Federal do Rio de Janeiro, Brazil Abstract

More information

Dynamical Clustering of Personalized Web Search Results

Dynamical Clustering of Personalized Web Search Results Dynamical Clustering of Personalized Web Search Results Xuehua Shen CS Dept, UIUC xshen@cs.uiuc.edu Hong Cheng CS Dept, UIUC hcheng3@uiuc.edu Abstract Most current search engines present the user a ranked

More information

USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva

USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva 382 [7] Reznik, A, Kussul, N., Sokolov, A.: Identification of user activity using neural networks. Cybernetics and computer techniques, vol. 123 (1999) 70 79. (in Russian) [8] Kussul, N., et al. : Multi-Agent

More information

Outlier Detection in Clustering

Outlier Detection in Clustering Outlier Detection in Clustering Svetlana Cherednichenko 24.01.2005 University of Joensuu Department of Computer Science Master s Thesis TABLE OF CONTENTS 1. INTRODUCTION...1 1.1. BASIC DEFINITIONS... 1

More information