Gene Expression Data Clustering Analysis: A Survey

Size: px
Start display at page:

Download "Gene Expression Data Clustering Analysis: A Survey"

Transcription

1 Gene Expression Data Clustering Analysis: A Survey Sajid Nagi #, Dhruba K. Bhattacharyya *, Jugal K. Kalita # Department of Computer Science, St. Edmund s College, Shillong , Meghalaya, India. * Department of Computer Science and Engineering, Tezpur University, Napaam , Assam, India. Department of Computer Science, University of Colorado, Colorado Springs CO 80918, USA. Abstract- The advent of DNA microarray technology has enabled biologists to monitor the expression levels (MRNA) of thousands of genes simultaneously. In this survey, we address various approaches to gene expression data analysis using clustering techniques. We discuss the performance of various existing clustering algorithms under each of these approaches. Proximity measure plays an important role in making a clustering technique effective. Therefore, we briefly discuss various proximity measures. Finally, since evaluation of the effectiveness of the clustering techniques over gene data requires validity measures and data sources for numeric data, we discuss them as well. Keywords- Gene expression data, proximity measure, coherent pattern, clustering, cluster validation I. INTRODUCTION Microarray technology has made it possible to simultaneously monitor expression levels of many thousands of genes through two major types of microarray experiments: cdna microarray [1] and oligonucleotide arrays [2]. Although both types of experiments follow different protocols, they have some common basic procedures. The analysis of gene expression is shown in Fig. 1 at a very high level. Laboratory Work RNA, microarrays Sample preparation, Hybridization,Scanners Computer aided data analysis Preprocessing, cluster analysis Further analysis Hypothesis, Validation, Fig 1: Gene expression analysis pipeline The basic steps executed in the process of gene expression data generation are as follows. Chip manufacture: A microarray is a small chip consisting of a solid surface, onto which tens of thousands of DNA molecules (probes) are attached in fixed grids. Each grid cell corresponds to a DNA sequence. Since thousands of DNA molecules are bonded to a single array, it is possible to measure the expression of many thousands of genes in parallel. Sample preparation, labeling and hybridization: Two mrna samples (a test sample and a control sample) are reverse-transcribed into cdna (targets), labeled using either fluorescent dyes or radioactive isotopes and then hybridized with the probes on the surface of the chip. Washing: The slides are then washed to remove excess hybridization solution from the array so that only the DNA complementary to each of the features will remain bound to the features on the array. Image Acquisition: The final step of the process is to produce an image of the surface of the hybridized array where chips are scanned to read the signal intensity that is emitted. The image of the microarray generated by the scanner is the raw data. Feature extraction software converts the image into numerical information that quantifies gene expression. A. Pre-processing of gene expression data A microarray experiment assesses a large number of DNA sequences under multiple conditions. These conditions may be a time series during a biological process or a collection of different tissue samples. Without making any distinction among the DNA sequences in this survey, we will refer to them as genes. Similarly, we will refer to all kinds of experimental conditions as samples. A gene expression dataset from a microarray experiment may be considered an n x l matrix M (as shown in Fig. 2), where M = {m i, j } are the rows which represent expression patterns of a set of n genes (G = {g 1,.,g n }), the columns represent expression profiles of a set of l samples S = (s 1,.,s l ) at different conditions/time points, and each cell m i,j is the expression level of gene g i (1 i n ) on sample s j (1 j l ). genes m 11 m 12 m 13 m 1n m 21 m 22 m 23 m 2n m l1 m l2 m l3 m ln samples Fig. 2: A gene expression matrix The microarray data thus generated is then cleaned, transformed and normalized to resolve any errors, noise and bias introduced by the microarray experiments. B. Coherent, Co-expressed and Co-regulated Genes Two genes are said to co-express when they have similar expression patterns and may belong to the same or similar functional categories, whereas two genes are said to coregulate when they are regulated by common transcription factors. Co-expressed genes in the same group are in all probability involved in the same cellular process. A strong expression pattern correlation between those genes indicates co-regulation. A coherent gene expression pattern (or coherent pattern in short) characterizes the common trend of expression levels for a group of co-expressed genes. A coherent pattern is a template, while the expression profiles of the corresponding co-expressed genes match the pattern with small divergence. Coherent gene expression patterns may characterize important cellular processes and may also be involved in the regulating mechanism in the cells. Clustering 1

2 techniques have proven to be useful in finding coherent or coexpressed gene patterns. However, as stated in [3], co-regulated genes have very high chance of having similar functions whereas co-expressed genes are not necessarily co-regulated and may not have similar functions. C. Extraction of Interesting Patterns from Gene Expression Data Using Clustering Techniques Clustering is the process of grouping data objects into a set of disjoint classes, called clusters, so that objects within a class have high similarity to each other, while objects in separate classes are more dissimilar. Clustering is an example of unsupervised classification. Classification refers to a procedure that assigns data objects to a set of classes. Unsupervised means that clustering does not rely on predefined classes and training examples while classifying the data objects. As stated in [4], the various clustering approaches are partitional, hierarchical, density based, model based, grid based and soft computing. In this survey, we will consider the use of these algorithms for gene-based clustering. D. Proximity Measures There are different methods [5] for quantifying similarity or dissimilarity among gene expression levels, described in terms of the distance between them in the high-dimensional space of gene expression measurements. A dissimilarity measure d ij for any two genes g i and g j obeys the following properties. i. The distance between any two profiles cannot be negative. ii. The distance between a profile and itself must be zero. iii. A zero distance between two profiles implies that the profiles are identical. iv. The distance between profile g i and profile g j is the same as distance between profile g j and profile g i, i.e., D(g i, g j ) = D(g j, g i ). Different proximity measures: A microarray experiment compares genes from an organism under different development time points, conditions or treatments. For an n condition experiment, a single gene has an n-dimensional observation vector known as its gene expression profile. A proximity measure is a real-valued function that assigns a positive real number as a similarity value between any two expression vectors. Therefore, to identify genes or samples that have similar expression profiles, selection of an appropriate proximity measure is very essential. Some of the commonly used proximity measures can be found in [6]. Selection of a measure depends upon the type of data, its dimensionality and the approach used in identification of coherent patterns. E. Cluster Validity Measures Cluster validity depends upon (i) the homogeneity in individual clusters, (ii) given the ground truth (available from the domain knowledge) of the clusters, and (iii) the reliability of the clusters. A few validation measures commonly used in evaluating the effectiveness of the technique(s) used in gene expression data mining will be highlighted in this survey. F. Paper Organization The rest of the survey is organized as follows: Section II describes the formulation of the problem and its complexity. In Section III we discuss proximity measures for numeric data. Various generalized clustering approaches are reported in Section IV. In Section V, cluster validity measures are discussed and Section VI discusses challenges and research issues. Section VII takes a brief look at some gene expression datasets to which gene-based clustering approaches have commonly been applied. Finally, we conclude in Section VIII. II. PROBLEM FORMULATION Gene expression data are generated by DNA chips and other microarray techniques and they are often presented as matrices of expression levels of genes under different conditions (including environments, individuals, and tissues). One of the major objectives of gene expression data analysis is to identify groups of genes having similar expression patterns over the full space or subspace of conditions. It may result in the discovery of regulatory patterns or condition similarities. Generally co-expressed genes, which are members of the same clusters, are expected to have similar functions. Most researchers address this issue by: (i) using either partitional, hierarchical, density-based, model-based or subspace clustering algorithms, based on the proximity between genes or conditions in the expression matrix or (ii) giving equal weights to all conditions or all genes in the computation of gene similarity and vice versa. However, if the proximity measure is not properly selected, it may lead to the discovery of some similar groups of genes at the expense of obscuring other similar groups. Hence, it may be necessary to recover information lost due to over-simplification of similarity and grouping computation, which may reveal the involvement of a gene or a condition in more than one group. III. EXISTING PROXIMITY MEASURES FOR NUMERIC DATA Distance or similarity measures are essential to solve many pattern recognition problems such as classification, clustering and retrieval problems. From the scientific and mathematical points of view, distance is defined as a quantitative degree of how far apart two objects are [3]. A synonym for distance is dissimilarity. Distance measures satisfying the metric properties are simply called metric while other non-metric distance measures are occasionally called divergence. Synonyms for similarity include proximity and similarity measures are often called similarity coefficients. The choice of the proximity measure depends upon the type of data, dimensionality and the approach used in the identification of coherent patterns. We reproduce from [6] a partial list of the formula of a few proximity measures for numeric data, since gene expression data are mostly available as numeric data. 2

3 Proximit y Minkowski distance TABLE I: PROXIMITY MEASURES FOR NUMERIC DATA Forms Euclidean distance Cityblock distance Sup distance Mahalanobis distance D M = (X i Y i) T S -1 (X i Y i), where S is within group covariance matrix Pearson correlation D P(X,Y) = Explanation Metric. Invariant to any translation and rotation only for n = 2 (Euclidean distance). Features with large values and variances tend to dominate over other features. The most commonly used metric. Special case of Minkowski metric at n = 2. Tends to form hyperspherical clusters. Special case of Minkowski metric at n = 1. Tend to form hyperrectangular clusters. Special case of Minkowski metric at n. Invariant to any nonsingular linear transformation. S is calculated based on all objects. Tends to form hyper-ellipsoidal clusters. When features are not correlated, squared Mahalanobis distance is equivalent to squared Euclidean distance. May tend to be computationally expensive. Not a metric. Derived from correlation coefficient. Unable to detect the magnitude of differences of two variables. The effective use of any of these proximity measures (Table 1) is highly influenced by the number of data records/instances, their dimensionality, the level of precision required and the nature of analysis required. IV. GENE-BASED CLUSTERING APPROACHES In this section, we discuss the problem of clustering genes based on their expression patterns. The purpose of gene-based clustering is to group together co-expressed genes which indicate co-function and co-regulation. We will first review a series of clustering algorithms belonging to various approaches (as reported in Section I-C), which have been applied to group genes. Finally, we present the issues involved in gene-based clustering. A. Partitioning Approach Clustering techniques belonging to this approach are divided into two major sub-categories: centroid based where each cluster is represented by using the gravity centre of the instances and medoid based, where each cluster is represented by means of the instances closest to the gravity centre. K-means [7] is a simple and fast centroid-based clustering algorithm which divides the data into pre-defined number of clusters in order to optimize a predefined criterion. However, it may not yield the same result with each run of the algorithm. To detect the optimal number of clusters, users usually run the algorithm repeatedly with different values of k and compare the clustering results. For a large gene expression dataset, which contains thousands of genes, this extensive parameter fine-tuning process may not be practical. Moreover, gene expression data typically contain a huge amount of noise and the k-means algorithm forces each gene into a single cluster, which may lead to the generation of non-qualitative or biologically non-relevant clusters. It is not suitable to detect clusters of arbitrary shapes. In spite of its unsuitability, the k-means algorithm is frequently used to cluster gene expression data due to its simplicity and it provides baseline results to compare with when we develop new clustering algorithms. The application of k-means algorithm and its variants to cluster gene expression data are widely used, as is evidenced by [50]-[59], [61] and [62]. k-modes [8] is a faster extension of k-means algorithm to handle categorical data that uses a different similarity measure, replaces k-means with k-modes and uses a frequency based method to update modes. The k-modes algorithm is useful only after the conversion of the numeric data into categorical. However, it may lead to information loss and hence may deteriorate the cluster result. The working of the Fuzzy C-Means (FCM) [9] algorithm is very similar to k-means. In FCM, each point has a degree of membership to each cluster rather than belonging completely to just one cluster. Thus, points on the edge of a cluster may be in the cluster to a lesser degree than points in the center of cluster. Fuzzy C-Means and its variants are widely used techniques used for microarray data clustering and they can be found in [55], [60], and [63]. Two early versions of k-medoid methods are the algorithms PAM and CLARA [10]. PAM uses dissimilarity values and an iterative optimization approach to determine a representative object for each cluster called medoid. Once the medoids have been selected, each non-selected object is grouped with the medoid to which it is most similar. Finally, the quality of a clustering is measured by the average dissimilarity between an object and the medoid of its cluster. CLARA follows the same principle as PAM, but instead of finding representative objects for the entire dataset, it draws a sample of the dataset, and applies PAM to this sample. It then classifies the remaining objects using partitioning principles. CLARANS [11] relies on a randomized search of a graph to find medoids representing clusters. The algorithm takes as input maxneighbor and numlocal, selects a random node and then checks a sample of the neighbors of the node. If a better neighbor is found based on the cost differential of the two nodes it moves to the neighbor and continues processing until the maxneighbor criterion is met. Otherwise, it declares the current node a local minimum and starts a new pass to search for other local minima. After a specified number of local minima (numlocal) are collected, the algorithm returns the best of these local values as the medoid of the cluster. 3

4 Discussion: Though partitioning based clustering algorithms can find separate clusters, in the context of gene expression data mining, it has the following disadvantages. i. The number of clusters is not known apriori. ii. The proximity measures used by most of these methods are inadequate due to high dimensionality of gene data. iii. The gene data, apart from displaying disjoint patterns of clusters, often show evidence of intersected and embedded cluster patterns, which are usually not detectable by the partitioning approach. B. Hierarchical Approaches Hierarchical clustering generates a hierarchical series of nested clusters which can be graphically represented by a tree, called dendrogram. We can obtain a specified number of clusters by cutting the dendrogram at an appropriate level. Hierarchical clustering algorithms can be further divided into agglomerative approaches and divisive approaches based on how the dendrogram is formed. Agglomerative algorithms (bottom-up approach) perform repeated amalgamations of groups of data until some pre-defined threshold is reached, whereas divisive algorithms (top-down approach) recursively divide the data until some pre-defined threshold is reached. For agglomerative approaches, linkage is used as the criteria to determine the distance between two clusters. Single linkage is the smallest minimum distance between two objects of two clusters, whereas complete linkage is the smallest maximum distance between two objects of the two clusters and average linkage is the mean distance between every pair of objects of two clusters. CURE [12] is an agglomerative hierarchical clustering algorithm, which begins by choosing a constant number c, of well scattered points, from a cluster. These points are used to identify the shape and size of the cluster. The next step of the algorithm shrinks the selected points toward the centroid of the cluster using some predetermined fraction. ROCK [13], also an agglomerative hierarchical clustering algorithm, employs links and not distances, to measure the similarity/proximity between a pair of data points, before merging them. CHAMELEON [14] uses a graph partitioning algorithm to partition the data using a method based on the k- nearest neighbor approach and then uses an agglomerative hierarchical clustering algorithm to combine the sub-clusters and find the real clusters. A study of the CHAMELEON algorithm using gene data is given in [83]. UPGMA [15] adopts an agglomerative method to graphically represent the clustered dataset. In this method, each cell of the gene expression matrix is colored on the basis of the measured fluorescence ratio and the rows of the matrix are reordered based on the hierarchical dendrogram structure and a consistent node-ordering rule. After clustering, the original gene expression matrix is represented by a colored table (a cluster image) where large contiguous patches of color represent groups of genes that share similar expression patterns over multiple conditions. However, it is not robust in the presence of noise. BIRCH [16] is an integrated hierarchical clustering method that performs cluster representations by introducing two new concepts, clustering feature (CF) and clustering feature tree, which help the clustering method to: (i) achieve good speed (ii) scalability in large databases and, (iii) better utilization of memory. BIRCH is also effective for incremental and dynamic clustering of incoming objects. It applies a multiphase clustering technique: a single scan of the dataset yields a basic good clustering, and additional scan(s) may be used to further improve the quality of clustering. AMOEBA [17] is an algorithm that supports hierarchical spatial clustering based on Delaunay triangles [17]. Discussion: Hierarchical clustering methods are vastly popular because of their presentation of cluster results, which biologists often prefer, as is evidenced by [50], [56]-[58], [61]-[68]. However, the effectiveness of a method in this approach is mostly influenced by the following points. i. The appropriateness of the proximity measure used. ii. The scope for incorporating regulation information, expression values and other type of biological knowledge while expanding the cluster. iii. Ability to work with high dimensional numeric data. iv. Representation of nested cluster structure is clear but inadequate in representing intersected clusters. C. Density Based Approaches Density based clustering identifies dense areas in the object space, where clusters are highly dense areas separated by sparsely dense areas. The given cluster continues to grow as long as the density (number of objects or data points in the Eps-neighborhood ) exceeds some threshold. Such a method can be used to filter out noise and discover clusters of arbitrary shape. Moreover, it has been proposed in [18] that statistically significant patterns can be derived from dense regions, which can then be used to identify genes and/or samples of interest and also eliminate genes and/or samples corresponding to outliers, noise, or abnormalities. The well known DBSCAN [19] density-based clustering algorithm starts with an arbitrary instance p in a dataset D and retrieves all instances of D with respect to Eps and MinPts. The algorithm makes use of a spatial data structure R*-tree to locate points within Eps distance from the core points of the clusters. It is found to be effective over spatial data with uniform density. Its use in gene data clustering is given in [82]. An extension to DBSCAN is OPTICS [20] that relaxes the strict requirements of input parameters and can handle variable density. OPTICS computes an augmented cluster ordering to automatically and interactively cluster the data. The ordering represents the density based clustering structure of the data and contains information that is equivalent to density-based clustering obtained by a range of parameter settings. DBCLASD [21] is another locality-based clustering algorithm, but unlike DBSCAN, the algorithm assumes that the points inside each cluster are uniformly distributed. Each cluster has a probability distribution of points to their nearest neighbors, and this probability set is used to define the cluster. A grid-based representation is used to approximate the 4

5 clusters as part of the probability calculation. The major advantages of DBCLASD are that it requires no external input, it is unaffected by noise since it relies on a probability based distance factor and it can handle clusters of arbitrary shape. Its main disadvantages are that it assumes uniformly distributed points in a cluster, limiting the effectiveness of the algorithm. Its runtime slows as the size of the database increases and it is computationally expensive when compared to later algorithms. DENCLUE [22] is a generalization of partitioning, hierarchical, density-based and grid-based clustering approaches. The algorithm initially performs a pre-clustering step creating a map of the active portion of the dataset which is used to speed up density function calculations. The next step is clustering, including the identification of density-attractors and their corresponding points. DENCLUE can identify clusters of arbitrary shape and exhibits good clustering properties in presence of noise. TURN* [23] consists of an overall algorithm and two component algorithms: an efficient resolution dependent clustering algorithm, TURN-RES, which returns both a clustering result and cluster features from that result, and TURN-CUT, an automatic method for finding the important or optimum resolutions from a set of resolution results from TURN-RES. TURN* removes the need for input parameters. The basic idea of the DHC [24] method is to consider a cluster as a high-dimensional dense area, where data objects are attracted to each other. It then organizes the cluster structure of the dataset in a two-level hierarchical structure. Initially, the root node represents the entire data as a single dense area of the density tree which is then split into several subdense areas based on some criteria, represented by a child node of the root node. These subdense areas are further split, until each subdense area contains a single cluster. The computational complexity of this step makes DHC inefficient. Discussion: Most existing density based algorithms are capable of handling two-dimensional data with uniform global density distribution. However, the characteristics of gene expression data demand a density based clustering algorithm with the following capabilities. i. The algorithm should be able to handle high dimensional numeric data. ii. It should recognize embedded and intersected cluster patterns apart from recognizing uniformly distributed cluster patterns. iii. It should be less dependent on input parameters. D. Model Based Approach Based on our study and analysis of gene expression data [72]-[76], we observe that often a single gene may exhibit membership in more than one cluster. Hence it demands a probabilistic clustering approach for a better fit. Model based clustering is one such approach which is suitable for a problem like gene expression data. Such an algorithm attempts to optimize the fit between the given data and some mathematical model. Applications of model based approach to gene expression data are discussed in [56], [57], [69]-[71]. COBWEB [25] creates hierarchical clustering in the form of a classification tree, where the input objects are described by categorical attribute-value pairs. It also has two operators: merging and splitting, which help COBWEB to become less input order dependent and also allow it to perform bidirectional search. The Self-Organizing Map (SOM) [26] was developed on the basis of a single layered neural network. Each neuron of the neural network is associated with a reference vector, and each data point is mapped to the neuron with the closest reference vector. In the process of running the algorithm, each data object acts as a training sample which directs the movement of the reference vectors towards the denser areas of the input vector space, so that those reference vectors are trained to fit the distributions of the input dataset. When the training is complete, clusters are identified by mapping all data points to the output neurons. The advantage of SOM is that (i) it generates intuitive cluster patterns of a highdimensional dataset and (ii) it is non-susceptible to noisy data. Some disadvantages of SOM are that (i) the number of clusters and the grid structure of the neuron map need to be given as input, and (ii) it is sensitive to the input parameter. SOM is a popular method for clustering data in biological context and can be found in [54], [55], [59], [60], [61], [63] and [77]. AutoClass [27] uses the Bayesian approach, starting from a random initialization of the parameters, incrementally adjusting them in an attempt to find their maximum likelihood estimates. It also assumes that, in addition to the observed or predictive attributes, there is a hidden variable which reflects the cluster membership for every element in the dataset. Therefore, the data-clustering problem is also an example of supervised learning from incomplete data due to the existence of such a hidden variable. The use of AutoClass for clustering microarray data is shown in [58] and [78]. Discussion: Though the model based approach can be considered more relevant to the gene expression data mining problem, most existing methods under the approach suffer from the following disadvantages: i. the number of clusters and the grid structure need to be given as input, ii. they are sensitive to the input parameter, and iii. the algorithms are not cost effective. E. Graph Theoretic Approach Graph theoretical clustering techniques explicitly work with data represented in terms of a graph. AUTOCLUST [28] automatically extracts boundaries based on Voronoi modeling and Delaunay Diagrams. Parameters need not be specified by users but are revealed from the proximity structures of the Voronoi modeling, and AUTOCLUST calculates them from the Delaunay Diagram. This removes human-generated bias and also reduces the exploration time. The advantages are (i) It is effective in the detection of clusters of different densities, and (ii) It identifies and removes multiple bridges linking clusters. 5

6 CLICK [29] tries to identify clusters as a highly connected component in a proximity graph based on a probabilistic assumption. It defines the weight of an edge as the probability that the vertices of the edge are in the same cluster and then iteratively finds the minimum cut in the graph and recursively splits the dataset into a set of connected components subject to a predefined threshold. It also includes two post-pruning steps to refine cluster results. CLICK is capable of recognising intersecting clusters. It also generates better quality clusters over gene expression datasets in terms of homogeneity and separation. It is found in literature in [43], [59], [69] and [79]. CAST [30] is based on the concept of a corrupt clique graph data model. It takes as input a real, symmetric, n-by-n similarity matrix M and an affinity threshold. It assumes that the input dataset contains errors and that the true clusters of the data points are obtained by a disjoint union of complete sub-graphs where each clique represents a cluster. A cluster is defined as a set of high affinity elements subject to, referred to as a clique graph. A proximity graph is derived (based on an n-by-n similarity matrix) using the clique graph by flipping each edge/non-edge with some probability measure. Thus, the clustering process defined by CAST is a process of elimination of noise or contamination from the corrupt version with minimum number of flips. CAST discovers clusters one at a time. It is popular in use of generating clusters in gene expression data [30], [43], [50], [59], [62], [69] and [80]. Discussion: The graph theoretic approach can be considered more relevant to gene expression data mining as it is capable of discovering intersected and embedded clusters. However, it sometimes generates non-realistic cluster patterns. Moreover the results are dependent on proximity measures, so inappropriate selection of the proximity measure will result in non-biologically relevant clusters. F. Soft Computing Approach Fuzzy Clustering: Gene expression data analysis sometimes encounters an intersecting gene pattern in which case a crisp or hard clustering may not yield a good result. However for fuzzy clustering, a gene can belong to several clusters with certain degrees of membership. An application of fuzzy clustering in biological context can be found in [81]. Fuzzy c-mean (FCM), which is a generalisation of ISODATA [31], operates over numeric data to identify c-fuzzy clusters (c is user input). FCM uses Euclidean distance and encounters problems in outlier handling, identification of initial partitions and discovery of clusters of all shapes, which are also the limitations of hard-partitioning based clustering techniques. However several variants of FCM are found in the literature that attempt to overcome these limitations. Neural Networks-Based Clustering: Two popular Neural Networks-based clustering approaches, SOFM (Self Organizing Feature Map) and ART (Adaptive Resonance Theory), are discussed briefly here. SOFM can be viewed as a visual representation of a lattice data structure of neurons, interconnected via adaptable weights. During the training process, the neighbouring input patterns are projected into the lattice corresponding to adjacent neurons. The advantages of SOFM are: (i) It enjoys the benefits of input space density approximation and, (ii) it is input order independent. The disadvantages are (i) like k- means, SOFM needs to predefine the size of the lattice, (the number of clusters) and, (ii) it may suffer from input space density misrepresentation. To overcome the limitations of SOFM, several variants of SOFM can be found in the literature [34] - [36]. The use of SOFM for clustering biological data is found in [82]. ART, a large family of NN architectures, is able to learn any input pattern in a fast, stable and self-organising way. ART1 deals with binary input patterns and can be extended to arbitrary input patterns by using a coding mechanism, whereas ART2 extends the application to analog input patterns, and ART3 achieves efficient parallel search in hierarchical structures based on certain biological processes. Several variants of the use of ART can be found in literature [61]. GA Based Clustering: GA (Genetic Algorithm) clustering is randomized search and optimization techniques based on the principles of evolution and natural genetics. GAs perform search in complex, large and multimodal landscapes, and provide near optimal solutions for the objective function of an optimization problem. Several GA-based clustering algorithms are found in the literature [37] [42]. GenClust [43] is an effective GA-based clustering algorithm with two key features (i) a simplified and compact coding of the search space and, (ii) is used in conjunction with a data driven internal data validation method. GenClust works stage-wise and produces a sequence of partitions consisting of k classes. The process of partitioning into classes continues until a termination criterion is satisfied. The GenClust algorithm is used in [59] to cluster gene expression data. Discussion: In case of fuzzy clustering, the problem of specifying the number of clusters remains. The limitations of FCM can be overcome by (i) by using City Block Distance (L 1 norm) to improve the robustness of FCM to outliers [32] (ii) Initialization strategy of unsupervised tracking of cluster prototypes in a 2-layer clustering scheme [33] (iii) use of appropriate prototypes and distance functions for identification of clusters of all shapes. The basic advantage of ART is that it is fast, exhibits stable learning and pattern detection. The disadvantage is its (i) inefficiency in dealing with noise, and (ii) deficiency in higher dimension representation for clusters. GenClust creates good clusters from gene expression datasets by using external (when a true solution is available) and internal (when a true solution is not known) criteria. However, like the other GA-based solutions, GenClust is also not free from the local optima problem. G. Incremental Approach In [44], the authors present an incremental clustering approach based on the DBSCAN algorithm. A one pass clustering algorithm for relational datasets is proposed in [45]. Rough set theory is employed in the incremental approach for clustering interval datasets in [46]. In [47], an incremental genetic k-means algorithm is presented. In [48], an incremental gene selection algorithm using a wrapper-based 6

7 method that reduces the search space complexity since it works on the ranking directly, is presented. In [49], an incremental clustering algorithm over gene expression data is presented; it uses regulation information to store the cluster information for use when clustering genes incrementally. The effectiveness of an incremental gene clustering algorithm depends upon the proximity measure used and the differing density. V. CLUSTER VALIDITY MEASURES The previous section has reviewed a number of clustering algorithms that partition the dataset based on a variety of clustering criteria. For gene expression data, clustering results in groups of co-expressed genes, groups of samples with a common phenotype, or blocks of genes and samples involved in specific biological processes. However, different clustering algorithms, or even a single clustering algorithm using different parameters, generally result in different sets of clusters [50]. Therefore, it is important to compare various clustering results and select the one that best fits the true data distribution. Cluster validation is the process of assessing the quality and reliability of the cluster sets derived from various clustering processes. Generally, cluster validity has three aspects. First, the quality of clusters can be measured in terms of homogeneity and separation on the basis of the definition of a cluster. Objects within one cluster are similar to each other and different from objects in other clusters. The second aspect comes from ground truth of the clusters. The ground truth could come from domain knowledge, such as the clinical diagnosis of normal or cancerous tissues. Cluster validation is based on the agreement between clustering results and the ground truth. The third aspect focuses on the reliability of the clusters or the likelihood that the cluster structure is not formed by chance. In this section, we will discuss these three aspects of cluster validation in the light of some popular validity measures. A. Rand Index Rand index [84] is a measure of the similarity between two clusters. The Rand index is defined as the number of pairs of objects that are either in the same group or in different groups in both partitions divided by the total number of pairs of objects. Rand index =!"!#! Where a is the number of object pairs (g i, g j ), where C ij = 1 and P ij = 1, b is the number of object pairs (g i, g j ), where C ij = 1 and P ij = 0, c is the number of object pairs (g i,g j ), where C ij = 0 and P ij = 1, d is the number of object pairs (g i,g j ), where C ij = 0 and P ij = 0. The Rand index lies between 0 and 1. The maximum value i.e., 1 is achieved when both partitions, C and P, agree perfectly. Some examples of papers that use Rand index can be found in [57], [63] and [70]. B. Cluster Homogeneity Homogeneity measures the quality of clusters on the basis of the definition of a cluster: objects within a cluster are! similar while objects in different clusters are dissimilar. Homogeneity measure used in this section is that of the overall average homogeneity used in [85]. It is calculated as follows. i. Compute the average value of similarity between each gene g i and the centroid of the cluster C i to which it has been assigned. $ % ( ) * +, -. - / /, where - is the ' & 0 1% centroid of %. ii. Calculate the average homogeneity for the clustering C weighted according to the size of the clusters as $ 23 5 $ %. C. Silhouette Index: 4 % 1% Silhouette index [86] is used to assess the quality of any clustering solution. This index reflects the compactness and separation of clusters. It is calculated as follows. i. Compute a(g i ), i.e., the average distance of gene i to the the other genes of cluster A to which it belongs. ii. Compute d(g i,c k ) where d(g i,c k ) is the average distance of gene g i from the genes of cluster C k where g i 6 C k. iii. Compute b(g i ), where b(g i ) = min{d(g i,c)} where C = {C 1, C 2,..., C m } and A6C, i.e., b(g i ) represents the distance of gene g i to its closest cluster. Now compute the silhouette width of gene g i as ( :. iv. Compute silhouette index by finding the average of S(i) over i = 1,2,...,G, where G is the total number of genes: S = average{s(g i )}. The value of silhouette index varies from -1 to 1 with higher values indicating better clustering. Silhouette index has been used in [63]. D. Z-score: Z-score [87] is calculated by investigating the relation (Mutual Information - MI) between a clustering obtained by an algorithm and the functional annotation of the genes in the cluster. A higher value of z-score indicates that genes are better clustered by function, indicating a more biologically relevant clustering result. The z-score represents a standardized distance between the MI value obtained by clustering and those MI values obtained by random assignment of genes to clusters. Higher z-scores indicate that the clustering results are more significantly related to the gene function. The use of z- score is shown in [88]. E. Jaccard Coefficient: The Jaccard Coefficient measures the proportion of pairs that are in the same cluster C and in the same partition P from those that are either in the same cluster or in the same partition [91]. The Jaccard Coefficient is defined as J =!"!# where a=ss if the pair belongs to the same cluster C and to the same group P, b = SD if the pair belongs to the same 7

8 cluster C and to different groups P and c = DS if the pair belongs to different clusters C and to the same group P As in the Rand Index, the values of these coefficient lies between 0 and 1, and values close to 1 indicate height agreement between C and P. F. Davies-Bouldin Index: Let s i be a measure of dispersion of cluster C i and d(c i,c j ) ; d ij the dissimilarity between two clusters. The dispersion of a cluster C i is defined as < = A BC&?. The Davies-Bouldin index [92] is defined as DE F F G F, where G HIA J.K.FJL G J.M N.K.H and R ij is a similarity index between C i and C j satisfying the condition G J O?!O P?P G. Dunn Index: The Dunn Index [93] is defined as: TU5.5 J V D F HMQ.K.F RHMQ J!.K.F S HIA W.K.F TMIH 5 W XY Where d(c i,c j ) = HMQ BC&?.ZC& P T A.[ is the dissimilarity between two clusters C i and C j Diameter of the cluster C is TMIH 5 HIA B.ZC& T A. [. H. P-value VI. DATASETS USED The reliability of the resulting clusters can be estimated by the p-value [52] of a cluster. It measures the probability of finding the number of genes involved in a given Gene Ontology (GO) term (i.e., function, process and component) within a cluster. From a given GO category, the probability p of getting k or more genes within a cluster of size n, is defined as: \ N ]^ _]`a^ W? ba? _ c. U` b V A cumulative hyper-geometric distribution is used to compute the p-value. A low p-value indicates that the genes belonging to the enriched functional categories are biologically significant in the corresponding clusters. The use of p-value to compare different clustering solutions of yeast cell-cycle gene expression data is given in [90].. We present some datasets related to Human, Yeast and Rat Central Nervous System in Table 2 along with their sources. These datasets have often been used to evaluate the clustering of gene array data. TABLE II: DATASETS AND THEIR SOURCE Dataset No. of genes No of conditions Source Yeast Sporulation Yeast Cell Cycle Sample input files in Expander [115] Subset of Yeast Cell Cycle Yeast Diauxic Shift Rat CNS Arabidopsis Thaliana Subset of Human Fibroblasts Serum Human cancer data Colon cancer data and 22 VII. CHALLENGES IN CLUSTERING GENE EXPRESSION DATA Clustering gene expression data poses challenges that are different from those of clustering non-biological data. This is due to the very nature of data being collected from microarray experiments. A. Challenges and Research Issues Studies have confirmed that clustering algorithms are useful in identifying groups of co-expressed genes and discovering coherent expression patterns. However, due to the distinct characteristics of time-series gene expression data and the special requirements from the biology domain, clustering gene expression data still faces the following challenges. 1) The effectiveness of a clustering technique is highly dependent on the proximity measure, used by the technique. Choosing or finding an appropriate proximity measure is a challenging task. 2) Most existing clustering techniques are either dependent on input parameter(s) or stopping criteria for discovery of the true number of clusters, which is a major task and hence a challenge. 3) Gene expression data often contain clusters which are highly connected [89], intersected or even embedded [24]. Hence the clustering algorithm should be capable of effectively managing such a situation. 4) The available gene datasets often contain a lot of noise and missing values. Thus a clustering algorithm should be capable of extracting the true number of clusters in the presence of this noise and also be able to handle missing values. 5) A clustering algorithm should be efficient in order to scale with the increasing size of datasets as well as dimensionality. 8

9 6) Apart from clustering, an algorithm should be capable of showing the associations among the clusters which may be useful for drawing conclusions. In the case of partitioning approaches, algorithms like PAM or Fuzzy C-Means have been observed to be robust. However, they suffer from limitations such as (i) the number of clusters are to be known apriori, (ii) the proximity measure used may be inadequate in finding the true number of biologically relevant clusters. Similarly, the algorithms following the hierarchical approach can be found advantageous from the biologists perspectives as they help to represent the cluster-cluster association, apart from individual co-expressed gene group representation. However, they also suffer from limitations such as (i) difficulty in deciding the appropriate stopping criteria, and (ii) simultaneous representation of disjoint, embedded and intersected clusters. Density approaches have already been established as being good at finding clusters of all shapes. They are capable of identifying global as well as local (embedded) clusters. However, two limitations of such approaches are: (i) They are input parameter sensitive, and (ii) ineffective in finding intersected patterns over high dimensional data. A modelbased approach provides an estimated probability that a single gene may exhibit membership in more than one cluster indicating that a gene may have a high correlation with two totally different clusters. It discovers good values for its parameters iteratively and can handle various shapes of data. However, it suffers from the limitations that (i) it can be computationally expensive since a large number of iterations may be required to find its parameters, and (ii) it assumes that the dataset fits a specific distribution which is not always true. Graph theoretic algorithms are suitable for subspace and high dimensional data clustering. The algorithms under this approach do not require a user-defined number of clusters, can handle outliers efficiently and they are capable of discovering intersected and embedded clusters. However, they are limited by (i) the difficulty faced in determining a good threshold value, and (ii) they require apriori knowledge of the dataset. Among the soft computing approaches, Fuzzy C-Means (FCM) and Genetic Algorithms (GA) have been used effectively in clustering gene expression data. The Fuzzy C-Means algorithm requires the number of clusters as an input parameter. The GA based algorithms can detect biologically relevant clusters but are dependent on proper tuning of the input parameters. VIII. CONCLUSION There are two groups of people who are involved in the clustering of biological data. One is the biologist who uses an existing clustering algorithm to solve an underlying biological problem. The challenge before the biologist is to make an appropriate choice of an algorithm since different algorithms will produce different results. The other is the developer of clustering algorithms, who consistently strives to improve existing algorithms, so that the underlying biological problems can be solved efficiently. A proper amalgamation of these two groups will lead to rapid advancement in this field. Although gene expression clustering has been done by applying k-means, hierarchical clustering and SOMs algorithms, the desired features for clustering include minimum user input, finding arbitrary shaped clusters, robustness to outliers and the ability to handle higher dimensionality. If outliers pose a problem while using k- means for clustering gene expression data, the biologist can consider an improvement by opting for k-medoids, CLARA, CLARANS as alternatives. If hierarchical clustering is to be used, then BIRCH is suitable for outlier detection while CURE or DHC could be used to detect arbitrary shaped clusters. Fuzzy C-means or GA based approach supports overlapping clusters of co-regulated genes, while projected density-based approaches improve quality on highdimensional data but again needs user specified parameters. A clustering algorithm s suitability to cluster biological data depends upon certain desirable features such as speed, minimum number of input parameters, robustness to noise and outliers, redundancy handling and independence of object order input. Though the features of many clustering algorithms match these requirements, they have not yet been applied to clustering biological data. Moreover, not all validity measures are suitable for all gene datasets; hence a judicious choice of the applicability of the validity measure has to be made. It is well known that most clustering methods are highly sensitive to input data and a slight variation or change in the data may result in very different gene clusters. If the information from genomic knowledge bases, such as GO, could be incorporated (data fusion) earlier in the analysis of genomic data, that additional information about genes and their relationship with each other will improve stability, accuracy and/or biological relevance of the cluster results. In this survey, an attempt has been made to provide a comprehensive survey of various clustering approaches in the context of coherent pattern identification in gene expression data. Effectiveness of a clustering technique is highly influenced by the proximity measure used by the technique. This survey also presents a partial list of proximity measures available for numeric data clustering. It attempts to provide analysis of the effectiveness of algorithms belonging to a specific approach in gene expression data mining. Finally, it discusses the various validity measures and the sources of various datasets for the effective evaluation of clustering results. REFERENCES [1] M.D. Schena, R. Shalon, R. Davis, and P. Brown, Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray, Science, vol. 270, pp , [2] D. Lockhart et al., Expression Monitoring by Hybridization to High- Density Oligonucleotide Arrays, Nature Biotechnology, vol. 14, pp , [3] R. Loganantharaj, Beyond clustering of array expressions, Int. J. Bioinformatics Res. Appl., vol 5, no. 3, pp , [4] P. Berkhin, Survey of Clustering Data Mining Techniques, [5] S.-H. Cha, Comprehensive survey on distance/similarity measures between probability density functions, International Journal of 9

10 Mathematical Models and Methods in Applied Science, vol. 1, no. 4, November [6] R. Xu and D. Wunsch, Survey of Clustering Algorithms, IEEE Transactions on Neural Networks, vol. 16, no. 3, pp , May [7] J McQueen, Some Methods for Classifications and Analysis of Multivariate Observations, in the Proc of 5 th Barkeley Symposium on Mathematics, Statistics and Probability, pp , 1967 [8] Z. Huang, A Fast Clustering Algorithm to cluster very large categorical datasets in Data Mining, SIGMOD Workshop on Research Issues on DM and Knowledge Discovery, May 1997 [9] J. C. Bezdek, R. Ehrlich and W. Full, FCM: Fuzzy c-mean Algorithm, Comp. and Geo-Science, [10] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons,1990. [11] T. Ng. Raymand and J. Han, Efficient and Effective Clustering Method for Spatial Data Mining, In the VLDB94, pp , [12] S. Guha, R. Rastogi, and K Shim, CURE: An Efficient Clustering Algorithm for Large Datasets, ACM SIGMOD Conf., [13] S. Guha, R. Rastogi and K Shim, ROCK: A Robust Clustering Algorithm for Categorical Attributes, IEEE Conf on Data Engineering, [14] G. Karypis, E. H. Han and V. Kumar, CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modelling, Computer, vol. 32, no. 8, pp , [15] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, Cluster Analysis and Display of Genome-Wide Expression Patterns, In Proc. Nat l Academy of Science, vol. 95, no. 25, pp , Dec [16] T. Zhang, R. Ramakrishnan, and M. Linvy, BIRCH: An Efficient data Clustering Method for Very Large Datasets, Data Mining and Knowledge Discovery, vol. 1, no. 2, pp , [17] V. Estivill-Castro, and I. Lee, AMOEBA: Hierarchical Clustering Based On Spatial Proximity Using Delaunay Diagram, in the 9th Int nl Symposium on Spatial Data Handling, [18] A. M. Yip, M. K. Ng, E. H. Wu and T. F. Chan, Strategies for Identifying Statistically Significant Dense Regions in Microarray Data, IEEE/ACM Transactions on Computational Biology and Bioinformatics, pp , July-September, 2007 [19] M. Ester, H.-P. Kriegel, J. Sander,and X. Xu, A density-based algorithm for discovering clusters in large spatial databases, In Proc. Int. Conf. Knowledge Discovery and Data Mining (KDD 96), pp , Portland, OR, Aug.1996 [20] M. Ankerst, M. Breuing, H. -P. Kriegel and J.Sander, OPTICS: Ordering points to identify the clustering structure, In Proc ACM- SIGMOD Conf. on Management of Data (SIGMOD 99), pp , [21] X. Xu, M. Ester, H.-P Kriegel and J Sander, A Nonparametric Clustering Algorithm for Knowledge Discovery in Large Spatial Datasets, in IEEE Int. Conf. On Data Engineering, [22] A. Hinneburg and D. Keim, An efficient approach to clustering in large multimedia datasets with noise, in the Proc. of 4th Int. Conf. On Knowledge Discovery and Data Mining, pp , 1998 [23] A. Foss and O. R. Zaiane, TURN* unsupervised clustering of spatial data, in Proc. of the ACM-SIKDD Int. Conf. On Knowledge Discovery and Data Mining, 2002 [24] D. Jiang, J. Pei, and A. Zhang, DHC: A Density-Based Hierarchical Clustering Method for Time-Series Gene Expression Data, in the Proc. BIBE2003: Third IEEE Int l Symp. Bioinformatics and Bioeng., [25] D. Fisher, Improving interference through conceptual clustering, in the Proc. of the AAAI Conf., pp , 1987 [26] T. Kohonen, Self-Organizing Maps, 2nd Extended Edition, Springer Series in Information Sciences, vol. 30, 1997 [27] P. Cheeseman and J. Stutz, Bayesian Classification (AutoClass): Theory and results, in Advances in Knowledge Discovery and Data Mining, pp , [28] V. Estivil-Castro and I. Lee, AutoClust: Automatic clustering via boundary extraction for mining massive point datasets, In Proc. 5th International Conference on Geocomputation. Pp , [29] R. Shamir and R. Sharan, CLICK: A Clustering Algorithm for Gene Expression Analysis, Proc. Eighth Int l Conf. Intelligent Systems for Molecular Biology (ISMB 00), [30] A. Ben-Dor, R. Shamir, and Z. Yakhini, Clustering Gene Expression Patterns, J. Computational Biology, vol. 6, nos. 3/4, pp , [31] J. Dunn, A fuzzy relative of the ISODATA process and its use in detecting compact well separated clusters, J. Cybern., vol. 3, no. 3, pp , [32] P. Kersten, Implementation issues in the fuzzy c-medians clustering algorithm, in Proc. 6th IEEE Int. Conf. Fuzzy Systems, vol. 2, pp , [33] I. Gath and A. Geva, Unsupervised optimal fuzzy clustering, IEEE Trans. Pattern Anal. Mach. Intell., vol. 11, no. 7, pp , Jul [34] T. Kohonen, Self-Organizing Maps, 3rd edition, New York: Springer- Verlag, [35] M. Su and H. Chang, Fast self-organizing feature map algorithm, IEEE Trans. Neural Netw., vol. 11, no. 3, pp , May [36] J. Vesanto and E. Alhoniemi, Clustering of the self-organizing map, IEEE Trans. Neural Netw., vol. 11, no. 3, pp , May [37] G. Babu and M. Murty, A near-optimal initial seed value selection in K-means algorithm using a genetic algorithm, Pattern Recognit. Lett., vol. 14, no. 10, pp , [38] G. Babu and M. Murty, Clustering with evolution strategies, Pattern Recognit., vol. 27, no. 2, pp , [39] M. Cowgill, R. Harvey, and L. Watson, A genetic algorithm approach to cluster analysis, Comput. Math. Appl., vol. 37, pp , [40] U. Maulik and S. Bandyopadhyay, Genetic algorithm-based clustering technique, Pattern Recognition Letters, vol. 33, pp , [41] L. Tseng and S. Yang, A genetic approach to the automatic clustering problem, Pattern Recognit., vol. 34, pp , [42] K. Krishna and M. Murty, Genetic K-means algorithm, IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 29, no. 3, pp , Jun [43] V. Di-Gesu, R. Giancarlo, G. Lo-Bosco, A. Raimondi and D. Scaturro, GenClust: A genetic algorithm for clustering gene expression data, BMC Bioinformatics, vol. 6, pp. 289, 2005 [44] M. Ester, H. P. Kriegel, J. Sander, M. Wimmer, and X. Xu, An incremental clustering for mining in a data warehousing environment, in Proceedings of the 24th VLDB Conference, New York, USA, [45] L. Tao and S. Sarabjot, Hirel: An incremental clustering algorithm for relational datasets, in Proceedings of Eighth IEEE International Conference on Data Mining, 2008, pp [46] S. Asharaf, M. Narasimha, and S. Shevade, Rough set based incremental clustering of interval data, Pattern Recognition Letters, vol. 27, pp , [47] Y. Lu, S. Lu, F. Fotouhi, Y. Deng, and S. Brown, Incremental genetic k-means algorithm and its application in gene expression data analysis, BMC Bioinformatics, vol. 5(172), [48] R. Ruiz, J. Riquelme, and J. Aguilar-Ruiz, Incremental wrapper-based gene selection from microarray data for cancer classification, Pattern Recognition, vol. 39, p , [49] R. Das, D. Bhattacharyya, and J. Kalita, An incremental clustering of gene expression data, in Proceedings of NABIC, Coimbatore, India, [50] K. Y. Yeung, D. R. Haynor, and W. L. Ruzzo, Validating Clustering for Gene Expression Data, Bioinformatics, vol. 17, no. 4, pp , [51] P. Kraj, A. Sharma, N. Garge, R. Podolsky and R. A. Mcindoe, Parakmeans: Implementation of a parallelized k-means algorithm suitable for general laboratory use, BMC Bioinformatics, vol. 9, pp. 200, [52] S. Tavazoie, D. Hughes, M.J. Campbell, R.J. Cho, and G.M. Church, Systematic Determination of Genetic Network Architecture, Nature Genetics, pp ,

11 [53] J. A. Hartigan and M. A. Wong, A K-means clustering algorithm, Appl. Stat., vol. 28, pp , [54] A. Sugiyama, M. Kotani, Analysis of gene expression Data Using Self-organizing Maps and k-means Clustering, IEEE, pp , [55] J.H. Do and D.-K. Choi, Clustering Approaches to Identifying Gene Expression Patterns from DNA Microarray Data, Mol. Cells, vol. 25, no. 2, pp. 1-1, [56] S. Datta and S. Datta, Comparisons and validation of statistical clustering techniques for microarray gene expression data, Bioinformatics, vol. 19, no. 4, pp , [57] A. Thalamuthu, I. Mukhopadhyay, X. Zheng and G. C. Tseng, Evaluation and comparison of gene clustering methods in microarray analysis, Bioinformatics, vol. 22, no. 19, pp , [58] Y. Liu, S. B. Navathe, J. Civera, V. Dasigi, A. Ram, B. J. Ciliax, and R. Dingledine, Text Mining Biomedical Literature for Discovering Gene-to-Gene Relationships: A Comparative Study of Algorithms, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2, no. 1, January-March [59] R. Shamir and R. Sharan, Algorithmic Approaches to Clustering Gene Expression Data, In Current Topics in Computational Biology, pp , MIT Press, [60] D. Dembele and P. Kastner, Fuzzy c-means method for clustering microarray data, Bioinformatics, vol. 19, pp , [61] S. Tomida, T. Hanai, H. Honda, T. Kobayashi, Analysis of expression profile using fuzzy adaptive resonance theory, Bioinformatics, vol. 18, no. 8, pp , [62] B. Dost, C. Wu, A. Su and V. Bafna, TCLUST: A Fast Method for Clustering Genome-Scale Expression Data, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 99, 2010 [63] S. Bandyopadhyay, A. Mukhopadhyay and U. Maulik, An improved algorithm for clustering gene expression data, Bioinformatics, vol. 23, no. 21, pp , [64] H. Ikota, S. Kinjo, H. Yokoo and Y. Nakazato, Systematic immunohistochemical profiling of 378 brain tumors with 37 antibodies using tissue microarray technology, Acta Neuropathol, vol. 111, no. 5, pp , [65] F.X. Wu, W.J. Zhang, and A.J. Kusalik, Determination of the minimum number of microarray experiments for discovery of gene expression patterns, BMC Bioinformatics, vol. 7 (Suppl. 4), S13, [66] B. Xing, C.M. Greenwood and S.B. Bull, A hierarchical clustering method for estimating copy number variation, Biostatistics, vol. 8, pp , [67] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, Cluster Analysis and Display of Genome-Wide Expression Patterns, In Proc. Nat l Academy of Science, vol. 95, no. 25, pp , Dec [68] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine, Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Array, In Proc. Nat l Academy of Science, vol. 96, no. 12, pp , June [69] D. Jiang, C. Tang and A. Zhang, Cluster Analysis for Gene Expression Data: A Survey, IEEE Transactions on Knowledge and data Engineering, vol. 16, no. 11, November [70] K.Y. Yeung, C. Fraley, A. Murua, A.E. Raftery, and W.L. Ruzz, Model-Based Clustering and Data Transformations for Gene Expression Data, Bioinformatics, vol. 17, pp , [71] G.J. McLachlan, R.W. Bean, and D. Peel, A Mixture Model-Based Approach to the Clustering of Microarray Expression Data, Bioinformatics, vol. 18, pp , [72] R. Das, D K Bhattacharyya and J K Kalita, An Effective Dissimilarity Measure for Clustering Gene Expression Time Series Data, Fourth Biotechnology and Bioinformatics Symposium (BIOT 07),Colorado Springs, CO, pp , October [73] R. Das, J. Kalita and D. Bhattacharyya, A Frequent-Itemset Nearest Neighbor Based Approach for Clustering Gene Expression Data, Fifth Biotechnology and Bioinformatics Symposium (BIOT 08), Arlington, TX, pp , October 2008 [74] R. Das and D K Bhattacharyya, A Pattern matching Approach for Identifying Coherent Patterns over Gene Expression Data, in the Technical Journal of ASS, vol. 50, no. 1-2, 2009 [75] R. Das, D K Bhattacharyya and J K Kalita, Clustering Gene Expression Data using Density Approach, in the IJRTE, (ISSN: ), Finland, vol 2, no 1-6,. November, 2009 [76] R. Das, D K Bhattacharyya and J K Kalita, A New Approach for Clustering Gene Expression Time Series Data, Int'nl Journal of Bioinformatics Research and Applications (IJBRA), vol. 5, no. 3, pp , [77] P. Tamayo, D. Solni, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E.S. Lander, and T.R. Golub, Interpreting Patterns of Gene Expression with Self-Organizing Maps: Methods and Application to Hematopoietic Differentiation, Proc. National Academy of Science, vol. 96, no. 6, pp , March [78] F. Achcar, J.M. Camadro and D. Mestivier, AutoClass@IJM: a powerful tool for Bayesian classification of heterogeneous data in biology, Nucleic Acids Research, vol. 37, Web Server issue, [79] CLIC: T. Yun, T. Hwang, K. Cha and G.-S. Yi, Clustering Analysis of large Microarray Datasets with Individual Dimension-based Clustering, Nucleic Acids Research, pp.1 8, [80] A. Bellaachia, D. Portnoy, Y. Chen and A. G. Elkahloun, E-CAST: A Data Mining Algorithm For Gene Expression Data, In Workshop on Data Mining in Bioinformatics, pp , [81] A.P Gasch and M.B Eisen, Exploring the Conditional Coregulation of Yeast Gene Expression through Fuzzy K-Means Clustering, Genome Biology, vol. 3, no. 11:research , [82] C. Bohm, K. Kailing, H.-P. Kriegel and P. Kroger, Density Connected Clustering with Local Subspace Preferences, in Proc. 4th IEEE Int. Conf. on Data Mining (ICDM'04), Brighton, UK, 2004 [83] S. Sahay, Study and Implementation of CHAMELEON algorithm for Gene Clustering, [84] W. M. Rand, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association, vol. 66, no. 336, pp , [85] R. Sharan, A. Maron-Katz, and R. Shamir, CLICK and EXPANDER: A System for Clustering and Visualizing Gene Expression Data. Bioinformatics, vol. 19, no. 14, pp , 2003 [86] P. Rousseeuw, Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics, vol. 20, pp , 1987 [87] F. Gibbons and F. Roth, Judging the Quality of Gene Expression Based Clustering Methods using Gene Annotation, Genome Research, vol. 12, pp , [88] C. Cheadle, M. P. Vawter, W. J. Freed and K. G. Becker, Analysis of Microarray Data Using Z Score Transformation, Journal of Molecular Diseases, vol. 5, no. 2, 2003 [89] D. Jiang, J. Pei, and A. Zhang, Interactive Exploration of Coherent Patterns in Time-Series Gene Expression Data, Proc. Ninth ACM SIGKDD Int l Conf. Knowledge Discovery and Data Mining (SIGKDD 03), [90] I. Gat-Viks, R. Sharan and R. Shamir, Scoring clustering solutions by their biological relevance, Bioinformatics, vol. 19, no. 18, pp , [91] Y Batistakis, M. Halkidi and M. Vazirgiannis, Cluster validity methods: Part I, Sigmod Record, vol. 31, no. 12, [92] D. Davies and D. Bouldin, A cluster separation measure, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 1, no. 2, [93] J. Dunn, Well separated clusters and optimal fuzzy partitions, Journal of Cybernetics, vol.4,

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

More information

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Unsupervised learning: Clustering

Unsupervised learning: Clustering Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

More information

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

More information

A comparison of various clustering methods and algorithms in data mining

A comparison of various clustering methods and algorithms in data mining Volume :2, Issue :5, 32-36 May 2015 www.allsubjectjournal.com e-issn: 2349-4182 p-issn: 2349-5979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

On Clustering Validation Techniques

On Clustering Validation Techniques Journal of Intelligent Information Systems, 17:2/3, 107 145, 2001 c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques MARIA HALKIDI mhalk@aueb.gr YANNIS

More information

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

Cluster Analysis: Basic Concepts and Methods

Cluster Analysis: Basic Concepts and Methods 10 Cluster Analysis: Basic Concepts and Methods Imagine that you are the Director of Customer Relationships at AllElectronics, and you have five managers working for you. You would like to organize all

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms 8 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

More information

An Introduction to Cluster Analysis for Data Mining

An Introduction to Cluster Analysis for Data Mining An Introduction to Cluster Analysis for Data Mining 10/02/2000 11:42 AM 1. INTRODUCTION... 4 1.1. Scope of This Paper... 4 1.2. What Cluster Analysis Is... 4 1.3. What Cluster Analysis Is Not... 5 2. OVERVIEW...

More information

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool Comparison and Analysis of Various Clustering Metho in Data mining On Education data set Using the weak tool Abstract:- Data mining is used to find the hidden information pattern and relationship between

More information

Distances, Clustering, and Classification. Heatmaps

Distances, Clustering, and Classification. Heatmaps Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier Data Mining: Concepts and Techniques Jiawei Han Micheline Kamber Simon Fräser University К MORGAN KAUFMANN PUBLISHERS AN IMPRINT OF Elsevier Contents Foreword Preface xix vii Chapter I Introduction I I.

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Cluster analysis Cosmin Lazar. COMO Lab VUB

Cluster analysis Cosmin Lazar. COMO Lab VUB Cluster analysis Cosmin Lazar COMO Lab VUB Introduction Cluster analysis foundations rely on one of the most fundamental, simple and very often unnoticed ways (or methods) of understanding and learning,

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms K-means and its variants Hierarchical clustering

More information

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows: Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are

More information

Data Clustering Techniques Qualifying Oral Examination Paper

Data Clustering Techniques Qualifying Oral Examination Paper Data Clustering Techniques Qualifying Oral Examination Paper Periklis Andritsos University of Toronto Department of Computer Science periklis@cs.toronto.edu March 11, 2002 1 Introduction During a cholera

More information

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig

More information

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

More information

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? K-Means Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images

A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images Małgorzata Charytanowicz, Jerzy Niewczas, Piotr A. Kowalski, Piotr Kulczycki, Szymon Łukasik, and Sławomir Żak Abstract Methods

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

The SPSS TwoStep Cluster Component

The SPSS TwoStep Cluster Component White paper technical report The SPSS TwoStep Cluster Component A scalable component enabling more efficient customer segmentation Introduction The SPSS TwoStep Clustering Component is a scalable cluster

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009 Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative K-means Density-based Interpretation

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Standardization and Its Effects on K-Means Clustering Algorithm

Standardization and Its Effects on K-Means Clustering Algorithm Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

More information

Clustering methods for Big data analysis

Clustering methods for Big data analysis Clustering methods for Big data analysis Keshav Sanse, Meena Sharma Abstract Today s age is the age of data. Nowadays the data is being produced at a tremendous rate. In order to make use of this large-scale

More information

Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1

Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1 Clustering: Techniques & Applications Nguyen Sinh Hoa, Nguyen Hung Son 15 lutego 2006 Clustering 1 Agenda Introduction Clustering Methods Applications: Outlier Analysis Gene clustering Summary and Conclusions

More information

MANAGING QUEUE STABILITY USING ART2 IN ACTIVE QUEUE MANAGEMENT FOR CONGESTION CONTROL

MANAGING QUEUE STABILITY USING ART2 IN ACTIVE QUEUE MANAGEMENT FOR CONGESTION CONTROL MANAGING QUEUE STABILITY USING ART2 IN ACTIVE QUEUE MANAGEMENT FOR CONGESTION CONTROL G. Maria Priscilla 1 and C. P. Sumathi 2 1 S.N.R. Sons College (Autonomous), Coimbatore, India 2 SDNB Vaishnav College

More information

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Clustering Techniques: A Brief Survey of Different Clustering Algorithms Clustering Techniques: A Brief Survey of Different Clustering Algorithms Deepti Sisodia Technocrates Institute of Technology, Bhopal, India Lokesh Singh Technocrates Institute of Technology, Bhopal, India

More information

How To Solve The Cluster Algorithm

How To Solve The Cluster Algorithm Cluster Algorithms Adriano Cruz adriano@nce.ufrj.br 28 de outubro de 2013 Adriano Cruz adriano@nce.ufrj.br () Cluster Algorithms 28 de outubro de 2013 1 / 80 Summary 1 K-Means Adriano Cruz adriano@nce.ufrj.br

More information

Forschungskolleg Data Analytics Methods and Techniques

Forschungskolleg Data Analytics Methods and Techniques Forschungskolleg Data Analytics Methods and Techniques Martin Hahmann, Gunnar Schröder, Phillip Grosse Prof. Dr.-Ing. Wolfgang Lehner Why do we need it? We are drowning in data, but starving for knowledge!

More information

Original Article Survey of Recent Clustering Techniques in Data Mining

Original Article Survey of Recent Clustering Techniques in Data Mining International Archive of Applied Sciences and Technology Volume 3 [2] June 2012: 68-75 ISSN: 0976-4828 Society of Education, India Website: www.soeagra.com/iaast/iaast.htm Original Article Survey of Recent

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

A Review of Data Mining Techniques

A Review of Data Mining Techniques Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

Linköpings Universitet - ITN TNM033 2011-11-30 DBSCAN. A Density-Based Spatial Clustering of Application with Noise

Linköpings Universitet - ITN TNM033 2011-11-30 DBSCAN. A Density-Based Spatial Clustering of Application with Noise DBSCAN A Density-Based Spatial Clustering of Application with Noise Henrik Bäcklund (henba892), Anders Hedblom (andh893), Niklas Neijman (nikne866) 1 1. Introduction Today data is received automatically

More information

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501 CLUSTER ANALYSIS Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 30602-2501 January 2006 Introduction Cluster analysis includes a broad suite of techniques designed to find groups

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

Clustering Via Decision Tree Construction

Clustering Via Decision Tree Construction Clustering Via Decision Tree Construction Bing Liu 1, Yiyuan Xia 2, and Philip S. Yu 3 1 Department of Computer Science, University of Illinois at Chicago, 851 S. Morgan Street, Chicago, IL 60607-7053.

More information

How To Use Neural Networks In Data Mining

How To Use Neural Networks In Data Mining International Journal of Electronics and Computer Science Engineering 1449 Available Online at www.ijecse.org ISSN- 2277-1956 Neural Networks in Data Mining Priyanka Gaur Department of Information and

More information

Neural Network Add-in

Neural Network Add-in Neural Network Add-in Version 1.5 Software User s Guide Contents Overview... 2 Getting Started... 2 Working with Datasets... 2 Open a Dataset... 3 Save a Dataset... 3 Data Pre-processing... 3 Lagging...

More information

CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學. Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理

CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學. Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理 CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學 Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理 Submitted to Department of Electronic Engineering 電 子 工 程 學 系 in Partial Fulfillment

More information

How To Cluster Of Complex Systems

How To Cluster Of Complex Systems Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley Young-Rae Cho Department of Computer Science Baylor University Complex Systems Definition Dynamically evolving

More information

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph

More information

Grid Density Clustering Algorithm

Grid Density Clustering Algorithm Grid Density Clustering Algorithm Amandeep Kaur Mann 1, Navneet Kaur 2, Scholar, M.Tech (CSE), RIMT, Mandi Gobindgarh, Punjab, India 1 Assistant Professor (CSE), RIMT, Mandi Gobindgarh, Punjab, India 2

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

Going Big in Data Dimensionality:

Going Big in Data Dimensionality: LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für

More information

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Mobile Phone APP Software Browsing Behavior using Clustering Analysis Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Final Project Report

Final Project Report CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes

More information

Quality Assessment in Spatial Clustering of Data Mining

Quality Assessment in Spatial Clustering of Data Mining Quality Assessment in Spatial Clustering of Data Mining Azimi, A. and M.R. Delavar Centre of Excellence in Geomatics Engineering and Disaster Management, Dept. of Surveying and Geomatics Engineering, Engineering

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

A Study of Web Log Analysis Using Clustering Techniques

A Study of Web Log Analysis Using Clustering Techniques A Study of Web Log Analysis Using Clustering Techniques Hemanshu Rana 1, Mayank Patel 2 Assistant Professor, Dept of CSE, M.G Institute of Technical Education, Gujarat India 1 Assistant Professor, Dept

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

Segmentation of stock trading customers according to potential value

Segmentation of stock trading customers according to potential value Expert Systems with Applications 27 (2004) 27 33 www.elsevier.com/locate/eswa Segmentation of stock trading customers according to potential value H.W. Shin a, *, S.Y. Sohn b a Samsung Economy Research

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA Prakash Singh 1, Aarohi Surya 2 1 Department of Finance, IIM Lucknow, Lucknow, India 2 Department of Computer Science, LNMIIT, Jaipur,

More information

A Survey of Clustering Techniques

A Survey of Clustering Techniques A Survey of Clustering Techniques Pradeep Rai Asst. Prof., CSE Department, Kanpur Institute of Technology, Kanpur-0800 (India) Shubha Singh Asst. Prof., MCA Department, Kanpur Institute of Technology,

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

Data Preprocessing. Week 2

Data Preprocessing. Week 2 Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

More information

Advanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis

Advanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis Advanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis Arumugam, P. and Christy, V Department of Statistics, Manonmaniam Sundaranar University, Tirunelveli, Tamilnadu,

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information