Gene Expression Data Clustering Analysis: A Survey

Transcription

1 Gene Expression Data Clustering Analysis: A Survey Sajid Nagi #, Dhruba K. Bhattacharyya *, Jugal K. Kalita # Department of Computer Science, St. Edmund s College, Shillong , Meghalaya, India. * Department of Computer Science and Engineering, Tezpur University, Napaam , Assam, India. Department of Computer Science, University of Colorado, Colorado Springs CO 80918, USA. Abstract- The advent of DNA microarray technology has enabled biologists to monitor the expression levels (MRNA) of thousands of genes simultaneously. In this survey, we address various approaches to gene expression data analysis using clustering techniques. We discuss the performance of various existing clustering algorithms under each of these approaches. Proximity measure plays an important role in making a clustering technique effective. Therefore, we briefly discuss various proximity measures. Finally, since evaluation of the effectiveness of the clustering techniques over gene data requires validity measures and data sources for numeric data, we discuss them as well. Keywords- Gene expression data, proximity measure, coherent pattern, clustering, cluster validation I. INTRODUCTION Microarray technology has made it possible to simultaneously monitor expression levels of many thousands of genes through two major types of microarray experiments: cdna microarray [1] and oligonucleotide arrays [2]. Although both types of experiments follow different protocols, they have some common basic procedures. The analysis of gene expression is shown in Fig. 1 at a very high level. Laboratory Work RNA, microarrays Sample preparation, Hybridization,Scanners Computer aided data analysis Preprocessing, cluster analysis Further analysis Hypothesis, Validation, Fig 1: Gene expression analysis pipeline The basic steps executed in the process of gene expression data generation are as follows. Chip manufacture: A microarray is a small chip consisting of a solid surface, onto which tens of thousands of DNA molecules (probes) are attached in fixed grids. Each grid cell corresponds to a DNA sequence. Since thousands of DNA molecules are bonded to a single array, it is possible to measure the expression of many thousands of genes in parallel. Sample preparation, labeling and hybridization: Two mrna samples (a test sample and a control sample) are reverse-transcribed into cdna (targets), labeled using either fluorescent dyes or radioactive isotopes and then hybridized with the probes on the surface of the chip. Washing: The slides are then washed to remove excess hybridization solution from the array so that only the DNA complementary to each of the features will remain bound to the features on the array. Image Acquisition: The final step of the process is to produce an image of the surface of the hybridized array where chips are scanned to read the signal intensity that is emitted. The image of the microarray generated by the scanner is the raw data. Feature extraction software converts the image into numerical information that quantifies gene expression. A. Pre-processing of gene expression data A microarray experiment assesses a large number of DNA sequences under multiple conditions. These conditions may be a time series during a biological process or a collection of different tissue samples. Without making any distinction among the DNA sequences in this survey, we will refer to them as genes. Similarly, we will refer to all kinds of experimental conditions as samples. A gene expression dataset from a microarray experiment may be considered an n x l matrix M (as shown in Fig. 2), where M = {m i, j } are the rows which represent expression patterns of a set of n genes (G = {g 1,.,g n }), the columns represent expression profiles of a set of l samples S = (s 1,.,s l ) at different conditions/time points, and each cell m i,j is the expression level of gene g i (1 i n ) on sample s j (1 j l ). genes m 11 m 12 m 13 m 1n m 21 m 22 m 23 m 2n m l1 m l2 m l3 m ln samples Fig. 2: A gene expression matrix The microarray data thus generated is then cleaned, transformed and normalized to resolve any errors, noise and bias introduced by the microarray experiments. B. Coherent, Co-expressed and Co-regulated Genes Two genes are said to co-express when they have similar expression patterns and may belong to the same or similar functional categories, whereas two genes are said to coregulate when they are regulated by common transcription factors. Co-expressed genes in the same group are in all probability involved in the same cellular process. A strong expression pattern correlation between those genes indicates co-regulation. A coherent gene expression pattern (or coherent pattern in short) characterizes the common trend of expression levels for a group of co-expressed genes. A coherent pattern is a template, while the expression profiles of the corresponding co-expressed genes match the pattern with small divergence. Coherent gene expression patterns may characterize important cellular processes and may also be involved in the regulating mechanism in the cells. Clustering 1

2 techniques have proven to be useful in finding coherent or coexpressed gene patterns. However, as stated in [3], co-regulated genes have very high chance of having similar functions whereas co-expressed genes are not necessarily co-regulated and may not have similar functions. C. Extraction of Interesting Patterns from Gene Expression Data Using Clustering Techniques Clustering is the process of grouping data objects into a set of disjoint classes, called clusters, so that objects within a class have high similarity to each other, while objects in separate classes are more dissimilar. Clustering is an example of unsupervised classification. Classification refers to a procedure that assigns data objects to a set of classes. Unsupervised means that clustering does not rely on predefined classes and training examples while classifying the data objects. As stated in [4], the various clustering approaches are partitional, hierarchical, density based, model based, grid based and soft computing. In this survey, we will consider the use of these algorithms for gene-based clustering. D. Proximity Measures There are different methods [5] for quantifying similarity or dissimilarity among gene expression levels, described in terms of the distance between them in the high-dimensional space of gene expression measurements. A dissimilarity measure d ij for any two genes g i and g j obeys the following properties. i. The distance between any two profiles cannot be negative. ii. The distance between a profile and itself must be zero. iii. A zero distance between two profiles implies that the profiles are identical. iv. The distance between profile g i and profile g j is the same as distance between profile g j and profile g i, i.e., D(g i, g j ) = D(g j, g i ). Different proximity measures: A microarray experiment compares genes from an organism under different development time points, conditions or treatments. For an n condition experiment, a single gene has an n-dimensional observation vector known as its gene expression profile. A proximity measure is a real-valued function that assigns a positive real number as a similarity value between any two expression vectors. Therefore, to identify genes or samples that have similar expression profiles, selection of an appropriate proximity measure is very essential. Some of the commonly used proximity measures can be found in [6]. Selection of a measure depends upon the type of data, its dimensionality and the approach used in identification of coherent patterns. E. Cluster Validity Measures Cluster validity depends upon (i) the homogeneity in individual clusters, (ii) given the ground truth (available from the domain knowledge) of the clusters, and (iii) the reliability of the clusters. A few validation measures commonly used in evaluating the effectiveness of the technique(s) used in gene expression data mining will be highlighted in this survey. F. Paper Organization The rest of the survey is organized as follows: Section II describes the formulation of the problem and its complexity. In Section III we discuss proximity measures for numeric data. Various generalized clustering approaches are reported in Section IV. In Section V, cluster validity measures are discussed and Section VI discusses challenges and research issues. Section VII takes a brief look at some gene expression datasets to which gene-based clustering approaches have commonly been applied. Finally, we conclude in Section VIII. II. PROBLEM FORMULATION Gene expression data are generated by DNA chips and other microarray techniques and they are often presented as matrices of expression levels of genes under different conditions (including environments, individuals, and tissues). One of the major objectives of gene expression data analysis is to identify groups of genes having similar expression patterns over the full space or subspace of conditions. It may result in the discovery of regulatory patterns or condition similarities. Generally co-expressed genes, which are members of the same clusters, are expected to have similar functions. Most researchers address this issue by: (i) using either partitional, hierarchical, density-based, model-based or subspace clustering algorithms, based on the proximity between genes or conditions in the expression matrix or (ii) giving equal weights to all conditions or all genes in the computation of gene similarity and vice versa. However, if the proximity measure is not properly selected, it may lead to the discovery of some similar groups of genes at the expense of obscuring other similar groups. Hence, it may be necessary to recover information lost due to over-simplification of similarity and grouping computation, which may reveal the involvement of a gene or a condition in more than one group. III. EXISTING PROXIMITY MEASURES FOR NUMERIC DATA Distance or similarity measures are essential to solve many pattern recognition problems such as classification, clustering and retrieval problems. From the scientific and mathematical points of view, distance is defined as a quantitative degree of how far apart two objects are [3]. A synonym for distance is dissimilarity. Distance measures satisfying the metric properties are simply called metric while other non-metric distance measures are occasionally called divergence. Synonyms for similarity include proximity and similarity measures are often called similarity coefficients. The choice of the proximity measure depends upon the type of data, dimensionality and the approach used in the identification of coherent patterns. We reproduce from [6] a partial list of the formula of a few proximity measures for numeric data, since gene expression data are mostly available as numeric data. 2

3 Proximit y Minkowski distance TABLE I: PROXIMITY MEASURES FOR NUMERIC DATA Forms Euclidean distance Cityblock distance Sup distance Mahalanobis distance D M = (X i Y i) T S -1 (X i Y i), where S is within group covariance matrix Pearson correlation D P(X,Y) = Explanation Metric. Invariant to any translation and rotation only for n = 2 (Euclidean distance). Features with large values and variances tend to dominate over other features. The most commonly used metric. Special case of Minkowski metric at n = 2. Tends to form hyperspherical clusters. Special case of Minkowski metric at n = 1. Tend to form hyperrectangular clusters. Special case of Minkowski metric at n. Invariant to any nonsingular linear transformation. S is calculated based on all objects. Tends to form hyper-ellipsoidal clusters. When features are not correlated, squared Mahalanobis distance is equivalent to squared Euclidean distance. May tend to be computationally expensive. Not a metric. Derived from correlation coefficient. Unable to detect the magnitude of differences of two variables. The effective use of any of these proximity measures (Table 1) is highly influenced by the number of data records/instances, their dimensionality, the level of precision required and the nature of analysis required. IV. GENE-BASED CLUSTERING APPROACHES In this section, we discuss the problem of clustering genes based on their expression patterns. The purpose of gene-based clustering is to group together co-expressed genes which indicate co-function and co-regulation. We will first review a series of clustering algorithms belonging to various approaches (as reported in Section I-C), which have been applied to group genes. Finally, we present the issues involved in gene-based clustering. A. Partitioning Approach Clustering techniques belonging to this approach are divided into two major sub-categories: centroid based where each cluster is represented by using the gravity centre of the instances and medoid based, where each cluster is represented by means of the instances closest to the gravity centre. K-means [7] is a simple and fast centroid-based clustering algorithm which divides the data into pre-defined number of clusters in order to optimize a predefined criterion. However, it may not yield the same result with each run of the algorithm. To detect the optimal number of clusters, users usually run the algorithm repeatedly with different values of k and compare the clustering results. For a large gene expression dataset, which contains thousands of genes, this extensive parameter fine-tuning process may not be practical. Moreover, gene expression data typically contain a huge amount of noise and the k-means algorithm forces each gene into a single cluster, which may lead to the generation of non-qualitative or biologically non-relevant clusters. It is not suitable to detect clusters of arbitrary shapes. In spite of its unsuitability, the k-means algorithm is frequently used to cluster gene expression data due to its simplicity and it provides baseline results to compare with when we develop new clustering algorithms. The application of k-means algorithm and its variants to cluster gene expression data are widely used, as is evidenced by [50]-[59], [61] and [62]. k-modes [8] is a faster extension of k-means algorithm to handle categorical data that uses a different similarity measure, replaces k-means with k-modes and uses a frequency based method to update modes. The k-modes algorithm is useful only after the conversion of the numeric data into categorical. However, it may lead to information loss and hence may deteriorate the cluster result. The working of the Fuzzy C-Means (FCM) [9] algorithm is very similar to k-means. In FCM, each point has a degree of membership to each cluster rather than belonging completely to just one cluster. Thus, points on the edge of a cluster may be in the cluster to a lesser degree than points in the center of cluster. Fuzzy C-Means and its variants are widely used techniques used for microarray data clustering and they can be found in [55], [60], and [63]. Two early versions of k-medoid methods are the algorithms PAM and CLARA [10]. PAM uses dissimilarity values and an iterative optimization approach to determine a representative object for each cluster called medoid. Once the medoids have been selected, each non-selected object is grouped with the medoid to which it is most similar. Finally, the quality of a clustering is measured by the average dissimilarity between an object and the medoid of its cluster. CLARA follows the same principle as PAM, but instead of finding representative objects for the entire dataset, it draws a sample of the dataset, and applies PAM to this sample. It then classifies the remaining objects using partitioning principles. CLARANS [11] relies on a randomized search of a graph to find medoids representing clusters. The algorithm takes as input maxneighbor and numlocal, selects a random node and then checks a sample of the neighbors of the node. If a better neighbor is found based on the cost differential of the two nodes it moves to the neighbor and continues processing until the maxneighbor criterion is met. Otherwise, it declares the current node a local minimum and starts a new pass to search for other local minima. After a specified number of local minima (numlocal) are collected, the algorithm returns the best of these local values as the medoid of the cluster. 3

4 Discussion: Though partitioning based clustering algorithms can find separate clusters, in the context of gene expression data mining, it has the following disadvantages. i. The number of clusters is not known apriori. ii. The proximity measures used by most of these methods are inadequate due to high dimensionality of gene data. iii. The gene data, apart from displaying disjoint patterns of clusters, often show evidence of intersected and embedded cluster patterns, which are usually not detectable by the partitioning approach. B. Hierarchical Approaches Hierarchical clustering generates a hierarchical series of nested clusters which can be graphically represented by a tree, called dendrogram. We can obtain a specified number of clusters by cutting the dendrogram at an appropriate level. Hierarchical clustering algorithms can be further divided into agglomerative approaches and divisive approaches based on how the dendrogram is formed. Agglomerative algorithms (bottom-up approach) perform repeated amalgamations of groups of data until some pre-defined threshold is reached, whereas divisive algorithms (top-down approach) recursively divide the data until some pre-defined threshold is reached. For agglomerative approaches, linkage is used as the criteria to determine the distance between two clusters. Single linkage is the smallest minimum distance between two objects of two clusters, whereas complete linkage is the smallest maximum distance between two objects of the two clusters and average linkage is the mean distance between every pair of objects of two clusters. CURE [12] is an agglomerative hierarchical clustering algorithm, which begins by choosing a constant number c, of well scattered points, from a cluster. These points are used to identify the shape and size of the cluster. The next step of the algorithm shrinks the selected points toward the centroid of the cluster using some predetermined fraction. ROCK [13], also an agglomerative hierarchical clustering algorithm, employs links and not distances, to measure the similarity/proximity between a pair of data points, before merging them. CHAMELEON [14] uses a graph partitioning algorithm to partition the data using a method based on the k- nearest neighbor approach and then uses an agglomerative hierarchical clustering algorithm to combine the sub-clusters and find the real clusters. A study of the CHAMELEON algorithm using gene data is given in [83]. UPGMA [15] adopts an agglomerative method to graphically represent the clustered dataset. In this method, each cell of the gene expression matrix is colored on the basis of the measured fluorescence ratio and the rows of the matrix are reordered based on the hierarchical dendrogram structure and a consistent node-ordering rule. After clustering, the original gene expression matrix is represented by a colored table (a cluster image) where large contiguous patches of color represent groups of genes that share similar expression patterns over multiple conditions. However, it is not robust in the presence of noise. BIRCH [16] is an integrated hierarchical clustering method that performs cluster representations by introducing two new concepts, clustering feature (CF) and clustering feature tree, which help the clustering method to: (i) achieve good speed (ii) scalability in large databases and, (iii) better utilization of memory. BIRCH is also effective for incremental and dynamic clustering of incoming objects. It applies a multiphase clustering technique: a single scan of the dataset yields a basic good clustering, and additional scan(s) may be used to further improve the quality of clustering. AMOEBA [17] is an algorithm that supports hierarchical spatial clustering based on Delaunay triangles [17]. Discussion: Hierarchical clustering methods are vastly popular because of their presentation of cluster results, which biologists often prefer, as is evidenced by [50], [56]-[58], [61]-[68]. However, the effectiveness of a method in this approach is mostly influenced by the following points. i. The appropriateness of the proximity measure used. ii. The scope for incorporating regulation information, expression values and other type of biological knowledge while expanding the cluster. iii. Ability to work with high dimensional numeric data. iv. Representation of nested cluster structure is clear but inadequate in representing intersected clusters. C. Density Based Approaches Density based clustering identifies dense areas in the object space, where clusters are highly dense areas separated by sparsely dense areas. The given cluster continues to grow as long as the density (number of objects or data points in the Eps-neighborhood ) exceeds some threshold. Such a method can be used to filter out noise and discover clusters of arbitrary shape. Moreover, it has been proposed in [18] that statistically significant patterns can be derived from dense regions, which can then be used to identify genes and/or samples of interest and also eliminate genes and/or samples corresponding to outliers, noise, or abnormalities. The well known DBSCAN [19] density-based clustering algorithm starts with an arbitrary instance p in a dataset D and retrieves all instances of D with respect to Eps and MinPts. The algorithm makes use of a spatial data structure R*-tree to locate points within Eps distance from the core points of the clusters. It is found to be effective over spatial data with uniform density. Its use in gene data clustering is given in [82]. An extension to DBSCAN is OPTICS [20] that relaxes the strict requirements of input parameters and can handle variable density. OPTICS computes an augmented cluster ordering to automatically and interactively cluster the data. The ordering represents the density based clustering structure of the data and contains information that is equivalent to density-based clustering obtained by a range of parameter settings. DBCLASD [21] is another locality-based clustering algorithm, but unlike DBSCAN, the algorithm assumes that the points inside each cluster are uniformly distributed. Each cluster has a probability distribution of points to their nearest neighbors, and this probability set is used to define the cluster. A grid-based representation is used to approximate the 4

5 clusters as part of the probability calculation. The major advantages of DBCLASD are that it requires no external input, it is unaffected by noise since it relies on a probability based distance factor and it can handle clusters of arbitrary shape. Its main disadvantages are that it assumes uniformly distributed points in a cluster, limiting the effectiveness of the algorithm. Its runtime slows as the size of the database increases and it is computationally expensive when compared to later algorithms. DENCLUE [22] is a generalization of partitioning, hierarchical, density-based and grid-based clustering approaches. The algorithm initially performs a pre-clustering step creating a map of the active portion of the dataset which is used to speed up density function calculations. The next step is clustering, including the identification of density-attractors and their corresponding points. DENCLUE can identify clusters of arbitrary shape and exhibits good clustering properties in presence of noise. TURN* [23] consists of an overall algorithm and two component algorithms: an efficient resolution dependent clustering algorithm, TURN-RES, which returns both a clustering result and cluster features from that result, and TURN-CUT, an automatic method for finding the important or optimum resolutions from a set of resolution results from TURN-RES. TURN* removes the need for input parameters. The basic idea of the DHC [24] method is to consider a cluster as a high-dimensional dense area, where data objects are attracted to each other. It then organizes the cluster structure of the dataset in a two-level hierarchical structure. Initially, the root node represents the entire data as a single dense area of the density tree which is then split into several subdense areas based on some criteria, represented by a child node of the root node. These subdense areas are further split, until each subdense area contains a single cluster. The computational complexity of this step makes DHC inefficient. Discussion: Most existing density based algorithms are capable of handling two-dimensional data with uniform global density distribution. However, the characteristics of gene expression data demand a density based clustering algorithm with the following capabilities. i. The algorithm should be able to handle high dimensional numeric data. ii. It should recognize embedded and intersected cluster patterns apart from recognizing uniformly distributed cluster patterns. iii. It should be less dependent on input parameters. D. Model Based Approach Based on our study and analysis of gene expression data [72]-[76], we observe that often a single gene may exhibit membership in more than one cluster. Hence it demands a probabilistic clustering approach for a better fit. Model based clustering is one such approach which is suitable for a problem like gene expression data. Such an algorithm attempts to optimize the fit between the given data and some mathematical model. Applications of model based approach to gene expression data are discussed in [56], [57], [69]-[71]. COBWEB [25] creates hierarchical clustering in the form of a classification tree, where the input objects are described by categorical attribute-value pairs. It also has two operators: merging and splitting, which help COBWEB to become less input order dependent and also allow it to perform bidirectional search. The Self-Organizing Map (SOM) [26] was developed on the basis of a single layered neural network. Each neuron of the neural network is associated with a reference vector, and each data point is mapped to the neuron with the closest reference vector. In the process of running the algorithm, each data object acts as a training sample which directs the movement of the reference vectors towards the denser areas of the input vector space, so that those reference vectors are trained to fit the distributions of the input dataset. When the training is complete, clusters are identified by mapping all data points to the output neurons. The advantage of SOM is that (i) it generates intuitive cluster patterns of a highdimensional dataset and (ii) it is non-susceptible to noisy data. Some disadvantages of SOM are that (i) the number of clusters and the grid structure of the neuron map need to be given as input, and (ii) it is sensitive to the input parameter. SOM is a popular method for clustering data in biological context and can be found in [54], [55], [59], [60], [61], [63] and [77]. AutoClass [27] uses the Bayesian approach, starting from a random initialization of the parameters, incrementally adjusting them in an attempt to find their maximum likelihood estimates. It also assumes that, in addition to the observed or predictive attributes, there is a hidden variable which reflects the cluster membership for every element in the dataset. Therefore, the data-clustering problem is also an example of supervised learning from incomplete data due to the existence of such a hidden variable. The use of AutoClass for clustering microarray data is shown in [58] and [78]. Discussion: Though the model based approach can be considered more relevant to the gene expression data mining problem, most existing methods under the approach suffer from the following disadvantages: i. the number of clusters and the grid structure need to be given as input, ii. they are sensitive to the input parameter, and iii. the algorithms are not cost effective. E. Graph Theoretic Approach Graph theoretical clustering techniques explicitly work with data represented in terms of a graph. AUTOCLUST [28] automatically extracts boundaries based on Voronoi modeling and Delaunay Diagrams. Parameters need not be specified by users but are revealed from the proximity structures of the Voronoi modeling, and AUTOCLUST calculates them from the Delaunay Diagram. This removes human-generated bias and also reduces the exploration time. The advantages are (i) It is effective in the detection of clusters of different densities, and (ii) It identifies and removes multiple bridges linking clusters. 5

6 CLICK [29] tries to identify clusters as a highly connected component in a proximity graph based on a probabilistic assumption. It defines the weight of an edge as the probability that the vertices of the edge are in the same cluster and then iteratively finds the minimum cut in the graph and recursively splits the dataset into a set of connected components subject to a predefined threshold. It also includes two post-pruning steps to refine cluster results. CLICK is capable of recognising intersecting clusters. It also generates better quality clusters over gene expression datasets in terms of homogeneity and separation. It is found in literature in [43], [59], [69] and [79]. CAST [30] is based on the concept of a corrupt clique graph data model. It takes as input a real, symmetric, n-by-n similarity matrix M and an affinity threshold. It assumes that the input dataset contains errors and that the true clusters of the data points are obtained by a disjoint union of complete sub-graphs where each clique represents a cluster. A cluster is defined as a set of high affinity elements subject to, referred to as a clique graph. A proximity graph is derived (based on an n-by-n similarity matrix) using the clique graph by flipping each edge/non-edge with some probability measure. Thus, the clustering process defined by CAST is a process of elimination of noise or contamination from the corrupt version with minimum number of flips. CAST discovers clusters one at a time. It is popular in use of generating clusters in gene expression data [30], [43], [50], [59], [62], [69] and [80]. Discussion: The graph theoretic approach can be considered more relevant to gene expression data mining as it is capable of discovering intersected and embedded clusters. However, it sometimes generates non-realistic cluster patterns. Moreover the results are dependent on proximity measures, so inappropriate selection of the proximity measure will result in non-biologically relevant clusters. F. Soft Computing Approach Fuzzy Clustering: Gene expression data analysis sometimes encounters an intersecting gene pattern in which case a crisp or hard clustering may not yield a good result. However for fuzzy clustering, a gene can belong to several clusters with certain degrees of membership. An application of fuzzy clustering in biological context can be found in [81]. Fuzzy c-mean (FCM), which is a generalisation of ISODATA [31], operates over numeric data to identify c-fuzzy clusters (c is user input). FCM uses Euclidean distance and encounters problems in outlier handling, identification of initial partitions and discovery of clusters of all shapes, which are also the limitations of hard-partitioning based clustering techniques. However several variants of FCM are found in the literature that attempt to overcome these limitations. Neural Networks-Based Clustering: Two popular Neural Networks-based clustering approaches, SOFM (Self Organizing Feature Map) and ART (Adaptive Resonance Theory), are discussed briefly here. SOFM can be viewed as a visual representation of a lattice data structure of neurons, interconnected via adaptable weights. During the training process, the neighbouring input patterns are projected into the lattice corresponding to adjacent neurons. The advantages of SOFM are: (i) It enjoys the benefits of input space density approximation and, (ii) it is input order independent. The disadvantages are (i) like k- means, SOFM needs to predefine the size of the lattice, (the number of clusters) and, (ii) it may suffer from input space density misrepresentation. To overcome the limitations of SOFM, several variants of SOFM can be found in the literature [34] - [36]. The use of SOFM for clustering biological data is found in [82]. ART, a large family of NN architectures, is able to learn any input pattern in a fast, stable and self-organising way. ART1 deals with binary input patterns and can be extended to arbitrary input patterns by using a coding mechanism, whereas ART2 extends the application to analog input patterns, and ART3 achieves efficient parallel search in hierarchical structures based on certain biological processes. Several variants of the use of ART can be found in literature [61]. GA Based Clustering: GA (Genetic Algorithm) clustering is randomized search and optimization techniques based on the principles of evolution and natural genetics. GAs perform search in complex, large and multimodal landscapes, and provide near optimal solutions for the objective function of an optimization problem. Several GA-based clustering algorithms are found in the literature [37] [42]. GenClust [43] is an effective GA-based clustering algorithm with two key features (i) a simplified and compact coding of the search space and, (ii) is used in conjunction with a data driven internal data validation method. GenClust works stage-wise and produces a sequence of partitions consisting of k classes. The process of partitioning into classes continues until a termination criterion is satisfied. The GenClust algorithm is used in [59] to cluster gene expression data. Discussion: In case of fuzzy clustering, the problem of specifying the number of clusters remains. The limitations of FCM can be overcome by (i) by using City Block Distance (L 1 norm) to improve the robustness of FCM to outliers [32] (ii) Initialization strategy of unsupervised tracking of cluster prototypes in a 2-layer clustering scheme [33] (iii) use of appropriate prototypes and distance functions for identification of clusters of all shapes. The basic advantage of ART is that it is fast, exhibits stable learning and pattern detection. The disadvantage is its (i) inefficiency in dealing with noise, and (ii) deficiency in higher dimension representation for clusters. GenClust creates good clusters from gene expression datasets by using external (when a true solution is available) and internal (when a true solution is not known) criteria. However, like the other GA-based solutions, GenClust is also not free from the local optima problem. G. Incremental Approach In [44], the authors present an incremental clustering approach based on the DBSCAN algorithm. A one pass clustering algorithm for relational datasets is proposed in [45]. Rough set theory is employed in the incremental approach for clustering interval datasets in [46]. In [47], an incremental genetic k-means algorithm is presented. In [48], an incremental gene selection algorithm using a wrapper-based 6

7 method that reduces the search space complexity since it works on the ranking directly, is presented. In [49], an incremental clustering algorithm over gene expression data is presented; it uses regulation information to store the cluster information for use when clustering genes incrementally. The effectiveness of an incremental gene clustering algorithm depends upon the proximity measure used and the differing density. V. CLUSTER VALIDITY MEASURES The previous section has reviewed a number of clustering algorithms that partition the dataset based on a variety of clustering criteria. For gene expression data, clustering results in groups of co-expressed genes, groups of samples with a common phenotype, or blocks of genes and samples involved in specific biological processes. However, different clustering algorithms, or even a single clustering algorithm using different parameters, generally result in different sets of clusters [50]. Therefore, it is important to compare various clustering results and select the one that best fits the true data distribution. Cluster validation is the process of assessing the quality and reliability of the cluster sets derived from various clustering processes. Generally, cluster validity has three aspects. First, the quality of clusters can be measured in terms of homogeneity and separation on the basis of the definition of a cluster. Objects within one cluster are similar to each other and different from objects in other clusters. The second aspect comes from ground truth of the clusters. The ground truth could come from domain knowledge, such as the clinical diagnosis of normal or cancerous tissues. Cluster validation is based on the agreement between clustering results and the ground truth. The third aspect focuses on the reliability of the clusters or the likelihood that the cluster structure is not formed by chance. In this section, we will discuss these three aspects of cluster validation in the light of some popular validity measures. A. Rand Index Rand index [84] is a measure of the similarity between two clusters. The Rand index is defined as the number of pairs of objects that are either in the same group or in different groups in both partitions divided by the total number of pairs of objects. Rand index =!"!#! Where a is the number of object pairs (g i, g j ), where C ij = 1 and P ij = 1, b is the number of object pairs (g i, g j ), where C ij = 1 and P ij = 0, c is the number of object pairs (g i,g j ), where C ij = 0 and P ij = 1, d is the number of object pairs (g i,g j ), where C ij = 0 and P ij = 0. The Rand index lies between 0 and 1. The maximum value i.e., 1 is achieved when both partitions, C and P, agree perfectly. Some examples of papers that use Rand index can be found in [57], [63] and [70]. B. Cluster Homogeneity Homogeneity measures the quality of clusters on the basis of the definition of a cluster: objects within a cluster are! similar while objects in different clusters are dissimilar. Homogeneity measure used in this section is that of the overall average homogeneity used in [85]. It is calculated as follows. i. Compute the average value of similarity between each gene g i and the centroid of the cluster C i to which it has been assigned. $ % ( ) * +, -. - / /, where - is the ' & 0 1% centroid of %. ii. Calculate the average homogeneity for the clustering C weighted according to the size of the clusters as $ 23 5 $ %. C. Silhouette Index: 4 % 1% Silhouette index [86] is used to assess the quality of any clustering solution. This index reflects the compactness and separation of clusters. It is calculated as follows. i. Compute a(g i ), i.e., the average distance of gene i to the the other genes of cluster A to which it belongs. ii. Compute d(g i,c k ) where d(g i,c k ) is the average distance of gene g i from the genes of cluster C k where g i 6 C k. iii. Compute b(g i ), where b(g i ) = min{d(g i,c)} where C = {C 1, C 2,..., C m } and A6C, i.e., b(g i ) represents the distance of gene g i to its closest cluster. Now compute the silhouette width of gene g i as ( :. iv. Compute silhouette index by finding the average of S(i) over i = 1,2,...,G, where G is the total number of genes: S = average{s(g i )}. The value of silhouette index varies from -1 to 1 with higher values indicating better clustering. Silhouette index has been used in [63]. D. Z-score: Z-score [87] is calculated by investigating the relation (Mutual Information - MI) between a clustering obtained by an algorithm and the functional annotation of the genes in the cluster. A higher value of z-score indicates that genes are better clustered by function, indicating a more biologically relevant clustering result. The z-score represents a standardized distance between the MI value obtained by clustering and those MI values obtained by random assignment of genes to clusters. Higher z-scores indicate that the clustering results are more significantly related to the gene function. The use of z- score is shown in [88]. E. Jaccard Coefficient: The Jaccard Coefficient measures the proportion of pairs that are in the same cluster C and in the same partition P from those that are either in the same cluster or in the same partition [91]. The Jaccard Coefficient is defined as J =!"!# where a=ss if the pair belongs to the same cluster C and to the same group P, b = SD if the pair belongs to the same 7

8 cluster C and to different groups P and c = DS if the pair belongs to different clusters C and to the same group P As in the Rand Index, the values of these coefficient lies between 0 and 1, and values close to 1 indicate height agreement between C and P. F. Davies-Bouldin Index: Let s i be a measure of dispersion of cluster C i and d(c i,c j ) ; d ij the dissimilarity between two clusters. The dispersion of a cluster C i is defined as < = A BC&?. The Davies-Bouldin index [92] is defined as DE F F G F, where G HIA J.K.FJL G J.M N.K.H and R ij is a similarity index between C i and C j satisfying the condition G J O?!O P?P G. Dunn Index: The Dunn Index [93] is defined as: TU5.5 J V D F HMQ.K.F RHMQ J!.K.F S HIA W.K.F TMIH 5 W XY Where d(c i,c j ) = HMQ BC&?.ZC& P T A.[ is the dissimilarity between two clusters C i and C j Diameter of the cluster C is TMIH 5 HIA B.ZC& T A. [. H. P-value VI. DATASETS USED The reliability of the resulting clusters can be estimated by the p-value [52] of a cluster. It measures the probability of finding the number of genes involved in a given Gene Ontology (GO) term (i.e., function, process and component) within a cluster. From a given GO category, the probability p of getting k or more genes within a cluster of size n, is defined as: \ N ]^ _]`a^ W? ba? _ c. U` b V A cumulative hyper-geometric distribution is used to compute the p-value. A low p-value indicates that the genes belonging to the enriched functional categories are biologically significant in the corresponding clusters. The use of p-value to compare different clustering solutions of yeast cell-cycle gene expression data is given in [90].. We present some datasets related to Human, Yeast and Rat Central Nervous System in Table 2 along with their sources. These datasets have often been used to evaluate the clustering of gene array data. TABLE II: DATASETS AND THEIR SOURCE Dataset No. of genes No of conditions Source Yeast Sporulation Yeast Cell Cycle Sample input files in Expander [115] Subset of Yeast Cell Cycle Yeast Diauxic Shift Rat CNS Arabidopsis Thaliana Subset of Human Fibroblasts Serum Human cancer data Colon cancer data and 22 VII. CHALLENGES IN CLUSTERING GENE EXPRESSION DATA Clustering gene expression data poses challenges that are different from those of clustering non-biological data. This is due to the very nature of data being collected from microarray experiments. A. Challenges and Research Issues Studies have confirmed that clustering algorithms are useful in identifying groups of co-expressed genes and discovering coherent expression patterns. However, due to the distinct characteristics of time-series gene expression data and the special requirements from the biology domain, clustering gene expression data still faces the following challenges. 1) The effectiveness of a clustering technique is highly dependent on the proximity measure, used by the technique. Choosing or finding an appropriate proximity measure is a challenging task. 2) Most existing clustering techniques are either dependent on input parameter(s) or stopping criteria for discovery of the true number of clusters, which is a major task and hence a challenge. 3) Gene expression data often contain clusters which are highly connected [89], intersected or even embedded [24]. Hence the clustering algorithm should be capable of effectively managing such a situation. 4) The available gene datasets often contain a lot of noise and missing values. Thus a clustering algorithm should be capable of extracting the true number of clusters in the presence of this noise and also be able to handle missing values. 5) A clustering algorithm should be efficient in order to scale with the increasing size of datasets as well as dimensionality. 8

9 6) Apart from clustering, an algorithm should be capable of showing the associations among the clusters which may be useful for drawing conclusions. In the case of partitioning approaches, algorithms like PAM or Fuzzy C-Means have been observed to be robust. However, they suffer from limitations such as (i) the number of clusters are to be known apriori, (ii) the proximity measure used may be inadequate in finding the true number of biologically relevant clusters. Similarly, the algorithms following the hierarchical approach can be found advantageous from the biologists perspectives as they help to represent the cluster-cluster association, apart from individual co-expressed gene group representation. However, they also suffer from limitations such as (i) difficulty in deciding the appropriate stopping criteria, and (ii) simultaneous representation of disjoint, embedded and intersected clusters. Density approaches have already been established as being good at finding clusters of all shapes. They are capable of identifying global as well as local (embedded) clusters. However, two limitations of such approaches are: (i) They are input parameter sensitive, and (ii) ineffective in finding intersected patterns over high dimensional data. A modelbased approach provides an estimated probability that a single gene may exhibit membership in more than one cluster indicating that a gene may have a high correlation with two totally different clusters. It discovers good values for its parameters iteratively and can handle various shapes of data. However, it suffers from the limitations that (i) it can be computationally expensive since a large number of iterations may be required to find its parameters, and (ii) it assumes that the dataset fits a specific distribution which is not always true. Graph theoretic algorithms are suitable for subspace and high dimensional data clustering. The algorithms under this approach do not require a user-defined number of clusters, can handle outliers efficiently and they are capable of discovering intersected and embedded clusters. However, they are limited by (i) the difficulty faced in determining a good threshold value, and (ii) they require apriori knowledge of the dataset. Among the soft computing approaches, Fuzzy C-Means (FCM) and Genetic Algorithms (GA) have been used effectively in clustering gene expression data. The Fuzzy C-Means algorithm requires the number of clusters as an input parameter. The GA based algorithms can detect biologically relevant clusters but are dependent on proper tuning of the input parameters. VIII. CONCLUSION There are two groups of people who are involved in the clustering of biological data. One is the biologist who uses an existing clustering algorithm to solve an underlying biological problem. The challenge before the biologist is to make an appropriate choice of an algorithm since different algorithms will produce different results. The other is the developer of clustering algorithms, who consistently strives to improve existing algorithms, so that the underlying biological problems can be solved efficiently. A proper amalgamation of these two groups will lead to rapid advancement in this field. Although gene expression clustering has been done by applying k-means, hierarchical clustering and SOMs algorithms, the desired features for clustering include minimum user input, finding arbitrary shaped clusters, robustness to outliers and the ability to handle higher dimensionality. If outliers pose a problem while using k- means for clustering gene expression data, the biologist can consider an improvement by opting for k-medoids, CLARA, CLARANS as alternatives. If hierarchical clustering is to be used, then BIRCH is suitable for outlier detection while CURE or DHC could be used to detect arbitrary shaped clusters. Fuzzy C-means or GA based approach supports overlapping clusters of co-regulated genes, while projected density-based approaches improve quality on highdimensional data but again needs user specified parameters. A clustering algorithm s suitability to cluster biological data depends upon certain desirable features such as speed, minimum number of input parameters, robustness to noise and outliers, redundancy handling and independence of object order input. Though the features of many clustering algorithms match these requirements, they have not yet been applied to clustering biological data. Moreover, not all validity measures are suitable for all gene datasets; hence a judicious choice of the applicability of the validity measure has to be made. It is well known that most clustering methods are highly sensitive to input data and a slight variation or change in the data may result in very different gene clusters. If the information from genomic knowledge bases, such as GO, could be incorporated (data fusion) earlier in the analysis of genomic data, that additional information about genes and their relationship with each other will improve stability, accuracy and/or biological relevance of the cluster results. In this survey, an attempt has been made to provide a comprehensive survey of various clustering approaches in the context of coherent pattern identification in gene expression data. Effectiveness of a clustering technique is highly influenced by the proximity measure used by the technique. This survey also presents a partial list of proximity measures available for numeric data clustering. It attempts to provide analysis of the effectiveness of algorithms belonging to a specific approach in gene expression data mining. Finally, it discusses the various validity measures and the sources of various datasets for the effective evaluation of clustering results. REFERENCES [1] M.D. Schena, R. Shalon, R. Davis, and P. Brown, Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray, Science, vol. 270, pp , [2] D. Lockhart et al., Expression Monitoring by Hybridization to High- Density Oligonucleotide Arrays, Nature Biotechnology, vol. 14, pp , [3] R. Loganantharaj, Beyond clustering of array expressions, Int. J. Bioinformatics Res. Appl., vol 5, no. 3, pp , [4] P. Berkhin, Survey of Clustering Data Mining Techniques, [5] S.-H. Cha, Comprehensive survey on distance/similarity measures between probability density functions, International Journal of 9

10 Mathematical Models and Methods in Applied Science, vol. 1, no. 4, November [6] R. Xu and D. Wunsch, Survey of Clustering Algorithms, IEEE Transactions on Neural Networks, vol. 16, no. 3, pp , May [7] J McQueen, Some Methods for Classifications and Analysis of Multivariate Observations, in the Proc of 5 th Barkeley Symposium on Mathematics, Statistics and Probability, pp , 1967 [8] Z. Huang, A Fast Clustering Algorithm to cluster very large categorical datasets in Data Mining, SIGMOD Workshop on Research Issues on DM and Knowledge Discovery, May 1997 [9] J. C. Bezdek, R. Ehrlich and W. Full, FCM: Fuzzy c-mean Algorithm, Comp. and Geo-Science, [10] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons,1990. [11] T. Ng. Raymand and J. Han, Efficient and Effective Clustering Method for Spatial Data Mining, In the VLDB94, pp , [12] S. Guha, R. Rastogi, and K Shim, CURE: An Efficient Clustering Algorithm for Large Datasets, ACM SIGMOD Conf., [13] S. Guha, R. Rastogi and K Shim, ROCK: A Robust Clustering Algorithm for Categorical Attributes, IEEE Conf on Data Engineering, [14] G. Karypis, E. H. Han and V. Kumar, CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modelling, Computer, vol. 32, no. 8, pp , [15] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, Cluster Analysis and Display of Genome-Wide Expression Patterns, In Proc. Nat l Academy of Science, vol. 95, no. 25, pp , Dec [16] T. Zhang, R. Ramakrishnan, and M. Linvy, BIRCH: An Efficient data Clustering Method for Very Large Datasets, Data Mining and Knowledge Discovery, vol. 1, no. 2, pp , [17] V. Estivill-Castro, and I. Lee, AMOEBA: Hierarchical Clustering Based On Spatial Proximity Using Delaunay Diagram, in the 9th Int nl Symposium on Spatial Data Handling, [18] A. M. Yip, M. K. Ng, E. H. Wu and T. F. Chan, Strategies for Identifying Statistically Significant Dense Regions in Microarray Data, IEEE/ACM Transactions on Computational Biology and Bioinformatics, pp , July-September, 2007 [19] M. Ester, H.-P. Kriegel, J. Sander,and X. Xu, A density-based algorithm for discovering clusters in large spatial databases, In Proc. Int. Conf. Knowledge Discovery and Data Mining (KDD 96), pp , Portland, OR, Aug.1996 [20] M. Ankerst, M. Breuing, H. -P. Kriegel and J.Sander, OPTICS: Ordering points to identify the clustering structure, In Proc ACM- SIGMOD Conf. on Management of Data (SIGMOD 99), pp , [21] X. Xu, M. Ester, H.-P Kriegel and J Sander, A Nonparametric Clustering Algorithm for Knowledge Discovery in Large Spatial Datasets, in IEEE Int. Conf. On Data Engineering, [22] A. Hinneburg and D. Keim, An efficient approach to clustering in large multimedia datasets with noise, in the Proc. of 4th Int. Conf. On Knowledge Discovery and Data Mining, pp , 1998 [23] A. Foss and O. R. Zaiane, TURN* unsupervised clustering of spatial data, in Proc. of the ACM-SIKDD Int. Conf. On Knowledge Discovery and Data Mining, 2002 [24] D. Jiang, J. Pei, and A. Zhang, DHC: A Density-Based Hierarchical Clustering Method for Time-Series Gene Expression Data, in the Proc. BIBE2003: Third IEEE Int l Symp. Bioinformatics and Bioeng., [25] D. Fisher, Improving interference through conceptual clustering, in the Proc. of the AAAI Conf., pp , 1987 [26] T. Kohonen, Self-Organizing Maps, 2nd Extended Edition, Springer Series in Information Sciences, vol. 30, 1997 [27] P. Cheeseman and J. Stutz, Bayesian Classification (AutoClass): Theory and results, in Advances in Knowledge Discovery and Data Mining, pp , [28] V. Estivil-Castro and I. Lee, AutoClust: Automatic clustering via boundary extraction for mining massive point datasets, In Proc. 5th International Conference on Geocomputation. Pp , [29] R. Shamir and R. Sharan, CLICK: A Clustering Algorithm for Gene Expression Analysis, Proc. Eighth Int l Conf. Intelligent Systems for Molecular Biology (ISMB 00), [30] A. Ben-Dor, R. Shamir, and Z. Yakhini, Clustering Gene Expression Patterns, J. Computational Biology, vol. 6, nos. 3/4, pp , [31] J. Dunn, A fuzzy relative of the ISODATA process and its use in detecting compact well separated clusters, J. Cybern., vol. 3, no. 3, pp , [32] P. Kersten, Implementation issues in the fuzzy c-medians clustering algorithm, in Proc. 6th IEEE Int. Conf. Fuzzy Systems, vol. 2, pp , [33] I. Gath and A. Geva, Unsupervised optimal fuzzy clustering, IEEE Trans. Pattern Anal. Mach. Intell., vol. 11, no. 7, pp , Jul [34] T. Kohonen, Self-Organizing Maps, 3rd edition, New York: Springer- Verlag, [35] M. Su and H. Chang, Fast self-organizing feature map algorithm, IEEE Trans. Neural Netw., vol. 11, no. 3, pp , May [36] J. Vesanto and E. Alhoniemi, Clustering of the self-organizing map, IEEE Trans. Neural Netw., vol. 11, no. 3, pp , May [37] G. Babu and M. Murty, A near-optimal initial seed value selection in K-means algorithm using a genetic algorithm, Pattern Recognit. Lett., vol. 14, no. 10, pp , [38] G. Babu and M. Murty, Clustering with evolution strategies, Pattern Recognit., vol. 27, no. 2, pp , [39] M. Cowgill, R. Harvey, and L. Watson, A genetic algorithm approach to cluster analysis, Comput. Math. Appl., vol. 37, pp , [40] U. Maulik and S. Bandyopadhyay, Genetic algorithm-based clustering technique, Pattern Recognition Letters, vol. 33, pp , [41] L. Tseng and S. Yang, A genetic approach to the automatic clustering problem, Pattern Recognit., vol. 34, pp , [42] K. Krishna and M. Murty, Genetic K-means algorithm, IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 29, no. 3, pp , Jun [43] V. Di-Gesu, R. Giancarlo, G. Lo-Bosco, A. Raimondi and D. Scaturro, GenClust: A genetic algorithm for clustering gene expression data, BMC Bioinformatics, vol. 6, pp. 289, 2005 [44] M. Ester, H. P. Kriegel, J. Sander, M. Wimmer, and X. Xu, An incremental clustering for mining in a data warehousing environment, in Proceedings of the 24th VLDB Conference, New York, USA, [45] L. Tao and S. Sarabjot, Hirel: An incremental clustering algorithm for relational datasets, in Proceedings of Eighth IEEE International Conference on Data Mining, 2008, pp [46] S. Asharaf, M. Narasimha, and S. Shevade, Rough set based incremental clustering of interval data, Pattern Recognition Letters, vol. 27, pp , [47] Y. Lu, S. Lu, F. Fotouhi, Y. Deng, and S. Brown, Incremental genetic k-means algorithm and its application in gene expression data analysis, BMC Bioinformatics, vol. 5(172), [48] R. Ruiz, J. Riquelme, and J. Aguilar-Ruiz, Incremental wrapper-based gene selection from microarray data for cancer classification, Pattern Recognition, vol. 39, p , [49] R. Das, D. Bhattacharyya, and J. Kalita, An incremental clustering of gene expression data, in Proceedings of NABIC, Coimbatore, India, [50] K. Y. Yeung, D. R. Haynor, and W. L. Ruzzo, Validating Clustering for Gene Expression Data, Bioinformatics, vol. 17, no. 4, pp , [51] P. Kraj, A. Sharma, N. Garge, R. Podolsky and R. A. Mcindoe, Parakmeans: Implementation of a parallelized k-means algorithm suitable for general laboratory use, BMC Bioinformatics, vol. 9, pp. 200, [52] S. Tavazoie, D. Hughes, M.J. Campbell, R.J. Cho, and G.M. Church, Systematic Determination of Genetic Network Architecture, Nature Genetics, pp ,

11 [53] J. A. Hartigan and M. A. Wong, A K-means clustering algorithm, Appl. Stat., vol. 28, pp , [54] A. Sugiyama, M. Kotani, Analysis of gene expression Data Using Self-organizing Maps and k-means Clustering, IEEE, pp , [55] J.H. Do and D.-K. Choi, Clustering Approaches to Identifying Gene Expression Patterns from DNA Microarray Data, Mol. Cells, vol. 25, no. 2, pp. 1-1, [56] S. Datta and S. Datta, Comparisons and validation of statistical clustering techniques for microarray gene expression data, Bioinformatics, vol. 19, no. 4, pp , [57] A. Thalamuthu, I. Mukhopadhyay, X. Zheng and G. C. Tseng, Evaluation and comparison of gene clustering methods in microarray analysis, Bioinformatics, vol. 22, no. 19, pp , [58] Y. Liu, S. B. Navathe, J. Civera, V. Dasigi, A. Ram, B. J. Ciliax, and R. Dingledine, Text Mining Biomedical Literature for Discovering Gene-to-Gene Relationships: A Comparative Study of Algorithms, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2, no. 1, January-March [59] R. Shamir and R. Sharan, Algorithmic Approaches to Clustering Gene Expression Data, In Current Topics in Computational Biology, pp , MIT Press, [60] D. Dembele and P. Kastner, Fuzzy c-means method for clustering microarray data, Bioinformatics, vol. 19, pp , [61] S. Tomida, T. Hanai, H. Honda, T. Kobayashi, Analysis of expression profile using fuzzy adaptive resonance theory, Bioinformatics, vol. 18, no. 8, pp , [62] B. Dost, C. Wu, A. Su and V. Bafna, TCLUST: A Fast Method for Clustering Genome-Scale Expression Data, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 99, 2010 [63] S. Bandyopadhyay, A. Mukhopadhyay and U. Maulik, An improved algorithm for clustering gene expression data, Bioinformatics, vol. 23, no. 21, pp , [64] H. Ikota, S. Kinjo, H. Yokoo and Y. Nakazato, Systematic immunohistochemical profiling of 378 brain tumors with 37 antibodies using tissue microarray technology, Acta Neuropathol, vol. 111, no. 5, pp , [65] F.X. Wu, W.J. Zhang, and A.J. Kusalik, Determination of the minimum number of microarray experiments for discovery of gene expression patterns, BMC Bioinformatics, vol. 7 (Suppl. 4), S13, [66] B. Xing, C.M. Greenwood and S.B. Bull, A hierarchical clustering method for estimating copy number variation, Biostatistics, vol. 8, pp , [67] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, Cluster Analysis and Display of Genome-Wide Expression Patterns, In Proc. Nat l Academy of Science, vol. 95, no. 25, pp , Dec [68] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine, Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Array, In Proc. Nat l Academy of Science, vol. 96, no. 12, pp , June [69] D. Jiang, C. Tang and A. Zhang, Cluster Analysis for Gene Expression Data: A Survey, IEEE Transactions on Knowledge and data Engineering, vol. 16, no. 11, November [70] K.Y. Yeung, C. Fraley, A. Murua, A.E. Raftery, and W.L. Ruzz, Model-Based Clustering and Data Transformations for Gene Expression Data, Bioinformatics, vol. 17, pp , [71] G.J. McLachlan, R.W. Bean, and D. Peel, A Mixture Model-Based Approach to the Clustering of Microarray Expression Data, Bioinformatics, vol. 18, pp , [72] R. Das, D K Bhattacharyya and J K Kalita, An Effective Dissimilarity Measure for Clustering Gene Expression Time Series Data, Fourth Biotechnology and Bioinformatics Symposium (BIOT 07),Colorado Springs, CO, pp , October [73] R. Das, J. Kalita and D. Bhattacharyya, A Frequent-Itemset Nearest Neighbor Based Approach for Clustering Gene Expression Data, Fifth Biotechnology and Bioinformatics Symposium (BIOT 08), Arlington, TX, pp , October 2008 [74] R. Das and D K Bhattacharyya, A Pattern matching Approach for Identifying Coherent Patterns over Gene Expression Data, in the Technical Journal of ASS, vol. 50, no. 1-2, 2009 [75] R. Das, D K Bhattacharyya and J K Kalita, Clustering Gene Expression Data using Density Approach, in the IJRTE, (ISSN: ), Finland, vol 2, no 1-6,. November, 2009 [76] R. Das, D K Bhattacharyya and J K Kalita, A New Approach for Clustering Gene Expression Time Series Data, Int'nl Journal of Bioinformatics Research and Applications (IJBRA), vol. 5, no. 3, pp , [77] P. Tamayo, D. Solni, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E.S. Lander, and T.R. Golub, Interpreting Patterns of Gene Expression with Self-Organizing Maps: Methods and Application to Hematopoietic Differentiation, Proc. National Academy of Science, vol. 96, no. 6, pp , March [78] F. Achcar, J.M. Camadro and D. Mestivier, AutoClass@IJM: a powerful tool for Bayesian classification of heterogeneous data in biology, Nucleic Acids Research, vol. 37, Web Server issue, [79] CLIC: T. Yun, T. Hwang, K. Cha and G.-S. Yi, Clustering Analysis of large Microarray Datasets with Individual Dimension-based Clustering, Nucleic Acids Research, pp.1 8, [80] A. Bellaachia, D. Portnoy, Y. Chen and A. G. Elkahloun, E-CAST: A Data Mining Algorithm For Gene Expression Data, In Workshop on Data Mining in Bioinformatics, pp , [81] A.P Gasch and M.B Eisen, Exploring the Conditional Coregulation of Yeast Gene Expression through Fuzzy K-Means Clustering, Genome Biology, vol. 3, no. 11:research , [82] C. Bohm, K. Kailing, H.-P. Kriegel and P. Kroger, Density Connected Clustering with Local Subspace Preferences, in Proc. 4th IEEE Int. Conf. on Data Mining (ICDM'04), Brighton, UK, 2004 [83] S. Sahay, Study and Implementation of CHAMELEON algorithm for Gene Clustering, [84] W. M. Rand, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association, vol. 66, no. 336, pp , [85] R. Sharan, A. Maron-Katz, and R. Shamir, CLICK and EXPANDER: A System for Clustering and Visualizing Gene Expression Data. Bioinformatics, vol. 19, no. 14, pp , 2003 [86] P. Rousseeuw, Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics, vol. 20, pp , 1987 [87] F. Gibbons and F. Roth, Judging the Quality of Gene Expression Based Clustering Methods using Gene Annotation, Genome Research, vol. 12, pp , [88] C. Cheadle, M. P. Vawter, W. J. Freed and K. G. Becker, Analysis of Microarray Data Using Z Score Transformation, Journal of Molecular Diseases, vol. 5, no. 2, 2003 [89] D. Jiang, J. Pei, and A. Zhang, Interactive Exploration of Coherent Patterns in Time-Series Gene Expression Data, Proc. Ninth ACM SIGKDD Int l Conf. Knowledge Discovery and Data Mining (SIGKDD 03), [90] I. Gat-Viks, R. Sharan and R. Shamir, Scoring clustering solutions by their biological relevance, Bioinformatics, vol. 19, no. 18, pp , [91] Y Batistakis, M. Halkidi and M. Vazirgiannis, Cluster validity methods: Part I, Sigmod Record, vol. 31, no. 12, [92] D. Davies and D. Bouldin, A cluster separation measure, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 1, no. 2, [93] J. Dunn, Well separated clusters and optimal fuzzy partitions, Journal of Cybernetics, vol.4,