Comparative Analysis of Biclustering Algorithms
|
|
|
- Morgan Williams
- 10 years ago
- Views:
Transcription
1 Comparative Analysis of Biclustering Algorithms Doruk Bozdağ Biomedical Informatics The Ohio State University Ashwin S. Kumar Computer Science and Engineering Biomedical Informatics The Ohio State University Umit V. Catalyurek Biomedical Informatics Electrical and Computer Engineering The Ohio State University ABSTRACT Biclustering is a very popular method to identify hidden co-regulation patterns among genes. There are numerous biclustering algorithms designed to undertake this challenging task, however, a thorough comparison between these algorithms is even harder to accomplish due to lack of a ground truth and large variety in the search strategies and objectives of the algorithms. In this paper, we address this less studied, yet important problem and formally analyze several biclustering algorithms in terms of the bicluster patterns they attempt to discover. We systematically formulate the requirements for well-known patterns and show the constraints imposed by biclustering algorithms that determine their capacity to identify such patterns. We also give experimental results from a carefully designed testbed to evaluate the power of the employed search strategies. Furthermore, on a set of real datasets, we report the biological relevance of clusters identified by each algorithm. Categories and Subject Descriptors J.3 [Biology and genetics] Keywords Biclustering, microarray, gene expression 1. INTRODUCTION Analyzing variations in expression levels of genes under different conditions (samples) is important to understand the underlying complex biological processes that the genes participate in. In gene expression data analysis, expression levels of genes in each sample are represented by a real-valued data matrix with rows and columns representing the genes and the samples, respectively. The goal is to identify genes that have correlated expression values in various samples. Clustering is the most widely used technique to discover interesting patterns in gene expression data. Genes assigned to the same cluster are likely to have related biological functions, hence they are good candidates for further wet laboratory analysis. Most of the traditional clustering algorithms operate in full space, i.e., take all columns (samples) into account while assigning a row (gene) into a cluster [8, 18]. In gene expression data analysis, however, due to the diversity of sample sources, functionally related genes may not exhibit a similar pattern in all samples but in a subset of them. To address this observation, biclustering (subspace clustering) approaches have emerged, where only a subset of columns are considered to determine the assignment of rows into clusters [4]. Since such a subset of columns are also initially unknown, biclustering can be viewed as simultaneous clustering of rows and columns. Following the work of Cheng and Church [4], biclustering has quickly become popular in analyzing gene expression data and numerous biclustering algorithms have been proposed [7, 9, 1, 11, 12, 14, 17, 19, 2, 21]. Since each algorithm focuses on identification of different bicluster patterns, it is a very challenging task to thoroughly evaluate these algorithms. Comparative analysis of biclustering techniques has only been considered in a few studies in the literature [13, 16]. In this work, we attempt to address this problem and systematically compare several biclustering algorithms to answer the following three crucial questions regarding these algorithms: What kind of bicluster patterns can it discover? How powerful is the search technique? Are the resulting clusters biologically relevant? We classify biclustering patterns into two, based on whether the pattern is defined on a single cluster or multiple clusters. If a bicluster pattern is defined on a single bicluster, we call the pattern a local pattern. Otherwise, we call the pattern a global pattern. Example of a local pattern is a scaling pattern, where every row in a bicluster is a multiple of another row (when considered across the columns in the bicluster). In a local pattern, no information is required about the elements outside the bicluster. This is in contrast to global patterns, where the membership of a row (column) to a bicluster depends on the elements of the row (column) external to the bicluster, and/or on the membership of the row (column) to other biclusters. Such external dependencies bring an additional layer of complexity in analysis of
2 α i π j β i Simplified â ij (i I, j J) Bicluster pattern any constant any constant a constant Constant bicluster constant constant constant constant constant any varying any varying β i Constant rows varying constant any any constant varying constant varying π j Constant columns constant varying constant constant varying varying π j + β i Shifting varying varying α i π j Scaling varying varying varying α i π j + β i Shift-scale Table 1: Well known bicluster patterns, corresponding expected value expressions, and constraints on α i, π j and β i factors to obtain each of these patterns. Constant refers to non-zero constant values, varying refers to non-constant values, and any refers to any value without any constraints. global patterns. Therefore, as a first step to understand bicluster patterns, in this work, we focus our attention to local patterns. In particular, we consider identification of local shifting (additive) and scaling (multiplicative) patterns which are extremely useful to characterize co-regulation relationships among genes. We analyze quality criteria and search strategies of several algorithms, and discuss the types of bicluster patterns each algorithm can identify. For this purpose we identified and analyzed four algorithms that can discover local patterns: Cheng and Church [4], [2], HARP [22] and [3]. In order to decouple the objective and the search technique of an algorithm as much as possible, we designed an extensive testbed consisting of synthetic datasets that correspond to various input scenarios. Furthermore, we included two additional well-known algorithms that seek global patterns ( [19] and [5]) in our experimental studies, and we applied the algorithms on real datasets to provide insight about the biological relevance of resulting clusters. The paper is organized as follows. In the following section, we describe the shifting and scaling patterns and bicluster quality metrics. Analytical discussion about the considered algorithms are given in Section 3. We report results from our experiments in Section 4, and conclude in Section BICLUSTER PATTERNS AND QUALITY METRICS In this section, we introduce the notation and show how various local patterns can be modeled as a variation of shifting and scaling patterns. A gene expression data matrix A = (R, C) can be represented by a set of rows R and a set of columns C denoting genes and samples, respectively. Each entry in this matrix is denoted by (r, c) and the expression level of gene r in sample c is represented by a rc, r R, c C. A bicluster B = (I, J) is composed of a subset of rows I R and a subset of columns in J C, where all a ij, for i I and j J, are expected to fit to a predetermined target pattern, possibly with small deviations. Therefore, each a ij value in a bicluster can be decomposed into two components: a ij = â ij + ɛ ij (1) where â ij is the expected value of a ij that would maximize its match to the target pattern and ɛ ij is the deviation from the expected value. ɛ ij is also referred to as a residue and it is commonly used to quantify the quality of a bicluster. The most commonly used bicluster quality metric is the Mean Squared Residue (MSR) [4], which is defined as MSR = 1 I J j J,i I ɛ 2 ij (2) A bicluster that matches perfectly to a target pattern has ɛ ij =, hence its MSR score is also. Types of patterns searched by most biclustering algorithms can be modeled using an appropriate expression for â ij. For example, shifting patterns and scaling patterns can be represented as â ij = π j + β i and â ij = α i π j, respectively. Here, α i and β i represent scaling and shifting factors associated with row i, and π j represents a base value associated with column j. A more general expression that encompasses both shifting and scaling patterns can be stated as â ij = α i π j + β i [1, 15]. We refer to this pattern as a shift-scale pattern, since it represents a simultaneously shifting and scaling pattern. The shift-scale pattern expression can be reduced to simpler pattern expressions by applying some constraints on α i, π j and β i factors. For instance, if α i is constrained to be equal to a constant α, the shift-scale pattern expression reduces to â ij = α π j + β i = π j + β i, which is a shifting pattern expression. A set of well known bicluster patterns and corresponding expected value expressions are given in Table 1. Furthermore, constraints on α i, π j and β i factors to obtain each of these patterns are also shown in this table. 3. ANALYSIS OF BICLUSTERING ALGO- RITHMS In this section, we formally analyze four biclustering algorithms that seek local patterns in terms of the types of patterns they can discover and discuss their search strategies.
3 α i π j β i Bicluster pattern any any constant bicluster, constant rows constant any any constant bicluster, constant columns, shifting any any constant bicluster, constant rows any constant any constant bicluster, constant rows Table 2: All possible non-trivial cases to satisfy condition in (8) for Cheng and Church algorithm. 3.1 Cheng and Church Algorithm The first and the most popular expression for residue ɛ ij is defined by Cheng and Church [4]: ɛ ij = a ij a ij a Ij + a IJ (3) where a ij, a Ij and a IJ are the mean of the entries in of row i, column j, and the entire bicluster, respectively, for i I and j J. In their biclustering algorithm [4], Cheng and Church considered minimization of MSR as their objective. Aguilar-Ruiz [1] showed that defining ɛ ij as in (3) allows capturing shifting pattens, but not shift-scale patterns. In their analysis they considered a bicluster that has a perfect shift-scale pattern, i.e., they set a ij = α i π j + β i. Let α and β denote the mean values of α i and β i over the rows in I, and π denote the mean value of π j over the columns in J. Then, a ij = 1 a ij J = 1 J j J (α i π j + β i) j J = α i π + β i (4) a Ij = 1 I a IJ = = = 1 I i I a ij (α i π j + β i) i I = α π j + β (5) 1 I J 1 I J i I,j J i I,j J Plugging these values in (3) gives a ij (α i π j + β i) = α π + β (6) ɛ ij = (α i π j + β i) (α i π + β i) (α π j + β) + (α π + β) = (α i α)(π j π) (7) In order to have a zero MSR value, each residue ɛ ij should also be zero. This condition can be expressed as (α i α)(π j π) = (8) for all i I and j J [1]. Possible ways of non-trivially satisfying this condition and corresponding bicluster patterns from Table 1 are shown in Table 2. Biclusters with scaling or shift-scale patterns cannot have a zero MSR value when Cheng and Church () algorithm is used, as can be seen from this table. Therefore, those patterns may not be identified with and other algorithms that use the residue definition in (3). is an iterative algorithm that starts with I = R and J = C. At each iteration, single or a group of rows/columns are removed from the bicluster to improve the MSR value. This is followed by a row/column addition phase where rows or columns are iteratively added to the cluster until a userspecified threshold MSR value is reached. To decide which row or column to move, a score similar to MSR is defined for each row and each column. The score for row i and column j are respectively the sum of the squared residues across the columns in J and rows in I. algorithm discovers biclusters one at a time and each discovered bicluster is masked by random numbers in the original matrix before the search for the next one begins. Returned biclusters depend on the value of the threshold MSR which in turn depends on the range of data values. Therefore, choosing an appropriate MSR threshold is not trivial and may have a significant impact on the resulting biclusters. 3.2 HARP Algorithm In [22], Yip et al. introduced a quality metric slightly different from MSR. In the HARP algorithm they proposed, quality of a bicluster is measured as the sum of the relevance indices of the columns. Relevance index R Ij for column j J is defined as R Ij = 1 σ2 Ij σ 2.j where σij 2 (local variance) and σ.j 2 (global variance) are the variance of the values a ij in column j for i I and i R, respectively. Note that the relevance index for a column is maximized if its local variance is zero, provided that the global variance is not. Local variance is computed as σ Ij = 1 I (9) (a ij a Ij) 2 (1) i I where a Ij denotes the mean of the values in column j over the rows in the bicluster. In order to understand the types of biclusters that can be identified using relevance index, again consider a bicluster with potentially shifting and scaling patterns, thus set a ij = α i π j + β i. Then, a Ij can be
4 α i π j β i Bicluster pattern any none any constant constant bicluster constant any constant bicluster, constant columns constant any constant constant bicluster, constant columns any none any constant constant bicluster Table 3: All possible non-trivial cases to satisfy condition in (12) for the HARP algorithm. simplified as in (5), and the local variance becomes σ 2 Ij = 1 I = 1 I `(αi π j + β i) (α π j + β) 2 i I `(αi α) π j (β i β) 2 i I (11) Therefore, local variance of column j J becomes zero only if (α i α) π j (β i β) = (12) for all i I. Note that the bicluster quality is maximized only if the local variance of all columns in J are zero. Therefore, the constraint in (12) should be satisfied for all j J to maximize quality. In Table 3, all possible ways of nontrivially satisfying this condition and corresponding bicluster patterns are given. As shown in Table 3, the only types of bicluster patterns that maximize the quality in HARP algorithm are constant bicluster and constant column patterns. Thus, one can conclude that the HARP algorithm is not suitable to discover constant row, shifting, scaling or shift-scale patterns. 3.3 Correlated Pattern Biclusters Algorithm () The goal in [3] algorithm is to identify biclusters, where the Pearson s Correlation Coefficient (P) between every pair of rows i I are above a threshold with respect to the columns j J. P between two rows i, l I with respect to the columns j J is defined as: P = P j J (aij aij)(a lj a lj ) q P j J (aij aij)2 P j J (a lj a lj ) 2 (13) where a ij and a lj respectively denote the mean of the entries in rows i and l over the columns in the bicluster. To investigate the relationship between P and shifting and scaling patterns, let a ij = α i π j + β i as before. Then, a ij is simplified as given in (4). Thus, Similarly, a ij a ij = (α i π j + β i) (α i π + β i) = α i(π j π) (14) a lj a lj = α l (π j π) (15) Plugging (14) and (15) into (13) gives P j J P = αi(πj π)α l(π j π) q P j J (αi(πj P π))2 j J (α l(π j π)) 2 = = 1 P α iα l j J (πj π)2 q α 2 i α2 l ` Pj J (πj π)2 2 (16) provided that the denominator is not zero. This result indicates that if a bicluster has a shift-scale pattern, there is perfect correlation between any pair of rows in I with respect to the columns in J. In Table 4, all possible cases that ensure a non-zero denominator value in the P expression is given. The table shows that besides shift-scale patterns, P can also be used to discover shifting, scaling and constant column patterns. However it may not be effective identifying constant bicluster or constant row patterns. algorithm starts by randomly initializing I and J sets and iteratively improves the bicluster by moving rows and columns in and out of the cluster. To represent the general tendency of rows in the columns of the bicluster, a tendency vector is T is utilized. A row is included in the cluster only if its P with this tendency vector is above a certain threshold. Each row is associated with a scaling and a shifting factor, α i and β i, and each column is associated with a tendency value t j, which is the j th element of the tendency vector T. For each a ij value in the bicluster, a corresponding normalized a ij value is computed by first subtracting β i from a ij, and then dividing by α i. t j is then updated by setting it equal to the mean of a ij values across the rows in I. Then, least squares fitting is applied to the values a ij in each row i to find parameters that maximize their fit to the updated t j values. The slope and intercept of the fit are assigned to α i and β i respectively, and the entire process is repeated until all α i, β i and t j values converge. Then, the bicluster is updated to ensure that only rows having large P with the new T vector, and columns having entries with small a ij t j difference over all rows are included. In general, since the search strategy of the algorithm accounts for both shifting and scaling factors, it is well-suited for identifying shift-scale patterns. 3.4 Order Preserving Submatrix () An order preserving submatrix is defined as a submatrix (bicluster) where there exists a permutation of columns in J such that the sequence of values in every row i I is strictly increasing [2]. This condition be stated as a ij < a ik iff a lj < a lk (17)
5 α i π j β i Bicluster pattern constant varying any constant columns, shifting varying varying any scaling, shift-scale Table 4: All possible non-trivial cases that i) ensure the denominator is non-zero in the P expression in (16) for algorithm, ii) satisfy condition in (18) for algorithm. for all i, l I and for all j, k J, such that i l and j k. In other words, a ij a ik < iff a lj a lk < (18) Assuming that the bicluster has a shift-scale pattern, condition in (18) becomes (α i π j + β i) (α i π k + β i) < iff (α l π j + β l ) (α l π k + β l ) < which can be rearranged as α i(π j π k ) < iff α l (π j π k ) < (19) In order to satisfy the condition in (18), α i and α l should be non-zero, and, π j and π k should be varying. The possible scenarios satisfying this condition are the same as those given in Table 4. In other words, may potentially identify the same type of biclusters as. The expected value expression â ij that algorithm seeks can be stated as â ij = α i π i j + β i (2) where π i is a vector of length J associated with row i. Although this expression is less restrictive compared to the shift-scale pattern expression (since πj i do not have to be the same for rows i I), the condition in (17) still needs to be satisfied. Let π i denote the mean of πj i values across rows j J. Setting a ij to the â ij expression in (2) and plugging it in the P equation in (13) gives P j J P = αi(πi j π i )α l (πj l π l ) q P j J (αi(πi j πi )) P 2 j J (α l(πj l πl )) 2 P j J = (πi j π i )(πj l π l ) q P j J (πi j πi ) P (21) 2 j J (πl j πl ) 2 provided that α i and α l are not zero. One can observe that the expression in (21) is equal to the P between vectors π i and π l. Therefore, the distribution of P values between rows in an order preserving submatrix is the same as the distribution of P values between any two vectors having the same column ordering. Using this observation, we designed a small experiment to see the practical values of P in an order preserving submatrix. We generated 2 row vectors of length t and sorted the values in each vector to ensure that they have the same column ordering. The values for the vectors were chosen from a uniform distribution between and 1. The smallest P value between any pair of these vectors was found to be.83,.94 and.96 for t = 2, t = 4 and t = 6, respectively. These results show that even when random values are used, requiring to have the same column ordering results in extremely high P values between the rows of a bicluster. This implies that the type of the bicluster pattern sought by the algorithm resembles a shift-scale pattern especially as the number columns gets larger. 4. EPERIMENTAL RESULTS In this section we report results from our extensive experiments on synthetically generated as well as real datasets to demonstrate the performances of the algorithms discussed. 4.1 Synthetic datasets We generated synthetic datasets by implanting biclusters in fixed-size matrices; one in each. All matrices comprised 1 rows and 12 columns. There were 3 types of implanted biclusters, namely shift-scale biclusters, shift biclusters and order-preserving biclusters. The datasets containing the respective biclusters were named correspondingly. The shiftscale biclusters were generated according to the shift-scale equation discussed earlier (â ij = α i π j + β i). The α i, β i and π j values were randomly chosen from a uniform distribution between and 1. The matrices were generated in exactly the same manner, thereby ensuring that the distribution of values remained the same throughout the matrix. After generation, the elements of the matrix were shuffled, followed by implanting of the bicluster in the matrix, as it was. The matrix with implanted bicluster was then subject to random permutation of its rows and columns. The shift datasets and order-preserving datasets were constructed in exactly the same manner, except that the value generation was done according to their respective patterns discussed earlier. In Figure 1, examples of 2x2 biclusters embedded into 4x4 synthetic datasets, before random row and column permutation, are displayed for each type of dataset considered. In general, when it comes to evaluating the quality of a biclustering result, there are two main factors which are to be taken into consideration, namely, the portion of the implanted bicluster the algorithm was is able to return and the portion external or irrelevant to the implanted bicluster which the algorithm returns. We define two metrics to evaluate cluster quality, namely U, which is the uncovered portion of the implanted bicluster and E, which is the portion of the output cluster external to the implanted cluster. 4.2 Algorithms used We have used publicly available implementation of the algorithm 1, and for the algorithm we used the implementation in BiCAT (Biclustering Analysis Toolbox) 2. Un
6 (a) shift-scale (b) shift (c) order preserving Figure 1: Examples of synthetic datasets. Algorithm [4] HARP [22] [3] [2] [19] [5] Target bicluster patterns Local: constant bicluster, constant columns, constant rows, shifting Local: constant bicluster, constant columns Local: constant columns, shifting, scaling, shift-scale Local: constant columns, shifting, scaling, shift-scale Global: biclusters with large variance Global: biclusters with small combined mean-squared residue Metric/Constraint MSR Relevance index P Ordered columns Heaviest bi-cliques MSR Table 5: Properties of the biclustering algorithms. fortunately, there is no publicly available implementation of the HARP algorithm. Since in Section 3.2 we analytically showed that the HARP algorithm is not suitable to discover constant row, shifting, scaling or shift-scale patterns, we decided to exclude HARP from experimental evaluation. We have implemented the algorithm. clusters as decided by the input cluster split. This algorithm is not deterministic, i.e., results are likely to change on each run. To further enhance our experimental setup, we included two well-known biclustering algorithms, namely the Minimum Sum-Squared Residue-based CoClustering algorithm () [5]3 and the Statistical Algorithmic Method for Bicluster Analysis () [19]4, These two algorithms seek global patterns, hence they are significantly different from the other algorithms previously discussed in terms of their search techniques and evaluation criteria. For, and, the number of runs on each dataset was set to 1. Since and algorithms are deterministic we ran them once. For each of the experiments on synthetic datasets, unless mentioned otherwise, the values for the various parameters of algorithms were as follows. For the algorithm, the value of threshold P was set to.9. The value of the threshold MSR in was set to.1, which was found to be the best value after various trials. In, the default values were used, i.e., the degree limit was set to 5, number of bi-cliques, K, to be returned to 2 and the overlap ratio to 3. In case of, the number of best partial models to be selected was set to 1 throughout. For, the cluster split was given according to the size of the implanted bicluster, so that the expected sizes of the output clusters were the same as the implanted biclusters. The algorithm [19] transforms the matrix into a bipartite graph where the genes and conditions collectively form the set of vertices. There is an edge between a gene and a condition only when the gene deviates sharply from its normal value, under that condition. It essentially finds the K heaviest bi-cliques in the graph. It can find only those biclusters where the variance of elements is large and clustering stops as soon as all columns are covered. This algorithm is deterministic. The algorithm [5] performs simultaneous clustering of the data matrix, by making use of the mean-squared residue as the objective function to be minimized. In a kmeans fashion, it first clusters columns in such a way that that the overall residue is minimized, then performs a similar clustering on the rows. This algorithm forces every row/column into a cluster. Hence, it cannot detect overlapping clusters and it necessarily returns the number of 3 cocluster.html 4 Implementation is available in EPANDER (Epression Analyzer and DisplayER) at expander/. The properties of the biclustering algorithms considered in this paper are summarized in Table 5. The implementations that ran fastest were, and, followed by, which was also reasonably fast. was rather time-consuming Effect of implanted bicluster type and size For the first set of experiments, we implanted 3 different types of biclusters, namely shift-scale, shift and order-preserving biclusters, with 3 different bicluster sizes: small (2x2), medium (4x4) and large (6x6). Figure 2 displays the results of these 9 experiments. The results indicate improved performances of all algorithms as we go from 2x2 datasets to 6x6 datasets. Thus, the ease of finding an implanted bicluster increases with bicluster size. A good performance is expected from and, since both shift and shift-scale datasets have a perfect P and the
7 (a) 2x2 shift-scale (b) 4x4 shift-scale (c) 6x6 shift-scale (d) 2x2 shift (e) 4x4 shift (f) 6x6 shift (g) 2x2 order preserving (h) 4x4 order preserving (i) 6x6 order preserving Figure 2: Scatter plot of U and E values of the algorithms on the base synthetic datasets with implanted biclusters.
8 (a) no noise (b) 5% noise (c) 1% noise (d) 2% noise Figure 3: Results on datasets with varying noise levels. 4x4 shift datasets were used. order-preserving dataset has a very large P between the rows of the implanted bicluster as discussed in Section 3.4. The Algorithm provided the best performance overall, finding a perfect match in most cases. A perfect match is a cluster whose U and E values are both zero. On the 4x4 and 6x6 datasets, it hardly missed any rows and columns and returned absolutely no external rows or columns (E=). also found one perfect match on the 2x2 datasets. performed as the second best algorithm overall, and did not return too many external rows and columns, i.e., the E value was consistently low throughout. performed better than, especially while finding medium and large biclusters in order preserving and shift-scale datasets. The algorithm works well only in exhaustive cases, i.e., when the entire matrix is made up of implanted biclusters, which is not the case in our datasets. We verified this, by running on a special 12x12 dataset comprising 9 implanted clusters. The results indicated that up to 3% of the clusters found had U and E values between and 5 while up to 6% of the clusters had U and E values between 8 and 1. The algorithm does not find any of the patterns because the algorithm cannot find shift and scale patterns but only constant patterns with large variance. We have also tested implanting larger biclusters (up to 5x6), and found out that, except, the performance of the algorithms improved marginally as the size of the implanted bicluster increased. on the other hand, consistently returned a perfect match in all cases Effect of noise Our second set of experiments evaluates the effect of noise in the input dataset. We generated three noise datasets, corresponding to three selected noise levels: 5%, 1% and 2%. The noise datasets were prepared from the 4x4 shift datasets by incrementing each value by a random percentage, up to 5%, 1% or 2% for the corresponding datasets. The results are displayed in Figure 3. In general, we see that with the increase in noise level, the performance of all algorithms degrades as expected. gave the best results and was least affected by noise compared to other algorithms. With increasing noise level, demonstrated an increase in U value, but the E value remained at. This indicates that, even though fails to identify the implanted bicluster entirely under high noise, it is still successful at filtering out external rows and columns., on the other hand, showed an increase in both U and E values with increasing noise. was most affected by noise, because the addition of noise to the dataset no longer preserved the the same ordering in columns Effect of overlap In the third set of experiments (Figure 4), we have evaluated effect of existence of overlapping clusters. We have generated 4x4 shift datasets with two implanted biclusters instead of one. There were three different types of datasets corresponding to the overlap ratio. The overlap ratio is defined as the ratio of the overlapping rows and columns in the implanted biclusters. For example, an overlap ratio of.5 indicates that 5% of the rows and 5% of the columns in
9 (a) no overlap (b) 25% overlap (c) 5% overlap Figure 4: Results on datasets with overlapping implanted biclusters. The overlap ratio shows the fraction of rows and columns overlapping in two biclusters. 4x4 shift datasets were used log1(p value) log1(p value) log1(p value) Clusters Clusters Clusters (a) GDS146, mouse (b) GDS1611, yeast (c) GDS1739, drosophila Figure 5: -log1 transformed p-values of the top 1 clusters for each algorithm on the three real datasets. two implanted biclusters are overlapping. The overlap ratios we used were,.25 and.5., and are capable of detecting overlaps. Owing to their nature, and algorithms cannot detect overlapping patterns. As the overlap increased, showed an increase in U and E values and remained unaffected. 4.3 Results on real datasets We tested the algorithms on three large real datasets to determine the biological relevance of their results. The datasets we used were those of yeast (GDS1611), mouse (GDS146) and drosophila (GDS1739) from the Gene Expression Omnibus (GEO) database [6]. The yeast dataset comprises 9275 genes and 96 conditions, the mouse dataset, genes and 87 conditions, and the drosophila dataset, genes and 54 conditions. The resulting biclusters were evaluated based on the enrichment of Gene Ontology (GO) terms. For each bicluster, the GO term with the smallest p-value is reported. In Figure 5, -log1 transformed p-values of the top 1 clusters with the smallest associated p-values are presented for each algorithm for each of the four datasets. As can be seen in Figure 5(a), the clusters returned by the algorithm were statistically the most significant ones for the GDS146 dataset, followed by those of and algorithms. On the GDS1611 dataset on the other hand, the clusters with the lowest associated p-values were returned by the algorithm, followed by those of and one of the clusters. However, the most significant cluster was identified by the algorithm. and also performed well on the GDS1739 dataset, and algorithm also identified one of the most significant clusters for this dataset. However, in general and performed relatively poorly in our experiments on real datasets. 5. CONCLUSIONS In this paper we analyzed and compared several biclustering algorithms including Cheng and Church, Correlated Pattern Biclusters and, in terms of the types of bicluster patterns they seek and the strength of their search strategies. First, we analytically modeled six well known biclustering patterns, then we examined algorithms with respect to the types of patterns that each of them can identify based on the constraints imposed by their quality criteria. Our mathematical analysis as well as experiments showed that Correlated Pattern Biclusters algorithm performs significantly better than the other considered algorithms that can identify local patterns, and can be a good candidate for searching simultaneously shifting and scaling patterns. The clusters identified by the as well as the algorithm (which considers global patters) from real datasets
10 were found to be more significantly enriched than the clusters identified by other algorithms. This indicates that the patterns sought by these algorithms may have higher biological relevance. The analysis methods introduced in this paper can be extended to formally analyze various other biclustering algorithms both in terms of the target bicluster patterns and search strategies. 6. ACKNOWLEDGMENTS This work was supported in parts by National Institutes of Health / National Cancer Institute Grant R1CA1419; by the U.S. DOE SciDAC Institute Grant DE-FC2-6ER2775; by the U.S. National Science Foundation under Grants CNS , OCI-9489, OCI-9482 and CNS-43342; and an allocation of computing time from the Ohio Supercomputer Center. 7. REFERENCES [1] J. S. Aguilar-Ruiz. Shifting and scaling patterns from gene expression data. Bioinformatics, 21(2): , 25. [2] A. Ben-Dor, B. Chor, R. Karp, and Z. Yakhini. Discovering local structure in gene expression data: The order-preserving submatrix problem. In Proc. International Conference on Computational Biology, pages 49 57, 22. [3] D. Bozdag, J. D. Parvin, and U. Catalyurek. A biclustering method to discover co-regulated genes using diverse gene expression datasets. In Proc. of International Conference on Bioinformatics and Computational Biology, volume 5462, pages , April 29. [4] Y. Cheng and G. M. Church. Biclustering of expression data. In Proc. of the International Conference on Intelligent Systems for Molecular Biology, pages 93 13, 2. [5] H. Cho, I. Dhillon, Y. Guan, and S. Sra. Minimum sum-squared residue based co-clustering of gene expression data. In SIAM International Conference on Data Mining, pages , 24. [6] R. Edgar, M. Domrachev, and A. E. Lash. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research, 3(1):27 21, 22. [7] J. Gu and J. Liu. Bayesian biclustering of gene expression data. BMC Genomics, 9(Suppl 1):S4, 28. [8] J. A. Hartigan and M. A. Wong. A k-means clustering algorithm. Applied Statistics, 28(1):1 18, [9] D. Jiang, J. Pei, and A. Zhang. DHC: A density-based hierarchical clustering method for time series gene expression data. In Proc. of the IEEE International Symposium on Bioinformatics and Bioengineering, pages 393 4, 23. [1] Y. Kluger, R. Basri, J. T. Chang, and M. Gerstein. Spectral biclustering of microarray data: coclustering genes and conditions. Genome Research, 13(4):73 716, 23. [11] G. Li, Q. Ma, H. Tang, A. H. Paterson, and Y. u. QUBIC: a qualitative biclustering algorithm for analyses of gene expression data. Nucleic Acids Research, 37(15):e11, 29. [12] J. Liu and W. Wang. Op-cluster: Clustering by tendency in high dimensional space. In Proc. of the IEEE International Conference on Data Mining, page 187, 23. [13] S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transactions on Computational Biology Bioinformatics, 1(1):24 45, 24. [14] T. Murali and S. Kasif. Extracting conserved gene expression motifs from gene expression data. In Pac. Symp. Biocomp., 8, 23. [15] J. A. Nepomuceno, A. T. Lora, J. S. Aguilar-Ruiz, and J. G. García-Gutiérrez. Biclusters evaluation based on shifting and scaling patterns. Intelligent Data Engineering and Automated Learning, pages , 27. [16] A. Prelic, S. Bleuler, P. Zimmermann, A. Wille, P. Buhlmann, W. Gruissem, L. Hennig, L. Thiele, and E. Zitzler. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 22(9): , May 26. [17] E. Segal, B. Taskar, A. Gasch, N. Friedman, and D. Koller. Rich probabilistic models for gene expression. Bioinformatics, 17(suppl 1):S , 21. [18] R. R. Sokal and C. D. Michener. A statistical method for evaluating systematic relationships. University of Kansas Scientific Bulletin, 28: , [19] A. Tanay, R. Sharan, and R. Shamir. Discovering statistically significant biclusters in gene expression data. Bioinformatics, 18(Supplement 1): , 22. [2] H. Wang, W. Wang, J. Yang, and P. S. Yu. Clustering by pattern similarity in large data sets. In SIGMOD, 22. [21] J. Yang, H. Wang, W. Wang, and P. Yu. Enhanced biclustering on expression data. In Proc. of the IEEE International Symposium on Bioinformatics and Bioengineering, page 321, 23. [22] K. Y. Yip, D. W. Cheung, and M. K. Ng. Harp: A practical projected clustering algorithm. IEEE Transactions on Knowledge and Data Engineering, 16(11): , 24.
A Toolbox for Bicluster Analysis in R
Sebastian Kaiser and Friedrich Leisch A Toolbox for Bicluster Analysis in R Technical Report Number 028, 2008 Department of Statistics University of Munich http://www.stat.uni-muenchen.de A Toolbox for
Biclustering Algorithms for Biological Data Analysis: A Survey
INESC-ID TEC. REP. 1/2004, JAN 2004 1 Biclustering Algorithms for Biological Data Analysis: A Survey Sara C. Madeira and Arlindo L. Oliveira Abstract A large number of clustering approaches have been proposed
Metodi Numerici per la Bioinformatica
Metodi Numerici per la Bioinformatica Biclustering A.A. 2008/2009 1 Outline Motivation What is Biclustering? Why Biclustering and not just Clustering? Bicluster Types Algorithms 2 Motivations Gene expression
Computing the maximum similarity bi-clusters of gene expression data
BIOINFORMATICS ORIGINAL PAPER Vol. 23 no. 1 2007, pages 50 56 doi:10.1093/bioinformatics/btl560 Gene expression Computing the maximum similarity bi-clusters of gene expression data Xiaowen Liu and Lusheng
Protein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Component Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
Rank one SVD: un algorithm pour la visualisation d une matrice non négative
Rank one SVD: un algorithm pour la visualisation d une matrice non négative L. Labiod and M. Nadif LIPADE - Universite ParisDescartes, France ECAIS 2013 November 7, 2013 Outline Outline 1 Data visualization
Exploratory data analysis for microarray data
Eploratory data analysis for microarray data Anja von Heydebreck Ma Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany [email protected] Visualization
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS
BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 [email protected] Genomics A genome is an organism s
Performance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
Random forest algorithm in big data environment
Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest
Categorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
Penalized regression: Introduction
Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Notes on Factoring. MA 206 Kurt Bryan
The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor
Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon
Integrating DNA Motif Discovery and Genome-Wide Expression Analysis Department of Mathematics and Statistics University of Massachusetts Amherst Statistics in Functional Genomics Workshop Ascona, Switzerland
Frozen Robust Multi-Array Analysis and the Gene Expression Barcode
Frozen Robust Multi-Array Analysis and the Gene Expression Barcode Matthew N. McCall October 13, 2015 Contents 1 Frozen Robust Multiarray Analysis (frma) 2 1.1 From CEL files to expression estimates...................
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
Solution of Linear Systems
Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start
Multiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]
Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
Standardization and Its Effects on K-Means Clustering Algorithm
Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03
5. Multiple regression
5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful
Robust Outlier Detection Technique in Data Mining: A Univariate Approach
Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS 1. SYSTEMS OF EQUATIONS AND MATRICES 1.1. Representation of a linear system. The general system of m equations in n unknowns can be written a 11 x 1 + a 12 x 2 +
On the Interaction and Competition among Internet Service Providers
On the Interaction and Competition among Internet Service Providers Sam C.M. Lee John C.S. Lui + Abstract The current Internet architecture comprises of different privately owned Internet service providers
Tutorial for proteome data analysis using the Perseus software platform
Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information
BiCluster Viewer: A Visualization Tool for Analyzing Gene Expression Data
BiCluster Viewer: A Visualization Tool for Analyzing Gene Expression Data Julian Heinrich, Robert Seifert, Michael Burch, Daniel Weiskopf VISUS, University of Stuttgart Abstract. Exploring data sets by
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems
2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Impact of Feature Selection on the Performance of ireless Intrusion Detection Systems
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS
MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS Systems of Equations and Matrices Representation of a linear system The general system of m equations in n unknowns can be written a x + a 2 x 2 + + a n x n b a
Globally Optimal Crowdsourcing Quality Management
Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University [email protected] Aditya G. Parameswaran University of Illinois (UIUC) [email protected] Jennifer Widom Stanford
SAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
SPATIAL DATA CLASSIFICATION AND DATA MINING
, pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION Exploration is a process of discovery. In the database exploration process, an analyst executes a sequence of transformations over a collection of data structures to discover useful
DATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
Introduction to Matrix Algebra
Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary
CLUSTER ANALYSIS FOR SEGMENTATION
CLUSTER ANALYSIS FOR SEGMENTATION Introduction We all understand that consumers are not all alike. This provides a challenge for the development and marketing of profitable products and services. Not every
Social Media Mining. Network Measures
Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the like-minded users
A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data
A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data Athanasius Zakhary, Neamat El Gayar Faculty of Computers and Information Cairo University, Giza, Egypt
KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS
ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,
Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall
Automatic Photo Quality Assessment Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Estimating i the photorealism of images: Distinguishing i i paintings from photographs h Florin
Characterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang University of Waterloo [email protected] Joseph L. Hellerstein Google Inc. [email protected] Raouf Boutaba University of Waterloo [email protected]
Marketing Mix Modelling and Big Data P. M Cain
1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored
Row Quantile Normalisation of Microarrays
Row Quantile Normalisation of Microarrays W. B. Langdon Departments of Mathematical Sciences and Biological Sciences University of Essex, CO4 3SQ Technical Report CES-484 ISSN: 1744-8050 23 June 2008 Abstract
Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems
Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems Ran M. Bittmann School of Business Administration Ph.D. Thesis Submitted to the Senate of Bar-Ilan University Ramat-Gan,
A Brief Study of the Nurse Scheduling Problem (NSP)
A Brief Study of the Nurse Scheduling Problem (NSP) Lizzy Augustine, Morgan Faer, Andreas Kavountzis, Reema Patel Submitted Tuesday December 15, 2009 0. Introduction and Background Our interest in the
IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA
CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA Summer 2013, Version 2.0 Table of Contents Introduction...2 Downloading the
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
Predict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
How To Cluster Of Complex Systems
Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley Young-Rae Cho Department of Computer Science Baylor University Complex Systems Definition Dynamically evolving
A Clustering Model for Mining Evolving Web User Patterns in Data Stream Environment
A Clustering Model for Mining Evolving Web User Patterns in Data Stream Environment Edmond H. Wu,MichaelK.Ng, Andy M. Yip,andTonyF.Chan Department of Mathematics, The University of Hong Kong Pokfulam Road,
STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. Clarificationof zonationprocedure described onpp. 238-239
STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. by John C. Davis Clarificationof zonationprocedure described onpp. 38-39 Because the notation used in this section (Eqs. 4.8 through 4.84) is inconsistent
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration
Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the
131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10
1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 [email protected] What is Learning? "Learning denotes changes in a system that enable
Subspace Analysis and Optimization for AAM Based Face Alignment
Subspace Analysis and Optimization for AAM Based Face Alignment Ming Zhao Chun Chen College of Computer Science Zhejiang University Hangzhou, 310027, P.R.China [email protected] Stan Z. Li Microsoft
Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
Partial Least Squares (PLS) Regression.
Partial Least Squares (PLS) Regression. Hervé Abdi 1 The University of Texas at Dallas Introduction Pls regression is a recent technique that generalizes and combines features from principal component
How To Understand And Solve A Linear Programming Problem
At the end of the lesson, you should be able to: Chapter 2: Systems of Linear Equations and Matrices: 2.1: Solutions of Linear Systems by the Echelon Method Define linear systems, unique solution, inconsistent,
Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach
Outline Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Jinfeng Yi, Rong Jin, Anil K. Jain, Shaili Jain 2012 Presented By : KHALID ALKOBAYER Crowdsourcing and Crowdclustering
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS
USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA [email protected]
A Survey on Outlier Detection Techniques for Credit Card Fraud Detection
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud
Local outlier detection in data forensics: data mining approach to flag unusual schools
Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential
How To Find Local Affinity Patterns In Big Data
Detection of local affinity patterns in big data Andrea Marinoni, Paolo Gamba Department of Electronics, University of Pavia, Italy Abstract Mining information in Big Data requires to design a new class
Polynomial Neural Network Discovery Client User Guide
Polynomial Neural Network Discovery Client User Guide Version 1.3 Table of contents Table of contents...2 1. Introduction...3 1.1 Overview...3 1.2 PNN algorithm principles...3 1.3 Additional criteria...3
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Section 14 Simple Linear Regression: Introduction to Least Squares Regression
Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship
Least-Squares Intersection of Lines
Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression
Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University
Big Ideas in Mathematics
Big Ideas in Mathematics which are important to all mathematics learning. (Adapted from the NCTM Curriculum Focal Points, 2006) The Mathematics Big Ideas are organized using the PA Mathematics Standards
MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)
MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL by Michael L. Orlov Chemistry Department, Oregon State University (1996) INTRODUCTION In modern science, regression analysis is a necessary part
How To Find Influence Between Two Concepts In A Network
2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Influence Discovery in Semantic Networks: An Initial Approach Marcello Trovati and Ovidiu Bagdasar School of Computing
Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,
Performance Level Descriptors Grade 6 Mathematics
Performance Level Descriptors Grade 6 Mathematics Multiplying and Dividing with Fractions 6.NS.1-2 Grade 6 Math : Sub-Claim A The student solves problems involving the Major Content for grade/course with
Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
Data quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
Univariate Regression
Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is
Gene Expression Analysis
Gene Expression Analysis Jie Peng Department of Statistics University of California, Davis May 2012 RNA expression technologies High-throughput technologies to measure the expression levels of thousands
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
