Authors. Data Clustering: Algorithms and Applications

Size: px
Start display at page:

Download "Authors. Data Clustering: Algorithms and Applications"

Transcription

1 Authors Data Clustering: Algorithms and Applications

2 2

3 Contents 1 Grid-based Clustering 1 Wei Cheng, Wei Wang, and Sandra Batista 1.1 Introduction The Classical Algorithms Earliest Approaches: GRIDCLUS and BANG STING and STING+: The Statistical Information Grid Approach WaveCluster: Wavelets in Grid-based Clustering Adaptive Grid-based Algorithms AMR: Adaptive Mesh Refinement Clustering Axis-shifting Grid-based Algorithms NSGC: New Shifting Grid Clustering Algorithm ADCC: Adaptable Deflect and Conquer Clustering ASGC: Axis-Shifted Grid-Clustering GDILC: Grid-based Density-IsoLine Clustering Algorithm High Dimensional Algorithms CLIQUE: The Classical High-Dimensional Algorithm Variants of CLIQUE ENCLUS: Entropy-based Approach MAFIA: Adaptive Grids in High Dimensions OptiGrid: Density-based Optimal Grid Partitioning Variants of the OptiGrid Approach O-Cluster: A Scalable Approach CBF: Cell-based Filtering Conclusions and Summary Bibliography 23 i

4 ii

5 Chapter 1 Grid-based Clustering Wei Cheng Computer Science Department University of North Carolina at Chapel Hill Chapel Hill, NC Wei Wang Computer Science Department University of California, Los Angeles Los Angeles, CA Sandra Batista Statistical Science Department Duke University Durham, NC Introduction The Classical Algorithms Earliest Approaches: GRIDCLUS and BANG STING and STING+: The Statistical Information Grid Approach WaveCluster: Wavelets in Grid-based Clustering Adaptive Grid-based Algorithms AMR: Adaptive Mesh Refinement Clustering Axis-shifting Grid-based Algorithms NSGC: New Shifting Grid Clustering Algorithm ADCC: Adaptable Deflect and Conquer Clustering ASGC: Axis-Shifted Grid-Clustering GDILC: Grid-based Density-IsoLine Clustering Algorithm High Dimensional Algorithms CLIQUE: The Classical High-Dimensional Algorithm Variants of CLIQUE ENCLUS: Entropy-based Approach MAFIA: Adaptive Grids in High Dimensions OptiGrid: Density-based Optimal Grid Partitioning Variants of the OptiGrid Approach O-Cluster: A Scalable Approach CBF: Cell-based Filtering Conclusions and Summary Introduction Grid-based clustering algorithms are efficient in mining large multidimensional data sets. These algorithms partition the data space into a finite number of cells to form a grid structure and then form clusters from the cells in the grid structure. Clusters correspond to regions that are more dense in data points than their surroundings. Grids were initially 1

6 2 Data Clustering: Algorithms and Applications TABLE 1.1: Grid-based algorithms that use hierarchical clustering or subspace clustering hierarchical clustering GRIDCLUS, BANG-clustering, AMR, STING, STING+ subspace clustering MAFIA, CLIQUE, ENCLUS proposed by Warnekar and Krishna [29] to organize the feature space, e.g., in GRIDCLUS [25], and increased in popularity after STING [28], CLIQUE [1], and WaveCluster [27] were introduced. The great advantage of grid-based clustering is a significant reduction in time complexity, especially for very large data sets. Rather than clustering the data points directly, grid-based approaches cluster the neighborhood surrounding the data points represented by cells. In most applications since the number of cells is significantly smaller than the number of data points, the performance of grid-based approaches is significantly improved. Grid-based clustering algorithms typically involve the following five steps [9, 10]: 1. Creating the grid structure, i.e., partitioning the data space into a finite number of cells. 2. Calculating the cell density for each cell. 3. Sorting of the cells according to their densities. 4. Identifying cluster centers. 5. Traversal of neighbor cells. Since cell density often needs to be calculated in order to sort cells and select cluster centers, most grid-based clustering algorithms may also be considered density-based. Some grid-based clustering algorithms also combine hierarchical clustering or subspace clustering in order to organize cells based on their density. Table 1.1 lists several representative gridbased algorithms which also use hierarchical clustering or subspace clustering. Grid-based clustering is susceptible to the following data challenges: 1. Non-Uniformity: Using a single inflexible, uniform grid may not be sufficient to achieve desired clustering quality or efficiency for highly irregular data distributions. 2. Locality: If there are local variations in the shape and density of the distribution of data points, the effectiveness of grid-based clustering is limited by predefined cell sizes, cell borders, and the density threshold for significant cells. 3. Dimensionality: Since performance depends on the size of the grid structures and the size of grid structures may increase significantly with more dimensions, grid-based approaches may not be scalable for clustering very high-dimensional data. In addition, there are aspects of the curse of dimensionality including filtering noise and selecting the most relevant attributes that are increasingly difficult with more dimensions in a grid-based clustering approach. To overcome the challenge of non-uniformity, adaptive grid-based clustering algorithms that divide the feature space at multiple resolutions, e.g., AMR [14] and MAFIA [21], were proposed. The varying grid sizes can cluster well data with non-uniform distributions. For example, as illustrated in Figure 1.1(a), the data is dispersed throughout the spatial domain with several more dense nested regions in the shape of a circle, square, and rectangle. A single resolution uniform grid would have difficulty identifying those more dense, nested regions as clusters as shown in Figure 1.1(b). In contrast, an adaptive algorithm, such as

7 Grid-based Clustering (a) Original data 3 (b) Uniform grid FIGURE 1.1: Non-uniformity example with nested clusters TABLE 1.2: Grid-based Algorithms Addressing Non-uniformity (Adaptive), Locality (Axis-Shifting), and Dimensionality Adaptive MAFIA, AMR Axis-shifting NSGC, ADCC, ASGC, GDILC High-dimension CLIQUE, MAFIA, ENCLUS, OptiGrid, O-cluster, CBF AMR, that permits higher resolution throughout the space can recognize those nested, more dense clusters with centers at the most clear, dense shapes. (Figure 1.1 is adapted from Figure 1 from Liao et al.[14] and is only illustrative, not based on real data.) To address locality, axis-shifting algorithms were introduced. These methods adopt axisshifted partitioning strategies to identify areas of high density in the feature space. For instance, in Figure 1.2(a), traditional grid-based algorithms will have difficulty adhering to the border and continuity of the most dense regions because of the predefined grids and the threshold of significant cells. The clustering from using a single uniform grid, shown in Figure 1.2(b), demonstrates that some clusters are divided into several smaller clusters because the continuity of the border of the dense regions is disturbed by cells with low density. To remedy this, axis-shifting algorithms, such as ASGC [16], shift the coordinate axis by half a cell width in each dimension creating a new grid structure. This shifting yields a clustering that recognizes more dense regions adjacent to lower density cells as shown in Figure 1.2(c). By combining the clustering from both axes, algorithms can recognize dense regions as clusters as shown in Figure 1.2(d). (Figures 1.2(a), 1.2(b), 1.2(c), and 1.2(d) are adapted from Figures 11, 12, 14, and 18 respectively from Lin et al. [16] and are only illustrative, not based on real data or real clustering algorithm results.) For handling high dimensional data, there are several grid-based approaches. For example, the CLIQUE algorithm selects appropriate subspaces rather than the whole feature space for finding the dense regions. In contrast, the OptiGrid algorithm uses density estimations. A summary of grid-based algorithms that address these three challenges is presented in Table 1.2.

8 4 Data Clustering: Algorithms and Applications v v Cluster 1 Cluster 2 Cluster 3 (a) original data (b) 3 clusters found by cells Cluster 1 v Cluster 2 Cluster 3 v Cluster 1 Cluster 2 (c) 3 clusters found by cells after axisshifting (d) final clusters by combining b and c FIGURE 1.2: Locality Example: Axis-shifting grid-based clustering

9 Grid-based Clustering 5 In the remainder of this chapter we survey classical grid-based clustering algorithms as well as those algorithms that directly address the challenges of non-uniformity, locality, and high dimensionality. First, we discuss some classical grid-based clustering algorithms in Section 1.2. These classical grid-based clustering algorithms include the earliest approaches: GRIDCLUS, STING, WaveCluster, and variants of them. We present an adaptive grid-based algorithm, AMR, in Section 1.3. Several axis-shifting algorithms are evaluated in Section 1.4. In Section 1.5, we discuss high dimensional grid-based algorithms, including CLIQUE, OptiGrid, and their variants. We offer our conclusions and summary in Section The Classical Algorithms In this section, we introduce three classical grid-based clustering algorithms together with their variants: GRIDCLUS, STING, and WaveCluster Earliest Approaches: GRIDCLUS and BANG Schikuta et al. [25] introduced the first GRID-based hierarchical CLUStering algorithm called GRIDCLUS. The algorithm partitions the data space into a grid structure comprised of disjoint d-dimensional hyper rectangles or blocks. Data points are considered points in d-dimensional space and are designated to blocks in the grid structure such that their topological distributions are maintained. Once the data is assigned to blocks, clustering is done by a neighbor search algorithm. In some respects, GRIDCLUS is the canonical grid-based clustering algorithm and its basic steps coincide with those given for grid-based algorithms in Section 1.1. Namely, GRIDCLUS inserts points into blocks in its grid structure, calculates the resultant density of the blocks, sorts the blocks according to their density, recognizes the most dense blocks as cluster centers, and constructs the rest of clusters using a neighbor search on the blocks. The grid structure has a scale for each dimension, a grid directory, and the set of data blocks. Each scale is used to partition the entire d-dimensional space and this partitioning is stored in the grid directory. The data blocks contain the data points and there is an upper bound on the number of points per block. The blocks must be non-empty, cover all the data points, and not have any data points in common. Hinrichs offers a more thorough discussion of the grid file structure used[15]. The density index of a block, B, is defined as the number of points in the block divided by the spatial volume of the block, i.e., D B = p B V B, (1.1) where p B is the number of data points in the block B and V B is the spatial volume of the block B, i.e., d V B = e Bi. (1.2) i=1 where d is the number of dimensions and e Bi is the extent of the block in the i dimension. GRIDCLUS sorts the blocks according to their density and those with the highest density are chosen as the cluster centers. The blocks are clustered in order of descending density iteratively to create a nested sequence of nonempty, disjoint clusters. Starting from cluster centers only neighboring blocks are merged into clusters. The neighbor search is

10 6 Data Clustering: Algorithms and Applications Algorithm 1 GRIDCLUS Algorithm 1: Set u := 0, W [] := {},C[] := {}{initialization}; 2: Create the grid structure and calculate the block density indices; 3: Generate a sorted block sequence B 1, B 2,..., B b and mark all blocks not active and not clustered ; 4: while a not active block exists do 5: u u + 1; 6: mark first B 1, B 2,..., B j with equal density index active ; 7: for each not clustered block B l := B 1, B 2,..., B j do 8: Create a new cluster set C[u]; 9: W [u] W [u] + 1,C[u, W [u]] B l ; 10: Mark block B l clustered; 11: NEIGHBOR SEARCH(B l,c[u, W [u]]); 12: end for 13: for each not active block B do 14: W [u] W [u] + 1,C[u, W [u]] B; 15: end for 16: Mark all blocks not clustered ; 17: end while Algorithm 2 Procedure NEIGHBOR SEARCH(B,C) 1: for each active and not clustered neighbor B of B do 2: C B ; 3: Mark block B clustered ; 4: NEIGHBOR SEARCH(B, C); 5: end for done recursively starting at the cluster center, checking for adjacent blocks that should be added to the cluster, and for only those neighboring blocks added to the cluster, continuing the search. The GRIDCLUS algorithm is described in Algorithm 1 and the function N EIGHBOR SEARCH is the recursive procedure described in Algorithm 2 [10, 25]. While no explicit time complexity analysis is given for GRIDCLUS in the original paper, the algorithm may not have time complexity much better than other hierarchical clustering algorithms in the worst case. The number of blocks in the worst case is O(n) where n is the number of data points and sorting the blocks by density is O(n log n). However, this complexity would still be better than hierarchical clustering. The problem is that step 4 can also require O(n) if all the blocks have different densities and step 7 can also require O(n) if all the blocks have the same density. In addition, while the number of neighbors of any block is a function of the number of dimensions, the depth of the recursive calls to the neighbor search function can also be O(n). This can occur if the blocks are adjacent in a single place analogous to a spanning tree that is a straight line. Without any discriminatory density thresholds, the pathological case of step 7 could also apply and the complexity would be O(n 2 ). (Granted average case complexity for several distributions may be significantly better (i.e., O(n)) and that may be a better analysis to consider.) The BANG algorithm introduced by Schikuta and Erhart [26] is an extension of the GRIDCLUS algorithm. It addresses some of the inefficiencies of the GRIDCLUS algorithm in terms of grid structure size, searching for neighbors, and managing blocks by their density. BANG also places data points in blocks and uses a variant of the grid directory called a BANG structure to maintain blocks. Neighbor search and processing the blocks in decreasing

11 Algorithm 3 BANG-clustering Algorithm Grid-based Clustering 7 1: Partition the feature space into rectangular blocks which contains up to a maximum of p max data points. 2: Build a binary tree to maintain the populated blocks, in which the partition level corresponds to the node depth in the tree. 3: Calculate the dendrogram in which the density indices of all blocks are calculated and sorted in decreasing order. 4: Starting with the highest density index, all neighbor blocks are determined and classified in decreasing order. BANG-clustering places the found regions in the dendrogram to the right of the original blocks. 5: Repeat step 4 for the remaining blocks of the dendrogram. TABLE 1.3: Statistical Information in STING n number of objects (points) in the cell mean mean of each dimension in this cell std min max dist standard deviation of each dimension in this cell the minimum value of each dimension in this cell the maximum value of each dimension in this cell the distribution of points in this cell order of density are also used for clustering blocks. Nearness of neighbors is determined by the maximum dimensions shared by a common face between blocks. A binary tree is used to store the grid structure, so that neighbor searching can be done more efficiently. From this tree in the grid directory and the sorted block densities, the dendrogram is calculated. Centers of clusters are still the most highly dense blocks in the clustering phase. The BANG algorithm is summarized in Algorithm 3 [10]. While both GRIDCLUS and BANG can discern nested clusters efficiently, BANG has been shown to be more efficient than GRIDCLUS on large data sets because of its significantly reduced growth of grid structure size [26] STING and STING+: The Statistical Information Grid Approach Wang et al. [28] proposed a STatistical INformation Grid-based clustering method (STING) to cluster spatial databases and to facilitate region oriented queries. STING divides the spatial area into rectangular cells and stores the cells in a hierarchical grid structure tree. Each cell (except leaves in the tree) is partitioned into 4 child cells at the next level with each child corresponding to a quadrant of the parent cell. A parent cell is the union of its children; the root cell at level 1 corresponds to the whole spatial area. The leaf level cells are of uniform size, determined globally from the average density of objects. For each cell, both attribute-dependent and attribute-independent parameters of the statistical information are maintained. These parameters are defined in Table 1.3. STING maintains summary statistics for each cell in its hierarchical tree. As a result, statistical parameters of parent cells can easily be computed from the parameters of child cells. Note that the distribution types may be normal, uniform, exponential and none. Value of dist may either be assigned by the user or obtained by hypothesis tests such as the χ 2 test. Even though measures of these statistical parameters are calculated in a bottom-up fashion from any leaf node, the STING algorithm adopts a top-down approach for clustering

12 8 Data Clustering: Algorithms and Applications Algorithm 4 STING Algorithm 1: Determine a level to begin with. 2: For each cell of this level, we calculate the confidence interval (or estimated range) of probability that this cell is relevant to the query. 3: From the interval calculated above, we label the cell as relevant or not relevant. 4: If this level is the leaf level, go to Step 6; otherwise, go to Step 5. 5: We go down the hierarchy structure by one level. Go to Step 2 for those cells that form the relevant cells of the higher level. 6: If the specification of the query is met, go to Step 8; otherwise, go to Step 7. 7: Retrieve those data fall into the relevant cells and do further processing. Return the result that meet the requirement of the query. Go to Step 9. 8: Find the regions of relevant cells. Return those regions that meet the requirement of the query. Go to Step 9. 9: Stop. and query by starting from the root of its hierarchical grid structure tree. The algorithm is summarized in Algorithm 4 [10, 28]. The tree can be constructed in O(N) time, where N is the total number of data points. Dense cells are identified and clustered by examining the density of these cells in a similar vein to the density-based DBSCAN algorithm[7]. If the cell tree has K leaves, then the complexity of spatial querying and clustering for STING is O(K), which is O(N) in the worst case since cells that would be empty never need to be materialized and stored in the tree. A common misconception is that K would be O(2 d ) where d is the number of dimensions and that this would be problematic in high dimensions. STING may have problems with higher dimensional data common to all grid-based algorithms (e.g., handling noise and selecting most relevant attributes) [11], but scalability of the grid structure is not one of them. There are several advantages of STING. First, it is a query-independent approach since the statistical information exists independent of queries. The computational complexity of STING for clustering is O(K) and this is quite efficient in clustering large data sets especially when K N. The algorithm is readily parallelizable and allows for multiple resolutions for examining the data in its hierarchical grid structure. In addition, incremental data updating is supported, so there is lower overhead for incorporating new data points. Wang et al. extended STING to STING+ so that it is able to process dynamically evolving spatial databases. In addition, STING+ enables active data mining by supporting user-defined trigger conditions WaveCluster: Wavelets in Grid-based Clustering Sheikholeslami et al. [27] proposed a grid-based and density-based clustering approach that uses wavelet transforms: WaveCluster. This algorithm applies wavelet transforms to the data points and then uses the transformed data to find clusters. A wavelet transform is a signal processing technique that decomposes a signal into different frequency subbands. The insight to using the wavelet transforms is that data points are considered d-dimensional signals where d is the number of dimensions. The high-frequency parts of a signal correspond to the more sparse data regions such as boundaries of the clusters whereas the low-frequency high-amplitude parts of a signal correspond to the more dense data regions such as cluster interiors [3]. By examining in different frequency subbands, clustering results may be achieved at different resolutions and scales from fine to coarse. Data are transformed to p- reserve relative distance between objects at different levels of resolution. A hat-shaped filter

13 Algorithm 5 WaveCluster Algorithm INPUT: Multidimensional data objects feature vectors OUTPUT: cluster objects Grid-based Clustering 9 1: First bin the feature space, then assign objects to the units, and compute unit summaries. 2: Apply wavelet transform on the feature space. 3: Find connected components (clusters) in the subbands of transformed feature space, at multiple levels. 4: Assign labels to the units in the connected components. 5: Make the lookup table. 6: Map the objects to the clusters. is used to emphasize regions where points cluster and to suppress weaker information in their boundaries. This makes natural clusters more distinguishable and eliminates outliers simultaneously. As input the algorithm requires as parameters the number of grid cells for each dimension, the wavelet, and the number of applications of the wavelet transform. The algorithm is summarized in Algorithm 5 [27]. WaveCluster offers several advantages. The time complexity is O(N) where N is the number of data points; this is very efficient for large spatial databases. The clustering results are insensitive to outliers and the data input order. The algorithm can accurately discern arbitrarily shaped clusters such as those with concavity and nesting. The wavelet transformation permits multiple levels of resolution, so that clusters may be detected more accurately. This algorithm is primarily only suited for low dimensional data. However, in the case of very high dimensional data, PCA may be applied to the data to reduce the number of dimensions, so that N > m f where m is the number of intervals in each dimension and f is the number of dimensions selected after PCA. After this, WaveCluster may be applied to the data to cluster it and still achieve linear time efficiency [27]. 1.3 Adaptive Grid-based Algorithms When a single inflexible, uniform grid is used, it may be difficult to achieve the desired clustering quality or efficiency for highly irregular data distributions. In such instances, adaptive algorithms that modify the uniform grid may be able to overcome this weakness in uniform grids. In this section, we introduce an adaptive grid-based clustering algorithm: AMR. Another adaptive algorithm (MAFIA) will be discussed in Section AMR: Adaptive Mesh Refinement Clustering Liao et al. [14] proposed a grid-based clustering algorithm AMR using a Adaptive Mesh Refinement technique that applies higher resolution grids to the localized denser regions. Different from traditional grid-based clustering algorithms, such as CLIQUE and GRID- CLUS, which use a single resolution mesh grid, AMR divides the feature space at multiple resolutions. While STING also offers multiple resolutions, it does so over the entire space, not localized regions. AMR creates a hierarchical tree constructed from the grids at multiple resolutions. Using this tree, this algorithm can discern clusters, especially nested ones, that

14 10 Data Clustering: Algorithms and Applications may be difficult to discover without clustering several levels of resolutions at once. AMR is very fit for data mining problems with highly irregular data distributions. AMR clustering algorithm mainly contains two steps summarized here from [13]: 1. Grid Construction: First grids are created at multiple resolutions based on regional density. The grid hierarchy tree contains nested grids of increasing resolution since the grid construction is done recursively. The construction of the AMR tree starts with a uniform grid covering the entire space and for those cells that exceed a density threshold, the grid is refined into higher resolution grids at each recursive step. The new child grids created as part of the refinement step are connected in the tree to parent grid cells whose density exceeds the threshold. 2. Clustering: To create clusters, each leaf node is considered to be the center of an individual cluster. The algorithm recursively assigns objects in the parent nodes to clusters until the root node is reached. Cells are assigned to clusters based on the minimum distance to the clusters under the tree branch. The overall complexity for constructing the AMR tree is O(dtN 1 ph 1 p +(dtk +6d )r 1 qh 1 q ), where N is the number of data points, d is the dimensionality, t is the number of attributes in each dimension, h is the AMR tree height, and p represents the average percentage of data points to be refined at each level, r is the mesh size at the root, and q is the average ratio of mesh sizes between two grid levels [14]. Like most grid-based methods, AMR exhibits insensitivity to the order of input data. The AMR clustering algorithm may be applied to to any collection of attributes with numerical values even those with very irregular or very concentrated data distributions. However, like GDILC, it cannot be scaled to high-dimensional databases because of its overall complexity. 1.4 Axis-shifting Grid-based Algorithms The effectiveness of a grid-based clustering algorithm is seriously limited by the size of the predefined grids, the borders of the cells, and the density threshold of the significant cells in the face of local variations in shape and density in a data space. These challenges motivate another kind of grid-based algorithms: axis-shifting algorithms. In this section, we introduce four axis-shifting algorithms: NSGC, ADCC, ASGC, and GDILC NSGC: New Shifting Grid Clustering Algorithm Fixed grids may suffer from the boundary effect. To alleviate this, Ma and Chow [18] proposed a New Shifting Grid Clustering algorithm (NSGC). NSGC is both density-based and grid-based. To form its grid structure, the algorithm divides each dimension of the space into an equal number of intervals. NSGC shifts the whole grid structure and uses the shifted grid along with the original grid to determine the density of cells. This reduces the influence of the size and borders of the cells. It then clusters the cells rather than the points. Specifically, NSGC consists of four main steps summarized in Algorithm 6 [18]. NSGC repeats the steps above until the result of the previous iteration and that of the current iteration are smaller than a specified accepted error threshold. The complexity of NSGC is O((2w) d ), where d is the dimensionality and w is the number of iterations of the algorithm. While it is claimed that this algorithm is non-parametric, its performance is dependent upon the choice of the number of iterations, w, and the accepted error threshold.

15 Algorithm 6 NSGC Algorithm Grid-based Clustering 11 1: Cell construction: It divides each dimension of the space into 2w intervals, where w is the number of iterations. 2: Cell assignment: It first finds the data points belonging to a cell, then shifts by half cell-size of the corresponding dimension, and finds the data points belonging to shifted cells. 3: Cell density computation: It uses both the density of the cell itself and its nearest neighborhood to obtain a descriptive density profile. 4: Group assignment(clustering): It starts when the considered cell or one of its neighbor cells has no group assigned. Otherwise, the next cell is considered until all non-empty cells are assigned. Algorithm 7 ADCC Algorithm 1: Generate the first grid structure. 2: Identify the significant cells. 3: Generate the first set of clusters. 4: Transform the grid structure. 5: Generate the second set of clusters. 6: Revise the original clusters. In this case, the first set of clusters and second set of clusters are combined recursively. 7: Generate the final clustering result. If w is set too low (or high) or the error threshold too high, then clustering results may not be accurate; there is no a priori way to know the best values of these parameters for specific data. NSGC is susceptible to errors caused by cell sizes that are too small also. As the size of cells decreases (and the number of iterations increases), the total number of cells and the number of clusters reported both increase. The reported clusters may be too small and not correspond to clusters in the original data. The strongest advantage of NSGC is that its grid shifting strategy permits it to recognize clusters of very arbitrary boundary shapes with great accuracy ADCC: Adaptable Deflect and Conquer Clustering The clustering quality of grid-based clustering algorithms often depends on the size of the predefined grids and the density threshold. To reduce their influence, Lin et al. adopted deflect and conquer techniques to propose a new grid-based clustering algorithm ADCC (Adaptable Deflect and Conquer Clustering) [17]. Very similar to NSGC, the idea of ADCC is to utilize the predefined grids and predefined threshold to identify significant cells. Nearby cells that are also significant can be merged to develop a cluster. Next, the grids are deflected half a cell size in all directions and the significant cells are identified again. Finally, the newly generated significant cells and the initial set of significant cells are merged to improve the clustering of both phases. Specifically, ADCC is summarized in Algorithm 7. The overall complexity of ADCC is O(m d + N), where m is the number of intervals in each dimension, d is the dimensionality of data, and N is the number of data points. While ADCC is very similar to NSGC in its axis-shifting strategy, it is quite different in how it constructs clusters from the sets of grids. Rather than examining a neighborhood of the two grids at once as NSGC does, ADCC examines the two grids recursively looking for consensus in the significance of cells in both clusterings especially those that overlap a previous clustering to make a determination about the final clustering. This step can

16 12 Data Clustering: Algorithms and Applications Algorithm 8 ASGC Algorithm 1: Generate the first grid structure: the entire feature space is divided into non overlapping cells thus forming the first grid structure. 2: Identify the significant cells: These are cells whose density is more than a predefined threshold. 3: Generate the first set of clusters: all neighboring significant cells are grouped together to form clusters. 4: Transform the grid structure: the original coordinate origin is shifted by distance ξ in each dimension of the feature space to obtain a new grid structure. 5: Generate the second set of clusters: new clusters are generated using steps 2 and 3. 6: Revise the original clusters: the clusters generated from the shifted grid structures can be used to revise the clusters generated from the original grid structure. 7: Generate the final clustering result. actually help to separate clusters more effectively especially if there is only a small distance with very little data between them. Both methods are suspectible to errors caused by small cell sizes, but can for the most part handle arbitrary borders and shapes in clusters very well. ADCC is not dependent on many parameters to determine its termination. It is only dependent on the choice of the number of intervals per dimension, m ASGC: Axis-Shifted Grid-Clustering Another attempt by Chang et al. [16] to minimize the impact of the size and borders of the cells is ASGC (Axis-Shifted Grid-Clustering) (also referred to as ACICA + ). After creating an original grid structure and initial clustering from that grid structure, the original grid structure is shifted in each dimension and another clustering is done. The shifted grid structure can be translated an arbitrary distance to be specified. The effect of this is to implicitly change the size of the original cells. It also offers greater flexibility to adjust to boundaries of clusters in the original data and minimize the effect of the boundary cells. The clusters generated from this shifted grid structure can be used to revise the original clusters. Specifically, the ASGC algorithm involves 7 steps, and is summarized in Algorithm 8 from [13]. The complexity of ASGC is the same as that of ADCC, which is O(m d + N), where N is the number of data points, d is the dimensionality of the data, and m is the number of intervals in each dimension. The main difference between ADCC and ASGC is that the consensus method to revise clusters is bi-directional in ASGC: using the overlapping cells the clusters from the first phase can be used to modify the clusters of the second phase and vice versa. When a cluster of the first clustering overlaps a cluster of the second clustering, the combined cluster of union of both can then be modified in order to generate the final clustering. This permits great flexibility in handling arbitrary shapes of clusters in the original data and minimizes the extent to which either grid structure will separate clusters. By essentially translating the original grid structure an arbitrary distance to create the second grid and overlapping it with the original grid structure, a different resolution (and implicitly different cell size) is also achieved by this translation. While this method is less susceptible to the effects of cell sizes and cell density thresholds than other axis-shifting grid clustering methods, it still requires careful initial choice of cell size and cell density threshold.

17 Algorithm 9 GDILC Algorithm Grid-based Clustering 13 1: Cells are initialized by dividing each dimension into m intervals. 2: The distances between sample points and those in neighboring cells are calculated. The distance threshold T is computed. 3: The density vector and density threshold τ are computed. 4: At first, GDILC takes each data point whose density is more than the density threshold τ as a cluster. Then, for each data point x, check, for every data point whose density is more than the density threshold τ in the neighbor cells of x, whether its distance to x is less than the distance threshold T. If so, GDILC then combines the two clusters containing those two data points. The algorithm continues until all point pairs have been checked. 5: Outliers are removed GDILC: Grid-based Density-IsoLine Clustering Algorithm Zhao and Song [31] proposed a Grid-based Density-IsoLine Clustering algorithm (GDIL- C) to perform clustering by making use of the density-isoline figure. It assumes that all data samples have been normalized. All attributes are numerals and are in the range of [0, 1]. This is for the convenience of distance and density calculation. GDILC first implicitly calculates a density-isoline figure, the contour figure of the density of data points. Then clusters are discovered from the density-isoline figure. GDILC computes the density of a data point by counting the number of points in its neighbor region. Specifically, the density of a data point x is defined as follows: Density(x) = {y : Dist(x, y) T }, (1.3) where T is a given distance threshold and Dist(x, y) is a distance function (e.g., Euclidean distance) used to measure the dissimilarity between data points x and y. The densityisoline figure is never drawn, but is obtained from the density vectors. The density vectors are computed by counting the elements of each row of the distance matrix that are less than the radius of the neighbor region, T. To avoid enumerating all data points for calculating the density vector, GDILC employs a grid-based method. The grid-based method first partitions each dimension into several intervals creating hyper-rectangular cells. Then, to calculate the density of data point x, GDILC only considers data points in the same cell with x and those data points in its neighbor cells; this is identical to axis shifting. The GDILC algorithm is shown in Algorithm 9 [10]. For many data sets, this grid-based method significantly reduces the search space of calculating the point pair distances; the complexity may appear nearly linear. In the worst case the time complexity of GDILC remains O(N 2 ) (i.e., for the pathological case when all the points cluster in the neighborhood of a constant number of cells). However, it cannot be scaled to high-dimensional data because the space is divided into m d cells, where m is the number of intervals in each dimension, and d is the dimensionality. When the dimensionality d is very large, m d is significantly large, and the data points in each cell is very sparse, the GDILC algorithm will no longer work. (There will be difficulty computing any distances or thresholds.) There are two significant advantages to this algorithm. First, it can handle outliers explicitly and this can be refined as desired. Second, it computes necessary thresholds such as those for density and distance directly from the data. These can be fined tuned as needed (i.e., they don t need to be guessed at any point). In essence this algorithm dynamically learns the data distribution of samples while learning the parameters for thresholds in addition to discerning the clustering in the data.

18 14 Data Clustering: Algorithms and Applications 1.5 High Dimensional Algorithms The scalability of grid-based approaches is a significant problem in higher dimensional data because of the increase in the size of the grid structure and the resultant time complexity increase. Moreover, inherent issues in clustering high dimensional data such as filtering noise and identifying the most relevant attributes or dimensions that represent the most dense regions must be addressed inherently in the grid structure creation as well as the actual clustering algorithm. In this section, we examine carefully a subspace clustering approach presented in CLIQUE and a density estimation approach presented by OptiGrid. This section is greatly influenced by the survey of Berkhin [3] with additional insights on the complexity, strengths, and weaknesses of each algorithm presented CLIQUE: The Classical High-Dimensional Algorithm Agrawal et al. [1] proposed a hybrid density-based, grid-based clustering algorithm, CLIQUE (Clustering In QUEst), to find automatically subspace clustering of high dimensional numerical data. It locates clusters embedded in subspaces of high dimensional data without much user intervention to discern significant subclusters. In order to present the clustering results in an easily interpretable format, each cluster is given a minimal description as a disjunctive normal form (DNF) expression. CLIQUE first partitions its numerical space into units for its grid structure. More specifically, let A = {A 1, A 2,..., A d } be a set of bounded, totally ordered domains (attributes) and S = A 1 A 2..., A d be a d-dimensional numerical space. By partitioning every dimension A i (1 i d) into m intervals of equal length, CLIQUE divides the d-dimensional data space into m d non-overlapping rectangular units. A d-dimensional data point, v, is considered in a unit, u, if the value of v in each attribute, is greater than or equal to the left boundary of that attribute in u and less than the right boundary of that attribute in u. The selectivity of a unit is defined to be the fraction of total data points in the unit. Only units whose selectivity is greater than a parameter τ are viewed as dense and retained. The definition of dense units applies to all subspaces of the original d-dimensional space. To identify dense units to retain and subspaces that contain clusters, CLIQUE considers projections of the subspaces from the bottom up (i.e., the least dimensional subspaces to those of increasing dimension). Given a projection subspace, A t1 A t2... A tp, where p < d and t i < t j if i < j, a unit is the intersection of an interval in each dimension. By leveraging the Apriori algorithm, CLIQUE employs a bottom-up scheme because monotonicity holds: if a collection of points is a cluster in a p-dimensional space, then this collection of points is also part of a cluster in any (p 1)-dimensional projections of this space. In CLIQUE, the recursive step from (p 1)-dimensional units to p-dimensional units involves a self-join of the p 1 units sharing first common (p 2)-dimensions [3]. To reduce the time complexity of the Apriori process, CLIQUE prunes the pool of candidates, only keeping the set of dense units to form the candidate units in the next level of the dense unit generation algorithm. To prune the candidates, all the subspaces are sorted by their coverage, i.e., the fraction of the database that is covered by the dense units in it. The less covered subspaces are pruned. The cut point between retained and pruned subspaces is selected based on the MDL [24] principle in information theory. CLIQUE then forms clusters from the remaining candidate units. Two p-dimensional units u 1, u 2 are connected if they have a common face or if there exists another p-dimensional unit u s such that u 1 is connected to u s and u 2 is connected to u s. A cluster is a maximal set of connected dense units in p-dimensions. Finding clusters is equivalent to finding connected

19 Grid-based Clustering 15 components in the graph defined to represent the dense units as the vertices and edges between vertices existing if and only if the units share a common face. In the worst case, this can be done in quadratic time in the number of dense units. After finding all the clusters, CLIQUE uses a DNF expression to specify a finite set of maximal segments (regions) whose union is the cluster. Finding the minimal descriptions for the clusters is equivalent to finding an optimal cover of the clusters; this is NP-hard. In light of this, instead, CLIQUE adopts a greedy approach to cover the cluster in regions and then discards redundant regions. By integrating density-based, grid-based, and subspace clustering, CLIQUE discovers clusters embedded in subspaces of high dimensional data without requiring users to select subspaces of interest. The DNF expressions for the clusters give a clear representation of clustering results. The time complexity of CLIQUE is O(c p + pn), where p is the highest subspace dimension selected, N is the number of input points, and c is a constant; this grows exponentially with respect to p. The algorithm offers an effective, efficient method of pruning the space of dense units in order to counter the inherent exponential nature of the problem. However, there is a trade-off for the pruning of dense units in the subspaces with low coverage. While the algorithm is faster, there is an increased likelihood of missing clusters. In addition, while CLIQUE does not require users to select subspaces of interest, its susceptibility to noise and ability to identify relevant attributes is highly dependent on the user s choice of unit intervals, m, and sensitivity threshold, τ Variants of CLIQUE There are two aspects of the CLIQUE algorithm that can be improved. The first one is the criterion for the subspace selection. The second is the size and resolution of the grid structure. The former is addressed by the ENCLUS algorithm by using entropy as subspace selection criterion. The latter is addressed by the MAFIA algorithm by using adaptive grids for fast subspace clustering ENCLUS: Entropy-based Approach The algorithm ENCLUS (ENtropy-based CLUStering) [6] is an adaptation of the CLIQUE that uses a different, entropy-based criterion for subspace selection. Rather than using the fraction of total points in a subspace as a criterion to select subspaces, ENCLUS uses an entropy criteria and only those subspaces spanned by attributes A 1,..., A p with entropy H(A 1,..., A p ) < ϖ(a threshold) are selected for clustering. A low-entropy subspace corresponds to a more dense region of units. An analogous monotonicity condition or Apriori property also exists in terms of entropy. If a p-dimensional subspace has low entropy, then so does any (p 1)-dimensional projections of this subspace: H(A 1,..., A p 1 ) = H(A 1,..., A p ) H(A p A 1,..., A p 1 ) < ϖ. (1.4) A significant limitation of ENCLUS is its extremely high computational cost, especially in terms of computing the entropy of subspaces. However, this cost also yields the benefit that this approach has increased sensitivity to detect clusters especially extremely dense small ones MAFIA: Adaptive Grids in High Dimensions MAFIA (Merging of Adaptive Finite IntervAls) proposed by Goil et al. [21] is a descendant of CLIQUE. Instead of using a fixed size cell grid structure with an equal number of bins in each dimension, MAFIA constructs adaptive grids to improve subspace clustering and also uses parallelism on a shared-nothing architecture to handle massive data sets.

20 16 Data Clustering: Algorithms and Applications Algorithm 10 MAFIA Algorithm 1: Do one scan of the data to construct adaptive grids in each dimension. 2: Compute the histograms by reading blocks of data into memory using bins. 3: Using the histograms to merge bins into a smaller number of adaptive variable-size bins, where adjacent bins with similar histogram values are combined to form larger bins. The bins that have low density of data are pruned. 4: Select bins that are α-times (α is a parameter called the cluster dominance factor) more densely populated than average as p (p = 1 now) candidate dense units (CDUs). 5: Iteratively scan data for higher dimensions, and construct new p-cdu from two (p 1)- CDUs if they share any (p 2)-face, and merge adjacent CDUs into clusters. 6: Generate minimal DNF expressions for each cluster MAFIA proposes an adaptive grid of bins in each dimension. Then using an Apriori algorithm, dense intervals are merged to create clusters in the higher dimensional space. The adaptive grid is created by partitioning each dimension independently based on the distribution (i.e., the histogram) observed in that dimension, merging intervals that have the same observed distribution, and pruning those intervals with low density. This pruning during the construction of the adaptive grid reduces the overall computation of the clustering step. The steps of MAFIA are summarized from [3] in Algorithm 10. If p is the highest dimensionality of a candidate dense unit (CDU), N is the number of data points, and m is a constant, the algorithm s complexity is O (m p +pn), still exponential in the dimension as CLIQUE also is. However, the performance results on real data sets show that MAFIA is 40 to 50 times faster than CLIQUE because of the use of adaptive grids and their ability to select a smaller set of interesting CDUs [6]. Parallel MAFIA further offers the ability to obtain a highly scalable clustering for large data sets. Since the adaptive grid permits not only variable resolution because of the variable bin size, but also variable, adaptive grid boundaries, MAFIA yields with greater accuracy cluster boundaries that are very close to grid boundaries and are readily expressed as minimal DNF expressions OptiGrid: Density-based Optimal Grid Partitioning Hinneburg and Keim proposed OptiGrid (OPTimal GRID-Clustering) [12] to address several aspects of the curse of dimensionality : noise, scalability of the grid construction, and selecting relevant attributes by optimizing the density function over the data space. OptiGrid uses density estimations to determine the centers of clusters as the clustering was done for the DENCLUE algorithm [11]. A cluster is a region of concentrated density centered around a strong density attractor or local maximum of the density function with density above the noise threshold. Clusters may also have multiple centers if the centers are strong density attractors and there exists a path between them above the noise threshold. By recursively partitioning the feature space into multidimensional grids, OptiGrid creates an optimal grid-partition by constructing the best cutting hyperplanes of the space. These cutting planes cut the space in areas of low density (i.e. local minima of the density function) and preserve areas of high density or clusters, specifically the cluster centers (i.e. local maxima of the density function). The cutting hyperplanes are found using a set of contracting linear projections of the feature space. The contracting projections create upper bounds for the density of the planes orthogonal to them. Namely, for any point, x, in a contracting projection, P, then for any point y such that P (y) = x, the density of y is at most the density of x. To define the grid more precisely, we present the definitions offered in [12] as summarized

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

More information

Data Warehousing und Data Mining

Data Warehousing und Data Mining Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data

More information

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

A Grid-based Clustering Algorithm using Adaptive Mesh Refinement

A Grid-based Clustering Algorithm using Adaptive Mesh Refinement Appears in the 7th Workshop on Mining Scientific and Engineering Datasets 2004 A Grid-based Clustering Algorithm using Adaptive Mesh Refinement Wei-keng Liao Ying Liu Alok Choudhary Abstract Clustering

More information

Data Preprocessing. Week 2

Data Preprocessing. Week 2 Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.

More information

Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1

Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1 Clustering: Techniques & Applications Nguyen Sinh Hoa, Nguyen Hung Son 15 lutego 2006 Clustering 1 Agenda Introduction Clustering Methods Applications: Outlier Analysis Gene clustering Summary and Conclusions

More information

Going Big in Data Dimensionality:

Going Big in Data Dimensionality: LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen

CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 3: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major

More information

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

More information

Data Clustering Techniques Qualifying Oral Examination Paper

Data Clustering Techniques Qualifying Oral Examination Paper Data Clustering Techniques Qualifying Oral Examination Paper Periklis Andritsos University of Toronto Department of Computer Science periklis@cs.toronto.edu March 11, 2002 1 Introduction During a cholera

More information

2 Basic Concepts and Techniques of Cluster Analysis

2 Basic Concepts and Techniques of Cluster Analysis The Challenges of Clustering High Dimensional Data * Michael Steinbach, Levent Ertöz, and Vipin Kumar Abstract Cluster analysis divides data into groups (clusters) for the purposes of summarization or

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants R-Trees: A Dynamic Index Structure For Spatial Searching A. Guttman R-trees Generalization of B+-trees to higher dimensions Disk-based index structure Occupancy guarantee Multiple search paths Insertions

More information

A comparison of various clustering methods and algorithms in data mining

A comparison of various clustering methods and algorithms in data mining Volume :2, Issue :5, 32-36 May 2015 www.allsubjectjournal.com e-issn: 2349-4182 p-issn: 2349-5979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering

More information

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool Comparison and Analysis of Various Clustering Metho in Data mining On Education data set Using the weak tool Abstract:- Data mining is used to find the hidden information pattern and relationship between

More information

Clustering Via Decision Tree Construction

Clustering Via Decision Tree Construction Clustering Via Decision Tree Construction Bing Liu 1, Yiyuan Xia 2, and Philip S. Yu 3 1 Department of Computer Science, University of Illinois at Chicago, 851 S. Morgan Street, Chicago, IL 60607-7053.

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

More information

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs. Multimedia Databases Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Previous Lecture 13 Indexes for Multimedia Data 13.1

More information

VISUALIZING HIERARCHICAL DATA. Graham Wills SPSS Inc., http://willsfamily.org/gwills

VISUALIZING HIERARCHICAL DATA. Graham Wills SPSS Inc., http://willsfamily.org/gwills VISUALIZING HIERARCHICAL DATA Graham Wills SPSS Inc., http://willsfamily.org/gwills SYNONYMS Hierarchical Graph Layout, Visualizing Trees, Tree Drawing, Information Visualization on Hierarchies; Hierarchical

More information

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier Data Mining: Concepts and Techniques Jiawei Han Micheline Kamber Simon Fräser University К MORGAN KAUFMANN PUBLISHERS AN IMPRINT OF Elsevier Contents Foreword Preface xix vii Chapter I Introduction I I.

More information

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

More information

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar

More information

Jan F. Prins. Work-efficient Techniques for the Parallel Execution of Sparse Grid-based Computations TR91-042

Jan F. Prins. Work-efficient Techniques for the Parallel Execution of Sparse Grid-based Computations TR91-042 Work-efficient Techniques for the Parallel Execution of Sparse Grid-based Computations TR91-042 Jan F. Prins The University of North Carolina at Chapel Hill Department of Computer Science CB#3175, Sitterson

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

On Clustering Validation Techniques

On Clustering Validation Techniques Journal of Intelligent Information Systems, 17:2/3, 107 145, 2001 c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques MARIA HALKIDI mhalk@aueb.gr YANNIS

More information

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms 8 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should

More information

SCAN: A Structural Clustering Algorithm for Networks

SCAN: A Structural Clustering Algorithm for Networks SCAN: A Structural Clustering Algorithm for Networks Xiaowei Xu, Nurcan Yuruk, Zhidan Feng (University of Arkansas at Little Rock) Thomas A. J. Schweiger (Acxiom Corporation) Networks scaling: #edges connected

More information

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Data Exploration Data Visualization

Data Exploration Data Visualization Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select

More information

Cluster Analysis: Basic Concepts and Methods

Cluster Analysis: Basic Concepts and Methods 10 Cluster Analysis: Basic Concepts and Methods Imagine that you are the Director of Customer Relationships at AllElectronics, and you have five managers working for you. You would like to organize all

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

Forschungskolleg Data Analytics Methods and Techniques

Forschungskolleg Data Analytics Methods and Techniques Forschungskolleg Data Analytics Methods and Techniques Martin Hahmann, Gunnar Schröder, Phillip Grosse Prof. Dr.-Ing. Wolfgang Lehner Why do we need it? We are drowning in data, but starving for knowledge!

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Load Balancing. Load Balancing 1 / 24

Load Balancing. Load Balancing 1 / 24 Load Balancing Backtracking, branch & bound and alpha-beta pruning: how to assign work to idle processes without much communication? Additionally for alpha-beta pruning: implementing the young-brothers-wait

More information

Introduction. Introduction. Spatial Data Mining: Definition WHAT S THE DIFFERENCE?

Introduction. Introduction. Spatial Data Mining: Definition WHAT S THE DIFFERENCE? Introduction Spatial Data Mining: Progress and Challenges Survey Paper Krzysztof Koperski, Junas Adhikary, and Jiawei Han (1996) Review by Brad Danielson CMPUT 695 01/11/2007 Authors objectives: Describe

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

More information

Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

Vector storage and access; algorithms in GIS. This is lecture 6

Vector storage and access; algorithms in GIS. This is lecture 6 Vector storage and access; algorithms in GIS This is lecture 6 Vector data storage and access Vectors are built from points, line and areas. (x,y) Surface: (x,y,z) Vector data access Access to vector

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

The Role of Visualization in Effective Data Cleaning

The Role of Visualization in Effective Data Cleaning The Role of Visualization in Effective Data Cleaning Yu Qian Dept. of Computer Science The University of Texas at Dallas Richardson, TX 75083-0688, USA qianyu@student.utdallas.edu Kang Zhang Dept. of Computer

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009 Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative K-means Density-based Interpretation

More information

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) Data Exploration and Preprocessing Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

BIRCH: An Efficient Data Clustering Method For Very Large Databases

BIRCH: An Efficient Data Clustering Method For Very Large Databases BIRCH: An Efficient Data Clustering Method For Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny CPSC 504 Presenter: Discussion Leader: Sophia (Xueyao) Liang HelenJr, Birches. Online Image.

More information

Performance of KDB-Trees with Query-Based Splitting*

Performance of KDB-Trees with Query-Based Splitting* Performance of KDB-Trees with Query-Based Splitting* Yves Lépouchard Ratko Orlandic John L. Pfaltz Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science University of Virginia Illinois

More information

Cluster Description Formats, Problems and Algorithms

Cluster Description Formats, Problems and Algorithms Cluster Description Formats, Problems and Algorithms Byron J. Gao Martin Ester School of Computing Science, Simon Fraser University, Canada, V5A 1S6 bgao@cs.sfu.ca ester@cs.sfu.ca Abstract Clustering is

More information

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

More information

Jiří Matas. Hough Transform

Jiří Matas. Hough Transform Hough Transform Jiří Matas Center for Machine Perception Department of Cybernetics, Faculty of Electrical Engineering Czech Technical University, Prague Many slides thanks to Kristen Grauman and Bastian

More information

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] Stephan Spiegel and Sahin Albayrak DAI-Lab, Technische Universität Berlin, Ernst-Reuter-Platz 7,

More information

Grid Density Clustering Algorithm

Grid Density Clustering Algorithm Grid Density Clustering Algorithm Amandeep Kaur Mann 1, Navneet Kaur 2, Scholar, M.Tech (CSE), RIMT, Mandi Gobindgarh, Punjab, India 1 Assistant Professor (CSE), RIMT, Mandi Gobindgarh, Punjab, India 2

More information

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Load balancing in a heterogeneous computer system by self-organizing Kohonen network Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by self-organizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.

More information

How To Cluster Of Complex Systems

How To Cluster Of Complex Systems Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley Young-Rae Cho Department of Computer Science Baylor University Complex Systems Definition Dynamically evolving

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode Iris Sample Data Set Basic Visualization Techniques: Charts, Graphs and Maps CS598 Information Visualization Spring 2010 Many of the exploratory data techniques are illustrated with the Iris Plant data

More information

Drawing a histogram using Excel

Drawing a histogram using Excel Drawing a histogram using Excel STEP 1: Examine the data to decide how many class intervals you need and what the class boundaries should be. (In an assignment you may be told what class boundaries to

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

Clustering UE 141 Spring 2013

Clustering UE 141 Spring 2013 Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

More information

Graph/Network Visualization

Graph/Network Visualization Graph/Network Visualization Data model: graph structures (relations, knowledge) and networks. Applications: Telecommunication systems, Internet and WWW, Retailers distribution networks knowledge representation

More information

The Minimum Consistent Subset Cover Problem and its Applications in Data Mining

The Minimum Consistent Subset Cover Problem and its Applications in Data Mining The Minimum Consistent Subset Cover Problem and its Applications in Data Mining Byron J Gao 1,2, Martin Ester 1, Jin-Yi Cai 2, Oliver Schulte 1, and Hui Xiong 3 1 School of Computing Science, Simon Fraser

More information

Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies

Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies Proceedings of the IASTED International Conference Parallel and Distributed Computing and Systems November 3-6, 1999 in Cambridge Massachusetts, USA Characterizing the Performance of Dynamic Distribution

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms K-means and its variants Hierarchical clustering

More information

Image Segmentation and Registration

Image Segmentation and Registration Image Segmentation and Registration Dr. Christine Tanner (tanner@vision.ee.ethz.ch) Computer Vision Laboratory, ETH Zürich Dr. Verena Kaynig, Machine Learning Laboratory, ETH Zürich Outline Segmentation

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Data Mining: Foundation, Techniques and Applications

Data Mining: Foundation, Techniques and Applications Data Mining: Foundation, Techniques and Applications Lesson 1b :A Quick Overview of Data Mining Li Cuiping( 李 翠 平 ) School of Information Renmin University of China Anthony Tung( 鄧 锦 浩 ) School of Computing

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

Unsupervised learning: Clustering

Unsupervised learning: Clustering Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What

More information

Technology White Paper Capacity Constrained Smart Grid Design

Technology White Paper Capacity Constrained Smart Grid Design Capacity Constrained Smart Grid Design Smart Devices Smart Networks Smart Planning EDX Wireless Tel: +1-541-345-0019 I Fax: +1-541-345-8145 I info@edx.com I www.edx.com Mark Chapman and Greg Leon EDX Wireless

More information

Overview. Background. Data Mining Analytics for Business Intelligence and Decision Support

Overview. Background. Data Mining Analytics for Business Intelligence and Decision Support Mining Analytics for Business Intelligence and Decision Support Chid Apte, PhD Manager, Abstraction Research Group IBM TJ Watson Research Center apte@us.ibm.com http://www.research.ibm.com/dar Overview

More information

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets Preeti Baser, Assistant Professor, SJPIBMCA, Gandhinagar, Gujarat, India 382 007 Research Scholar, R. K. University,

More information

Topological Properties

Topological Properties Advanced Computer Architecture Topological Properties Routing Distance: Number of links on route Node degree: Number of channels per node Network diameter: Longest minimum routing distance between any

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 1 What is data exploration? A preliminary

More information

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1 System Interconnect Architectures CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.4 System Interconnect Architectures Direct networks for static connections Indirect

More information

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Distributed Dynamic Load Balancing for Iterative-Stencil Applications Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,

More information

A Note on Maximum Independent Sets in Rectangle Intersection Graphs

A Note on Maximum Independent Sets in Rectangle Intersection Graphs A Note on Maximum Independent Sets in Rectangle Intersection Graphs Timothy M. Chan School of Computer Science University of Waterloo Waterloo, Ontario N2L 3G1, Canada tmchan@uwaterloo.ca September 12,

More information

Part-Based Recognition

Part-Based Recognition Part-Based Recognition Benedict Brown CS597D, Fall 2003 Princeton University CS 597D, Part-Based Recognition p. 1/32 Introduction Many objects are made up of parts It s presumably easier to identify simple

More information

Clustering methods for Big data analysis

Clustering methods for Big data analysis Clustering methods for Big data analysis Keshav Sanse, Meena Sharma Abstract Today s age is the age of data. Nowadays the data is being produced at a tremendous rate. In order to make use of this large-scale

More information

Network Intrusion Detection using a Secure Ranking of Hidden Outliers

Network Intrusion Detection using a Secure Ranking of Hidden Outliers Network Intrusion Detection using a Secure Ranking of Hidden Outliers Marwan Hassani and Thomas Seidl Data Management and Data Exploration Group RWTH Aachen University, Germany {hassani, seidl}@cs.rwth-aachen.de

More information

CHAPTER-24 Mining Spatial Databases

CHAPTER-24 Mining Spatial Databases CHAPTER-24 Mining Spatial Databases 24.1 Introduction 24.2 Spatial Data Cube Construction and Spatial OLAP 24.3 Spatial Association Analysis 24.4 Spatial Clustering Methods 24.5 Spatial Classification

More information

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere! Interconnection Networks Interconnection Networks Interconnection networks are used everywhere! Supercomputers connecting the processors Routers connecting the ports can consider a router as a parallel

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Clustering. Chapter 7. 7.1 Introduction to Clustering Techniques. 7.1.1 Points, Spaces, and Distances

Clustering. Chapter 7. 7.1 Introduction to Clustering Techniques. 7.1.1 Points, Spaces, and Distances 240 Chapter 7 Clustering Clustering is the process of examining a collection of points, and grouping the points into clusters according to some distance measure. The goal is that points in the same cluster

More information