Scalable Single Linkage Hierarchical Clustering For Big Data

Size: px
Start display at page:

Download "Scalable Single Linkage Hierarchical Clustering For Big Data"

Transcription

1 Scalable Single Linkage Hierarchical Clustering For Big Data Timothy C. Havens 1, James C. Bezdek 2, Marimuthu Palaniswami 2 1 Electrical and Computer Engineering and Computer Science Departments, Michigan Technological University Houghton, MI USA 2 Department of Electrical and Electronic Engineering, University of Melbourne Parkville, VIC Australia Abstract Personal computing technologies are everywhere; hence, there are an abundance of staggeringly large data sets the Library of Congress has stored over 160 terabytes of web data and it is estimated that Facebook alone logs nearly a petabyte of data per day. Thus, there is a pertinent need for systems by which one can elucidate the similarity and dissimilarity among and between groups in these big data sets. Clustering is one way to find these groups. In this paper, we extend the scalable Visual Assessment of Tendency (svat) algorithm to return single-linkage partitions of big data sets. The svat algorithm is designed to provide visual evidence of the number of clusters in unloadable (big) data sets. The extension we describe for svat enables it to also then efficiently return the data partition as indicated by the visual evidence. The computational complexity and storage requirements of svat are (usually) significantly less than the O(n 2 ) requirement of the classic single-linkage hierarchical algorithm. We show that svat is a scalable instantiation of single-linkage clustering for data sets that contain c compactseparated clusters, where c n; n is the number of objects. For data sets that do not contain compact-separated clusters, we show that svat produces a good approximation of single-linkage partitions. Experimental results are presented for both synthetic and real data sets. I. INTRODUCTION Clustering or cluster analysis is a form of exploratory data analysis in which data are separated into groups or subsets such that the objects in each group share more similarity to each other than to the objects in other groups. Clustering has been used for many purposes and there are many good books that describe its various uses [1, 2]. The most popular use for clustering is to assign labels to unlabeled data data for which no pre-existing grouping is known. Assigning labels or partitioning is not the only operation in clustering. There is also cluster tendency, which asks the question, Are there clusters and, if so, how many? And cluster validity, which asks, Are the clusters I found any good? In this paper we will use the scalable version of the popular visual assessment of tendency (VAT) [3] algorithm to both assess tendency and partition the data. Big data is a reality of modern computing social networking and mobile computing alone account for terabytes (TB) to petabytes of logged data per day. Hence, there is a great need for algorithms that can address these big data concerns. In 1996, Huber [4] classified data set sizes as in TABLE I HUBER S NOMINAL DATA SET SIZES [4] Bytes >12 size medium large huge monster VL Big Data Table I. 1 Bezdek and Hathaway [5] added the Very Large (VL) category to this table in In recent years, these data have been collectively called Big Data. Interestingly, monster and VL data with objects is still unloadable on most current (circa 2012) computers. For example, a data set representing objects, each with 10 features, stored in short integer (4 byte) format would require 40 TB of storage (most high-performance computing platforms have <1 TB of main memory). Hence, we believe that Table I will continue to be pertinent for many years. Big Data will always be a concern as data sets will continue to grow. There are two main approaches to clustering in big data: distributed clustering based on various incremental styles, and clustering a sample found by progressive or random sampling (or some hybrid scheme). Each has been applied in the context of just about any clustering algorithm, including k- means, fuzzy c-means, and kernelized variants [5 10]. Both approaches provide useful ways to accomplish two objectives: acceleration for loadable data, and approximation for unloadable data. Now consider a set of n objects O = {o 1,..., o n }, e.g., hurricanes, wireless sensor network nodes, B-list actors, or members of a social network. Each object is typically represented by numerical feature-vector data that have the form X = {x 1,..., x n } R p, where the coordinates of x i provide feature values (e.g., wind speed, sensor voltage, age, number of blog posts, etc.) describing object o i. An alternative form of data is relational data, where only the relationship between pairs of objects is known. This type of data is especially prevalent in document-analysis and bioinformatics. Relational data typically consist of the n 2 values of a dissimilarity matrix R, where R = d(o i, o j ) is the pair-wise dissimilarity (or distance) between objects o i and o j. For instance, numerical data X 1 Huber also defined tiny as 10 2 and small as 10 4.

2 can always be converted to R by R = [r ij ] = [ x i x j ] (any vector norm on R p ). There are, however, similarity and dissimilarity relational data that do not begin as feature vector data; for these, there is no choice but to use a relational algorithm. Hence, relational data represent the most general form of input data. A crisp partition of the objects is defined as a set of n entries of an indicator vector Π = (π 1,..., π n ), where π i is the cluster label of object o i ; e.g., π i = 2 indicates that o i is in cluster 2. Other types of partitions exist, such as fuzzy and probabilistic, but we limit our discussion here to crisp partitions each object belongs wholly and unequivocally to exactly one cluster. In this paper, we propose an extension to the scalable VAT (svat) algorithm [11] that produces crisp c-partitions of big data. Our algorithm first uses svat to sample the data and reorder the sample. We then use a property of VAT reordering that allows us to efficiently compute single-linkage partitions of the reordered data sample. Finally, we extend the partition to the entire data set using the simple nearest object rule. Section II describes the svat algorithm, followed by our proposed svat-sl algorithm in Section III. We demonstrate the proposed svat-sl on synthetic and real data in Section IV. Finally, Section V summarizes and proposes future research directions. We now discuss the VAT algorithm and related research. A. Matrix Reordering For Tendency Assessment and Partitioning Matrix reordering has a long history of use for clustering data. Petrie, in 1899, illustrated the value of reordering objects, by visually clustering 917 pieces of prehistoric Egyptian pottery recovered from about 4,000 overlapping tombs [12]. In 1909, Czekanowski proposed the first method for clustering in dissimilarity data using a visual approach [13]. Czekanowski reordered, by hand, a dissimilarity matrix that represented the average difference in various skulls, showing that skulls roughly clustered into two distinct groups. These pieces of history and other important milestones on the use of heat maps (aka reordered dissimilarity images) to visualize clusters are described in detail in [14]. The matrix reordering method that we focus on in this paper is VAT, outlined in Algorithm 1. The VAT algorithm displays an image of reordered (and scaled for imaging) dissimilarity data [3]. Each image pixel of the VAT-reordered matrix R displays the scaled dissimilarity value between two objects. White pixels represent high dissimilarity, while black represents low dissimilarity. 2 A dark block along the diagonal of the image of R is a sub-matrix of similarly small dissimilarity values; hence, dark blocks may represent clusters 2 Note that each object is exactly similar to itself, resulting in a zero-valued diagonal of R and R. The diagonal thus gives us no information on how the data should cluster. Hence, most VAT users will scale the pixel values of the off-diagonal elements of R to best use the quantized range of the imaging method. An easy way to do this in MATLAB is to set the diagonal elements of R to the minimum of the off-diagonal elements and then use the MATLAB function imagesc to display the image Y coord X coord (b) Euclidean distance matrix R (a) Data set X (c) VAT image of R Fig. 1. Example of VAT visualization showing a cluster tendency of c = 3. of objects that are relatively similar to each other. Thus, cluster tendency, or the number of clusters, may be shown by the number of dark blocks along the diagonal of the VAT image. Figure 1 shows the VAT visualization for a data set that has a preferred cluster count of c = 3. It is clear that the VAT image in view (c) shows 3 dark blocks and, thus, a tendency of the data to have 3 clusters. Algorithm 1: VAT Data: R (R + ) n n Dissimilarity matrix Result: R VAT reordered R p VAT reordering indices of R d MST cut magnitude vector Set K = {1,..., n} and I = J = Select (i, j) arg max k K,q K r kq p 1 = i; I = {i}; and J = K {i} for t = 2,...,n do Select (i, j) arg min k I,q J r kq p r = j; I I {j}; J J {j} d t 1 = r ij Obtain the reordered dissimilarity matrix R using the ordering array p as r kq = r p k,p q, 1 k, q, n. Algorithm 1 shows that VAT is based on (but not identical to) Prim s algorithm [15] for finding the minimum spanning tree (MST) of a weighted undirected graph. In [16], we showed that this property can be used to prove that all singlelinkage partitions appear as aligned partitions in the VATreordered objects O. Aligned c-partitions of O have c contiguous blocks of cluster labels in Π, starting with 1 and ending with c. For example, Π = (1, 1, 1, 2, 2, 2, 3, 3) is an

3 aligned partition of 8 objects, while Π = (1, 1, 1, 2, 2, 3, 2, 3) is not. The special nature of aligned partitions enables us to specify them in an alternative form. Every member of the set of all aligned c-partitions for a set of n objects is isomorphic to the unique set of c distinct integers (which are the cardinalities of the c clusters in Π) that satisfy {n i 1 n i ; 1 i c; c i=1 n i = n}; so, aligned partitions are completely specified by {n 1 :... : n c }. For example, Π = (1, 1, 1, 2, 2, 2, 3, 3) = {3 : 3 : 2}. Because single-linkage clusters are always aligned in the VAT-reordered data, to find them we merely have to cut the largest (c 1) edges in the MST and form the corresponding aligned partition. For example, if we wish to compute the single-linkage 4-partition, we find the 3 largest values of the vector d, storing the index values as i (1), i (2), and i (3) (where i (1) < i (2) < i (3) ), and form the aligned partition {i (1) : i (2) i (1) : i (3) i (2) : n i (3) }. We will use this method to form single-linkage partitions of the svat sample, which provides an approximation of the single-linkage partition for big data. Another property of the VAT image is that the contrast and presence of dark blocks on the diagonal is related to Dunn s index [17]. This index is a metric of how well a set of clusters represents compact-separated (CS) clusters. For a set of objects O with corresponding relational dissimilarity data R, we say that a partitioning Π = {π 1,..., π c } of O is CS relative to R if each of the possible intra-cluster distances is strictly less than each of the possible intercluster distances. We state this by saying that O can be partitioned into c CS clusters. For a given relational matrix R and set of indicator vectors Π, Dunn s index is defined as α(c, Π) = min 1 k c min 1 q c,q k dist(π k, π q ), (1) max 1 k c diam(π k ) where π k is the kth cluster, dist(π k, π j ) is the distance between two clusters, and diam(π k ) is the cluster diameter [18]. The distance and diameter functions are dist(π k, π q ) = min r ij, (2) i π k,j π q diam(π k ) = max r ij. (3) i π k,j π k The relative validity of clusters found at different values of c or by different clustering algorithms can be compared by examining the respective values of (1) for each partition. Definition 1. Clusters that have a Dunn s index α(c, Π) > 1 are compact-separated (CS) clusters [18]. Later, we will show a relationship between CS clusters and the proposed svat-sl algorithm. First, we describe the svat algorithm in detail. II. SCALABLE VAT The svat algorithm is a scalable solution for visualizing the number of clusters in a big data set. It first draws a representative sample of the data and then produces a visualization which shows the number of clusters as the number of dark blocks on the diagonal of a reordered dissimilarity image. Algorithm 2 outlines the steps. Algorithm 2: svat Data: R (R + ) n n Dissimilarity matrix Input: c Overestimate of true number of clusters c; n s Size of approximating sample Result: R s VAT reordered R s S = {S 1,..., S c } grouping sets S indices of samples in R S p VAT reordering indices of sample R S d MST cut vector 1 Select the indices M of the c distinguished objects. m 1 = 1 d = (d 1,..., d n ) = (r 11,..., r 1n ) for t = 2,..., c do d ( min{d 1, r mt 1,1},..., min{d n, r mt 1,n} ) m t = arg max 1 j n {d j } 2 Group objects in O = {o 1,..., o n } with their nearest distinguished object. S 1 = S 2 =... = S c = for t = 1,..., n do k = arg min 1 j c {r mj,t} S k S k {t} 3 Select some data for R S near each of the distinguished objects. n t = n s S t /n, t = 1,..., c Draw n t random indices S t from S t without replacement, t = 1,..., c. S = c t=1 S t Form R S, where R S is the square submatrix of R indexed by S in both rows and columns. 4 Apply VAT to R S, returning R s and p (optionally d) The svat algorithm essentially does the following four steps. In Step 1, svat defines a set of c distinguished objects that hopefully represents the clustering structure of all the objects. In other words, the c distinguished objects are prototypical indices for the clusters in the big data. At Step 2, the objects in O are partitioned using the nearest prototype rule, where the prototypes are the c distinguished object indices. Step 3 draws a random subset of O to produce a well-represented set of (approximate) size n s. Finally, in Step 4, VAT is applied to the approximately-sized n s n s submatrix R S. The computational complexity of svat Steps 1 and 2 are O(c n), with storage requirements of O(c n) (for fast execution). Drawing the indices of R S in Step 3 requires O(c n) computations and storage of ñ s entries, where ñ s = S (note that ñ s is usually only slightly greater that the chosen

4 sample size n s ). The final run of VAT on R S is O(ñ 2 s) in computation and storage complexity. If the data begin as vectors X R p then Step 1 requires O(pc n) operations to compute the necessary elements of R and Step 3 requires O(pñ 2 s) operations to compute the elements of R S. It can be shown that ñ s n s +c, hence the overall storage requirement of svat is O(max{c n, (n s + c ) 2 }). The overall computational complexity is O(max{c n, (n s + c ) 2 }) if the data begin as relational data R, and O(max{pc n, pñ 2 s, (n s + c ) 2 }) if the data begin as vectors X. In [11], it was stated that these complexities show that svat scales linearly with n. However, we argue that because c, ñ s, and n s are some fraction of n, then svat does not asymptotically scale linearly with n. However, if c n, n s n, then these orders of complexity show a significant reduction in both storage and computational complexity. The svat algorithm has two desirable properties for data sets that contain CS clusters. Proposition 1 from [11] describes the first property and is important in our later analysis of svat- SL. The second property is as follows. For an object set O that contains c CS clusters, the proportion of objects in the svat sample S (drawn in Step 3 of Algorithm 2) from the ith cluster equals the proportion of the objects in O from the ith cluster. In other words, each CS cluster s size will appear correctly in the svat image (large clusters will appear as large dark blocks, while small clusters will appear as small dark blocks). Proposition 1. [11] Consider a set of objects O that can be partitioned into c CS clusters, and let c c. Then Step 1 of svat selects at least one distinguished object from each CS cluster. Proof: See [11] for proof. We now propose an extension to svat, called svat-sl, that returns a c-partition, c c, of O. For the case where there exists c CS clusters in O, then we show that the svat-sl partition is exactly the single-linkage c-partition of O. III. SVAT-SL ALGORITHM The svat-sl algorithm proposed in Algorithm 3 calculates partitions of data represented by a dissimilarity matrix R. In brief, svat-sl calculates a single-linkage partition of the svat-sampled data and then extends this partition to the entire data set. We will show that in certain cases the svat-sl c- partition is equivalent to the single-linkage c-partition of the VL data set. In Step 1 of svat-sl, the svat algorithm is run on R, returning the VAT-reordered sample matrix R s, the indices S computed in svat Step 2, the indices of the sample S, the reordering vector p, and the magnitude of the MST links d. At Step 2 of svat-sl, the user must choose the number of clusters c to seek. We recommend using the svat visualization of R s to help choose the number of clusters. One could also use the biggest-jump criteria by sorting d is descending order and choosing c as the argmax of {d c d c+1 }. At Step 3 of svat-sl, the indices of the c largest values of d are found and denoted as t. These indicate the links of the MST that Algorithm 3: svat-sl Data: R (R + ) n n Dissimilarity matrix Input: c Overestimate of true number of clusters c; n s Size of approximating sample Result: R s VAT reordered R s Π c-partition of O 1 Run svat on R, returning R s, S, S, p, and d 2 Choose the number of clusters c (e.g., by using the svat image) 3 Find indices t of c largest values in d 4 Form the aligned partition Π as {t 1 : t 2 t 1 :... : t c t c 1 } 5 π Spi = π p i, i = 1,..., ñ, k = 1,..., c 6 for each ŝ Ŝ = S S do j = arg min k S rŝk πŝ = π k are cut to find the c single-linkage clusters of R s. Next, the aligned partition Π is calculated using the indices t found in Step 3. At Step 5, we merely reorder the cluster indicator vector Π of S to match the index-ordering of the original objets O (i.e., we label the samples of O that are represented in S). Finally, at Step 6 we label the remaining objects in O by giving them the label of the nearest respective object in S. The computational complexity of svat-sl, beyond the run of svat at Step 1, is as follows. Step 3 is O(ñ), as is the resorting of Π at Step 5. The final step of svat-sl assigns the cluster indicators for the (n ñ) objects that are not in the sample S (i.e., R s ). Each iteration of the for loop requires O(ñ) computations, thus resulting in a final O(ñˆn) computational complexity, where ˆn = S Ŝ. The storage requirement is O(ñˆn) for RŜ S, used in Step 6 (although, one could certainly achieve fast computation and smaller storage requirement by loading one row of R at a time in Step 6). Proposition 2. Consider a data set O and a corresponding dissimilarity matrix R. Let Π be the c-partition indicator vector returned by svat-sl and Π be the single-linkage c- partition indicator vector of O. If O, according to R, can be partitioned into c CS clusters, then Π = Π if c c. Proof: First, it is well known that single-linkage finds the c-partition corresponding to the c CS clusters of a data set. Hence, Π is the partition of O into its c CS clusters. Second, Proposition 1 says that the sampling procedure of svat selects at least one distinguished object from each cluster of the c CS clusters. Hence, the sampled data S contain at least one object from each of the c CS clusters. Because of the property of CS clusters, the subset S contains the same c CS clusters as O (albeit in a sampled form). In [16], we showed that every single-linkage partition is aligned in the VAT-reordered dissimilarity matrix. Thus, Steps 3 and 4 of svat-sl find the single-linkage c-partition of the sampled data S. Thus, Π, produced at Step 4 of svat-sl,

5 Y coord X coord. (a) 3 Clouds (n = 25, 000) (b) svat image Fig. 2. svat image of 3 Clouds data, c = 20, n s = 1, 000 is the partition of S into its c CS clusters. The partition Π is thus the same as Π, albeit sampled and reordered. At Step 6 of svat-sl, each object in O that is not in S is labeled as being in the cluster of the nearest object in S, producing the partition Π of O. Hence, the partition Π = Π. Remark 1. Proposition 2 states that if O contains c CS clusters then svat-sl will find them, as long as c c. If O does not contain c CS clusters, then svat-sl is not guaranteed to find the same partition as single-linkage. However, we show in Section IV that svat-sl produces a good approximation of the preferred partition of O. IV. EXPERIMENTS We compare the clustering results of single linkage and svat-sl on two main data sets. The data sets denoted as 3 Clouds is a variably sized data set with 3 clouds of data drawn from 3 Gaussian distributions with the following parameters, µ 1 = (5, 5), µ 2 = (0, 0), µ 3 = (10, 0), Σ 1 = Σ 2 = Σ 3 = I 2. The size of each cloud, respectively, is 0.2n, 0.4n, and 0.4n. FIgure 2 shows a plot of a draw of the 3 Clouds data for n = 25, 000. The svat image of the data in Fig. 2(a), with c = 20 and n s = 1, 000 is shown in view (b). The svat image clearly shows 3 clusters. Furthermore, it shows the relative size of each cluster accurately. The second data set we show results for is the Forest Cover data. 3 These data are composed of 54 cartographic features obtained from United States Geological Survey (USGS) and United State Forest Service (USFS) data. These features were collected from a total of 581, meter cells, which were then determined to be one of 7 forest cover types by the USFS by analyzing the 54 features. We normalize the features to the interval [0, 1]. We compared single linkage and svat-sl by calculating the purity and normalized mutual information (NMI) of partitions calculated with each algorithm and also the run-time. Each of these indices are described in detail in [19, pp ]. In brief, an index value of 1 is perfect, while an index value of 0 indicates poor performance (in the case of NMI, an value of 0 indicates randomly chosen clusters). For the 3 Clouds 3 The Forest Cover data set can be downloaded at dstar/data/clusteringdata.html. data, we randomly drew 50 instances of the data set with size n = 25, 000 and ran single-linkage and svat-sl (with various parameters) on each instances. Table II outlines the mean and standard deviation of purity and NMI for these 50 runs. Single linkage performs very poorly at finding the 3 clusters in these data. This is because single linkage can be fatally affected by one outlier (which can easily happen with Gaussian distributed data). In contrast, svat-sl produces very good clustering results for all parameter settings both purity and NMI are > 0.75 and for c = 10 and n s = 100, these indices are both > 0.9, a near perfect result. These results convince us that svat-sl is producing the preferred clusters in the 3 Clouds data, while single-linkage performance is hampered by the outlier data points. The Time column in Table II shows that svat-sl is also 2 orders of magnitude faster than single-linkage. For our second experiment, we ran svat-sl on the 3 Clouds data (n = 250, 000) and the Forest Cover data. Each of these data sets were too large to be clustered by single-linkage on our machine; one would need > 1TB of main memory to store the full n n distance matrix. Table II shows that svat-sl is able cluster both these data sets in a very reasonable amout of time: on the order of seconds for the 3 Clouds data and seconds for the Forest Cover data. Furthermore, the svat-sl performance indices show that the algorithm produces very accurate clusters for the 3-Clouds data. With the parameters (c = 5, n s = 500), svat-sl produces clusters with indices > 0.9: a near-perfect solution. For the Forest Cover data, svat- SL returns clusters that achieve a purity of about 0.5 and an NMI of about 0.2. To compare, the algorithms proposed in [20] achieve an NMI of Our proposed svat-sl algorithm is able to cluster these data in about seconds. V. CONCLUSIONS AND FUTURE WORK Our analysis in Section III shows that svat-sl is an approximation to single-linkage clustering for big data. The computational complexity is significantly smaller than the O(n 2 ) complexity of the full hierarchical algorithm. Experimental run-time calculations also show that the algorithm is fast at both producing a cluster visualization and at partitioning the data. Furthermore, we showed that the svat-sl c-partition is equivalent to the single-linkage c-partition for data that have c compact-separated clusters. A weakness to svat-sl is that the partition is based on the single-linkage criterion, which has the well-known drawback that the partition can be ruined by outliers. For data with many outliers, the outliers have to be partitioned (usually in their own cluster) before the preferred partition can be found via the single-link criterion. However, svat-sl provides visual evidence as to how large the clusters should be; hence, if the size of the single-link clusters do not match well with the visual evidence then the user can disregard the partition (perhaps choosing a different clustering algorithm to partition the sample or throwing out data from small clusters). In the future, we will examine the sampling scheme suggested by the k-means++ initialization technique [21], which

6 TABLE II CLUSTERING RESULTS* SL svat-sl Data set Purity NMI Time (secs) Purity NMI Time (secs) 3-Clouds c = (n = 25, 000) ±0 ±0 ±0.36 n s = 1000 ±0.20 ±0.23 ±0.11 c = n s = 1000 ±0.17 ±0.18 ±0.06 c = n s = 1000 ±0.20 ±0.23 ±0.05 c = n s = 500 ±0.18 ±0.19 ±0.05 c = n s = 100 ±0.13 ±0.13 ± Clouds??? c = (n = 250, 000) n s = 1000 ±0.20 ±0.27 ±0.15 c = n s = 500 ±0.06 ±0.08 ±0.02 Forest Cover??? c = (n = 581, 012) n s = 500 ±0.01 ±0.05 ±0.28 c = n s = 500 ±0.02 ±0.02 ±0.19 *Mean and standard deviation over 50 independent trials. Bold indicates superior algorithm of SL and svat-sl by 2-sided t-test. draws objects according to a distance-weighted probability distribution. The drawback of the k-means++ method is that it does not absolutely ensure that an object is drawn from each cluster for the case of CS clusters. REFERENCES [1] A. Jain and R. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall, [2] R. Xu and D. Wunsch II, Clustering. Psicataway, NJ: IEEE Press, [3] J. C. Bezdek and R. J. Hathaway, VAT: A tool for visual assessment of (cluster) tendency, in Proc. IJCNN, Honolulu, HI, 2002, pp [4] P. Huber, Massive Data Sets. National Academy Press, 1997, ch. Massive Data Sets Workshop: The Morning After, pp [5] R. Hathaway and J. Bezdek, Extending fuzzy and probabilistic clustering to very large data sets, Computational Statistics and Data Analysis, vol. 51, pp , [6] T. C. Havens, J. C. Bezdek, C. Leckie, L. O. Hall, and M. Palaniswami, Fuzzy c-means algorithms for very large data, IEEE Trans. Fuzzy Systems, [7] N. Pal and J. Bezdek, Complexity reduction for large image processing, IEEE Trans. Systems, Man, and Cybernetics, vol. B, no. 32, pp , [8] P. Hore, L. Hall, and D. Goldgof, Single pass fuzzy c means, in Proc. IEEE Int. Conf. Fuzzy Systems, London, England, 2007, pp [9] R. Chitta, R. Jin, T. Havens, and A. Jain, Approximate kernel k-means: Solution to large scale kernel clustering, in Proc. ACM SIGKDD Conf. Knowledge Discovery and Data Mining, 2011, pp [10] T. Havens, R. Chitta, A. Jain, and R. Jin, Speedup of fuzzy and possibilistic c-means for large-scale clustering, in Proc. IEEE Int. Conf. Fuzzy Systems, Taipei, Taiwan, [11] R. J. Hathaway, J. C. Bezdek, and J. M. Huband, Scalable visual asseessment of cluster tendency for large data sets, Pattern Recognition, vol. 39, no. 7, pp , July [12] W. Petrie, Sequences in prehistoric remains, J. Anthropological Inst. Great Britain and Ireland, vol. 29, pp , [13] J. Czekanowski, Zur differentialdiagnose der neandertalgruppe, Korrespondenzblatt der Deutschen Gesellschaft fr Anthropologie, Ethnologie und Urgeschichte, vol. 40, pp , [14] L. Wilkinson and M. Friendly, The history of the cluster heat map, The American Statistician, vol. 63, no. 2, pp , [15] R. C. Prim, Shortest connection networks and some generalizations, Bell System Technical Journal, vol. 36, pp , [16] T. C. Havens, J. C. Bezdek, J. M. Keller, M. Popescu, and J. M. Huband, Is VAT really single linkage in disguise? Ann. Math. Artif. Intell., vol. 55, no. 3-4, pp , [17] T. C. Havens, J. C. Bezdek, J. M. Keller, and M. Popescu, Dunnís cluster validity index as a contrast measure of VAT image, in Proc. ICPR, Tampa, FL, December [18] J. Dunn, A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters, J. of Cybernetics, vol. 3, no. 3, pp , [19] C. D. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval. Cambridge University Press, [20] R. Chitta, R. Jin, and A. K. Jain, Efficient kernel clustering using random fourier features, in Int. Conf. Data Mining, [21] D. Arthur and S. Vassilvitskii, k-means++: The advantages of careful seeding, in Proc. SODA, 2007, pp

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Fig. 1 A typical Knowledge Discovery process [2]

Fig. 1 A typical Knowledge Discovery process [2] Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering

More information

Standardization and Its Effects on K-Means Clustering Algorithm

Standardization and Its Effects on K-Means Clustering Algorithm Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

More information

Character Image Patterns as Big Data

Character Image Patterns as Big Data 22 International Conference on Frontiers in Handwriting Recognition Character Image Patterns as Big Data Seiichi Uchida, Ryosuke Ishida, Akira Yoshida, Wenjie Cai, Yaokai Feng Kyushu University, Fukuoka,

More information

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

A Survey of Kernel Clustering Methods

A Survey of Kernel Clustering Methods A Survey of Kernel Clustering Methods Maurizio Filippone, Francesco Camastra, Francesco Masulli and Stefano Rovetta Presented by: Kedar Grama Outline Unsupervised Learning and Clustering Types of clustering

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

More information

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University September 19, 2012

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University September 19, 2012 Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University September 19, 2012 E-Mart No. of items sold per day = 139x2000x20 = ~6 million

More information

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

SOME CLUSTERING ALGORITHMS TO ENHANCE THE PERFORMANCE OF THE NETWORK INTRUSION DETECTION SYSTEM

SOME CLUSTERING ALGORITHMS TO ENHANCE THE PERFORMANCE OF THE NETWORK INTRUSION DETECTION SYSTEM SOME CLUSTERING ALGORITHMS TO ENHANCE THE PERFORMANCE OF THE NETWORK INTRUSION DETECTION SYSTEM Mrutyunjaya Panda, 2 Manas Ranjan Patra Department of E&TC Engineering, GIET, Gunupur, India 2 Department

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms K-means and its variants Hierarchical clustering

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

10-810 /02-710 Computational Genomics. Clustering expression data

10-810 /02-710 Computational Genomics. Clustering expression data 10-810 /02-710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally,

More information

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

Scalable Parallel Clustering for Data Mining on Multicomputers

Scalable Parallel Clustering for Data Mining on Multicomputers Scalable Parallel Clustering for Data Mining on Multicomputers D. Foti, D. Lipari, C. Pizzuti and D. Talia ISI-CNR c/o DEIS, UNICAL 87036 Rende (CS), Italy {pizzuti,talia}@si.deis.unical.it Abstract. This

More information

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

More information

USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva

USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva 382 [7] Reznik, A, Kussul, N., Sokolov, A.: Identification of user activity using neural networks. Cybernetics and computer techniques, vol. 123 (1999) 70 79. (in Russian) [8] Kussul, N., et al. : Multi-Agent

More information

Method of Data Center Classifications

Method of Data Center Classifications Method of Data Center Classifications Krisztián Kósi Óbuda University, Bécsi út 96/B, H-1034 Budapest, Hungary kosi.krisztian@phd.uni-obuda.hu Abstract: This paper is about the Classification of big data

More information

Prototype-less Fuzzy Clustering

Prototype-less Fuzzy Clustering Prototype-less Fuzzy Clustering Christian Borgelt Abstract In contrast to standard fuzzy clustering, which optimizes a set of prototypes, one for each cluster, this paper studies fuzzy clustering without

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Fuzzy Clustering Technique for Numerical and Categorical dataset

Fuzzy Clustering Technique for Numerical and Categorical dataset Fuzzy Clustering Technique for Numerical and Categorical dataset Revati Raman Dewangan, Lokesh Kumar Sharma, Ajaya Kumar Akasapu Dept. of Computer Science and Engg., CSVTU Bhilai(CG), Rungta College of

More information

CLASSIFYING SERVICES USING A BINARY VECTOR CLUSTERING ALGORITHM: PRELIMINARY RESULTS

CLASSIFYING SERVICES USING A BINARY VECTOR CLUSTERING ALGORITHM: PRELIMINARY RESULTS CLASSIFYING SERVICES USING A BINARY VECTOR CLUSTERING ALGORITHM: PRELIMINARY RESULTS Venkat Venkateswaran Department of Engineering and Science Rensselaer Polytechnic Institute 275 Windsor Street Hartford,

More information

IMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES

IMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES INTERNATIONAL JOURNAL OF ADVANCED RESEARCH IN ENGINEERING AND SCIENCE IMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES C.Priyanka 1, T.Giri Babu 2 1 M.Tech Student, Dept of CSE, Malla Reddy Engineering

More information

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,

More information

RoleVAT: Visual Assessment of Practical Need for Role Based Access Control

RoleVAT: Visual Assessment of Practical Need for Role Based Access Control RoleVAT: Visual Assessment of Practical Need for Role Based Access Control Dana Zhang The University of Melbourne zhangd@csse.unimelb.edu.au Kotagiri Ramamohanarao The University of Melbourne rao@csse.unimelb.edu.au

More information

Lecture 20: Clustering

Lecture 20: Clustering Lecture 20: Clustering Wrap-up of neural nets (from last lecture Introduction to unsupervised learning K-means clustering COMP-424, Lecture 20 - April 3, 2013 1 Unsupervised learning In supervised learning,

More information

Image Estimation Algorithm for Out of Focus and Blur Images to Retrieve the Barcode Value

Image Estimation Algorithm for Out of Focus and Blur Images to Retrieve the Barcode Value IJSTE - International Journal of Science Technology & Engineering Volume 1 Issue 10 April 2015 ISSN (online): 2349-784X Image Estimation Algorithm for Out of Focus and Blur Images to Retrieve the Barcode

More information

A Study of Web Log Analysis Using Clustering Techniques

A Study of Web Log Analysis Using Clustering Techniques A Study of Web Log Analysis Using Clustering Techniques Hemanshu Rana 1, Mayank Patel 2 Assistant Professor, Dept of CSE, M.G Institute of Technical Education, Gujarat India 1 Assistant Professor, Dept

More information

A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining

A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining Binu Thomas and Rau G 2, Research Scholar, Mahatma Gandhi University,Kerala, India. binumarian@rediffmail.com 2 SCMS School of Technology

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Ranking on Data Manifolds

Ranking on Data Manifolds Ranking on Data Manifolds Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 72076 Tuebingen, Germany {firstname.secondname

More information

Discuss the size of the instance for the minimum spanning tree problem.

Discuss the size of the instance for the minimum spanning tree problem. 3.1 Algorithm complexity The algorithms A, B are given. The former has complexity O(n 2 ), the latter O(2 n ), where n is the size of the instance. Let n A 0 be the size of the largest instance that can

More information

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181 The Global Kernel k-means Algorithm for Clustering in Feature Space Grigorios F. Tzortzis and Aristidis C. Likas, Senior Member, IEEE

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Clustering and Data Mining in R

Clustering and Data Mining in R Clustering and Data Mining in R Workshop Supplement Thomas Girke December 10, 2011 Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches

More information

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Load balancing in a heterogeneous computer system by self-organizing Kohonen network Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by self-organizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.

More information

Mining Social-Network Graphs

Mining Social-Network Graphs 342 Chapter 10 Mining Social-Network Graphs There is much information to be gained by analyzing the large-scale data that is derived from social networks. The best-known example of a social network is

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

Offline sorting buffers on Line

Offline sorting buffers on Line Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

More information

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

More information

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data Client Based Power Iteration Clustering Algorithm to Reduce Dimensionalit in Big Data Jaalatchum. D 1, Thambidurai. P 1, Department of CSE, PKIET, Karaikal, India Abstract - Clustering is a group of objects

More information

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] Stephan Spiegel and Sahin Albayrak DAI-Lab, Technische Universität Berlin, Ernst-Reuter-Platz 7,

More information

OPTIMAL DESIGN OF DISTRIBUTED SENSOR NETWORKS FOR FIELD RECONSTRUCTION

OPTIMAL DESIGN OF DISTRIBUTED SENSOR NETWORKS FOR FIELD RECONSTRUCTION OPTIMAL DESIGN OF DISTRIBUTED SENSOR NETWORKS FOR FIELD RECONSTRUCTION Sérgio Pequito, Stephen Kruzick, Soummya Kar, José M. F. Moura, A. Pedro Aguiar Department of Electrical and Computer Engineering

More information

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation: CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm

More information

Cluster Algorithms. Adriano Cruz adriano@nce.ufrj.br. 28 de outubro de 2013

Cluster Algorithms. Adriano Cruz adriano@nce.ufrj.br. 28 de outubro de 2013 Cluster Algorithms Adriano Cruz adriano@nce.ufrj.br 28 de outubro de 2013 Adriano Cruz adriano@nce.ufrj.br () Cluster Algorithms 28 de outubro de 2013 1 / 80 Summary 1 K-Means Adriano Cruz adriano@nce.ufrj.br

More information

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed

More information

A Comparison of General Approaches to Multiprocessor Scheduling

A Comparison of General Approaches to Multiprocessor Scheduling A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

A Novel Density based improved k-means Clustering Algorithm Dbkmeans

A Novel Density based improved k-means Clustering Algorithm Dbkmeans A Novel Density based improved k-means Clustering Algorithm Dbkmeans K. Mumtaz 1 and Dr. K. Duraiswamy 2, 1 Vivekanandha Institute of Information and Management Studies, Tiruchengode, India 2 KS Rangasamy

More information

Hadoop SNS. renren.com. Saturday, December 3, 11

Hadoop SNS. renren.com. Saturday, December 3, 11 Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December

More information

Complexity Reduction for Large Image Processing

Complexity Reduction for Large Image Processing 598 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART B: CYBERNETICS, VOL. 32, NO. 5, OCTOBER 2002 Complexity Reduction for Large Image Processing Nikhil R. Pal, Senior Member, IEEE, and James C.

More information

Unsupervised learning: Clustering

Unsupervised learning: Clustering Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What

More information

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

Clustering UE 141 Spring 2013

Clustering UE 141 Spring 2013 Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

More information

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning Clustering Very Large Data Sets with Principal Direction Divisive Partitioning David Littau 1 and Daniel Boley 2 1 University of Minnesota, Minneapolis MN 55455 littau@cs.umn.edu 2 University of Minnesota,

More information

Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

More information

B490 Mining the Big Data. 2 Clustering

B490 Mining the Big Data. 2 Clustering B490 Mining the Big Data 2 Clustering Qin Zhang 1-1 Motivations Group together similar documents/webpages/images/people/proteins/products One of the most important problems in machine learning, pattern

More information

Approximation Algorithms

Approximation Algorithms Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms

More information

DISTRIBUTED ANOMALY DETECTION IN WIRELESS SENSOR NETWORKS

DISTRIBUTED ANOMALY DETECTION IN WIRELESS SENSOR NETWORKS DISTRIBUTED ANOMALY DETECTION IN WIRELESS SENSOR NETWORKS Sutharshan Rajasegarar 1, Christopher Leckie 2, Marimuthu Palaniswami 1 ARC Special Research Center for Ultra-Broadband Information Networks 1

More information

Forschungskolleg Data Analytics Methods and Techniques

Forschungskolleg Data Analytics Methods and Techniques Forschungskolleg Data Analytics Methods and Techniques Martin Hahmann, Gunnar Schröder, Phillip Grosse Prof. Dr.-Ing. Wolfgang Lehner Why do we need it? We are drowning in data, but starving for knowledge!

More information

Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)

Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven) Algorithmic Aspects of Big Data Nikhil Bansal (TU Eindhoven) Algorithm design Algorithm: Set of steps to solve a problem (by a computer) Studied since 1950 s. Given a problem: Find (i) best solution (ii)

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

L15: statistical clustering

L15: statistical clustering Similarity measures Criterion functions Cluster validity Flat clustering algorithms k-means ISODATA L15: statistical clustering Hierarchical clustering algorithms Divisive Agglomerative CSCE 666 Pattern

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

Social Media Mining. Graph Essentials

Social Media Mining. Graph Essentials Graph Essentials Graph Basics Measures Graph and Essentials Metrics 2 2 Nodes and Edges A network is a graph nodes, actors, or vertices (plural of vertex) Connections, edges or ties Edge Node Measures

More information

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Distributed Dynamic Load Balancing for Iterative-Stencil Applications Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,

More information

SoSe 2014: M-TANI: Big Data Analytics

SoSe 2014: M-TANI: Big Data Analytics SoSe 2014: M-TANI: Big Data Analytics Lecture 4 21/05/2014 Sead Izberovic Dr. Nikolaos Korfiatis Agenda Recap from the previous session Clustering Introduction Distance mesures Hierarchical Clustering

More information

Bisecting K-Means for Clustering Web Log data

Bisecting K-Means for Clustering Web Log data Bisecting K-Means for Clustering Web Log data Ruchika R. Patil Department of Computer Technology YCCE Nagpur, India Amreen Khan Department of Computer Technology YCCE Nagpur, India ABSTRACT Web usage mining

More information

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014 Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)

More information

Improved Fuzzy C-means Clustering Algorithm Based on Cluster Density

Improved Fuzzy C-means Clustering Algorithm Based on Cluster Density Journal of Computational Information Systems 8: 2 (2012) 727 737 Available at http://www.jofcis.com Improved Fuzzy C-means Clustering Algorithm Based on Cluster Density Xiaojun LOU, Junying LI, Haitao

More information

Chapter 4: Non-Parametric Classification

Chapter 4: Non-Parametric Classification Chapter 4: Non-Parametric Classification Introduction Density Estimation Parzen Windows Kn-Nearest Neighbor Density Estimation K-Nearest Neighbor (KNN) Decision Rule Gaussian Mixture Model A weighted combination

More information

Data a systematic approach

Data a systematic approach Pattern Discovery on Australian Medical Claims Data a systematic approach Ah Chung Tsoi Senior Member, IEEE, Shu Zhang, Markus Hagenbuchner Member, IEEE Abstract The national health insurance system in

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA J.RAVI RAJESH PG Scholar Rajalakshmi engineering college Thandalam, Chennai. ravirajesh.j.2013.mecse@rajalakshmi.edu.in Mrs.

More information

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Outline Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Jinfeng Yi, Rong Jin, Anil K. Jain, Shaili Jain 2012 Presented By : KHALID ALKOBAYER Crowdsourcing and Crowdclustering

More information