Scalable Single Linkage Hierarchical Clustering For Big Data


 Paul Bishop
 1 years ago
 Views:
Transcription
1 Scalable Single Linkage Hierarchical Clustering For Big Data Timothy C. Havens 1, James C. Bezdek 2, Marimuthu Palaniswami 2 1 Electrical and Computer Engineering and Computer Science Departments, Michigan Technological University Houghton, MI USA 2 Department of Electrical and Electronic Engineering, University of Melbourne Parkville, VIC Australia Abstract Personal computing technologies are everywhere; hence, there are an abundance of staggeringly large data sets the Library of Congress has stored over 160 terabytes of web data and it is estimated that Facebook alone logs nearly a petabyte of data per day. Thus, there is a pertinent need for systems by which one can elucidate the similarity and dissimilarity among and between groups in these big data sets. Clustering is one way to find these groups. In this paper, we extend the scalable Visual Assessment of Tendency (svat) algorithm to return singlelinkage partitions of big data sets. The svat algorithm is designed to provide visual evidence of the number of clusters in unloadable (big) data sets. The extension we describe for svat enables it to also then efficiently return the data partition as indicated by the visual evidence. The computational complexity and storage requirements of svat are (usually) significantly less than the O(n 2 ) requirement of the classic singlelinkage hierarchical algorithm. We show that svat is a scalable instantiation of singlelinkage clustering for data sets that contain c compactseparated clusters, where c n; n is the number of objects. For data sets that do not contain compactseparated clusters, we show that svat produces a good approximation of singlelinkage partitions. Experimental results are presented for both synthetic and real data sets. I. INTRODUCTION Clustering or cluster analysis is a form of exploratory data analysis in which data are separated into groups or subsets such that the objects in each group share more similarity to each other than to the objects in other groups. Clustering has been used for many purposes and there are many good books that describe its various uses [1, 2]. The most popular use for clustering is to assign labels to unlabeled data data for which no preexisting grouping is known. Assigning labels or partitioning is not the only operation in clustering. There is also cluster tendency, which asks the question, Are there clusters and, if so, how many? And cluster validity, which asks, Are the clusters I found any good? In this paper we will use the scalable version of the popular visual assessment of tendency (VAT) [3] algorithm to both assess tendency and partition the data. Big data is a reality of modern computing social networking and mobile computing alone account for terabytes (TB) to petabytes of logged data per day. Hence, there is a great need for algorithms that can address these big data concerns. In 1996, Huber [4] classified data set sizes as in TABLE I HUBER S NOMINAL DATA SET SIZES [4] Bytes >12 size medium large huge monster VL Big Data Table I. 1 Bezdek and Hathaway [5] added the Very Large (VL) category to this table in In recent years, these data have been collectively called Big Data. Interestingly, monster and VL data with objects is still unloadable on most current (circa 2012) computers. For example, a data set representing objects, each with 10 features, stored in short integer (4 byte) format would require 40 TB of storage (most highperformance computing platforms have <1 TB of main memory). Hence, we believe that Table I will continue to be pertinent for many years. Big Data will always be a concern as data sets will continue to grow. There are two main approaches to clustering in big data: distributed clustering based on various incremental styles, and clustering a sample found by progressive or random sampling (or some hybrid scheme). Each has been applied in the context of just about any clustering algorithm, including k means, fuzzy cmeans, and kernelized variants [5 10]. Both approaches provide useful ways to accomplish two objectives: acceleration for loadable data, and approximation for unloadable data. Now consider a set of n objects O = {o 1,..., o n }, e.g., hurricanes, wireless sensor network nodes, Blist actors, or members of a social network. Each object is typically represented by numerical featurevector data that have the form X = {x 1,..., x n } R p, where the coordinates of x i provide feature values (e.g., wind speed, sensor voltage, age, number of blog posts, etc.) describing object o i. An alternative form of data is relational data, where only the relationship between pairs of objects is known. This type of data is especially prevalent in documentanalysis and bioinformatics. Relational data typically consist of the n 2 values of a dissimilarity matrix R, where R = d(o i, o j ) is the pairwise dissimilarity (or distance) between objects o i and o j. For instance, numerical data X 1 Huber also defined tiny as 10 2 and small as 10 4.
2 can always be converted to R by R = [r ij ] = [ x i x j ] (any vector norm on R p ). There are, however, similarity and dissimilarity relational data that do not begin as feature vector data; for these, there is no choice but to use a relational algorithm. Hence, relational data represent the most general form of input data. A crisp partition of the objects is defined as a set of n entries of an indicator vector Π = (π 1,..., π n ), where π i is the cluster label of object o i ; e.g., π i = 2 indicates that o i is in cluster 2. Other types of partitions exist, such as fuzzy and probabilistic, but we limit our discussion here to crisp partitions each object belongs wholly and unequivocally to exactly one cluster. In this paper, we propose an extension to the scalable VAT (svat) algorithm [11] that produces crisp cpartitions of big data. Our algorithm first uses svat to sample the data and reorder the sample. We then use a property of VAT reordering that allows us to efficiently compute singlelinkage partitions of the reordered data sample. Finally, we extend the partition to the entire data set using the simple nearest object rule. Section II describes the svat algorithm, followed by our proposed svatsl algorithm in Section III. We demonstrate the proposed svatsl on synthetic and real data in Section IV. Finally, Section V summarizes and proposes future research directions. We now discuss the VAT algorithm and related research. A. Matrix Reordering For Tendency Assessment and Partitioning Matrix reordering has a long history of use for clustering data. Petrie, in 1899, illustrated the value of reordering objects, by visually clustering 917 pieces of prehistoric Egyptian pottery recovered from about 4,000 overlapping tombs [12]. In 1909, Czekanowski proposed the first method for clustering in dissimilarity data using a visual approach [13]. Czekanowski reordered, by hand, a dissimilarity matrix that represented the average difference in various skulls, showing that skulls roughly clustered into two distinct groups. These pieces of history and other important milestones on the use of heat maps (aka reordered dissimilarity images) to visualize clusters are described in detail in [14]. The matrix reordering method that we focus on in this paper is VAT, outlined in Algorithm 1. The VAT algorithm displays an image of reordered (and scaled for imaging) dissimilarity data [3]. Each image pixel of the VATreordered matrix R displays the scaled dissimilarity value between two objects. White pixels represent high dissimilarity, while black represents low dissimilarity. 2 A dark block along the diagonal of the image of R is a submatrix of similarly small dissimilarity values; hence, dark blocks may represent clusters 2 Note that each object is exactly similar to itself, resulting in a zerovalued diagonal of R and R. The diagonal thus gives us no information on how the data should cluster. Hence, most VAT users will scale the pixel values of the offdiagonal elements of R to best use the quantized range of the imaging method. An easy way to do this in MATLAB is to set the diagonal elements of R to the minimum of the offdiagonal elements and then use the MATLAB function imagesc to display the image Y coord X coord (b) Euclidean distance matrix R (a) Data set X (c) VAT image of R Fig. 1. Example of VAT visualization showing a cluster tendency of c = 3. of objects that are relatively similar to each other. Thus, cluster tendency, or the number of clusters, may be shown by the number of dark blocks along the diagonal of the VAT image. Figure 1 shows the VAT visualization for a data set that has a preferred cluster count of c = 3. It is clear that the VAT image in view (c) shows 3 dark blocks and, thus, a tendency of the data to have 3 clusters. Algorithm 1: VAT Data: R (R + ) n n Dissimilarity matrix Result: R VAT reordered R p VAT reordering indices of R d MST cut magnitude vector Set K = {1,..., n} and I = J = Select (i, j) arg max k K,q K r kq p 1 = i; I = {i}; and J = K {i} for t = 2,...,n do Select (i, j) arg min k I,q J r kq p r = j; I I {j}; J J {j} d t 1 = r ij Obtain the reordered dissimilarity matrix R using the ordering array p as r kq = r p k,p q, 1 k, q, n. Algorithm 1 shows that VAT is based on (but not identical to) Prim s algorithm [15] for finding the minimum spanning tree (MST) of a weighted undirected graph. In [16], we showed that this property can be used to prove that all singlelinkage partitions appear as aligned partitions in the VATreordered objects O. Aligned cpartitions of O have c contiguous blocks of cluster labels in Π, starting with 1 and ending with c. For example, Π = (1, 1, 1, 2, 2, 2, 3, 3) is an
3 aligned partition of 8 objects, while Π = (1, 1, 1, 2, 2, 3, 2, 3) is not. The special nature of aligned partitions enables us to specify them in an alternative form. Every member of the set of all aligned cpartitions for a set of n objects is isomorphic to the unique set of c distinct integers (which are the cardinalities of the c clusters in Π) that satisfy {n i 1 n i ; 1 i c; c i=1 n i = n}; so, aligned partitions are completely specified by {n 1 :... : n c }. For example, Π = (1, 1, 1, 2, 2, 2, 3, 3) = {3 : 3 : 2}. Because singlelinkage clusters are always aligned in the VATreordered data, to find them we merely have to cut the largest (c 1) edges in the MST and form the corresponding aligned partition. For example, if we wish to compute the singlelinkage 4partition, we find the 3 largest values of the vector d, storing the index values as i (1), i (2), and i (3) (where i (1) < i (2) < i (3) ), and form the aligned partition {i (1) : i (2) i (1) : i (3) i (2) : n i (3) }. We will use this method to form singlelinkage partitions of the svat sample, which provides an approximation of the singlelinkage partition for big data. Another property of the VAT image is that the contrast and presence of dark blocks on the diagonal is related to Dunn s index [17]. This index is a metric of how well a set of clusters represents compactseparated (CS) clusters. For a set of objects O with corresponding relational dissimilarity data R, we say that a partitioning Π = {π 1,..., π c } of O is CS relative to R if each of the possible intracluster distances is strictly less than each of the possible intercluster distances. We state this by saying that O can be partitioned into c CS clusters. For a given relational matrix R and set of indicator vectors Π, Dunn s index is defined as α(c, Π) = min 1 k c min 1 q c,q k dist(π k, π q ), (1) max 1 k c diam(π k ) where π k is the kth cluster, dist(π k, π j ) is the distance between two clusters, and diam(π k ) is the cluster diameter [18]. The distance and diameter functions are dist(π k, π q ) = min r ij, (2) i π k,j π q diam(π k ) = max r ij. (3) i π k,j π k The relative validity of clusters found at different values of c or by different clustering algorithms can be compared by examining the respective values of (1) for each partition. Definition 1. Clusters that have a Dunn s index α(c, Π) > 1 are compactseparated (CS) clusters [18]. Later, we will show a relationship between CS clusters and the proposed svatsl algorithm. First, we describe the svat algorithm in detail. II. SCALABLE VAT The svat algorithm is a scalable solution for visualizing the number of clusters in a big data set. It first draws a representative sample of the data and then produces a visualization which shows the number of clusters as the number of dark blocks on the diagonal of a reordered dissimilarity image. Algorithm 2 outlines the steps. Algorithm 2: svat Data: R (R + ) n n Dissimilarity matrix Input: c Overestimate of true number of clusters c; n s Size of approximating sample Result: R s VAT reordered R s S = {S 1,..., S c } grouping sets S indices of samples in R S p VAT reordering indices of sample R S d MST cut vector 1 Select the indices M of the c distinguished objects. m 1 = 1 d = (d 1,..., d n ) = (r 11,..., r 1n ) for t = 2,..., c do d ( min{d 1, r mt 1,1},..., min{d n, r mt 1,n} ) m t = arg max 1 j n {d j } 2 Group objects in O = {o 1,..., o n } with their nearest distinguished object. S 1 = S 2 =... = S c = for t = 1,..., n do k = arg min 1 j c {r mj,t} S k S k {t} 3 Select some data for R S near each of the distinguished objects. n t = n s S t /n, t = 1,..., c Draw n t random indices S t from S t without replacement, t = 1,..., c. S = c t=1 S t Form R S, where R S is the square submatrix of R indexed by S in both rows and columns. 4 Apply VAT to R S, returning R s and p (optionally d) The svat algorithm essentially does the following four steps. In Step 1, svat defines a set of c distinguished objects that hopefully represents the clustering structure of all the objects. In other words, the c distinguished objects are prototypical indices for the clusters in the big data. At Step 2, the objects in O are partitioned using the nearest prototype rule, where the prototypes are the c distinguished object indices. Step 3 draws a random subset of O to produce a wellrepresented set of (approximate) size n s. Finally, in Step 4, VAT is applied to the approximatelysized n s n s submatrix R S. The computational complexity of svat Steps 1 and 2 are O(c n), with storage requirements of O(c n) (for fast execution). Drawing the indices of R S in Step 3 requires O(c n) computations and storage of ñ s entries, where ñ s = S (note that ñ s is usually only slightly greater that the chosen
4 sample size n s ). The final run of VAT on R S is O(ñ 2 s) in computation and storage complexity. If the data begin as vectors X R p then Step 1 requires O(pc n) operations to compute the necessary elements of R and Step 3 requires O(pñ 2 s) operations to compute the elements of R S. It can be shown that ñ s n s +c, hence the overall storage requirement of svat is O(max{c n, (n s + c ) 2 }). The overall computational complexity is O(max{c n, (n s + c ) 2 }) if the data begin as relational data R, and O(max{pc n, pñ 2 s, (n s + c ) 2 }) if the data begin as vectors X. In [11], it was stated that these complexities show that svat scales linearly with n. However, we argue that because c, ñ s, and n s are some fraction of n, then svat does not asymptotically scale linearly with n. However, if c n, n s n, then these orders of complexity show a significant reduction in both storage and computational complexity. The svat algorithm has two desirable properties for data sets that contain CS clusters. Proposition 1 from [11] describes the first property and is important in our later analysis of svat SL. The second property is as follows. For an object set O that contains c CS clusters, the proportion of objects in the svat sample S (drawn in Step 3 of Algorithm 2) from the ith cluster equals the proportion of the objects in O from the ith cluster. In other words, each CS cluster s size will appear correctly in the svat image (large clusters will appear as large dark blocks, while small clusters will appear as small dark blocks). Proposition 1. [11] Consider a set of objects O that can be partitioned into c CS clusters, and let c c. Then Step 1 of svat selects at least one distinguished object from each CS cluster. Proof: See [11] for proof. We now propose an extension to svat, called svatsl, that returns a cpartition, c c, of O. For the case where there exists c CS clusters in O, then we show that the svatsl partition is exactly the singlelinkage cpartition of O. III. SVATSL ALGORITHM The svatsl algorithm proposed in Algorithm 3 calculates partitions of data represented by a dissimilarity matrix R. In brief, svatsl calculates a singlelinkage partition of the svatsampled data and then extends this partition to the entire data set. We will show that in certain cases the svatsl c partition is equivalent to the singlelinkage cpartition of the VL data set. In Step 1 of svatsl, the svat algorithm is run on R, returning the VATreordered sample matrix R s, the indices S computed in svat Step 2, the indices of the sample S, the reordering vector p, and the magnitude of the MST links d. At Step 2 of svatsl, the user must choose the number of clusters c to seek. We recommend using the svat visualization of R s to help choose the number of clusters. One could also use the biggestjump criteria by sorting d is descending order and choosing c as the argmax of {d c d c+1 }. At Step 3 of svatsl, the indices of the c largest values of d are found and denoted as t. These indicate the links of the MST that Algorithm 3: svatsl Data: R (R + ) n n Dissimilarity matrix Input: c Overestimate of true number of clusters c; n s Size of approximating sample Result: R s VAT reordered R s Π cpartition of O 1 Run svat on R, returning R s, S, S, p, and d 2 Choose the number of clusters c (e.g., by using the svat image) 3 Find indices t of c largest values in d 4 Form the aligned partition Π as {t 1 : t 2 t 1 :... : t c t c 1 } 5 π Spi = π p i, i = 1,..., ñ, k = 1,..., c 6 for each ŝ Ŝ = S S do j = arg min k S rŝk πŝ = π k are cut to find the c singlelinkage clusters of R s. Next, the aligned partition Π is calculated using the indices t found in Step 3. At Step 5, we merely reorder the cluster indicator vector Π of S to match the indexordering of the original objets O (i.e., we label the samples of O that are represented in S). Finally, at Step 6 we label the remaining objects in O by giving them the label of the nearest respective object in S. The computational complexity of svatsl, beyond the run of svat at Step 1, is as follows. Step 3 is O(ñ), as is the resorting of Π at Step 5. The final step of svatsl assigns the cluster indicators for the (n ñ) objects that are not in the sample S (i.e., R s ). Each iteration of the for loop requires O(ñ) computations, thus resulting in a final O(ñˆn) computational complexity, where ˆn = S Ŝ. The storage requirement is O(ñˆn) for RŜ S, used in Step 6 (although, one could certainly achieve fast computation and smaller storage requirement by loading one row of R at a time in Step 6). Proposition 2. Consider a data set O and a corresponding dissimilarity matrix R. Let Π be the cpartition indicator vector returned by svatsl and Π be the singlelinkage c partition indicator vector of O. If O, according to R, can be partitioned into c CS clusters, then Π = Π if c c. Proof: First, it is well known that singlelinkage finds the cpartition corresponding to the c CS clusters of a data set. Hence, Π is the partition of O into its c CS clusters. Second, Proposition 1 says that the sampling procedure of svat selects at least one distinguished object from each cluster of the c CS clusters. Hence, the sampled data S contain at least one object from each of the c CS clusters. Because of the property of CS clusters, the subset S contains the same c CS clusters as O (albeit in a sampled form). In [16], we showed that every singlelinkage partition is aligned in the VATreordered dissimilarity matrix. Thus, Steps 3 and 4 of svatsl find the singlelinkage cpartition of the sampled data S. Thus, Π, produced at Step 4 of svatsl,
5 Y coord X coord. (a) 3 Clouds (n = 25, 000) (b) svat image Fig. 2. svat image of 3 Clouds data, c = 20, n s = 1, 000 is the partition of S into its c CS clusters. The partition Π is thus the same as Π, albeit sampled and reordered. At Step 6 of svatsl, each object in O that is not in S is labeled as being in the cluster of the nearest object in S, producing the partition Π of O. Hence, the partition Π = Π. Remark 1. Proposition 2 states that if O contains c CS clusters then svatsl will find them, as long as c c. If O does not contain c CS clusters, then svatsl is not guaranteed to find the same partition as singlelinkage. However, we show in Section IV that svatsl produces a good approximation of the preferred partition of O. IV. EXPERIMENTS We compare the clustering results of single linkage and svatsl on two main data sets. The data sets denoted as 3 Clouds is a variably sized data set with 3 clouds of data drawn from 3 Gaussian distributions with the following parameters, µ 1 = (5, 5), µ 2 = (0, 0), µ 3 = (10, 0), Σ 1 = Σ 2 = Σ 3 = I 2. The size of each cloud, respectively, is 0.2n, 0.4n, and 0.4n. FIgure 2 shows a plot of a draw of the 3 Clouds data for n = 25, 000. The svat image of the data in Fig. 2(a), with c = 20 and n s = 1, 000 is shown in view (b). The svat image clearly shows 3 clusters. Furthermore, it shows the relative size of each cluster accurately. The second data set we show results for is the Forest Cover data. 3 These data are composed of 54 cartographic features obtained from United States Geological Survey (USGS) and United State Forest Service (USFS) data. These features were collected from a total of 581, meter cells, which were then determined to be one of 7 forest cover types by the USFS by analyzing the 54 features. We normalize the features to the interval [0, 1]. We compared single linkage and svatsl by calculating the purity and normalized mutual information (NMI) of partitions calculated with each algorithm and also the runtime. Each of these indices are described in detail in [19, pp ]. In brief, an index value of 1 is perfect, while an index value of 0 indicates poor performance (in the case of NMI, an value of 0 indicates randomly chosen clusters). For the 3 Clouds 3 The Forest Cover data set can be downloaded at dstar/data/clusteringdata.html. data, we randomly drew 50 instances of the data set with size n = 25, 000 and ran singlelinkage and svatsl (with various parameters) on each instances. Table II outlines the mean and standard deviation of purity and NMI for these 50 runs. Single linkage performs very poorly at finding the 3 clusters in these data. This is because single linkage can be fatally affected by one outlier (which can easily happen with Gaussian distributed data). In contrast, svatsl produces very good clustering results for all parameter settings both purity and NMI are > 0.75 and for c = 10 and n s = 100, these indices are both > 0.9, a near perfect result. These results convince us that svatsl is producing the preferred clusters in the 3 Clouds data, while singlelinkage performance is hampered by the outlier data points. The Time column in Table II shows that svatsl is also 2 orders of magnitude faster than singlelinkage. For our second experiment, we ran svatsl on the 3 Clouds data (n = 250, 000) and the Forest Cover data. Each of these data sets were too large to be clustered by singlelinkage on our machine; one would need > 1TB of main memory to store the full n n distance matrix. Table II shows that svatsl is able cluster both these data sets in a very reasonable amout of time: on the order of seconds for the 3 Clouds data and seconds for the Forest Cover data. Furthermore, the svatsl performance indices show that the algorithm produces very accurate clusters for the 3Clouds data. With the parameters (c = 5, n s = 500), svatsl produces clusters with indices > 0.9: a nearperfect solution. For the Forest Cover data, svat SL returns clusters that achieve a purity of about 0.5 and an NMI of about 0.2. To compare, the algorithms proposed in [20] achieve an NMI of Our proposed svatsl algorithm is able to cluster these data in about seconds. V. CONCLUSIONS AND FUTURE WORK Our analysis in Section III shows that svatsl is an approximation to singlelinkage clustering for big data. The computational complexity is significantly smaller than the O(n 2 ) complexity of the full hierarchical algorithm. Experimental runtime calculations also show that the algorithm is fast at both producing a cluster visualization and at partitioning the data. Furthermore, we showed that the svatsl cpartition is equivalent to the singlelinkage cpartition for data that have c compactseparated clusters. A weakness to svatsl is that the partition is based on the singlelinkage criterion, which has the wellknown drawback that the partition can be ruined by outliers. For data with many outliers, the outliers have to be partitioned (usually in their own cluster) before the preferred partition can be found via the singlelink criterion. However, svatsl provides visual evidence as to how large the clusters should be; hence, if the size of the singlelink clusters do not match well with the visual evidence then the user can disregard the partition (perhaps choosing a different clustering algorithm to partition the sample or throwing out data from small clusters). In the future, we will examine the sampling scheme suggested by the kmeans++ initialization technique [21], which
6 TABLE II CLUSTERING RESULTS* SL svatsl Data set Purity NMI Time (secs) Purity NMI Time (secs) 3Clouds c = (n = 25, 000) ±0 ±0 ±0.36 n s = 1000 ±0.20 ±0.23 ±0.11 c = n s = 1000 ±0.17 ±0.18 ±0.06 c = n s = 1000 ±0.20 ±0.23 ±0.05 c = n s = 500 ±0.18 ±0.19 ±0.05 c = n s = 100 ±0.13 ±0.13 ± Clouds??? c = (n = 250, 000) n s = 1000 ±0.20 ±0.27 ±0.15 c = n s = 500 ±0.06 ±0.08 ±0.02 Forest Cover??? c = (n = 581, 012) n s = 500 ±0.01 ±0.05 ±0.28 c = n s = 500 ±0.02 ±0.02 ±0.19 *Mean and standard deviation over 50 independent trials. Bold indicates superior algorithm of SL and svatsl by 2sided ttest. draws objects according to a distanceweighted probability distribution. The drawback of the kmeans++ method is that it does not absolutely ensure that an object is drawn from each cluster for the case of CS clusters. REFERENCES [1] A. Jain and R. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ: PrenticeHall, [2] R. Xu and D. Wunsch II, Clustering. Psicataway, NJ: IEEE Press, [3] J. C. Bezdek and R. J. Hathaway, VAT: A tool for visual assessment of (cluster) tendency, in Proc. IJCNN, Honolulu, HI, 2002, pp [4] P. Huber, Massive Data Sets. National Academy Press, 1997, ch. Massive Data Sets Workshop: The Morning After, pp [5] R. Hathaway and J. Bezdek, Extending fuzzy and probabilistic clustering to very large data sets, Computational Statistics and Data Analysis, vol. 51, pp , [6] T. C. Havens, J. C. Bezdek, C. Leckie, L. O. Hall, and M. Palaniswami, Fuzzy cmeans algorithms for very large data, IEEE Trans. Fuzzy Systems, [7] N. Pal and J. Bezdek, Complexity reduction for large image processing, IEEE Trans. Systems, Man, and Cybernetics, vol. B, no. 32, pp , [8] P. Hore, L. Hall, and D. Goldgof, Single pass fuzzy c means, in Proc. IEEE Int. Conf. Fuzzy Systems, London, England, 2007, pp [9] R. Chitta, R. Jin, T. Havens, and A. Jain, Approximate kernel kmeans: Solution to large scale kernel clustering, in Proc. ACM SIGKDD Conf. Knowledge Discovery and Data Mining, 2011, pp [10] T. Havens, R. Chitta, A. Jain, and R. Jin, Speedup of fuzzy and possibilistic cmeans for largescale clustering, in Proc. IEEE Int. Conf. Fuzzy Systems, Taipei, Taiwan, [11] R. J. Hathaway, J. C. Bezdek, and J. M. Huband, Scalable visual asseessment of cluster tendency for large data sets, Pattern Recognition, vol. 39, no. 7, pp , July [12] W. Petrie, Sequences in prehistoric remains, J. Anthropological Inst. Great Britain and Ireland, vol. 29, pp , [13] J. Czekanowski, Zur differentialdiagnose der neandertalgruppe, Korrespondenzblatt der Deutschen Gesellschaft fr Anthropologie, Ethnologie und Urgeschichte, vol. 40, pp , [14] L. Wilkinson and M. Friendly, The history of the cluster heat map, The American Statistician, vol. 63, no. 2, pp , [15] R. C. Prim, Shortest connection networks and some generalizations, Bell System Technical Journal, vol. 36, pp , [16] T. C. Havens, J. C. Bezdek, J. M. Keller, M. Popescu, and J. M. Huband, Is VAT really single linkage in disguise? Ann. Math. Artif. Intell., vol. 55, no. 34, pp , [17] T. C. Havens, J. C. Bezdek, J. M. Keller, and M. Popescu, Dunnís cluster validity index as a contrast measure of VAT image, in Proc. ICPR, Tampa, FL, December [18] J. Dunn, A fuzzy relative of the ISODATA process and its use in detecting compact wellseparated clusters, J. of Cybernetics, vol. 3, no. 3, pp , [19] C. D. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval. Cambridge University Press, [20] R. Chitta, R. Jin, and A. K. Jain, Efficient kernel clustering using random fourier features, in Int. Conf. Data Mining, [21] D. Arthur and S. Vassilvitskii, kmeans++: The advantages of careful seeding, in Proc. SODA, 2007, pp
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph
More informationFUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 3448 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
More informationCluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototypebased Fuzzy cmeans
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDDLAB ISTI CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationFig. 1 A typical Knowledge Discovery process [2]
Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering
More informationStandardization and Its Effects on KMeans Clustering Algorithm
Research Journal of Applied Sciences, Engineering and Technology 6(7): 3993303, 03 ISSN: 0407459; eissn: 0407467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03
More informationComparision of kmeans and kmedoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of kmeans and kmedoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
More informationData Mining Project Report. Document Clustering. Meryem UzunPer
Data Mining Project Report Document Clustering Meryem UzunPer 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. Kmeans algorithm...
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical
More informationCharacter Image Patterns as Big Data
22 International Conference on Frontiers in Handwriting Recognition Character Image Patterns as Big Data Seiichi Uchida, Ryosuke Ishida, Akira Yoshida, Wenjie Cai, Yaokai Feng Kyushu University, Fukuoka,
More informationData Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distancebased Kmeans, Kmedoids,
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will
More informationA Survey of Kernel Clustering Methods
A Survey of Kernel Clustering Methods Maurizio Filippone, Francesco Camastra, Francesco Masulli and Stefano Rovetta Presented by: Kedar Grama Outline Unsupervised Learning and Clustering Types of clustering
More informationROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015
ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti
More informationEM Clustering Approach for MultiDimensional Analysis of Big Data Set
EM Clustering Approach for MultiDimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
More informationCluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico
Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from
More informationClustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University September 19, 2012
Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University September 19, 2012 EMart No. of items sold per day = 139x2000x20 = ~6 million
More informationClustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012
Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 WolfTilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig
More informationNonnegative Matrix Factorization (NMF) in Semisupervised Learning Reducing Dimension and Maintaining Meaning
Nonnegative Matrix Factorization (NMF) in Semisupervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
More informationSOME CLUSTERING ALGORITHMS TO ENHANCE THE PERFORMANCE OF THE NETWORK INTRUSION DETECTION SYSTEM
SOME CLUSTERING ALGORITHMS TO ENHANCE THE PERFORMANCE OF THE NETWORK INTRUSION DETECTION SYSTEM Mrutyunjaya Panda, 2 Manas Ranjan Patra Department of E&TC Engineering, GIET, Gunupur, India 2 Department
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms Kmeans and its variants Hierarchical clustering
More informationCategorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors ChiaHui Chang and ZhiKai Ding Department of Computer Science and Information Engineering, National Central University, ChungLi,
More information10810 /02710 Computational Genomics. Clustering expression data
10810 /02710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intracluster similarity low intercluster similarity Informally,
More information. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns
Outline Part 1: of data clustering NonSupervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties
More informationNeural Networks Lesson 5  Cluster Analysis
Neural Networks Lesson 5  Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt.  Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29
More informationClustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationSPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
More informationScalable Parallel Clustering for Data Mining on Multicomputers
Scalable Parallel Clustering for Data Mining on Multicomputers D. Foti, D. Lipari, C. Pizzuti and D. Talia ISICNR c/o DEIS, UNICAL 87036 Rende (CS), Italy {pizzuti,talia}@si.deis.unical.it Abstract. This
More informationUNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable
More informationUSING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva
382 [7] Reznik, A, Kussul, N., Sokolov, A.: Identification of user activity using neural networks. Cybernetics and computer techniques, vol. 123 (1999) 70 79. (in Russian) [8] Kussul, N., et al. : MultiAgent
More informationMethod of Data Center Classifications
Method of Data Center Classifications Krisztián Kósi Óbuda University, Bécsi út 96/B, H1034 Budapest, Hungary kosi.krisztian@phd.uniobuda.hu Abstract: This paper is about the Classification of big data
More informationPrototypeless Fuzzy Clustering
Prototypeless Fuzzy Clustering Christian Borgelt Abstract In contrast to standard fuzzy clustering, which optimizes a set of prototypes, one for each cluster, this paper studies fuzzy clustering without
More informationClustering. Adrian Groza. Department of Computer Science Technical University of ClujNapoca
Clustering Adrian Groza Department of Computer Science Technical University of ClujNapoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 Kmeans 3 Hierarchical Clustering What is Datamining?
More informationEnvironmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
More informationFuzzy Clustering Technique for Numerical and Categorical dataset
Fuzzy Clustering Technique for Numerical and Categorical dataset Revati Raman Dewangan, Lokesh Kumar Sharma, Ajaya Kumar Akasapu Dept. of Computer Science and Engg., CSVTU Bhilai(CG), Rungta College of
More informationCLASSIFYING SERVICES USING A BINARY VECTOR CLUSTERING ALGORITHM: PRELIMINARY RESULTS
CLASSIFYING SERVICES USING A BINARY VECTOR CLUSTERING ALGORITHM: PRELIMINARY RESULTS Venkat Venkateswaran Department of Engineering and Science Rensselaer Polytechnic Institute 275 Windsor Street Hartford,
More informationIMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES
INTERNATIONAL JOURNAL OF ADVANCED RESEARCH IN ENGINEERING AND SCIENCE IMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES C.Priyanka 1, T.Giri Babu 2 1 M.Tech Student, Dept of CSE, Malla Reddy Engineering
More informationData Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data
More informationLargeScale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 59565963 Available at http://www.jofcis.com LargeScale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
More informationUse of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 4, April 2015,
More informationRoleVAT: Visual Assessment of Practical Need for Role Based Access Control
RoleVAT: Visual Assessment of Practical Need for Role Based Access Control Dana Zhang The University of Melbourne zhangd@csse.unimelb.edu.au Kotagiri Ramamohanarao The University of Melbourne rao@csse.unimelb.edu.au
More informationLecture 20: Clustering
Lecture 20: Clustering Wrapup of neural nets (from last lecture Introduction to unsupervised learning Kmeans clustering COMP424, Lecture 20  April 3, 2013 1 Unsupervised learning In supervised learning,
More informationImage Estimation Algorithm for Out of Focus and Blur Images to Retrieve the Barcode Value
IJSTE  International Journal of Science Technology & Engineering Volume 1 Issue 10 April 2015 ISSN (online): 2349784X Image Estimation Algorithm for Out of Focus and Blur Images to Retrieve the Barcode
More informationA Study of Web Log Analysis Using Clustering Techniques
A Study of Web Log Analysis Using Clustering Techniques Hemanshu Rana 1, Mayank Patel 2 Assistant Professor, Dept of CSE, M.G Institute of Technical Education, Gujarat India 1 Assistant Professor, Dept
More informationA Novel Fuzzy Clustering Method for Outlier Detection in Data Mining
A Novel Fuzzy Clustering Method for Outlier Detection in Data Mining Binu Thomas and Rau G 2, Research Scholar, Mahatma Gandhi University,Kerala, India. binumarian@rediffmail.com 2 SCMS School of Technology
More informationDoptimal plans in observational studies
Doptimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationRanking on Data Manifolds
Ranking on Data Manifolds Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 72076 Tuebingen, Germany {firstname.secondname
More informationDiscuss the size of the instance for the minimum spanning tree problem.
3.1 Algorithm complexity The algorithms A, B are given. The former has complexity O(n 2 ), the latter O(2 n ), where n is the size of the instance. Let n A 0 be the size of the largest instance that can
More informationIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181 The Global Kernel kmeans Algorithm for Clustering in Feature Space Grigorios F. Tzortzis and Aristidis C. Likas, Senior Member, IEEE
More informationAn Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
More informationClustering and Data Mining in R
Clustering and Data Mining in R Workshop Supplement Thomas Girke December 10, 2011 Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches
More informationLoad balancing in a heterogeneous computer system by selforganizing Kohonen network
Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by selforganizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.
More informationMining SocialNetwork Graphs
342 Chapter 10 Mining SocialNetwork Graphs There is much information to be gained by analyzing the largescale data that is derived from social networks. The bestknown example of a social network is
More informationData Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototypebased clustering Densitybased clustering Graphbased
More informationOffline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com
More informationClustering. 15381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is
Clustering 15381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv BarJoseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is
More informationClient Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data
Client Based Power Iteration Clustering Algorithm to Reduce Dimensionalit in Big Data Jaalatchum. D 1, Thambidurai. P 1, Department of CSE, PKIET, Karaikal, India Abstract  Clustering is a group of objects
More informationAn OrderInvariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]
An OrderInvariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis] Stephan Spiegel and Sahin Albayrak DAILab, Technische Universität Berlin, ErnstReuterPlatz 7,
More informationOPTIMAL DESIGN OF DISTRIBUTED SENSOR NETWORKS FOR FIELD RECONSTRUCTION
OPTIMAL DESIGN OF DISTRIBUTED SENSOR NETWORKS FOR FIELD RECONSTRUCTION Sérgio Pequito, Stephen Kruzick, Soummya Kar, José M. F. Moura, A. Pedro Aguiar Department of Electrical and Computer Engineering
More informationClustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance Knearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
More informationChapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis DensityBased Cluster Analysis Cluster Evaluation Constrained
More informationCost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:
CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm
More informationCluster Algorithms. Adriano Cruz adriano@nce.ufrj.br. 28 de outubro de 2013
Cluster Algorithms Adriano Cruz adriano@nce.ufrj.br 28 de outubro de 2013 Adriano Cruz adriano@nce.ufrj.br () Cluster Algorithms 28 de outubro de 2013 1 / 80 Summary 1 KMeans Adriano Cruz adriano@nce.ufrj.br
More informationA Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
More informationA Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling JingChiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationA Novel Density based improved kmeans Clustering Algorithm Dbkmeans
A Novel Density based improved kmeans Clustering Algorithm Dbkmeans K. Mumtaz 1 and Dr. K. Duraiswamy 2, 1 Vivekanandha Institute of Information and Management Studies, Tiruchengode, India 2 KS Rangasamy
More informationHadoop SNS. renren.com. Saturday, December 3, 11
Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December
More informationComplexity Reduction for Large Image Processing
598 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART B: CYBERNETICS, VOL. 32, NO. 5, OCTOBER 2002 Complexity Reduction for Large Image Processing Nikhil R. Pal, Senior Member, IEEE, and James C.
More informationUnsupervised learning: Clustering
Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What
More informationKMeans Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
KMeans Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar
More informationUsing Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
More informationClustering UE 141 Spring 2013
Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or
More informationClustering Very Large Data Sets with Principal Direction Divisive Partitioning
Clustering Very Large Data Sets with Principal Direction Divisive Partitioning David Littau 1 and Daniel Boley 2 1 University of Minnesota, Minneapolis MN 55455 littau@cs.umn.edu 2 University of Minnesota,
More informationChapter 7. Cluster Analysis
Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. DensityBased Methods 6. GridBased Methods 7. ModelBased
More informationB490 Mining the Big Data. 2 Clustering
B490 Mining the Big Data 2 Clustering Qin Zhang 11 Motivations Group together similar documents/webpages/images/people/proteins/products One of the most important problems in machine learning, pattern
More informationApproximation Algorithms
Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NPCompleteness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms
More informationDISTRIBUTED ANOMALY DETECTION IN WIRELESS SENSOR NETWORKS
DISTRIBUTED ANOMALY DETECTION IN WIRELESS SENSOR NETWORKS Sutharshan Rajasegarar 1, Christopher Leckie 2, Marimuthu Palaniswami 1 ARC Special Research Center for UltraBroadband Information Networks 1
More informationForschungskolleg Data Analytics Methods and Techniques
Forschungskolleg Data Analytics Methods and Techniques Martin Hahmann, Gunnar Schröder, Phillip Grosse Prof. Dr.Ing. Wolfgang Lehner Why do we need it? We are drowning in data, but starving for knowledge!
More informationAlgorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)
Algorithmic Aspects of Big Data Nikhil Bansal (TU Eindhoven) Algorithm design Algorithm: Set of steps to solve a problem (by a computer) Studied since 1950 s. Given a problem: Find (i) best solution (ii)
More informationClustering & Visualization
Chapter 5 Clustering & Visualization Clustering in highdimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to highdimensional data.
More informationL15: statistical clustering
Similarity measures Criterion functions Cluster validity Flat clustering algorithms kmeans ISODATA L15: statistical clustering Hierarchical clustering algorithms Divisive Agglomerative CSCE 666 Pattern
More informationResearch on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 67 (2012) pp 8287 Online: 20120926 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.67.82 Research on Clustering Analysis of Big Data
More informationSocial Media Mining. Graph Essentials
Graph Essentials Graph Basics Measures Graph and Essentials Metrics 2 2 Nodes and Edges A network is a graph nodes, actors, or vertices (plural of vertex) Connections, edges or ties Edge Node Measures
More informationDistributed Dynamic Load Balancing for IterativeStencil Applications
Distributed Dynamic Load Balancing for IterativeStencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
More informationSoSe 2014: MTANI: Big Data Analytics
SoSe 2014: MTANI: Big Data Analytics Lecture 4 21/05/2014 Sead Izberovic Dr. Nikolaos Korfiatis Agenda Recap from the previous session Clustering Introduction Distance mesures Hierarchical Clustering
More informationBisecting KMeans for Clustering Web Log data
Bisecting KMeans for Clustering Web Log data Ruchika R. Patil Department of Computer Technology YCCE Nagpur, India Amreen Khan Department of Computer Technology YCCE Nagpur, India ABSTRACT Web usage mining
More informationAsking Hard Graph Questions. Paul Burkhardt. February 3, 2014
Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate  R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)
More informationImproved Fuzzy Cmeans Clustering Algorithm Based on Cluster Density
Journal of Computational Information Systems 8: 2 (2012) 727 737 Available at http://www.jofcis.com Improved Fuzzy Cmeans Clustering Algorithm Based on Cluster Density Xiaojun LOU, Junying LI, Haitao
More informationChapter 4: NonParametric Classification
Chapter 4: NonParametric Classification Introduction Density Estimation Parzen Windows KnNearest Neighbor Density Estimation KNearest Neighbor (KNN) Decision Rule Gaussian Mixture Model A weighted combination
More informationData a systematic approach
Pattern Discovery on Australian Medical Claims Data a systematic approach Ah Chung Tsoi Senior Member, IEEE, Shu Zhang, Markus Hagenbuchner Member, IEEE Abstract The national health insurance system in
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationSEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA
SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA J.RAVI RAJESH PG Scholar Rajalakshmi engineering college Thandalam, Chennai. ravirajesh.j.2013.mecse@rajalakshmi.edu.in Mrs.
More informationCrowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach
Outline Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Jinfeng Yi, Rong Jin, Anil K. Jain, Shaili Jain 2012 Presented By : KHALID ALKOBAYER Crowdsourcing and Crowdclustering
More information