Scalable Single Linkage Hierarchical Clustering For Big Data


 Paul Bishop
 1 years ago
 Views:
Transcription
1 Scalable Single Linkage Hierarchical Clustering For Big Data Timothy C. Havens 1, James C. Bezdek 2, Marimuthu Palaniswami 2 1 Electrical and Computer Engineering and Computer Science Departments, Michigan Technological University Houghton, MI USA 2 Department of Electrical and Electronic Engineering, University of Melbourne Parkville, VIC Australia Abstract Personal computing technologies are everywhere; hence, there are an abundance of staggeringly large data sets the Library of Congress has stored over 160 terabytes of web data and it is estimated that Facebook alone logs nearly a petabyte of data per day. Thus, there is a pertinent need for systems by which one can elucidate the similarity and dissimilarity among and between groups in these big data sets. Clustering is one way to find these groups. In this paper, we extend the scalable Visual Assessment of Tendency (svat) algorithm to return singlelinkage partitions of big data sets. The svat algorithm is designed to provide visual evidence of the number of clusters in unloadable (big) data sets. The extension we describe for svat enables it to also then efficiently return the data partition as indicated by the visual evidence. The computational complexity and storage requirements of svat are (usually) significantly less than the O(n 2 ) requirement of the classic singlelinkage hierarchical algorithm. We show that svat is a scalable instantiation of singlelinkage clustering for data sets that contain c compactseparated clusters, where c n; n is the number of objects. For data sets that do not contain compactseparated clusters, we show that svat produces a good approximation of singlelinkage partitions. Experimental results are presented for both synthetic and real data sets. I. INTRODUCTION Clustering or cluster analysis is a form of exploratory data analysis in which data are separated into groups or subsets such that the objects in each group share more similarity to each other than to the objects in other groups. Clustering has been used for many purposes and there are many good books that describe its various uses [1, 2]. The most popular use for clustering is to assign labels to unlabeled data data for which no preexisting grouping is known. Assigning labels or partitioning is not the only operation in clustering. There is also cluster tendency, which asks the question, Are there clusters and, if so, how many? And cluster validity, which asks, Are the clusters I found any good? In this paper we will use the scalable version of the popular visual assessment of tendency (VAT) [3] algorithm to both assess tendency and partition the data. Big data is a reality of modern computing social networking and mobile computing alone account for terabytes (TB) to petabytes of logged data per day. Hence, there is a great need for algorithms that can address these big data concerns. In 1996, Huber [4] classified data set sizes as in TABLE I HUBER S NOMINAL DATA SET SIZES [4] Bytes >12 size medium large huge monster VL Big Data Table I. 1 Bezdek and Hathaway [5] added the Very Large (VL) category to this table in In recent years, these data have been collectively called Big Data. Interestingly, monster and VL data with objects is still unloadable on most current (circa 2012) computers. For example, a data set representing objects, each with 10 features, stored in short integer (4 byte) format would require 40 TB of storage (most highperformance computing platforms have <1 TB of main memory). Hence, we believe that Table I will continue to be pertinent for many years. Big Data will always be a concern as data sets will continue to grow. There are two main approaches to clustering in big data: distributed clustering based on various incremental styles, and clustering a sample found by progressive or random sampling (or some hybrid scheme). Each has been applied in the context of just about any clustering algorithm, including k means, fuzzy cmeans, and kernelized variants [5 10]. Both approaches provide useful ways to accomplish two objectives: acceleration for loadable data, and approximation for unloadable data. Now consider a set of n objects O = {o 1,..., o n }, e.g., hurricanes, wireless sensor network nodes, Blist actors, or members of a social network. Each object is typically represented by numerical featurevector data that have the form X = {x 1,..., x n } R p, where the coordinates of x i provide feature values (e.g., wind speed, sensor voltage, age, number of blog posts, etc.) describing object o i. An alternative form of data is relational data, where only the relationship between pairs of objects is known. This type of data is especially prevalent in documentanalysis and bioinformatics. Relational data typically consist of the n 2 values of a dissimilarity matrix R, where R = d(o i, o j ) is the pairwise dissimilarity (or distance) between objects o i and o j. For instance, numerical data X 1 Huber also defined tiny as 10 2 and small as 10 4.
2 can always be converted to R by R = [r ij ] = [ x i x j ] (any vector norm on R p ). There are, however, similarity and dissimilarity relational data that do not begin as feature vector data; for these, there is no choice but to use a relational algorithm. Hence, relational data represent the most general form of input data. A crisp partition of the objects is defined as a set of n entries of an indicator vector Π = (π 1,..., π n ), where π i is the cluster label of object o i ; e.g., π i = 2 indicates that o i is in cluster 2. Other types of partitions exist, such as fuzzy and probabilistic, but we limit our discussion here to crisp partitions each object belongs wholly and unequivocally to exactly one cluster. In this paper, we propose an extension to the scalable VAT (svat) algorithm [11] that produces crisp cpartitions of big data. Our algorithm first uses svat to sample the data and reorder the sample. We then use a property of VAT reordering that allows us to efficiently compute singlelinkage partitions of the reordered data sample. Finally, we extend the partition to the entire data set using the simple nearest object rule. Section II describes the svat algorithm, followed by our proposed svatsl algorithm in Section III. We demonstrate the proposed svatsl on synthetic and real data in Section IV. Finally, Section V summarizes and proposes future research directions. We now discuss the VAT algorithm and related research. A. Matrix Reordering For Tendency Assessment and Partitioning Matrix reordering has a long history of use for clustering data. Petrie, in 1899, illustrated the value of reordering objects, by visually clustering 917 pieces of prehistoric Egyptian pottery recovered from about 4,000 overlapping tombs [12]. In 1909, Czekanowski proposed the first method for clustering in dissimilarity data using a visual approach [13]. Czekanowski reordered, by hand, a dissimilarity matrix that represented the average difference in various skulls, showing that skulls roughly clustered into two distinct groups. These pieces of history and other important milestones on the use of heat maps (aka reordered dissimilarity images) to visualize clusters are described in detail in [14]. The matrix reordering method that we focus on in this paper is VAT, outlined in Algorithm 1. The VAT algorithm displays an image of reordered (and scaled for imaging) dissimilarity data [3]. Each image pixel of the VATreordered matrix R displays the scaled dissimilarity value between two objects. White pixels represent high dissimilarity, while black represents low dissimilarity. 2 A dark block along the diagonal of the image of R is a submatrix of similarly small dissimilarity values; hence, dark blocks may represent clusters 2 Note that each object is exactly similar to itself, resulting in a zerovalued diagonal of R and R. The diagonal thus gives us no information on how the data should cluster. Hence, most VAT users will scale the pixel values of the offdiagonal elements of R to best use the quantized range of the imaging method. An easy way to do this in MATLAB is to set the diagonal elements of R to the minimum of the offdiagonal elements and then use the MATLAB function imagesc to display the image Y coord X coord (b) Euclidean distance matrix R (a) Data set X (c) VAT image of R Fig. 1. Example of VAT visualization showing a cluster tendency of c = 3. of objects that are relatively similar to each other. Thus, cluster tendency, or the number of clusters, may be shown by the number of dark blocks along the diagonal of the VAT image. Figure 1 shows the VAT visualization for a data set that has a preferred cluster count of c = 3. It is clear that the VAT image in view (c) shows 3 dark blocks and, thus, a tendency of the data to have 3 clusters. Algorithm 1: VAT Data: R (R + ) n n Dissimilarity matrix Result: R VAT reordered R p VAT reordering indices of R d MST cut magnitude vector Set K = {1,..., n} and I = J = Select (i, j) arg max k K,q K r kq p 1 = i; I = {i}; and J = K {i} for t = 2,...,n do Select (i, j) arg min k I,q J r kq p r = j; I I {j}; J J {j} d t 1 = r ij Obtain the reordered dissimilarity matrix R using the ordering array p as r kq = r p k,p q, 1 k, q, n. Algorithm 1 shows that VAT is based on (but not identical to) Prim s algorithm [15] for finding the minimum spanning tree (MST) of a weighted undirected graph. In [16], we showed that this property can be used to prove that all singlelinkage partitions appear as aligned partitions in the VATreordered objects O. Aligned cpartitions of O have c contiguous blocks of cluster labels in Π, starting with 1 and ending with c. For example, Π = (1, 1, 1, 2, 2, 2, 3, 3) is an
3 aligned partition of 8 objects, while Π = (1, 1, 1, 2, 2, 3, 2, 3) is not. The special nature of aligned partitions enables us to specify them in an alternative form. Every member of the set of all aligned cpartitions for a set of n objects is isomorphic to the unique set of c distinct integers (which are the cardinalities of the c clusters in Π) that satisfy {n i 1 n i ; 1 i c; c i=1 n i = n}; so, aligned partitions are completely specified by {n 1 :... : n c }. For example, Π = (1, 1, 1, 2, 2, 2, 3, 3) = {3 : 3 : 2}. Because singlelinkage clusters are always aligned in the VATreordered data, to find them we merely have to cut the largest (c 1) edges in the MST and form the corresponding aligned partition. For example, if we wish to compute the singlelinkage 4partition, we find the 3 largest values of the vector d, storing the index values as i (1), i (2), and i (3) (where i (1) < i (2) < i (3) ), and form the aligned partition {i (1) : i (2) i (1) : i (3) i (2) : n i (3) }. We will use this method to form singlelinkage partitions of the svat sample, which provides an approximation of the singlelinkage partition for big data. Another property of the VAT image is that the contrast and presence of dark blocks on the diagonal is related to Dunn s index [17]. This index is a metric of how well a set of clusters represents compactseparated (CS) clusters. For a set of objects O with corresponding relational dissimilarity data R, we say that a partitioning Π = {π 1,..., π c } of O is CS relative to R if each of the possible intracluster distances is strictly less than each of the possible intercluster distances. We state this by saying that O can be partitioned into c CS clusters. For a given relational matrix R and set of indicator vectors Π, Dunn s index is defined as α(c, Π) = min 1 k c min 1 q c,q k dist(π k, π q ), (1) max 1 k c diam(π k ) where π k is the kth cluster, dist(π k, π j ) is the distance between two clusters, and diam(π k ) is the cluster diameter [18]. The distance and diameter functions are dist(π k, π q ) = min r ij, (2) i π k,j π q diam(π k ) = max r ij. (3) i π k,j π k The relative validity of clusters found at different values of c or by different clustering algorithms can be compared by examining the respective values of (1) for each partition. Definition 1. Clusters that have a Dunn s index α(c, Π) > 1 are compactseparated (CS) clusters [18]. Later, we will show a relationship between CS clusters and the proposed svatsl algorithm. First, we describe the svat algorithm in detail. II. SCALABLE VAT The svat algorithm is a scalable solution for visualizing the number of clusters in a big data set. It first draws a representative sample of the data and then produces a visualization which shows the number of clusters as the number of dark blocks on the diagonal of a reordered dissimilarity image. Algorithm 2 outlines the steps. Algorithm 2: svat Data: R (R + ) n n Dissimilarity matrix Input: c Overestimate of true number of clusters c; n s Size of approximating sample Result: R s VAT reordered R s S = {S 1,..., S c } grouping sets S indices of samples in R S p VAT reordering indices of sample R S d MST cut vector 1 Select the indices M of the c distinguished objects. m 1 = 1 d = (d 1,..., d n ) = (r 11,..., r 1n ) for t = 2,..., c do d ( min{d 1, r mt 1,1},..., min{d n, r mt 1,n} ) m t = arg max 1 j n {d j } 2 Group objects in O = {o 1,..., o n } with their nearest distinguished object. S 1 = S 2 =... = S c = for t = 1,..., n do k = arg min 1 j c {r mj,t} S k S k {t} 3 Select some data for R S near each of the distinguished objects. n t = n s S t /n, t = 1,..., c Draw n t random indices S t from S t without replacement, t = 1,..., c. S = c t=1 S t Form R S, where R S is the square submatrix of R indexed by S in both rows and columns. 4 Apply VAT to R S, returning R s and p (optionally d) The svat algorithm essentially does the following four steps. In Step 1, svat defines a set of c distinguished objects that hopefully represents the clustering structure of all the objects. In other words, the c distinguished objects are prototypical indices for the clusters in the big data. At Step 2, the objects in O are partitioned using the nearest prototype rule, where the prototypes are the c distinguished object indices. Step 3 draws a random subset of O to produce a wellrepresented set of (approximate) size n s. Finally, in Step 4, VAT is applied to the approximatelysized n s n s submatrix R S. The computational complexity of svat Steps 1 and 2 are O(c n), with storage requirements of O(c n) (for fast execution). Drawing the indices of R S in Step 3 requires O(c n) computations and storage of ñ s entries, where ñ s = S (note that ñ s is usually only slightly greater that the chosen
4 sample size n s ). The final run of VAT on R S is O(ñ 2 s) in computation and storage complexity. If the data begin as vectors X R p then Step 1 requires O(pc n) operations to compute the necessary elements of R and Step 3 requires O(pñ 2 s) operations to compute the elements of R S. It can be shown that ñ s n s +c, hence the overall storage requirement of svat is O(max{c n, (n s + c ) 2 }). The overall computational complexity is O(max{c n, (n s + c ) 2 }) if the data begin as relational data R, and O(max{pc n, pñ 2 s, (n s + c ) 2 }) if the data begin as vectors X. In [11], it was stated that these complexities show that svat scales linearly with n. However, we argue that because c, ñ s, and n s are some fraction of n, then svat does not asymptotically scale linearly with n. However, if c n, n s n, then these orders of complexity show a significant reduction in both storage and computational complexity. The svat algorithm has two desirable properties for data sets that contain CS clusters. Proposition 1 from [11] describes the first property and is important in our later analysis of svat SL. The second property is as follows. For an object set O that contains c CS clusters, the proportion of objects in the svat sample S (drawn in Step 3 of Algorithm 2) from the ith cluster equals the proportion of the objects in O from the ith cluster. In other words, each CS cluster s size will appear correctly in the svat image (large clusters will appear as large dark blocks, while small clusters will appear as small dark blocks). Proposition 1. [11] Consider a set of objects O that can be partitioned into c CS clusters, and let c c. Then Step 1 of svat selects at least one distinguished object from each CS cluster. Proof: See [11] for proof. We now propose an extension to svat, called svatsl, that returns a cpartition, c c, of O. For the case where there exists c CS clusters in O, then we show that the svatsl partition is exactly the singlelinkage cpartition of O. III. SVATSL ALGORITHM The svatsl algorithm proposed in Algorithm 3 calculates partitions of data represented by a dissimilarity matrix R. In brief, svatsl calculates a singlelinkage partition of the svatsampled data and then extends this partition to the entire data set. We will show that in certain cases the svatsl c partition is equivalent to the singlelinkage cpartition of the VL data set. In Step 1 of svatsl, the svat algorithm is run on R, returning the VATreordered sample matrix R s, the indices S computed in svat Step 2, the indices of the sample S, the reordering vector p, and the magnitude of the MST links d. At Step 2 of svatsl, the user must choose the number of clusters c to seek. We recommend using the svat visualization of R s to help choose the number of clusters. One could also use the biggestjump criteria by sorting d is descending order and choosing c as the argmax of {d c d c+1 }. At Step 3 of svatsl, the indices of the c largest values of d are found and denoted as t. These indicate the links of the MST that Algorithm 3: svatsl Data: R (R + ) n n Dissimilarity matrix Input: c Overestimate of true number of clusters c; n s Size of approximating sample Result: R s VAT reordered R s Π cpartition of O 1 Run svat on R, returning R s, S, S, p, and d 2 Choose the number of clusters c (e.g., by using the svat image) 3 Find indices t of c largest values in d 4 Form the aligned partition Π as {t 1 : t 2 t 1 :... : t c t c 1 } 5 π Spi = π p i, i = 1,..., ñ, k = 1,..., c 6 for each ŝ Ŝ = S S do j = arg min k S rŝk πŝ = π k are cut to find the c singlelinkage clusters of R s. Next, the aligned partition Π is calculated using the indices t found in Step 3. At Step 5, we merely reorder the cluster indicator vector Π of S to match the indexordering of the original objets O (i.e., we label the samples of O that are represented in S). Finally, at Step 6 we label the remaining objects in O by giving them the label of the nearest respective object in S. The computational complexity of svatsl, beyond the run of svat at Step 1, is as follows. Step 3 is O(ñ), as is the resorting of Π at Step 5. The final step of svatsl assigns the cluster indicators for the (n ñ) objects that are not in the sample S (i.e., R s ). Each iteration of the for loop requires O(ñ) computations, thus resulting in a final O(ñˆn) computational complexity, where ˆn = S Ŝ. The storage requirement is O(ñˆn) for RŜ S, used in Step 6 (although, one could certainly achieve fast computation and smaller storage requirement by loading one row of R at a time in Step 6). Proposition 2. Consider a data set O and a corresponding dissimilarity matrix R. Let Π be the cpartition indicator vector returned by svatsl and Π be the singlelinkage c partition indicator vector of O. If O, according to R, can be partitioned into c CS clusters, then Π = Π if c c. Proof: First, it is well known that singlelinkage finds the cpartition corresponding to the c CS clusters of a data set. Hence, Π is the partition of O into its c CS clusters. Second, Proposition 1 says that the sampling procedure of svat selects at least one distinguished object from each cluster of the c CS clusters. Hence, the sampled data S contain at least one object from each of the c CS clusters. Because of the property of CS clusters, the subset S contains the same c CS clusters as O (albeit in a sampled form). In [16], we showed that every singlelinkage partition is aligned in the VATreordered dissimilarity matrix. Thus, Steps 3 and 4 of svatsl find the singlelinkage cpartition of the sampled data S. Thus, Π, produced at Step 4 of svatsl,
5 Y coord X coord. (a) 3 Clouds (n = 25, 000) (b) svat image Fig. 2. svat image of 3 Clouds data, c = 20, n s = 1, 000 is the partition of S into its c CS clusters. The partition Π is thus the same as Π, albeit sampled and reordered. At Step 6 of svatsl, each object in O that is not in S is labeled as being in the cluster of the nearest object in S, producing the partition Π of O. Hence, the partition Π = Π. Remark 1. Proposition 2 states that if O contains c CS clusters then svatsl will find them, as long as c c. If O does not contain c CS clusters, then svatsl is not guaranteed to find the same partition as singlelinkage. However, we show in Section IV that svatsl produces a good approximation of the preferred partition of O. IV. EXPERIMENTS We compare the clustering results of single linkage and svatsl on two main data sets. The data sets denoted as 3 Clouds is a variably sized data set with 3 clouds of data drawn from 3 Gaussian distributions with the following parameters, µ 1 = (5, 5), µ 2 = (0, 0), µ 3 = (10, 0), Σ 1 = Σ 2 = Σ 3 = I 2. The size of each cloud, respectively, is 0.2n, 0.4n, and 0.4n. FIgure 2 shows a plot of a draw of the 3 Clouds data for n = 25, 000. The svat image of the data in Fig. 2(a), with c = 20 and n s = 1, 000 is shown in view (b). The svat image clearly shows 3 clusters. Furthermore, it shows the relative size of each cluster accurately. The second data set we show results for is the Forest Cover data. 3 These data are composed of 54 cartographic features obtained from United States Geological Survey (USGS) and United State Forest Service (USFS) data. These features were collected from a total of 581, meter cells, which were then determined to be one of 7 forest cover types by the USFS by analyzing the 54 features. We normalize the features to the interval [0, 1]. We compared single linkage and svatsl by calculating the purity and normalized mutual information (NMI) of partitions calculated with each algorithm and also the runtime. Each of these indices are described in detail in [19, pp ]. In brief, an index value of 1 is perfect, while an index value of 0 indicates poor performance (in the case of NMI, an value of 0 indicates randomly chosen clusters). For the 3 Clouds 3 The Forest Cover data set can be downloaded at dstar/data/clusteringdata.html. data, we randomly drew 50 instances of the data set with size n = 25, 000 and ran singlelinkage and svatsl (with various parameters) on each instances. Table II outlines the mean and standard deviation of purity and NMI for these 50 runs. Single linkage performs very poorly at finding the 3 clusters in these data. This is because single linkage can be fatally affected by one outlier (which can easily happen with Gaussian distributed data). In contrast, svatsl produces very good clustering results for all parameter settings both purity and NMI are > 0.75 and for c = 10 and n s = 100, these indices are both > 0.9, a near perfect result. These results convince us that svatsl is producing the preferred clusters in the 3 Clouds data, while singlelinkage performance is hampered by the outlier data points. The Time column in Table II shows that svatsl is also 2 orders of magnitude faster than singlelinkage. For our second experiment, we ran svatsl on the 3 Clouds data (n = 250, 000) and the Forest Cover data. Each of these data sets were too large to be clustered by singlelinkage on our machine; one would need > 1TB of main memory to store the full n n distance matrix. Table II shows that svatsl is able cluster both these data sets in a very reasonable amout of time: on the order of seconds for the 3 Clouds data and seconds for the Forest Cover data. Furthermore, the svatsl performance indices show that the algorithm produces very accurate clusters for the 3Clouds data. With the parameters (c = 5, n s = 500), svatsl produces clusters with indices > 0.9: a nearperfect solution. For the Forest Cover data, svat SL returns clusters that achieve a purity of about 0.5 and an NMI of about 0.2. To compare, the algorithms proposed in [20] achieve an NMI of Our proposed svatsl algorithm is able to cluster these data in about seconds. V. CONCLUSIONS AND FUTURE WORK Our analysis in Section III shows that svatsl is an approximation to singlelinkage clustering for big data. The computational complexity is significantly smaller than the O(n 2 ) complexity of the full hierarchical algorithm. Experimental runtime calculations also show that the algorithm is fast at both producing a cluster visualization and at partitioning the data. Furthermore, we showed that the svatsl cpartition is equivalent to the singlelinkage cpartition for data that have c compactseparated clusters. A weakness to svatsl is that the partition is based on the singlelinkage criterion, which has the wellknown drawback that the partition can be ruined by outliers. For data with many outliers, the outliers have to be partitioned (usually in their own cluster) before the preferred partition can be found via the singlelink criterion. However, svatsl provides visual evidence as to how large the clusters should be; hence, if the size of the singlelink clusters do not match well with the visual evidence then the user can disregard the partition (perhaps choosing a different clustering algorithm to partition the sample or throwing out data from small clusters). In the future, we will examine the sampling scheme suggested by the kmeans++ initialization technique [21], which
6 TABLE II CLUSTERING RESULTS* SL svatsl Data set Purity NMI Time (secs) Purity NMI Time (secs) 3Clouds c = (n = 25, 000) ±0 ±0 ±0.36 n s = 1000 ±0.20 ±0.23 ±0.11 c = n s = 1000 ±0.17 ±0.18 ±0.06 c = n s = 1000 ±0.20 ±0.23 ±0.05 c = n s = 500 ±0.18 ±0.19 ±0.05 c = n s = 100 ±0.13 ±0.13 ± Clouds??? c = (n = 250, 000) n s = 1000 ±0.20 ±0.27 ±0.15 c = n s = 500 ±0.06 ±0.08 ±0.02 Forest Cover??? c = (n = 581, 012) n s = 500 ±0.01 ±0.05 ±0.28 c = n s = 500 ±0.02 ±0.02 ±0.19 *Mean and standard deviation over 50 independent trials. Bold indicates superior algorithm of SL and svatsl by 2sided ttest. draws objects according to a distanceweighted probability distribution. The drawback of the kmeans++ method is that it does not absolutely ensure that an object is drawn from each cluster for the case of CS clusters. REFERENCES [1] A. Jain and R. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ: PrenticeHall, [2] R. Xu and D. Wunsch II, Clustering. Psicataway, NJ: IEEE Press, [3] J. C. Bezdek and R. J. Hathaway, VAT: A tool for visual assessment of (cluster) tendency, in Proc. IJCNN, Honolulu, HI, 2002, pp [4] P. Huber, Massive Data Sets. National Academy Press, 1997, ch. Massive Data Sets Workshop: The Morning After, pp [5] R. Hathaway and J. Bezdek, Extending fuzzy and probabilistic clustering to very large data sets, Computational Statistics and Data Analysis, vol. 51, pp , [6] T. C. Havens, J. C. Bezdek, C. Leckie, L. O. Hall, and M. Palaniswami, Fuzzy cmeans algorithms for very large data, IEEE Trans. Fuzzy Systems, [7] N. Pal and J. Bezdek, Complexity reduction for large image processing, IEEE Trans. Systems, Man, and Cybernetics, vol. B, no. 32, pp , [8] P. Hore, L. Hall, and D. Goldgof, Single pass fuzzy c means, in Proc. IEEE Int. Conf. Fuzzy Systems, London, England, 2007, pp [9] R. Chitta, R. Jin, T. Havens, and A. Jain, Approximate kernel kmeans: Solution to large scale kernel clustering, in Proc. ACM SIGKDD Conf. Knowledge Discovery and Data Mining, 2011, pp [10] T. Havens, R. Chitta, A. Jain, and R. Jin, Speedup of fuzzy and possibilistic cmeans for largescale clustering, in Proc. IEEE Int. Conf. Fuzzy Systems, Taipei, Taiwan, [11] R. J. Hathaway, J. C. Bezdek, and J. M. Huband, Scalable visual asseessment of cluster tendency for large data sets, Pattern Recognition, vol. 39, no. 7, pp , July [12] W. Petrie, Sequences in prehistoric remains, J. Anthropological Inst. Great Britain and Ireland, vol. 29, pp , [13] J. Czekanowski, Zur differentialdiagnose der neandertalgruppe, Korrespondenzblatt der Deutschen Gesellschaft fr Anthropologie, Ethnologie und Urgeschichte, vol. 40, pp , [14] L. Wilkinson and M. Friendly, The history of the cluster heat map, The American Statistician, vol. 63, no. 2, pp , [15] R. C. Prim, Shortest connection networks and some generalizations, Bell System Technical Journal, vol. 36, pp , [16] T. C. Havens, J. C. Bezdek, J. M. Keller, M. Popescu, and J. M. Huband, Is VAT really single linkage in disguise? Ann. Math. Artif. Intell., vol. 55, no. 34, pp , [17] T. C. Havens, J. C. Bezdek, J. M. Keller, and M. Popescu, Dunnís cluster validity index as a contrast measure of VAT image, in Proc. ICPR, Tampa, FL, December [18] J. Dunn, A fuzzy relative of the ISODATA process and its use in detecting compact wellseparated clusters, J. of Cybernetics, vol. 3, no. 3, pp , [19] C. D. Manning, P. Raghavan, and H. Schutze, Introduction to Information Retrieval. Cambridge University Press, [20] R. Chitta, R. Jin, and A. K. Jain, Efficient kernel clustering using random fourier features, in Int. Conf. Data Mining, [21] D. Arthur and S. Vassilvitskii, kmeans++: The advantages of careful seeding, in Proc. SODA, 2007, pp
Complexity Reduction for Large Image Processing
598 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART B: CYBERNETICS, VOL. 32, NO. 5, OCTOBER 2002 Complexity Reduction for Large Image Processing Nikhil R. Pal, Senior Member, IEEE, and James C.
More informationA Method for Decentralized Clustering in Large MultiAgent Systems
A Method for Decentralized Clustering in Large MultiAgent Systems Elth Ogston, Benno Overeinder, Maarten van Steen, and Frances Brazier Department of Computer Science, Vrije Universiteit Amsterdam {elth,bjo,steen,frances}@cs.vu.nl
More informationCluster Analysis: Basic Concepts and Algorithms
8 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should
More informationClustering. Chapter 7. 7.1 Introduction to Clustering Techniques. 7.1.1 Points, Spaces, and Distances
240 Chapter 7 Clustering Clustering is the process of examining a collection of points, and grouping the points into clusters according to some distance measure. The goal is that points in the same cluster
More informationTerritorial Analysis for Ratemaking. Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs
Territorial Analysis for Ratemaking by Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs Department of Statistics and Applied Probability University
More informationExtracting Fuzzy Rules from Data for Function Approximation and Pattern Classification
Extracting Fuzzy Rules from Data for Function Approximation and Pattern Classification Chapter 9 in Fuzzy Information Engineering: A Guided Tour of Applications, ed. D. Dubois, H. Prade, and R. Yager,
More informationClean Answers over Dirty Databases: A Probabilistic Approach
Clean Answers over Dirty Databases: A Probabilistic Approach Periklis Andritsos University of Trento periklis@dit.unitn.it Ariel Fuxman University of Toronto afuxman@cs.toronto.edu Renée J. Miller University
More informationEnergy Efficient Monitoring in Sensor Networks
Energy Efficient Monitoring in Sensor Networks Amol Deshpande, Samir Khuller, Azarakhsh Malekian, Mohammed Toossi Computer Science Department, University of Maryland, A.V. Williams Building, College Park,
More informationRoleVAT: Visual Assessment of Practical Need for Role Based Access Control
RoleVAT: Visual Assessment of Practical Need for Role Based Access Control Dana Zhang The University of Melbourne zhangd@csse.unimelb.edu.au Kotagiri Ramamohanarao The University of Melbourne rao@csse.unimelb.edu.au
More informationRouting State Distance: A Pathbased Metric for Network Analysis
Routing State Distance: A Pathbased Metric for Network Analysis Gonca Gürsun, Natali Ruchansky, Evimaria Terzi, and Mark Crovella Department of Computer Science Boston University ABSTRACT Characterizing
More informationIdentifying Leaders and Followers in Online Social Networks
Identifying Leaders and Followers in Online Social Networks M. Zubair Shafiq, Student Member, IEEE, Muhammad U. Ilyas, Member, IEEE, Alex X. Liu, Member, IEEE, Hayder Radha, Fellow, IEEE Abstract Identifying
More informationTop 10 algorithms in data mining
Knowl Inf Syst (2008) 14:1 37 DOI 10.1007/s1011500701142 SURVEY PAPER Top 10 algorithms in data mining Xindong Wu Vipin Kumar J. Ross Quinlan Joydeep Ghosh Qiang Yang Hiroshi Motoda Geoffrey J. McLachlan
More informationNetwork Discovery from Passive Measurements
Network Discovery from Passive Measurements Brian Eriksson UWMadison bceriksson@wisc.edu Paul Barford UWMadison pb@cs.wisc.edu Robert Nowak UWMadison nowak@ece.wisc.edu ABSTRACT Understanding the Internet
More informationTechniques for Classifying Executions of Deployed Software to Support Software Engineering Tasks
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 33, NO. 5, MAY 2007 287 Techniques for Classifying Executions of Deployed Software to Support Software Engineering Tasks Murali Haran, Alan Karr, Michael
More informationUsing MapReduce for Large Scale Analysis of GraphBased Data
Using MapReduce for Large Scale Analysis of GraphBased Data NAN GONG KTH Information and Communication Technology Master of Science Thesis Stockholm, Sweden 2011 TRITAICTEX2011:218 Using MapReduce
More informationProximity Analysis of Social Network using Skip Graph
Proximity Analysis of Social Network using Skip Graph Thesis submitted in partial fulfillment of the requirements for the award of degree of Master of Engineering in Software Engineering Submitted By Amritpal
More informationScalable Histograms on Large Probabilistic Data
Scalable Histograms on Large Probabilistic Data Mingwang Tang and Feifei Li School of Computing, University of Utah, Salt Lake City, USA {tang, lifeifei}@cs.utah.edu ABSTRACT Histogram construction is
More informationWITH the availability of large data sets in application
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 10, OCTOBER 2004 1 Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance Ruoming
More informationPARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce
PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce Matteo Riondato Dept. of Computer Science Brown University Providence, RI, USA matteo@cs.brown.edu Justin A.
More informationOn Clusterings: Good, Bad and Spectral
On Clusterings: Good, Bad and Spectral RAVI KANNAN Yale University, New Haven, Connecticut AND SANTOSH VEMPALA AND ADRIAN VETTA M.I.T., Cambridge, Massachusetts Abstract. We motivate and develop a natural
More information1 NOT ALL ANSWERS ARE EQUALLY
1 NOT ALL ANSWERS ARE EQUALLY GOOD: ESTIMATING THE QUALITY OF DATABASE ANSWERS Amihai Motro, Igor Rakov Department of Information and Software Systems Engineering George Mason University Fairfax, VA 220304444
More informationIEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2013. ACCEPTED FOR PUBLICATION 1
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2013. ACCEPTED FOR PUBLICATION 1 ActiveSet Newton Algorithm for Overcomplete NonNegative Representations of Audio Tuomas Virtanen, Member,
More informationGRAPHBASED ranking models have been deeply
EMR: A Scalable Graphbased Ranking Model for Contentbased Image Retrieval Bin Xu, Student Member, IEEE, Jiajun Bu, Member, IEEE, Chun Chen, Member, IEEE, Can Wang, Member, IEEE, Deng Cai, Member, IEEE,
More informationTHE development of methods for automatic detection
Learning to Detect Objects in Images via a Sparse, PartBased Representation Shivani Agarwal, Aatif Awan and Dan Roth, Member, IEEE Computer Society 1 Abstract We study the problem of detecting objects
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationSearching the Web. Abstract
Searching the Web Arvind Arasu Junghoo Cho Hector GarciaMolina Andreas Paepcke Sriram Raghavan Computer Science Department, Stanford University {arvinda,cho,hector,paepcke,rsram}@cs.stanford.edu Abstract
More informationLeveraging Aggregate Constraints For Deduplication
Leveraging Aggregate Constraints For Deduplication Surajit Chaudhuri Anish Das Sarma Venkatesh Ganti Raghav Kaushik Microsoft Research Stanford University Microsoft Research Microsoft Research surajitc@microsoft.com
More informationRobust Set Reconciliation
Robust Set Reconciliation Di Chen 1 Christian Konrad 2 Ke Yi 1 Wei Yu 3 Qin Zhang 4 1 Hong Kong University of Science and Technology, Hong Kong, China 2 Reykjavik University, Reykjavik, Iceland 3 Aarhus
More information2 Basic Concepts and Techniques of Cluster Analysis
The Challenges of Clustering High Dimensional Data * Michael Steinbach, Levent Ertöz, and Vipin Kumar Abstract Cluster analysis divides data into groups (clusters) for the purposes of summarization or
More informationEFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES
ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com
More information