# Dynamic load balancing in parallel KDtree

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Dynamic load balancing in parallel KDtree k-means A B C D EBB F C D C CD D CB F DBCD CD D D B FCD D D C D D DB D DF D C D B D CD D A F C E C B D FCD B C F C C C D C CD DF BC C F C D C C CB C D CD C CB F F C C C D B E CD D E B F D B C CDB FCD B C C DF C CD F B B C F DF B DFC C D C C F CD F CD DF E D FCD B B D CentAUR D E B C DC C FCD FCD B D CD

3 k-means algorithm. In section IV we introduce a parallel formulation of the efficient k-means algorithm based on KD-Trees. Section V provides a comparative experimental analysis of the proposed approaches. Finally section VI provides conclusive remarks and future research directions. II. THE K-MEANS ALGORITHM Given a set X = { x 1,..., x N } of N data vectors in a d dimensional space R d and a parameter K (1 <K< N) which defines the number of desired clusters, k-means determines a set of K vectors C = { c 1,..., c K }, called centers or centroids, to minimise the within-cluster variance. Without loss of generality, we assume the input vectors to be defined in the range [0, 1] in each dimension. The centroid of the cluster k is derived to approximate the centre of mass of the cluster and is defined as: ( 1 c k = n k ) nk i=1 x (k) i, (1) where n k is the number of data points in the cluster k, and x (k) i is a data point in the cluster k. The error for each cluster is the squared sum of a norm, e.g. the Euclidean norm, between each input pattern and its nearest centroid. The objective function that k-means optimises is the the overall error (square-error distortion) E(C) which is given by the sum of the squared errors for each cluster: N K E(C)= min x n k i c k 2 = x (k) 2 i c k. k=1..k i=1 k=1 i=1 (2) A general problem in clustering methods is that there are no computationally feasible methods to find the optimal solution. In practical applications it is not possible to guarantee that a given clustering configuration minimises the square-error distortion, as it would require an exhaustive search. The number of possible clustering configurations, even for small numbers of patterns N and clusters K, quickly becomes enormously large. In [9] it has been shown that all distinct Voronoi partitions, which are equivalent to all possible clustering configurations, can be enumerated in O(N Kd+1 ). Therefore many clustering algorithms use iterative hill climbing methods that can be terminated when specific conditions are met [10]. These conditions typically are when no further change can be made (local convergence), when the last improvement in the total squared error drops below a certain threshold, or when a user-defined number of iterations have been completed. However, only the first condition reaches a local minimum in the search space. The k-means algorithm works by first sampling K centres at random from the data points. At each iteration, data points are assigned to the closest centre and, finally, centers are updated according to (1). Although the centres can be chosen arbitrarily, the algorithm itself is fully deterministic, once the number of clusters and the initial centres have been fixed. At the core of each iteration k-means performs a nearest neighbour query for each of the data points of the input data set. This requires exactly N K distance computations at each iteration. While the algorithm is guaranteed to converge in a finite number of iterations, this number can significantly vary for different problems and for different choice of the initial centres for the same problem. The overall time complexity of the worst case is often indicated as O(INKd), where I is the number of iterations. However, unless explicitly bounded by the user, the number of iterations depends on the problem. Only recently the theoretical bounds for the number of iterations k-means can take to converge have been studied. In [11] it has been shown that k-means requires more than a polylogarithmic number of iterations and derived polynomial upper and lower bounds on the number of iterations for simple configurations. In particular, for two clusters in one dimension the lower bound is Ω(N). The upper bound in one dimension and any number of clusters is O(N 2 ),where is the spread of the set X, i.e. the ratio between the longest and shortest pairwise distances. Authors in [12] provide a proof that the worst-case running time is superpolynomial and improved the best known lower bound from Ω(N) iterations to 2 Ω( N). The clustering configuration produced by k-means is not necessarily a global minimum and it can be arbitrarily bad compared to the optimal configuration. Both the quality of the clustering and the number of iterations for convergence depend on the initial choice of centroids. It is common to run k-means several times with different initial centroids (e.g. [13]). Moreover, the assumption that the number of clusters K is known a priori is not true in most explorative data mining applications. Solutions to an appropriate choice of K go from trial and error strategies to more complex techniques based on Information Theory [14] and heuristic approaches with repeated k-means runs with varying number of centroids [10]. For these reasons and for the interactive nature of data mining applications, several k-means runs often need to be carried out. Thus, it is important to identify appropriate techniques in order to improve the efficiency of the core iteration step, which is the aim of the techniques based on KD-Trees and described in the following section. III. KD-TREE K-MEANS A KD-Tree [15] is a multi-dimensional binary search tree commonly adopted for organising spatial data. It is useful in several problems like graph partitioning, n-body simulations and database applications. KD-Trees can be used to perform an efficient search of the nearest neighbour for classification. In the case of the k-means algorithm, a KD-Tree is used to optimise the

4 identification of the closest centroid for each pattern. The basic idea is to group patterns with similar coordinates to perform group assignments, whenever possible, without the explicit distance computations for each of the patterns. During a pre-processing step the input patterns are organised in a KD-Tree. At each k-means iteration the tree is traversed and the patterns are assigned to their closest centroid. Construction and traversal of the tree are described in the next two sections. A. Tree Construction Each node of the tree is associated with a set of patterns. The root node of the tree is associated with all input patterns and a binary partition is recursively carried out. At each node the data set is partitioned in approximately two equal sized sets, which are assigned to the left and right child nodes. The partitioning process is repeated until the full tree is built and each leaf node is associated to a single data pattern. A minimum leaf size (> 1) can also be defined which leads to an incomplete tree. The partitioning operation is performed by selecting a dimension and a pivot value. Data points are assigned to the left child if the coordinate in that dimension is smaller than the pivot value, otherwise to the right child. The dimension for partitioning can be selected with a round robin policy during a depth first construction of the tree. This way the same dimension is used to split the data sets of internal nodes which are at the same tree level. An alternative method is to select the dimension with the widest range of values at each node. The pivot value can be the median or the mid point. The median guarantees equal sized partitions with some computational cost. The computation of the mid point is faster but may lead to imbalanced trees. An example of a KD-Tree for a data set in a two dimensional space is shown in Figure 1. Figure 1. KD-Tree example in 2D Each KD-Tree node stores parameters and statistics of the associated set of patterns. The information cached at each KD-Tree node is computed during the pre-processing step and includes: The set of patterns { x i } associated to the KD-Tree node, The KD-Tree node parameters The dimension used for partitioning the data, The pivot value used for partitioning the data, The KD-Tree node statistics The cardinality of the pattern set { x i }, Two boundary vectors (the range of values at each dimension), The vector sum of the patterns i x i, The scalar sum of normalised patterns i x i 2, The dimension and the value for partitioning the data are used to generate the child nodes during the construction of the tree. The KD-Tree node statistics (aka Clustering Feature [16]) are generated during the construction and exploited during the traversal for speeding up the computation of each iteration, as described in the following section. B. Traversal of the Tree At each k-means iteration the KD-Tree is visited to perform the closest centroid search for the input patterns. After all patterns have been assigned to their closest centroid, the centroids can be updated. The traversal follows a depth first search (DFS) strategy. During the traversal a list of centroids (candidates) is propagated from the parent node to the child nodes. The list contains the centroids that might be closest to any of the patterns in the node. The list of candidates at the root of the tree is initialised with all centroids. At the visit of a node the list of candidates is filtered from any centroid that cannot be the closest one for any of the patterns associated to the node. The excluded centroids are not propagated to the child nodes as, by definition, a child node s data set is a partition of the parent node s data set. If during the traversal the list of candidates reduces to a single centroid, the traversal of the subtree rooted at the node is skipped (backtrack). In this case, all the patterns of the node can be assigned to the centroid without any further operation (bulk assignment). The explicit computation of the distances between these patterns and the centroids is entirely avoided. If the visit reaches a leaf node, the explicit distance computations must be performed. However, the distances are computed only between the patterns in the node and the centroids which have reached the leaf node. In this case, the performance gain depends on how many centroids have been filtered out from the complete list along the path from the root node to the leaf. In both cases, early backtrack and leaf node visit, there is a potential reduction of the number of distance computations performed by the brute force k-means algorithm. Moreover, in the case of early backtracking there is a further computation improvement. The statistics cached at the node are used to update the partial sum (bulk assignments) required to update the centroids at the end of the iteration. The total number of distance operations can be significantly reduced at the cost of the preprocessing for constructing the KD-Tree and of some computation overhead

7 80 64 pkmeans SP SLB DLB linear speedup pkmeans (time) speedup time [sec.] (a) prototypes Figure 4. Speedup and running time (b) patterns Figure 3. 2D representation of the multi-dimensional prototypes (a) and patterns (b) tralized and asynchronous k-means algorithm suitable for large scale systems in wide area networks. V. EXPERIMENTAL ANALYSIS In order to evaluate the three proposed approaches for parallel KD-Tree k-means, Static Partitioning (SP), Static Load Balancing (SLB) and Dynamic Load Balancing (DLB), we compare them with the parallel formulation [7] of the brute force k-means algorithm (pkmeans). We have generated an artificial data set with patterns in a 20 dimensional space with a mixed Gaussian distribution as described in the following. First we have generated 50 pattern prototypes in the multi-dimensional space. This corresponds to the number of clusters K. We have generated 0 patterns with a Gaussian distribution around each prototype and with a random standard deviation in the range [0.0, 0.1]. In order to create a more realistically skewed data distribution, we did not generate the prototypes uniformly in the multi-dimensional space. We have distributed 25 prototypes uniformly in the whole domain and 25 prototypes were restricted to a subdomain. This generated a higher density of prototypes in the subdomain. The skewed distribution of data patterns in the domain is meant to emphasize the load imbalance problem. The parameters were chosen in order to generate a dataset which contains some well separated clusters and some not well separated clusters. We have applied Multi-Dimensional Scaling [19] to the set of prototypes and to a sample of the patterns to visualize them in 2-dimensional maps. Figure 3(a) shows the 50 prototypes and the higher density area is clearly visible at the center of the map. Figure 3(b) shows a 2D map of a sample of the generated patterns. The software has been developed in Java and adopts MPJ Express [20], a Java binding for the MPI standard. The experimental tests were carried out in a IBM Bladecenter cluster (2.5GHz dual-core PowerPC 970MP) connected via a Myrinet network, running Linux ( ppc64) and J2RE (IBM J9 2.3). In all our tests we have built KD-Trees with a minimum leaf size of 50, attribute and value selection based, respectively, on the widest range and the median. Figure 4 shows the relative speedup of the four parallel algorithms and the running time of pkmeans. Forasmall number of processes (up to 32) the algorithms pkmeans and SP have similar performance. For larger number of processes pkmeans outperforms SP. In spite of the more efficient data structure, SP shows clear scalability problems with respect to the parallel brute force approach. The second approach, SLB, is slightly outperforming pkmeans up to 32 processes. For 64 processes they have comparable speedup. For 128 processes SLB is also showing worse performance than the parallel brute force k-means algorithm. In the entire range of tests the algorithm DLB is outperforming all the others. For computing environments with more than 32 processors the worse performance of the first two KD-Tree based algorithms (SP and SLB) is due to load imbalance. It is evident that the SLB has reduced the problem, but has not completely solved it. The third approach based on KD-Tree which adopts a dynamic load balancing policy is indeed able to overcome the problem and to make the use of more efficient data structure convenient for a parallel k-means implementation.

8 computation communication idle computation communication idle time [sec] time [sec] (a) Parallel k-means (pkmeans) (b) Parallel KD-Tree k-means - Static Partitioning (SP) computation communication idle computation communication idle DLB time [sec] time [sec] (c) Parallel KD-Tree k-means - Static Load Balancing (SLB) (d) Parallel KD-Tree k-means - Dynamic Load Balancing (DLB) Figure 5. Parallel costs: cumulative times over all processes To analyse the performance of the algorithms in more detail, we show the contributions to the overall running time of the different parallel costs. We have accumulated the time spent by each process, respectively, in computation, communication, idle periods and in performing dynamic load balancing for the algorithm DLB. Figure 5 shows the cumulative computation time and the components of the overall parallel cost. The charts confirm that SP and SLB spend less time in computation as effect of the better data structure and algorithm they adopt. Although, they always have higher idle time than the simple pkmeans. The advantage of using a more complex data structure is neutralised by the inefficient load distribution. However, the algorithm DLB is able to reduce the load imbalance and to provide an overall better parallel efficiency. As expected, the communication times of all methods are very similar. The computation time in the approaches based on KD-Trees tends to slightly increase with the number of processes. This effect is due to the extra computation introduced by the filtering criteria. In a larger computing environment, the master node generates a deeper initial tree. The resulting local subtrees are smaller and rooted deeper in the initial tree. This makes the filtering criteria less effective by introducing additional distance calculations for filtering which are not present in the corresponding sequential algorithm. Local root nodes start with a full list of candidates, while in the sequential algorithm they get a sublist filtered along the path from the global root. We could start the filtering process at each parallel process at the global root, perform the filtering along the path to the local root as in the sequential case. Nevertheless, we would not solve the issue because this initial filtering would be redundantly repeated at each parallel process and would still introduce extra computation. VI. CONCLUSIONS We have presented a parallel formulation of the k-means algorithm based on an efficient data structure, namely multidimensional binary search trees. While the sequential algo-

### Partitioning and Divide and Conquer Strategies

and Divide and Conquer Strategies Lecture 4 and Strategies Strategies Data partitioning aka domain decomposition Functional decomposition Lecture 4 and Strategies Quiz 4.1 For nuclear reactor simulation,

Load Balancing Backtracking, branch & bound and alpha-beta pruning: how to assign work to idle processes without much communication? Additionally for alpha-beta pruning: implementing the young-brothers-wait

### Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

### Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of

### Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

### A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed

### A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906

### Using the Triangle Inequality to Accelerate

Using the Triangle Inequality to Accelerate -Means Charles Elkan Department of Computer Science and Engineering University of California, San Diego La Jolla, California 92093-0114 ELKAN@CS.UCSD.EDU Abstract

### FAST APPROXIMATE NEAREST NEIGHBORS WITH AUTOMATIC ALGORITHM CONFIGURATION

FAST APPROXIMATE NEAREST NEIGHBORS WITH AUTOMATIC ALGORITHM CONFIGURATION Marius Muja, David G. Lowe Computer Science Department, University of British Columbia, Vancouver, B.C., Canada mariusm@cs.ubc.ca,

### BIRCH: An Efficient Data Clustering Method For Very Large Databases

BIRCH: An Efficient Data Clustering Method For Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny CPSC 504 Presenter: Discussion Leader: Sophia (Xueyao) Liang HelenJr, Birches. Online Image.

### Distance based clustering

// Distance based clustering Chapter ² ² Clustering Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 99). What is a cluster? Group of objects separated from other clusters Means

### SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India

### Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

### Original Article Survey of Recent Clustering Techniques in Data Mining

International Archive of Applied Sciences and Technology Volume 3 [2] June 2012: 68-75 ISSN: 0976-4828 Society of Education, India Website: www.soeagra.com/iaast/iaast.htm Original Article Survey of Recent

### Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

### Online k-means Clustering of Nonstationary Data

15.097 Prediction Proect Report Online -Means Clustering of Nonstationary Data Angie King May 17, 2012 1 Introduction and Motivation Machine learning algorithms often assume we have all our data before

### Voronoi Treemaps in D3

Voronoi Treemaps in D3 Peter Henry University of Washington phenry@gmail.com Paul Vines University of Washington paul.l.vines@gmail.com ABSTRACT Voronoi treemaps are an alternative to traditional rectangular

### Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations Roy D. Williams, 1990 Presented by Chris Eldred Outline Summary Finite Element Solver Load Balancing Results Types Conclusions

### An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

### A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti, Nidhi Rajak 1 Department of Computer Science & Applications, Dr.H.S.Gour Central University, Sagar, India, ranjit.jnu@gmail.com

### COMPUTATIONAL biology attempts to simulate and understand

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 8, AUGUST 2006 773 Dynamic Load Balancing for the Distributed Mining of Molecular Structures Giuseppe Di Fatta, Member, IEEE, and Michael

### Fast Matching of Binary Features

Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been

### Resource Allocation Schemes for Gang Scheduling

Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian

### IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181 The Global Kernel k-means Algorithm for Clustering in Feature Space Grigorios F. Tzortzis and Aristidis C. Likas, Senior Member, IEEE

### Praktikum Wissenschaftliches Rechnen (Performance-optimized optimized Programming)

Praktikum Wissenschaftliches Rechnen (Performance-optimized optimized Programming) Dynamic Load Balancing Dr. Ralf-Peter Mundani Center for Simulation Technology in Engineering Technische Universität München

### Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

### Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph

### Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets Dudu Lazarov, Gil David, Amir Averbuch School of Computer Science, Tel-Aviv University Tel-Aviv 69978, Israel Abstract

### A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms

A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms Thierry Garcia and David Semé LaRIA Université de Picardie Jules Verne, CURI, 5, rue du Moulin Neuf 80000 Amiens, France, E-mail:

### Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

### A Dynamic Load Balancing Strategy for Parallel Datacube Computation

A Dynamic Load Balancing Strategy for Parallel Datacube Computation Seigo Muto Institute of Industrial Science, University of Tokyo 7-22-1 Roppongi, Minato-ku, Tokyo, 106-8558 Japan +81-3-3402-6231 ext.

### Validity Measure of Cluster Based On the Intra-Cluster and Inter-Cluster Distance

International Journal of Electronics and Computer Science Engineering 2486 Available Online at www.ijecse.org ISSN- 2277-1956 Validity Measure of Cluster Based On the Intra-Cluster and Inter-Cluster Distance

### A Review on Load Balancing Algorithms in Cloud

A Review on Load Balancing Algorithms in Cloud Hareesh M J Dept. of CSE, RSET, Kochi hareeshmjoseph@ gmail.com John P Martin Dept. of CSE, RSET, Kochi johnpm12@gmail.com Yedhu Sastri Dept. of IT, RSET,

### Expanding the CASEsim Framework to Facilitate Load Balancing of Social Network Simulations

Expanding the CASEsim Framework to Facilitate Load Balancing of Social Network Simulations Amara Keller, Martin Kelly, Aaron Todd 4 June 2010 Abstract This research has two components, both involving the

### Fast Sequential Summation Algorithms Using Augmented Data Structures

Fast Sequential Summation Algorithms Using Augmented Data Structures Vadim Stadnik vadim.stadnik@gmail.com Abstract This paper provides an introduction to the design of augmented data structures that offer

### A Comparison of General Approaches to Multiprocessor Scheduling

A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University

### Flat Clustering K-Means Algorithm

Flat Clustering K-Means Algorithm 1. Purpose. Clustering algorithms group a set of documents into subsets or clusters. The cluster algorithms goal is to create clusters that are coherent internally, but

### Efficient K-means clustering and the importance of seeding

Efficient K-means clustering and the importance of seeding PHILIP ELIASSON, NIKLAS ROSÉN Bachelor of Science Thesis Supervisor: Per Austrin Examiner: Mårten Björkman Abstract Data clustering is the process

### An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,

### Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

### Scalable Parallel Clustering for Data Mining on Multicomputers

Scalable Parallel Clustering for Data Mining on Multicomputers D. Foti, D. Lipari, C. Pizzuti and D. Talia ISI-CNR c/o DEIS, UNICAL 87036 Rende (CS), Italy {pizzuti,talia}@si.deis.unical.it Abstract. This

### A Constraint Programming based Column Generation Approach to Nurse Rostering Problems

Abstract A Constraint Programming based Column Generation Approach to Nurse Rostering Problems Fang He and Rong Qu The Automated Scheduling, Optimisation and Planning (ASAP) Group School of Computer Science,

### Clustering & Visualization

Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

### Fig. 1 A typical Knowledge Discovery process [2]

Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering

### Integrating Benders decomposition within Constraint Programming

Integrating Benders decomposition within Constraint Programming Hadrien Cambazard, Narendra Jussien email: {hcambaza,jussien}@emn.fr École des Mines de Nantes, LINA CNRS FRE 2729 4 rue Alfred Kastler BP

### FPGA area allocation for parallel C applications

1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University

### Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

TECHNISCHE UNIVERSITEIT EINDHOVEN Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows Lloyd A. Fasting May 2014 Supervisors: dr. M. Firat dr.ir. M.A.A. Boon J. van Twist MSc. Contents

### Distributed Optimization of Fiber Optic Network Layout using MATLAB. R. Pfarrhofer, M. Kelz, P. Bachhiesl, H. Stögner, and A. Uhl

Distributed Optimization of Fiber Optic Network Layout using MATLAB R. Pfarrhofer, M. Kelz, P. Bachhiesl, H. Stögner, and A. Uhl uhl@cosy.sbg.ac.at R. Pfarrhofer, M. Kelz, P. Bachhiesl, H. Stögner, and

### Lecture 20: Clustering

Lecture 20: Clustering Wrap-up of neural nets (from last lecture Introduction to unsupervised learning K-means clustering COMP-424, Lecture 20 - April 3, 2013 1 Unsupervised learning In supervised learning,

### Clustering Hierarchical clustering and k-mean clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics Elhanan Borenstein The clustering problem: A quick review partition genes into distinct sets with high homogeneity

### Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

### Fast Multipole Method for particle interactions: an open source parallel library component

Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,

### Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies Somesh S Chavadi 1, Dr. Asha T 2 1 PG Student, 2 Professor, Department of Computer Science and Engineering,

### DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

### DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH

DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH P.Neelakantan Department of Computer Science & Engineering, SVCET, Chittoor pneelakantan@rediffmail.com ABSTRACT The grid

### Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department

### Comparing Improved Versions of K-Means and Subtractive Clustering in a Tracking Application

Comparing Improved Versions of K-Means and Subtractive Clustering in a Tracking Application Marta Marrón Romera, Miguel Angel Sotelo Vázquez, and Juan Carlos García García Electronics Department, University

### Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

### A Study of Various Methods to find K for K-Means Clustering

International Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-3 E-ISSN: 2347-2693 A Study of Various Methods to find K for K-Means Clustering Hitesh Chandra Mahawari

### Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

### Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

### Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

### Divide and Conquer Examples. Top-down recursive mergesort. Gravitational N-body problem.

Divide and Conquer Divide the problem into several subproblems of equal size. Recursively solve each subproblem in parallel. Merge the solutions to the various subproblems into a solution for the original

### Analysis of Micromouse Maze Solving Algorithms

1 Analysis of Micromouse Maze Solving Algorithms David M. Willardson ECE 557: Learning from Data, Spring 2001 Abstract This project involves a simulation of a mouse that is to find its way through a maze.

### R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

R-Trees: A Dynamic Index Structure For Spatial Searching A. Guttman R-trees Generalization of B+-trees to higher dimensions Disk-based index structure Occupancy guarantee Multiple search paths Insertions

### Cluster Analysis for Optimal Indexing

Proceedings of the Twenty-Sixth International Florida Artificial Intelligence Research Society Conference Cluster Analysis for Optimal Indexing Tim Wylie, Michael A. Schuh, John Sheppard, and Rafal A.

### Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by self-organizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.

### Measuring the Performance of an Agent

25 Measuring the Performance of an Agent The rational agent that we are aiming at should be successful in the task it is performing To assess the success we need to have a performance measure What is rational

### ?kt. An Unconventional Method for Load Balancing. w = C ( t m a z - ti) = p(tmaz - 0i=l. 1 Introduction. R. Alan McCoy,*

ENL-62052 An Unconventional Method for Load Balancing Yuefan Deng,* R. Alan McCoy,* Robert B. Marr,t Ronald F. Peierlst Abstract A new method of load balancing is introduced based on the idea of dynamically

### A Novel Density based improved k-means Clustering Algorithm Dbkmeans

A Novel Density based improved k-means Clustering Algorithm Dbkmeans K. Mumtaz 1 and Dr. K. Duraiswamy 2, 1 Vivekanandha Institute of Information and Management Studies, Tiruchengode, India 2 KS Rangasamy

### A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster

, pp.11-20 http://dx.doi.org/10.14257/ ijgdc.2014.7.2.02 A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster Kehe Wu 1, Long Chen 2, Shichao Ye 2 and Yi Li 2 1 Beijing

### Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

### Data Warehousing und Data Mining

Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data

### A Non-Linear Schema Theorem for Genetic Algorithms

A Non-Linear Schema Theorem for Genetic Algorithms William A Greene Computer Science Department University of New Orleans New Orleans, LA 70148 bill@csunoedu 504-280-6755 Abstract We generalize Holland

### Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

Subgraph Patterns: Network Motifs and Graphlets Pedro Ribeiro Analyzing Complex Networks We have been talking about extracting information from networks Some possible tasks: General Patterns Ex: scale-free,

### A Study on the Application of Existing Load Balancing Algorithms for Large, Dynamic, Heterogeneous Distributed Systems

A Study on the Application of Existing Load Balancing Algorithms for Large, Dynamic, Heterogeneous Distributed Systems RUPAM MUKHOPADHYAY, DIBYAJYOTI GHOSH AND NANDINI MUKHERJEE Department of Computer

### Load Balancing Between Heterogenous Computing Clusters

Load Balancing Between Heterogenous Computing Clusters Siu-Cheung Chau Dept. of Physics and Computing, Wilfrid Laurier University, Waterloo, Ontario, Canada, N2L 3C5 e-mail: schau@wlu.ca Ada Wai-Chee Fu

### ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

### Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning David Littau 1 and Daniel Boley 2 1 University of Minnesota, Minneapolis MN 55455 littau@cs.umn.edu 2 University of Minnesota,

Scheduling Allowance Adaptability in Load Balancing technique for Distributed Systems G.Rajina #1, P.Nagaraju #2 #1 M.Tech, Computer Science Engineering, TallaPadmavathi Engineering College, Warangal,

### Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt

Load Balancing in Periodic Wireless Sensor Networks for Lifetime Maximisation Anthony Kleerekoper 2nd year PhD Multi-Service Networks 2011 The Energy Hole Problem Uniform distribution of motes Regular,

### Standardization and Its Effects on K-Means Clustering Algorithm

Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

### Parallelization: Binary Tree Traversal

By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First

### Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

### Survey On: Nearest Neighbour Search With Keywords In Spatial Databases

Survey On: Nearest Neighbour Search With Keywords In Spatial Databases SayaliBorse 1, Prof. P. M. Chawan 2, Prof. VishwanathChikaraddi 3, Prof. Manish Jansari 4 P.G. Student, Dept. of Computer Engineering&

### PARALLEL PROGRAMMING

PARALLEL PROGRAMMING TECHNIQUES AND APPLICATIONS USING NETWORKED WORKSTATIONS AND PARALLEL COMPUTERS 2nd Edition BARRY WILKINSON University of North Carolina at Charlotte Western Carolina University MICHAEL

### 2.3 Scheduling jobs on identical parallel machines

2.3 Scheduling jobs on identical parallel machines There are jobs to be processed, and there are identical machines (running in parallel) to which each job may be assigned Each job = 1,,, must be processed

### From Self-Organising Mechanisms to Design Patterns

Self-aware Pervasive Service Ecosystems From Self-Organising Mechanisms to Design Patterns University of Geneva Giovanna.Dimarzo@unige.ch 1 Outline Motivation: Spatial Structures and Services Self-Organising

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

### DYNAMIC LOAD BALANCING IN A DECENTRALISED DISTRIBUTED SYSTEM

DYNAMIC LOAD BALANCING IN A DECENTRALISED DISTRIBUTED SYSTEM 1 Introduction In parallel distributed computing system, due to the lightly loaded and overloaded nodes that cause load imbalance, could affect

### Visualization of large data sets using MDS combined with LVQ.

Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk

### Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

### Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Jianqiang Dong, Fei Wang and Bo Yuan Intelligent Computing Lab, Division of Informatics Graduate School at Shenzhen,

### Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

### Junghyun Ahn Changho Sung Tag Gon Kim. Korea Advanced Institute of Science and Technology (KAIST) 373-1 Kuseong-dong, Yuseong-gu Daejoen, Korea

Proceedings of the 211 Winter Simulation Conference S. Jain, R. R. Creasey, J. Himmelspach, K. P. White, and M. Fu, eds. A BINARY PARTITION-BASED MATCHING ALGORITHM FOR DATA DISTRIBUTION MANAGEMENT Junghyun