# MODULE 15 Clustering Large Datasets LESSON 34

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 MODULE 15 Clustering Large Datasets LESSON 34 Incremental Clustering Keywords: Single Database Scan, Leader, BIRCH, Tree 1

2 Clustering Large Datasets Pattern matrix It is convenient to view the input data as a pattern matrix of size nxd, where there are n patterns (rows) and each pattern is represented by d feature values (columns). Data compression Using a suitable algorithm, it is possible to cluster either the rows or columns or both of the pattern matrix. Clustering the rows is helpful in prototype selection and clustering the columns aids in feature selection. Versatility of algorithms The hierarchical algorithms are more versatile than the partitional algorithms. For example, the single-link clustering algorithm works well on data sets containing non-isotropic clusters including well-separated, chain-like, and concentric clusters, whereas a typical partitional algorithm like the k-means algorithm works well only on data sets having isotropic clusters. Hierarchical algorithms are expensive K-Means algorithm is one of the most popular partitional algorithms; it needs O(nkdl) time to cluster using l iterations. Each iteration of the algorithm needs to scan the data set once. So, it requires l data set scans. On the other hand, hierarchical algorithms initially compute a proximity matrix of size nxn and use this matrix to cluster n patterns. Computation and storage of this proximity matrix itself needs O(n 2 d) time and O(n 2 ) space which increase quadratically with n. Large Datasets There are several applications where the size of the pattern matrix is large. By large, we mean that the entire pattern matrix cannot be accommodated in the main memory of the computer. So, we store the input data on a secondary storage medium like the disk and transfer the data in parts to the main memory for processing. Applications For example, a transaction database of a super market chain may consist of trillions of transactions and each transaction is a sparse vector of a very high dimensionality; the dimensionality depends on the number 2

3 of product-lines. Similarly, in a network intrusion detection application, the number of connections could be prohibitively large and the number of packets to be analyzed or classified could be even larger. Another application is the clustering of click-streams; this forms an important part of web usage mining. Other applications include genome sequence clustering where the dimensionality could be running into millions, text mining, and biometrics. Feasibility of algorithms The pattern matrix is large when either n or d or both are large. Increase in the size of n would increase the size of the proximity matrix quadratically; this limits the applicability of the hierarchical algorithms that use the proximity matrix for grouping. Even the partitional algorithms like the k-means algorithm may demand multiple passes through the data and may be infeasible to work on large data sets. Possible Solutions Large Data An objective way of characterizing largeness of a data set is by specifying bounds on the number of patterns and features present. For example, a data set having more than billion patterns and/or more than million features is large. However, such a characterization is not universally acceptable and is bound to change with the developments in technology. For example, in the 1960s, large meant several thousand patterns. So, it is good to consider a more pragmatic characterization; a data set is large if it is not possible to fit the data in the main memory of the machine on which it is processed. Number of Dataset Scans So, the data resides on a secondary storage device and has to be transferred to the main memory based on need. Further, accessing the secondary storage space is several orders slower than accessing the main memory. This assumption is behind the design of various data mining tasks where large data sets are routinely processed which prompts us to consider the number of dataset scans. Feasibility of the clustering algorithms It is important that the clustering algorithms that work with large data 3

4 sets should scale-up well. Algorithms having non-linear time and space complexities are ruled out. Even algorithms requiring linear time and space may not be feasible if they scan the data set several times. Based on these observations, it is possible to list the following solutions for clustering large data sets. 1. Incremental Clustering The basis of incremental clustering is that the data is considered sequentially and the patterns are processed step by step. Such algorithms are useful in processing stream data. In most of the incremental clustering algorithms, one of the patterns in the data set (usually the first pattern) is selected to form an initial cluster. Each of the remaining points is assigned to one of the existing clusters or may be used to form a new cluster based on some criterion. Here, a new data item is assigned to a cluster without affecting the existing clusters. Characterization of incremental clustering We can characterize incremental clustering formally as follows. Let X = {X 1, X 2,, X n } be the set of n patterns, where X i is the i th pattern. In incremental clustering, the data is considered sequentially, let us say in a particular order, X 1, X 2,,, X n. Let A k represent the abstraction generated using the first k patterns and A n represent the abstraction obtained after all the n patterns are processed. Further, in incremental clustering, A k+1 is obtained using A k and X k+1 only. Abstraction generated using clustering A k varies from algorithm to algorithm and it can take different forms. Some of them are: (a) Abstraction A k is a set of prototypes or cluster representatives. Leader clustering algorithm is a well-known member of this category. It is described below. Leader Clustering Algorithm i. Assign the first data item to a cluster. ii. Assign the next data item to one of the existing clusters 4

5 or to a new cluster. It is assigned to an existing cluster if the distance between the data item and the cluster representative (leader) is less than a user-provided threshold (T). Otherwise, a new cluster is started. iii. Repeat step b till all the data items are assigned to clusters. The Leader algorithm is the simplest algorithm for handling large data. We explain it using an example. Example 1 Consider the following collection of ten 3-dimensional patterns given below. (1, 1, 1) t (1, 1, 2) t (1, 3, 2) t (2, 1, 1) t (6, 3, 1) t (6, 4, 4) t (6, 6, 6) t (6, 5, 7) t (6, 7, 5) t (7, 5, 6) t Let the user-specified threshold be 5 units and L 1 norm be used to compute the distance between a pair of points. First we consider the pattern (1, 1, 1) t. It is assigned to cluster C 1 ; it is the leader of C 1. Next we consider (1, 1, 2) t. The distance (L 1 norm) of this pattern from the leader of C 1 is 1 unit; it is less than the threshold of 5 units. So, (1, 1, 2) t is assigned to C 1. Next we consider (1, 3, 2) t. Again the distance from the only leader is 1 unit; so, it is assigned to C 1. Now consider (2, 1, 1) t. Again the distance is 1 unit from the leader of C 1 and so it is assigned to C 1. Next we consider (6, 3, 1) t. This pattern is at a distance of 7 units from the leader of C 1 ; the distance is above the threshold of 5 units. So, we start a new cluster, C 2, and assign (6, 3, 1) t to C 2. So, the leader of C 2 is (6, 3, 1) t. Now (6, 4, 4) t is processed. It is at a distance of 11 units from the leader of C 1 ; but the distance from the leader of C 2 ((6, 3, 1) t ) is 1 unit. So, it is assigned to C 2. 5

6 The pattern (6, 6, 6) t is considered now. It is at a distance of 15 units from the leader of C 1 and at a distance of 8 units from that of C 2. As both these distances are more than the threshold of 5 units, a new cluster (C 3 ) is started and (6, 6, 6) t is assigned to C 3 as its leader. Note that (6, 5, 7) t is at a distance of 15 units from the leader of C 1, 8 units from the leader of C 2, and 2 units from that of C 3. So, it is assigned to C 3. Similarly, the remaining two patterns (6, 7, 5) t, (7, 5, 6) t are assigned to C 3 in sequence because each of them is at a distance of 2 units from (6, 6, 6) t, the leader of C 3. Further, each of them is at a distance of 15 units from the leader of C 1 and 8 units from the leader of C 2. So, we end up with three clusters with their respective leaders as given in Table 1. Cluster C 1 C 2 C 3 Leader (1, 1, 1) t (6, 3, 1) t (6, 6, 6) t Table 1: Cluster Representatives (b) Tree (Clustering Feature) tree in BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies). Here, A k is the tree after inserting k patterns. Each node in the tree stores information such as linear sum of patterns, squared sum of patterns, number of patterns assigned to a subcluster (cluster features or sufficient statistics) to obtain the prototypes for formation of clusters later. The tree may be illustrated using figure 1. The vector representation is ideally suited to represent cluster structures where the vector corresponding to a merged cluster is obtained by adding the vectors corresponding to the constituent clusters. 6

7 Figure 1: Example -Tree. Tree construction We illustrate the construction of the tree using an example. Example 2 Consider the following collection of eight 3-dimensional points. (1, 1, 1) t, (1, 1, 2) t, (1, 3, 2) t, (2, 1, 1) t (6, 3, 1) t, (6, 4, 4) t, (6, 6, 6) t, (6, 5, 7) t A simplified version of the tree We show the tree constructed after inserting 1, 2, 3, and 4 patterns in Figure 2. In this simple case, we use a binary tree and each leaf node in the tree can accommodate two clusters; each cluster consists of points falling in a sphere of radius 2 (threshold) units. A cluster is represented by a simplified version of the tree; it consists of the number of elements in the cluster, and linear sum of vectors in the 7

8 cluster. So, each vector is of dimension 4. After inserting (1,1,1), the vector is (1,1,1,1). The next pattern, (1,1,2), is at a distance of 1 unit from the current cluster center. Because the threshold is of 2 units, we assign (1,1,2) to the same cluster to give the vector (2,2,2,3). Now, (1,3,2) is at a distance of 5 units; so, a new cluster is started with the vector (1,1,3,2). Next, we consider (2,1,1); it is at a distance of 1 unit from the centroid of the first cluster. So, we assign it to cluster 1 and the resulting vector of cluster 1 is (3,4,3,4). (1, 1, 1, 1) After inserting (1,1,1) (2, 2, 2, 3) After inserting (1,1,2) (3, 3, 5, 5) (2, 2, 2, 3) (1, 1, 3, 2) After inserting (1,3,2) (4, 5, 6, 6) (3, 4, 3, 4) (1, 1, 3, 2) After inserting (2,1,1) Figure 2: -Tree after inserting 1-4 patterns Tree for the eight patterns Next we insert the remaining 4 patterns; the resulting Tree is shown in Figure 3. Note that each node in the tree has degree 2; it can accommodate up to two children. In a more practical setting, each node can have degree of 100 or 1000 based on the size of the data set. The degree, B, of each non-leaf node and degree, L, of each leaf node could be different. Here, B = L = 2. 8

9 tree does not explicitly accommodate all the patterns. At the leaf level, it represents all the points in a cluster by the corresponding vector.for example, the vector (3,4,3,4) in Figure 2 and in Figure 3 represents a cluster of 3 points, (1, 1, 1) t, (1, 1, 2) t, and (2, 1, 1) t, which fall in a sphere of radius less than 2 units. It is possible to compute the mean (centroid) of all the points in the cluster from the vector. For example, the mean of the cluster represented by the vector (3, 4, 3, 4) is ( 4 3, 1, 4 3 ). (6,17,13,11) (2,12,11,13) (4, 5, 6, 6) (2,12,7,5) (2,12,11,13) (3, 4, 3, 4) (1, 1, 3, 2) (1, 6, 3, 1) (1, 6, 4, 4) (2,12,11,13) Figure 3: -Tree after inserting all the patterns 9

### Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

### Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

### BIRCH: An Efficient Data Clustering Method For Very Large Databases

BIRCH: An Efficient Data Clustering Method For Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny CPSC 504 Presenter: Discussion Leader: Sophia (Xueyao) Liang HelenJr, Birches. Online Image.

### Clustering UE 141 Spring 2013

Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

### Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

### Clustering and Data Mining in R

Clustering and Data Mining in R Workshop Supplement Thomas Girke December 10, 2011 Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches

### The Data Mining Process

Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

### Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 12 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global

### SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

### Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

### DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

### The SPSS TwoStep Cluster Component

White paper technical report The SPSS TwoStep Cluster Component A scalable component enabling more efficient customer segmentation Introduction The SPSS TwoStep Clustering Component is a scalable cluster

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

### Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### Text Clustering. Clustering

Text Clustering 1 Clustering Partition unlabeled examples into disoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover

### Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

### An Enhanced Clustering Algorithm to Analyze Spatial Data

International Journal of Engineering and Technical Research (IJETR) ISSN: 2321-0869, Volume-2, Issue-7, July 2014 An Enhanced Clustering Algorithm to Analyze Spatial Data Dr. Mahesh Kumar, Mr. Sachin Yadav

### R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

R-Trees: A Dynamic Index Structure For Spatial Searching A. Guttman R-trees Generalization of B+-trees to higher dimensions Disk-based index structure Occupancy guarantee Multiple search paths Insertions

### Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

### Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning David Littau 1 and Daniel Boley 2 1 University of Minnesota, Minneapolis MN 55455 littau@cs.umn.edu 2 University of Minnesota,

### UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

### Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

### Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

### Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

### Hadoop SNS. renren.com. Saturday, December 3, 11

Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December

### DATABASE DESIGN - 1DL400

DATABASE DESIGN - 1DL400 Spring 2015 A course on modern database systems!! http://www.it.uu.se/research/group/udbl/kurser/dbii_vt15/ Kjell Orsborn! Uppsala Database Laboratory! Department of Information

### Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig

### Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of

### Lesson 8: DESIGN PROCESSES AND DESIGN METRIC FOR AN EMBEDDED-SYSTEM DESIGN

Lesson 8: DESIGN PROCESSES AND DESIGN METRIC FOR AN EMBEDDED-SYSTEM DESIGN 1 Abstraction Each problem component first abstracted. For example, Display picture and text as an abstract class Robotic system

### . Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

### Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm

Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm Prepared by: Yacine ghanjaoui Supervised by: Dr. Hachim Haddouti March 24, 2003 Abstract The indexing techniques in multidimensional

### L15: statistical clustering

Similarity measures Criterion functions Cluster validity Flat clustering algorithms k-means ISODATA L15: statistical clustering Hierarchical clustering algorithms Divisive Agglomerative CSCE 666 Pattern

### Robotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard

Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,

### Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

### Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

### Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph

### Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

### DATA STRUCTURES USING C

DATA STRUCTURES USING C QUESTION BANK UNIT I 1. Define data. 2. Define Entity. 3. Define information. 4. Define Array. 5. Define data structure. 6. Give any two applications of data structures. 7. Give

### Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

### ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

### Machine Learning for NLP

Natural Language Processing SoSe 2015 Machine Learning for NLP Dr. Mariana Neves May 4th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Introduction Field of study that gives computers the ability

### CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka

CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training

### Bisecting K-Means for Clustering Web Log data

Bisecting K-Means for Clustering Web Log data Ruchika R. Patil Department of Computer Technology YCCE Nagpur, India Amreen Khan Department of Computer Technology YCCE Nagpur, India ABSTRACT Web usage mining

### Vector storage and access; algorithms in GIS. This is lecture 6

Vector storage and access; algorithms in GIS This is lecture 6 Vector data storage and access Vectors are built from points, line and areas. (x,y) Surface: (x,y,z) Vector data access Access to vector

### Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

### An Introduction to Cluster Analysis for Data Mining

An Introduction to Cluster Analysis for Data Mining 10/02/2000 11:42 AM 1. INTRODUCTION... 4 1.1. Scope of This Paper... 4 1.2. What Cluster Analysis Is... 4 1.3. What Cluster Analysis Is Not... 5 2. OVERVIEW...

### Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

### Cluster Analysis: Basic Concepts and Algorithms

8 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should

### Physical Data Organization

Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor

### Fig. 1 A typical Knowledge Discovery process [2]

Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering

### Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

### Cluster Analysis using R

Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other

### Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing

SUBMISSION TO DATA MINING AND KNOWLEDGE DISCOVERY: AN INTERNATIONAL JOURNAL, MAY. 2005 100 Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing Hwanjo Yu, Jiong Yang, Jiawei Han,

### Chapter 20: Data Analysis

Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

### Operating Systems: Internals and Design Principles. Chapter 12 File Management Seventh Edition By William Stallings

Operating Systems: Internals and Design Principles Chapter 12 File Management Seventh Edition By William Stallings Operating Systems: Internals and Design Principles If there is one singular characteristic

### Clustering Hierarchical clustering and k-mean clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics Elhanan Borenstein The clustering problem: A quick review partition genes into distinct sets with high homogeneity

### The basic data mining algorithms introduced may be enhanced in a number of ways.

DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,

### Clustering & Association

Clustering - Overview What is cluster analysis? Grouping data objects based only on information found in the data describing these objects and their relationships Maximize the similarity within objects

### INTEGER PROGRAMMING. Integer Programming. Prototype example. BIP model. BIP models

Integer Programming INTEGER PROGRAMMING In many problems the decision variables must have integer values. Example: assign people, machines, and vehicles to activities in integer quantities. If this is

### Topological Properties

Advanced Computer Architecture Topological Properties Routing Distance: Number of links on route Node degree: Number of channels per node Network diameter: Longest minimum routing distance between any

### Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

### Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

### Nonlinear Programming Methods.S2 Quadratic Programming

Nonlinear Programming Methods.S2 Quadratic Programming Operations Research Models and Methods Paul A. Jensen and Jonathan F. Bard A linearly constrained optimization problem with a quadratic objective

### Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

### Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

### 10-810 /02-710 Computational Genomics. Clustering expression data

10-810 /02-710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally,

REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

### Cluster Analysis: Basic Concepts and Algorithms

Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering K-means Intuition Algorithm Choosing initial centroids Bisecting K-means Post-processing Strengths

### Clustering Data Streams

Clustering Data Streams Mohamed Elasmar Prashant Thiruvengadachari Javier Salinas Martin gtg091e@mail.gatech.edu tprashant@gmail.com javisal1@gatech.edu Introduction: Data mining is the science of extracting

### Philosophies and Advances in Scaling Mining Algorithms to Large Databases

Philosophies and Advances in Scaling Mining Algorithms to Large Databases Paul Bradley Apollo Data Technologies paul@apollodatatech.com Raghu Ramakrishnan UW-Madison raghu@cs.wisc.edu Johannes Gehrke Cornell

### KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it

KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:

### Clustering & Visualization

Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

### The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon ABSTRACT Effective business development strategies often begin with market segmentation,

### Chapter 12 File Management. Roadmap

Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Overview Roadmap File organisation and Access

### Chapter 12 File Management

Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Roadmap Overview File organisation and Access

### Machine Learning using MapReduce

Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

### Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

### Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design

Physical Database Design Process Physical Database Design Process The last stage of the database design process. A process of mapping the logical database structure developed in previous stages into internal

### Lecture 10: Regression Trees

Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

### Content-Based Recommendation

Content-Based Recommendation Content-based? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items User-based CF Searches

### Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

### CELLULAR MANUFACTURING

CELLULAR MANUFACTURING Grouping Machines logically so that material handling (move time, wait time for moves and using smaller batch sizes) and setup (part family tooling and sequencing) can be minimized.

### Flat Clustering K-Means Algorithm

Flat Clustering K-Means Algorithm 1. Purpose. Clustering algorithms group a set of documents into subsets or clusters. The cluster algorithms goal is to create clusters that are coherent internally, but

### Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection

### TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP

TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

### Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm

R. Sridevi et al Int. Journal of Engineering Research and Applications RESEARCH ARTICLE OPEN ACCESS Finding Frequent Patterns Based On Quantitative Binary Attributes Using FP-Growth Algorithm R. Sridevi,*

### K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 6 (2013), pp. 505-510 International Research Publications House http://www. irphouse.com /ijict.htm K-means

### Clustering. Chapter 7. 7.1 Introduction to Clustering Techniques. 7.1.1 Points, Spaces, and Distances

240 Chapter 7 Clustering Clustering is the process of examining a collection of points, and grouping the points into clusters according to some distance measure. The goal is that points in the same cluster

### A comparison of various clustering methods and algorithms in data mining

Volume :2, Issue :5, 32-36 May 2015 www.allsubjectjournal.com e-issn: 2349-4182 p-issn: 2349-5979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering

### Building Data Cubes and Mining Them. Jelena Jovanovic Email: jeljov@fon.bg.ac.yu

Building Data Cubes and Mining Them Jelena Jovanovic Email: jeljov@fon.bg.ac.yu KDD Process KDD is an overall process of discovering useful knowledge from data. Data mining is a particular step in the

### International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

### Regression Using Support Vector Machines: Basic Foundations

Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering