MODULE 15 Clustering Large Datasets LESSON 34


 Angela Todd
 1 years ago
 Views:
Transcription
1 MODULE 15 Clustering Large Datasets LESSON 34 Incremental Clustering Keywords: Single Database Scan, Leader, BIRCH, Tree 1
2 Clustering Large Datasets Pattern matrix It is convenient to view the input data as a pattern matrix of size nxd, where there are n patterns (rows) and each pattern is represented by d feature values (columns). Data compression Using a suitable algorithm, it is possible to cluster either the rows or columns or both of the pattern matrix. Clustering the rows is helpful in prototype selection and clustering the columns aids in feature selection. Versatility of algorithms The hierarchical algorithms are more versatile than the partitional algorithms. For example, the singlelink clustering algorithm works well on data sets containing nonisotropic clusters including wellseparated, chainlike, and concentric clusters, whereas a typical partitional algorithm like the kmeans algorithm works well only on data sets having isotropic clusters. Hierarchical algorithms are expensive KMeans algorithm is one of the most popular partitional algorithms; it needs O(nkdl) time to cluster using l iterations. Each iteration of the algorithm needs to scan the data set once. So, it requires l data set scans. On the other hand, hierarchical algorithms initially compute a proximity matrix of size nxn and use this matrix to cluster n patterns. Computation and storage of this proximity matrix itself needs O(n 2 d) time and O(n 2 ) space which increase quadratically with n. Large Datasets There are several applications where the size of the pattern matrix is large. By large, we mean that the entire pattern matrix cannot be accommodated in the main memory of the computer. So, we store the input data on a secondary storage medium like the disk and transfer the data in parts to the main memory for processing. Applications For example, a transaction database of a super market chain may consist of trillions of transactions and each transaction is a sparse vector of a very high dimensionality; the dimensionality depends on the number 2
3 of productlines. Similarly, in a network intrusion detection application, the number of connections could be prohibitively large and the number of packets to be analyzed or classified could be even larger. Another application is the clustering of clickstreams; this forms an important part of web usage mining. Other applications include genome sequence clustering where the dimensionality could be running into millions, text mining, and biometrics. Feasibility of algorithms The pattern matrix is large when either n or d or both are large. Increase in the size of n would increase the size of the proximity matrix quadratically; this limits the applicability of the hierarchical algorithms that use the proximity matrix for grouping. Even the partitional algorithms like the kmeans algorithm may demand multiple passes through the data and may be infeasible to work on large data sets. Possible Solutions Large Data An objective way of characterizing largeness of a data set is by specifying bounds on the number of patterns and features present. For example, a data set having more than billion patterns and/or more than million features is large. However, such a characterization is not universally acceptable and is bound to change with the developments in technology. For example, in the 1960s, large meant several thousand patterns. So, it is good to consider a more pragmatic characterization; a data set is large if it is not possible to fit the data in the main memory of the machine on which it is processed. Number of Dataset Scans So, the data resides on a secondary storage device and has to be transferred to the main memory based on need. Further, accessing the secondary storage space is several orders slower than accessing the main memory. This assumption is behind the design of various data mining tasks where large data sets are routinely processed which prompts us to consider the number of dataset scans. Feasibility of the clustering algorithms It is important that the clustering algorithms that work with large data 3
4 sets should scaleup well. Algorithms having nonlinear time and space complexities are ruled out. Even algorithms requiring linear time and space may not be feasible if they scan the data set several times. Based on these observations, it is possible to list the following solutions for clustering large data sets. 1. Incremental Clustering The basis of incremental clustering is that the data is considered sequentially and the patterns are processed step by step. Such algorithms are useful in processing stream data. In most of the incremental clustering algorithms, one of the patterns in the data set (usually the first pattern) is selected to form an initial cluster. Each of the remaining points is assigned to one of the existing clusters or may be used to form a new cluster based on some criterion. Here, a new data item is assigned to a cluster without affecting the existing clusters. Characterization of incremental clustering We can characterize incremental clustering formally as follows. Let X = {X 1, X 2,, X n } be the set of n patterns, where X i is the i th pattern. In incremental clustering, the data is considered sequentially, let us say in a particular order, X 1, X 2,,, X n. Let A k represent the abstraction generated using the first k patterns and A n represent the abstraction obtained after all the n patterns are processed. Further, in incremental clustering, A k+1 is obtained using A k and X k+1 only. Abstraction generated using clustering A k varies from algorithm to algorithm and it can take different forms. Some of them are: (a) Abstraction A k is a set of prototypes or cluster representatives. Leader clustering algorithm is a wellknown member of this category. It is described below. Leader Clustering Algorithm i. Assign the first data item to a cluster. ii. Assign the next data item to one of the existing clusters 4
5 or to a new cluster. It is assigned to an existing cluster if the distance between the data item and the cluster representative (leader) is less than a userprovided threshold (T). Otherwise, a new cluster is started. iii. Repeat step b till all the data items are assigned to clusters. The Leader algorithm is the simplest algorithm for handling large data. We explain it using an example. Example 1 Consider the following collection of ten 3dimensional patterns given below. (1, 1, 1) t (1, 1, 2) t (1, 3, 2) t (2, 1, 1) t (6, 3, 1) t (6, 4, 4) t (6, 6, 6) t (6, 5, 7) t (6, 7, 5) t (7, 5, 6) t Let the userspecified threshold be 5 units and L 1 norm be used to compute the distance between a pair of points. First we consider the pattern (1, 1, 1) t. It is assigned to cluster C 1 ; it is the leader of C 1. Next we consider (1, 1, 2) t. The distance (L 1 norm) of this pattern from the leader of C 1 is 1 unit; it is less than the threshold of 5 units. So, (1, 1, 2) t is assigned to C 1. Next we consider (1, 3, 2) t. Again the distance from the only leader is 1 unit; so, it is assigned to C 1. Now consider (2, 1, 1) t. Again the distance is 1 unit from the leader of C 1 and so it is assigned to C 1. Next we consider (6, 3, 1) t. This pattern is at a distance of 7 units from the leader of C 1 ; the distance is above the threshold of 5 units. So, we start a new cluster, C 2, and assign (6, 3, 1) t to C 2. So, the leader of C 2 is (6, 3, 1) t. Now (6, 4, 4) t is processed. It is at a distance of 11 units from the leader of C 1 ; but the distance from the leader of C 2 ((6, 3, 1) t ) is 1 unit. So, it is assigned to C 2. 5
6 The pattern (6, 6, 6) t is considered now. It is at a distance of 15 units from the leader of C 1 and at a distance of 8 units from that of C 2. As both these distances are more than the threshold of 5 units, a new cluster (C 3 ) is started and (6, 6, 6) t is assigned to C 3 as its leader. Note that (6, 5, 7) t is at a distance of 15 units from the leader of C 1, 8 units from the leader of C 2, and 2 units from that of C 3. So, it is assigned to C 3. Similarly, the remaining two patterns (6, 7, 5) t, (7, 5, 6) t are assigned to C 3 in sequence because each of them is at a distance of 2 units from (6, 6, 6) t, the leader of C 3. Further, each of them is at a distance of 15 units from the leader of C 1 and 8 units from the leader of C 2. So, we end up with three clusters with their respective leaders as given in Table 1. Cluster C 1 C 2 C 3 Leader (1, 1, 1) t (6, 3, 1) t (6, 6, 6) t Table 1: Cluster Representatives (b) Tree (Clustering Feature) tree in BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies). Here, A k is the tree after inserting k patterns. Each node in the tree stores information such as linear sum of patterns, squared sum of patterns, number of patterns assigned to a subcluster (cluster features or sufficient statistics) to obtain the prototypes for formation of clusters later. The tree may be illustrated using figure 1. The vector representation is ideally suited to represent cluster structures where the vector corresponding to a merged cluster is obtained by adding the vectors corresponding to the constituent clusters. 6
7 Figure 1: Example Tree. Tree construction We illustrate the construction of the tree using an example. Example 2 Consider the following collection of eight 3dimensional points. (1, 1, 1) t, (1, 1, 2) t, (1, 3, 2) t, (2, 1, 1) t (6, 3, 1) t, (6, 4, 4) t, (6, 6, 6) t, (6, 5, 7) t A simplified version of the tree We show the tree constructed after inserting 1, 2, 3, and 4 patterns in Figure 2. In this simple case, we use a binary tree and each leaf node in the tree can accommodate two clusters; each cluster consists of points falling in a sphere of radius 2 (threshold) units. A cluster is represented by a simplified version of the tree; it consists of the number of elements in the cluster, and linear sum of vectors in the 7
8 cluster. So, each vector is of dimension 4. After inserting (1,1,1), the vector is (1,1,1,1). The next pattern, (1,1,2), is at a distance of 1 unit from the current cluster center. Because the threshold is of 2 units, we assign (1,1,2) to the same cluster to give the vector (2,2,2,3). Now, (1,3,2) is at a distance of 5 units; so, a new cluster is started with the vector (1,1,3,2). Next, we consider (2,1,1); it is at a distance of 1 unit from the centroid of the first cluster. So, we assign it to cluster 1 and the resulting vector of cluster 1 is (3,4,3,4). (1, 1, 1, 1) After inserting (1,1,1) (2, 2, 2, 3) After inserting (1,1,2) (3, 3, 5, 5) (2, 2, 2, 3) (1, 1, 3, 2) After inserting (1,3,2) (4, 5, 6, 6) (3, 4, 3, 4) (1, 1, 3, 2) After inserting (2,1,1) Figure 2: Tree after inserting 14 patterns Tree for the eight patterns Next we insert the remaining 4 patterns; the resulting Tree is shown in Figure 3. Note that each node in the tree has degree 2; it can accommodate up to two children. In a more practical setting, each node can have degree of 100 or 1000 based on the size of the data set. The degree, B, of each nonleaf node and degree, L, of each leaf node could be different. Here, B = L = 2. 8
9 tree does not explicitly accommodate all the patterns. At the leaf level, it represents all the points in a cluster by the corresponding vector.for example, the vector (3,4,3,4) in Figure 2 and in Figure 3 represents a cluster of 3 points, (1, 1, 1) t, (1, 1, 2) t, and (2, 1, 1) t, which fall in a sphere of radius less than 2 units. It is possible to compute the mean (centroid) of all the points in the cluster from the vector. For example, the mean of the cluster represented by the vector (3, 4, 3, 4) is ( 4 3, 1, 4 3 ). (6,17,13,11) (2,12,11,13) (4, 5, 6, 6) (2,12,7,5) (2,12,11,13) (3, 4, 3, 4) (1, 1, 3, 2) (1, 6, 3, 1) (1, 6, 4, 4) (2,12,11,13) Figure 3: Tree after inserting all the patterns 9
Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical microclustering algorithm ClusteringBased SVM (CBSVM) Experimental
More informationData Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototypebased clustering Densitybased clustering Graphbased
More informationBIRCH: An Efficient Data Clustering Method For Very Large Databases
BIRCH: An Efficient Data Clustering Method For Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny CPSC 504 Presenter: Discussion Leader: Sophia (Xueyao) Liang HelenJr, Birches. Online Image.
More informationClustering UE 141 Spring 2013
Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or
More informationCluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototypebased Fuzzy cmeans
More informationClustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance Knearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
More informationClustering and Data Mining in R
Clustering and Data Mining in R Workshop Supplement Thomas Girke December 10, 2011 Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches
More informationThe Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
More informationClustering on Large Numeric Data Sets Using Hierarchical Approach Birch
Global Journal of Computer Science and Technology Software & Data Engineering Volume 12 Issue 12 Version 1.0 Year 2012 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global
More informationSPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
More informationClustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDDLAB ISTI CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationThe SPSS TwoStep Cluster Component
White paper technical report The SPSS TwoStep Cluster Component A scalable component enabling more efficient customer segmentation Introduction The SPSS TwoStep Clustering Component is a scalable cluster
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical
More informationData Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationText Clustering. Clustering
Text Clustering 1 Clustering Partition unlabeled examples into disoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover
More informationClustering. Adrian Groza. Department of Computer Science Technical University of ClujNapoca
Clustering Adrian Groza Department of Computer Science Technical University of ClujNapoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 Kmeans 3 Hierarchical Clustering What is Datamining?
More informationAn Enhanced Clustering Algorithm to Analyze Spatial Data
International Journal of Engineering and Technical Research (IJETR) ISSN: 23210869, Volume2, Issue7, July 2014 An Enhanced Clustering Algorithm to Analyze Spatial Data Dr. Mahesh Kumar, Mr. Sachin Yadav
More informationRtrees. RTrees: A Dynamic Index Structure For Spatial Searching. RTree. Invariants
RTrees: A Dynamic Index Structure For Spatial Searching A. Guttman Rtrees Generalization of B+trees to higher dimensions Diskbased index structure Occupancy guarantee Multiple search paths Insertions
More informationClustering. 15381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is
Clustering 15381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv BarJoseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is
More informationClustering Very Large Data Sets with Principal Direction Divisive Partitioning
Clustering Very Large Data Sets with Principal Direction Divisive Partitioning David Littau 1 and Daniel Boley 2 1 University of Minnesota, Minneapolis MN 55455 littau@cs.umn.edu 2 University of Minnesota,
More informationUNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationData Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004
More informationData Mining Project Report. Document Clustering. Meryem UzunPer
Data Mining Project Report Document Clustering Meryem UzunPer 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. Kmeans algorithm...
More informationHadoop SNS. renren.com. Saturday, December 3, 11
Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December
More informationDATABASE DESIGN  1DL400
DATABASE DESIGN  1DL400 Spring 2015 A course on modern database systems!! http://www.it.uu.se/research/group/udbl/kurser/dbii_vt15/ Kjell Orsborn! Uppsala Database Laboratory! Department of Information
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 WolfTilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig
More informationChapter 7. Cluster Analysis
Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. DensityBased Methods 6. GridBased Methods 7. ModelBased
More informationClass #6: Nonlinear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Nonlinear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Nonlinear classification Linear Support Vector Machines
More informationEchidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis
Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis Abdun Mahmood, Christopher Leckie, Parampalli Udaya Department of Computer Science and Software Engineering University of
More informationLesson 8: DESIGN PROCESSES AND DESIGN METRIC FOR AN EMBEDDEDSYSTEM DESIGN
Lesson 8: DESIGN PROCESSES AND DESIGN METRIC FOR AN EMBEDDEDSYSTEM DESIGN 1 Abstraction Each problem component first abstracted. For example, Display picture and text as an abstract class Robotic system
More information. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns
Outline Part 1: of data clustering NonSupervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties
More informationIndexing Techniques in Data Warehousing Environment The UBTree Algorithm
Indexing Techniques in Data Warehousing Environment The UBTree Algorithm Prepared by: Yacine ghanjaoui Supervised by: Dr. Hachim Haddouti March 24, 2003 Abstract The indexing techniques in multidimensional
More informationL15: statistical clustering
Similarity measures Criterion functions Cluster validity Flat clustering algorithms kmeans ISODATA L15: statistical clustering Hierarchical clustering algorithms Divisive Agglomerative CSCE 666 Pattern
More informationRobotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard
Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationData Clustering. Dec 2nd, 2013 Kyrylo Bessonov
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms kmeans Hierarchical Main
More informationMethodology for Emulating Self Organizing Maps for Visualization of Large Datasets
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph
More informationMapReduce for Machine Learning on Multicore
MapReduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers  dual core to 12+core Shift to more concurrent programming paradigms and languages Erlang,
More informationDATA STRUCTURES USING C
DATA STRUCTURES USING C QUESTION BANK UNIT I 1. Define data. 2. Define Entity. 3. Define information. 4. Define Array. 5. Define data structure. 6. Give any two applications of data structures. 7. Give
More informationSupport Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano
More informationROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015
ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will
More informationMachine Learning for NLP
Natural Language Processing SoSe 2015 Machine Learning for NLP Dr. Mariana Neves May 4th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Introduction Field of study that gives computers the ability
More informationCLASSIFICATION AND CLUSTERING. Anveshi Charuvaka
CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training
More informationBisecting KMeans for Clustering Web Log data
Bisecting KMeans for Clustering Web Log data Ruchika R. Patil Department of Computer Technology YCCE Nagpur, India Amreen Khan Department of Computer Technology YCCE Nagpur, India ABSTRACT Web usage mining
More informationVector storage and access; algorithms in GIS. This is lecture 6
Vector storage and access; algorithms in GIS This is lecture 6 Vector data storage and access Vectors are built from points, line and areas. (x,y) Surface: (x,y,z) Vector data access Access to vector
More informationCategorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors ChiaHui Chang and ZhiKai Ding Department of Computer Science and Information Engineering, National Central University, ChungLi,
More informationAn Introduction to Cluster Analysis for Data Mining
An Introduction to Cluster Analysis for Data Mining 10/02/2000 11:42 AM 1. INTRODUCTION... 4 1.1. Scope of This Paper... 4 1.2. What Cluster Analysis Is... 4 1.3. What Cluster Analysis Is Not... 5 2. OVERVIEW...
More informationFast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
More informationCluster Analysis: Basic Concepts and Algorithms
8 Cluster Analysis: Basic Concepts and Algorithms Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should
More informationPhysical Data Organization
Physical Data Organization Database design using logical model of the database  appropriate level for users to focus on  user independence from implementation details Performance  other major factor
More informationFig. 1 A typical Knowledge Discovery process [2]
Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering
More informationUnsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
More informationCluster Analysis using R
Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other
More informationMaking SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing
SUBMISSION TO DATA MINING AND KNOWLEDGE DISCOVERY: AN INTERNATIONAL JOURNAL, MAY. 2005 100 Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing Hwanjo Yu, Jiong Yang, Jiawei Han,
More informationChapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.dbbook.com for conditions on reuse Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
More informationOperating Systems: Internals and Design Principles. Chapter 12 File Management Seventh Edition By William Stallings
Operating Systems: Internals and Design Principles Chapter 12 File Management Seventh Edition By William Stallings Operating Systems: Internals and Design Principles If there is one singular characteristic
More informationClustering Hierarchical clustering and kmean clustering
Clustering Hierarchical clustering and kmean clustering Genome 373 Genomic Informatics Elhanan Borenstein The clustering problem: A quick review partition genes into distinct sets with high homogeneity
More informationThe basic data mining algorithms introduced may be enhanced in a number of ways.
DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,
More informationClustering & Association
Clustering  Overview What is cluster analysis? Grouping data objects based only on information found in the data describing these objects and their relationships Maximize the similarity within objects
More informationINTEGER PROGRAMMING. Integer Programming. Prototype example. BIP model. BIP models
Integer Programming INTEGER PROGRAMMING In many problems the decision variables must have integer values. Example: assign people, machines, and vehicles to activities in integer quantities. If this is
More informationTopological Properties
Advanced Computer Architecture Topological Properties Routing Distance: Number of links on route Node degree: Number of channels per node Network diameter: Longest minimum routing distance between any
More informationUsing multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
More informationResearch on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 67 (2012) pp 8287 Online: 20120926 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.67.82 Research on Clustering Analysis of Big Data
More informationNonlinear Programming Methods.S2 Quadratic Programming
Nonlinear Programming Methods.S2 Quadratic Programming Operations Research Models and Methods Paul A. Jensen and Jonathan F. Bard A linearly constrained optimization problem with a quadratic objective
More informationNeural Networks Lesson 5  Cluster Analysis
Neural Networks Lesson 5  Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt.  Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29
More informationSimilarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE  Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
More information10810 /02710 Computational Genomics. Clustering expression data
10810 /02710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intracluster similarity low intercluster similarity Informally,
More informationAdvanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
More informationCluster Analysis: Basic Concepts and Algorithms
Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering Kmeans Intuition Algorithm Choosing initial centroids Bisecting Kmeans Postprocessing Strengths
More informationClustering Data Streams
Clustering Data Streams Mohamed Elasmar Prashant Thiruvengadachari Javier Salinas Martin gtg091e@mail.gatech.edu tprashant@gmail.com javisal1@gatech.edu Introduction: Data mining is the science of extracting
More informationPhilosophies and Advances in Scaling Mining Algorithms to Large Databases
Philosophies and Advances in Scaling Mining Algorithms to Large Databases Paul Bradley Apollo Data Technologies paul@apollodatatech.com Raghu Ramakrishnan UWMadison raghu@cs.wisc.edu Johannes Gehrke Cornell
More informationKNIME TUTORIAL. Anna Monreale KDDLab, University of Pisa Email: annam@di.unipi.it
KNIME TUTORIAL Anna Monreale KDDLab, University of Pisa Email: annam@di.unipi.it Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:
More informationClustering & Visualization
Chapter 5 Clustering & Visualization Clustering in highdimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to highdimensional data.
More informationThe Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon
The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon ABSTRACT Effective business development strategies often begin with market segmentation,
More informationChapter 12 File Management. Roadmap
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Overview Roadmap File organisation and Access
More informationChapter 12 File Management
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 12 File Management Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Roadmap Overview File organisation and Access
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationData Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distancebased Kmeans, Kmedoids,
More informationPhysical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design
Physical Database Design Process Physical Database Design Process The last stage of the database design process. A process of mapping the logical database structure developed in previous stages into internal
More informationLecture 10: Regression Trees
Lecture 10: Regression Trees 36350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
More informationContentBased Recommendation
ContentBased Recommendation Contentbased? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items Userbased CF Searches
More informationDistributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
More informationCELLULAR MANUFACTURING
CELLULAR MANUFACTURING Grouping Machines logically so that material handling (move time, wait time for moves and using smaller batch sizes) and setup (part family tooling and sequencing) can be minimized.
More informationFlat Clustering KMeans Algorithm
Flat Clustering KMeans Algorithm 1. Purpose. Clustering algorithms group a set of documents into subsets or clusters. The cluster algorithms goal is to create clusters that are coherent internally, but
More informationUniversité de Montpellier 2 Hugo AlatristaSalas : hugo.alatristasalas@teledetection.fr
Université de Montpellier 2 Hugo AlatristaSalas : hugo.alatristasalas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
More informationTRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
More informationFinding Frequent Patterns Based On Quantitative Binary Attributes Using FPGrowth Algorithm
R. Sridevi et al Int. Journal of Engineering Research and Applications RESEARCH ARTICLE OPEN ACCESS Finding Frequent Patterns Based On Quantitative Binary Attributes Using FPGrowth Algorithm R. Sridevi,*
More informationKmeans Clustering Technique on Search Engine Dataset using Data Mining Tool
International Journal of Information and Computation Technology. ISSN 09742239 Volume 3, Number 6 (2013), pp. 505510 International Research Publications House http://www. irphouse.com /ijict.htm Kmeans
More informationClustering. Chapter 7. 7.1 Introduction to Clustering Techniques. 7.1.1 Points, Spaces, and Distances
240 Chapter 7 Clustering Clustering is the process of examining a collection of points, and grouping the points into clusters according to some distance measure. The goal is that points in the same cluster
More informationA comparison of various clustering methods and algorithms in data mining
Volume :2, Issue :5, 3236 May 2015 www.allsubjectjournal.com eissn: 23494182 pissn: 23495979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering
More informationBuilding Data Cubes and Mining Them. Jelena Jovanovic Email: jeljov@fon.bg.ac.yu
Building Data Cubes and Mining Them Jelena Jovanovic Email: jeljov@fon.bg.ac.yu KDD Process KDD is an overall process of discovering useful knowledge from data. Data mining is a particular step in the
More informationInternational Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, MayJune 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
More informationRegression Using Support Vector Machines: Basic Foundations
Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering
More informationAn Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
More information, each of which contains a unique key value, say k i , R 2. such that k i equals K (or to determine that no such record exists in the collection).
The Search Problem 1 Suppose we have a collection of records, say R 1, R 2,, R N, each of which contains a unique key value, say k i. Given a particular key value, K, the search problem is to locate the
More information