Clustering. Data Mining. Abraham Otero. Data Mining. Agenda



Similar documents
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Cluster Analysis: Advanced Concepts

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Chapter 7. Cluster Analysis

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis: Basic Concepts and Algorithms

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Cluster Analysis: Basic Concepts and Algorithms

Clustering UE 141 Spring 2013

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Cluster Analysis using R

Cluster Analysis: Basic Concepts and Methods

Unsupervised learning: Clustering

Clustering methods for Big data analysis

Cluster Analysis Overview. Data Mining Techniques: Cluster Analysis. What is Cluster Analysis? What is Cluster Analysis?

An Introduction to Cluster Analysis for Data Mining

Neural Networks Lesson 5 - Cluster Analysis

Clustering. Chapter Introduction to Clustering Techniques Points, Spaces, and Distances

Information Retrieval and Web Search Engines

Distances, Clustering, and Classification. Heatmaps

A Comparative Study of clustering algorithms Using weka tools

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Unsupervised Data Mining (Clustering)

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

A comparison of various clustering methods and algorithms in data mining

Classification algorithm in Data mining: An Overview

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Social Media Mining. Data Mining Essentials

Clustering & Visualization

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Classification Techniques (1)

Data Clustering Techniques Qualifying Oral Examination Paper

Concept of Cluster Analysis

A Survey of Clustering Techniques

Hierarchical Cluster Analysis Some Basics and Algorithms

Knowledge Discovery and Data Mining

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Outlier Detection in Clustering

Public Transportation BigData Clustering

Chapter ML:XI (continued)

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Cluster analysis Cosmin Lazar. COMO Lab VUB

How To Cluster

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Machine Learning using MapReduce

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering

On Clustering Validation Techniques

Linköpings Universitet - ITN TNM DBSCAN. A Density-Based Spatial Clustering of Application with Noise

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Statistical Databases and Registers with some datamining

2 Basic Concepts and Techniques of Cluster Analysis

Distances between Clustering, Hierarchical Clustering

Authors. Data Clustering: Algorithms and Applications

GE-INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH VOLUME -3, ISSUE-6 (June 2015) IF ISSN: ( ) EMERGING CLUSTERING TECHNIQUES ON BIG DATA

Hadoop SNS. renren.com. Saturday, December 3, 11

Using Data Mining for Mobile Communication Clustering and Characterization

Comparison the various clustering algorithms of weka tools

SoSe 2014: M-TANI: Big Data Analytics

Segmentation & Clustering

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

They can be obtained in HQJHQH format directly from the home page at:

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Proposed Application of Data Mining Techniques for Clustering Software Projects

Classification Techniques for Remote Sensing

Chapter 3: Cluster Analysis

BIRCH: An Efficient Data Clustering Method For Very Large Databases

Transcription:

Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1

Introduction It seems logical that in a new situation we should act in a similar way as in previous similar situations, if we succeeded in them. In order to taking advantage of this strategy it is necessary to define what is meant by "similar, or the equivalent mathematical concept of "distance". It will also be necessary to determine when we are going to take advantage of this similarity: In an eager mode, processing the data available before starting the process. In a lazy mode, processing the data as it arrives. 3/46 Introduction Problem formulation: 4/46 2

Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference Open problems References 5/46 Distance Several common distances: p-norm(euclidean p=2, Minkowski p>2) Chebyshev Manhattan 6/46 3

Distance Be careful when applying distances: 7/46 Distance Be careful when applying distances: 8/46 4

Always normalize first: Distance 9/46 Distance But when normalizing beware of outliers!: 10/46 5

Distance Sometimes, we need to calculate the distance between a point and a set of points: 11/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference Open problems References 12/46 6

k-nearest neighbors k-nearest neighbors algorithm (k-nn) is a method for classifying objects based on closest training examples in the feature space. It is an instance-based learning lazy algorithm. An object is classified by a majority vote of its neighbors. The object that is assigned to the class is the one that is most common amongst its k nearest neighbors. 13/46 k-nearest neighbors It is one of the simplest methods of clustering. Requires an initial set of labeled points. It is critical to determine an appropriate value for K. Try several values. Circle Square 14/46 7

Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference Open problems References 15/46 It is prototype based clustering. Each of the existing classes is represented by a prototype vector (a fictitious instance of the class) called centroid. Once the centroids have been calculated, if we need to classify a new element we simply calculate its closest centroid; this will be its class. Centroids share space in a set of regions called Voronoi regions. 16/46 8

Centroid calculation: 17/46 algorithm: 18/46 9

Sample (successful) run: 19/46 Initialization matters: Try different initial values. 20/46 10

The selection of K is critical: Try different K values. K=3 K=4 21/46 Limitations: Different cluster sizes 22/46 11

Limitations: Different density 23/46 Limitations: Non-globular shapes 24/46 12

One possible solution is to use many clusters. Find parts of clusters. Then you need to put them together. 25/46 What about the nominal attributes? We can define a function if a=b, and otherwise. Therefore, the distance between two classes is given by: 26/46 13

KMeans demo: http://home.dei.polimi.it/matteucc/clust ering/tutorial_html/appletkm.html http://www.cs.ualberta.ca/~yaling/cluster/ Applet/Code/Cluster.html 27/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference Open problems References 28/46 14

(Density-Based Spatial Clustering of Applications with Noise) is a data clustering algorithm, not prototype based. It finds a number of clusters starting from the estimated density distribution of corresponding nodes. Classifies points in three categories: A point is a core point if it has more than a specified number of points (MinPts) within a radius Eps (these points are the interior of a cluster). A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point. A noise point is any point that is not a core point or a border point. 29/46 Example: 30/46 15

Algorithm: Classify points as noise, border and core. Eliminate noise points. Perform clustering on the remaining points. 31/46 Example: 32/46 16

Strong points: Resistant to noise. Can handle clusters of different shapes and sizes. Weak points: Clusters with varying densities. High-dimensional data (it usually becomes too sparse). 33/46 34/46 17

Parameter determination. For MinPts a small number is usually employed. For two-dimensional experimental data it has been shown that 4 is the most reasonable value. Eps is more tricky, as we have seen. A possible solution: For points in a cluster, their k th nearest neighbors are at roughly the same distance. Noise points have the k th nearest neighbor at a farther distance. So, plot sorted distance of every point to its k th nearest neighbor 35/46 Parameter determination. 36/46 18

demo: http://www.cs.ualberta.ca/~yaling/cluster/applet/co de/cluster.html 37/46 Agenda Introduction Distance K-nearest neighbors Grid clustering Hierarchical clustering Quick reference 38/46 19

Hierarchical clustering Hierarchical clustering builds a hierarchy of clusters based on distance measurements. The traditional representation of this hierarchy is a tree (called a dendrogram), with individual elements on the leaves and a single cluster containing every element at the root. The tree like diagram can be interpreted as a sequences of merges or splits. Any desired number of clusters can be obtained by cutting the dendogram at the proper level. 39/46 Hierarchical clustering There are two main types of hierarchical clustering: Agglomerative (AGNES, Agglomerative NESting): Starts with the points as individual clusters. At each step, merge the closest pair of clusters until only one cluster (or k clusters) are left. Divisive (DIANA, Divisive ANAlysis Clustering): Start with one, all-inclusive cluster. At each step, split a cluster until each cluster contains a point (or there are k clusters). In both cases, once a decision is made to combine/split two clusters, it cannot be undone. There is no global minimization. 40/46 20

Hierarchical clustering How to define inter-cluster distance? 41/46 Hierarchical clustering Single link Can handle non ellipitical clusters. Sensitive to noise and outliers Complete link Less sensitive to noise and outliers. Tends to break large clusters. Biased to globular clusters. Group and centroid average Less sensitive to noise and outliers Biased to globular clusters 42/46 21

Demo: Hierarchical clustering http://home.dei.polimi.it/matteucc/clustering/tutori al_html/appleth.html 43/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 44/46 22

Quick reference Some general tips for choosing the clustering algorithm: Prototype-based and Hierarchical clustering (except single-link) tend to form globular clusters. This is good for vector quantization but not for other kinds of data. Density-based and graph-based (except those in the previous rule) tend to form non-globular clusters. Most clustering algorithms work well for low dimensional spaces. If the dimensionality of the data is very large, think of reducing the dimensionality beforehand (PCA). 45/46 Quick reference If a taxonomy is to be created, consider hierarchical clustering. If a summarization of the data is needed, consider a partitional clustering. Can we allow the algorithm to discard outliers? (Ex: ). They might represent unusually profitable customers. Is it necessary to classify all the data? (Ex: we have to classify all documents in the database). Computing the mean makes sense only for real-value attributes (K-Means). Define an appropriate distance (Ex: Euclidean distance is valid for real-valued attributes only). 46/46 23