Statistical Databases and Registers with some datamining



Similar documents
Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Unsupervised learning: Clustering

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering UE 141 Spring 2013

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Hierarchical Cluster Analysis Some Basics and Algorithms

Chapter 7. Cluster Analysis

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Chapter ML:XI (continued)

Introduction to Clustering

A comparison of various clustering methods and algorithms in data mining

CLUSTER ANALYSIS FOR SEGMENTATION

Neural Networks Lesson 5 - Cluster Analysis

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Cluster Analysis: Basic Concepts and Algorithms

Exploratory data analysis approaches unsupervised approaches. Steven Kiddle With thanks to Richard Dobson and Emanuele de Rinaldis

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Standardization and Its Effects on K-Means Clustering Algorithm

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Information Retrieval and Web Search Engines

How To Cluster

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

SoSe 2014: M-TANI: Big Data Analytics

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Exploratory data analysis for microarray data

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

They can be obtained in HQJHQH format directly from the home page at:

Social Media Mining. Data Mining Essentials

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Practical Introduction to Machine Learning and Optimization. Alessio Signorini

A Two-Step Method for Clustering Mixed Categroical and Numeric Data

Distances, Clustering, and Classification. Heatmaps

Data Mining 5. Cluster Analysis

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Cluster analysis Cosmin Lazar. COMO Lab VUB

Time series clustering and the analysis of film style

A Survey of Clustering Techniques

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

Constrained Clustering of Territories in the Context of Car Insurance

Cluster Analysis: Basic Concepts and Methods

Movie Classification Using k-means and Hierarchical Clustering

A Study of Web Log Analysis Using Clustering Techniques

A successful market segmentation initiative answers the following critical business questions: * How can we a. Customer Status.

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Hierarchical Clustering Analysis

Map-Reduce for Machine Learning on Multicore

Tutorial Segmentation and Classification

IMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES

Vector Quantization and Clustering

Environmental Remote Sensing GEOG 2021

How To Solve The Cluster Algorithm

Cluster Analysis using R

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Applying Data Analysis to Big Data Benchmarks. Jazmine Olinger

Using Data Mining for Mobile Communication Clustering and Characterization

STOCK MARKET TRENDS USING CLUSTER ANALYSIS AND ARIMA MODEL

Comparing large datasets structures through unsupervised learning

SPSS Tutorial. AEB 37 / AE 802 Marketing Research Methods Week 7

Clustering Connectionist and Statistical Language Processing

Clustering Anonymized Mobile Call Detail Records to Find Usage Groups

Cluster Analysis. Aims and Objectives. What is Cluster Analysis? How Does Cluster Analysis Work? Postgraduate Statistics: Cluster Analysis

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

USING DATA SCIENCE TO DISCOVE INSIGHT OF MEDICAL PROVIDERS CHARGE FOR COMMON SERVICES

CHAPTER 6 DATA MINING

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Fuzzy Clustering Technique for Numerical and Categorical dataset

STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. Clarificationof zonationprocedure described onpp

Rethinking Museum Visitors: Using K-means Cluster Analysis to Explore a Museum s Audience

Distances between Clustering, Hierarchical Clustering

1. Introduction MINING AND TRACKING EVOLVING WEB USER TRENDS FROM LARGE WEB SERVER LOGS. Basheer Hawwash and Olfa Nasraoui

Multivariate Analysis of Ecological Data

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

ANALYSIS OF VARIOUS CLUSTERING ALGORITHMS OF DATA MINING ON HEALTH INFORMATICS

Enhanced Customer Relationship Management Using Fuzzy Clustering

Foundations of Artificial Intelligence. Introduction to Data Mining

Self-Organizing g Maps (SOM) COMP61021 Modelling and Visualization of High Dimensional Data

Multidimensional data analysis

Clustering DHS , ,

South East of Process Main Building / 1F. North East of Process Main Building / 1F. At 14:05 April 16, Sample not collected

Segmentation: Foundation of Marketing Strategy

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

TDA and Machine Learning: Better Together

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Transcription:

Unsupervised learning - Statistical Databases and Registers with some datamining a course in Survey Methodology and O cial Statistics Pages in the book: 501-528 Department of Statistics Stockholm University October 2010

Unsupervised learning - Goals for cluster analysis (also called data segmentation) arrange into a natural hierarchy group with respect to similarities

Unsupervised learning - You need to identify people with similar patterns of past purchases so that you can tailor your marketing strategies. You ve been assigned to group television shows into homogeneous categories based on viewer characteristics. This can be used for market segmentation. You can use it in biology, to derive plant and animal taxonomies, to categorize genes with similar functionality, and to gain insight into structures inherent in populations. You re trying to examine patients with a diagnosis of depression to determine if distinct subgroups can be identi ed, based on a symptom checklist.

Unsupervised learning - De nition (Data matrix) Set x ij = measurement object i variable j where i = 1, 2,..., n, j = 1, 2,..., p. We then have the n p data matrix 2 3 x 11 x 1j x 1p X = 6 x i1 x ij x ip 7 4 5 x n1 x nj x np

Unsupervised learning - De nition (Dissimilarity matrix) Set 2 3 0 d (2, 1) 0 D = d (3, 1) d (3, 2) 0 6 4... 7 5 d (n, 1) d (n, 2) d (n, n 1) 0 where d (i, j) is the measured di erence or dissimilarity between object i and j, i, j = 1, 2,..., n. Here d (i, j) is a distance function ie 1 d (i, j) 0 2 d (i, i) = 0 3 d(i, j) = d (j, i) 4 d (i, j) d (i, k) + d (k, j)

Unsupervised learning - is a natural and easy method to nd clusters, when you have given the number of clusters. It is based on Euclidean distance and the following observation T = = j = j p n j=1 i=1 i n i 0 =1 i 0 i (x ij x i 0 j ) 2 = 2 n 2 variance (x ij x j ) 2 + 2 (x ij x j ) (x ij x j ) 2 + i 0 i 2 x 0 i j x j + x 0 i j x j! 2 x 0 i j x j i 0

(forts) Unsupervised learning - We know that the total variation may be split into two parts: Within cluster variation (W ) and between cluster variation (B). The method of try to nd a partition of clusters such that W is minimized. The algorithm to do this is: ALGORITHM 1 Choose points m 1,..., m K and call them centroids 2 Partition the population into K subsets so each point x ij is attached to the m k which is closest, Euclidean distance. 3 For subset k, k = 1, 2,..., K, compute the center of gravity and set m k equal to this center. 4 If at least one m k changes then go to 2, otherwise we stop.

Unsupervised learning - (start) The choosen centroids (5) are in red (not a particular good choice).

The rst picture shows the induced partition, by the centroids, of subsets. Second picture shows the new centroids. Unsupervised learning - (cont)

In the rst and second picture the direction, of the centroids movements, is indicated. Unsupervised learning - (cont)

The left picture shows the nal result. Also take a look at autonlab.com and the animation Kmeans. Unsupervised learning - (stop)

Unsupervised learning - (cont) In our example we had 5 clusters and ve centroids and we found the ve best cluster. It is not always true that we will nd the best partition even though we have the proper amount of centroids. Contemplate the following gure Will the given centroids ever move?

Unsupervised learning - Step 1) Step 2) Step 3) Step 4)

Unsupervised learning - Such clustering technique give rise to dendrograms