Cluster Analysis. Cluster Analysis



Similar documents
Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

A DATA MINING APPLICATION IN A STUDENT DATABASE

APPLICATION OF BINARY DIVISION ALGORITHM FOR IMAGE ANALYSIS AND CHANGE DETECTION TO IDENTIFY THE HOTSPOTS IN MODIS IMAGES

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm

A Binary Quantum-behaved Particle Swarm Optimization Algorithm with Cooperative Approach


An Efficient Recovery Algorithm for Coverage Hole in WSNs

Formulating & Solving Integer Problems Chapter

Efficient Algorithms for Computing the Triplet and Quartet Distance Between Trees of Arbitrary Degree

High Performance Latent Dirichlet Allocation for Text Mining

EXAMPLE PROBLEMS SOLVED USING THE SHARP EL-733A CALCULATOR

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

A Simple Approach to Clustering in Excel

DEGREES OF EQUIVALENCE IN A KEY COMPARISON 1 Thang H. L., Nguyen D. D. Vietnam Metrology Institute, Address: 8 Hoang Quoc Viet, Hanoi, Vietnam

Chapter 6. Classification and Prediction

The Design of Efficiently-Encodable Rate-Compatible LDPC Codes

The Greedy Method. Introduction. 0/1 Knapsack Problem

Institut für Informatik der Technischen Universität München. MISTRAL: Processing Relational Queries using a Multidimensional Access Technique

Project Networks With Mixed-Time Constraints

Adaptive Fractal Image Coding in the Frequency Domain

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

A Comparative Study of Data Clustering Techniques

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Data Mining Analysis and Modeling for Marketing Based on Attributes of Customer Relationship

Image Compression of MRI Image using Planar Coding

Enterprise Master Patient Index

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

An Interest-Oriented Network Evolution Mechanism for Online Communities

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

Master s Thesis. Configuring robust virtual wireless sensor networks for Internet of Things inspired by brain functional networks

On the Optimal Marginal Rate of Income Tax

Trust Network and Trust Community Clustering based on Shortest Path Analysis for E-commerce

The OC Curve of Attribute Acceptance Plans

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

NEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION

Statistical Approach for Offline Handwritten Signature Verification

ERP Software Selection Using The Rough Set And TPOSIS Methods

Sensor placement for leak detection and location in water distribution networks

Present Values and Accumulations

A Data Mining-Based OLAP Aggregation of. Complex Data: Application on XML Documents

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

Estimating the Number of Clusters in Genetics of Acute Lymphoblastic Leukemia Data

BERNSTEIN POLYNOMIALS

Exact GP Schema Theory for Headless Chicken Crossover and Subtree Mutation

Single and multiple stage classifiers implementing logistic discrimination

Bag-of-Words models. Lecture 9. Slides from: S. Lazebnik, A. Torralba, L. Fei-Fei, D. Lowe, C. Szurka

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

Forecasting the Direction and Strength of Stock Market Movement

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement

Lecture 2: Single Layer Perceptrons Kevin Swingler

On the computation of the capital multiplier in the Fortis Credit Economic Capital model

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Support Vector Machines

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance

A novel Method for Data Mining and Classification based on

Mining Multiple Large Data Sources

An Integrated Approach of AHP-GP and Visualization for Software Architecture Optimization: A case-study for selection of architecture style

Calculation of Sampling Weights

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

What is Candidate Sampling

Traffic State Estimation in the Traffic Management Center of Berlin

8 Algorithm for Binary Searching in Trees

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System

A Crossplatform ECG Compression Library for Mobile HealthCare Services

Decomposition Methods for Large Scale LP Decoding

A heuristic task deployment approach for load balancing

1. Measuring association using correlation and regression

An Alternative Way to Measure Private Equity Performance

Rate Monotonic (RM) Disadvantages of cyclic. TDDB47 Real Time Systems. Lecture 2: RM & EDF. Priority-based scheduling. States of a process

Effective wavelet-based compression method with adaptive quantization threshold and zerotree coding

Efficient Project Portfolio as a tool for Enterprise Risk Management

Activity Scheduling for Cost-Time Investment Optimization in Project Management

Automated Mobile ph Reader on a Camera Phone

On-Line Fault Detection in Wind Turbine Transmission System using Adaptive Filter and Robust Statistical Features

FORMAL ANALYSIS FOR REAL-TIME SCHEDULING

STATISTICAL DATA ANALYSIS IN EXCEL

A New Task Scheduling Algorithm Based on Improved Genetic Algorithm

Multi-Resource Fair Allocation in Heterogeneous Cloud Computing Systems

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

A Survey on Clustering based Meteorological Data Mining

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

Transcription:

Cluster Analyss Cluster Analyss What s Cluster Analyss? Types of Data n Cluster Analyss A Categorzaton of Maor Clusterng Methos Parttonng Methos Herarchcal Methos Densty-Base Methos Gr-Base Methos Moel-Base Clusterng Methos Outler Analyss Summary

What s Cluster Analyss? Cluster: a collecton of ata obects Smlar to one another wthn the same cluster Dssmlar to the obects n other clusters Cluster analyss Groupng a set of ata obects nto clusters Clusterng s unsupervse classfcaton: no preefne classes Clusterng s use: As a stan-alone tool to get nsght nto ata strbuton Vsualzaton of clusters may unvel mportant nformaton As a preprocessng step for other algorthms Effcent neng or compresson often reles on clusterng General Applcatons of Clusterng Pattern Recognton Spatal Data Analyss create thematc maps n GIS by clusterng feature spaces etect spatal clusters an eplan them n spatal ata mnng Image Processng cluster mages base on ther vsual content Economc Scence (especally market research) WWW an IR ocument classfcaton cluster Weblog ata to scover groups of smlar access patterns

What Is Goo Clusterng? A goo clusterng metho wll prouce hgh qualty clusters wth hgh ntra-class smlarty low nter-class smlarty The qualty of a clusterng result epens on both the smlarty measure use by the metho an ts mplementaton. The qualty of a clusterng metho s also measure by ts ablty to scover some or all of the hen patterns. Requrements of Clusterng n Data Mnng Scalablty Ablty to eal wth fferent types of attrbutes Dscovery of clusters wth arbtrary shape Mnmal requrements for oman knowlege to etermne nput parameters Able to eal wth nose an outlers Insenstve to orer of nput recors Hgh mensonalty Incorporaton of user-specfe constrants Interpretablty an usablty

Outlers Outlers are obects that o not belong to any cluster or form clusters of very small carnalty cluster outlers In some applcatons we are ntereste n scoverng outlers, not clusters (outler analyss) Cluster Analyss What s Cluster Analyss? Types of Data n Cluster Analyss A Categorzaton of Maor Clusterng Methos Parttonng Methos Herarchcal Methos Densty-Base Methos Gr-Base Methos Moel-Base Clusterng Methos Outler Analyss Summary

Data Structures ata matr (two moes) the classc ata nput ssmlarty or stance matr (one moe) the esre ata nput to some clusterng algorthms tuples/obects obects n (,) (, ) : ( n,) attrbutes/mensons f f nf obects (,) : ( n,) : p p np Measurng Smlarty n Clusterng Dssmlarty/Smlarty metrc: The ssmlarty (, ) between two obects an s epresse n terms of a stance functon, whch s typcally a metrc: (, ) (non-negatvty) (, )= (solaton) (, )= (, ) (symmetry) (, ) (, h)(h, ) (trangular nequalty) The efntons of stance functons are usually fferent for nterval-scale, boolean, categorcal, ornal an rato-scale varables. Weghts may be assocate wth fferent varables base on applcatons an ata semantcs.

Type of ata n cluster analyss Interval-scale varables e.g., salary, heght Bnary varables e.g., gener (M/F), has_cancer(t/f) Nomnal (categorcal) varables e.g., relgon (Chrstan, Muslm, Buhst, Hnu, etc.) Ornal varables e.g., mltary rank (soler, sergeant, lutenant, captan, etc.) Rato-scale varables populaton growth (,,,,) Varables of me types multple attrbutes wth varous types Smlarty an Dssmlarty Between Obects Dstance metrcs are normally use to measure the smlarty or ssmlarty between two ata obects The most popular conform to Mnkowsk stance: p p p / p L p (, ) = n n where = (,,, n ) an = (,,, n ) are two n-mensonal ata obects, an p s a postve nteger If p =, L s the Manhattan (or cty block) stance: L (, ) = n n

7 Smlarty an Dssmlarty Between Obects (Cont.) If p =, L s the Euclean stance: Propertes (,) (,) = (,) = (,) (,) (,k) (k,) Also one can use weghte stance: ) ( ), ( n n = ) ( ), ( n n n w w w = Bnary Varables A bnary varable has two states: absent, present A contngency table for bnary ata Smple matchng coeffcent stance (nvarant, f the bnary varable s symmetrc): Jaccar coeffcent stance (nonnvarant f the bnary varable s asymmetrc): c b a c b = ), ( c b a c b = ), ( p b c a sum c c b a b a sum obect obect

Bnary Varables Another approach s to efne the smlarty of two obects an not ther stance. In that case we have the followng: Smple matchng coeffcent smlarty: s(, ) = a a b c Jaccar coeffcent smlarty: s(, ) = a a b c Note that: s(,) = (,) Dssmlarty between Bnary Varables Eample (Jaccar coeffcent) Name Fever Cough Test- Test- Test- Test- Jack Mary Jm all attrbutes are asymmetrc bnary enotes presence or postve test enotes absence or negatve test ( ack, mary ) = =. ( ack, m ) = =.7 ( m, mary ) = =.7

A smpler efnton Each varable s mappe to a btmap (bnary vector) Name Fever Cough Test- Test- Test- Test- Jack Mary Jm Jack: Mary: Jm: Smple match stance: Jaccar coeffcent: number of non - common bt postons (, ) = total number of bts number of 's n (, ) = number of 's n Varables of Me Types A atabase may contan all the s types of varables symmetrc bnary, asymmetrc bnary, nomnal, ornal, nterval an rato-scale. One may use a weghte formula to combne ther effects. (, Σ ) = p f Σ δ ( f ) ( f ) = p ( f ) δ f =

Cluster Analyss What s Cluster Analyss? Types of Data n Cluster Analyss A Categorzaton of Maor Clusterng Methos Parttonng Methos Herarchcal Methos Densty-Base Methos Gr-Base Methos Moel-Base Clusterng Methos Outler Analyss Summary Maor Clusterng Approaches Parttonng algorthms: Construct ranom parttons an then teratvely refne them by some crteron Herarchcal algorthms: Create a herarchcal ecomposton of the set of ata (or obects) usng some crteron Densty-base: base on connectvty an ensty functons Gr-base: base on a multple-level granularty structure Moel-base: A moel s hypothesze for each of the clusters an the ea s to fn the best ft of that moel to each other

Cluster Analyss What s Cluster Analyss? Types of Data n Cluster Analyss A Categorzaton of Maor Clusterng Methos Parttonng Methos Herarchcal Methos Densty-Base Methos Gr-Base Methos Moel-Base Clusterng Methos Outler Analyss Summary Parttonng Algorthms: Basc Concepts Parttonng metho: Construct a partton of a atabase D of n obects nto a set of k clusters Gven a k, fn a partton of k clusters that optmzes the chosen parttonng crteron Global optmal: ehaustvely enumerate all parttons Heurstc methos: k-means an k-meos algorthms k-means (MacQueen 7): Each cluster s represente by the center of the cluster k-meos or PAM (Partton aroun meos) (Kaufman & Rousseeuw 7): Each cluster s represente by one of the obects n the cluster

The k-means Clusterng Metho Gven k, the k-means algorthm s mplemente n steps:. Partton obects nto k nonempty subsets. Compute see ponts as the centros of the clusters of the current partton. The centro s the center (mean pont) of the cluster.. Assgn each obect to the cluster wth the nearest see pont.. Go back to Step, stop when no more new assgnment. The k-means Clusterng Metho Eample 7 7 7 7 7 7 7 7

Comments on the k-means Metho Strength Relatvely effcent: O(tkn), where n s # obects, k s # clusters, an t s # teratons. Normally, k, t << n. Often termnates at a local optmum. Weaknesses Applcable only when mean s efne, then what about categorcal ata? Nee to specfy k, the number of clusters, n avance Unable to hanle nosy ata an outlers Not sutable to scover clusters wth non-conve shapes The K-Meos Clusterng Metho Fn representatve obects, calle meos, n clusters PAM (Parttonng Aroun Meos, 7) starts from an ntal set of meos an teratvely replaces one of the meos by one of the non-meos f t mproves the total stance of the resultng clusterng PAM works effectvely for small ata sets, but oes not scale well for large ata sets CLARA (Kaufmann & Rousseeuw, ) CLARANS (Ng & Han, ): Ranomze samplng

PAM (Parttonng Aroun Meos) (7) PAM (Kaufman an Rousseeuw, 7), bult n statstcal package S Use real obect to represent the cluster. Select k representatve obects arbtrarly. For each par of non-selecte obect h an selecte obect, calculate the total swappng cost TC h. For each par of an h, If TC h <, s replace by h Then assgn each non-selecte obect to the most smlar representatve obect. repeat steps - untl there s no change PAM Clusterng: Total swappng cost TC h = C h s a current meo, h s a nonselecte obect Assume that s replace by h n the set of meos TC h = ; For each non-selecte obect h: TC h = (,new_me )-(,prev_me ): new_me = the closest meo to after s replace by h prev_me = the closest meo to before s replace by h

PAM Clusterng: Total swappng cost TC h = C h 7 t h 7 t h 7 C h = (, h) - (, ) 7 C h = 7 h t 7 h t 7 C h = (, t) - (, ) 7 C h = (, h) - (, t) CLARA (Clusterng Large Applcatons) CLARA (Kaufmann an Rousseeuw n ) Bult n statstcal analyss packages, such as S It raws multple samples of the ata set, apples PAM on each sample, an gves the best clusterng as the output Strength: eals wth larger ata sets than PAM Weakness: Effcency epens on the sample sze A goo clusterng base on samples wll not necessarly represent a goo clusterng of the whole ata set f the sample s base

CLARANS ( Ranomze CLARA) CLARANS (A Clusterng Algorthm base on Ranomze Search) (Ng an Han ) CLARANS raws sample of neghbors ynamcally The clusterng process can be presente as searchng a graph where every noe s a potental soluton, that s, a set of k meos If the local optmum s foun, CLARANS starts wth new ranomly selecte noe n search for a new local optmum It s more effcent an scalable than both PAM an CLARA Focusng technques an spatal access structures may further mprove ts performance (Ester et al. )