Lecture 18: Clustering & classification

Similar documents
Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

What is Candidate Sampling

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

L10: Linear discriminants analysis

Formulating & Solving Integer Problems Chapter

A DATA MINING APPLICATION IN A STUDENT DATABASE

1 Example 1: Axis-aligned rectangles

Lecture 2: Single Layer Perceptrons Kevin Swingler

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification


Conversion between the vector and raster data structures using Fuzzy Geographical Entities

A Simple Approach to Clustering in Excel

Implementation of Deutsch's Algorithm Using Mathcad

1. Measuring association using correlation and regression

Estimating the Number of Clusters in Genetics of Acute Lymphoblastic Leukemia Data

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Single and multiple stage classifiers implementing logistic discrimination

Implementations of Web-based Recommender Systems Using Hybrid Methods

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm

An Interest-Oriented Network Evolution Mechanism for Online Communities

Planning for Marketing Campaigns

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

Performance Analysis and Coding Strategy of ECOC SVMs

Sensor placement for leak detection and location in water distribution networks

The Greedy Method. Introduction. 0/1 Knapsack Problem

Cluster Analysis of Data Points using Partitioning and Probabilistic Model-based Algorithms

Forecasting the Direction and Strength of Stock Market Movement

Support Vector Machines

Abstract. Clustering ensembles have emerged as a powerful method for improving both the

Imperial College London

Ring structure of splines on triangulations

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Recurrence. 1 Definitions and main statements

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

Learning from Multiple Outlooks

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

Efficient Project Portfolio as a tool for Enterprise Risk Management

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES

Cluster Analysis. Cluster Analysis

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement

Ants Can Schedule Software Projects

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

Logistic Regression. Steve Kroon

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

Data Visualization by Pairwise Distortion Minimization

How To Calculate The Accountng Perod Of Nequalty

Fast Fuzzy Clustering of Web Page Collections

where the coordinates are related to those in the old frame as follows.

A Comparative Study of Data Clustering Techniques

Calculation of Sampling Weights

Sngle Snk Buy at Bulk Problem and the Access Network

CHAPTER 14 MORE ABOUT REGRESSION

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

Interpreting Patterns and Analysis of Acute Leukemia Gene Expression Data by Multivariate Statistical Analysis

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

J. Parallel Distrib. Comput.

Gender Classification for Real-Time Audience Analysis System

Logical Development Of Vogel s Approximation Method (LD-VAM): An Approach To Find Basic Feasible Solution Of Transportation Problem

This circuit than can be reduced to a planar circuit

Mixtures of Factor Analyzers with Common Factor Loadings for the Clustering and Visualisation of High-Dimensional Data

Availability-Based Path Selection and Network Vulnerability Assessment

An Alternative Way to Measure Private Equity Performance

Dynamic Fuzzy Pattern Recognition

Project Networks With Mixed-Time Constraints

PRACTICE 1: MUTUAL FUNDS EVALUATION USING MATLAB.

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

BERNSTEIN POLYNOMIALS

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

Lecture 2 Sequence Alignment. Burr Settles IBS Summer Research Program 2008 bsettles@cs.wisc.edu

Lecture 5,6 Linear Methods for Classification. Summary

A Fast Incremental Spectral Clustering for Large Data Sets

Research on Transformation Engineering BOM into Manufacturing BOM Based on BOP

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

Bag-of-Words models. Lecture 9. Slides from: S. Lazebnik, A. Torralba, L. Fei-Fei, D. Lowe, C. Szurka

The OC Curve of Attribute Acceptance Plans

Extending Probabilistic Dynamic Epistemic Logic

Review of Hierarchical Models for Data Clustering and Visualization

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems

Customer Segmentation Using Clustering and Data Mining Techniques

Design of Output Codes for Fast Covering Learning using Basic Decomposition Techniques

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

8 Algorithm for Binary Searching in Trees

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

Fuzzy Set Approach To Asymmetrical Load Balancing In Distribution Networks

Loop Parallelization

Modelling high-dimensional data by mixtures of factor analyzers

Properties of Indoor Received Signal Strength for WLAN Location Fingerprinting

Least Squares Fitting of Data

Enterprise Master Patient Index

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

Transcription:

O CPS260/BGT204. Algorthms n Computatonal Bology October 30, 2003 Lecturer: Pana K. Agarwal Lecture 8: Clusterng & classfcaton Scrbe: Daun Hou Open Problem In HomeWor 2, problem 5 has an open problem whch may be easy or may be hard. You can publsh a paper f you can fnd the soluton. The problem s: How to mprove the space complexty of the algorthm to O() or O(+logn). What s Clusterng? A loose defnton of clusterng could be the process of organzng obects nto groups whose members are smlar n some way. A cluster s therefore a collecton of obects whch are smlar to each other and are dssmlar to the obects belongng to other clusters. Cluster analyss s also used to form descrptve statstcs to ascertan whether or not the data conssts of a set dstnct subgroups, each group representng obects wth substantally dfferent propertes. The latter goal requres an assessment of the degree of dfference between the obects assgned to the respectve clusters [5]. Example of Clusterng Central to clusterng s to decde what consttutes a good clusterng. Ths can only come from subect matter consderatons and there s no absolute best crteron whch would be ndependent of the fnal am of the clusterng. For example, we could be nterested n

2 fndng representatves for homogeneous groups (data reducton), n fndng natural clusters and descrbe ther unnown propertes ( natural data types), n fndng useful and sutable groupngs ( useful data classes) or n fndng unusual data obects (outler detecton). Two mportant components of cluster analyss are the smlarty (dstance) measure between two data samples and the clusterng algorthm. 2. Dstance Measure Dfferent formula n defnng the dstance between two data ponts can lead to dfferent classfcaton results. Doman nowledge must be used to gude the formulaton of a sutable dstance measure for each partcular applcaton. For hgh dmensonal data, a popular measure s the Mnows Metrc: d d( x, x ) = x x,, = p p where d s the dmensonalty of the data. Specal Cases: p=2: Eucldean dstance p=: Manhattan dstance p-> : Super dstance However, there are no general theoretcal gudelnes for selectng a measure for any gven applcaton. In the case that the components of the data feature vectors are not mmedately comparable, such as the days of the wee, doman nowledge must be used to formulate an approprate measure. 3. Clusterng algorthms Clusterng algorthms may be classfed as lsted below: Exclusve Clusterng In exclusve clusterng data are grouped n an exclusve way, so that a certan datum belongs to only one defnte cluster. K-means clusterng s one example of the exclusve clusterng algorthms.

3 Overlappng Clusterng The overlappng clusterng uses fuzzy sets to cluster data, so that each pont may belong to two or more clusters wth dfferent degrees of membershp. Herarchcal Clusterng Herarchcal clusterng algorthm has two versons: agglomeratve clusterng and dvsve clusterng Agglomeratve clusterng s based on the unon between the two nearest clusters. The begnnng condton s realzed by settng every datum as a cluster. After a few teratons t reaches the fnal clusters wanted. Bascally, ths s a bottom-up verson Dvsve clusterng starts from one cluster contanng all data tems. At each step, clusters are successvely splt nto smaller clusters accordng to some dssmlarty. Bascally ths s a top-down verson. Probablstc Clusterng Probablstc clusterng, e.g. Mxture of Gaussan, uses a completely probablstc approach. 4. Herarchcal Algorthm Gven a set of N obects = { s s } S s, 2,... N to be clustered and a functon of dstance D( c, c) between two clusters c and c, buld a herarchy tree on S : c, c S, c c =. The basc process of herarchcal clusterng ( S.C. Johnson n 967) s as follows:. Start by assgnng each obect to a cluster c = s ( =,... N), so that f you have N, 2... N, each contanng ust one tem. 2. Fnd the par of clusters ( c, c ) such that D( c, c) D( c', c' ) c' c' and merge them nto a sngle cluster c = c c. Delete c, c from and nsert c nto so that now you have one cluster less. 3. Compute dstances (smlartes) between the new cluster and each of the old clusters. 4. Repeat steps 2 and 3 untl all tems are clustered nto a sngle cluster of sze N. obects, you have N clusters = { c c c } An example of how the herarchcal algorthm leads to long clusters s shown n the followng fgure:

4 Example of Herarchcal Clusterng Step 3 n the herarchcal algorthm can be done n dfferent ways, whch s what dstngushes sngle-lnage from complete-lnage and average-lnage clusterng. In sngle-lnage clusterng, the dstance between one cluster and another cluster s equal to the shortest dstance from any member of one cluster to any member of the other cluster: D( c, c) = mn d( a, b) a c, b c. It s obvous that: { } (, l) mn (, l), (, l) D c c = D c c D c c for c = c c In complete-lnage clusterng, the dstance between one cluster and another cluster s equal to the greatest dstance from any member of one cluster to any member of the other cluster: D( c, c) = max d( a, b) a c, b c. In average-lnage clusterng, the dstance between one cluster and another cluster s equal to the average dstance from any member of one cluster to any member of the other cluster: D( c, c) = d( a, b). It s obvous that c c, a c b c c c D( c, c ) = D( c, c ) + D( c c ) for l l, l c = c c c c The man weanesses of herarchcal clusterng methods nclude that they do not scale well: the tme complexty s at least O(N 2 ), where N s the number of total obects, and that they can never undo what was done prevously. It s bascally a greedy approach.

An example [4]: clusterng 4 data tems n 2-dmensonal space 5

6

7 5. K-Means Clusterng K-means (MacQueen, 967) s one of the smplest unsupervsed learnng algorthms that solves the clusterng problem. The obectve s to classfy a gven data set S = { s, s2,... s N } nto a certan number of clusters (assume ntal clusters) fxed a pror. The dea s to defne ntal centrods, one for each cluster c ( =,... ). The procedure s: o. Intal clusters: = { c 0 0 0, c2... c } possble from each other. ; the ntal centrods should be placed as far as 2: Calculate the centrods of the clusters: the th teraton. u = c x where =,... and denotes x c 3. Tae each pont belongng to a gven data set and assocate t to the nearest centrod: { (, ) (, ') ' } c + = x d x u d x u { c + } + = 4. Repeat steps 2 and 3 untl no more changes can be made to the clusters, that s, + =. In other words centrods do not move any more. Each teraton taes O(N) tme, but we don t now n how many teratons t wll tae to converge. Fnally, ths algorthm ams at mnmzng an obectve functon, n ths case a squared error functon: J = x u = x c 2 Although t can be proved that the -means algorthm wll always termnate, the algorthm does not necessarly fnd the most optmal confguraton, correspondng to the global obectve functon mnmum. It mght get stuc n a local mnmum. The algorthm s also sgnfcantly senstve to the ntal randomly selected cluster centers. To get out of local mnmum, the -means algorthm can be run multple tmes from dfferent ntal clusterng or smulated annealng technque could be used.

8 6. Prncpal Component Analyss d The prncpal components v{ =,2... d} of a data set S R consstng of N d- dmensonal random vectors {, s,... s } S s (d>>logn) provdes a sequence of best = 2 N lnear approxmaton to that data n terms of mnmum mean-square error. The orgnal vector s { =,2... N} hence can be represented as a lnear combnaton of the prncpal components: s d = = a v Ths representaton has several attractve propertes:. The effectveness of each prncpal components, v{ =,2... d}, n terms of representng S, s determned by ts correspondng egenvalue. Therefore, the prncpal component wth the largest egenvalue has the greatest mportance n effectvely approxmatng the orgnal data set S, and vce versa. 2. The prncpal components, v{ =,2... d} are mutually uncorrelated, that s, v v = 0( < d). Furthermore, n the specal case where S s normally dstrbuted, the prncpal components are mutually ndependent. 3. The set of prncpal components v{ =, 2... }, whch corresponds to the largest egenvalues, mnmzes the mean-square error over all choces of orthogonal vectors. The man applcaton of Prncpal Component Analyss s for feature space reducton: the d-dmensonal data can be proected onto a -dmensonal subspace usng the frst prncpal components. In classfcaton, the data then mght be clustered around the - dmensonal hyperplane after the transformaton proecton. To calculate the prncpal components v{ =,2... d}, the covarance matrx X of the data set S s frst calculated: N T X = ( s u)( s u N N ) where u = s N = = The average u of all vectors s n the data set s subtracted so that the egenvector wth the largest egenvalue represents the subspace dmenson n whch the varance of the data set s maxmzed n the correlaton sense. The prncpal components v{ =,2... d} are the egenvectors of the covarance matrx X and can be determned by solvng the well-nown egenvalue decomposton problem: v = X v λ Here λ λ 2... λ... λ d. One approach n selectng such that the frst egenvectors of X capture mportant varatons n the data set S s:

9 = d = λ T λ where the threshold T s close to but less than unty. 7. Supervsed Learnng All the clusterng analyss methods ntroduced above are examples of unsupervsed learnng algorthms. A learnng method s consdered unsupervsed f t learns n the absence of a teacher sgnal that provdes pror nowledge of the correct answer. Supervsed learnng has a substantal advantage over unsupervsed learnng. In partcular, supervsed learnng allows us to tae advantage of our own nowledge about the classfcaton problem we are tryng to solve. Instead of ust lettng the algorthm wor out for tself what the classes should be, we can tell t what we now about the classes: how many there are and what examples of each one loo le. The supervsed learnng algorthm s ob s then to fnd the features n the examples that are most useful n predctng the classes. Neural networs and support vector machnes are wdely used n supervsed learnng. The followng s an example of supervsed learnng: For feature vectors S = { s s s } and classfer : s {, }, 2,... N χ +, the obectve s to fnd a hyperplane H passng through the orgn so that the obectve functon s maxmzed: sgn( H ( s )) χ( s ). Bascally you want to maxmze the number of tmes s sgn( H ( s)) = χ ( s). Sometmes a hyperplane dose not wor because the ponts are not lnearly separable. In that case you could loo at dfferent hyper-surfaces to separate the data ponts. References: [] S. C. Johnson (967): "Herarchcal Clusterng Schemes" Psychometra, 2:24-254. [2] J.B.MacQueen (967): "Some Methods for classfcaton and Analyss of Multvarate Observatons, Proceedngs of 5-th Bereley Symposum on Mathematcal Statstcs and Probablty", Bereley, Unversty of Calforna Press, :28-297. [3] A Tutoral on Clusterng Algorthms http://www.elet.polm.t/upload/matteucc/clusterng/tutoral_html/ [4] Jeong-Ho Chang http://cbt.snu.ac.r/tutoral-2002/ppt/clusteranalyss.pdf [5] T. Haste, R. Tbshran, and J. Fredman, The Elements of Statstcal Learnng