IOSR Journal Engineering AN OVERVIEW ON CLUSTERING METHODS T. Soni Madhulatha Aociate Preor, Alluri Intitute Management Science, Warangal. ABSTRACT Clutering i a common technique for tatitical data analyi, which i ued in many field, including machine learning, data mining, pattern recognition, image analyi and bioinformatic. Clutering i the proce grouping imilar object into different group, or more preciely, the partitioning a data et into ubet, o that the data in each ubet according to ome defined ditance meaure. Thi paper cover about clutering alg, benefit and it application. Paper conclude by dicuing ome limitation. Keyword: Clutering, hierarchical alg, partitional alg, ditance meaure, I. INTRODUCTION Clutering can be conidered the mot important unupervied learning problem; o, a every other problem thi ind, it deal with finding a tructure in a collection unlabeled data. A cluter i therefore a collection object which are imilar between them and are diimilar to the object belonging to other cluter. Beide the term data clutering a ynonym lie cluter analyi, automatic claification, numerical taxonomy, botrology and typological analyi. II. TYPES OF CLUSTERING. Data clutering alg can be hierarchical or partitional. Hierarchical alg find ucceive cluter uing previouly etablihed cluter, wherea partitional alg determine all cluter at time. Hierarchical alg can be agglomerative (bottom-up or diviive (top-down. Agglomerative alg begin with each element a a eparate cluter and merge them in ucceively larger cluter. Diviive alg begin with the whole et and proceed to divide it into ucceively maller cluter. HIERARCHICAL CLUSTERING A ey tep in a hierarchical clutering i to elect a ditance meaure. A imple meaure i manhattan ditance, equal to the um abolute ditance for each variable. The name come from the fact that in a two-variable cae, the variable can be plotted on a grid that can be compared to city treet, and the ditance between two point i the bloc a peron would wal. A more common meaure i Euclidean ditance, computed by finding the quare the ditance between each variable, umming the quare, and finding the quare root that um. In the two-variable cae, the ditance i analogou to finding the length the hypotenue in a triangle; that i, it i the ditance "a the crow flie." A review cluter analyi in health pychology reearch found that the mot common ditance meaure in publihed tudie in that reearch area i the Euclidean ditance or the quared Euclidean ditance. The Manhattan ditance function compute the ditance that would be traveled to get from one data point to the other if a grid-lie path i followed. The Manhattan ditance between two item i the um the difference their correponding component. The formula for thi ditance between a point X=(X1, X2, etc. and a point Y=(Y1, Y2, etc. i: d n i1 X i Y i Where n i the variable, and Xi and Yi are the value the ith variable, at point X and Y repectively. The Euclidean ditance function meaure the athe-crow-flie ditance. The formula for thi ditance between a point X (X1, X2, etc. and a point Y (Y1, Y2, etc. i: d n j1 ( x j y j 2 Deriving the Euclidean ditance between two data point involve computing the quare root the um the quare the difference between correponding value. The following figure illutrate the difference between Manhattan ditance and Euclidean ditance: ISSN: 2250-3021 www.iorjen.org 719 P a g e
IOSR Journal Engineering 1 card ( A card ( B xa yb d( x, y Manhattan ditance Euclidean ditance Thi method build the hierarchy from the individual element by progreively merging cluter. Again, we have ix element {a} {b} {c} {d} {e} and {f}. The firt tep i to determine which element to merge in a cluter. Uually, we want to tae the two cloet element, therefore we mut define a ditance between element. One can alo contruct a ditance matrix at thi tage. The um all intra-cluter variance The increae in variance for the cluter being merged The probability that candidate cluter pawn from the ame ditribution function. Each agglomeration occur at a greater ditance between cluter than the previou agglomeration, and one can decide to top clutering either when the cluter are too far apart to be merged or when there i a ufficiently mall cluter. Agglomerative hierarchical clutering For example, uppoe thee data are to be analyzed, where pixel euclidean ditance i the ditance metric. Uually the ditance between two cluter and i one the following: The maximum ditance between element each cluter i alo called complete linage clutering. max d(x, y: x A,y B The minimum ditance between element each cluter i alo called ingle linage clutering. min d(x, y: x A,yB The mean ditance between element each cluter i alo called average linage clutering. Diviive clutering So far we have only looed at agglomerative clutering, but a cluter hierarchy can alo be generated top-down. Thi variant hierarchical clutering i called top-down clutering or diviive clutering. We tart at the top with all document in one cluter. The cluter i plit uing a flat clutering alg. Thi procedure i applied recurively until each document i in it own ingleton cluter. Top-down clutering i conceptually more complex than bottom-up clutering ince we need a econd, flat clutering alg a a ``ubroutine''. It ha the advantage being more efficient if we do not generate a complete hierarchy all the way down to individual document leave. For a fixed top level, uing an efficient flat alg lie K-mean, top-down alg are linear in the document and cluter ISSN: 2250-3021 www.iorjen.org 720 P a g e
IOSR Journal Engineering Hierarchal method uffer from the fact that once the merge/plit i done, it can never be undone. Thi rigidity i ueful in that i ueful in that it lead to maller computation cot by not worrying about a combinatorial different choice. However there are two approache to improve the quality hierarchical clutering Perform careful analyi object linage at each hierarchical partitioning uch a CURE and Chameleon. Integrate hierarchical agglomeration and then redefine the reult uing iterative relocation a in BRICH PARTITIONAL CLUSTERING: Partitioning alg are baed on pecifying an initial group, and iteratively reallocating object among group to convergence. Thi alg typically determine all cluter at once. Mot application adopt one two popular heuritic method lie -mean alg -medoid alg K-mean alg The K-mean alg aign each point to the cluter whoe center alo called centroid i nearet. The center i the average all the point in the cluter that i, it coordinate are the aritetic mean for each dimenion eparately over all the point in the cluter. The peudo code the -mean alg i to explain how it wor: A. Chooe K a the cluter. B. Initialize the codeboo vector the K cluter (randomly, for intance C. For every new ample vector: a. Compute the ditance between the new vector and every cluter' codeboo vector. b. Re-compute the cloet codeboo vector with the new vector, uing a learning rate that decreae in time. The reaon behind chooing the -mean alg to tudy i it popularity for the following reaon: It time complexity i O (nl, where n i the pattern, i the cluter, and l i the iteration taen by the alg to converge. It pace complexity i O (+n. It require additional pace to tore the data matrix. It i order-independent; for a given initial eed et cluter center, it generate the ame partition the data irrepective the order in which the pattern are preented to the alg. K-medoid alg: The baic trategy -medoid alg i each cluter i repreented by one the object located near the center the cluter. PAM (Partitioning around Medoid wa one the firt -medoid alg i introduced. The peudo code the -medoid alg i to explain how it wor: Arbitrarily chooe object a the initial medoid Repeat Aign each remaining object to the cluter with the nearet medoid ly elect a non-medoid object O random Compute the total cot, S, wapping O j with O random If S<0 the wap O j with O random to form the new et -medoid Until no change K-medoid method i more robut than -mean in preence noie and outlier becaue a medoid i le influenced by outlier or other extreme value than a mean. DENSITY-BASED CLUSTERING Denity-baed clutering alg are devied to dicover arbitrary-haped cluter. In thi approach, a cluter i regarded a a region in which the denity data object exceed a threhold. DBSCAN and SSN are two typical alg thi ind. DBSCAN alg The DBSCAN alg wa firt introduced by Eter, and relie on a denity-baed notion cluter. Cluter are identified by looing at the denity point. Region with a high denity point depict the exitence cluter wherea region with a low denity point indicate cluter noie or cluter outlier. Thi alg i particularly uited to deal with large dataet, with noie, and i able to identify cluter with different ize and hape. The ey idea the DBSCAN alg i that, for each point a cluter, the neighbourhood a given radiu ha to contain at leat a minimum point, that i, the denity in the neighbourhood ha to exceed ome predefined threhold. Thi alg need three input parameter: -, the neighbour lit ize; - Ep, the radiu that delimitate the neighbourhood area a point (Ep neighbourhood; - MinPt, the minimum point that mut exit in the Ep-neighbourhood. ISSN: 2250-3021 www.iorjen.org 721 P a g e
IOSR Journal Engineering The clutering proce i baed on the claification the point in the dataet a core point, border point and noie point, and on the ue denity relation between point to form the cluter. The peudo code the DBSCAN alg i to explain how it wor: To cluter a dataet, our DBSCAN implementation tart by identifying the nearet neighbour each point and identify the farthet nearet neighbour. The average all thi ditance i then calculated. For each point the dataet the alg identifie the directly denity-reachable point uing the Ep threhold provided by the uer and claifie the point into core or border point. It then loop trough all point the dataet and for the core point it tart to contruct a new cluter with the upport the GetDRPoint( procedure that follow the definition denity reachable point. In thi phae the value ued a Ep threhold i the average ditance calculated previouly. At the end, the compoition the cluter i verified in order to chec if there exit cluter that can be merged together. Thi can append if two point different cluter are at a ditance le than Ep. Note: DBSCAN doe not deal very well with cluter different denitie. SNN ALGORITHM The SNN alg, a DBSCAN, i a denity-baed clutering alg. The main difference between thi alg and DBSCAN i that it define the imilarity between point by looing at the nearet neighbour that two point hare. Uing thi imilarity meaure in the SNN alg, the denity i defined a the um the imilaritie the nearet neighbour a point. Point with high denity become core point, while point with low denity repreent noie point. All remainder point that are trongly imilar to a pecific core point will repreent a new cluter. The SNN alg need three input parameter: - K, the neighbour lit ize; - Ep, the threhold denity; - MinPt, the threhold that define the core point. The peudo code the SSN alg i to explain how it wor: Define the input parameter. Find the K nearet neighbour each point the dataet. Then the imilarity between pair point i calculated in term how many nearet neighbour the two point hare. Uing thi imilarity meaure, the denity each point can be calculated a being the neighbour with which the hared neighbour i equal or greater than Ep. The point are claified a being core point, if the denity the point i equal or greater than MinPt. At thi point, the alg ha all the information needed to tart to build the cluter. Thoe tart to be contructed around the core point. However, thee cluter do not contain all point. They contain only point that come from region relatively uniform denity. The point that are not claified into any cluter are claified a noie point. GRID-BASED CLUSTERING The grid baed clutering approach ue a multireolution grid data tructure. It quantize the pace into a finite cell that form a grid tructure on which all the operation for clutering are performed. Grid approach include STING (STatitical INformation Grid approach and CLIQUE Baic Grid-baed 1. Define a et grid-cell 2. Aign object to the appropriate grid cell and compute the denity each cell. 3. Eliminate cell, whoe denity i below a certain threhold t. 4. Form cluter from contiguou (adjacent group dene cell. The peudo code the STING alg i to explain how it wor: The patial area i divided into rectangular cell There are everal level cell correponding to different level reolution Each cell i partitioned into a maller cell in the next level. Statitical info each cell i calculated and tored beforehand and i ued to anwer querie Parameter higher level cell can be eaily calculated from parameter lower level cell count, mean,, min, max type ditribution normal, uniform, etc. Ue a top-down approach to anwer patial data querie Start from a pre-elected layer typically with a mall cell from the pre-elected layer until you reach the bottom layer do the following: For each cell in the current level compute the confidence interval indicating a cell relevance to a given query; 1. If it i relevant, include the cell in a cluter ISSN: 2250-3021 www.iorjen.org 722 P a g e
IOSR Journal Engineering 2. If it irrelevant, remove cell from further conideration 3. otherwie, loo for relevant cell at the next lower layer 1. Combine relevant cell into relevant region (baed on grid-neighborhood and return the o obtained cluter a your anwer. Advantage: Query-independent, eay to parallelize, incremental update O(K, where K i the grid cell at the lowet level Diadvantage: All the cluter boundarie are either horizontal or vertical, and no diagonal boundary i detected MODEL-BASED CLUSTERING Model-Baed Clutering method attempt to optimize the fit between the given data and ome mathematical model. Such method ten baed on the aumption that the data are generated by mixture underlying probability ditribution. Model-Baed Clutering method follow two major approache: Statitical Approach or Neural networ approach 1. Clutering i alo performed by having everal unit competing for the current object 2. The unit whoe weight vector i cloet to the current object win 3. The winner and it neighbor learn by having their weight adjuted 4. SOM are believed to reemble proceing that can occur in the brain 5. Ueful for viualizing high-dimenional data in 2- or 3-D pace In model-baed clutering, the data x are viewed a coming P from a mixture denity f ( x G 1 T f ( x ( x ;, i 1 T exp ( xi 2 det(2 1 ( x i For univariate data, the covariance matrix reduce to a calar variance. The lielihood for data coniting n obervation auming a Gauian mixture model with G multivariate mixture component i n G i1 1 T ( x i ;,. MCLUST i probably the mot well nown model-baed Thi i all about variou clutering alg. III. HOW TO DETERMINE THE NUMBER OF CLUSTERS Many clutering alg require the pecification the cluter to produce in the input data et, prior to execution the alg. Barring nowledge the proper value beforehand, the appropriate value mut be determined, a problem on it own for which a technique have been developed. If the cluter nown, termination condition i given! In general, et a ditance threhold value (termination condition The K-cluter lifetime a the range threhold value on the dendrogram tree that lead to the identification K cluter Heuritic rule: cut a dendrogram tree with maximum life time One imple rule thumb et the to n with n a the object. 2 where f i the probability denity function the obervation in group, and T i the probability that an obervation come from the th mixture component Each component i uually modeled by the normal or Gauian ditribution. Component ditribution are characterized by the mean μ and the covariance matrix, and have the probability denity function Elbow criterion The elbow criterion i a common rule thumb to determine what cluter hould be choen, for example for -mean and agglomerative hierarchical clutering. The elbow criterion ay that you hould chooe a cluter o that adding another cluter doen't add ufficient information. More preciely, if you graph the percentage variance explained by the cluter againt the cluter, the firt cluter will add much information, but at ome point the marginal gain will drop, giving an angle in the graph. ISSN: 2250-3021 www.iorjen.org 723 P a g e
IOSR Journal Engineering Another et method for determining the cluter are information criteria, uch a : The Aaie information criterion (AIC, The Bayeian information criterion (BIC, The Deviance information criterion (DIC. IV. HOW ALGORITHMS ARE COMPARED The above clutering alg are compared according to the following factor: The ize the dataet, Number the cluter, Type dataet, Type tware Table 1 explain how the four alg are compared and the concluion are written down. Parti tiona l Hie rarc hica l Grid bae d Mo del- Size Number Cluter cluter cluter cluter Cluter cluter cluter Type Type Stware ba ed cluter cluter V. POSSIBLE APPLICATIONS Clutering alg can be applied in many field, for intance: Mareting: finding group cutomer with imilar behavior given a large databae cutomer data containing their propertie and pat buying record; Financial ta: Forecating toc maret, currency exchange rate, ban banruptcie, un-dertanding and managing financial ri, trading future, credit rating, Biology: claification plant and animal given their feature; Librarie: boo ordering; Inurance: identifying group motor inurance policy holder with a high average claim cot; identifying fraud; City-planning: identifying group houe according to their houe type, value and geographical location; Earthquae tudie: clutering oberved earthquae epicenter to identify dangerou zone; WWW: document claification; clutering web log data to dicover group imilar acce pattern VI. CONCLUSION Clutering i a decriptive technique. The olution i not unique and it trongly depend upon the analyt choice. We decribed how it i poible to combine different reult in order to obtain table cluter, not depending too much on the criteria elected to analyze data. Clutering alway provide group, even if there i no group tructure. When applying a cluter analyi we are hypotheizing that the group exit. But thi aumption may be fale or wea. Clutering reult hould not be generalized. Cae in the ame cluter are imilar only with repect to the information cluter analyi wa baed on i.e., dimenion/variable inducing the diimilaritie. REFERENCES 1. Han, J. and Kamber, M. Data Mining: Concept and Technique, 2001 (Academic Pre, San Diego, California, USA. 2. Compariion between clutering alg- Oama Abu Abba. 3. Pham, D.T. and Afify, A.A. Clutering technique and their application in engineering. Submitted to Proceeding the Intitution Mechanical Engineer, ISSN: 2250-3021 www.iorjen.org 724 P a g e
IOSR Journal Engineering Part C: Journal Mechanical Engineering Science, 2006. 4. Jain, A.K. and Dube, R.C. for Clutering Data, 1988 (Prentice Hall, Englewood Cliff, New Jerey, USA. 5. Bottou, L. and Bengio, Y. Convergence propertie the -mean alg. 6. Advance in Neural Information Proceing Sytem, 1995, 7, 585-592. 7. Grabmeier, J. and Rudolph, A. Technique cluter alg in data mining. Data Mining and Knowledge Dicovery, 2002, 6, 303-360. 8. Data Clutering. A Review: A.K. Jain Michigan State Univerity and M.N. Murty Indian Intitute Science and P.J. Flynn The Ohio State Univerity. 9. R C T Lee Cluter Analyi and It Application In J.T. Tou, editor, Advance in Information Sytem Science. Plenum Pre. New Yor. 10. Model-baed Method Claification: Uing the mclut Stware in Chemo metric Chri Fraley Univerity Wahington Adrian E. Raftery Univerity Wahington. ISSN: 2250-3021 www.iorjen.org 725 P a g e