Similar documents
b) since the remainder is 0 I need to factor the numerator. Synthetic division tells me this is true

Zeros of Polynomial Functions

Xxxxxxxxxxxxxxxx Xxxxxxxxxxxxxxxx

Contents. Financial Analysis Report

All of my instructors showed a true compassion for teaching. This passion helped students enjoy every class. Amanda

Magrathea Non-Geographic Numbering Plan

Sample Size Calculation for Longitudinal Studies

On computer algebra-aided stability analysis of dierence schemes generated by means of Gr obner bases

Lecture 3: Linear methods for classification

Probability for Estimation (review)


CROSS REFERENCE. Cross Reference Index Cast ID Number Connector ID Number 111 Engine ID Number Ford Motor Company 109

Michigan Public School Accounting Manual presented by Glenda Rader Grand Ledge Public Schools September 23, 2015

Rendering Area Sources D.A. Forsyth

A a. Cursive Practice. Name: Write the letter on the lines. Write each letter pair. Write each word. Write the sentence twice.

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

CERTIFIED TRANSLATION

UNIVERSITY OF NORTH FLORIDA Office for Research and Sponsored Programs (ORSP) COLLECTION PROCEDURES

Table of Contents. Volume No. 2 - Classification & Coding Structure TOPIC NO Function No CARS TOPIC CHART OF ACCOUNTS.

STAT 350 Practice Final Exam Solution (Spring 2015)

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

List the elements of the given set that are natural numbers, integers, rational numbers, and irrational numbers. (Enter your answers as commaseparated

Changes to telemarketing and non-geographic numbers in the UK. Your questions answered

Solvency ii: an overview. Lloyd s July 2010

The Heat Equation. Lectures INF2320 p. 1/88

Clustering in the Linear Model

1 Review of Newton Polynomials

2011 Latin American Network Security Markets. N July 2011

Your gas and electricity bill actual readings

Question 1a of 14 ( 2 Identifying the roots of a polynomial and their importance )

To provide Employees and Managers with a clear understanding of how training is identified and supported at PSUAD.

Probability Theory. Florian Herzog. A random variable is neither random nor variable. Gian-Carlo Rota, M.I.T..

A Tutorial on Probability Theory

4 Sums of Random Variables

Sections 2.11 and 5.8

100. In general, we can define this as if b x = a then x = log b

Better credit models benefit us all

PES 1110 Fall 2013, Spendier Lecture 27/Page 1

College / Admin Unit Space Auditor Training. Instructor: Ray Dinello, Director Facilities Information Systems

Accounting Notes. Purchasing Merchandise under the Perpetual Inventory system:

Angelika Mader Veri cation of Modal Properties Using Boolean Equation Systems EDITION VERSAL 8

6.1 Add & Subtract Polynomial Expression & Functions

National Qualifications Framework for Higher Education in Thailand IMPLEMENTATION HANDBOOK

Algebra Sequence - A Card/Board Game

The Convolution Operation

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:

Scientic Computing 2013 Computer Classes: Worksheet 11: 1D FEM and boundary conditions

HEALTH SYSTEM INTERFUND JOURNAL ENTRY EXAMPLES

~ EQUIVALENT FORMS ~

Accounting Notes. Types (classifications) of Assets:

AN EXERCISE IN SERIATION DATING

Lecture 5 Least-squares

ALGEBRA REVIEW LEARNING SKILLS CENTER. Exponents & Radicals

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

(DSSORA)isaninteractivemathematicalprogrammingsystemforoptimalresourceallocationdevelopedtosupportdecisionsofinvestment

VLSM Static routing. Computer networks. Seminar 5

Introduction to Probability

2.3. Finding polynomial functions. An Introduction:

Storm Damage Arbitration Agreement ADR Systems File # xxxxxxxxx Insurance Claim # xxxxxxxxxx

SUGI 29 Posters. Web Server

MECHANICAL ENGINEERING PROGRAMME DIRECTOR S MESSAGE

UMD Naming Convention for Active Directory

Digital Imaging and Multimedia. Filters. Ahmed Elgammal Dept. of Computer Science Rutgers University

FOIL FACTORING. Factoring is merely undoing the FOIL method. Let s look at an example: Take the polynomial x²+4x+4.

acyclotomicpolynomial).otherexamples,writingthefactorizationsasdierencesof squares,are (5y2)5?1 (3y2)3+1 3y2+1=(3y2+1)2?(3y)2

INTERPOLATION. Interpolation is a process of finding a formula (often a polynomial) whose graph will pass through a given set of points (x, y).

P R E F E I T U R A M U N I C I P A L D E J A R D I M

Examples of Tasks from CCSS Edition Course 3, Unit 5

6.2 Solving Nonlinear Equations

Graphic Designing with Transformed Functions

Linear Threshold Units

Total Credits: 32 credits are required for master s program graduates and 53 credits for undergraduate program.

Instructions for the Completion of the Report on Interest Rates on Loans and Deposits

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

HT2015: SC4 Statistical Data Mining and Machine Learning

620M User's Guide. Motor Finance Company

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

13. Write the decimal approximation of 9,000,001 9,000,000, rounded to three significant

The History of NAICS

The North American Industry Classification System (NAICS)

74LVC1G14. Description. Pin Assignments. Features. Applications SINGLE SCHMITT-TRIGGER INVERTER 74LVC1G14

Simple Programming in MATLAB. Plotting a graph using MATLAB involves three steps:

Finite cloud method: a true meshless technique based on a xed reproducing kernel approximation

EECS 556 Image Processing W 09. Interpolation. Interpolation techniques B splines

GLENN A. GRANT, J.A.D. Acting Administrative Director of the Courts MEMORANDUM

How To Cluster

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

2.2 Derivative as a Function

NEXT. Tools of the Participant Portal: Scientific Reports & Deliverables

ASUH Funding Fiscal Procedures

MIMO CHANNEL CAPACITY

First degree price discrimination ECON 171

ASSESSING FINANCIAL EDUCATION: EVIDENCE FROM BOOTCAMP. William Skimmyhorn. Online Appendix

IP Address Structure

Factoring. Factoring Polynomial Equations. Special Factoring Patterns. Factoring. Special Factoring Patterns. Special Factoring Patterns

Transcription:

DataClusteringAnalysisinaMultidimensionalSpace A.BouguettayaandQ.LeViet QueenslandUniversityofTechnology fathman,quanglg@icis.qut.edu.au SchoolofInformationSystems Brisbane,Qld4001,Australia theresultofafairlyexhaustivestudytoevaluatethreecommonlyusedclusteringalgorithms, Thereisawidechoiceofmethodswithdierentrequirementsincomputerresources.Wepresent Clusteranalysistechniquesareusedtoclassifyobjectsintogroupsbasedontheirsimilarities. Abstract namely,singlelinkage,completelinkage,andcentroid.theclusteranalysisstudyisconducted inthe2dimensionalspace.threetypesofstatisticaldistributionareused.twodierent 1 IntroductionandMotivation typesofdistancestocomparelistsofobjectsarealsoused.theresultspointtosomestartling similaritiesinthebehaviorandstabilityofallclusteringmethods. groupsbasedontheirsimilarities.indatabases,clusteranalysishasbeenusedtore-allocatestored informationbasedonpredenedcriteriawiththegoaltoimprovetheeciencyofdataretrieval basedontheirdegreeofassociation[21].insimplewords,clusteranalysisclassiesitemsinto Clusteranalysisisagenericnameformultivariateanalysistechniquestocreategroupsofobjects reducethenumberofdiskaccesses.inadistributedenvironment,clusteringisevenmoreimportant becauseoftheimpactontheresponsetimeiftherequesteddataisphysicallylocatedatdierent toanother.byre-allocatingdata,relatedinformationisphysicallystoredascloseaspossibleto operations.thestandardwayforevaluatingdegreeofsimilaritiesvariesfromoneapplication sites.theneedtodoclusteringisclear.therearemanyissuesthatneedtobeaddressed: 1.Calculationofthedegreeofassociationbetweendierenttypesofobjects. 2.Determinationofanacceptablecriteriontoevaluatethe\goodness"ofclusteringmethods. 3.Adaptabilityoftheclusteringmethodswithdierentdistributionsofdata:randomlyscattered,skewedorconcentratedaroundcertainregions,etc. theclusteringsomeofwhichare:hierarchicalversuspartitional,agglomerativeversusdivisive, Severalclusteringmethodshavebeenproposedthatdierintheapproachtakentoperform ExtrinsicversusIntrinsic,etc.[7][10][8][14][22][23][25][18][1].Inthatrespect,eachclustering methodhasadierenttheoreticalbasisandisapplicabletoparticularelds.weproposeafairly 1

exhaustivestudyofwellknownclusteringtechniquesinthe2-dspace.previousstudies,thatare lessexhaustiveintheiranalysis,havefocusedonthe1-dspace[5].ourexperimentincludeavariety 1.1Denitions ofenvironmentsettingstotesttheclusteringtechniquessensibilityandbehavior.theresultscan beofparamountimportanceacrossawiderangeofapplications. thispaper.theontologieswecoverarethefollowing:clusteranalysis,objects,clusters,distance Wepresentahighleveldescriptionofthedierentontologiesusedintheresearchliteratureand Clusteranalysisisaboutthegenerationofgroupsofentitiesthattasetofdenitions.The andsimilarity,coecientofcorrelation,andstandarddeviation. groupwhichformsaclustershouldhavehigherdegreeofassociationswithingroupmembers thanmembersofdierentgroups.atahighlevelofabstraction,aclustercanbeviewedasa tion.clusteranalysishasapropertythatmakesitdierentfromotherclassicationmethods, namely,informationaboutclassesofgroupingsarenotknownpriortotheprocessing.items aregroupedintoclustersthataredenedbymembersofthoseclusters. groupof\similar"objects.clusteranalysisissometimesreferredtoasautomaticclassica- Objects(oritems)areusedinabroadsense.Theycanbeanythingthatrequiretobeclassied basedoncertaincriteria.theobjectmayrepresentasingleattributeinarelationaldatabase oracomplexobjectinanobject-orienteddatabaseprovidedthatitcanberepresentedasa samemeasurementspace.ina1{d(onedimensional)environment,anobjectisrepresented pointinameasurementspace.obviously,allobjectstobeclusteredshouldbedenedinthe Clustersaregroupsofobjectslinkedtogetheraccordingtosomerules.Thegoalofclusteringis numbers. asapointbelongingtoasegmentdenedbytheinterval[a,b]whereaandbarearbitrary tondgroupscontainingobjectsmosthomogeneouswithinthesegroups.homogeneityrefers clustersinameasurementspace:asahypotheticalpointwhichisnotanobjectinthecluster, orasanexistingobjectintheclustercalledcentroidorclusterrepresentative. tothecommonpropertiesoftheobjectstobeclustered.therearetwowaystorepresent thenumberofobjectsinthecluster.fromthatpointofview,asingleobjectisacluster Clustersarerepresentedinthemeasurementspaceinthesameformastheobjectsthey contain.todistinguishbetweenanobjectandacluster,additionalinformationisneeded: DistanceandSimilarity:Toclusteritemsinadatabaseorinanyotherenvironment,some containingexactlyoneobject.anexampleofclustersina1{denvironmentisfg,fg, f,,g,etc. thechoicemayhaveaneectontheresultsobtained.objectswhichhavemorethanone dimensionmayuserelativeornormalizedweighttoconverttheirdistancetoanarbitrary measureofdistancesorsimilarities.thereareanumberofsimilaritymeasuresavailableand meansofquantifyingthedegreeofassociationsbetweenitemsareneeded.theymaybea scalesotheycanbecompared.oncetheobjectsaredenedinthesamemeasurementspace asthepoints,itisastraightforwardexercisetocomputethedistanceorsimilarity.the smallerthedistancethemoresimilartwoobjectsare.themostpopularchoiceincomputing distanceistheeuclideandistancewith:2

wherenisthenumberofdimensions.forthe1{dspace,thedistancebecomes: d(i;j)=q(xi1?xj1)2+(xi2?xj2)2+:::+(xin?xjn)2 ThereisalsotheManhattandistanceorcityblockconceptsthatarerepresentedasfollows: d(i;j)=jxi?xjj Thedistancebetweentwoclustersinvolvessomeorallitemsofthetwoclustersandiscalculateddierentlydependingontheclusteringmethod. d(i;j)=jxi1?xj1j+jxi2?xj2j+:::+jxin?xjnj StandardDeviationisthemeasurementoftheuctuationofofvaluesascomparedtothemean ThestandarddeviationofarandomvariableXisgivenby value.inthisstudy,standarddeviationisusedtoshowtheacceptabilityoftheresults. CoecientofCorrelationisthemeasurementofthestrengthofrelationshipbetweentwovariableXandY.Itessentiallyanswersthequestion\howsimilarareXandY?".Thevalues (X)=qE(X2)?E2(X) ofthecoecientsofcorrelationrangefrom0to1wherethevalue0pointstonosimilarityandthevalue1pointshighsimilarity.thecoecientofcorrelationisusedtond thesimilarityamongobjects.thecorrelationroftworandomvariablesxandywhere: X=(x1;x2;x3;:::;xn)andY=(y1;y2;y3;:::;yn)isgivenbytheformula: wheree(x)=pni=1xi n ande(y)=pni=1yi r= p(e(x2)?e2(x)p(e(y2)?e2(y) je(x;y)?e(x)e(y)j 1.2RelatedWork n ande(x;y)=pni=1xiyi n. andchemicalstructures[18][27][14][22]. Clusteranalysishasbeenusedinseveraleldsofsciencetodeterminetaxonomyrelationships amongentities,includingeldsdealingwithspecies,psychiatricproles,censusandsurveydata, ofrelateddatainadatabasetoimprovetheperformanceofdbmss.datarecordswhichare caseofdatabaseclustering,theabilitytocategorizeitemsintogroupsallowsthere-allocation learningtodatacompression[11].ourapplicationdomainisintheareaofdatabases.inthe Clusteringapplicationsrangefromdatabases(eg.dataclusteringanddatamining)tomachine frequentlyreferencedtogetheraremovedincloseproximitytoreduceaccesstime.toreachthis goal,clusteranalysisisusedtoformclustersbasedonthesimilaritiesofdataitems.datamaybe 3

OODBsusesomelimitedformofclusteringtoimprovetheirperformance.However,theyare re-allocatedbasedonvaluesofanattribute,groupofattributesoronaccessingpatterns.these mostlystaticinnature[4].thecaseofoodbsisuniqueinthattheunderlyingmodelprovidesa criteriadeterminethemeasuringdistancebetweendataitems. testbedfordynamicclustering.thisisthereasonwhyclusteringtakesonawholenewmeaning WiththeadventofOODBs,theneedofecientclusteringtechniquesbecomescrucial.Some withoodbs.therehasbeenasurgeinthenumberofstudiesofdatabaseclustering[13][17][24] [3][2].Inparticular,therewererecentlyanumbersofstudieswhichinvestigateadaptiveclustering techniques,i.e.,theclusteringtechniqueswhichcancopewithchangingaccesspatternandperform of,previouslyunknownpatternsinlargedatasetsstoredindatabases[19][9].thepatternsare clusteringon-line[5],[5][26]. thenusedtopredictthemodelofdataclassication.thereisawiderangeofbenetsfor\mining" datatondinterestingassociations.datawarehousesbecomevaluableintermsofunderstanding, Indatabaseminingandknowledgediscovery,theprimarygoalisthesearchfor,andthediscovery managing,andusingpreviouslyunknownrelationshipsbetweensetsofdata.ourexperimentsare targetspecicapplicationsandareapplicabletoawiderangeofdomains. meanttoprovideagenericviewonhowdataisclustered.inthisregard,theexperimentsdonot Average,withdierentsettings[5].Thepreliminaryndingseemtopointthatthechoiceofclusteringmethodbecomesirrelevantintermsofnaloutcomes.thestudypresentedhereextendronmentsettings.Weinvestigatedthreecommonlyusedclusteringalgorithms:Slink,Clink,and Thisresearchbuildsuponpreviousworkthatwehaveconductedusingadierentsetofenvi- ourpreviousworktoincludeseveralothersettings[5].thenewenvironmentssettingsincludeadditionalparametersthatinclude:anewclusteringmethod,astatisticaldistribution,largerinput behaviorandsensitivityoftheconsideredclusteringmethods. s,andspacedimension.theaimistoprovideabasisforamorecategoricalargumentastothe Ouraimistoseewhetherclusteringisdependentonthewayobjectsaregenerated.Therstone statisticaldistributions;andweselectedthethreethatcloselymodelrealworlddistributions[6]. 1.3StatisticalDistributions Theobjectsusedinthisstudyconsistsofpointslyingintheinterval[0,1].Therearenumerous istheuniformdistributionandthesecondoneisthepiecewisedistributionandthethirdoneis thegaussiandistribution.inwhatfollows,wedescribethestatisticaldistributionsthatweused UniformDistribution inourexperiments. Piecewise(Skewed)Distribution Therespectivedistributionfunctionisthefollowing:F(x)=x. Thedensityfunctionofthisdistributionisf(x)=F0(x)=18xsuchthat0x1. Therespectivedistributionfunctionisthefollowing: 4

F(x)= 8 ><>: 0:05 0:475if0:37x<0:62 0:525if0:62x<0:743 0:95 if0x<0:37 1 if0:743x<0:89 Gaussian(Normal)Distribution Thedensityfunctionofthisdistributionis:f(x)=F(b)?F(a) if0:89x1 b?a 8xsuchthatax<b. Therespectivedistributionfunctionisthefollowing:F(x)=1 p2e?(x?)2 2isthevariance. Thisisatwo-parameter(and)distribution,whereisthemeanofthedistributionand 22 InproducingsamplesfortheGaussiandistribution,wechoose=and=. f(x)=f0(x)=1 p2?x 3e?(x?)2 22 F(x)= 8 ><>: 0:00132if0:1x<0:2 0:02277if0:2x<0:3 0:15867if0:3x<0:4 0:49997if0:4x<0:5 Followingisanoutlineofthepaper.Insection2,theclusteringmethodsusedinthisstudyare Forvaluesofxthatareintherange[,1],thedistributionissymmetric. 1 for0:0x1 described.insection3,wedetailtheexperimentsconductedinthisstudy.insection4,weprovide theinterpretationsoftheexperimentresults.insection5,weprovidesomeconcludingremarks. 2Therearedierentwaystoclassifyclusteringmethodsaccordingtothetypeofclusterstructure theyproduce.thesimplenon-hierarchicalmethodsdividethedatasetofnobjectsintomclusters, ClusteringMethods memberofaclusterwithwhichitismostsimilarto,andtheclustermayberepresentedbya wherenooverlapisallowed.theyarealsoknownaspartitioningmethods.eachitemisonlya linkeduntileveryiteminthedatasetislinkedtoformonecluster.hierarchicalmethodscanbe: centroidorclusterrepresentativethatrepresentsthecharacteristicsofallcontainedobjects.this methodisheuristicallybasedandmostlyappliedinsocialsciences. eitheragglomerative,withn-1pairwisejoinsfromanunclustereddataset.inotherwords, Hierarchicalmethodsproduceanesteddatasetinwhichpairsofitemsorclustersaresuccessively fromnclustersofoneobject,thismethodgraduallyformsoneclusterofnobjects.ateach step,clustersorobjectsarejoinedtogetherintolargerandlargerclustersendingwithone bigclustercontainingallobjects. 5

ordivisive,inwhichallobjectsbelongtoasingleclusteratthebeginning,thentheyare dividedintosmallerclustersoverandoveruntilthelastclustercontainingtwoobjectshave methods".thehierarchicaltreemaybepresentedasadendrogram,inwhichpairwisecoupling Inbothcases,theresultoftheprocedureisahierarchicaltree,hencethename\hierarchical beenbrokenapartintobothatomicconstituents. ofthesimilarityisrepresentednumerically.divisivemethodsarelesscommonlyusedandonly agglomerativemethodswillbediscussedinthispaper. oftheobjectsinthedatasetisshownandthelengthofthebranches(vertices)orthevalue Thesemethodshavebeenusedintheexperimentspresentedinthispaper. Thesectiondescribesthehierarchicalagglomerativeclusteringmethodsandtheircharacteristics. 2.1HierarchicalClusteringTechniques Singlelinkageclusteringmethod(Slink):Thedistancebetweentwoclustersisthesmallest distanceofalldistancesbetweentwoitems(x;y),denoteddx;y,suchthatxisamember ofaclusterandyisamemberofanothercluster.thismethodisalsoknowasthenearest neighbormethod.thedistancedx;yiscomputedasfollows: isthesimplestamongallclusteringmethods.ithassomeattractivetheoreticalproperties where(x;y)areclusters,and(x;y)areobjectsinthecorrespondingclusters.thismethod DX;Y=minfDx;ygwithx2X;y2Y Completelinkageclusteringmethod(Clink):Thesimilaritycoecientisthelongestdistancebetweenanypairofobjects,denotedDx;y,takenfromtwoclusters.Thismethodisalso [12].However,ittendstoformlongorchainingclusters.Thismethodmaynotverysuitable forobjectsconcentratedaroundsomecentersinthemeasurementspace. calledfurthestneighborclusteringmethod[16][22][8].thedistanceiscomputedasfollows: Centroid/medianmethod:Clustersinthismethodarerepresentedbya\centroid",apoint inthemiddleofthecluster.thedistancebetweentwoclustersisthedistancebetweentheir DX;Y=minfDx;ygwithx2X;y2Y 2.2GeneralAlgorithm centroids.thismethodalsohasaspecialcaseinwhichthecentroidofthesmallergroupis leveledtothelargerone. [8].Asarststep,objectsaregeneratedusingarandomnumbergenerator.Inourcase,these objectsareobjectsintheinterval[0,1].aftertheseobjectsarecreated,theyarecomparedto Weprovideexamplesusingthethreeclusteringmethods.Moreexamplescanbefoundin[22] eachotherbymeasuringthedistance.thedistancebetweentwoclustersiscomputedusingthe similaritycoecient.thewayobjectsandclustersofobjectsarejoinedtogethertoformlarger clustersvarieswiththeapproachused.weoutlineagenericalgorithmthatisapplicabletoall 6

clusteringmethods.essentially,itconsistsoftwophases.therstphaserecordsthesimilarity coecients.thesecondphasecomputestheminimumandtheperformsclustering.initiallyevery clusterconsistsofexactlyoneobject. 1.Scanallclustersandrecordallsimilaritycoecients. 4.Goto(1). 3.Ifexactlyoneclusterremainsthenstop. 2.Computetheminimumofallsimilaritycoecientsandthenjointhecorrespondingclusters. WhenperformingStep2,thersttwoclustersarejoined.However,whencomputingthesimilarity Step1,threesuccessiveclustersaretobejoined(theyallhavetheminimumsimilarityvalue). coecientbetweenthisnewclusterandthethirdcluster,thesimilarityvaluemaynowbedierent ThereisacasewhenusingClinkmethodwhereambiguitymayarise.Supposewhenperforming fromtheminimumvalue.thequestionnowiswhatthenextstepshouldbe.thereareessentially twopossibilities: Eitherproceedbyjoiningclustersusingarecomputationofthesimilaritycoecientforeach Orjoinallthoseclustersthathavethesimilaritycoecientdierentatonceanddonot recomputethesimilarityinstep2. timeinstep2. therstalternative. Ingeneral,thereisnoevidencethatoneisbetterthantheother[22].Forourstudy,weselected 2.3Examples Followingareexamplesofhowdataisclusteredtoprovideanideahowdierentclusteringmethods workwiththesamesetofdata.example1usesslinkmethod,example2usesclinkmethod,and Example3usesCentroid.Thesampledatahas10itemsandeachitemhasanidentication;anda valueonwhichthedistanceiscalculated.theuniformdistributionwasusedtogeneratetheabove setofobjects.forthesakeofsimplicity,weconsiderthe1-dspaceforgeneratingdataobjects. Example1:Slink 3.Joinclustersf4gandf8gat05885140. 2.Joinclustersf6gandf10,1gat05885. 1.Joinclustersf10gandf1gatdistance0.0117584. 5.Joinclustersf4,8gandf3,7gandf2,6,10,1,5,9gasoneclusterat17643. 4.Joinclustersf3gandf7gasoneclusterandf2gf6,10,1gf5gandf9gasanothercluster at05885148. 7

Value 764770 529483 294196Id 058909 1 823621 2 588334 3 353047 4 117760 5 0.9882473 678 Table1:Exampleofasampledatalist 64718510 9 Example2:Clink Figure1:ClusteringTreeusingSlink 4 8 3 7 2 6 10 1 5 9 3.Joinclustersf3gandf7gasonecluster,andf2gandf6gasanothercluster,andf5g 2.Joinclustersf4gandf8gat05885140. 1.Joinclustersf1gandf10gatdistance0.0117584. 5.Joinclustersf4,8gandf3,7gat294138. 4.Joinclustersf2,6gandf1,10gat235286. andf9gasanotherclusterat05885148. 7.Joinclustersf4,8,3,7gandf2,6,10,1,5,9gat823563. 6.Joinclustersf2,6,10,1gandf5,9gat352989. 8

Example3:Centroid Figure2:ClusteringTreeusingClink 4 8 3 7 2 6 10 1 5 9 2.Joinclustersf2gandf6gasonecluster,andf3gandf7gasanothercluster,andf4g 1.Joinclustersf10gandf1gatdistanceof0.0117585andformthecentralpointat 7059775. 3.Joinclustersf5gandf9gat058852andformthecentralpointat0.93530470. andf8gasanotherclusterat05885148andformthecentralpointsat0589085, 4.Joinclustersf1,10gandf2,6gat6478690andformthecentralpointat8824430. 2241215,and5883345. 6.Joinclustersf1,10,2,6gandf5,9gat4706040andformthecentralpointat 5.Joinclustersf3,7gandf4,8gat2357870andformthecentralpointat7062280. 7.Joinclustersf1,10,2,6,5,9gandf3,7,4,8gat315170. 1177450. Figure3:ClusteringTreeusingCentroid 9 4 8 3 7 2 6 10 1 5 9

treeinwhichnobjectsarelinkedbyn{1connectionsandthereisnocycleinthetree.msthas Inthisstudy,thehierarchicaltreeisimplementedasaminimumspanningtree(MST).MSTisa 3 ExperimentDescription thefollowingpropertiesthatmakeitsuitabletorepresentthehierarchicaltree. Anyisolateditemcanbeconnectedtoanearestneighbor. Anyisolatedfragment(sub-setofaMST)inanycasecanbeconnectedtoanearestneighbor 0to1inclusiveandtheirsrangefrom100to500.Thecongruentiallinearalgorithmisusedto Datausedinthispaperisdrawnfromatwo-dimensional(2-D)space.Theirvaluesrangefrom bytheavailablelink. generatedataobjects[15][20].theseedisthesystemtime.eachexperimentfollowedthefollowing steps: Calculatethecoecientofcorrelationforeachclusteringmethod. Generatelistsofobjects. Carryouttheclusteringprocesswiththreedierentclusteringmethods. lationiscalculated.theleastsquareapproximation(lsa)isusedtoevaluatetheacceptabilityof denedbythecorrespondingstandarddeviation,theapproximationisdeemedtobeacceptable. theapproximation.ifacoecientofcorrelationobtainedusingthelsa,fallswithinthesegment Eachexperimentisrepeated100timesandthestandarddeviationofthecoecientsofcorre- dierencebetweentwoobjects.theothermethodofcomputingthedistanceistousetheminimum obtainedfromlistsofobjects,tocomputetheircoecientofcorrelation.thedistanceusedin thecoecientofcorrelationcouldforinstancebecomputedusingtheactuallinear(euclidean) Typesofdistancesincomparingtrees:Therearetwowaysofcomparingtwotrees, numberofedges(ofatree)neededtojointwoobjects.thelatterhasanadvantageovertheformer inthatitprovidesamore"natural"implementationofacorrelation.wecallthersttypeof distance,lineardistanceandtheseconddistance,edgedistance.oncewechooseadistancetype, wecomputethecoecientofcorrelationbyselectingonepairofidentiersinthesecondlist parametershavebeenused.forinstance,theclusteringmethodisoneparameter.thereare3*2*3 (shorterlist)andcomputeitsdistanceandthenlookforthesamepairintherstlistandcompute itsdistance.werepeatthesameprocessforallremainingpairsinthesecondlist. =18possiblewaystocomputethecoecientofcorrelationfortwolistsofobjects.Indeed,we havethefollowingchoices: Thereareseveraltypesofcoecientsofcorrelation.Thisstemsfromthefactthatseveral firstparameter8><>:slink Clink secondparameter(lineardistance Centroid 10edgedistance

thirdparameter8><>:uniformdistribution piecewisedistribution thedatainput.thisdetermineswhatkindofdataistobecomparedandwhatitsis.the Theotherdimensionofthiscomparisonstudythathasadirectinuenceontheclusteringis gaussiandistribution totheinputdata.foreverytypeofcoecientofcorrelationmentionedabove,eleven(11)typesof situations(hence,elevencoecientsofcorrelation)havebeenisolated.itisourbeliefthatthese followingcaseshavebeenidentiedtocheckthesensitivityofeachclusteringmethodwithregard casescloselyrepresent,whatmayinuencethechoiceofaclusteringmethod. 1.ThecoecientofcorrelationisbetweenpairsofobjectsdrawnfromasetSandpairsof 2.ThecoecientofcorrelationisbetweenpairsofobjectsdrawnfromSandpairsofobjects objectsdrawnfromthersthalfofthesamesets.thersthalfofsisusedbeforetheset drawnfromthesecondhalfofs.thesecondhalfofsisusedbeforethesetissorted. issorted. 3.ThecoecientofcorrelationisbetweenpairsofobjectsdrawnfromthersthalfofS,say number1andsoisgiventherstobjectofs02.thesecondobjectofs2isgivenasidentier thenumber2andsoisgiventhesecondobjectofs02andsoon. S2,andpairsofobjectsdrawnfromthersthalfofanothersetS',sayS02.Thetwosetsare givenascendingidentiersafterbeingsorted.therstobjectofs2isgivenasidentierthe 4.ThecoecientofcorrelationisbetweenpairsofobjectsdrawnfromthesecondhalfofS,say 5.ThecoecientofcorrelationisbetweenpairsofobjectsdrawnfromSandpairsofobjects S2,andpairsofobjectsdrawnfromthesecondhalfofS',sayS02.Thetwosetsaregiven ascendingidentiersafterbeingsortedinthesamewasasthepreviouscase. 6.Thecoecientofcorrelationdenitionisthesameasthefthcoecientofcorrelationexcept drawnfromtheunionofasetxands.thesetxcontains10%newrandomlygenerated thatxnowcontains20%newrandomlygeneratedobjects. objects. 8.Thecoecientofcorrelationdenitionisthesameasthefthcoecientofcorrelationexcept 7.Thecoecientofcorrelationdenitionisthesameasthefthcoecientofcorrelationexcept thatxnowcontains40%newrandomlygeneratedobjects. thatxnowcontains30%newrandomlygeneratedobjects. 10.ThecoecientofcorrelationisbetweenpairsofobjectsdrawnfromSusingtheuniform 9.ThecoecientofcorrelationisbetweenpairsobjectsdrawnfromSusingtheuniformdistributionandpairsofobjectsdrawnfromS'usingthepiecewisedistribution. distributionandpairsofobjectsdrawnfroms'usingthegaussiandistribution. 11

11.ThecoecientofcorrelationisbetweenpairsofobjectsdrawnfromSusingtheGaussian distributionandpairsofobjectsdrawnfroms'usingthepiecewisedistribution. Inanutshell,theabovecoecientsofcorrelationaremeanttoanalyzedierentsituationsinthe evaluationofresults.wepartitionthesesituationsintothreesituations,representedbythreeblocks ofcoecientsofcorrelation: FirstBlock:Therst,second,thirdandfourthcoecientsofcorrelationareusedtocheckthe SecondBlock:Thefth,sixth,seventhandeightcoecientsofcorrelationareusedtocheck theinuenceofthedata. inuenceofthecontextonhowobjectsareclustered. ThirdBlock:Theninth,tenth,andeleventhcoecientsofcorrelationareusedtocheckthe relationwhichmayexistbetweentwolistsobtainedusingtwodierentdistributions. lationandstandarddeviationvalues(ofthesametype)arecomputed.theleastsquareapproxi- mationisthenappliedtoobtainthefollowingequation: Toensurethestatisticalrepresentativityoftheresults,theaverageof100coecientofcorre- Thecriterionforagoodapproximation(oracceptability)isgivenbytheinequality: f(x)=ax+b deviationforyi.ifthisinequalityissatised,fisthenagoodapproximation.theleastsquare whereyiisthecoecientofcorrelation,fistheapproximationfunctionandisthestandard jyi?f(xi)j(yi)foralli approximation,ifacceptable,helpspredictthebehaviorofclusteringmethodsfordatapoints beyondtherangeconsideredinourexperiments. Asstatedearlier,theaimofthisstudyistoconductexperimentstodeterminethestabilityof 4clusteringmethodsandhowtheycomparetoeachother.Forthesakeofreadability,ashorthand ResultsandtheirInterpretations notationisusedtoindicateallpossiblecases.asimilarnotationhasbeenusedinourprevious ofcorrelation.thetablesdescribetheleastsquareapproximationsofthecoecientsofcorrelations. distributionandlineardistance;theabbreviationsulisused. ndings[5].forinstance,torepresenttheinputwiththefollowingparameters:slink,uniform Resultsarepresentedinguresandtables.Theguresdescribethedierenttypesofcoecients 12

Term Slink Clink Centroid Shorthand UniformDistr. S GaussianDistr.GC PiecewiseDistr.PO LinearDistanceLU Table2:ListofAbbreviations EdgeDistance E Coecient First&Fifth Second&Sixth Third&Seventh Correlation ofdashed solid dotted First Blocksand SecondTenth Ninth Eleventh Coecient Correlation ofdashed solid dotted ThirdBlock Table3:GraphicalRepresentationofallTypesofCorrelationCoecientsandStandardDeviations Fourth&Eighth dash-dotted 4.1AnalysisoftheStabilityandSensitivityoftheClusteringMethods Werstlookatthedierentclusteringmethodsandanalyzehowstableandsensitivetheyareto tothechangesinparametervalues. 4.1.1Slink:ResultsInterpretation thevariousparameters.inessence,weareinterestedinknowinghoweachclusteringmethodreacts Wethenprovideaninterpretationofthecorrespondingresults. Welookatthebehaviorofthe3blocksofcoecientsofcorrelationvaluesasdenedinsection3. thecontext.fig.4representtherst4coecientsofcorrelation.the(small)dierencebetweenl Aspreviouslymentioned,therst4coecientsofcorrelationaremeanttotesttheinuenceof Firstblockofcoecientsofcorrelation andevaluesisconsistentlythesameacrossallexperimentswiththeexceptionofthoseexperiments comparingdierentdistributions(fig.6,fig.9,andfig.12).wenotethatthevaluesusingeare However,whenEisused,thismaynotbetrue(eg.treethatisnotheightbalanced)sincethe distanceisequaltothenumberofedgesconnectingtwomembersbelongingtodierentclusters. WhenLisused,thedistancebetweenthemembersoftwoclustersisthesameforallmembers. consistentlysmallerthanthosevaluesusingl.thereasonliesindierenceincomputingdistances. InthecaseofFig.6,Fig.9,andFig.12thedierenceisattenuatedduetotheuseofdierent 13

Single linkage Uniform dist. Linear Single linkage Uniform dist. Edge 5 5 5 5 5 1 Single linkage Piecewise dist. Linear Single linkage Piecewise dist. Edge 0.9 Single linkage Gaussian dist. Linear Single linkage Gaussian dist. Edge Figure4:Slink:FirstBlockofCoecientofCorrelation distributions. ofcorrelationisalmostthesame.thispointstothefactthatthedistancetypedoesnotplaya WhenthevaluesinLandEarecomparedagainsteachother,thetrendamongthe4coecients majorroleinthenalclustering. typesofcorrelationcomparedataobjectsdrawnfromdierentsets.oneshouldexpecttheformer fourthtypesofcorrelationbecauseofthecorrespondingintrinsicsemantics.therstandsecond typesofcorrelationcomparesdataobjectsdrawnformthesameinitialset.thethirdandfourth Thevaluesoftherstandsecondtypesofcorrelationarelargerthanthoseofthethirdand dataobjectstobemorerelatedthanthelatterdataobjects.thestandarddeviationvaluesexhibit pointstothefactthatthedierenttypesofcorrelationbehaveinauniformandpredictablefashion. roughlythesamekindofbehaviorastheircorrespondingcoecientofcorrelationvalues.this totheimportantobservationthatthedatacontextdoesnotseemtoplayasignicantroleinthe onthenalclustering.notethattheslopevalueisalmostequaltozero.thisisalsoconrmedby naldataclustering.likewise,thedatasetdoesnotseemtohaveanysubstantialinuence Sincethecoecientofcorrelationvaluesaregreaterthan.5inalltheabovecases,thispoints 14

theuniformbehaviorofthestandarddeviationvaluesasdescribedabove. Secondblockofcoecientsofcorrelation Single linkage Uniform dist. Linear Single linkage Uniform dist. Edge 0.9 1 Single linkage Piecewise dist. Linear 0.9 Single linkage Piecewise dist. Edge 0.9 0.9 Single linkage Gaussian dist. Linear Single linkage Gaussian dist. Edge depictsthenextblockofcoecientofcorrelation. Thenext4coecientsofcorrelationchecktheinuenceofthedataonclustering.Fig.5 Figure5:Slink:SecondBlockofCoecientofCorrelation WhenthevaluesinLandEarecompared,thereisnosubstantialdierencewhichisindicative oftheindependenceoftheclusteringfromanytypeofdistanceused.asinthepreviouscase, correlations. thestandarddeviationvaluesexhibitthesamebehaviorasthecorrespondingcoecientofcorrelationvalues.thisisreminiscentofauniformandpredictablebehaviorofthedierenttypesof thepreviouscase,thedatadoesnotseemtoinuencethenalclusteringoutcomeastheslope isnearlyequaltozero.likewise,thedatasetdoesnotseemtohaveanysubstantialinuence Thehighvaluesindicatethatthecontextshavelittleeectonhowdataisclustered.Asin 15

onthenalclustering.notethattheslopevalueisalmostequaltozero.thisisalsoconrmedby theuniformbehaviorofthestandarddeviationvaluesasdescribedabove. Thirdblockofcoecientsofcorrelation Single linkage... Linear Single linkage... Edge 5 5 std / cc (9 11) 5 5 std / cc (9 11) 5 5 5 5 5 Thenext3coecientsofcorrelationchecktheinuenceofthedistributionforLandE.All Figure6:Slink:ThirdBlockofCoecientofCorrelation 5 otherparametersaresetandthesameforthepairsofsetsofobjectstobecompared. case,showsvaluesthatareabitlowerthanthevaluesincurvesrepresentingug(uniformand Gaussiandistributions)andGP(GaussianandPiecewisedistributions).Thiscanbeexplained Fig.6depictsthelastthreetypesofcoecientsofcorrelations. bytheproblemofbootstrappingtherandomnumbergenerator.thisisaconstantmostofthe ThecurverepresentingthecaseforUP(UniformandPiecewisedistributions)ineitherLorE experimentsconductedinthisstudy.therstconcurrentexperiments(slinkusinglande) exhibitabehaviorthatisalittledierentfromtheotherpiecesofexperiments. Thisisindicativeoftheindependenceoftheclusteringfromthetypesofdistancesused.Like thepreviouscase,thestandarddeviationvaluesexhibitthesamebehaviorasthecorresponding coecientofcorrelationvalues.thisisreminiscentofauniformandpredictablebehaviorofthe WhenthevaluesinthecaseofLandEarecompared,nosubstantialdierenceisobserved. dierenttypesofcorrelations. clusteringonewayortheother.asinthepreviouscase,thedatadoesnotseemtoinuence seemtohaveanysubstantialinuenceonthenalclustering.notethattheslopevalueisalmost thenalclusteringoutcomeastheslopeisnearlyequaltozero.likewise,thedatasetdoesnot Sincevaluesconvergetothevalue.5,thisindicatesthatthedistributionsdonotinuencethe equaltozero.thisisalsoconrmedbytheuniformbehaviorofthestandarddeviationvaluesas describedabove. behaviorastheslinkmethod.fig.7,fig.8,and,fig.9depictthetherst,second,thirdblocks Inessence,theexperimentsfortheClinkclusteringmethodfollowthesametypeofpatternand 4.1.2Clink:ResultsInterpretation 16

Complete linkage Uniform dist. Linear Complete linkage Uniform dist. Edge Complete linkage Piecewise dist. Linear Complete linkage Piecewise dist. Edge Complete linkage Gaussian dist. Linear Complete linkage Gaussian dist. Edge Figure7:Clink:FirstBlockofCoecientofCorrelation ofcoecientsofcorrelation. asbothfollowaverysimilarpatternofbehaviorandthevaluesforboththecoecientsofcorrelationandthestandarddeviationarequitesimilar. TheinterpretationsthatapplyfortheSlinkmethodalsoapplyfortheClinkclusteringmethods thesametypeofpatternandbehavior.theonlydierenceliesinthevaluesforthedierent AswasthecaseforSlinkandClink,theexperimentsfortheCentroidclusteringmethodsfollow correlationsobtainedusingdierentparameters.fig.10,fig.11,and,fig.12depictthetherst, 4.1.3Centroid:ResultsInterpretation second,thirdblocksofcoecientsofcorrelation. clusteringmethodsasitsfollowsasimilarpatternofbehavior.similarly,thevaluesforboththe coecientsofcorrelationandthestandarddeviationarealsosimilar. TheinterpretationsthatapplyforthepreviousclusteringmethodsalsoapplyfortheCentroid 17

Complete linkage Uniform dist. Linear Complete linkage Uniform dist. Edge 1 Complete linkage Piecewise dist. Linear Complete linkage Piecewise dist. Edge 0.9 0.9 Complete linkage Gaussian dist. Linear Complete linkage Gaussian dist. Edge Figure8:Clink:SecondBlockofCoecientofCorrelation Complete linkage... Linear 5 Complete linkage... Edge 5 5 5 std / cc (9 11) Figure9:Clink:ThirdBlockofCoecientofCorrelation 5 std / cc (9 11) 5 5 5 18 5 5

Centroid linkage Uniform dist. Linear 5 Centroid linkage Uniform dist. Edge 5 5 5 5 Centroid linkage Piecewise dist. Linear 0.9 5 Centroid linkage Piecewise dist. Edge Centroid linkage Gaussian dist. Linear Centroid linkage Gaussian dist. Edge Figure10:Centroid:FirstBlockofCoecientofCorrelation correlation.block1,block2,andblock3correspondrespectivelytotherstblockof4coecients Table4showsasummaryoftheresultsthataveragesoutthedierentcomputedcoecientsof 4.2SummaryofResults ofcorrelation,thesecondblockof4coecientsofcorrelation,andthethirdblockof3coecients ofcorrelation. 4.3AcceptabilityoftheLeastSquareApproximation thestandarddeviation.ifthisisthecase,thenwesaythattheapproximationisgood.otherwise, Table5,Table6,andTable7(seeAppendix)representtheleastsquareapproximationsforall coecientofcorrelationvaluesfallwithintheintervaldelimitedbytheapproximatingfunctionand thecurvesshowninthisstudy.theacceptabilityofanapproximationdependsonwhetherallthe 19

0.9 Centroid linkage Uniform dist. Linear Centroid linkage Uniform dist. Edge Centroid linkage Piecewise dist. Linear 0.9 Centroid linkage Piecewise dist. Edge 0.9 Centroid linkage Gaussian dist. Linear Centroid linkage Gaussian dist. Edge Figure11:Centroid:SecondBlockofCoecientofCorrelation 5 Centroid linkage... Linear 5 Centroid linkage... Edge 5 5 std / cc (9 11) 5 5 Figure12:Centroid:ThirdBlockofCoecientofCorrelation 5 5 5 5 20 std / cc (9 11)

Block1LU.6 SlinkClinkCentroid EU.5 G.7.6.55.75.7 Block2LU.8 P.7 G.6.6.55 P.9.85.65 EU.65 G.85.85.75 Block3L P.8 G.7.75.65 E.5.55.5 Table4:SummaryofResults.45 welookathowmanypointsdonotfallwithintheboundariesanddeterminethegoodnessofthe function.usingthesefunctions,willenableustopredictthebehavioroftheclusteringmethods small.thispointstothestabilityofallresults.allapproximationsyieldalmostparallellinesto withhigherdatasets. thex-axis. AsTable5,Table6,andTable7(seeAppendix)show,thevaluesoftheslopesareallvery 4.4ComparisonofResultsacrossClusteringMethods approximationslistedintablesmentionedabovearegoodapproximations. Theacceptabilitytestwasrunandallpointspassedthetestsatisfactorily.Thereforeallthe candrawnfromtheexperimentsshowinthedierentguresandtables. Inwhatfollows,wecomparethedierentclusteringmethodsagainsteachotherusingthedierent parametersusedinthisstudy.werelyontheresultsobtainedandthegeneralobservationsthat 1.Theresultsshowthatacrossspacedimensions,thecontextdoesnotcompletelyhidethesets. Forinstance,therstandsecondtypesofcoecientsofcorrelation(asshowninallgures) 2.Theresultsshowthatgiventhesamedistributionandtypeofdistance,allclusteringmethods arealittledierentfromthethirdandfourthtypesofcoecientofcorrelation(asshownin allgures).thevaluesclearlyshowwhatkindsofcoecientsarecomputed. 3.Slink,Clink,andCentroidseemtohaveverysimilarbehavior.Thecoecientsofcorrelation exhibitthesamebehaviorandyieldapproximatelythesamevalues. valuesarealsoveryclose.anexplanationforthesimilarityinbehaviorbetweenslink,clink, andcentroidisthatthesemethodsarebasedononesingleobjectperclustertodetermine similaritydistances. 21

4.Thesecondblockofcoecientsofcorrelationforallclusteringmethods,demonstratethat 5.Theresultsalsoshowthatallclusteringmethodsareequallystable.Thisndingcomesasa closetothevalue1. thecontextdoesnotinuencethedataclusteringbecauseallcoecientsofcorrelationare 6.Theresultsshowthatthedatadistributiondoesnotsignicantlyaecttheclusteringtechniquesbecausethevaluesobtainedareverysimilartoeachother.Thatisarelativelymajor surprise,asintuitively,oneexpectsaclusteringmethodtobemorestablethantheothers. 7.Thethirdblockofcoecientsofcorrelationacrossallclusteringmethodsshowthatthethree ndingastheresultsstronglypointtotheindependenceofthedistributionandthedata clustering. 8.Thetypeofdistance(linearoredge)doesnotinuencetheclusteringprocessasthereare methodsarelittleornotperturbedeveninanoisyenvironmentsincetherearenosignicant dierencesinresultsfromuniformandpiecewise,andgaussiandistributions. Theresultsobtainedinthisstudyconrmthosefrom[5]whichusedone-dimensional(1{D) oredgedistances. nosignicantdierencesbetweenthecoecientsofcorrelationobtainedusingeitherlinear datasampleandfewerparameters.theresultspointverystronglythatingeneral,noclustering techniqueisbetterthananother.whatthisessentiallymeansisthatthereisaninherentwayfor dataobjectstoclusterandindependentlyfromthetechniquesused. computationattractivenessandnothingelse.thisisaveryimportantresultasinthepastthere wasneveranevidencethatclusteringmethodshadverysimilarbehavior. Theotherimportantresultsthattheonlydiscriminatorforselectingaclusteringmethodisits inuencetheoutcomeoftheclusteringprocess.indeed,allclusteringmethodsconsideredhere exhibitabehaviorthatisalmostconstantregardlessoftheparametersbeingusedincomparing them. Theresultspresentedhereareacompellingevidencethatclusteringmethodsdonotseemto rangeofparameterstotestthestabilityandsensitivityoftheclusteringmethods.theseexperimentswereconductedforobjectsthatareinthe2-dspace.theresultsobtainedoverwhelmingly andregardlessoftheparametersused.thefactthatdataobjectsaredrawnfromdierentdata Inthisexhaustivestudy,weconsideredthreeclusteringmethods.Theexperimentsinvolvedawide 5 Conclusion pointtothestabilityofeachclusteringmethodandthelittlesensitivitytonoise.themoststartling ndingsofthisstudyishowever,thatallclusteringmethodsexhibitanalmostidenticalbehavior; seemtoplayamajorroleinthenalshapeoftheclustering.thisalsomeansthattheonlycriterionthatshouldbeusedtoselectoneclusteringmethodistheattractivenessofthecomputational Theabovendingsareofparamountimplications.Inthatregard,oneofthemostimportant resultsisthatobjectshaveanaturaltendencytoclusterthemselves.clusteringmethodsdonot spacesdoesnotchangetheabovendings[5]. complexityoftheclusteringalgorithm. 22

Akbowledgments References WewouldliketothankMostefaGoleaandAlexDelisfortheirinsightfulanddetailedcomments. [1]M.S.AlderferandR.K.Blasheld.ClusterAnalysis.SagePublication,California,1984. [2]JayBanerjee,WonKim,Sung-JoKim,andJorgeGarza.Clusteringadagforcaddatabases. [3]VeroniqueBenzaken.Anevaluationmodelforclusteringstrategiesintheo2object-oriented IEEETransactionsonSoftwareEngineering,14(11):1684{1699,1988. [4]VeroniqueBenzakenandClaudeDelobel.Enhancingperformanceinapersistentobjectstore: Clusteringstrategiesino2.InPODS,1990. databasesystem.inicdt,1990. [6]A.DelisandV.R.Basili.DataBindingTool:aToolforMeasurementBasedAdaSource [5]A.Bouguettaya.On-lineClustering.IEEETransactionsonKnowledgeandDataEngineering, ReusabilityandDesignAssessment.InternationalJournalofSoftwareEngineeringandKnowledgeEngineering,3(3):287{318,November1993. 8(2),April1996. [7]R.DubesandA.K.Jain.ClusteringMethodologiesinExploratoryDataAnalysis.Advances [8]B.Everitt.ClusterAnalysis.HeinemannEducationalBooks,Yorkshire,England,1977. [9]UsamaM.Fayyad,GregoryPiatetsky-Shapiro,PadhraicSmyth,andRamasamyUthurusamy, incomputers,19,1980. [10]J.A.Hartigan.ClusteringAlgorithms.JohnWiley&Sons,London,1975. editors.advancesinknowledgediscoveryanddatamining.aaaipress/mitpress,menlo [11]AnilK.Jain,JianchangMao,andK.M.Mohiuddin.Articialneuralnetworks.Computer, Park,CA,1996. [12]N.JardineandR.Sibson.MathematicalTaxonomy.JohnWiley&Sons,London,1971. [13]Jia-bing,R.Cheng,andA.R.Hurson.Eectiveclusteringofcomplexobjectsinobject-oriented 29(3):31{44,1996. [14]L.KaufmanandP.J.Rousseeuw.FindingGroupsinData,anIntroductiontoClusterAnalysis. databases.insigmod,1991. [15]D.E.Knuth.TheArtofComputerProgramming.Addison-Wesley,Reading,MA,1971. [16]G.NLanceandW.T.Williams.AGeneralTheoryforClassicationSortingStrategy.The JohnWiley&Sons,London,1990. ComputerJournal,9(..):373{386,1967. 23

[17]WilliamJ.McIverandRogerKing.Self-adaptive,on-linereclusteringofcomplexobjectdata. [18]F.Murtagh.ASurveyofRecentAdvancesinHierachicalClusteringAlgorihms.TheComputer InSIGMOD,1994. [19]G.Piatetsky-ShapiroandW.J.Frawley,editors.KnowledgeDiscoveryinDatabases.AAAI Journal,26(4):354{358,1983. [20]W.H.Press.NumericalRecipesinC:TheArtofScienticProgramming.CambridgeUniversityPress,2ndedition,1992. Press,MenloPark,CA,1991. [21]E.Ramussen.ClusteringAlgorithmsinInformationRetrieval.InW.BFrakesR.Baeza- [22]H.C.Romesburg.ClusterAnalysisforResearchers.KriegerPublishingCompany,Malabar, Yates,editor,InformationRetrieval:DataStructuresandAlgorithms.Prentice-Hall,EnglewoodClis,NJ,1990. [23]R.C.TryonandD.E.Bailey.ClusterAnalysis.McGraw-Hill,NewYork,1970. [24]ManolisM.TsangarisandJereyF.Naughton.Astochasticapproachforclusteringinobject FL,1990. [25]P.Willett.RecentTrendsinHierarchicDocumentClustering,aCriticalReview.Information bases.insigmod,1991. [26]C.T.Yu,C.Suen,K.Lam,andM.K.Siu.AdaptiveRecordClustering.ACMTransactions ProcessingandManagement,9(24):577{597,1988. [27]J.Zupan.ClusteringofLargeDataSets.ResearchStudiesPress,Letchworth,England,1982. ondatabasesystems,2(10):180{204,june1985. 24

TypeFirstCorrelation SUL-0.00009X+2 SUE0.00042X+6 SPL SPE0.000003X+8 0.00076X+9 SecondCorrelation -0.00016X+2-0.0000067X+9-0.0000032X+0-0.00034X+2 0.00041X+6 0.00087X+0 ThirdCorrelation 0.000055X+4-0.00093X+9 FourthCorrelation SGL0.00026X+7 SGE0.0000089X+40.00102X+3 0.000042X+8 0.00011X+8-0.0000051X+0-0.0000046X+1-0.00046X+7 0.0000047X+3 CUL0.000032X+8 CUE0.000071X+3 CPL-0.000018X+9-0.00031X+8 0.00029X+0-0.00066X+6 0.000004X+2-0.00057X+2 CPE0.00032X+6 0.000066X+1-0.00109X+6-0.000068X+2 CGL0.000001X+6-0.00034X+5 0.000042X+2 CGE0.00075X+8 0.0000053X+4-0.00074X+0 0.00105X+1 OUL-0.00014X+1-0.0002X+7-0.00043X+4-0.00015X+8 OUE0.00057X+9 0.00095X+3-0.000078X+1-0.00078X+8 OPL-0.000002X+1-0.0000027X+1-0.00037X+1-0.00014X+1 0.00066X+4-0.00041X+2 OPE0.0000036X+60.0000051X+5 0.00053X+7 0.0000083X+9-0.00078X+2 OGL0.00039X+1 0.000071X+3 OGE0.00084X+5 0.000016X+4 0.0000098X+2 0.000062X+0-0.00073X+7-0.00045X+9-0.0000061X+8 0.00083X+1-0.0005X+7-0.00073X+6 0.00024X+6 TypeFifthCorrelation Table5:FunctionApproximationoftheFirstBlockofCoecientsofCorrelation SUL-0.000021X+8 SUE0.00047X+0 SPL SPE0.000031X+7-0.000004X+5 0.0000058X+5 SixthCorrelation 0.000009X+1 0.00086X+1 SeventhCorrelationEighthCorrelation SGL0.00002X+7-0.000049X+0.92-0.00043X+0.93 0.0000006X+9-0.0000045X+0.930.00051X+0.93 0.000009X+2-0.000008X+0 SGE0.00042X+6 0.00011X+7 0.00007X+8 0.00107X+8 CUL-0.0000021X+20.0000034X+3 CUE0.00004X+2 0.00054X+0 0.000008X+8 0.00009X+0 CPL-0.00081X+0.91 0.00033X+0-0.0000064X+8-0.00077X+6 0.000052X+9 0.00076X+6-0.0000002X+7 CPE0.00037X+7 CGL-0.0000002X+0-0.00012X+7-0.00036X+4-0.00085X+8-0.000068X+5 0.0005X+6-0.00048X+0 CGE0.000049X+6 OUL0.00017X+8-0.00024X+9-0.00006X+4 OUE0.0000044X+9-0.0000046X+30.000006X+5-0.000033X+5 OPL-0.00037X+9 0.00022X+7-0.00029X+5-0.00018X+9 OPE0.0000037X+6 0.00046X+6 0.00077X+4-0.000057X+1 OGL0.0000022X+0-0.0003X+7-0.0003X+6-0.00032X+3 OGE0.00049X+5 0.00006X+8 0.000003X+4 0.00031X+8 0.000086X+6 0.00046X+5 0.00036X+7 0.00075X+5 0.00094X+1 0.00035X+3-0.00038X+6-0.00016X+4 Table6:FunctionApproximationoftheSecondBlockofCoecientsofCorrelation 25

TypeNinthCorrelation SL SE CL CE 0.00113X+5-0.00089X+3-0.0000078X+4-0.00083X+0-0.00085X+2 TenthCorrelation -0.000119X+1-0.000124X+1-0.00095X+2 0.00089X+0 0.0000088X+3-0.0008X+1 0.00088X+1 EleventhCorrelation Table7:FunctionApproximationoftheThirdBlockofCoecientsofCorrelation OL OE 0.00079X+0 0.0000101X+6 0.00079X+1 0.00111X+8-0.0000088X+0 0.0000101X+9 26