DataClusteringAnalysisinaMultidimensionalSpace A.BouguettayaandQ.LeViet QueenslandUniversityofTechnology fathman,quanglg@icis.qut.edu.au SchoolofInformationSystems Brisbane,Qld4001,Australia theresultofafairlyexhaustivestudytoevaluatethreecommonlyusedclusteringalgorithms, Thereisawidechoiceofmethodswithdierentrequirementsincomputerresources.Wepresent Clusteranalysistechniquesareusedtoclassifyobjectsintogroupsbasedontheirsimilarities. Abstract namely,singlelinkage,completelinkage,andcentroid.theclusteranalysisstudyisconducted inthe2dimensionalspace.threetypesofstatisticaldistributionareused.twodierent 1 IntroductionandMotivation typesofdistancestocomparelistsofobjectsarealsoused.theresultspointtosomestartling similaritiesinthebehaviorandstabilityofallclusteringmethods. groupsbasedontheirsimilarities.indatabases,clusteranalysishasbeenusedtore-allocatestored informationbasedonpredenedcriteriawiththegoaltoimprovetheeciencyofdataretrieval basedontheirdegreeofassociation[21].insimplewords,clusteranalysisclassiesitemsinto Clusteranalysisisagenericnameformultivariateanalysistechniquestocreategroupsofobjects reducethenumberofdiskaccesses.inadistributedenvironment,clusteringisevenmoreimportant becauseoftheimpactontheresponsetimeiftherequesteddataisphysicallylocatedatdierent toanother.byre-allocatingdata,relatedinformationisphysicallystoredascloseaspossibleto operations.thestandardwayforevaluatingdegreeofsimilaritiesvariesfromoneapplication sites.theneedtodoclusteringisclear.therearemanyissuesthatneedtobeaddressed: 1.Calculationofthedegreeofassociationbetweendierenttypesofobjects. 2.Determinationofanacceptablecriteriontoevaluatethe\goodness"ofclusteringmethods. 3.Adaptabilityoftheclusteringmethodswithdierentdistributionsofdata:randomlyscattered,skewedorconcentratedaroundcertainregions,etc. theclusteringsomeofwhichare:hierarchicalversuspartitional,agglomerativeversusdivisive, Severalclusteringmethodshavebeenproposedthatdierintheapproachtakentoperform ExtrinsicversusIntrinsic,etc.[7][10][8][14][22][23][25][18][1].Inthatrespect,eachclustering methodhasadierenttheoreticalbasisandisapplicabletoparticularelds.weproposeafairly 1
exhaustivestudyofwellknownclusteringtechniquesinthe2-dspace.previousstudies,thatare lessexhaustiveintheiranalysis,havefocusedonthe1-dspace[5].ourexperimentincludeavariety 1.1Denitions ofenvironmentsettingstotesttheclusteringtechniquessensibilityandbehavior.theresultscan beofparamountimportanceacrossawiderangeofapplications. thispaper.theontologieswecoverarethefollowing:clusteranalysis,objects,clusters,distance Wepresentahighleveldescriptionofthedierentontologiesusedintheresearchliteratureand Clusteranalysisisaboutthegenerationofgroupsofentitiesthattasetofdenitions.The andsimilarity,coecientofcorrelation,andstandarddeviation. groupwhichformsaclustershouldhavehigherdegreeofassociationswithingroupmembers thanmembersofdierentgroups.atahighlevelofabstraction,aclustercanbeviewedasa tion.clusteranalysishasapropertythatmakesitdierentfromotherclassicationmethods, namely,informationaboutclassesofgroupingsarenotknownpriortotheprocessing.items aregroupedintoclustersthataredenedbymembersofthoseclusters. groupof\similar"objects.clusteranalysisissometimesreferredtoasautomaticclassica- Objects(oritems)areusedinabroadsense.Theycanbeanythingthatrequiretobeclassied basedoncertaincriteria.theobjectmayrepresentasingleattributeinarelationaldatabase oracomplexobjectinanobject-orienteddatabaseprovidedthatitcanberepresentedasa samemeasurementspace.ina1{d(onedimensional)environment,anobjectisrepresented pointinameasurementspace.obviously,allobjectstobeclusteredshouldbedenedinthe Clustersaregroupsofobjectslinkedtogetheraccordingtosomerules.Thegoalofclusteringis numbers. asapointbelongingtoasegmentdenedbytheinterval[a,b]whereaandbarearbitrary tondgroupscontainingobjectsmosthomogeneouswithinthesegroups.homogeneityrefers clustersinameasurementspace:asahypotheticalpointwhichisnotanobjectinthecluster, orasanexistingobjectintheclustercalledcentroidorclusterrepresentative. tothecommonpropertiesoftheobjectstobeclustered.therearetwowaystorepresent thenumberofobjectsinthecluster.fromthatpointofview,asingleobjectisacluster Clustersarerepresentedinthemeasurementspaceinthesameformastheobjectsthey contain.todistinguishbetweenanobjectandacluster,additionalinformationisneeded: DistanceandSimilarity:Toclusteritemsinadatabaseorinanyotherenvironment,some containingexactlyoneobject.anexampleofclustersina1{denvironmentisfg,fg, f,,g,etc. thechoicemayhaveaneectontheresultsobtained.objectswhichhavemorethanone dimensionmayuserelativeornormalizedweighttoconverttheirdistancetoanarbitrary measureofdistancesorsimilarities.thereareanumberofsimilaritymeasuresavailableand meansofquantifyingthedegreeofassociationsbetweenitemsareneeded.theymaybea scalesotheycanbecompared.oncetheobjectsaredenedinthesamemeasurementspace asthepoints,itisastraightforwardexercisetocomputethedistanceorsimilarity.the smallerthedistancethemoresimilartwoobjectsare.themostpopularchoiceincomputing distanceistheeuclideandistancewith:2
wherenisthenumberofdimensions.forthe1{dspace,thedistancebecomes: d(i;j)=q(xi1?xj1)2+(xi2?xj2)2+:::+(xin?xjn)2 ThereisalsotheManhattandistanceorcityblockconceptsthatarerepresentedasfollows: d(i;j)=jxi?xjj Thedistancebetweentwoclustersinvolvessomeorallitemsofthetwoclustersandiscalculateddierentlydependingontheclusteringmethod. d(i;j)=jxi1?xj1j+jxi2?xj2j+:::+jxin?xjnj StandardDeviationisthemeasurementoftheuctuationofofvaluesascomparedtothemean ThestandarddeviationofarandomvariableXisgivenby value.inthisstudy,standarddeviationisusedtoshowtheacceptabilityoftheresults. CoecientofCorrelationisthemeasurementofthestrengthofrelationshipbetweentwovariableXandY.Itessentiallyanswersthequestion\howsimilarareXandY?".Thevalues (X)=qE(X2)?E2(X) ofthecoecientsofcorrelationrangefrom0to1wherethevalue0pointstonosimilarityandthevalue1pointshighsimilarity.thecoecientofcorrelationisusedtond thesimilarityamongobjects.thecorrelationroftworandomvariablesxandywhere: X=(x1;x2;x3;:::;xn)andY=(y1;y2;y3;:::;yn)isgivenbytheformula: wheree(x)=pni=1xi n ande(y)=pni=1yi r= p(e(x2)?e2(x)p(e(y2)?e2(y) je(x;y)?e(x)e(y)j 1.2RelatedWork n ande(x;y)=pni=1xiyi n. andchemicalstructures[18][27][14][22]. Clusteranalysishasbeenusedinseveraleldsofsciencetodeterminetaxonomyrelationships amongentities,includingeldsdealingwithspecies,psychiatricproles,censusandsurveydata, ofrelateddatainadatabasetoimprovetheperformanceofdbmss.datarecordswhichare caseofdatabaseclustering,theabilitytocategorizeitemsintogroupsallowsthere-allocation learningtodatacompression[11].ourapplicationdomainisintheareaofdatabases.inthe Clusteringapplicationsrangefromdatabases(eg.dataclusteringanddatamining)tomachine frequentlyreferencedtogetheraremovedincloseproximitytoreduceaccesstime.toreachthis goal,clusteranalysisisusedtoformclustersbasedonthesimilaritiesofdataitems.datamaybe 3
OODBsusesomelimitedformofclusteringtoimprovetheirperformance.However,theyare re-allocatedbasedonvaluesofanattribute,groupofattributesoronaccessingpatterns.these mostlystaticinnature[4].thecaseofoodbsisuniqueinthattheunderlyingmodelprovidesa criteriadeterminethemeasuringdistancebetweendataitems. testbedfordynamicclustering.thisisthereasonwhyclusteringtakesonawholenewmeaning WiththeadventofOODBs,theneedofecientclusteringtechniquesbecomescrucial.Some withoodbs.therehasbeenasurgeinthenumberofstudiesofdatabaseclustering[13][17][24] [3][2].Inparticular,therewererecentlyanumbersofstudieswhichinvestigateadaptiveclustering techniques,i.e.,theclusteringtechniqueswhichcancopewithchangingaccesspatternandperform of,previouslyunknownpatternsinlargedatasetsstoredindatabases[19][9].thepatternsare clusteringon-line[5],[5][26]. thenusedtopredictthemodelofdataclassication.thereisawiderangeofbenetsfor\mining" datatondinterestingassociations.datawarehousesbecomevaluableintermsofunderstanding, Indatabaseminingandknowledgediscovery,theprimarygoalisthesearchfor,andthediscovery managing,andusingpreviouslyunknownrelationshipsbetweensetsofdata.ourexperimentsare targetspecicapplicationsandareapplicabletoawiderangeofdomains. meanttoprovideagenericviewonhowdataisclustered.inthisregard,theexperimentsdonot Average,withdierentsettings[5].Thepreliminaryndingseemtopointthatthechoiceofclusteringmethodbecomesirrelevantintermsofnaloutcomes.thestudypresentedhereextendronmentsettings.Weinvestigatedthreecommonlyusedclusteringalgorithms:Slink,Clink,and Thisresearchbuildsuponpreviousworkthatwehaveconductedusingadierentsetofenvi- ourpreviousworktoincludeseveralothersettings[5].thenewenvironmentssettingsincludeadditionalparametersthatinclude:anewclusteringmethod,astatisticaldistribution,largerinput behaviorandsensitivityoftheconsideredclusteringmethods. s,andspacedimension.theaimistoprovideabasisforamorecategoricalargumentastothe Ouraimistoseewhetherclusteringisdependentonthewayobjectsaregenerated.Therstone statisticaldistributions;andweselectedthethreethatcloselymodelrealworlddistributions[6]. 1.3StatisticalDistributions Theobjectsusedinthisstudyconsistsofpointslyingintheinterval[0,1].Therearenumerous istheuniformdistributionandthesecondoneisthepiecewisedistributionandthethirdoneis thegaussiandistribution.inwhatfollows,wedescribethestatisticaldistributionsthatweused UniformDistribution inourexperiments. Piecewise(Skewed)Distribution Therespectivedistributionfunctionisthefollowing:F(x)=x. Thedensityfunctionofthisdistributionisf(x)=F0(x)=18xsuchthat0x1. Therespectivedistributionfunctionisthefollowing: 4
F(x)= 8 ><>: 0:05 0:475if0:37x<0:62 0:525if0:62x<0:743 0:95 if0x<0:37 1 if0:743x<0:89 Gaussian(Normal)Distribution Thedensityfunctionofthisdistributionis:f(x)=F(b)?F(a) if0:89x1 b?a 8xsuchthatax<b. Therespectivedistributionfunctionisthefollowing:F(x)=1 p2e?(x?)2 2isthevariance. Thisisatwo-parameter(and)distribution,whereisthemeanofthedistributionand 22 InproducingsamplesfortheGaussiandistribution,wechoose=and=. f(x)=f0(x)=1 p2?x 3e?(x?)2 22 F(x)= 8 ><>: 0:00132if0:1x<0:2 0:02277if0:2x<0:3 0:15867if0:3x<0:4 0:49997if0:4x<0:5 Followingisanoutlineofthepaper.Insection2,theclusteringmethodsusedinthisstudyare Forvaluesofxthatareintherange[,1],thedistributionissymmetric. 1 for0:0x1 described.insection3,wedetailtheexperimentsconductedinthisstudy.insection4,weprovide theinterpretationsoftheexperimentresults.insection5,weprovidesomeconcludingremarks. 2Therearedierentwaystoclassifyclusteringmethodsaccordingtothetypeofclusterstructure theyproduce.thesimplenon-hierarchicalmethodsdividethedatasetofnobjectsintomclusters, ClusteringMethods memberofaclusterwithwhichitismostsimilarto,andtheclustermayberepresentedbya wherenooverlapisallowed.theyarealsoknownaspartitioningmethods.eachitemisonlya linkeduntileveryiteminthedatasetislinkedtoformonecluster.hierarchicalmethodscanbe: centroidorclusterrepresentativethatrepresentsthecharacteristicsofallcontainedobjects.this methodisheuristicallybasedandmostlyappliedinsocialsciences. eitheragglomerative,withn-1pairwisejoinsfromanunclustereddataset.inotherwords, Hierarchicalmethodsproduceanesteddatasetinwhichpairsofitemsorclustersaresuccessively fromnclustersofoneobject,thismethodgraduallyformsoneclusterofnobjects.ateach step,clustersorobjectsarejoinedtogetherintolargerandlargerclustersendingwithone bigclustercontainingallobjects. 5
ordivisive,inwhichallobjectsbelongtoasingleclusteratthebeginning,thentheyare dividedintosmallerclustersoverandoveruntilthelastclustercontainingtwoobjectshave methods".thehierarchicaltreemaybepresentedasadendrogram,inwhichpairwisecoupling Inbothcases,theresultoftheprocedureisahierarchicaltree,hencethename\hierarchical beenbrokenapartintobothatomicconstituents. ofthesimilarityisrepresentednumerically.divisivemethodsarelesscommonlyusedandonly agglomerativemethodswillbediscussedinthispaper. oftheobjectsinthedatasetisshownandthelengthofthebranches(vertices)orthevalue Thesemethodshavebeenusedintheexperimentspresentedinthispaper. Thesectiondescribesthehierarchicalagglomerativeclusteringmethodsandtheircharacteristics. 2.1HierarchicalClusteringTechniques Singlelinkageclusteringmethod(Slink):Thedistancebetweentwoclustersisthesmallest distanceofalldistancesbetweentwoitems(x;y),denoteddx;y,suchthatxisamember ofaclusterandyisamemberofanothercluster.thismethodisalsoknowasthenearest neighbormethod.thedistancedx;yiscomputedasfollows: isthesimplestamongallclusteringmethods.ithassomeattractivetheoreticalproperties where(x;y)areclusters,and(x;y)areobjectsinthecorrespondingclusters.thismethod DX;Y=minfDx;ygwithx2X;y2Y Completelinkageclusteringmethod(Clink):Thesimilaritycoecientisthelongestdistancebetweenanypairofobjects,denotedDx;y,takenfromtwoclusters.Thismethodisalso [12].However,ittendstoformlongorchainingclusters.Thismethodmaynotverysuitable forobjectsconcentratedaroundsomecentersinthemeasurementspace. calledfurthestneighborclusteringmethod[16][22][8].thedistanceiscomputedasfollows: Centroid/medianmethod:Clustersinthismethodarerepresentedbya\centroid",apoint inthemiddleofthecluster.thedistancebetweentwoclustersisthedistancebetweentheir DX;Y=minfDx;ygwithx2X;y2Y 2.2GeneralAlgorithm centroids.thismethodalsohasaspecialcaseinwhichthecentroidofthesmallergroupis leveledtothelargerone. [8].Asarststep,objectsaregeneratedusingarandomnumbergenerator.Inourcase,these objectsareobjectsintheinterval[0,1].aftertheseobjectsarecreated,theyarecomparedto Weprovideexamplesusingthethreeclusteringmethods.Moreexamplescanbefoundin[22] eachotherbymeasuringthedistance.thedistancebetweentwoclustersiscomputedusingthe similaritycoecient.thewayobjectsandclustersofobjectsarejoinedtogethertoformlarger clustersvarieswiththeapproachused.weoutlineagenericalgorithmthatisapplicabletoall 6
clusteringmethods.essentially,itconsistsoftwophases.therstphaserecordsthesimilarity coecients.thesecondphasecomputestheminimumandtheperformsclustering.initiallyevery clusterconsistsofexactlyoneobject. 1.Scanallclustersandrecordallsimilaritycoecients. 4.Goto(1). 3.Ifexactlyoneclusterremainsthenstop. 2.Computetheminimumofallsimilaritycoecientsandthenjointhecorrespondingclusters. WhenperformingStep2,thersttwoclustersarejoined.However,whencomputingthesimilarity Step1,threesuccessiveclustersaretobejoined(theyallhavetheminimumsimilarityvalue). coecientbetweenthisnewclusterandthethirdcluster,thesimilarityvaluemaynowbedierent ThereisacasewhenusingClinkmethodwhereambiguitymayarise.Supposewhenperforming fromtheminimumvalue.thequestionnowiswhatthenextstepshouldbe.thereareessentially twopossibilities: Eitherproceedbyjoiningclustersusingarecomputationofthesimilaritycoecientforeach Orjoinallthoseclustersthathavethesimilaritycoecientdierentatonceanddonot recomputethesimilarityinstep2. timeinstep2. therstalternative. Ingeneral,thereisnoevidencethatoneisbetterthantheother[22].Forourstudy,weselected 2.3Examples Followingareexamplesofhowdataisclusteredtoprovideanideahowdierentclusteringmethods workwiththesamesetofdata.example1usesslinkmethod,example2usesclinkmethod,and Example3usesCentroid.Thesampledatahas10itemsandeachitemhasanidentication;anda valueonwhichthedistanceiscalculated.theuniformdistributionwasusedtogeneratetheabove setofobjects.forthesakeofsimplicity,weconsiderthe1-dspaceforgeneratingdataobjects. Example1:Slink 3.Joinclustersf4gandf8gat05885140. 2.Joinclustersf6gandf10,1gat05885. 1.Joinclustersf10gandf1gatdistance0.0117584. 5.Joinclustersf4,8gandf3,7gandf2,6,10,1,5,9gasoneclusterat17643. 4.Joinclustersf3gandf7gasoneclusterandf2gf6,10,1gf5gandf9gasanothercluster at05885148. 7
Value 764770 529483 294196Id 058909 1 823621 2 588334 3 353047 4 117760 5 0.9882473 678 Table1:Exampleofasampledatalist 64718510 9 Example2:Clink Figure1:ClusteringTreeusingSlink 4 8 3 7 2 6 10 1 5 9 3.Joinclustersf3gandf7gasonecluster,andf2gandf6gasanothercluster,andf5g 2.Joinclustersf4gandf8gat05885140. 1.Joinclustersf1gandf10gatdistance0.0117584. 5.Joinclustersf4,8gandf3,7gat294138. 4.Joinclustersf2,6gandf1,10gat235286. andf9gasanotherclusterat05885148. 7.Joinclustersf4,8,3,7gandf2,6,10,1,5,9gat823563. 6.Joinclustersf2,6,10,1gandf5,9gat352989. 8
Example3:Centroid Figure2:ClusteringTreeusingClink 4 8 3 7 2 6 10 1 5 9 2.Joinclustersf2gandf6gasonecluster,andf3gandf7gasanothercluster,andf4g 1.Joinclustersf10gandf1gatdistanceof0.0117585andformthecentralpointat 7059775. 3.Joinclustersf5gandf9gat058852andformthecentralpointat0.93530470. andf8gasanotherclusterat05885148andformthecentralpointsat0589085, 4.Joinclustersf1,10gandf2,6gat6478690andformthecentralpointat8824430. 2241215,and5883345. 6.Joinclustersf1,10,2,6gandf5,9gat4706040andformthecentralpointat 5.Joinclustersf3,7gandf4,8gat2357870andformthecentralpointat7062280. 7.Joinclustersf1,10,2,6,5,9gandf3,7,4,8gat315170. 1177450. Figure3:ClusteringTreeusingCentroid 9 4 8 3 7 2 6 10 1 5 9
treeinwhichnobjectsarelinkedbyn{1connectionsandthereisnocycleinthetree.msthas Inthisstudy,thehierarchicaltreeisimplementedasaminimumspanningtree(MST).MSTisa 3 ExperimentDescription thefollowingpropertiesthatmakeitsuitabletorepresentthehierarchicaltree. Anyisolateditemcanbeconnectedtoanearestneighbor. Anyisolatedfragment(sub-setofaMST)inanycasecanbeconnectedtoanearestneighbor 0to1inclusiveandtheirsrangefrom100to500.Thecongruentiallinearalgorithmisusedto Datausedinthispaperisdrawnfromatwo-dimensional(2-D)space.Theirvaluesrangefrom bytheavailablelink. generatedataobjects[15][20].theseedisthesystemtime.eachexperimentfollowedthefollowing steps: Calculatethecoecientofcorrelationforeachclusteringmethod. Generatelistsofobjects. Carryouttheclusteringprocesswiththreedierentclusteringmethods. lationiscalculated.theleastsquareapproximation(lsa)isusedtoevaluatetheacceptabilityof denedbythecorrespondingstandarddeviation,theapproximationisdeemedtobeacceptable. theapproximation.ifacoecientofcorrelationobtainedusingthelsa,fallswithinthesegment Eachexperimentisrepeated100timesandthestandarddeviationofthecoecientsofcorre- dierencebetweentwoobjects.theothermethodofcomputingthedistanceistousetheminimum obtainedfromlistsofobjects,tocomputetheircoecientofcorrelation.thedistanceusedin thecoecientofcorrelationcouldforinstancebecomputedusingtheactuallinear(euclidean) Typesofdistancesincomparingtrees:Therearetwowaysofcomparingtwotrees, numberofedges(ofatree)neededtojointwoobjects.thelatterhasanadvantageovertheformer inthatitprovidesamore"natural"implementationofacorrelation.wecallthersttypeof distance,lineardistanceandtheseconddistance,edgedistance.oncewechooseadistancetype, wecomputethecoecientofcorrelationbyselectingonepairofidentiersinthesecondlist parametershavebeenused.forinstance,theclusteringmethodisoneparameter.thereare3*2*3 (shorterlist)andcomputeitsdistanceandthenlookforthesamepairintherstlistandcompute itsdistance.werepeatthesameprocessforallremainingpairsinthesecondlist. =18possiblewaystocomputethecoecientofcorrelationfortwolistsofobjects.Indeed,we havethefollowingchoices: Thereareseveraltypesofcoecientsofcorrelation.Thisstemsfromthefactthatseveral firstparameter8><>:slink Clink secondparameter(lineardistance Centroid 10edgedistance
thirdparameter8><>:uniformdistribution piecewisedistribution thedatainput.thisdetermineswhatkindofdataistobecomparedandwhatitsis.the Theotherdimensionofthiscomparisonstudythathasadirectinuenceontheclusteringis gaussiandistribution totheinputdata.foreverytypeofcoecientofcorrelationmentionedabove,eleven(11)typesof situations(hence,elevencoecientsofcorrelation)havebeenisolated.itisourbeliefthatthese followingcaseshavebeenidentiedtocheckthesensitivityofeachclusteringmethodwithregard casescloselyrepresent,whatmayinuencethechoiceofaclusteringmethod. 1.ThecoecientofcorrelationisbetweenpairsofobjectsdrawnfromasetSandpairsof 2.ThecoecientofcorrelationisbetweenpairsofobjectsdrawnfromSandpairsofobjects objectsdrawnfromthersthalfofthesamesets.thersthalfofsisusedbeforetheset drawnfromthesecondhalfofs.thesecondhalfofsisusedbeforethesetissorted. issorted. 3.ThecoecientofcorrelationisbetweenpairsofobjectsdrawnfromthersthalfofS,say number1andsoisgiventherstobjectofs02.thesecondobjectofs2isgivenasidentier thenumber2andsoisgiventhesecondobjectofs02andsoon. S2,andpairsofobjectsdrawnfromthersthalfofanothersetS',sayS02.Thetwosetsare givenascendingidentiersafterbeingsorted.therstobjectofs2isgivenasidentierthe 4.ThecoecientofcorrelationisbetweenpairsofobjectsdrawnfromthesecondhalfofS,say 5.ThecoecientofcorrelationisbetweenpairsofobjectsdrawnfromSandpairsofobjects S2,andpairsofobjectsdrawnfromthesecondhalfofS',sayS02.Thetwosetsaregiven ascendingidentiersafterbeingsortedinthesamewasasthepreviouscase. 6.Thecoecientofcorrelationdenitionisthesameasthefthcoecientofcorrelationexcept drawnfromtheunionofasetxands.thesetxcontains10%newrandomlygenerated thatxnowcontains20%newrandomlygeneratedobjects. objects. 8.Thecoecientofcorrelationdenitionisthesameasthefthcoecientofcorrelationexcept 7.Thecoecientofcorrelationdenitionisthesameasthefthcoecientofcorrelationexcept thatxnowcontains40%newrandomlygeneratedobjects. thatxnowcontains30%newrandomlygeneratedobjects. 10.ThecoecientofcorrelationisbetweenpairsofobjectsdrawnfromSusingtheuniform 9.ThecoecientofcorrelationisbetweenpairsobjectsdrawnfromSusingtheuniformdistributionandpairsofobjectsdrawnfromS'usingthepiecewisedistribution. distributionandpairsofobjectsdrawnfroms'usingthegaussiandistribution. 11
11.ThecoecientofcorrelationisbetweenpairsofobjectsdrawnfromSusingtheGaussian distributionandpairsofobjectsdrawnfroms'usingthepiecewisedistribution. Inanutshell,theabovecoecientsofcorrelationaremeanttoanalyzedierentsituationsinthe evaluationofresults.wepartitionthesesituationsintothreesituations,representedbythreeblocks ofcoecientsofcorrelation: FirstBlock:Therst,second,thirdandfourthcoecientsofcorrelationareusedtocheckthe SecondBlock:Thefth,sixth,seventhandeightcoecientsofcorrelationareusedtocheck theinuenceofthedata. inuenceofthecontextonhowobjectsareclustered. ThirdBlock:Theninth,tenth,andeleventhcoecientsofcorrelationareusedtocheckthe relationwhichmayexistbetweentwolistsobtainedusingtwodierentdistributions. lationandstandarddeviationvalues(ofthesametype)arecomputed.theleastsquareapproxi- mationisthenappliedtoobtainthefollowingequation: Toensurethestatisticalrepresentativityoftheresults,theaverageof100coecientofcorre- Thecriterionforagoodapproximation(oracceptability)isgivenbytheinequality: f(x)=ax+b deviationforyi.ifthisinequalityissatised,fisthenagoodapproximation.theleastsquare whereyiisthecoecientofcorrelation,fistheapproximationfunctionandisthestandard jyi?f(xi)j(yi)foralli approximation,ifacceptable,helpspredictthebehaviorofclusteringmethodsfordatapoints beyondtherangeconsideredinourexperiments. Asstatedearlier,theaimofthisstudyistoconductexperimentstodeterminethestabilityof 4clusteringmethodsandhowtheycomparetoeachother.Forthesakeofreadability,ashorthand ResultsandtheirInterpretations notationisusedtoindicateallpossiblecases.asimilarnotationhasbeenusedinourprevious ofcorrelation.thetablesdescribetheleastsquareapproximationsofthecoecientsofcorrelations. distributionandlineardistance;theabbreviationsulisused. ndings[5].forinstance,torepresenttheinputwiththefollowingparameters:slink,uniform Resultsarepresentedinguresandtables.Theguresdescribethedierenttypesofcoecients 12
Term Slink Clink Centroid Shorthand UniformDistr. S GaussianDistr.GC PiecewiseDistr.PO LinearDistanceLU Table2:ListofAbbreviations EdgeDistance E Coecient First&Fifth Second&Sixth Third&Seventh Correlation ofdashed solid dotted First Blocksand SecondTenth Ninth Eleventh Coecient Correlation ofdashed solid dotted ThirdBlock Table3:GraphicalRepresentationofallTypesofCorrelationCoecientsandStandardDeviations Fourth&Eighth dash-dotted 4.1AnalysisoftheStabilityandSensitivityoftheClusteringMethods Werstlookatthedierentclusteringmethodsandanalyzehowstableandsensitivetheyareto tothechangesinparametervalues. 4.1.1Slink:ResultsInterpretation thevariousparameters.inessence,weareinterestedinknowinghoweachclusteringmethodreacts Wethenprovideaninterpretationofthecorrespondingresults. Welookatthebehaviorofthe3blocksofcoecientsofcorrelationvaluesasdenedinsection3. thecontext.fig.4representtherst4coecientsofcorrelation.the(small)dierencebetweenl Aspreviouslymentioned,therst4coecientsofcorrelationaremeanttotesttheinuenceof Firstblockofcoecientsofcorrelation andevaluesisconsistentlythesameacrossallexperimentswiththeexceptionofthoseexperiments comparingdierentdistributions(fig.6,fig.9,andfig.12).wenotethatthevaluesusingeare However,whenEisused,thismaynotbetrue(eg.treethatisnotheightbalanced)sincethe distanceisequaltothenumberofedgesconnectingtwomembersbelongingtodierentclusters. WhenLisused,thedistancebetweenthemembersoftwoclustersisthesameforallmembers. consistentlysmallerthanthosevaluesusingl.thereasonliesindierenceincomputingdistances. InthecaseofFig.6,Fig.9,andFig.12thedierenceisattenuatedduetotheuseofdierent 13
Single linkage Uniform dist. Linear Single linkage Uniform dist. Edge 5 5 5 5 5 1 Single linkage Piecewise dist. Linear Single linkage Piecewise dist. Edge 0.9 Single linkage Gaussian dist. Linear Single linkage Gaussian dist. Edge Figure4:Slink:FirstBlockofCoecientofCorrelation distributions. ofcorrelationisalmostthesame.thispointstothefactthatthedistancetypedoesnotplaya WhenthevaluesinLandEarecomparedagainsteachother,thetrendamongthe4coecients majorroleinthenalclustering. typesofcorrelationcomparedataobjectsdrawnfromdierentsets.oneshouldexpecttheformer fourthtypesofcorrelationbecauseofthecorrespondingintrinsicsemantics.therstandsecond typesofcorrelationcomparesdataobjectsdrawnformthesameinitialset.thethirdandfourth Thevaluesoftherstandsecondtypesofcorrelationarelargerthanthoseofthethirdand dataobjectstobemorerelatedthanthelatterdataobjects.thestandarddeviationvaluesexhibit pointstothefactthatthedierenttypesofcorrelationbehaveinauniformandpredictablefashion. roughlythesamekindofbehaviorastheircorrespondingcoecientofcorrelationvalues.this totheimportantobservationthatthedatacontextdoesnotseemtoplayasignicantroleinthe onthenalclustering.notethattheslopevalueisalmostequaltozero.thisisalsoconrmedby naldataclustering.likewise,thedatasetdoesnotseemtohaveanysubstantialinuence Sincethecoecientofcorrelationvaluesaregreaterthan.5inalltheabovecases,thispoints 14
theuniformbehaviorofthestandarddeviationvaluesasdescribedabove. Secondblockofcoecientsofcorrelation Single linkage Uniform dist. Linear Single linkage Uniform dist. Edge 0.9 1 Single linkage Piecewise dist. Linear 0.9 Single linkage Piecewise dist. Edge 0.9 0.9 Single linkage Gaussian dist. Linear Single linkage Gaussian dist. Edge depictsthenextblockofcoecientofcorrelation. Thenext4coecientsofcorrelationchecktheinuenceofthedataonclustering.Fig.5 Figure5:Slink:SecondBlockofCoecientofCorrelation WhenthevaluesinLandEarecompared,thereisnosubstantialdierencewhichisindicative oftheindependenceoftheclusteringfromanytypeofdistanceused.asinthepreviouscase, correlations. thestandarddeviationvaluesexhibitthesamebehaviorasthecorrespondingcoecientofcorrelationvalues.thisisreminiscentofauniformandpredictablebehaviorofthedierenttypesof thepreviouscase,thedatadoesnotseemtoinuencethenalclusteringoutcomeastheslope isnearlyequaltozero.likewise,thedatasetdoesnotseemtohaveanysubstantialinuence Thehighvaluesindicatethatthecontextshavelittleeectonhowdataisclustered.Asin 15
onthenalclustering.notethattheslopevalueisalmostequaltozero.thisisalsoconrmedby theuniformbehaviorofthestandarddeviationvaluesasdescribedabove. Thirdblockofcoecientsofcorrelation Single linkage... Linear Single linkage... Edge 5 5 std / cc (9 11) 5 5 std / cc (9 11) 5 5 5 5 5 Thenext3coecientsofcorrelationchecktheinuenceofthedistributionforLandE.All Figure6:Slink:ThirdBlockofCoecientofCorrelation 5 otherparametersaresetandthesameforthepairsofsetsofobjectstobecompared. case,showsvaluesthatareabitlowerthanthevaluesincurvesrepresentingug(uniformand Gaussiandistributions)andGP(GaussianandPiecewisedistributions).Thiscanbeexplained Fig.6depictsthelastthreetypesofcoecientsofcorrelations. bytheproblemofbootstrappingtherandomnumbergenerator.thisisaconstantmostofthe ThecurverepresentingthecaseforUP(UniformandPiecewisedistributions)ineitherLorE experimentsconductedinthisstudy.therstconcurrentexperiments(slinkusinglande) exhibitabehaviorthatisalittledierentfromtheotherpiecesofexperiments. Thisisindicativeoftheindependenceoftheclusteringfromthetypesofdistancesused.Like thepreviouscase,thestandarddeviationvaluesexhibitthesamebehaviorasthecorresponding coecientofcorrelationvalues.thisisreminiscentofauniformandpredictablebehaviorofthe WhenthevaluesinthecaseofLandEarecompared,nosubstantialdierenceisobserved. dierenttypesofcorrelations. clusteringonewayortheother.asinthepreviouscase,thedatadoesnotseemtoinuence seemtohaveanysubstantialinuenceonthenalclustering.notethattheslopevalueisalmost thenalclusteringoutcomeastheslopeisnearlyequaltozero.likewise,thedatasetdoesnot Sincevaluesconvergetothevalue.5,thisindicatesthatthedistributionsdonotinuencethe equaltozero.thisisalsoconrmedbytheuniformbehaviorofthestandarddeviationvaluesas describedabove. behaviorastheslinkmethod.fig.7,fig.8,and,fig.9depictthetherst,second,thirdblocks Inessence,theexperimentsfortheClinkclusteringmethodfollowthesametypeofpatternand 4.1.2Clink:ResultsInterpretation 16
Complete linkage Uniform dist. Linear Complete linkage Uniform dist. Edge Complete linkage Piecewise dist. Linear Complete linkage Piecewise dist. Edge Complete linkage Gaussian dist. Linear Complete linkage Gaussian dist. Edge Figure7:Clink:FirstBlockofCoecientofCorrelation ofcoecientsofcorrelation. asbothfollowaverysimilarpatternofbehaviorandthevaluesforboththecoecientsofcorrelationandthestandarddeviationarequitesimilar. TheinterpretationsthatapplyfortheSlinkmethodalsoapplyfortheClinkclusteringmethods thesametypeofpatternandbehavior.theonlydierenceliesinthevaluesforthedierent AswasthecaseforSlinkandClink,theexperimentsfortheCentroidclusteringmethodsfollow correlationsobtainedusingdierentparameters.fig.10,fig.11,and,fig.12depictthetherst, 4.1.3Centroid:ResultsInterpretation second,thirdblocksofcoecientsofcorrelation. clusteringmethodsasitsfollowsasimilarpatternofbehavior.similarly,thevaluesforboththe coecientsofcorrelationandthestandarddeviationarealsosimilar. TheinterpretationsthatapplyforthepreviousclusteringmethodsalsoapplyfortheCentroid 17
Complete linkage Uniform dist. Linear Complete linkage Uniform dist. Edge 1 Complete linkage Piecewise dist. Linear Complete linkage Piecewise dist. Edge 0.9 0.9 Complete linkage Gaussian dist. Linear Complete linkage Gaussian dist. Edge Figure8:Clink:SecondBlockofCoecientofCorrelation Complete linkage... Linear 5 Complete linkage... Edge 5 5 5 std / cc (9 11) Figure9:Clink:ThirdBlockofCoecientofCorrelation 5 std / cc (9 11) 5 5 5 18 5 5
Centroid linkage Uniform dist. Linear 5 Centroid linkage Uniform dist. Edge 5 5 5 5 Centroid linkage Piecewise dist. Linear 0.9 5 Centroid linkage Piecewise dist. Edge Centroid linkage Gaussian dist. Linear Centroid linkage Gaussian dist. Edge Figure10:Centroid:FirstBlockofCoecientofCorrelation correlation.block1,block2,andblock3correspondrespectivelytotherstblockof4coecients Table4showsasummaryoftheresultsthataveragesoutthedierentcomputedcoecientsof 4.2SummaryofResults ofcorrelation,thesecondblockof4coecientsofcorrelation,andthethirdblockof3coecients ofcorrelation. 4.3AcceptabilityoftheLeastSquareApproximation thestandarddeviation.ifthisisthecase,thenwesaythattheapproximationisgood.otherwise, Table5,Table6,andTable7(seeAppendix)representtheleastsquareapproximationsforall coecientofcorrelationvaluesfallwithintheintervaldelimitedbytheapproximatingfunctionand thecurvesshowninthisstudy.theacceptabilityofanapproximationdependsonwhetherallthe 19
0.9 Centroid linkage Uniform dist. Linear Centroid linkage Uniform dist. Edge Centroid linkage Piecewise dist. Linear 0.9 Centroid linkage Piecewise dist. Edge 0.9 Centroid linkage Gaussian dist. Linear Centroid linkage Gaussian dist. Edge Figure11:Centroid:SecondBlockofCoecientofCorrelation 5 Centroid linkage... Linear 5 Centroid linkage... Edge 5 5 std / cc (9 11) 5 5 Figure12:Centroid:ThirdBlockofCoecientofCorrelation 5 5 5 5 20 std / cc (9 11)
Block1LU.6 SlinkClinkCentroid EU.5 G.7.6.55.75.7 Block2LU.8 P.7 G.6.6.55 P.9.85.65 EU.65 G.85.85.75 Block3L P.8 G.7.75.65 E.5.55.5 Table4:SummaryofResults.45 welookathowmanypointsdonotfallwithintheboundariesanddeterminethegoodnessofthe function.usingthesefunctions,willenableustopredictthebehavioroftheclusteringmethods small.thispointstothestabilityofallresults.allapproximationsyieldalmostparallellinesto withhigherdatasets. thex-axis. AsTable5,Table6,andTable7(seeAppendix)show,thevaluesoftheslopesareallvery 4.4ComparisonofResultsacrossClusteringMethods approximationslistedintablesmentionedabovearegoodapproximations. Theacceptabilitytestwasrunandallpointspassedthetestsatisfactorily.Thereforeallthe candrawnfromtheexperimentsshowinthedierentguresandtables. Inwhatfollows,wecomparethedierentclusteringmethodsagainsteachotherusingthedierent parametersusedinthisstudy.werelyontheresultsobtainedandthegeneralobservationsthat 1.Theresultsshowthatacrossspacedimensions,thecontextdoesnotcompletelyhidethesets. Forinstance,therstandsecondtypesofcoecientsofcorrelation(asshowninallgures) 2.Theresultsshowthatgiventhesamedistributionandtypeofdistance,allclusteringmethods arealittledierentfromthethirdandfourthtypesofcoecientofcorrelation(asshownin allgures).thevaluesclearlyshowwhatkindsofcoecientsarecomputed. 3.Slink,Clink,andCentroidseemtohaveverysimilarbehavior.Thecoecientsofcorrelation exhibitthesamebehaviorandyieldapproximatelythesamevalues. valuesarealsoveryclose.anexplanationforthesimilarityinbehaviorbetweenslink,clink, andcentroidisthatthesemethodsarebasedononesingleobjectperclustertodetermine similaritydistances. 21
4.Thesecondblockofcoecientsofcorrelationforallclusteringmethods,demonstratethat 5.Theresultsalsoshowthatallclusteringmethodsareequallystable.Thisndingcomesasa closetothevalue1. thecontextdoesnotinuencethedataclusteringbecauseallcoecientsofcorrelationare 6.Theresultsshowthatthedatadistributiondoesnotsignicantlyaecttheclusteringtechniquesbecausethevaluesobtainedareverysimilartoeachother.Thatisarelativelymajor surprise,asintuitively,oneexpectsaclusteringmethodtobemorestablethantheothers. 7.Thethirdblockofcoecientsofcorrelationacrossallclusteringmethodsshowthatthethree ndingastheresultsstronglypointtotheindependenceofthedistributionandthedata clustering. 8.Thetypeofdistance(linearoredge)doesnotinuencetheclusteringprocessasthereare methodsarelittleornotperturbedeveninanoisyenvironmentsincetherearenosignicant dierencesinresultsfromuniformandpiecewise,andgaussiandistributions. Theresultsobtainedinthisstudyconrmthosefrom[5]whichusedone-dimensional(1{D) oredgedistances. nosignicantdierencesbetweenthecoecientsofcorrelationobtainedusingeitherlinear datasampleandfewerparameters.theresultspointverystronglythatingeneral,noclustering techniqueisbetterthananother.whatthisessentiallymeansisthatthereisaninherentwayfor dataobjectstoclusterandindependentlyfromthetechniquesused. computationattractivenessandnothingelse.thisisaveryimportantresultasinthepastthere wasneveranevidencethatclusteringmethodshadverysimilarbehavior. Theotherimportantresultsthattheonlydiscriminatorforselectingaclusteringmethodisits inuencetheoutcomeoftheclusteringprocess.indeed,allclusteringmethodsconsideredhere exhibitabehaviorthatisalmostconstantregardlessoftheparametersbeingusedincomparing them. Theresultspresentedhereareacompellingevidencethatclusteringmethodsdonotseemto rangeofparameterstotestthestabilityandsensitivityoftheclusteringmethods.theseexperimentswereconductedforobjectsthatareinthe2-dspace.theresultsobtainedoverwhelmingly andregardlessoftheparametersused.thefactthatdataobjectsaredrawnfromdierentdata Inthisexhaustivestudy,weconsideredthreeclusteringmethods.Theexperimentsinvolvedawide 5 Conclusion pointtothestabilityofeachclusteringmethodandthelittlesensitivitytonoise.themoststartling ndingsofthisstudyishowever,thatallclusteringmethodsexhibitanalmostidenticalbehavior; seemtoplayamajorroleinthenalshapeoftheclustering.thisalsomeansthattheonlycriterionthatshouldbeusedtoselectoneclusteringmethodistheattractivenessofthecomputational Theabovendingsareofparamountimplications.Inthatregard,oneofthemostimportant resultsisthatobjectshaveanaturaltendencytoclusterthemselves.clusteringmethodsdonot spacesdoesnotchangetheabovendings[5]. complexityoftheclusteringalgorithm. 22
Akbowledgments References WewouldliketothankMostefaGoleaandAlexDelisfortheirinsightfulanddetailedcomments. [1]M.S.AlderferandR.K.Blasheld.ClusterAnalysis.SagePublication,California,1984. [2]JayBanerjee,WonKim,Sung-JoKim,andJorgeGarza.Clusteringadagforcaddatabases. [3]VeroniqueBenzaken.Anevaluationmodelforclusteringstrategiesintheo2object-oriented IEEETransactionsonSoftwareEngineering,14(11):1684{1699,1988. [4]VeroniqueBenzakenandClaudeDelobel.Enhancingperformanceinapersistentobjectstore: Clusteringstrategiesino2.InPODS,1990. databasesystem.inicdt,1990. [6]A.DelisandV.R.Basili.DataBindingTool:aToolforMeasurementBasedAdaSource [5]A.Bouguettaya.On-lineClustering.IEEETransactionsonKnowledgeandDataEngineering, ReusabilityandDesignAssessment.InternationalJournalofSoftwareEngineeringandKnowledgeEngineering,3(3):287{318,November1993. 8(2),April1996. [7]R.DubesandA.K.Jain.ClusteringMethodologiesinExploratoryDataAnalysis.Advances [8]B.Everitt.ClusterAnalysis.HeinemannEducationalBooks,Yorkshire,England,1977. [9]UsamaM.Fayyad,GregoryPiatetsky-Shapiro,PadhraicSmyth,andRamasamyUthurusamy, incomputers,19,1980. [10]J.A.Hartigan.ClusteringAlgorithms.JohnWiley&Sons,London,1975. editors.advancesinknowledgediscoveryanddatamining.aaaipress/mitpress,menlo [11]AnilK.Jain,JianchangMao,andK.M.Mohiuddin.Articialneuralnetworks.Computer, Park,CA,1996. [12]N.JardineandR.Sibson.MathematicalTaxonomy.JohnWiley&Sons,London,1971. [13]Jia-bing,R.Cheng,andA.R.Hurson.Eectiveclusteringofcomplexobjectsinobject-oriented 29(3):31{44,1996. [14]L.KaufmanandP.J.Rousseeuw.FindingGroupsinData,anIntroductiontoClusterAnalysis. databases.insigmod,1991. [15]D.E.Knuth.TheArtofComputerProgramming.Addison-Wesley,Reading,MA,1971. [16]G.NLanceandW.T.Williams.AGeneralTheoryforClassicationSortingStrategy.The JohnWiley&Sons,London,1990. ComputerJournal,9(..):373{386,1967. 23
[17]WilliamJ.McIverandRogerKing.Self-adaptive,on-linereclusteringofcomplexobjectdata. [18]F.Murtagh.ASurveyofRecentAdvancesinHierachicalClusteringAlgorihms.TheComputer InSIGMOD,1994. [19]G.Piatetsky-ShapiroandW.J.Frawley,editors.KnowledgeDiscoveryinDatabases.AAAI Journal,26(4):354{358,1983. [20]W.H.Press.NumericalRecipesinC:TheArtofScienticProgramming.CambridgeUniversityPress,2ndedition,1992. Press,MenloPark,CA,1991. [21]E.Ramussen.ClusteringAlgorithmsinInformationRetrieval.InW.BFrakesR.Baeza- [22]H.C.Romesburg.ClusterAnalysisforResearchers.KriegerPublishingCompany,Malabar, Yates,editor,InformationRetrieval:DataStructuresandAlgorithms.Prentice-Hall,EnglewoodClis,NJ,1990. [23]R.C.TryonandD.E.Bailey.ClusterAnalysis.McGraw-Hill,NewYork,1970. [24]ManolisM.TsangarisandJereyF.Naughton.Astochasticapproachforclusteringinobject FL,1990. [25]P.Willett.RecentTrendsinHierarchicDocumentClustering,aCriticalReview.Information bases.insigmod,1991. [26]C.T.Yu,C.Suen,K.Lam,andM.K.Siu.AdaptiveRecordClustering.ACMTransactions ProcessingandManagement,9(24):577{597,1988. [27]J.Zupan.ClusteringofLargeDataSets.ResearchStudiesPress,Letchworth,England,1982. ondatabasesystems,2(10):180{204,june1985. 24
TypeFirstCorrelation SUL-0.00009X+2 SUE0.00042X+6 SPL SPE0.000003X+8 0.00076X+9 SecondCorrelation -0.00016X+2-0.0000067X+9-0.0000032X+0-0.00034X+2 0.00041X+6 0.00087X+0 ThirdCorrelation 0.000055X+4-0.00093X+9 FourthCorrelation SGL0.00026X+7 SGE0.0000089X+40.00102X+3 0.000042X+8 0.00011X+8-0.0000051X+0-0.0000046X+1-0.00046X+7 0.0000047X+3 CUL0.000032X+8 CUE0.000071X+3 CPL-0.000018X+9-0.00031X+8 0.00029X+0-0.00066X+6 0.000004X+2-0.00057X+2 CPE0.00032X+6 0.000066X+1-0.00109X+6-0.000068X+2 CGL0.000001X+6-0.00034X+5 0.000042X+2 CGE0.00075X+8 0.0000053X+4-0.00074X+0 0.00105X+1 OUL-0.00014X+1-0.0002X+7-0.00043X+4-0.00015X+8 OUE0.00057X+9 0.00095X+3-0.000078X+1-0.00078X+8 OPL-0.000002X+1-0.0000027X+1-0.00037X+1-0.00014X+1 0.00066X+4-0.00041X+2 OPE0.0000036X+60.0000051X+5 0.00053X+7 0.0000083X+9-0.00078X+2 OGL0.00039X+1 0.000071X+3 OGE0.00084X+5 0.000016X+4 0.0000098X+2 0.000062X+0-0.00073X+7-0.00045X+9-0.0000061X+8 0.00083X+1-0.0005X+7-0.00073X+6 0.00024X+6 TypeFifthCorrelation Table5:FunctionApproximationoftheFirstBlockofCoecientsofCorrelation SUL-0.000021X+8 SUE0.00047X+0 SPL SPE0.000031X+7-0.000004X+5 0.0000058X+5 SixthCorrelation 0.000009X+1 0.00086X+1 SeventhCorrelationEighthCorrelation SGL0.00002X+7-0.000049X+0.92-0.00043X+0.93 0.0000006X+9-0.0000045X+0.930.00051X+0.93 0.000009X+2-0.000008X+0 SGE0.00042X+6 0.00011X+7 0.00007X+8 0.00107X+8 CUL-0.0000021X+20.0000034X+3 CUE0.00004X+2 0.00054X+0 0.000008X+8 0.00009X+0 CPL-0.00081X+0.91 0.00033X+0-0.0000064X+8-0.00077X+6 0.000052X+9 0.00076X+6-0.0000002X+7 CPE0.00037X+7 CGL-0.0000002X+0-0.00012X+7-0.00036X+4-0.00085X+8-0.000068X+5 0.0005X+6-0.00048X+0 CGE0.000049X+6 OUL0.00017X+8-0.00024X+9-0.00006X+4 OUE0.0000044X+9-0.0000046X+30.000006X+5-0.000033X+5 OPL-0.00037X+9 0.00022X+7-0.00029X+5-0.00018X+9 OPE0.0000037X+6 0.00046X+6 0.00077X+4-0.000057X+1 OGL0.0000022X+0-0.0003X+7-0.0003X+6-0.00032X+3 OGE0.00049X+5 0.00006X+8 0.000003X+4 0.00031X+8 0.000086X+6 0.00046X+5 0.00036X+7 0.00075X+5 0.00094X+1 0.00035X+3-0.00038X+6-0.00016X+4 Table6:FunctionApproximationoftheSecondBlockofCoecientsofCorrelation 25
TypeNinthCorrelation SL SE CL CE 0.00113X+5-0.00089X+3-0.0000078X+4-0.00083X+0-0.00085X+2 TenthCorrelation -0.000119X+1-0.000124X+1-0.00095X+2 0.00089X+0 0.0000088X+3-0.0008X+1 0.00088X+1 EleventhCorrelation Table7:FunctionApproximationoftheThirdBlockofCoecientsofCorrelation OL OE 0.00079X+0 0.0000101X+6 0.00079X+1 0.00111X+8-0.0000088X+0 0.0000101X+9 26