DataIntegrationTechniquesbasedon MichaelGertz DataQualityAspects DepartmentofComputerScience UniversityofCalifornia,Davis IngoSchmitt gertz@cs.ucdavis.edu Davis,CA95616,USA OneShieldsAvenue Otto-von-Guericke-UniversitatMagdeburg InstitutfurTechn.Informationssysteme schmitt@iti.cs.uni-magdeburg.de D-39106Magdeburg,Germany Universitatsplatz2 conictswheretwoobjectshavingthesamedenitionandrepresentingthesame realworldobjecthavedierentextensions.traditionaldataintegrationapproaches Inmultidatabasesystems,amajordataintegrationproblemistoresolvedata Abstract ictingattributevalues,thusassumingauniqueandtime-independentresolution. suggeststaticconictresolutionfunctionsthatperformacomputationovercon- geneitiesamongthedatacapturingandprocessingtechniquesandmethodsusedby componentdatabases.thatis,componentdatabasedierinhowandwhenthey maprealworlddataintolocaldatastructures.thisdiversityresultsinthefact Inthispaper,wearguethatsuchtypeofdataconictoftenarisesduetohetero- thatthequalityofthedatastoredatdierentsitescanbedierentandthatthe qualitycanalsovaryovertime,thusrequiringdynamicdataintegrationmethods, dependingonwhichdataqualitygoalisrequiredatthegloballevel.weoutlinea novelframeworkthatallowstoformalize,modelandutilizediverseandinparticular databaseintegration.bymakingthenotionofdataqualityaspectsexplicitbothin orthogonaldataqualityaspectssuchastimeliness,accuracy,andcompletenessin modelingandqueryingamultidatabasesystem,existingapproachestodatabaseintegrationcannotonlybeextended,butalsotoolscandevelopedthatensuredierent 1 Introduction typesof\highqualitydata"attheintegrationlevelandforglobalapplications. easindatabaseresearchoverthepast15years.mostoftheworkhasbeendevotedto approachesandtechniquesthatallowdesignerstoidentifyandresolvestructuralandsemanticconictsbetweenmeta-dataobjects(tables,classes,:::)anddataitems(records, tuples,objects,:::)locatedatlocaldatabasesparticipatinginthemultidatabasesystem [BE96,KCG95,KS91,She91].Despitetheknowndicultiesindetectingsemanticequiv- globallyaccessiblemultidatabasesystems(mdbs)hasbeenoneofthemostactivear- Theintegrationofpreviouslyindependentbutsemanticallyrelateddatabasesystemsinto warehousestypicallyrestrictaccesstolocaldatabasestoread-onlyaccesses,providing alencesandresolvingsemanticconictsandheterogeneities,manyproductshaveemerged thatrealizesomekindofamultidatabasesystem. Themostprominenttypeofsuchsystemsaredatawarehouses[Kim96,BA97].Data 1
systemthatprovidesaccesstointegrateddata,mostoftheproblemsrelatedtointegrated tems.althoughtheissuewediscussinthispaperisnotspecictodatawarehousesbutany globalusersandapplicationsameanstointegratedatafromdierentresourcesfor,e.g., datahavebeenreportedfordatawarehouses.asrecentstudiesandreportsshow,inparticularapplicationsbuildontopofdatawarehousesoftenexperienceseveralproblemswith regardtothereliabilityandqualityoftheintegrateddata[kim96b,huf96,ba97,jv97]. Themainreasonforthisisthatoftenalreadythelocaldatabasesparticipatinginthe Thequalityofintegrateddatathenbecomesevenworseunlesssuitablemethodsandtech- multidatabasesystemcontainincorrect,inaccurate,outdatedorsimplypoorqualitydata. niquesformodelingandmanagingtheseissuesareemployedduringmultidatabasedesign time.despitetheamountofworkthatfocusesonsemanticheterogeneityamongmetadataanddataitems,aspectsofthequalityofintegrateddatahavenotbeenaddressedso far.dataqualitymainlyhasbeenandstillisanimportantresearchtopicindependentof databaseintegration[red96,wan98,ws96,wsf96]. decisionsupportsystems,hospitalinformationsystemsorenvironmentalinformationsys- Thistypeofheterogeneity,calledoperationalheterogeneity,thencanresultinthefactthat integration.weclaimthatoftendataconictsbetweenlocaldataitemsoccurduetodiscrepanciesamonghowthedataaregathered,processedandmaintainedatlocaldatabases. Inthispaperweexplicitlyintroducethenotionofdataqualityindatabaseanddata thequalityofinformationstoredaboutrealworldobjectsorartifactsindierentlocal databasesmaybedierentandvaryovertime.weshowthatdataqualitycanbedened asorthogonalaspects,referringtotimeliness,accuracyandcompletenessofdata.the orthogonalityoftheseaspectsindicatesthatthereisnotalwaysauniqueresolutionof thisconict.moreimportantly,accuratedatadoesnotnecessarilyimplyup-to-datedata realworldobjectdierandonevalueisknowtobemoreaccurateandtheotherisknown tobemoreup-to-date,onecannotgiveauniqueresolution(ordataintegrationrule)for dataconicts.forexample,iftheattributevaluesoftwoobjectsreferringtothesame orviceversa.therefore,inthispaperwesuggestsomekindofmeasurementfordierent dierentportionsofsemanticallyequivalentclassesforwhichdataconictscanoccur. mationprolingthatallowsdesignerstoassociateandrelatedataqualityaspectswith andglobalclassesatmultidatabasedesigntime.wepresentatechniquecalledinfor- dataqualityaspects.thesemeasurementsareusedtomodeldataqualityaspectsoflocal jectsandattributevalueswithrespecttodierentdataqualityaspectsarerecordedas enedbythataglobaluserorapplicationcanspecifyadataqualitygoalfortheresult metadataattheintegrationlevel.thetransparencyofglobalqueryprocessingisweak- Informationaboutdataqualitymeasurementsaswellaspreferencesamonglocalob- ofaglobalquery.basedonsuchdataqualitygoalsandtherecordedmetadata,data integrationrulesaregenerateddynamicallybytheglobalqueryprocessortoensurethe retrievalof\highqualitydata"fromlocaldatabases. Throughoutthepaperweadoptasimpleobject-orienteddatamodel,notfocusingon andalsotheneedfordynamicdataintegrationrulesisdiscussedinthenextsection. aspectslikecomplexobjects,methodsorobject/methodinheritance. Atypicaldataintegrationscenariowhichrevealsthedierentaspectsofdataquality SupposetwoclassesattwolocaldatabasesDB1andDB2whichstoreenvironmentaldata. 1.1 Bothclasses(namedPollution)containdataaboutthequantityofsometoxicmaterials MotivatingExample 2
thetwoclasseshavebeenresolvedandnowadataintegrationrulefortheglobalclass, Mat1andMat2recordedfordierentregions.Weassumethatschematicconictsbetween saygpollution,needstobespecied.belowaretheextensionsofthetwoclasses: RegionAreaMat1Mat2 Pollution@DB1(C1) R1 A3 A4 A2 A1 15 814 10 2 RegionAreaMat1Mat2 Pollution@DB2(C2) R2 R4 B1 B2 A1 6721 1 A1 12 3 R1 A2 A3 17 19 3 8 R2 B1 B2 7 2 Inordertodeneanappropriatedataintegrationruleforthesetwoclasses,obviously 1 adataconicthastoberesolvedduetothedierentquantitiesrecordedforsameregions. thedatafromthetwoclassextensionsaccordingtothespecieddataintegrationrule. Traditionaldataintegrationapproachessuggestaconictresolutionfunctionthateither Nowassumethefollowingscenarioswhereadditionalinformationaboutthetwoclasses choosesoneclass(orattribute)overtheotherorthatcomputestheaverageofthevalues andtheirextensionshasbeenobtained: recordedforthesameregion.aglobalqueryagainstgpollutionthenalwaysretrieves extensionup-to-datedataareretrieved. aglobalqueryisissuedagainsttheglobalclassgpollutiondeterminesfromwhich DB2updatesitsclassC2onTuesdays,Fridays,andSundays.Inthiscase,thetime Scenario1:DB1updatesitsclassC1onMondays,Thursdays,andSaturdays,and thevaluesformat1inc2arerecordedonamanualbasis.thusthevaluesformat1:c1 maybemoreaccuratethenthevaluesrecordedinmat1:c2. Scenario3:DB1coversmoreregionsinC1thanDB2doesinC2.Thatis,the Scenario2:WhilethequantityofMat1inC1isrecordedusingsomekindofsensors, thatthewayofhowandwhendataispopulatedintolocalclassesplaysanimportant informationinc2islesscompletethantheinformationinc1. roleinintegratinglocaldataforglobalqueries.interestingly,theabovescenariosalso describeorthogonalaspectsofdataquality.thatis,forexample,up-to-datedatadoes Theabovethreesimplescenarios,whichwillbeformalizedinthenextsection,show neithernecessarilyimplymostaccuratenormostcompletedata.moreimportantly,while oneglobaluser(orapplication)mightbeinterestedinmostcompletedata,anotheruser mightbeinterestedinmostaccuratedata.inbothcasestheremustbethepossibilityfor ausertospecifycertaindataqualitygoalsor,atleast,theglobalqueryinterfaceshould reectthatthedataretrievedfromdierentsourcesinresponsetoaglobalqueryhave dierentquality,e.g.,troughtaggingorgroupingretrievedrecords. lapping(andconicting)extensionsoflocalclasses,butalsoifsemanticallyequivalent classesaredisjoint.thustheabovescenariosdescribemorethanjustdataconicts. Furthermore,theabovedataqualityaspectsarenotonlyofinterestincaseofover- integrationanddataqualitymanagement.intheareaofdatabaseintegrationmostof Themethodsandtechniquesdescribedinthispaperareinuencedbytwoareas:database 1.2 ComparisonwithotherWork theworkfocusonresolvingstructuralconictsontheschemalevel(schematicconicts) 3
includingclassandattributeconicts,e.g.,[bln86,spd92,gsc96,kcg95].lesswork hasbeendoneondetectingandresolvingso-calledsemanticdataconictsandsemantic heterogeneitiesatthedatalevel,typicallybecauseitisdiculttoformalizeandcompare dierentobjectidentiers,havebeendiscussedin[ctk96,pu91,ss95,dem89].recent inparticular[be96].specialtypesofsemanticconicts,inparticulartheresolutionof semanticsassociatedwithdataitemsanddatavalues.classicationsanddiscussions andverypromisingworkaddressingsemanticissuesindatabaseinteroperabilityfocuson onsemanticissuesinmultidatabasesystemscanbefoundin[she91,sk93,ks96]and context-basedapproaches,e.g.,[gms94,ks96].semanticconictsoftypeslikeinconsistentdataoroutdateddatahavebeenlistedinsomework(e.g.,[sk93,be96]),buttheir identicationandresolutionhasnotbeendiscussedindetail. [Red96,Wan98,WS96,WSF96].Only[RW95]discussestheestimationofthequalityof tegrateddataormultidatabasesystemsingeneral.mostoftheworkfocusesonde- nitionsandmeasurementsofdataqualityaspectsindatabaseandinformationsystems Intheareaofdataqualitymanagementtherehasbeennoparticularfocusonin- datainfederateddatabasesystems,thusnotconsideringtheintegrationaspectortask butonlythealreadyintegrateddata.in[jv97,jjq98]theaspectofdataqualityisaddressedinthecontextofdatawarehousesbutneitherformaldenitionsormeasurements fordataqualitynordesignmethodologiesfocusingondierentdataqualityaspectsare given. canbeformallydenedusingthenotionofvirtualclasses.becausevirtualclassesand 1.3 InSection2weshowhowthedataqualityaspectstimeliness,accuracyandcompleteness PaperOutline theirpropertiesarenotdirectlyaccessibleatmultidatabasedesigntime,insection3we presentanapproachcalledinformationprolingthatallowstoidentifyandcomparedifferentdataqualityaspectsassociatedwithlocalclassesandtheirpossibleextensions.the resultobtainedthroughinformationprolingisrecordedasmetadataattheintegration 2layerandisusedforglobalqueryprocessing,whichisoutlinedinSection4. icallycannotbedescribedormodeledbyusingdiscretevaluessuchas\good",\poor"or Thequalityofdatastoredatlocaldatabasesparticipatinginamultidatabasesystemtyp- FormalizationofDataQualityinMDBS \bad".inthissection,wesuggestaspecicationof(time-varying)dataqualityassertions basedoncomparisonsofsemanticallyrelatedclassesandclassextensions.insections2.1 2.1 and2.2wegiveformaldenitionsofdierentdataqualityaspectsusingvirtualclasses. Inordertoanalyzeanddeterminetheconictsorsemanticproximityoftwoobjects, objectattributesorevencompleteclasses,oneneedstohavesomekindofareference VirtualandConceptualClasses pointforcomparisons.inthispaper,suchcomparisonsarebasedonvirtualclasses.a virtualclassisadescriptionofrealworldobjectsorartifactsthatallhavethesame -notnecessarilycompletelyinstantiated-attributes.theextensionofavirtualclass isassumedtobealwaysup-to-date,completeandcorrect,i.e.,onlycurrentrealworld objectsanddataarereectedintheextensionofavirtualclass. 4
usedtodescribesemanticissuesindatabaseinteroperability,e.g.,in[sg93,gsc96].the reasonforthisisthatthedevelopmentoflocaldatabaseschemasistypicallydrivenby whatinformationaboutrealworldobjectsandclassesisneededforlocalapplicationsand Assumingsuchtypeofclassesisquitenaturalindatabasedesignandtheyarealso whatinformationisavailableabouttheseclassesandobjects. C1;:::;Cn.Moreimportantly,themappings1;:::;nadoptedbylocaldatabasesdier structures,i.e.,onlyinformationrelevanttolocalapplicationsismappedintolocalclasses DB1;:::;DBmtypicallyemployonlyapartialmappingofrealworlddataintolocaldata GivenadescriptionofrealworldobjectsintermsofavirtualclassCvirt,localdatabases intheunderlyinglocaldatastructures(schemata)andhowrealworlddataispopulated intothesestructures.dierentmappingsthenresultinschematicandsemanticheterogeneitiesamongthelocalclassesc1;:::;cnthatrefertothesamevirtualclasscvirt. WhilealocalclassCitypicallymapsonlyaportionoftheinformationassociatedwith PSfragreplacements Cvirt,aconceptual(orglobal)classCconintegratesallaspectsmodeledinsemantically equivalentorsimilarlocalclasses(figure1). Ccon C11 n Figure1:Relationshipbetweenvirtual,conceptualandlocalclasses Cn isthatthespecicationofaconceptualclasscconcomesasnearaspossibletothespecicationoftheassociatedvirtualclasscvirtfromwhichthelocalclassesc1;:::;cnare Determiningconceptualclassesascomponentsof,e.g.,afederatedormultidatabase schema,isthemaintaskindatabaseintegration.onemaingoalindatabaseintegration derived.besidethesestructuralaspectsofconceptualclasses,theotheraspectishowto integratedatafrom(nowstructurallyequivalent)localclassesintoglobalclasses.that is,dataintegrationrulesneedtobespecied.ifthe(possible)extensionsoflocalclasses areknowntobedisjoint,thenthedataintegrationrulebasicallyconsistsofjoiningthe extensionsoftheselocalclasses.incaseofpossibleobjectordataconicts,thatis,foran objectoftheglobalclassthereareatleasttwolocalobjectscorrespondingtothatobject butwithdierentattributevalues,dataintegrationrulesadditionallydescribeconict resolution.suchresolutionsensurethatonlyoneobjectisretrievedfromlocalclasses object. orlocalattributevaluesfromsameobjectsarecombinedappropriatelyintooneglobal overtime.objectsareaddedanddeleted,orpropertiesofobjectschange.atlocalsites Abasicpropertyofrealworldobjectsisthatobjectsasinstancesofvirtualclassesevolve 2.2 BasicDataQualityAspects intolocalclasses,thusresultinginatypeofheterogeneityamonglocaldatabaseswecall dierentorganizationalactivitiesareperformedtomapsuchtime-varyinginformation operationalheterogeneity.weconsideroperationalheterogeneityasanon-uniformityin thefrequency,processes,andtechniquesbywhichrealworldinformationispopulated 5... Schema Integration virtual class local mappings semantically equivalent local classes
updatedmanuallyonaweeklybasis,atanotherdatabasestoringrelatedinformation intolocaldatastructures.forexample,whileatonelocaldatabasethestoreddataare updatesareperformedautomatically(e.g.,sensor-based)onamonthlybasis.insuch varyingreliability.basedontheuseddatacapturingandprocessingapproaches,the cases,operationalheterogeneitycanleadtothefactthatsimilardatareferringtosame propertiesandattributesofrealworldobjectshavedierentqualityandthusmayhave ationalheterogeneitycannotberesolvedontheschemaintegrationlevelbutrequiresa outdated,asdiscussedinsection1.1.moreimportantly,wehavealsoshownthatoper- qualityofdataatalocaldatabasecanevenvaryovertime,forexample,datacanbe orwrongdata.inthispaperweconsideroperationalheterogeneityanddataqualityin ticdataconictsareappropriateorrichenoughtohandleaspectssuchasoutdateddata suitableapproachtodataintegration. particularasatypeofsemanticdataconictwhichrequiresanotnecessarilyunique Theseobservationsraisethequestionwhetherexistingapproachestoresolvingseman- mostimportantindatabaseintegrationandresolvingsemanticdataconicts.forthis resolutionatrun-timebymeansof(time-varying)dataintegrationrules. wemakethefollowing(simplied)assumptions. Werstgiveaformaldenitionoftime-varyingdataqualityaspectsweconsideras TherearetwoclassesC1hA1;:::;AniandC2hA1;:::;Anifromtwolocaldatabases assumethatthetwoclassesarerepresentedintheglobaldatamodel.theresolution hasbeenperformedforc1andc2intotheconceptualclasscconha1;:::;ani.we ofheterogeneousrepresentationsoftheoriginallocalclassstructuresisatopicof DB1andDB2.C1andC2refertothesamevirtualclassCvirtandschemaintegration Usingthepredicatesameitispossibletodeterminewhetheranobjecto1from schemaintegrationandisthereforeoutsidethescopeinthispaper. theextensionofc1,denotedbyext(c1),referstothesamerealworldobjecto2 Ext(Cvirt)asanobjecto22Ext(C2).Theresolutionofdierencesamongkey representationsisassumedtoberesolvedduringschemaintegrationbyappropriate heterogeneity.forthis,weassumeadiscretemodeloftimeisomorphictonaturalnumbers. AsmotivedinSection1.1,theaspectoftimeplaysanimportantroleinoperational methodsassuggestedin,e.g.,[pu91,ctk96,ss95]. Inthismodeltimeisinterpretedasasetofequallyspacedtimepoints,eachtimepoint isdenotedbytnow.theextensionofaclasscattimepointtisdenotedbyext(c;t), thavingasinglesuccessor.thepresentpointoftime,whichadvancesastimegoesby, thevalueofanobjectoforanattributea2schema(c)attimepointtisdenotedby ValC(o;A;t).Ifnotimepointisexplicitlyspecied,weassumethetimepointtnow.We furthermoreassumeafunctiontimec(o;a;t)thatdeterminesthetimepointt0tthe valuea:oofattributeaofobjecto2ext(c;t)wasupdatedthelasttimebeforet. Denition2.1(Timeliness)GiventwoclassesC1andC2withschema(C1)= thedataqualityaspectstimeliness,completeness,andaccuracyinaformalway. Theabovedenitionsandassumptionsnowprovideusasuitableframeworktodene schema(c2).classc1issaidtobemoreup-to-datethanc2attimepointttnow withrespecttoattributea2schema(c1),denotedbyc1>time countfo1jo1:ext(c1;t);o2:ext(c2;t):same(o1;o2)^timec(o1;a;t)>timec(o2;a;t)g countfo2jo1:ext(c1;t);o2:ext(c2;t):same(o1;o2)^timec(o2;a;t)>timec(o1;a;t)g A;tC2,i 6
attributeaifitsextensionext(c1;t)containsmorerecentupdatesonathanext(c2;t) does.notethatfort=tnowthispropertymayipsincetnowadvancesastimegoes by.itshouldalsobenotedthattherearealternativedenitionsfortimeliness.possible Inotherwords,classC1ismoreup-to-datethanC2attimepointtwithrespectto denitionsdependonhowmuchinformationaboutupdatestrategiesisavailableforlocal databasesandclassesatintegrationtime.forexample,focusingmoreonthedistances wouldgiverisetothefollowingcondition: betweenthetimepointswheretheattributevaluesofthesameobjectswereupdated, sumftimec(o1;a;t)?timec(o2;a;t)jo1:ext(c1;t);o2:ext(c2;t):same(o1;o2)g Example2.2AssumethatthefollowingvaluesfortimeC1andtimeC2havebeendeterminedatt=tnow='10-14-98'fortheattributeA.TheattributevaluesA:C1andA:C2 Thefollowingexampleshowsthedierencebetweenthetwoconditions. sumftimec(o2;a;t)?timec(o1;a;t)jo1:ext(c1;t);o2:ext(c2;t):same(o1;o2)g shownineachrowareassumedtobevaluesofobjectsfromext(c1;t)andext(c2;t) andbothobjectsrefertothesameobjectincvirt. 120 156 123 A:C1timeC1(o1;A;t) 108 10-9-98 A:C2timeC2(o2;A;t) 10-7-98 156 130 125 10-3-98 conditionwehavec1time ApplyingtherstconditionwouldyieldthatC1=time A;tC2(12>2). 111 10-8-98A;tC2holds,whileforthesecond timelymannerusingacertaindatacapturingapproach,thisdoesnotnecessarilymean Althoughatalocaldatabasemodicationsofrealworldobjectsaremappedina virtualclass.whiletimelinessessentiallyreferstopropertiesofattributes,thedata whole. thatthisapproachsuitablypropagatesinformationaboutdeletedornewobjectsofthe Denition2.3(Completeness)ClassC1issaidtobemorecompletethantheclassC2 qualityaspectcompletenessfocusesontheextensionsofthetwoclassesc1andc2asa attimepointttnow,denotedbyc1>comp countfo1jo12ext(c1;t)^:9o02ext(cvirt;t):same(o1;o0)g< countfo2jo22ext(c2;t)^:9o02ext(cvirt;t):same(o2;o0)g t C2,i t,thisdoesnotnecessarilymeanthattheseobjectsstillexistinthecorrespondingvirtual Inotherwords,althoughtheextensionofC2maycontainmoreobjectsattimepoint timepointt0<t.theabovedenitionthusnicelyreectstheaspectofoutdatedobjects. class.c2cancontainnumerousobjectswhichalreadyhavebeendeletedinrealityata aboutanyrealworldobjectfromcvirtthanc2does.includingthisaspectwouldimply NotethatwehavenotincludedtheaspectthatC1mustalsocontainmoreinformation thatatalocaldatabaseoneisalwaysinterestedinallobjectsthatbelongtoavirtual class.this,however,isoftennotthecasebecausetheworkingscopeoflocalclassesand applicationsistypicallyrestrictedtoasubsetofsuchrealworldobjects. istheaspectofdataaccuracywhichfocusesonhowwellpropertiesorattributesofreal worldobjectsaremappedintolocalclasses. Thethirddataqualityaspect,whichisorthogonaltotimelinessandcompleteness, 7
Denition2.4(DataAccuracy)GiventwoclassesC1andC2withschema(C1)= schema(c2)andattributea2schema(c1).classc1issaidtobemoreaccuratethan C2withrespecttoAattimepointt,denotedbyC1>acc same(o1;o)^ jvalcvirt(o;a;t)?valc1(o1;a;t)jjvalcvirt(o;a;t)?valc2(o2;a;t)jg> countfo1jo12ext(c1;t);o22ext(c2;t);o2ext(cvirt;t):same(o1;o2)^ A;tC2,i countfo2jo12ext(c1;t);o22ext(c2;t);o2ext(cvirt;t):same(o1;o2)^same(o2;o)^ apairofstrings,numbersordates.inordertosuitablyincorporatetheaspectofpossible jvalcvirt(o;a;t)?valc1(o2;a;t)jjvalcvirt(o;a;t)?c2(o1;a;t)jg Intheabovedenition?denotesagenericminusoperatorwhichisapplicabletoeither nullvalues,onecandeneavalueamaxthatisusedifvalci(oi;a;t)isnull.itiseven possibletogiveadenitionfordataaccuracythattakesonlythenumberofobjectsinto accountthathavethevaluenullfortheattributea. tributes.forexample,objectidentiersaretypicallytime-invariantandthuscannot objectidentier).wewilldiscusstheaspectoftime-varyingandtime-invariantattributes causeanydataqualityproblems(unlessapropertyofrealworldobjectsdesignatesthe Determiningthetimelinessandaccuracyofdatamakesonlysenseforcertainat- inmoredetailinsection3.theimportantpointwiththeabovedenitionsisthatthey describeorthogonaldataqualityaspects.thatis,incaseofadataconictamongtwo objectsreferringtothesamerealworldobject,itispossibletochooseeitherthemost accurateorthemostup-to-datedataaboutthisobject,dependingonwhetherrespectivespecicationsexistforthetwoobjects.alsoincaseofnon-conictingobjectclasses andextensions(i.e.,theextensionsofsemanticallyequivalentlocalclassesaredisjoint), adataqualityaspectmightgiveareasontochooseoneextensionovertheotheror,at least,todistinguishorgrouptheresultsretrievedfromtheseclasses.bothscenarios,of thegloballevel. equivalentclassessuchthatthisclasssatisesallthreedataqualityaspectsbest.that course,requiretoweakenthetransparencypropertyofintegratedclassesandobjectsat iswhythese,possiblytime-varying,aspectsneedtobemodeledsuitablyandutilizedfor globalqueryprocessing. Inpracticeitisratherunlikelythatthereisoneclassamongseveralsemantically aspectsusingvirtualclasses.inpractice,however,virtualclassesarenotexplicitlyprovidedinawaythattheycaneasilybeusedtoevaluatetheconditionsgivenindenitions 2.1,2.3,and2.4.Inthissectionwediscusshowstatementsaboutdataqualitycanbe thesestatementscanbemodeledasmetadataavailableforglobalqueryprocessing.the underlyingconceptforthisisinformationproling. extractedfromlocaldatabaseparticipatinginamultidatabasesystem.weshowhow Intheprevioussectionwehaveshownthatitispossibletoformallydenedataquality 3 ModelingDataQualityAspects 3.1 Informationprolingessentiallyreferstotheactivityofdescribingtheinformationcapturingandprocessingtechniquesusedtopopulaterealworlddataintoalocaldatamodel InformationProling anddatastructures,respectively,classes.informationprolingisnotpeculiartodatabase 8
integration,butisanintegraltaskindatabaseandapplicationdesign.aninformation prole,e.g.,foraclass,describesnotonlytheschemaofthatclass(staticproperties)but erties).thisincludesadescriptionof(setsof)objectmethodsandhowthesemethods alsohowrealworlddataorartifactsaremappedasobjectsintothisclass(dynamicprop- interactwiththedatabaseandapplicationenvironment.mostinformationforproling canbeobtainedwhilestructuralandsemanticconictsareinvestigatedatmultidatabase designtime.asanyotherproposaltodatabaseintegrationandresolvingsemanticcon- goodknowledgeaboutparticularlocaldatabaseschemasbutalsoabouttheenvironments inwhichthelocaldatabasesoperate. icts,resolvingoperationalheterogeneitythroughinformationprolingrequiresnotonlya ftimeliness,completeness,accuracyg.thebasicideaforinformationprolingistopartitioncanditspossibleextensionspext(c)intoasetucofinformationunitssuchthat all(possible)datainaunitu2ucshowthesameproperties(seebelow)withrespectto GivenalocalclassCwithattributeshA1;:::;AniandadataqualityaspectQ2 (U1)anattributeA2fAi;:::;Akg,i.e.,uA=A Q.AbasicinformationunituforaclassCanditspossibleextensionsPExt(C)canbe (U2)asubsetofpossibleattributevaluesforanattributeA2fAi;:::;Akg, oneofthefollowingtypes.weassumethattheobjectidentierispartofeveryunit. schema(c)isdenotedbyuac.partitioningaclasscintobasicinformationunitscanbe ThesetofbasicinformationunitsuA1;uA2;:::;uAkassociatedwithanattributeA2 i.e.,ua=fa(o)jo2pext(c)^p(o)gwithpbeingaselectionpredicate. optimizedbybuildingcomposedinformationunitsthatconsidermorethanoneattribute, i.e.,u=fai1;:::;aikgoru=fai1;:::;aik(o)jo2pext(c)^p(o)g,whereaij2 fai;:::;akg,ifalltheconstitutingbasicunitshavethesamepropertieswithrespectto alsolatertheglobalqueryprocessinginthepresenceofinformationunits(section4),we thedataqualityaspectq.inordertodiscussthebasicideaofinformationprolingand donotconsidertheseoptimizationissuesintherestofthepaper. thisunitarecaptured,processedandmaintained.suchdescriptions,ofcourse,dependon thetypeofdataqualityaspectsqforwhichtheclasschasbeenpartitioned(q-prole, Q-partitioning).Fromapracticalpointofview,wethinkthatitisratherunlikelytohave Withaninformationunitaproleisassociated,describinghowandwhenthedatain aformalspecicationforproles.dependingonthetypeofdataqualityaspect,aprole shouldconsider,respectivelyaddress,thefollowingissues: TimelinessGivenanattributeA2schema(C).Whatisthe(average)update CompletenessHowdodataprocessingtechniquesensurethatrealworldobjects frequencyofthatattribute?doupdatesoccuronanevent-basedbasis?areupdate frequenciestime-dependent? DataAccuracyWhattypeoftechniqueisusedtorecordrealworlddata?Aredata relevantforthatclassarerecorded?aretheemployedprocessingtechniquestimedependent? enteredmanuallyoraretheyread,e.g.,bysensors?arethedataextractedfrom time-dependent? otherresources?howmanydataprocessingunits(applications)areinvolvedin readingrealworlddataintolocaldatastructures?aretheemployedtechniques 9
prolescanbeacriticalpoint.itshould,however,benotedthatformalspecications oftendonotexistwherestructuralorcertainsemanticrelationshipsandheterogeneities amongclassesandattributesneedtobediscoveredandresolvedatthedatabaseschema Wearefullyawareofthefactthatnothavingaformalspecicationfordataquality cientdataforinformationproling. level.wethinkthatsystemdescriptionssuchasdataowdiagrams,workowdiagrams, oreveninformationobtainedthroughsystemanalysiscanoftenprovideusefulandsu- butnoq-prolecanbegiven.thesecondtype,denotedbyua0,isthedescriptionof schema(c)therearetwospecialtypesofunits.therstonewecalldefaultunit,denoted byuad,anditcorrespondstothepartitionofaforwhichdatais(orwillbe)recorded BesidetheinformationunitsUACthatcanbeassociatedwithanattributeA2 thepartitionforwhichdatainpext(c)exist,butforwhichdataareneverrecorded thespecicationofu0acanalwaysbedeterminedbasedonthespecicationsoftheother because,e.g.,theyarenotadmissibleduetosomeintegrityconstraints.notethat(1) units,and(2)thespecicationofua0isalwaysoftype(u2).addingthesetwotypesof unitstouacforwhichproleshavebeendeterminedleadstothefactthatforanattribute AandassociateddomainDalwaysacompletepartitioningcanbegiven,which,inthe 3.2 worstcase(noprolesavailable),consistsonlyofthedefaultunituad. PartitioningandinformationprolingisappliedtoallattributesofaclassC.Thatis,for CasetUAC=UA1 ClassPartitioningandQualityAssertions type(u2))whilepartitionsfortime-variantattributes(i.e.,attributesthatareupdated tosomesampleclassesandtheirpossibleextensionsshowsthattime-invariantattributes typicallyareallpartitionedinthesameway(i.e.,samespecicationofpforbasicunitsof C[:::[UAn Cofinformationunitsisdetermined.Applyingthistechnique frequently)dier.figure2illustratesanexampleresultofpartitioningaclasscwith time-invariantattributesa1;a3andtime-variantattributesa2;a4. PSfragreplacementsUA1 CUA2 CUA3 CUA4 C HavingatleasttwopartitionsuAianduAjdierentfromuA0anduAdforanattributeA Figure2:PartitioningofaclassC indicatesthatevenwithinpossibleextensionsofc,thequalityofthedatavaluesfora diers.thisisquitenaturalsincewecannotexpectallobjectsandattributesinaclass tohavethesamequalitywithrespecttoadataqualityaspectq.becausethepartitions itselfdonotsaymuchabouthowthequalityofthedataintheseunitsisrelated,attribute qualityassertionsaredeterminedamongtheunitsinuacandassociatedprolessuchthat acompleteorderamongtheunitsinuacisobtained.determiningthisorderisdoneby pairwisecomparingtheq-prolesassociatedwitheachunituai2uacnfua0;uadg. 10
Denition3.1(AttributeQualityAssertions)GiventwoinformationunitsuAi;uAj2 UACnfuA0;uAdgwhichhavebeendeterminedbasedonaQ-partitioningofA.Iftheproles associatedwithuai;uajrevealthatthedatacontainedinuaihasabetterqualitythanthe datacontainedinuaj,thentheattributequalityassertionuai>quajissaidtobevalid. ThegoalofpairwisecomparingprolesassociatedwiththeuAisistodeterminea completeorderamongtheunitsinuac.nothavingacompleteorderbutonlyapartial orderintroducessomekindofvaguenessintheassertionswhichthenhastobedealt withinspecialway.forthesakeofsimplicityandalsoduetospacelimitations,inthe remainderofthepaperweassumethatalwaysacompleteordercanbedetermined.note thatuadcannotbecomparedwithanyotherunitfromuacbecausenoproleforuadis given,thustherearenoassertionsreferringtouad.furthermore,thereisnoneedto specifyanyassertioninvolvingua0becausecorrespondingdataisneverrecordedforthat unitinc. Beforewediscusscomparisonsofattributequalityassertionsofsemanticallyequivalent classes,werstfocusontheaspectoftimewhichweexplicitlyintroducedinsection2.2. Timeplaysanimportantrolewhendeterminingqualityassertionsamongunitsthathave beenobtainedbasedonatimeliness-partitioning(seealsoscenario1insection1.1).it iseasytoseethatiftimeplaysarole,thetotalorderamonginformationunitsinuacis time-dependent,too.whileatime-dependentordercanbeoptionalwithrespecttothe dataqualityaspectscompletenessandaccuracy,itisintrinsictotheaspectoftimeliness. Thismeansthatforeachpossibletimepointt,wehavetospecifyatotalorderamong theunitsinuac.thiscanbedoneindierentways.assuminganitedomainfortime, e.g.,numberoftheweekinayearordayoftheweek,itispossibletogiveacomplete specicationoftime-dependentorders.thuswitheachorderasetofdaysornumbersof weekscanbeassociated.morecomplextemporalquantiersfortime-dependentorders, ofcourse,canbeusefulandneedtobeinvestigatedinfutureresearch.hereweassume thatwitheachorderasetoftimeintervalsortimepointsisassociatedandthat,givena partitioninguacandorderamongtheinformationsunitsinuac,witheachtimepointt anordercanbeassociated. 3.3 Inter-ClassConsiderations AssumethattwosemanticallyequivalentclassesC1hA1;:::;AniandC2hA1;:::;Ani havebeenq-partitionedandthatalltotalordershavebeendeterminedfortheunits inua1 C1;:::;UAn C1andUA1 C2;:::;UAn C2.ThetasknowistocombinethepartitionsetsUAi C1 anduai C2intoanewpartitioningUAi CsuchthatCistheglobalclassintegratingC1and C2.InbuildingUACthetaskisnottodenethestructureofCbutthedataintegration rulesassociatedwithcatthegloballevel.themainideaforthisistoresolvepossibledataconictsamongsemanticallyequivalentobjectsinc1andc2bycomparingand orderingtheinformationunitsandprolesthatmaycontain(attributesof)conicting objects.alsoincasenodataconictswillexistamongpossibleextensionsofc1andc2, theinformationunitsandprolesareusedtoestablishanorder(withrespecttoadata qualityaspectq)amongpossibleobjectsintheglobalclassc. GiventwoinformationunitsUAi C1andUAi C2,aportioningUAi Cisobtainedbyoverlapping theunitsinuai C1withtheunitsUAi C2asillustratedinFigure3. NotethatthespecicationoftheunitsinUACcaneasilybeobtainedfromthespeci- 11
ua1;1 UAC1 ua1;2 ua2;1 UAC2 ua2;2 ua1 ua2 UAC ua1;3 ua2;3 =) ua5 ua3 ua4 Figure3:Overlappingoflocalinformationunits + cationoftheunitsinuac1anduac2.furthermore,thepartitioninguacisalwayscomplete becauseofthedefaultunitsinuac1anduac2.giventhespecicationsofthepartitionsin UAC,thenextstepistocomparetheprolesassociatedwiththe(local)unitsthatcontributetoaunituAi2UAC.InFigure3,forexample,theunitsthatcontributetouA12UAC associatedwiththelocalunitsallrefertothesamedataqualityaspectq,itispossible todetermineapreferenceamongtwolocalunitsthatcontributetoaglobalunitinuac. Determiningpreferencesessentiallycorrespondstoresolvingdataconictsincasefora globalobjectthereisacorrespondingobjectineachlocalunit.again,thetwotypesof localinformationunitsuai;danduai;o;i2f1;2gneedspecialconsideration.weconsider thedierentcasesassumingthatuai;d=ua1;danduai;0=ua1;0andthateitherofthesetwo unitsiscomparedwithaunitua2;j2uac2: 1.uA1;danduA2;j2UAC2nfuA2;d;uA2;0g:becausenoproleisassociatedwithuA1;d,noprefer- areua1;12uac1andua2;12uac2,andforua2theyareua1;1andua2;2.becausetheproles 2.uA1;danduA2;d:nopreferencecanbegiven,meaningthatitispossibleattheglobal encecanbegivenunlessitisknownthat,forexample,ua2;jhasthehighestquality amongthelocalunitsinuac2withrespecttothedataqualityaspectq. leveltoreturntwolocalobjects(orattributes)thatcorrespondtothesameglobal 3.uA1;0anduA2;j62fuA2;d;uA2;0g:BecauseuA1;0nevercontainsdata,uA2;jischosen. objectandthere(perhaps)isadataconictamongthetwoobjects(attributes)in ua1;dandua2;d. 4.uA1;0anduA2;d:Sameasfor3. 5.uA1;0anduA2;0:Thereareneverdataineitherofthetwounits.Thustherewillnever OnceapreferenceamongtheunitsbuildingtheglobalinformationunitsinUAChas units. beaconictthatneedstoberesolvedbasedontheprolesassociatedwiththetwo beendetermined,againacompleteorderamongtheinformationunitsinuacnfuad;ua0gis determined.notethatincasethetwolocalclassesc1andc2havebeenpartitionedwith respecttoadataqualityaspectqandthetotalorderamongtheunitsinuac1,respectively UAC2,istime-dependent,thetotalorderamongtheunitsinUACcanbetime-dependent,too. Moreimportantlyinthiscase,comparisonsoftheprolesoflocalunitsmustoccurfor eachtimepointt.itremainstobeinvestigatedtowhatextenttheprocessofdetermining 12
giventheattributequalityassertionsforuac1anduac2. a(time-dependent)totalorderamongtheunitsinuaccanbedeterminedautomatically, recordedasmetadataforglobalqueryprocessing,wersttakeacloserlookatthedata qualityaspectcompleteness.asdescribedinsection2.2,completenessreferstocomplete extensionsofclassesratherthanonpossibledataconictscoveredbytheaspectstimelinessandaccuracy.partitioningaclassanditspossibleextensionswithrespecttothe aspectcompletenessoftenrevealsaverysimplepattern.figure4illustratesanoptimized completeness-partitioninguac1anduac2fortwosemanticallyequivalentlocalclassesc1 andc2. PSfragreplacements ua1;1 ua2;1 UAC2 Beforeweconcludethissectionbydescribinghowinformationunitsandprolesare ua1;2 ua2;2 NotethatforallattributesinAthesamepartitioningisadopted.Thisisquitenatural Figure4:Partitioningbasedonthedataqualityaspectcompleteness becausecompletenessreferstoobjectsasawholeandnottosingleattributes.forthis particulardataqualityaspectthusdeninglocalattributequalityassertionsaswellas 3.4 deningtheinformationobtainedthroughoverlappingoflocalunitscanbedoneeasily. alentclassesisusedbytheglobalqueryprocessortodealwithresolvingdataconicts Asmentionedearlier,informationaboutpartitioningandoverlappingsemanticallyequiv- DataQualityInformationasMetadata andorderingobjectsandattributesofdierentquality. classesandattributesintoglobalclasses.besidethisstructuralinformationaboutmetadataatthegloballevel,therepositoryalsohastorecordinformationabouthowtoactually Themetadatarepositoryutilizedbythequeryprocessoralreadycontainsinformation aboutthestructureofglobal(conceptual)classesandthe(structural)mappingsoflocal integrationrulesaretypicallyspeciedbyusinglocalclassstructuresformulatedinthe integratedatacontainedinlocalclassextensionsintoglobalclassextensions.suchdata globaldatamodel.assumetwosemanticallyequivalentclassesc1andc2wherestructural and(certain)semanticconictshavebeenresolved,andc1;c2arespeciedintheglobal datamodel.asimpledataintegrationrulefortheglobalclasscthenmightlooklike Ext(C):=Ext(C1)[Ext(C2)(usingtherelationaloperatorunion).Dependingon possibledataconictsamongobjectsinext(c1)andext(c2),adataintegrationrule canbemorecomplexbecauseconictresolutionfunctionsareencodedintherule. dataqualityoflocalclasses,informationunitsetc?whilefortraditionalapproaches utilizesthespecieddataintegrationrule,forthescenariopresentedinthispaperitis toqueryprocessinginmultidatabasesystemstheglobalqueryprocessingenginesimply Howdodataintegrationruleslooklikeinthepresenceofvariousinformationabout thetaskofthequeryenginetoformadataintegrationrule.forthis,thefollowing informationisrecordedinthemetadatarepositoryattheintegrationlayer. 13
ForeachlocalclassCiandeachdataqualityaspectQ,thespecicationofthe ForeachglobalclassCandeachattributeA2schema(C),thespecicationofthe sertions,andtheir(time-dependent)totalorder. Q-partitions,theassociatedQ-proles,thedescriptionoftheattributequalityaspendsontheunderlyingglobaldatamodel,thetypeofexpressionsallowedinbuilding Amoreprecisedenitionoftheinformationrecordedinthemetadatarepositoryde- partitioninguacandthedescriptionofthetotalorderamongtheunitsinuac. withtotalordersamonginformationunits.forthesakeofsimplicity,inthefollowing sectionweassumetheabstractdescriptionofthemetadataasabove. informationunitsoftype(u2),andthetypeoftemporalquantiersthatcanbeassociated multidatabasesysteminthepresenceofdataqualityinformation.becauseofthevarious Inthissectionwegiveashortoutlineoftheaspectofglobalqueryprocessingina 4 QueryProcessing featuresandpossibleoptimizationtechniquesthatcanbeappliedinthisscenario,weonly or[edn97]). basictasksandmethodsinqueryprocessinginmultidatabasesystems(see,e.g.,[my95] queries(nestedqueries,subqueries).wealsoassumethatthereaderisfamiliarwiththe givethebasicideaofglobalqueryprocessing.forexample,wedonotconsidercomplex 4.1 designersandapplicationsatransparentaccesstolocaldata.thatis,theusershould AmainfeatureofaglobalquerylanguageforaMDBSshouldbetoprovideglobalusers, GlobalQueryLanguage s/heshouldalsonotberesponsibleforresolvingsuchconicts(someapproaches,e.g., [SRL93,MR95]adoptexactlytheoppositestrategy). beunawareofpossiblestructuralandsemanticconictsamonglocalmeta-data,and prespecieduniquedataintegrationrules.butwhatdoesthismeanwhenthereare itisquiteobviousthatglobalqueriescannotsimplybedecomposedsolelybasedona localdataofdierentqualityandthedataneedtobeintegratedatrun-time?first,as Inthepresenceofdierentdataqualityaspects,whichareencodedinthemetadata, discussedextensively,dataqualityaspectsareorthogonal.thatis,theredoesnotexist auniedoruniqueviewontheintegrateddata.assumingthattheuserisawareofthe factthatlocaldatahavedierentquality,querylanguageconstructsarenecessarythat Suchanapproachthensupportsthefollowingtwoaspects: supportthespecicationofadataqualitygoalforglobalqueriesandthusintegrateddata. 1.Incasetherearedataconictsamongtwoormoresemanticallyequivalentlocal ischosen.notethatinthiscasethemetadataaboutlocalinformationunitsand preferencesamongtheseunits(withrespecttoaglobalclass)areutilized. objects,theobject(orattributes)satisfyingthespecieddataqualitygoalbest 2.Incasetherearenoconictsamongobjectsoflocalclassesbutthequalityofthe objects(withrespecttoadataqualityaspectq)isdierent,theintegratedobjects needtobegroupedortaggedtoindicatethisaspectattheglobalqueryinterface. 14
suggestanextensionsofaglobalquerylanguage,sayoql,byadataqualitygoalclause: Inordertosupportqueryingobjectsandattributeswithpossiblydierentquality,we select<listofattributes> Asimpledataqualitygoalcaneitherbemostaccurate,mostup-to-date,ormost withgoal<dataqualitygoal>; where<selectioncondition> from<listc1;:::;cnofglobalclasses> conictingobjects(attributes)whichhavethesamequalitywithrespecttotherstgoal, complete.itisalsopossibletoallowalistofdataqualitygoals,meaningthatamongtwo thoseattributesarechosenwhichsatisfythesecondgoalbest. simplecase,aqueryofthetype AssumeaglobalclassChA1;:::;A4icomposedoflocalclassesC1;:::;Cm.Inthemost 4.2 QueryingaGlobalClass couldretrievelocaldatafollowingthedataintegrationruleext(c)=ext(c1)[:::[ Ext(Cm)(Cibeingspeciedintheglobaldatamodel).Nowassumethatinformation selectc:fromcwithgoalmost-accurate; prolinghasbeenperformedforthelocalclasseswithrespecttothedataqualityaspect accuracyandthefollowingpatternhasbeenobtainedfortheglobalclassc: PSfragreplacementsUA1 CUA2 CUA3 ua3 1CUA4 C Fortheabovequery,thequeryprocessorthenutilizesthespecicationofeachinformationunitinUACinordertocomposeglobalobjectsinExt(C)fromlocalobjectsansorchoosesthosevaluesfromcorrespondinglocalunitsthatsatisfytheaspectaccuracy 1,thequeryproces- Figure5:GlobalClassCwithitsinformationunits attributesinext(ci).forexample,forattributevaluesintheunitua3 best(accordingtotheattributequalityassertionsspeciedfortheunitua3 areselectedfrompossiblydierentlocalinformationunits.notethattheobjectidentier iscontainedineachlocalinformationunit(seesection3.1). speaking,foreachobjectidentierknowntobeinext(c),correspondingattributevalues 1).Roughly time-dependentordersamongthepartitionsinuacischosen. aglobalquery,thepointintimetheglobalisissueddetermineswhichofthedierent pletelydierent.inparticular,ifthedataqualitygoaltimelinesshasbeenspeciedin NotealsothatforanotherdataqualityaspectthepartitioningUACmightlookcom- fortheseunitsattributequalityassertionshavebeenspecied,displayingattributevalues suggeststoindicatethevariancesamongthequalityoftheattributevalues.this,for Inanycase,ifforanattributeAthereexistmultipleunitswithinaglobalclassand 15
example,canbedonebyusingcertaincoloringschemasorobjectgroupingstructures. Thisfeatureisinparticularnecessaryifaglobaluserwantstocomparethedierent 4.3 resultsofthesamequerieswithdierentdataqualitygoals. jectsthathavedierentqualities(thatisalsowhywesuggestcertaingroupingstructures Inthepresenceofdataqualityinformationonewouldliketoavoidsimplycombiningob- JoiningGlobalObjects thequeryresult-thatan\up-to-date"objectisjoinedwithanoutdatedobject(based tuplesinrelationaldatabases,basedonprimarykeyandforeignkeyrelationships,ensuringthispropertyisquitetrivialunderthefollowingassumption:primaryandforeignkey intheprevioussection).forexample,itshouldbeavoided-oralternativelyindicatedin ontherelationaljoinoperator).becausejoiningobjectoccursinasimilarwayasjoining showninsection3.supposethequery attributes(orobjectreferences)alwayshavethesamequalityforalldataqualityaspects. Obviously,thisassumptioncaneasilybeveriedduringmodelingdataqualityaspectsas selectc1:,c2:fromc1;c2 wherethejoinconditionisbasedonaprimary-foreignkeyrelationship.forthisquery, wherec1.primary-key=c2.foreign-key thequeryprocessorrstbuildstheextensionoftheglobalclassc2accordingtothespecieddataqualitygoal.then,foreachreferencedobject,therespective(subsetofthe) extensionofc1isbuiltaccordingtothesamedataqualitygoal,butrestrictedtothe objectsneededforext(c1). 5Inthispaperwehavepresentedanovelframeworktohandlingdiversedataqualityaspects indatabaseintegration.wehaveshownthatthereisastrongneedforsupportingdata ConclusionsandFutureWork withgoalmostaccurate; qualityindataintegration,ensuringthatapplicationsbuiltontopofamultidatabase andaccuracy,and(2)howtomodeldiversedataqualitypropertiesoflocalandglobalclass haveshown(1)howtoformalizethebasicdataqualityaspectstimeliness,completeness systemcanrelyon\high-qualitydata". extensionsusinginformationunits,informationproling,andattributequalityassertions. Theimportantcontributiontodatabaseintegrationapproachesinthispaperisthatwe Weareconvincedthattheusedconceptofinformationprolingplaysanimportantrole indesigningamultidatabasesystem,inparticularfordesigningdataqualitydependent ofinformationproleswhicharemoreformal,ideallyallowingtocompareprolesand dataintegrationstrategiesandrules.inourfutureworkweaimtodevelopspecications determiningattributequalityassertionsonanautomatedbasis,supportedbytoolsof, investigationsistheuseofatemporallogicframeworkforspecifyingtime-dependent e.g.,amultidatabasesystemdesignenvironment.anotheraspectwhichneedsmore ordersamongattributequalityassertions. opensupacompletelynewareaofresearchinmultidatabasequeryprocessing: Howtoexploitdataqualityfeaturesrecordedforlocalandglobalclassesandtheir Makingthenotionofdataqualityexplicitatthemultidatabasequerylanguagelevel extensionsattheglobalquerylevel?whatareusefulquerylanguagefeatures? 16
Howtoperformmultidatabasequeryoptimizationsinthepresenceofdatahaving Havingaglobalqueryprocessingenginethatutilizesmetadataaboutinformationpro- Howtorepresentdataofdierentqualityattheglobalqueryinterfacelevel? dierentquality? les,attributequalityassertionsandpreferencesamonginformationunitsprovidesavery exiblemeanstocopewithdynamicdatabaseenvironments.thatis,ifprolesforlocal databasesandinformationunitschange,thesechangesneedonlytobedescribedatthe levelbuttheyaredynamicallydeterminedbythequeryprocessingengine. metadatalevel.moreprecisely,nonewdataintegrationrulesneedtobespeciedatthat cleansing.becausedierencesamongthequalityoflocaldataissuitablyrepresentedat levelcanprovideapplicationdesignersandusersasophisticatedmeanstoperformdata thegloballevel,querylanguageconstructsmightbeusefultoinvestigatethesourceof Furtherweakeningthetransparencyoftheexistenceoflocaldatabasesattheglobal poorqualitydata.thus,theframeworkpresentedinthispaperfurthermoreprovidesa andinparticulardatawarehousesaresubjecttoourfutureresearch. suitablebasisforapplyingdatacleansingtechniquestolocaldatabase.thedevelopment ofsuchenvironmentsandunderlyingstrategiesinthecontextofmultidatabasesystems References [BE96] [BA97] J.Bischo,T.Alexander:DataWarehouses:PracticalAdvicefromtheExperts. O.Bukhres,A.Elmagarmid:ObjectOrientedMultidatabaseSystems.Prentice-Hall, 1996. Prentice-Hall,1997. [BLN86] C.Batini,M.Lenzerini,S.B.Navathe:AComparativeAnalysisofMethodologies [CTK96]A.L.P.Chan,P.Tsai,J.-L.Koh:IdentifyingObjectIsomerisminMultidatabase fordatabaseschemaintegration.acmcomputingsurveys18:4(december1986), 323{364. [DeM89] Systems.DistributedandParallelDatabaseSystems4:2(April1996),143{168. RelationalOperationsoverMismatchedDomains.IEEETransactionsonKnowledge L.G.DeMichiel:ResolvingDatabaseIncompatibility:AnApproachtoPerforming [EDN97]C.Evrendilek,A.Dogac,S.Nural,F.Ozcan:MultidatabaseQueryOptimization. anddataengineering,1(4),december1998485{493. [GMS94]C.H.Goh,S.E.Madnick,M.Siegel:ContextInterchange:OvercomingtheChallengesofLarge-ScaleInteroperableDatabaseSystemsinaDynamicEnvironment. DistributedandParallelDatabases5(1997),77{114. InProceedingsoftheThirdInternationalConferenceonInformationandKnowledge [GSC96] M.Garcia-Solaco,F.Saltor,M.Castellanos:SemanticHeterogeneityinMultidatabases.Invitedchapterin:O.BukhresandA.Elmagarmid(eds.)ObjectOriented Management(CIKM'94),ACMPress,1994,337-346. [Huf96] MultidatabaseSystems.Prentice-Hall,129{202,1996. D.Huord:DataWarehouseQuality,DataManagementReview,Feb/Mar1996 17
[JJQ98] M.Jarke,M.A.Jeusfeld,C.Quix,P.Vassiliadis:ArchitectureandQualityinData [JV97] M.Jarke,Y.Vassiliou:DataWarehouseQualityDesign:AReviewoftheDWQ LNCSVol.1413,Springer,1998,93-113. Warehouses.93-113InAdvancedInformationSystemsEngineering{CAiSE'98. [Kim95] Project.InvitedPaper,Proc.ofthe2ndConferenceonInformationQuality.MassachusettsInstituteofTechnology,Cambridge,1997. W.Kim:ModernDatabaseSystems:TheObjectModel,Interoperability,andBeyond,649{663.ACMPress,NewYork,1995. R.Kimball:TheDataWarehouseToolkit,JohnWiley,1996. [Kim96b]R.Kimball:DealingwithDirtyData.DBMSMagazine9:10,September1996,Miller [Kim96] [KCG95]W.Kim,I.Choi,S.Gala,M.Scheevel:OnResolvingSchematicHeterogeneityin Freeman,Inc.,1996. [KS91] MultidatabaseSystems.In[Kim95],521{550. [KS96] W.Kim,J.Seo:ClassifyingSchematicandDataHeterogeneityinMultidatabase Systems.IEEEComputer24:12(December1991),12{18. [MR95] V.Kashyap,A.Sheth,SemanticandSchematicSimilaritiesBetweenDatabaseObjects:AContext-BasedApproach.TheVLDBJournal,5(4),Dec1996,276{304. P.Missier,M.Rusinkiewicz:ExtendingaMultidatabaseManipulationLanguage DataSemantics(DS-6),93{115,Chapman&Hall,London,1995. ApplicationsSemantics,ProceedingsoftheSixthIFIPTC-2WorkingConferenceon toresolveschemaanddataconicts.inr.meersman,l.mark(eds.),database [MY95] [Pu91] W.Meng,C.Yu:QueryProcessinginHeterogeneousEnvironment.In[Kim95], C.Pu:KeyEquivalenceinHeterogeneousDatabases.InY.Kambayashiand M.RusinkiewiczandA.Sheth(eds.),Proc.ofthe1stInt.WorkshoponInteroperabilityinMultidatabaseSystems(IMS'91),Kyoto,Japan,IEEEComputerSociety 551{572. [Red96] T.C.Redman:DataQualityfortheInformationAge.ArtechHouse,Boston,1996. Press,314{316,1991. [RW95] M.P.Reddy,R.Y.Wang:EstimatingDataAccuracyinaFederatedDatabase [She91] Environment.InS.Bhalla(ed.),InformationSystemsandDataManagement,Proc. ofthe6thconf.,cismod'95,lncs1006,springer-verlag,115{134,1995. [SG93] A.Sheth:SemanticIssuesinMultidatabaseSystems,SIGMODRecord20(4),SpecialIssue,December1992. F.Saltor,M.Garcia-Solaco:DiversitywithCooperationinDatabaseSchemata:SemanticRelativism.Proceedingsofthe14thInternationalConferenceonInformation [SK93] A.Sheth,V.Kashyap:SoFar(Schematically)Yet.SoNear(Semantically).In Systems(ICIS'93,Orlando1993),247-254. [SL90] A.Sheth,J.Larson:FederatedDatabaseSystemsforManagingDistributed,Heterogeneous,andAutonomousDatabases.ACMComputingSurveys22:3(1990), 5),North-Holland,Amsterdam,TheNetherlands,1993. D.Hsiao,E.Neuhold,R.Sacks-Davis(eds.),InteroperableDatabaseSystems(DS- 183{236. 18
[SRL93] L.Suardi,M.Rusinkiewicz,W.Litwin:ExecutionofExtendedMultidatabaseSQL. [SPD92] InA.Elmagarmid,E.Neuhold(eds.),Proc.ofthe9thInternationalConferenceon S.Spaccapietra,C.Parent,Y.Dupont:ModelIndependentAssertionsforIntegrationofHeterogeneousSchemas.VLDBJournal1:11994),81{126,1992. DataEngineering-1993,641{650,IEEEComputerSocietyPress,1993. [SS95] I.Schmitt,G.Saake:ManagingObjectIdentityinFederatedDatabaseSystems.In M.Papazoglou(ed.),OOER'95:Object-OrientedandEntity-RelationshipModeling, [Wan98] Pages400-411,December1995. Proc.ofthe14thInt.Conf.,GoldCoast,Australia,LNCS1021,Springer-Verlag, [WS96] R.Y.Wang,D.M.Strong:BeyondAccuracy:WhatDataQualityMeanstoData nicationsoftheacm41:2,58{65,1998. R.Y.Wang:AProductPerspectiveonTotalDataQualityManagement.Commu- [WSF96]R.Y.Wang,V.C.Storey,C.P.Firth:AFrameworkforAnalysisofDataQualityResearch.IEEETransactionsonKnowledgeandDataEngineering7:4(August1995), Consumers.JournalofManagementInformationSystems12:4,5{34,1996. 623{640,1996. 19