... Schema Integration



Similar documents
virtual class local mappings semantically equivalent local classes ... Schema Integration


Victims Compensation Claim Status of All Pending Claims and Claims Decided Within the Last Three Years

DEMYSTIFYING THE RELIABILITY OF CLOUD SERVICES. Michael Tortorella, Ph.D. Rutgers University

Trend Library Definitions

7. Classification. Business value. Structuring (repetition) Automation. Classification (after Leymann/Roller) Automation.

d e f i n i c j i p o s t a w y, z w i z a n e j e s t t o m. i n. z t y m, i p o jі c i e t o

A Quick Guide to Colleges. Offering Engineering Degrees

The Authentication and Processing Performance of Session Initiation Protocol (SIP) Based Multi-party Secure Closed Conference System

Beamer Class well nice

Beamer Class well nice

A New Methodology For Developing The MIS Master Plan Mohammad Dadashzadeh, Ph.D., Oakland University, USA

Director Air Defence and Airspace Management. Presentation to ATC Global. March 2011

1.- L a m e j o r o p c ió n e s c l o na r e l d i s co ( s e e x p li c a r á d es p u é s ).

CryptographicallyEnforced

EE4607 Session Initiation Protocol

D3.1.1 Initial Overall PONTE Architecture - Interface definition and Component design

Annex A to the MPEG Audio Patent License Agreement Essential Philips, France Telecom and IRT Patents relevant to DVD-Video Player - MPEG Audio

How To Load balance traffic of Mail server hosted in the Internal network and redirect traffic over preferred Interface

Valuing double barrier options with time-dependent parameters by Fourier series expansion

Technology Services Standard Operating Procedure. Intake Forms

PROCEDURE TO JOIN WEBEX MEETING FOR REMOTE SUPPORT

STATE OF WASHINGTON DEPARTMENT OF FINANCIAL INSTITUTIONS DIVISION OF CONSUMER SERVICES INTRODUCTION I. FACTUAL ALLEGATIONS

Abstract Oncewerelaxtheassumptionthatitmustbepossibletospecifyprogramssolelyintermsof Fax:

Ne l'aria in questi di fatt'ho un si forte Castel,

ZOZ 213 VAS

Mr. Bracken. Multiple Choice Review: Thermochemistry

Mobile Hilti. 21 st March 2014 Dr. Martin Nemetz and Dr. Christoph Göth

Integrating Assembly Lines based on Lean Line Design Concept Ms. Surekha. S 1, Dr. R. V Praveena Gowda 2, Mr. Manoj Kulkarni 3

thek-aryn-cubestructure. 1

F4 PLUS. control MPS MPS 120E - MPS MPS 160E - MPS MPS 240E YES YES. 1 4 inputs Trimmers and DIP-switches YES

MINUTES UTAH ACUPUNCTURE BOARD MEETING. April 10, Room th Floor 9:00 A.M. Heber Wells Building Salt Lake City, UT 84111

LYXOR ASSET MANAGEMENT THE POWER TO PERFORM IN ANY MARKET

Oakland Accelerated College Experience

INTRODUCTION TO ISO 9001 REVISION - COMMITTEE DRAFT

PHP in RPM distribution

Introduction into Web Services (WS)

Media Gateway Controller RTP

Modelling of Contact Problems of Rough Surfaces

Academic Year:

Network Security. Chapter 9 Integrating Security Services into Communication Architectures

GOLD SERIES SILVER SERIES GOLD SERIES

NEW - Page 3 03 / 2012 MERCEDES 97MR03459 WISHBONE LEFT SALINCAK SOL. ÜRÜN GELİŞİMİ YENİLİKLER HABERLER ve daha fazlası...

A Multidatabase System as 4-Tiered Client-Server Distributed Heterogeneous Database System

EMERGING DISPLAY CUSTOMER ACCEPTANCE SPECIFICATIONS 16290(LED TYPES) EXAMINED BY : FILE NO. CAS ISSUE : JUL.03,2001 TOTAL PAGE : 7

Analyzing Business Tax Returns. Partnership Corporation S-Corporation LLC

Relational Database Design Theory

EVPA SWING MECHANISMS IN AGN JETS:

Themethodofmovingcurvesandmovingsurfacesisanew,eectivetoolfor Abstract

Studienverlaufspläne (Stand Oktober 2013)

A Phased Framework for Countering VoIP SPAM

WebGL based E-Learning Platform on Computer Graphics

Keywords: Interoperability, HLA Evolved, Web Services, Service Oriented Architecture, WSDL, RTI, WAN

CUSTOMER CLASS FILE. ABB STOTZ-KONTAKT GmbH _0_000

Configuring the Thomson Gateway SIP Server

INDUSTRIAL TF1: 16 keys with LED 6AV AA00 KEYBOARDS TF2: 20 keys with LED 6AV AB00 6AV NE30-0AX0 6AV DB10 6AV FB12

Low Back Pain and Urinary Incontinence. Mary Ka Mei LEUNG Physiotherapist Kwong Wah Hospital Hong Kong SAR, China

Technical Bulletin. Understanding Servo Safety Functionality and SIL ratings

METRO C&C from international C&C to international Omnichannel B2B service company. Ales Drabek Director Global E-Marketing and E- Commerce METRO C&C

ELECTRONIC TAX FILING TAX PREPARATION PACKET

OHIM SEARCH TOOLS: TMVIEW, DSVIEW AND TMCLASS. Making trade mark and design information readily available for users

1099 and W2 Tax Form Tips and Instructions for 2013 (Effective January 1, 2014)

Knowledge based energy management for public buildings through holistic information modeling and 3D visualization. Ing. Antonio Sacchetti TERA SRL

Emergency Services Interconnection Forum (ESIF) Emergency Services Messaging Interface Task Force ( Task Force 34 )

FedEx Electronic Trade Documents User Guide for FedEx Ship Manager TM Software

SOA Best Practices (from monolithic to service-oriented)

The advanced extrusion process of the molecule of LACTOMER TM

1 Database Systems. Computers are now used in almost all aspects of human activity. One of their main. 1.1 The Main Principles

WHY IS BÁRDI AUTÓ YOUR BEST CHOICE?

DEVELOPING FLOOD VULNERABILITY MAP FOR NORTH KOREA INTROUDUCTION

Listado de Nuevos Dominios

AMATEUR ATHLETIC UNION. Level 3 Club. Information Brochure

AVAILABLE SERVICES FY /01/2016 to 06/30/2017 revised 6/13/2016. Statewide Waiver Cost Centers CAC

Transcription:

DataIntegrationTechniquesbasedon MichaelGertz DataQualityAspects DepartmentofComputerScience UniversityofCalifornia,Davis IngoSchmitt gertz@cs.ucdavis.edu Davis,CA95616,USA OneShieldsAvenue Otto-von-Guericke-UniversitatMagdeburg InstitutfurTechn.Informationssysteme schmitt@iti.cs.uni-magdeburg.de D-39106Magdeburg,Germany Universitatsplatz2 conictswheretwoobjectshavingthesamedenitionandrepresentingthesame realworldobjecthavedierentextensions.traditionaldataintegrationapproaches Inmultidatabasesystems,amajordataintegrationproblemistoresolvedata Abstract ictingattributevalues,thusassumingauniqueandtime-independentresolution. suggeststaticconictresolutionfunctionsthatperformacomputationovercon- geneitiesamongthedatacapturingandprocessingtechniquesandmethodsusedby componentdatabases.thatis,componentdatabasedierinhowandwhenthey maprealworlddataintolocaldatastructures.thisdiversityresultsinthefact Inthispaper,wearguethatsuchtypeofdataconictoftenarisesduetohetero- thatthequalityofthedatastoredatdierentsitescanbedierentandthatthe qualitycanalsovaryovertime,thusrequiringdynamicdataintegrationmethods, dependingonwhichdataqualitygoalisrequiredatthegloballevel.weoutlinea novelframeworkthatallowstoformalize,modelandutilizediverseandinparticular databaseintegration.bymakingthenotionofdataqualityaspectsexplicitbothin orthogonaldataqualityaspectssuchastimeliness,accuracy,andcompletenessin modelingandqueryingamultidatabasesystem,existingapproachestodatabaseintegrationcannotonlybeextended,butalsotoolscandevelopedthatensuredierent 1 Introduction typesof\highqualitydata"attheintegrationlevelandforglobalapplications. easindatabaseresearchoverthepast15years.mostoftheworkhasbeendevotedto approachesandtechniquesthatallowdesignerstoidentifyandresolvestructuralandsemanticconictsbetweenmeta-dataobjects(tables,classes,:::)anddataitems(records, tuples,objects,:::)locatedatlocaldatabasesparticipatinginthemultidatabasesystem [BE96,KCG95,KS91,She91].Despitetheknowndicultiesindetectingsemanticequiv- globallyaccessiblemultidatabasesystems(mdbs)hasbeenoneofthemostactivear- Theintegrationofpreviouslyindependentbutsemanticallyrelateddatabasesystemsinto warehousestypicallyrestrictaccesstolocaldatabasestoread-onlyaccesses,providing alencesandresolvingsemanticconictsandheterogeneities,manyproductshaveemerged thatrealizesomekindofamultidatabasesystem. Themostprominenttypeofsuchsystemsaredatawarehouses[Kim96,BA97].Data 1

systemthatprovidesaccesstointegrateddata,mostoftheproblemsrelatedtointegrated tems.althoughtheissuewediscussinthispaperisnotspecictodatawarehousesbutany globalusersandapplicationsameanstointegratedatafromdierentresourcesfor,e.g., datahavebeenreportedfordatawarehouses.asrecentstudiesandreportsshow,inparticularapplicationsbuildontopofdatawarehousesoftenexperienceseveralproblemswith regardtothereliabilityandqualityoftheintegrateddata[kim96b,huf96,ba97,jv97]. Themainreasonforthisisthatoftenalreadythelocaldatabasesparticipatinginthe Thequalityofintegrateddatathenbecomesevenworseunlesssuitablemethodsandtech- multidatabasesystemcontainincorrect,inaccurate,outdatedorsimplypoorqualitydata. niquesformodelingandmanagingtheseissuesareemployedduringmultidatabasedesign time.despitetheamountofworkthatfocusesonsemanticheterogeneityamongmetadataanddataitems,aspectsofthequalityofintegrateddatahavenotbeenaddressedso far.dataqualitymainlyhasbeenandstillisanimportantresearchtopicindependentof databaseintegration[red96,wan98,ws96,wsf96]. decisionsupportsystems,hospitalinformationsystemsorenvironmentalinformationsys- Thistypeofheterogeneity,calledoperationalheterogeneity,thencanresultinthefactthat integration.weclaimthatoftendataconictsbetweenlocaldataitemsoccurduetodiscrepanciesamonghowthedataaregathered,processedandmaintainedatlocaldatabases. Inthispaperweexplicitlyintroducethenotionofdataqualityindatabaseanddata thequalityofinformationstoredaboutrealworldobjectsorartifactsindierentlocal databasesmaybedierentandvaryovertime.weshowthatdataqualitycanbedened asorthogonalaspects,referringtotimeliness,accuracyandcompletenessofdata.the orthogonalityoftheseaspectsindicatesthatthereisnotalwaysauniqueresolutionof thisconict.moreimportantly,accuratedatadoesnotnecessarilyimplyup-to-datedata realworldobjectdierandonevalueisknowtobemoreaccurateandtheotherisknown tobemoreup-to-date,onecannotgiveauniqueresolution(ordataintegrationrule)for dataconicts.forexample,iftheattributevaluesoftwoobjectsreferringtothesame orviceversa.therefore,inthispaperwesuggestsomekindofmeasurementfordierent dierentportionsofsemanticallyequivalentclassesforwhichdataconictscanoccur. mationprolingthatallowsdesignerstoassociateandrelatedataqualityaspectswith andglobalclassesatmultidatabasedesigntime.wepresentatechniquecalledinfor- dataqualityaspects.thesemeasurementsareusedtomodeldataqualityaspectsoflocal jectsandattributevalueswithrespecttodierentdataqualityaspectsarerecordedas enedbythataglobaluserorapplicationcanspecifyadataqualitygoalfortheresult metadataattheintegrationlevel.thetransparencyofglobalqueryprocessingisweak- Informationaboutdataqualitymeasurementsaswellaspreferencesamonglocalob- ofaglobalquery.basedonsuchdataqualitygoalsandtherecordedmetadata,data integrationrulesaregenerateddynamicallybytheglobalqueryprocessortoensurethe retrievalof\highqualitydata"fromlocaldatabases. Throughoutthepaperweadoptasimpleobject-orienteddatamodel,notfocusingon andalsotheneedfordynamicdataintegrationrulesisdiscussedinthenextsection. aspectslikecomplexobjects,methodsorobject/methodinheritance. Atypicaldataintegrationscenariowhichrevealsthedierentaspectsofdataquality SupposetwoclassesattwolocaldatabasesDB1andDB2whichstoreenvironmentaldata. 1.1 Bothclasses(namedPollution)containdataaboutthequantityofsometoxicmaterials MotivatingExample 2

thetwoclasseshavebeenresolvedandnowadataintegrationrulefortheglobalclass, Mat1andMat2recordedfordierentregions.Weassumethatschematicconictsbetween saygpollution,needstobespecied.belowaretheextensionsofthetwoclasses: RegionAreaMat1Mat2 Pollution@DB1(C1) R1 A3 A4 A2 A1 15 814 10 2 RegionAreaMat1Mat2 Pollution@DB2(C2) R2 R4 B1 B2 A1 6721 1 A1 12 3 R1 A2 A3 17 19 3 8 R2 B1 B2 7 2 Inordertodeneanappropriatedataintegrationruleforthesetwoclasses,obviously 1 adataconicthastoberesolvedduetothedierentquantitiesrecordedforsameregions. thedatafromthetwoclassextensionsaccordingtothespecieddataintegrationrule. Traditionaldataintegrationapproachessuggestaconictresolutionfunctionthateither Nowassumethefollowingscenarioswhereadditionalinformationaboutthetwoclasses choosesoneclass(orattribute)overtheotherorthatcomputestheaverageofthevalues andtheirextensionshasbeenobtained: recordedforthesameregion.aglobalqueryagainstgpollutionthenalwaysretrieves extensionup-to-datedataareretrieved. aglobalqueryisissuedagainsttheglobalclassgpollutiondeterminesfromwhich DB2updatesitsclassC2onTuesdays,Fridays,andSundays.Inthiscase,thetime Scenario1:DB1updatesitsclassC1onMondays,Thursdays,andSaturdays,and thevaluesformat1inc2arerecordedonamanualbasis.thusthevaluesformat1:c1 maybemoreaccuratethenthevaluesrecordedinmat1:c2. Scenario3:DB1coversmoreregionsinC1thanDB2doesinC2.Thatis,the Scenario2:WhilethequantityofMat1inC1isrecordedusingsomekindofsensors, thatthewayofhowandwhendataispopulatedintolocalclassesplaysanimportant informationinc2islesscompletethantheinformationinc1. roleinintegratinglocaldataforglobalqueries.interestingly,theabovescenariosalso describeorthogonalaspectsofdataquality.thatis,forexample,up-to-datedatadoes Theabovethreesimplescenarios,whichwillbeformalizedinthenextsection,show neithernecessarilyimplymostaccuratenormostcompletedata.moreimportantly,while oneglobaluser(orapplication)mightbeinterestedinmostcompletedata,anotheruser mightbeinterestedinmostaccuratedata.inbothcasestheremustbethepossibilityfor ausertospecifycertaindataqualitygoalsor,atleast,theglobalqueryinterfaceshould reectthatthedataretrievedfromdierentsourcesinresponsetoaglobalqueryhave dierentquality,e.g.,troughtaggingorgroupingretrievedrecords. lapping(andconicting)extensionsoflocalclasses,butalsoifsemanticallyequivalent classesaredisjoint.thustheabovescenariosdescribemorethanjustdataconicts. Furthermore,theabovedataqualityaspectsarenotonlyofinterestincaseofover- integrationanddataqualitymanagement.intheareaofdatabaseintegrationmostof Themethodsandtechniquesdescribedinthispaperareinuencedbytwoareas:database 1.2 ComparisonwithotherWork theworkfocusonresolvingstructuralconictsontheschemalevel(schematicconicts) 3

includingclassandattributeconicts,e.g.,[bln86,spd92,gsc96,kcg95].lesswork hasbeendoneondetectingandresolvingso-calledsemanticdataconictsandsemantic heterogeneitiesatthedatalevel,typicallybecauseitisdiculttoformalizeandcompare dierentobjectidentiers,havebeendiscussedin[ctk96,pu91,ss95,dem89].recent inparticular[be96].specialtypesofsemanticconicts,inparticulartheresolutionof semanticsassociatedwithdataitemsanddatavalues.classicationsanddiscussions andverypromisingworkaddressingsemanticissuesindatabaseinteroperabilityfocuson onsemanticissuesinmultidatabasesystemscanbefoundin[she91,sk93,ks96]and context-basedapproaches,e.g.,[gms94,ks96].semanticconictsoftypeslikeinconsistentdataoroutdateddatahavebeenlistedinsomework(e.g.,[sk93,be96]),buttheir identicationandresolutionhasnotbeendiscussedindetail. [Red96,Wan98,WS96,WSF96].Only[RW95]discussestheestimationofthequalityof tegrateddataormultidatabasesystemsingeneral.mostoftheworkfocusesonde- nitionsandmeasurementsofdataqualityaspectsindatabaseandinformationsystems Intheareaofdataqualitymanagementtherehasbeennoparticularfocusonin- datainfederateddatabasesystems,thusnotconsideringtheintegrationaspectortask butonlythealreadyintegrateddata.in[jv97,jjq98]theaspectofdataqualityisaddressedinthecontextofdatawarehousesbutneitherformaldenitionsormeasurements fordataqualitynordesignmethodologiesfocusingondierentdataqualityaspectsare given. canbeformallydenedusingthenotionofvirtualclasses.becausevirtualclassesand 1.3 InSection2weshowhowthedataqualityaspectstimeliness,accuracyandcompleteness PaperOutline theirpropertiesarenotdirectlyaccessibleatmultidatabasedesigntime,insection3we presentanapproachcalledinformationprolingthatallowstoidentifyandcomparedifferentdataqualityaspectsassociatedwithlocalclassesandtheirpossibleextensions.the resultobtainedthroughinformationprolingisrecordedasmetadataattheintegration 2layerandisusedforglobalqueryprocessing,whichisoutlinedinSection4. icallycannotbedescribedormodeledbyusingdiscretevaluessuchas\good",\poor"or Thequalityofdatastoredatlocaldatabasesparticipatinginamultidatabasesystemtyp- FormalizationofDataQualityinMDBS \bad".inthissection,wesuggestaspecicationof(time-varying)dataqualityassertions basedoncomparisonsofsemanticallyrelatedclassesandclassextensions.insections2.1 2.1 and2.2wegiveformaldenitionsofdierentdataqualityaspectsusingvirtualclasses. Inordertoanalyzeanddeterminetheconictsorsemanticproximityoftwoobjects, objectattributesorevencompleteclasses,oneneedstohavesomekindofareference VirtualandConceptualClasses pointforcomparisons.inthispaper,suchcomparisonsarebasedonvirtualclasses.a virtualclassisadescriptionofrealworldobjectsorartifactsthatallhavethesame -notnecessarilycompletelyinstantiated-attributes.theextensionofavirtualclass isassumedtobealwaysup-to-date,completeandcorrect,i.e.,onlycurrentrealworld objectsanddataarereectedintheextensionofavirtualclass. 4

usedtodescribesemanticissuesindatabaseinteroperability,e.g.,in[sg93,gsc96].the reasonforthisisthatthedevelopmentoflocaldatabaseschemasistypicallydrivenby whatinformationaboutrealworldobjectsandclassesisneededforlocalapplicationsand Assumingsuchtypeofclassesisquitenaturalindatabasedesignandtheyarealso whatinformationisavailableabouttheseclassesandobjects. C1;:::;Cn.Moreimportantly,themappings1;:::;nadoptedbylocaldatabasesdier structures,i.e.,onlyinformationrelevanttolocalapplicationsismappedintolocalclasses DB1;:::;DBmtypicallyemployonlyapartialmappingofrealworlddataintolocaldata GivenadescriptionofrealworldobjectsintermsofavirtualclassCvirt,localdatabases intheunderlyinglocaldatastructures(schemata)andhowrealworlddataispopulated intothesestructures.dierentmappingsthenresultinschematicandsemanticheterogeneitiesamongthelocalclassesc1;:::;cnthatrefertothesamevirtualclasscvirt. WhilealocalclassCitypicallymapsonlyaportionoftheinformationassociatedwith PSfragreplacements Cvirt,aconceptual(orglobal)classCconintegratesallaspectsmodeledinsemantically equivalentorsimilarlocalclasses(figure1). Ccon C11 n Figure1:Relationshipbetweenvirtual,conceptualandlocalclasses Cn isthatthespecicationofaconceptualclasscconcomesasnearaspossibletothespecicationoftheassociatedvirtualclasscvirtfromwhichthelocalclassesc1;:::;cnare Determiningconceptualclassesascomponentsof,e.g.,afederatedormultidatabase schema,isthemaintaskindatabaseintegration.onemaingoalindatabaseintegration derived.besidethesestructuralaspectsofconceptualclasses,theotheraspectishowto integratedatafrom(nowstructurallyequivalent)localclassesintoglobalclasses.that is,dataintegrationrulesneedtobespecied.ifthe(possible)extensionsoflocalclasses areknowntobedisjoint,thenthedataintegrationrulebasicallyconsistsofjoiningthe extensionsoftheselocalclasses.incaseofpossibleobjectordataconicts,thatis,foran objectoftheglobalclassthereareatleasttwolocalobjectscorrespondingtothatobject butwithdierentattributevalues,dataintegrationrulesadditionallydescribeconict resolution.suchresolutionsensurethatonlyoneobjectisretrievedfromlocalclasses object. orlocalattributevaluesfromsameobjectsarecombinedappropriatelyintooneglobal overtime.objectsareaddedanddeleted,orpropertiesofobjectschange.atlocalsites Abasicpropertyofrealworldobjectsisthatobjectsasinstancesofvirtualclassesevolve 2.2 BasicDataQualityAspects intolocalclasses,thusresultinginatypeofheterogeneityamonglocaldatabaseswecall dierentorganizationalactivitiesareperformedtomapsuchtime-varyinginformation operationalheterogeneity.weconsideroperationalheterogeneityasanon-uniformityin thefrequency,processes,andtechniquesbywhichrealworldinformationispopulated 5... Schema Integration virtual class local mappings semantically equivalent local classes

updatedmanuallyonaweeklybasis,atanotherdatabasestoringrelatedinformation intolocaldatastructures.forexample,whileatonelocaldatabasethestoreddataare updatesareperformedautomatically(e.g.,sensor-based)onamonthlybasis.insuch varyingreliability.basedontheuseddatacapturingandprocessingapproaches,the cases,operationalheterogeneitycanleadtothefactthatsimilardatareferringtosame propertiesandattributesofrealworldobjectshavedierentqualityandthusmayhave ationalheterogeneitycannotberesolvedontheschemaintegrationlevelbutrequiresa outdated,asdiscussedinsection1.1.moreimportantly,wehavealsoshownthatoper- qualityofdataatalocaldatabasecanevenvaryovertime,forexample,datacanbe orwrongdata.inthispaperweconsideroperationalheterogeneityanddataqualityin ticdataconictsareappropriateorrichenoughtohandleaspectssuchasoutdateddata suitableapproachtodataintegration. particularasatypeofsemanticdataconictwhichrequiresanotnecessarilyunique Theseobservationsraisethequestionwhetherexistingapproachestoresolvingseman- mostimportantindatabaseintegrationandresolvingsemanticdataconicts.forthis resolutionatrun-timebymeansof(time-varying)dataintegrationrules. wemakethefollowing(simplied)assumptions. Werstgiveaformaldenitionoftime-varyingdataqualityaspectsweconsideras TherearetwoclassesC1hA1;:::;AniandC2hA1;:::;Anifromtwolocaldatabases assumethatthetwoclassesarerepresentedintheglobaldatamodel.theresolution hasbeenperformedforc1andc2intotheconceptualclasscconha1;:::;ani.we ofheterogeneousrepresentationsoftheoriginallocalclassstructuresisatopicof DB1andDB2.C1andC2refertothesamevirtualclassCvirtandschemaintegration Usingthepredicatesameitispossibletodeterminewhetheranobjecto1from schemaintegrationandisthereforeoutsidethescopeinthispaper. theextensionofc1,denotedbyext(c1),referstothesamerealworldobjecto2 Ext(Cvirt)asanobjecto22Ext(C2).Theresolutionofdierencesamongkey representationsisassumedtoberesolvedduringschemaintegrationbyappropriate heterogeneity.forthis,weassumeadiscretemodeloftimeisomorphictonaturalnumbers. AsmotivedinSection1.1,theaspectoftimeplaysanimportantroleinoperational methodsassuggestedin,e.g.,[pu91,ctk96,ss95]. Inthismodeltimeisinterpretedasasetofequallyspacedtimepoints,eachtimepoint isdenotedbytnow.theextensionofaclasscattimepointtisdenotedbyext(c;t), thavingasinglesuccessor.thepresentpointoftime,whichadvancesastimegoesby, thevalueofanobjectoforanattributea2schema(c)attimepointtisdenotedby ValC(o;A;t).Ifnotimepointisexplicitlyspecied,weassumethetimepointtnow.We furthermoreassumeafunctiontimec(o;a;t)thatdeterminesthetimepointt0tthe valuea:oofattributeaofobjecto2ext(c;t)wasupdatedthelasttimebeforet. Denition2.1(Timeliness)GiventwoclassesC1andC2withschema(C1)= thedataqualityaspectstimeliness,completeness,andaccuracyinaformalway. Theabovedenitionsandassumptionsnowprovideusasuitableframeworktodene schema(c2).classc1issaidtobemoreup-to-datethanc2attimepointttnow withrespecttoattributea2schema(c1),denotedbyc1>time countfo1jo1:ext(c1;t);o2:ext(c2;t):same(o1;o2)^timec(o1;a;t)>timec(o2;a;t)g countfo2jo1:ext(c1;t);o2:ext(c2;t):same(o1;o2)^timec(o2;a;t)>timec(o1;a;t)g A;tC2,i 6

attributeaifitsextensionext(c1;t)containsmorerecentupdatesonathanext(c2;t) does.notethatfort=tnowthispropertymayipsincetnowadvancesastimegoes by.itshouldalsobenotedthattherearealternativedenitionsfortimeliness.possible Inotherwords,classC1ismoreup-to-datethanC2attimepointtwithrespectto denitionsdependonhowmuchinformationaboutupdatestrategiesisavailableforlocal databasesandclassesatintegrationtime.forexample,focusingmoreonthedistances wouldgiverisetothefollowingcondition: betweenthetimepointswheretheattributevaluesofthesameobjectswereupdated, sumftimec(o1;a;t)?timec(o2;a;t)jo1:ext(c1;t);o2:ext(c2;t):same(o1;o2)g Example2.2AssumethatthefollowingvaluesfortimeC1andtimeC2havebeendeterminedatt=tnow='10-14-98'fortheattributeA.TheattributevaluesA:C1andA:C2 Thefollowingexampleshowsthedierencebetweenthetwoconditions. sumftimec(o2;a;t)?timec(o1;a;t)jo1:ext(c1;t);o2:ext(c2;t):same(o1;o2)g shownineachrowareassumedtobevaluesofobjectsfromext(c1;t)andext(c2;t) andbothobjectsrefertothesameobjectincvirt. 120 156 123 A:C1timeC1(o1;A;t) 108 10-9-98 A:C2timeC2(o2;A;t) 10-7-98 156 130 125 10-3-98 conditionwehavec1time ApplyingtherstconditionwouldyieldthatC1=time A;tC2(12>2). 111 10-8-98A;tC2holds,whileforthesecond timelymannerusingacertaindatacapturingapproach,thisdoesnotnecessarilymean Althoughatalocaldatabasemodicationsofrealworldobjectsaremappedina virtualclass.whiletimelinessessentiallyreferstopropertiesofattributes,thedata whole. thatthisapproachsuitablypropagatesinformationaboutdeletedornewobjectsofthe Denition2.3(Completeness)ClassC1issaidtobemorecompletethantheclassC2 qualityaspectcompletenessfocusesontheextensionsofthetwoclassesc1andc2asa attimepointttnow,denotedbyc1>comp countfo1jo12ext(c1;t)^:9o02ext(cvirt;t):same(o1;o0)g< countfo2jo22ext(c2;t)^:9o02ext(cvirt;t):same(o2;o0)g t C2,i t,thisdoesnotnecessarilymeanthattheseobjectsstillexistinthecorrespondingvirtual Inotherwords,althoughtheextensionofC2maycontainmoreobjectsattimepoint timepointt0<t.theabovedenitionthusnicelyreectstheaspectofoutdatedobjects. class.c2cancontainnumerousobjectswhichalreadyhavebeendeletedinrealityata aboutanyrealworldobjectfromcvirtthanc2does.includingthisaspectwouldimply NotethatwehavenotincludedtheaspectthatC1mustalsocontainmoreinformation thatatalocaldatabaseoneisalwaysinterestedinallobjectsthatbelongtoavirtual class.this,however,isoftennotthecasebecausetheworkingscopeoflocalclassesand applicationsistypicallyrestrictedtoasubsetofsuchrealworldobjects. istheaspectofdataaccuracywhichfocusesonhowwellpropertiesorattributesofreal worldobjectsaremappedintolocalclasses. Thethirddataqualityaspect,whichisorthogonaltotimelinessandcompleteness, 7

Denition2.4(DataAccuracy)GiventwoclassesC1andC2withschema(C1)= schema(c2)andattributea2schema(c1).classc1issaidtobemoreaccuratethan C2withrespecttoAattimepointt,denotedbyC1>acc same(o1;o)^ jvalcvirt(o;a;t)?valc1(o1;a;t)jjvalcvirt(o;a;t)?valc2(o2;a;t)jg> countfo1jo12ext(c1;t);o22ext(c2;t);o2ext(cvirt;t):same(o1;o2)^ A;tC2,i countfo2jo12ext(c1;t);o22ext(c2;t);o2ext(cvirt;t):same(o1;o2)^same(o2;o)^ apairofstrings,numbersordates.inordertosuitablyincorporatetheaspectofpossible jvalcvirt(o;a;t)?valc1(o2;a;t)jjvalcvirt(o;a;t)?c2(o1;a;t)jg Intheabovedenition?denotesagenericminusoperatorwhichisapplicabletoeither nullvalues,onecandeneavalueamaxthatisusedifvalci(oi;a;t)isnull.itiseven possibletogiveadenitionfordataaccuracythattakesonlythenumberofobjectsinto accountthathavethevaluenullfortheattributea. tributes.forexample,objectidentiersaretypicallytime-invariantandthuscannot objectidentier).wewilldiscusstheaspectoftime-varyingandtime-invariantattributes causeanydataqualityproblems(unlessapropertyofrealworldobjectsdesignatesthe Determiningthetimelinessandaccuracyofdatamakesonlysenseforcertainat- inmoredetailinsection3.theimportantpointwiththeabovedenitionsisthatthey describeorthogonaldataqualityaspects.thatis,incaseofadataconictamongtwo objectsreferringtothesamerealworldobject,itispossibletochooseeitherthemost accurateorthemostup-to-datedataaboutthisobject,dependingonwhetherrespectivespecicationsexistforthetwoobjects.alsoincaseofnon-conictingobjectclasses andextensions(i.e.,theextensionsofsemanticallyequivalentlocalclassesaredisjoint), adataqualityaspectmightgiveareasontochooseoneextensionovertheotheror,at least,todistinguishorgrouptheresultsretrievedfromtheseclasses.bothscenarios,of thegloballevel. equivalentclassessuchthatthisclasssatisesallthreedataqualityaspectsbest.that course,requiretoweakenthetransparencypropertyofintegratedclassesandobjectsat iswhythese,possiblytime-varying,aspectsneedtobemodeledsuitablyandutilizedfor globalqueryprocessing. Inpracticeitisratherunlikelythatthereisoneclassamongseveralsemantically aspectsusingvirtualclasses.inpractice,however,virtualclassesarenotexplicitlyprovidedinawaythattheycaneasilybeusedtoevaluatetheconditionsgivenindenitions 2.1,2.3,and2.4.Inthissectionwediscusshowstatementsaboutdataqualitycanbe thesestatementscanbemodeledasmetadataavailableforglobalqueryprocessing.the underlyingconceptforthisisinformationproling. extractedfromlocaldatabaseparticipatinginamultidatabasesystem.weshowhow Intheprevioussectionwehaveshownthatitispossibletoformallydenedataquality 3 ModelingDataQualityAspects 3.1 Informationprolingessentiallyreferstotheactivityofdescribingtheinformationcapturingandprocessingtechniquesusedtopopulaterealworlddataintoalocaldatamodel InformationProling anddatastructures,respectively,classes.informationprolingisnotpeculiartodatabase 8

integration,butisanintegraltaskindatabaseandapplicationdesign.aninformation prole,e.g.,foraclass,describesnotonlytheschemaofthatclass(staticproperties)but erties).thisincludesadescriptionof(setsof)objectmethodsandhowthesemethods alsohowrealworlddataorartifactsaremappedasobjectsintothisclass(dynamicprop- interactwiththedatabaseandapplicationenvironment.mostinformationforproling canbeobtainedwhilestructuralandsemanticconictsareinvestigatedatmultidatabase designtime.asanyotherproposaltodatabaseintegrationandresolvingsemanticcon- goodknowledgeaboutparticularlocaldatabaseschemasbutalsoabouttheenvironments inwhichthelocaldatabasesoperate. icts,resolvingoperationalheterogeneitythroughinformationprolingrequiresnotonlya ftimeliness,completeness,accuracyg.thebasicideaforinformationprolingistopartitioncanditspossibleextensionspext(c)intoasetucofinformationunitssuchthat all(possible)datainaunitu2ucshowthesameproperties(seebelow)withrespectto GivenalocalclassCwithattributeshA1;:::;AniandadataqualityaspectQ2 (U1)anattributeA2fAi;:::;Akg,i.e.,uA=A Q.AbasicinformationunituforaclassCanditspossibleextensionsPExt(C)canbe (U2)asubsetofpossibleattributevaluesforanattributeA2fAi;:::;Akg, oneofthefollowingtypes.weassumethattheobjectidentierispartofeveryunit. schema(c)isdenotedbyuac.partitioningaclasscintobasicinformationunitscanbe ThesetofbasicinformationunitsuA1;uA2;:::;uAkassociatedwithanattributeA2 i.e.,ua=fa(o)jo2pext(c)^p(o)gwithpbeingaselectionpredicate. optimizedbybuildingcomposedinformationunitsthatconsidermorethanoneattribute, i.e.,u=fai1;:::;aikgoru=fai1;:::;aik(o)jo2pext(c)^p(o)g,whereaij2 fai;:::;akg,ifalltheconstitutingbasicunitshavethesamepropertieswithrespectto alsolatertheglobalqueryprocessinginthepresenceofinformationunits(section4),we thedataqualityaspectq.inordertodiscussthebasicideaofinformationprolingand donotconsidertheseoptimizationissuesintherestofthepaper. thisunitarecaptured,processedandmaintained.suchdescriptions,ofcourse,dependon thetypeofdataqualityaspectsqforwhichtheclasschasbeenpartitioned(q-prole, Q-partitioning).Fromapracticalpointofview,wethinkthatitisratherunlikelytohave Withaninformationunitaproleisassociated,describinghowandwhenthedatain aformalspecicationforproles.dependingonthetypeofdataqualityaspect,aprole shouldconsider,respectivelyaddress,thefollowingissues: TimelinessGivenanattributeA2schema(C).Whatisthe(average)update CompletenessHowdodataprocessingtechniquesensurethatrealworldobjects frequencyofthatattribute?doupdatesoccuronanevent-basedbasis?areupdate frequenciestime-dependent? DataAccuracyWhattypeoftechniqueisusedtorecordrealworlddata?Aredata relevantforthatclassarerecorded?aretheemployedprocessingtechniquestimedependent? enteredmanuallyoraretheyread,e.g.,bysensors?arethedataextractedfrom time-dependent? otherresources?howmanydataprocessingunits(applications)areinvolvedin readingrealworlddataintolocaldatastructures?aretheemployedtechniques 9

prolescanbeacriticalpoint.itshould,however,benotedthatformalspecications oftendonotexistwherestructuralorcertainsemanticrelationshipsandheterogeneities amongclassesandattributesneedtobediscoveredandresolvedatthedatabaseschema Wearefullyawareofthefactthatnothavingaformalspecicationfordataquality cientdataforinformationproling. level.wethinkthatsystemdescriptionssuchasdataowdiagrams,workowdiagrams, oreveninformationobtainedthroughsystemanalysiscanoftenprovideusefulandsu- butnoq-prolecanbegiven.thesecondtype,denotedbyua0,isthedescriptionof schema(c)therearetwospecialtypesofunits.therstonewecalldefaultunit,denoted byuad,anditcorrespondstothepartitionofaforwhichdatais(orwillbe)recorded BesidetheinformationunitsUACthatcanbeassociatedwithanattributeA2 thepartitionforwhichdatainpext(c)exist,butforwhichdataareneverrecorded thespecicationofu0acanalwaysbedeterminedbasedonthespecicationsoftheother because,e.g.,theyarenotadmissibleduetosomeintegrityconstraints.notethat(1) units,and(2)thespecicationofua0isalwaysoftype(u2).addingthesetwotypesof unitstouacforwhichproleshavebeendeterminedleadstothefactthatforanattribute AandassociateddomainDalwaysacompletepartitioningcanbegiven,which,inthe 3.2 worstcase(noprolesavailable),consistsonlyofthedefaultunituad. PartitioningandinformationprolingisappliedtoallattributesofaclassC.Thatis,for CasetUAC=UA1 ClassPartitioningandQualityAssertions type(u2))whilepartitionsfortime-variantattributes(i.e.,attributesthatareupdated tosomesampleclassesandtheirpossibleextensionsshowsthattime-invariantattributes typicallyareallpartitionedinthesameway(i.e.,samespecicationofpforbasicunitsof C[:::[UAn Cofinformationunitsisdetermined.Applyingthistechnique frequently)dier.figure2illustratesanexampleresultofpartitioningaclasscwith time-invariantattributesa1;a3andtime-variantattributesa2;a4. PSfragreplacementsUA1 CUA2 CUA3 CUA4 C HavingatleasttwopartitionsuAianduAjdierentfromuA0anduAdforanattributeA Figure2:PartitioningofaclassC indicatesthatevenwithinpossibleextensionsofc,thequalityofthedatavaluesfora diers.thisisquitenaturalsincewecannotexpectallobjectsandattributesinaclass tohavethesamequalitywithrespecttoadataqualityaspectq.becausethepartitions itselfdonotsaymuchabouthowthequalityofthedataintheseunitsisrelated,attribute qualityassertionsaredeterminedamongtheunitsinuacandassociatedprolessuchthat acompleteorderamongtheunitsinuacisobtained.determiningthisorderisdoneby pairwisecomparingtheq-prolesassociatedwitheachunituai2uacnfua0;uadg. 10

Denition3.1(AttributeQualityAssertions)GiventwoinformationunitsuAi;uAj2 UACnfuA0;uAdgwhichhavebeendeterminedbasedonaQ-partitioningofA.Iftheproles associatedwithuai;uajrevealthatthedatacontainedinuaihasabetterqualitythanthe datacontainedinuaj,thentheattributequalityassertionuai>quajissaidtobevalid. ThegoalofpairwisecomparingprolesassociatedwiththeuAisistodeterminea completeorderamongtheunitsinuac.nothavingacompleteorderbutonlyapartial orderintroducessomekindofvaguenessintheassertionswhichthenhastobedealt withinspecialway.forthesakeofsimplicityandalsoduetospacelimitations,inthe remainderofthepaperweassumethatalwaysacompleteordercanbedetermined.note thatuadcannotbecomparedwithanyotherunitfromuacbecausenoproleforuadis given,thustherearenoassertionsreferringtouad.furthermore,thereisnoneedto specifyanyassertioninvolvingua0becausecorrespondingdataisneverrecordedforthat unitinc. Beforewediscusscomparisonsofattributequalityassertionsofsemanticallyequivalent classes,werstfocusontheaspectoftimewhichweexplicitlyintroducedinsection2.2. Timeplaysanimportantrolewhendeterminingqualityassertionsamongunitsthathave beenobtainedbasedonatimeliness-partitioning(seealsoscenario1insection1.1).it iseasytoseethatiftimeplaysarole,thetotalorderamonginformationunitsinuacis time-dependent,too.whileatime-dependentordercanbeoptionalwithrespecttothe dataqualityaspectscompletenessandaccuracy,itisintrinsictotheaspectoftimeliness. Thismeansthatforeachpossibletimepointt,wehavetospecifyatotalorderamong theunitsinuac.thiscanbedoneindierentways.assuminganitedomainfortime, e.g.,numberoftheweekinayearordayoftheweek,itispossibletogiveacomplete specicationoftime-dependentorders.thuswitheachorderasetofdaysornumbersof weekscanbeassociated.morecomplextemporalquantiersfortime-dependentorders, ofcourse,canbeusefulandneedtobeinvestigatedinfutureresearch.hereweassume thatwitheachorderasetoftimeintervalsortimepointsisassociatedandthat,givena partitioninguacandorderamongtheinformationsunitsinuac,witheachtimepointt anordercanbeassociated. 3.3 Inter-ClassConsiderations AssumethattwosemanticallyequivalentclassesC1hA1;:::;AniandC2hA1;:::;Ani havebeenq-partitionedandthatalltotalordershavebeendeterminedfortheunits inua1 C1;:::;UAn C1andUA1 C2;:::;UAn C2.ThetasknowistocombinethepartitionsetsUAi C1 anduai C2intoanewpartitioningUAi CsuchthatCistheglobalclassintegratingC1and C2.InbuildingUACthetaskisnottodenethestructureofCbutthedataintegration rulesassociatedwithcatthegloballevel.themainideaforthisistoresolvepossibledataconictsamongsemanticallyequivalentobjectsinc1andc2bycomparingand orderingtheinformationunitsandprolesthatmaycontain(attributesof)conicting objects.alsoincasenodataconictswillexistamongpossibleextensionsofc1andc2, theinformationunitsandprolesareusedtoestablishanorder(withrespecttoadata qualityaspectq)amongpossibleobjectsintheglobalclassc. GiventwoinformationunitsUAi C1andUAi C2,aportioningUAi Cisobtainedbyoverlapping theunitsinuai C1withtheunitsUAi C2asillustratedinFigure3. NotethatthespecicationoftheunitsinUACcaneasilybeobtainedfromthespeci- 11

ua1;1 UAC1 ua1;2 ua2;1 UAC2 ua2;2 ua1 ua2 UAC ua1;3 ua2;3 =) ua5 ua3 ua4 Figure3:Overlappingoflocalinformationunits + cationoftheunitsinuac1anduac2.furthermore,thepartitioninguacisalwayscomplete becauseofthedefaultunitsinuac1anduac2.giventhespecicationsofthepartitionsin UAC,thenextstepistocomparetheprolesassociatedwiththe(local)unitsthatcontributetoaunituAi2UAC.InFigure3,forexample,theunitsthatcontributetouA12UAC associatedwiththelocalunitsallrefertothesamedataqualityaspectq,itispossible todetermineapreferenceamongtwolocalunitsthatcontributetoaglobalunitinuac. Determiningpreferencesessentiallycorrespondstoresolvingdataconictsincasefora globalobjectthereisacorrespondingobjectineachlocalunit.again,thetwotypesof localinformationunitsuai;danduai;o;i2f1;2gneedspecialconsideration.weconsider thedierentcasesassumingthatuai;d=ua1;danduai;0=ua1;0andthateitherofthesetwo unitsiscomparedwithaunitua2;j2uac2: 1.uA1;danduA2;j2UAC2nfuA2;d;uA2;0g:becausenoproleisassociatedwithuA1;d,noprefer- areua1;12uac1andua2;12uac2,andforua2theyareua1;1andua2;2.becausetheproles 2.uA1;danduA2;d:nopreferencecanbegiven,meaningthatitispossibleattheglobal encecanbegivenunlessitisknownthat,forexample,ua2;jhasthehighestquality amongthelocalunitsinuac2withrespecttothedataqualityaspectq. leveltoreturntwolocalobjects(orattributes)thatcorrespondtothesameglobal 3.uA1;0anduA2;j62fuA2;d;uA2;0g:BecauseuA1;0nevercontainsdata,uA2;jischosen. objectandthere(perhaps)isadataconictamongthetwoobjects(attributes)in ua1;dandua2;d. 4.uA1;0anduA2;d:Sameasfor3. 5.uA1;0anduA2;0:Thereareneverdataineitherofthetwounits.Thustherewillnever OnceapreferenceamongtheunitsbuildingtheglobalinformationunitsinUAChas units. beaconictthatneedstoberesolvedbasedontheprolesassociatedwiththetwo beendetermined,againacompleteorderamongtheinformationunitsinuacnfuad;ua0gis determined.notethatincasethetwolocalclassesc1andc2havebeenpartitionedwith respecttoadataqualityaspectqandthetotalorderamongtheunitsinuac1,respectively UAC2,istime-dependent,thetotalorderamongtheunitsinUACcanbetime-dependent,too. Moreimportantlyinthiscase,comparisonsoftheprolesoflocalunitsmustoccurfor eachtimepointt.itremainstobeinvestigatedtowhatextenttheprocessofdetermining 12

giventheattributequalityassertionsforuac1anduac2. a(time-dependent)totalorderamongtheunitsinuaccanbedeterminedautomatically, recordedasmetadataforglobalqueryprocessing,wersttakeacloserlookatthedata qualityaspectcompleteness.asdescribedinsection2.2,completenessreferstocomplete extensionsofclassesratherthanonpossibledataconictscoveredbytheaspectstimelinessandaccuracy.partitioningaclassanditspossibleextensionswithrespecttothe aspectcompletenessoftenrevealsaverysimplepattern.figure4illustratesanoptimized completeness-partitioninguac1anduac2fortwosemanticallyequivalentlocalclassesc1 andc2. PSfragreplacements ua1;1 ua2;1 UAC2 Beforeweconcludethissectionbydescribinghowinformationunitsandprolesare ua1;2 ua2;2 NotethatforallattributesinAthesamepartitioningisadopted.Thisisquitenatural Figure4:Partitioningbasedonthedataqualityaspectcompleteness becausecompletenessreferstoobjectsasawholeandnottosingleattributes.forthis particulardataqualityaspectthusdeninglocalattributequalityassertionsaswellas 3.4 deningtheinformationobtainedthroughoverlappingoflocalunitscanbedoneeasily. alentclassesisusedbytheglobalqueryprocessortodealwithresolvingdataconicts Asmentionedearlier,informationaboutpartitioningandoverlappingsemanticallyequiv- DataQualityInformationasMetadata andorderingobjectsandattributesofdierentquality. classesandattributesintoglobalclasses.besidethisstructuralinformationaboutmetadataatthegloballevel,therepositoryalsohastorecordinformationabouthowtoactually Themetadatarepositoryutilizedbythequeryprocessoralreadycontainsinformation aboutthestructureofglobal(conceptual)classesandthe(structural)mappingsoflocal integrationrulesaretypicallyspeciedbyusinglocalclassstructuresformulatedinthe integratedatacontainedinlocalclassextensionsintoglobalclassextensions.suchdata globaldatamodel.assumetwosemanticallyequivalentclassesc1andc2wherestructural and(certain)semanticconictshavebeenresolved,andc1;c2arespeciedintheglobal datamodel.asimpledataintegrationrulefortheglobalclasscthenmightlooklike Ext(C):=Ext(C1)[Ext(C2)(usingtherelationaloperatorunion).Dependingon possibledataconictsamongobjectsinext(c1)andext(c2),adataintegrationrule canbemorecomplexbecauseconictresolutionfunctionsareencodedintherule. dataqualityoflocalclasses,informationunitsetc?whilefortraditionalapproaches utilizesthespecieddataintegrationrule,forthescenariopresentedinthispaperitis toqueryprocessinginmultidatabasesystemstheglobalqueryprocessingenginesimply Howdodataintegrationruleslooklikeinthepresenceofvariousinformationabout thetaskofthequeryenginetoformadataintegrationrule.forthis,thefollowing informationisrecordedinthemetadatarepositoryattheintegrationlayer. 13

ForeachlocalclassCiandeachdataqualityaspectQ,thespecicationofthe ForeachglobalclassCandeachattributeA2schema(C),thespecicationofthe sertions,andtheir(time-dependent)totalorder. Q-partitions,theassociatedQ-proles,thedescriptionoftheattributequalityaspendsontheunderlyingglobaldatamodel,thetypeofexpressionsallowedinbuilding Amoreprecisedenitionoftheinformationrecordedinthemetadatarepositoryde- partitioninguacandthedescriptionofthetotalorderamongtheunitsinuac. withtotalordersamonginformationunits.forthesakeofsimplicity,inthefollowing sectionweassumetheabstractdescriptionofthemetadataasabove. informationunitsoftype(u2),andthetypeoftemporalquantiersthatcanbeassociated multidatabasesysteminthepresenceofdataqualityinformation.becauseofthevarious Inthissectionwegiveashortoutlineoftheaspectofglobalqueryprocessingina 4 QueryProcessing featuresandpossibleoptimizationtechniquesthatcanbeappliedinthisscenario,weonly or[edn97]). basictasksandmethodsinqueryprocessinginmultidatabasesystems(see,e.g.,[my95] queries(nestedqueries,subqueries).wealsoassumethatthereaderisfamiliarwiththe givethebasicideaofglobalqueryprocessing.forexample,wedonotconsidercomplex 4.1 designersandapplicationsatransparentaccesstolocaldata.thatis,theusershould AmainfeatureofaglobalquerylanguageforaMDBSshouldbetoprovideglobalusers, GlobalQueryLanguage s/heshouldalsonotberesponsibleforresolvingsuchconicts(someapproaches,e.g., [SRL93,MR95]adoptexactlytheoppositestrategy). beunawareofpossiblestructuralandsemanticconictsamonglocalmeta-data,and prespecieduniquedataintegrationrules.butwhatdoesthismeanwhenthereare itisquiteobviousthatglobalqueriescannotsimplybedecomposedsolelybasedona localdataofdierentqualityandthedataneedtobeintegratedatrun-time?first,as Inthepresenceofdierentdataqualityaspects,whichareencodedinthemetadata, discussedextensively,dataqualityaspectsareorthogonal.thatis,theredoesnotexist auniedoruniqueviewontheintegrateddata.assumingthattheuserisawareofthe factthatlocaldatahavedierentquality,querylanguageconstructsarenecessarythat Suchanapproachthensupportsthefollowingtwoaspects: supportthespecicationofadataqualitygoalforglobalqueriesandthusintegrateddata. 1.Incasetherearedataconictsamongtwoormoresemanticallyequivalentlocal ischosen.notethatinthiscasethemetadataaboutlocalinformationunitsand preferencesamongtheseunits(withrespecttoaglobalclass)areutilized. objects,theobject(orattributes)satisfyingthespecieddataqualitygoalbest 2.Incasetherearenoconictsamongobjectsoflocalclassesbutthequalityofthe objects(withrespecttoadataqualityaspectq)isdierent,theintegratedobjects needtobegroupedortaggedtoindicatethisaspectattheglobalqueryinterface. 14

suggestanextensionsofaglobalquerylanguage,sayoql,byadataqualitygoalclause: Inordertosupportqueryingobjectsandattributeswithpossiblydierentquality,we select<listofattributes> Asimpledataqualitygoalcaneitherbemostaccurate,mostup-to-date,ormost withgoal<dataqualitygoal>; where<selectioncondition> from<listc1;:::;cnofglobalclasses> conictingobjects(attributes)whichhavethesamequalitywithrespecttotherstgoal, complete.itisalsopossibletoallowalistofdataqualitygoals,meaningthatamongtwo thoseattributesarechosenwhichsatisfythesecondgoalbest. simplecase,aqueryofthetype AssumeaglobalclassChA1;:::;A4icomposedoflocalclassesC1;:::;Cm.Inthemost 4.2 QueryingaGlobalClass couldretrievelocaldatafollowingthedataintegrationruleext(c)=ext(c1)[:::[ Ext(Cm)(Cibeingspeciedintheglobaldatamodel).Nowassumethatinformation selectc:fromcwithgoalmost-accurate; prolinghasbeenperformedforthelocalclasseswithrespecttothedataqualityaspect accuracyandthefollowingpatternhasbeenobtainedfortheglobalclassc: PSfragreplacementsUA1 CUA2 CUA3 ua3 1CUA4 C Fortheabovequery,thequeryprocessorthenutilizesthespecicationofeachinformationunitinUACinordertocomposeglobalobjectsinExt(C)fromlocalobjectsansorchoosesthosevaluesfromcorrespondinglocalunitsthatsatisfytheaspectaccuracy 1,thequeryproces- Figure5:GlobalClassCwithitsinformationunits attributesinext(ci).forexample,forattributevaluesintheunitua3 best(accordingtotheattributequalityassertionsspeciedfortheunitua3 areselectedfrompossiblydierentlocalinformationunits.notethattheobjectidentier iscontainedineachlocalinformationunit(seesection3.1). speaking,foreachobjectidentierknowntobeinext(c),correspondingattributevalues 1).Roughly time-dependentordersamongthepartitionsinuacischosen. aglobalquery,thepointintimetheglobalisissueddetermineswhichofthedierent pletelydierent.inparticular,ifthedataqualitygoaltimelinesshasbeenspeciedin NotealsothatforanotherdataqualityaspectthepartitioningUACmightlookcom- fortheseunitsattributequalityassertionshavebeenspecied,displayingattributevalues suggeststoindicatethevariancesamongthequalityoftheattributevalues.this,for Inanycase,ifforanattributeAthereexistmultipleunitswithinaglobalclassand 15

example,canbedonebyusingcertaincoloringschemasorobjectgroupingstructures. Thisfeatureisinparticularnecessaryifaglobaluserwantstocomparethedierent 4.3 resultsofthesamequerieswithdierentdataqualitygoals. jectsthathavedierentqualities(thatisalsowhywesuggestcertaingroupingstructures Inthepresenceofdataqualityinformationonewouldliketoavoidsimplycombiningob- JoiningGlobalObjects thequeryresult-thatan\up-to-date"objectisjoinedwithanoutdatedobject(based tuplesinrelationaldatabases,basedonprimarykeyandforeignkeyrelationships,ensuringthispropertyisquitetrivialunderthefollowingassumption:primaryandforeignkey intheprevioussection).forexample,itshouldbeavoided-oralternativelyindicatedin ontherelationaljoinoperator).becausejoiningobjectoccursinasimilarwayasjoining showninsection3.supposethequery attributes(orobjectreferences)alwayshavethesamequalityforalldataqualityaspects. Obviously,thisassumptioncaneasilybeveriedduringmodelingdataqualityaspectsas selectc1:,c2:fromc1;c2 wherethejoinconditionisbasedonaprimary-foreignkeyrelationship.forthisquery, wherec1.primary-key=c2.foreign-key thequeryprocessorrstbuildstheextensionoftheglobalclassc2accordingtothespecieddataqualitygoal.then,foreachreferencedobject,therespective(subsetofthe) extensionofc1isbuiltaccordingtothesamedataqualitygoal,butrestrictedtothe objectsneededforext(c1). 5Inthispaperwehavepresentedanovelframeworktohandlingdiversedataqualityaspects indatabaseintegration.wehaveshownthatthereisastrongneedforsupportingdata ConclusionsandFutureWork withgoalmostaccurate; qualityindataintegration,ensuringthatapplicationsbuiltontopofamultidatabase andaccuracy,and(2)howtomodeldiversedataqualitypropertiesoflocalandglobalclass haveshown(1)howtoformalizethebasicdataqualityaspectstimeliness,completeness systemcanrelyon\high-qualitydata". extensionsusinginformationunits,informationproling,andattributequalityassertions. Theimportantcontributiontodatabaseintegrationapproachesinthispaperisthatwe Weareconvincedthattheusedconceptofinformationprolingplaysanimportantrole indesigningamultidatabasesystem,inparticularfordesigningdataqualitydependent ofinformationproleswhicharemoreformal,ideallyallowingtocompareprolesand dataintegrationstrategiesandrules.inourfutureworkweaimtodevelopspecications determiningattributequalityassertionsonanautomatedbasis,supportedbytoolsof, investigationsistheuseofatemporallogicframeworkforspecifyingtime-dependent e.g.,amultidatabasesystemdesignenvironment.anotheraspectwhichneedsmore ordersamongattributequalityassertions. opensupacompletelynewareaofresearchinmultidatabasequeryprocessing: Howtoexploitdataqualityfeaturesrecordedforlocalandglobalclassesandtheir Makingthenotionofdataqualityexplicitatthemultidatabasequerylanguagelevel extensionsattheglobalquerylevel?whatareusefulquerylanguagefeatures? 16

Howtoperformmultidatabasequeryoptimizationsinthepresenceofdatahaving Havingaglobalqueryprocessingenginethatutilizesmetadataaboutinformationpro- Howtorepresentdataofdierentqualityattheglobalqueryinterfacelevel? dierentquality? les,attributequalityassertionsandpreferencesamonginformationunitsprovidesavery exiblemeanstocopewithdynamicdatabaseenvironments.thatis,ifprolesforlocal databasesandinformationunitschange,thesechangesneedonlytobedescribedatthe levelbuttheyaredynamicallydeterminedbythequeryprocessingengine. metadatalevel.moreprecisely,nonewdataintegrationrulesneedtobespeciedatthat cleansing.becausedierencesamongthequalityoflocaldataissuitablyrepresentedat levelcanprovideapplicationdesignersandusersasophisticatedmeanstoperformdata thegloballevel,querylanguageconstructsmightbeusefultoinvestigatethesourceof Furtherweakeningthetransparencyoftheexistenceoflocaldatabasesattheglobal poorqualitydata.thus,theframeworkpresentedinthispaperfurthermoreprovidesa andinparticulardatawarehousesaresubjecttoourfutureresearch. suitablebasisforapplyingdatacleansingtechniquestolocaldatabase.thedevelopment ofsuchenvironmentsandunderlyingstrategiesinthecontextofmultidatabasesystems References [BE96] [BA97] J.Bischo,T.Alexander:DataWarehouses:PracticalAdvicefromtheExperts. O.Bukhres,A.Elmagarmid:ObjectOrientedMultidatabaseSystems.Prentice-Hall, 1996. Prentice-Hall,1997. [BLN86] C.Batini,M.Lenzerini,S.B.Navathe:AComparativeAnalysisofMethodologies [CTK96]A.L.P.Chan,P.Tsai,J.-L.Koh:IdentifyingObjectIsomerisminMultidatabase fordatabaseschemaintegration.acmcomputingsurveys18:4(december1986), 323{364. [DeM89] Systems.DistributedandParallelDatabaseSystems4:2(April1996),143{168. RelationalOperationsoverMismatchedDomains.IEEETransactionsonKnowledge L.G.DeMichiel:ResolvingDatabaseIncompatibility:AnApproachtoPerforming [EDN97]C.Evrendilek,A.Dogac,S.Nural,F.Ozcan:MultidatabaseQueryOptimization. anddataengineering,1(4),december1998485{493. [GMS94]C.H.Goh,S.E.Madnick,M.Siegel:ContextInterchange:OvercomingtheChallengesofLarge-ScaleInteroperableDatabaseSystemsinaDynamicEnvironment. DistributedandParallelDatabases5(1997),77{114. InProceedingsoftheThirdInternationalConferenceonInformationandKnowledge [GSC96] M.Garcia-Solaco,F.Saltor,M.Castellanos:SemanticHeterogeneityinMultidatabases.Invitedchapterin:O.BukhresandA.Elmagarmid(eds.)ObjectOriented Management(CIKM'94),ACMPress,1994,337-346. [Huf96] MultidatabaseSystems.Prentice-Hall,129{202,1996. D.Huord:DataWarehouseQuality,DataManagementReview,Feb/Mar1996 17

[JJQ98] M.Jarke,M.A.Jeusfeld,C.Quix,P.Vassiliadis:ArchitectureandQualityinData [JV97] M.Jarke,Y.Vassiliou:DataWarehouseQualityDesign:AReviewoftheDWQ LNCSVol.1413,Springer,1998,93-113. Warehouses.93-113InAdvancedInformationSystemsEngineering{CAiSE'98. [Kim95] Project.InvitedPaper,Proc.ofthe2ndConferenceonInformationQuality.MassachusettsInstituteofTechnology,Cambridge,1997. W.Kim:ModernDatabaseSystems:TheObjectModel,Interoperability,andBeyond,649{663.ACMPress,NewYork,1995. R.Kimball:TheDataWarehouseToolkit,JohnWiley,1996. [Kim96b]R.Kimball:DealingwithDirtyData.DBMSMagazine9:10,September1996,Miller [Kim96] [KCG95]W.Kim,I.Choi,S.Gala,M.Scheevel:OnResolvingSchematicHeterogeneityin Freeman,Inc.,1996. [KS91] MultidatabaseSystems.In[Kim95],521{550. [KS96] W.Kim,J.Seo:ClassifyingSchematicandDataHeterogeneityinMultidatabase Systems.IEEEComputer24:12(December1991),12{18. [MR95] V.Kashyap,A.Sheth,SemanticandSchematicSimilaritiesBetweenDatabaseObjects:AContext-BasedApproach.TheVLDBJournal,5(4),Dec1996,276{304. P.Missier,M.Rusinkiewicz:ExtendingaMultidatabaseManipulationLanguage DataSemantics(DS-6),93{115,Chapman&Hall,London,1995. ApplicationsSemantics,ProceedingsoftheSixthIFIPTC-2WorkingConferenceon toresolveschemaanddataconicts.inr.meersman,l.mark(eds.),database [MY95] [Pu91] W.Meng,C.Yu:QueryProcessinginHeterogeneousEnvironment.In[Kim95], C.Pu:KeyEquivalenceinHeterogeneousDatabases.InY.Kambayashiand M.RusinkiewiczandA.Sheth(eds.),Proc.ofthe1stInt.WorkshoponInteroperabilityinMultidatabaseSystems(IMS'91),Kyoto,Japan,IEEEComputerSociety 551{572. [Red96] T.C.Redman:DataQualityfortheInformationAge.ArtechHouse,Boston,1996. Press,314{316,1991. [RW95] M.P.Reddy,R.Y.Wang:EstimatingDataAccuracyinaFederatedDatabase [She91] Environment.InS.Bhalla(ed.),InformationSystemsandDataManagement,Proc. ofthe6thconf.,cismod'95,lncs1006,springer-verlag,115{134,1995. [SG93] A.Sheth:SemanticIssuesinMultidatabaseSystems,SIGMODRecord20(4),SpecialIssue,December1992. F.Saltor,M.Garcia-Solaco:DiversitywithCooperationinDatabaseSchemata:SemanticRelativism.Proceedingsofthe14thInternationalConferenceonInformation [SK93] A.Sheth,V.Kashyap:SoFar(Schematically)Yet.SoNear(Semantically).In Systems(ICIS'93,Orlando1993),247-254. [SL90] A.Sheth,J.Larson:FederatedDatabaseSystemsforManagingDistributed,Heterogeneous,andAutonomousDatabases.ACMComputingSurveys22:3(1990), 5),North-Holland,Amsterdam,TheNetherlands,1993. D.Hsiao,E.Neuhold,R.Sacks-Davis(eds.),InteroperableDatabaseSystems(DS- 183{236. 18

[SRL93] L.Suardi,M.Rusinkiewicz,W.Litwin:ExecutionofExtendedMultidatabaseSQL. [SPD92] InA.Elmagarmid,E.Neuhold(eds.),Proc.ofthe9thInternationalConferenceon S.Spaccapietra,C.Parent,Y.Dupont:ModelIndependentAssertionsforIntegrationofHeterogeneousSchemas.VLDBJournal1:11994),81{126,1992. DataEngineering-1993,641{650,IEEEComputerSocietyPress,1993. [SS95] I.Schmitt,G.Saake:ManagingObjectIdentityinFederatedDatabaseSystems.In M.Papazoglou(ed.),OOER'95:Object-OrientedandEntity-RelationshipModeling, [Wan98] Pages400-411,December1995. Proc.ofthe14thInt.Conf.,GoldCoast,Australia,LNCS1021,Springer-Verlag, [WS96] R.Y.Wang,D.M.Strong:BeyondAccuracy:WhatDataQualityMeanstoData nicationsoftheacm41:2,58{65,1998. R.Y.Wang:AProductPerspectiveonTotalDataQualityManagement.Commu- [WSF96]R.Y.Wang,V.C.Storey,C.P.Firth:AFrameworkforAnalysisofDataQualityResearch.IEEETransactionsonKnowledgeandDataEngineering7:4(August1995), Consumers.JournalofManagementInformationSystems12:4,5{34,1996. 623{640,1996. 19