SchemalessRepresentationofSemistructured Dong-YalSeo1,Dong-HaLee1,Kang-SikMoon1,JisookChang1, DataandSchemaConstruction 1Dept.ofComputerScienceandEngineering PohangUniversityofScienceandTechnology Jeon-YoungLee1,andChang-YongHan2 2DataWarehouseAdvancedTechnology Pohang,Kyungbuk,790-784,KOREA Abstract.Weshouldconsidersemistructureddataofwhichhavea Youngdeungpo-Gu,Seoul,150-010,KOREA OracleSystemsKorea,Ltd. weakschemainformationinnetworkedinformationworld.tomanage suchsemistructureddataeciently,thispaperintroducesadatamodel fullydependentonschemalessmanipulations.forschemaconstruction, transformsemistructureddataintostructuredonebyintroducingschema constructionmethodology,comparedtotheformerstudieswhichare forsemistructureddataandoperationsforschemaconstruction.we wedenedoperationsforbuildingis-a/is-part-ofrelationships,collectingdataobjectstobuildaprimitiveclass,andmergingtwodata 1Introduction instancesorclasses. 1.1Motivation Inearlystagesofdataprocessingsuchasinventory/accountmanagementsystems,acentralizedlargedatabasesystemwasusedasaninformationserver. dataviapredenedschema.theschemaisrmandend-usersarenotresponsible Throughthedatabasedesign(orreal-worldmodeling)phase,DBA(Database andcreationafterschemadenition.forend-users,theirroleistomanipulate Administrator)denesawell-structuredschema.Weperformdataacquisition WideWeb)isatypicaldomainoftheexamples.Everyusercreateshis/her forschemamanagement. owndocumentsandsubmittsinthewww.howtomanagethoseplentyof usercreatesandupdateshis/herowninformation,likedba.thewww(world- informationisprovidedbyindividualusersandupdatedveryquickly.eachend- Asdatasourcesandcomputingenvironmentaredistributed,abundantof hyper-linksorsearchbykeywordsbecausethereisnoabsoluteschemainthe HTMLdocumentsandotherwebresources?Weshouldalwaysnavigatethrough storedinformation.ifwecoulddeneaschemaonthesetofwebresources, useastructuredquerylanguage.schemaprovidesthewell-structuredviewof notonlywehaveabetterstructureofgatheredinformationbutalsowecan
Semistructured Data Processing Lightweight Information Systems Structured Data Processing Conventional DBMSs storeddata.itexpressesdatalocation,relationshipsamongdataobjects,data categories,summarizedconcepts,andsoon. structuremaybeirregularorincomplete,areknownassemistructureddata.even thoughadataiscreatedasawell-structured,i.e.,schema-based,setofdata,it becomessemistructuredwhenthedatacomesoutfromitsoriginalstructure. Forexample,asinglerecordfromarelationaltableissemistructuredifwehave Thedatasets,wherethereisnoabsoluteschemaxedinadvanceandwhose tables. 1.2ProblemsandApproaches noideaabouttheoverallstructureofthetableandtherelationshipswithother Figure1showstheapproachesofinformationprocessingbasedonthestructuralnatureofdatasources.Rightsideofthegurepresentsconventionaldata well-structuredmodel,likerelationalorobject-oriented.usersmanipulatedata processing.informationitselfhasarigidstructureandisrepresentedwitha withaschemawhichmainlyprovidesanstructuralabstractionofstoreddata. Fig.1.InformationStructuresandProcessings Schema Construction Semistructured Structured Representation Models Representation Models Structured instances,andisstoredinalightweightstorage.storedinformationismanipulatedwithalightweightquerylanguage,whichcanbeusedwithincomplete schemainformation. databaseschema.althoughschemalessmanipulationisconvenientforuserswho wanttoretrievedatawithoutdeepknowledgeofunderlyingstructures,schema givesrmnessandconceptualizedview. isindispensableforembeddedsql,apicalls,orstoredprocedures.schema Thestudiesonlightweightapproachesmuchoverlookedtheimportanceof representedwithalightweightmodel,whichpermitsschemalesscreationofdata turedinformation[12](orevenunstructured[3]).semistructuredinformationis Leftsideshowstheprocessingofinformationwithpoorschema,i.e.semistruc- Information Sources Sources outcompleteknowledgeofthepredenedschema.(orevenwithoutanyschema Insemistructureddataprocessing,end-usersrepresentdatainstanceswith-
lesscreation,schema-basedmanipulation"whichinvolvesthefollowinggoals: phase.forsemistructureddataprocessing,weestablishedthestrategy,\schema- information.)sodatacreationphasecanbeperformedbeforeschemadenition 1.Providearepresentationmodelforschemalessdatainstances.Themodel 2.Deviseamechanismforschemaconstructionwhichcanbeappliedtoa shouldbeexpressiveenoughtodescribesemistructureddatainstancesfrom heterogeneousdatasources. schemalesspoolofdatainstances.afterapplyingschemaconstructionprocedures,wewillhavearigidschemaandmanipulatethestoreddatawith relatedworkandcorrespondingcontributions.section3and4addressadata Theremainingpartsofthispaperiscomposedasfollows.Section2presents modelforschemalesscreationofdataobjectsandoperationsforschemaconstruction,themaincontributionofthiswork,respectively.andnally,conclusion SQL. 2RelatedWork anddirectionsforfutureworkarediscussedinsection5. workdealswiththeproblemsininformationgatheringlayer.morespecically, Generalproblemsofnetworkedinformationprocessingarediscussedin[4]. weareinterestedindatamappingproblemandweintroduceschemaconstruction informationinterface,informationdispersion,andinformationgathering.our Threeconceptuallayersofnetworkedinformationsystemsareintroducedas operatorsforthatproblem.theimportanceandthemotivationaboutschemalessdatarepresentationsandmanipulationsarediscussedin[1][3][11][12]. Althoughnotdevelopedasasemistructureddatamodel,O2'scomplexvalue OEM(ObjectExchangeModel)[11],Labeled-Tree[3],andDataForestModel[1]. model[8]showsagoodwayofsemistructuredrepresentationwithattribute-value Schemalessdatainstancesareusuallydescribedbytheirattributesandcorrespondingvalues.Attribute-valuepairswereusedfordatarepresentationin andtheirexpressivepowersarealmostsame. pairs.alltheearliermodelsforsemistructureddataaresimilartoeachother substructuresaswellasatomicvalueslikeintegersandstringsbyusingattributevaluepairs4.labeled-treemodel,hasthesameexpressivepowerastheoem, representssemistructureddataastrees,i.e.,thetreeswithalabelingofedges. TSIMMIS3project[7]introducesOEMandotherrelatedwork[12]aboutthe integrationofheterogeneousinformationsources.oemprovidessetsandnested DataForestModelsupportslisttypewhichisunabletobedescribedinthe OEMandthelabeled-tree. proachdrawsadistinctionbetweenourstudyandconventionalmethodologiesin importanceofdatabaseschemaistoooverlooked.theschemaconstructionap- 3TheStanford-IBMManagerofMultipleInformationSources Theformerstudiesaremainlyfocusedonlightweightapproachesandthe 4In[11],theterm\level-valuepair"wasusedinsteadof\attribute-valuepair".
semistructureddataprocessing.theproblemsonstructuring[13]andtyping[9] semistructureddataareintroducedrecently. forclasscompositionswhichdealswithbehaviorscomparedtoourapproach whichdealswithdata. Thestudiesonsubject-orientedprogramming[10]introduceamethodology orrelationshipsamongobjects.wemainlyconsidereddataobjectsfromthe operationsandproperties.propertiesdeneeitherattributesoftheobjectitself 3SchemalessCreationofDataObjects viewpointofproperties. Objectsareusuallydistinguishedbytheirtypes,wheretypedescribesapplicable 3.1ModelDenition Ourdatamodeldescribesschemalessdataobjectswithaseriesofattributevaluepairs,calledAVPL(Attribute-ValuePairsList).Anattribute-valuepairis asetofattributes,andasetofvaluesasd,a,andv,respectively,avplis denedasfollows: 1.Singleattribute-valuepairisanAVPL 2.Unionoftwoattribute-valuepairsisanAVPL (a2a)^(v2v)?!f(a;v)g2d composedoftwotuples,attributeandvalue.whenwedenoteasetofavpl, itselfisalsoanattribute.whensdenotesthesetofstrings,attributeisdened asfollows: Attributeisanorderedcollectionofoneormorevariables,whereeachvariable D1;D22D?!D1[D22D 2.Compositeattribute(Attributewithmultiplevariables) 1.Singletonattribute a1;a2;:::;an2a?!(a1;a2;:::;an)2a s2s?!s2a attributeanditsvariables.assignmentofvaluestoattributesaredenedas follows: Valueisanassignableinstance,orasetofinstances,tothecorresponding where(a1;a2;:::;an)isanorderedsequenceofattributevariables. 1.Singletonattributeandvalue wherea2aandv2v a?v
2.Compositeattributeandvalue Thedomainofattributesincludesprimitivestrings,referencesofvalues,set (1in). where(a1;a2;:::;an)2a,(v1;v2;:::;vn)2v,andeachviisassignedtoai?(v1;v2;:::;vn) denedasfollows: structure. ofavplobjects)isalsoacomponentofotheravplobjectsandallowsnested (orlist)ofvalues,andavplobjects.therefore,anavplobjectitself(oraset 1.Primitivecharacterstringss2S?!s2V Whenwedenoteaset(ortypesystem)ofvaluesasV,typesofvaluesare 2.Referencestoanytypeofvalues wheresdenotesasetofstrings. 3.Setofanytypesofvalues(unordered) where&visthereference,i.e.,identier,ofvand. v1;v2;:::;vn2v?!fv1;v2;:::;vng2v v2v?!&v2v 4.Listofanytypesofvalues(ordered) 5.AVPLobjectsv1;v2;:::;vn2V?!<v1;v2;:::;vn>2V 6.Null(emptyvalue) whereddenotesasetofavplobjects. Anyattributecanbenullvalued,i.e.,novalueisassigned. d2d?!d2v 7.Identier 3.2ExpressivePower Aselfcontainedlabel,astring,whichbeginswith`#'.Identierisoptional AlltheatomicvaluesarestringsinAVPL.Otherkindsofatomictypeslike andusedbyotheravplobjectsasareference. integer,oat,andbooleanarenotprovided.thosetypescanbeeasilyderived fromadatasourceandusersarefreefromatomictypes. shipcannotberepresented.figure2isanexampleofavplobjectintabular sets,lists,andnestedstructures.itprovidestablestructureswithacomposite recordtuples).advancedsemanticsofobject-orientedmodel,likeis-arelation- attribute(astableheaders)andacorrespondinglistofcompositevalues(as Wecanrepresenttable-structuredvaluesaswellasreferences(identiers), representation.
Name Research Education Contact Dong-Yal Seo Database Degree School Year BS POSTECH 1992 4SchemaConstruction Fig.2.TabularRepresentationofanExampleAVPLObject MS POSTECH 1994 Telephone Fax e-mail 4.1SchemaandObjects 0562-279-5660 0562-279-5699 dyseo@white.postech.ac.kr Schema,inanOODB,denesclassesandtheirrelationships.Andtherelationshipsamongclassesimplytherelationshipsamongobjects.Schemadenesboth structuralandbehavioralpartofaclass.inthiswork,wemainlyfocusedon classfromasetofinstances,and2)variousrelationshipsamongthoseclasses. Atrst,wewillremindpossiblerelationshipsamongclassestodeneoperations forconstructingclassesandtheirrelationships.therearetworelationshipsbetweenobejcts,is-aandis-part-of.theformeristhebasisoftheinheritance Toconstructaschemafromapoolofschemalessobjects,weshouldbuild1)a structuralpart. hierarchyandthelatterisisthebasisofthecompositionhierarchy. informationinanobject-orientedmodel.typeisimplementedasaclassandthe classdenesacollectionrelationship.notonlyobjectsarecreatedasinstances ofaclass,aclasscouldbecreatedasacollectionofinstances. Atypeisacollectionofobjectswiththesamestructuralandbehavioral uniqueinobject-orientedworld.sotwodescriptionscanhaveanequivalence relationship. twodescriptionsmustbemergedintoasingledescriptionbecauseanobjectis Oneobjectcanbedescribedinmorethantwoways.Inthiscase,those 4.2SchemaConstructionOperations 1.CreationofaclassbyInstanceCollection ofavplobjectsuisconstructedwithobjectcollect(s1;:::;sm)if FortheAVPLobjectsS1;:::;Smandtheirattribute-setsA1;:::;Am,aclass valueofaisnull. a2ua,ifthereisanyavplobjectsi(1m)whereaisnotinai,the wheretheattribute-setuaofuisua=a1[a2[[amforallattributes U=fS1;S2;:::;Smg 2.Merging
(a)objectmerging withobjectmerge(s,t)if FortwoAVPLobjectsSandT,anewAVPLobjectUisconstructed wheretheattribute-setofuisthesameass[tand^aisacommon(shared)attribute-valuepairofsandt.wisanattribute-value U=fwj(w2S[T)^9^a(^a2S^^a2T)g (b)classmerging isconstructedwithclassmerge(s,t)ifu2uisconstructedwithobjectmerge(s,t)anduisconstructedwithobjectcollect(u1;u2;:::;um) FortwoclassesofAVPLobjectsSandT,anewclassofAVPLobjectsU pair. 3.Composition(IS-PART-OFRelationship) (a)objectcomposition wheres2s,t2t,andui2ufor1im. FortwoAVPLobjectsS,T,anewAVPLobjectUisconstructedwith ObjectCompose(S,T)ifU=(S?t)[^T (b)classcomposition wheret2sandt2t.^tistitselforareferencetot.theattribute-set ofuisthesameass. 1im. structedwithobjectcompose(s,t)anduisconstructedwithob- jectcollect(u1;u2;:::;um)wheres2s,t2t,andui2ufor jectsuisconstructedwithclasscompose(s,t)ifu2uiscon- FortwoclassesofAVPLobjectsSandT,anewclassofAVPLob- 4.Inclusion(IS-ARelationship) FortwosetsofAVPLobjectsUandV,anewrelationship,Uisasubsetof V,canbeconstructedwithClassInclude(U,V)if whereattribute-setsuaandvaofuandv,respectively,hasrelationshipof VAUA. UV 5.Triviallywecandeneadditionaloperations,suchasdestruction,splitting, jectcollect().otheroperationslikeclassmerge()orclasscompose()canbe Figure3explainstheoperationsObjectMerge(),ObjectCompose(),andOb- andexclusion,fromtheinverseoftheabovedenedoperations. implementedbyusingobjectmerge()orobjectcompose()withobjectcollect(), respectively.objectcompose()infigure3meanscompositionbyreferencevalue.
o1 Name Advisor Research Dong-Yal Seo J.Y. Lee Database o2 Name Telephone e-mail Dong-Yal Seo 279-5660 dyseo@white.... o3 = Object_Merge(o1, o2) Name Advisor Research Telephone e-mail Dong-Yal Seo J.Y. Lee Database 279-5660 dyseo@white.... a) Object Merging o5= Object_Compose(o3, o4) Name Advisor Research Telephone e-mail Dong-Yal Seo Database 279-5660 dyseo@white.... o4 Name Position Lab.... J.Y. Lee Associate Prof. IIS b) Object Composition o1 o6 Name Age Address Home City Chang-Yong Han 28 Pohang Sungnam o7 = Object_Collect(o1, o6) Name Advisor Research Age Address Home City 4.3SchemaConsistency Fig.3.ExampleofSchemaConstructionOperations Dong-Yal Seo J.Y. Lee Database Chang-Yong Han 28 Pohang Sungnam Whentheuserrunsaschemaconstructionprocedureusingabovementioned operations,schemaevolutiontakesplaceinthepre-existingschema.indatabase c) Object Collection to Build a Class maptheconsequencesoftheeectsonthetaxonomyoftheschema-modication world,itisveryimportanttokeepschemaconsistency.weintroduceseveral eectsofschemaconstructionoperationsontheexistingschemahierarchyand aectstaticstructureofclasses. operationslistedin[2].infact,theschemaconstructionoperationsmightheavily ationinschemaevolutiontaxonomyiftheclasstobemergedhasrelationships berejected5. cannotbepreserved,theoperationthatbreaksschemaconsistencyrulesshould withotherclasses.thus,iftheinvariantspropertiesoftheinheritancehierarchy Forexample,mergeoperationcouldbeconsideredasattributeaddingoper- 5Refer[6]formoreaboutschemainvariants. WechoseschemaevolutiontaxonomyofORIONdatamodelbasedonthe
comparisonsin[6].table1showsthetaxonomyofschemamodicationsin anobject-orienteddatabaseandtheircorrespondingschemaconstructionoperations.itmeansthatwecanmaptheconsistencyproblemsbyourschema constructionoperationsintoschemamodicationproblems. havevaluablemeaninginclassicalobject-orientedmodelwhereclassdenition alwaysprecedesobjectinstantiation. notndanytaxonomyformethods.weneitheraddressthecategoryofdefault valueattributesorsharedattributesdenedin[2],sincethesefunctionsonly Becausewedidnotconsiderthebehavioralpartofobjects,thereaderwill SchemaConstructionCorrespondingSchemaEvolution Merge Split Table1.SchemaConstructionOperationsandEvolutions Compose Decompose Include Modifythedomain'sattributes Modifycompositeattributesintononcompositeattributes Addattributes Exclude Deleteattributesandbuildanewclass Collect MakeaclassSthesuperclassofclassC RemoveaclassSfromthelistofsuperclassesofclassC Createanewclass Weintroducedanewmodelofdatabaseprocessingwhereobjectsarecreated beforeschemadenition.wedenedatypesystemforsemistructureddatainstances,andtheoperationsfortheconstructionofstructuralschemafromaset whichcontainsalistofuser-denedattributesandtheircorrespondingvalues. Forschemaconstruction,wedenedoperationsforbuildingIS-AandIS-PART- OFrelationships,collectingobjectstobuildaclass,andmergingtwoobjectsor classestomakealargerone.operationscanbeappliedinbothobject-leveland Inourdatamodel,aschemalessdatainstanceiscreatedasadescription 5ConclusionandFutureWork ofschemalessdatainstances. class-level. semistructureddatainstances,whicharenotcreatedasinstancesofpredened schema.databasesystemforcollectedhtmldocumentsisagoodapplicationof ourwork.htmldocumentshavesignicantlylessstructurethantheexamples inthispaperanditismorediculttoextracttheattribute-valuepairsneeded Ourapproachissuitablefortheapplicationswherewecollectandmanage toconstructtheschema.
References 1.Abiteboul,S.,Cluet,S.,Milo,T.:CorrespondenceandTranslationforHeterogeneousData.Proceedingsofthe'97ICDT,Delphi,Greece(1997)352{363 3.Buneman,P.,Davidson,S.,Hillerbrand,G.,Suciu,D.:AQueryLanguageand 2.Banerjee,J.,Kim,W.,Kim,H.,Korth,H.:SemanticsandImplementationof MOD,SanFrancisco,CA(1987)311{322 SchemaEvolutioninObject-OrientedDatabases.Proceedingsofthe'87ACMSIG- 4.Bowman,C.,Danzig,P.,Manber,U.,Schwartz,M.:ScalableInternetResourceDiscovery:ResearchProblemsandApproaches.CommunicationsoftheACM.37(8) MOD,Montreal,Canada(1996)505{516 OptimizationTechniquesforUnstructuredData.Proceedingsofthe'96ACMSIG- (1994)98{107 5.Bowman,C.,Danzig,P.,Hardy,D.,Manber,U.,Schwartz,M.:TheHarvestInformationDiscoveryandAccessSystem.ProceedingsoftheSecondInternational 7.Chawathe,S.,Garcia-Molina,H.,Hammer,J.,Ireland,K.:TheTSIMMISProject: 6.Tsichritzis,D.,ed.:ObjectManagement.CentreUniversitaired'Informatique,UniversityofGeneva(1990) WorldWideWebConference,Chicago,Illinois(1994)763{771 IntegrationofHeterogeneousInformationSources.ProceedingsofIPSJConference, 8.Bancilhon,F.,Delobel,C.,Kanellakis,P.eds.:BuildinganObject-OrientSystem: 9.Nestorov,S.,Abiteboul,S.,Motwani,R.:InferringStructureinSemistructured Tokyo,Japan(1994) TheStoryofO2.MorganKaufmann,SanMateo,CA(1992) Data.ProceedingsoftheWorkshopfortheManagementofSemistructuredData 11.Papakonstantinou,Y.,Garcia-Molina,H.,Widom,J.:ObjectExchangeAcross 10.Ossher,H.,Kaplan,M.,Harrison,W.,Katz,A.,Kruskal,V.:Subject-Oriented CompositionRules.ProceedingsoftheOOPSLA'95,Austin,Texas(1995)235{ 250 (inconjunctionwith'97acmpods/sigmod),tucson,arizona(1997)42{48 12.Quass,D.,Rajaraman,A.,Ullman,J.,Widom,J.:QueryingSemistructuredHeterogeneousInformation.Proceedingsof4thInternationalConferenceonDeductive SelectivelyLabeledOrderedTrees.ProceedingsoftheWorkshopfortheManage- andobject-orienteddatabases,singapore(1995)319{344 mentofsemistructureddata(inconjunctionwith'97acmpods/sigmod), Tucson,Arizona(1997)54{59 ConferenceonDataEngineering,Taipei,Taiwan(1995)251{260 HeterogeneousInformationSources.Proceedingsofthe11thIEEEInternational 13.Seo,D.,Lee,D.,Lee,K.,Lee,J.:DiscoveryofSchemaInformationfromaForestof ThisarticlewasprocessedusingtheLATEXmacropackagewithLLNCSstyle