Xml Tree Structure and Methods of Interdependence

Transcription

1 Reasoning bout Data in XML Data Integration adeusz ankowski 1,2 1 Institute of Control and Information Engineering, oznań University of echnology, oland 2 Faculty of Mathematics and Computer Science, dam Mickiewicz University, oznań, oland tadeusz.pankowski@put.poznan.pl bstract In this paper, we propose solutions to some problems arising while data from different sources is to be integrated under a given target schema. We address the following problems: inferring missing data based on constraints imposed by the target schema, generating mappings from a source schema to a target schema based on key constraints and value dependencies, and merging data based on subsumptions between XML data controlled by ontology and semantics defined by means of description logic. 1 Introduction In data integration [2, 11] we identify the following issues concerning reasoning about data: (1) inferring data values which are not given explicitly in sources but can be deduced based on some constraints enforced by the target schema; (2) finding an executable mapping from a source schema into a target schema so that an instance of the target schema can be computed from a given set of source instances; (3) merging heterogeneous source data in such a way that the result is subsumed by all merged components the result is at least as specific as any component and is free of overlapping data. In the process of data transformation some missing or incomplete data may be inferred. We achieve that by representing missing data by terms reflecting constraints imposed by the schema. In some cases such terms may be resolved and replaced by the actual data [15]. In the paper we propose a method for generating mappings between schemas based on key constraints and value dependencies defined by means of an XML schema. In the first step an automapping over a schema, i.e. a mapping from the schema onto itself, is generated. he automapping represents the schema. composition of automappings over two schemas gives a mapping between these schemas. We propose a language, called XDMap, for mapping specification based on source-to-target dependencies and Skolem functions. he data taken out from different sources may have not only different structures but also may use different names, concepts, precision, etc. In order to handle them we have to use a domain ontology. However, semantic relationships provided by the ontology must be generalized to XML tree structures in order to reason about subsumptions or equivalences between XML data. We have done it using semantics of description logic. Section 2 illustrates the problem of inferring some missing data in data integration. In Section 3 an approach to create executable schema mappings is proposed. We show how key constraints defined in XML Schema may be used to generate automappings and how mappings can be derived from automappings. In Section 4 we discuss subsumptions on XML data trees and their use for merging data. Section 5 concludes the paper.

2 2 Using constraints for inferring data in data integration We will show how missing data may be inferred in data integration using some constraints on target schema. Suppose there are three schemas S 1, S 2, and S 3, respectively (Fig. 1) and that only S 2 and S 3 are associated with data, while S 1 is a mediated (or target) schema that does not store any data. he meaning of labels are: author (), name () and university (U) of the author; paper ( ) title ( ), year ( ) of publication and the conference (C) where the paper has been presented. Elements labeled with R and K are used to join authors with their papers. I 2 and I 3 are instances of S 2 and S 3, respectively. In such scenario we meet the problem of data integration (data exchange), i.e. computing target instances from source instances [3, 8, 12, 17, 20]. It is commonly agreed that mappings are needed to perform these functions effectively, where a mapping specifies a relationship between a set of source schemas and a target schema. In particular, an instance of S 1 in Fig. 1 can be obtained by transformations M 21 (I 2 ) or M 31 (I 3 ), or by merging (M 21 (I 2 ) M 31 (I 3 ) = (M 21 M 31 )(I 2, I 3 ), where M ij denotes a mapping from S i into S j. We can use two kinds of constraints to define mappings, namely: 1. Value dependencies (on the target) to declare that a value of a path depends on a tuple of values of other paths; 2. Key constraints (on a source) to declare that a subtree is uniquely identified by a tuple of values of key paths. Value dependencies can be used to infer missing data [3, 15, 20]. Suppose we want to transform the instance I 2 to the target schema S 1, i.e. an instance I 11 = M 21 (I 2 ) must be produced (Fig. 2(a)). he original instance provides no data about publication year. We know, however, that the publication year ( ) uniquely depends on the title ( ), denoted by the value dependency constraint = y( ), where y is the name of a function mapping titles into publication years. Hence, we assign the term y(t) as the text value of, where t is the title. his convention forces some elements of type to have the same values (Fig. 2(a)). Such value dependencies can be defined within a schema declared be means of an (extended) XML Schema (Fig. 3). term, like y(t), may be resolved using other mappings. Suppose we want to merge the instance in Fig. 2(a)) and I 3. In this process terms denoting years will be replaced with actual values (Fig. 2(b)). ote that in this way we are able to infer the publication year of the paper written by a2. his information is not given explicitly neither in I 2 nor in I 3. Information provided by key constraints, elements <xs:key> within XML Schema (Fig. 3), are used to specify how many instances (nodes) of an element type must be in the computed target instance. For example, the element type /1/ in S 1 is uniquely identified by the key path. So, there are as many nodes of type /1/ as there are different values of /1//. In S 2, however, elements of type are identified by but only in a context determined by the element type / 2/ that is identified by. hus, to identify / 2// we need a pair of values determined by paths / 2// and / 2///. 3 XML schema mappings 3.1 Basic ideas of mappings We will show how, from the declaration in Fig. 3, the automapping M 11 over S 1 can be generated (Fig. 4). he clause foreach defines variables. Lines (1) and (2) are obvious. (3) includes value dependencies specified in the schema. Let y = f($x 1 ) and z = f($x 2 ) be two value dependencies, Ω be a set of bindings for $x 1, Ω be a set of bindings for $z and $x 2, and there is no binding for $y, neither in Ω nor in Ω ($x 1 denotes a vector of variables). he value to $y is assigned according to the rules: 1. For a binding ω Ω, the term f(a), where a = ω($x 1 ), is assigned to y.

3 S 1 : 1 S 2 : 2 S 3 : D3 * * * * U? + + R* K C? I 2 : 2 U I 3 : D3 t1 t2 U U a1 u1 a2 u2 a1 U u1 a1 R i1 R i2 a3 R i3 K i1 t1 05 C1 C K i2 t2 03 C C2 K i3 t3 04 C C1 Figure 1: Schemas: S 1, S 2, S 3, and schema instances I 2 and I 3 (S 1 does not have any stored instance) (a) I 11 = M 21(I 2) 1 (b) I 13 = M 21(I 2) M 31(I 3) = (M 21 M 31)(I 2, I 3) 1 U U a1 u1 a2 u2 t1 y(t1) t2 y(t2) t1 y(t1) a1 U u1 t1 05 t2 03 a2 U u2 t1 05 U a3 u(a3) t3 04 Figure 2: Instances of schema S 1 produced by mappings using value dependency constraints <xs:schema xmlns:xs="..."> <xs:element name="1"> <xs:complexype><xs:sequence> <xs:element ref=""/></xs:sequence> </xs:complexype> </xs:element> <xs:element name=""> <xs:complexype><xs:sequence> <xs:element name="" type="xs:string"/> <xs:element name="u" type="xs:string"/> <xs:element ref="" /></xs:sequence> </xs:complexype> <xs:key name="key"><xs:selector xpath="."/> <xs:field xpath=""/> </xs:key> <xs:valdep> <xs:target name="u"/><xs:function name="u"/> <xs:source xpath=""/> </xs:valdep> </xs:element> <xs:element name=""> <xs:complexype><xs:sequence> <xs:element name="" type="xs:string"/> <xs:element name="" type="xs:string"/> </xs:sequence> </xs:complexype> <xs:key name="key"><xs:selector xpath="."/> <xs:field xpath=""/> </xs:key> <xs:valdep> <xs:target name=""/><xs:function name="y"/> <xs:source xpath=""/> </xs:valdep> </xs:element> </xs:schema> Figure 3: XML Schema of S 1, extended with <xs:valdep> declaration 2. If there is a binding ω Ω such that ω ($x 2 ) = a, then the value ω ($z) is assigned to $y (we say that the term f(a) has been resolved). M 11 = (G 11, Φ 11, C 11, E 11) = (1) foreach $y 1 in /1, $y in $y 1/, $y in $y /, $y U in $y /U, $y in $y /, $y in $y /, $y in $y /, (2) where true (3) when $y U = u($y ), $y = y($y ) exists (4) F /1 () in F () ()/1 (5) F /1/ ($y ) in F /1 ()/ (6) F /1// ($y ) in F /1/ ($y )/ with $y (7) F /1//U ($y, $y U ) in F /1/ ($y )/U with $y U (8) F /1// ($y, $y ) in F /1/ ($y )/ (9) F /1/// ($y, $y ) in F /1// ($y, $y )/ with $y (10) F /1/// ($y, $y, $y ) in F /1// ($y, $y )/ with $y Figure 4: utomapping M 11 over S 1 (4) creates two new nodes, the root r and the node n of the outermost element of type /1, as results of Skolem functions F () () and F /1 (), respectively. he node n is a child of type 1 of r. (5) creates a new node n for any distinct value of $y, each such node has the type /1/ and is a child of type of the node created by F /1 () in (4). (6) For any distinct value of $y a new node n of type /1// is created. Each such node is

4 a child of type of the node created by invocation of F /1/ ($y ) in (5) for the same value of $y. Because n is a leaf, so it obtains the text value equal to the current value of $y. nalogously for the remainder. 3.2 Capturing key constraints by automappings In specification of automappings, Skolem functions and their arguments play a crucial role. We assume that: for any path in the schema there is exactly one Skolem function F (...), arguments of a Skolem function F (...) are determined by key paths defined for the element of type in the schema. In S 1 there is exactly one root and one outermost element, so the corresponding Skolem functions have empty lists of arguments. Element of type /1/ has a key path. Each of its subelements inherits this key path and additionally has its local key paths. Local key paths for non-leaf elements are defined in the schema. he local key path for a leaf element is, by default, this leaf element itself. hus, in S 1 we have the following key paths: for /1/ and for /1//; (, ) for /1// and for /1/// ; and (,, ) for /1///. Values of these key paths are bound to variables and are used as arguments of Skolem functions. In definition of S 3 (Fig. 5), the schema specifies the key and keyref relationships between the K child element of the element (the referenced key) and the R child element of the element (the foreign key). dditionaly, the value dependency K = k(, ) says that the path must start at element referencing via its foreign key defined in Keyref. Key references are captured as follows: in the exists clause any occurrence of a variable $x f ranging over values of a foreign key is replaced with a variable $x k ranging over values of the corresponding referenced key; <xs:element name=""> <xs:complexype>...</xs:complexype>... <xs:keyref name="keyref" refer="key"> <xs:selector xpath="."/> <xs:field xpath="r"/> </xs:keyref> </xs:element> <xs:element name=""> <xs:complexype>...</xs:complexype> <xs:key name="key"> <xs:selector xpath="."/> <xs:field xpath="k"/> </xs:key> <xs:valdep> <xs:target name="k"/><xs:function name="k"/> <xs:source xpath=""/> <xs:source xpath="" ref="keyref"/> </xs:valdep>...</xs:element> Figure 5: Fragment of XML Schema for S 3 the equality $x f = $x k is inserted into the where clause. Using these rules, we obtain the following specification of the automapping over S 3 : M 33 = foreach $z D3 in /D3, $z in $z D3 /, $z in $z /, $z R in $z /R, $z in $z D3 /, $z K in $z /K, $z in $z /, $z in $z /, $z C in $z /C where $z R = $z K when $z K = k($z, $z ), $z = y($z ), $z C = c($z ) exists F /D3 () in F () ()/D3 F /D3/ ($z ) in F /D3 ()/ F /D3// ($z ) in F /D3/ ($z )/ with $z F /D3//R ($z, $z K ) in F /D3/ ($z )/R with $z K F /D3/ ($z K ) in F /D3 ()/ F /D3//K ($z K) in F /D3/ ($z K)/K with $z K F /D3// ($z K, $z ) in F /D3/ ($z K)/ with $z F /D3// ($z K, $z ) in F /D3/ ($z K)/ with $z F /D3//C ($z K, $z C) in F /D3/ ($z K)/C with $z C 3.3 Syntax and semantics for mappings he part foreach/where/when of a mapping M determines a partially ordered set (Ω, ) of bindings of variables ($x, $y). For example, in the mapping M 21 (Fig. 6) for two bindings over I 2, ω 1 = ($x t 1, $x a 1, $x U u 1, $y y(t 1 )) and ω 2 = ($x t 1, $x a 2, $x U u 2, $y y(t 2 )), we have ω 1 < ω 2, because the tuple of leaf nodes providing values for ω 1 precedes the tuple of leaf nodes providing values for ω 2. Bindings from Ω are used in the exists E part to produce the result target instance. he ordering

5 imposed in Ω by a source instance should be preserved in the target instance. ote that if the foreach/where clause is defined over S 2, while the when/exists concerns S 1, then we deal with a mapping M 21 from S 2 into S 1. hen, after an appropriate replacement of variables, we obtain: M 21 = foreach $x 2 in / 2, $x in $x 2 /, $x in $x /, $x in $x /, $x in $x /, $x U in $x /U where true when C 11 ($y, $y U, $y, $y ) [$y $x, $y U $x U, $y $x ] exists E 11 ($y, $y U, $y, $y ) [$y $x, $y U $x U, $y $x ] Figure 6: Mapping M 21 from S 2 into S 1 In M 21 there is no replacement for $y, thus its value must be set somehow differently, e.g. as a null value [3]. We set it as the term y(t), where t is the current value of $x (see Fig. 2(a)). It is a form of Skolemization. hus, a mapping specification in XDMap conforms to the general form of source-to-target generating dependencies [1, 9, 12, 13]: $x(g($x) Φ($x) $yc($x, $y) E($x, $y)). Definition 1 n executable schema mapping in XDMap (or mapping for short) between a source schema S and a target schema is a sequence M ::= (M,..., M) of mapping constraints between S and, where: M := foreach G($x) where Φ($x) when C($x, $y) exists F /l ($x, $y) in F ($x, $y )/l [ with $x ] G is a list of variable definitions over a source schema: $x in Q or $x in $x /Q; Φ is a conjunction of atomic conditions: $x = $x or $x $x ; C a list of target constraints $x = f($x) or $y = f($x), $x $x, $y $y; F ($x, $y) a Skolem term, where is a rooted path in a target schema; ($x, $y ) ($x, $y), $x ($x, $y). Definition 2 Let M = (G, Φ, C, E)($x, $y) be a mapping, and (Ω, ) be a partially ordered set of bindings of variables ($x, $y) determined by (G, Φ, C). target instance I of a target schema is then obtained as follows: 1. F () () = r the root of I. 2. F ($x, $y)(ω) = n a node of type. 3. If F /l ($x, $y)(ω) = n and F ($x, $y )(ω) = n, and ($x, $y ) ($x, $y) then n is a child of type l of the node n. 4. Let F /l ($x, $y)(ω 1 ) = n 1, F /l ($x, $y)(ω 2 ) = n 2, where ω 1 ω 2, and ($x, $y )(ω 1 ) = ($x, $y )(ω 2 ). hen n 1 n 2 in the document order in the set of children of type l of the node F ($x, $y )(ω 1 ). 5. If F /l (($x, $y)(ω) = n is a leaf, then the text value of n is equal to ω($x ). 4 Subsumptions on XML data trees ill now we have assumed that source documents are ontologically homogeneous. In real applications [16], however, we need domain ontologies to make use of relationships between the concepts used for data modeling. Relationships between concepts need to be generalized to cope with XML data trees. hen XML data, taken out from different sources, can be merged into a document that is the greatest lower bound of the set of data being merged, i.e. is subsumed by the data. o discuss the problem more precisely, we will use a simple tree language L, to express paths and tree patterns (at schema level) as well as values and trees (at instance level). ::= /(,..., ) (tree patterns) ::= l l/ (paths) t ::= v :v /(t,..., t) (trees) v ::= s (v,..., v) (values) where l is a node label, and s is a string value. ote, that a tree pattern is a set of paths with a common prefix.

6 o define semantics for L, we will use the approach used in description logic [4]. Let be a non-empty set of individuals, and child be a transitively closed binary relation over. Interpretation of L is a function.i defined as follows: c I l I (v 1,..., v n ) I = v1 I... vn I (l/ ) I = (l I child I ).2, where (X child ).2 = = {y x(x X (x, y) child)} ( 1,..., n ) I = 1 I... n I (/( 1,..., n )) I = ( I child ( 1,..., n ) I ).2 (t 1,..., t n ) I = t I 1... t I n (/(t 1,..., t n)) I = ( I child (t 1,..., t n) I ).2 ( : v) I = ( I child v I ).2 We say that an expression E 1 is subsumed by an expression E 2, or that E 2 subsumes E 1, written E 1 E 2, if E I 1 EI 2. If both E 1 E 2 and E 2 E 1, then E 1 is equivalent to E 2, written E 1 E 2. heorem 1 he following rules hold: R1. (v 1,..., v n ) (v 1,..., v i ), ( 1,..., n ) ( 1,..., i ), (t 1,..., t n ) (t 1,..., t i ), for any 1 i n; R2. /t t, R3. /:v :v, R4. if 1 /t 1 and 2 /t 2 are valid trees, then 1 2 t 1 t 2 1 /t 1 2 /t 2, R5. /( 1 :v 1,..., n :v n ) :(v 1,..., v n ). roof (R1) follows from the property of sets intersection; to prove (R2) note that (/t) I = ( I child t I ).2 t I ; in proof of (R3) we use the fact that the child relation is transitively closed, thus we have (/ :v) I = (( I child I ).2) child v I ).2 ( I child v I ).2; (R4) is a standard property of partial ordering relations. (R5) follows from the definition and from (R1) and (R3): (/( 1 :v 1,..., n :v n )) I = (/ 1 :v 1,..., / n :v n )) I ( :v 1,..., :v n )) I = ( :(v 1,..., v n )) I. In data integration we try to merge different XML documents into a one, duplicate-free, and well constructed document. In order to realize this we can use: definitions of source data schemas given by means of DD or XML Schemas, if they are available; domain ontologies both for names and tags (at the schema level) and for values (at the instance level), any other resources which can be used to understand and classify data correctly, such as dictionaries, taxonomies, thesauri, user provided match and mismatch information as well as knowledge discovered in data, e.g. keys and statistical characteristics. Using these resources and methods, we can classify XML tree fragments such as values and paths and tree patterns into equivalence classes with respect to the synonymy relation. he value representing the class of semantically equivalent values resolves such issues as diversity of currencies, measures, and representation formats, in order to overcome difficulties in duplicate elimination and value comparison. For text values there is a problem with synonyms, different languages, jargon and so on. o solve these problems, methods from information retrieval can be used [5, 7]. ext, subsumption on these classes can be defined, where v 1 v 2 means that v 1 is more desirable than v 2, because v 1 is more informative, more reliable (one database may be considered to be more reliable than others) or has higher precision. Correct definition of this relation is crucial because it is used to define subsumption relation over complex expressions. In order to define subsumption on tree patterns, we start with establishing it on individual labels. s for values, patterns with different syntax may have the same meaning, e.g. f name, first-name, and f irstname belong to the same equivalence class. he path author/name and the tree pattern author/name(f name, lname) will belong to

7 t 1 : article t 2 : paper t 3 : paper title author title author journal title author journal title-1 fname John lname Smith title-1 John Smith journal-1 title-1 fname lname John Smith journal-1 Figure 7: Source data trees t 1 and t 2, and their join t 3. Fat arrows denote equivalent key paths different, but somehow related equivalence classes. gain, identification of such patterns can be supported by ontologies, statistics and machinelearning methods [10, 18]. For complex patterns, the subsumption relation can be inferred from atomic patterns by means of rules proved in heorem 1. he following inference rules follow from heorem 1 and are of special importance for data merging during data integration: 1 2 v 1 v 2 1 :v 1 2 :v 2, /( 1,..., n ) /( 1,..., m) (v 1,..., v n ) (v 1,..., v m) /( 1 :v 1,..., n :v n ) /( 1 :v 1,..., m:v m) It follows from heorem 1 that it is sufficient to inspect subsumptions between trees and paths, rather than between trees and trees. Example For data trees t 1, t 2, and t 3 from Fig. 7, we have: atterns: article paper, author/fname author, author/lname author, author/(fname,lname) author. Values: John Smith John, John Smith Smith. rees: author/(fname: John,lname: Smith ) author: John Smith, t 3 t 1, t 3 t 2. ote that if we restricted ourselves to paths only, we would not be able to construct the expected minimal result tree t 3 from t 1 and t 2, because neither author/fname: John nor author/lname: Smith is subsumed by the path author: John Smith. rees t 1 and t 2 from Fig. 7 can be joined because there are two keys holding in t 1 and t 2, respectively, which are equivalent and have equivalent values, i.e. article/title: title- 1 paper/title: title-1. hus, these trees could be treated as describing the same entity from the semantic domain of interest. When trees describe different entities they are non-joinable. on-joinable trees are merged in such a way that a new root label is created and all trees under consideration become the highest-level subtrees of the newly created root. 5 Conclusion We discussed some reasoning methods useful in XML data integration systems. We motivated our research on an scenario of data exchange when data structured under source schemas are to be transformed into a data structured under another schema (a target schema). In such data integration some missing or incomplete data can be inferred. he reasoning about missing data is based on data constraints imposed by the target schema. Integration of data needs mappings which describes transformation from a source into a target schema. We propose a novel approach to XML schema mapping specification based on key constraints [6, 19]. First, automappings over schemas are generated, and next the automappings are combined to create mappings between schemas represented by these automappings. he other kind of reasoning is based on ontologies and concerns a problem of finding the least upper bound of merged data. he assumption of the existence of some domain oriented taxonomies and on-

8 References tologies makes the problem more feasible than in the case of deep Web integration [10]. he method presented in the paper is a part of our research on XML data integration [16, 15] XML data transformation [14] and query reformulation. [1] S. biteboul, R. Hull, and V. Vianu. Foundations of Databases. ddison- Wesley, Reading, Massachusetts, [2] S. biteboul, L. Segoufin, and V. Vianu. Representing and Querying XML with Incomplete Information. In ODS Conference, pages , [3] M. renas and L. Libkin. XML Data Exchange: Consistency and Query nswering. In ODS, pages 13 24, [4] F. Baader, D. Calvanese, D. McGuinness, D. ardi, and. etel-schneider, editors. he Description Logic Handbook: heory, Implementation and pplications. Cambridge, [5] R. Baeza-ates and B. Ribeiro-eto. Modern Information Retrieval. ddison Wesley, ew ork, [6]. Buneman, S. B. Davidson, W. Fan, C. S. Hara, and W. C. an. Reasoning about keys for XML. Information Systems, 28(8): , [7] J. C.. Carvalho and. S. da Silva. Finding similar identities among objects from multiple web sources. In WIDM 2003, pages CM, [8] R. Fagin,. G. Kolaitis, and L. opa. Data exchange: getting to the core. CM ODS, 30(1): , [9] R. Fagin,. G. Kolaitis, L. opa, and W. C. an. Composing schema mappings: Second-order dependencies to the rescue. In ODS, pages 83 94, [10] B. He, K. C.-C. Chang, and J. Han. Discovering complex matchings across web query interfaces: a correlation mining approach. In KDD 2004, pages CM, [11] M. Lenzerini. Data integration: theoretical perspective. In ODS, pages , [12] S. Melnik,.. Bernstein,.. Halevy, and E. Rahm. Supporting executable mappings in model management. In SIG- MOD Conference, pages , [13]. ash,.. Bernstein, and S. Melnik. Composition of mappings given by embedded dependencies. In ODS, [14]. ankowski. High-Level Language for Specifying XML Data ransformations, In DBIS Lecture otes in Computer Science, 3255: , [15]. ankowski. Management of executable schema mappings for XML data exchange, In DX 2006,EDB 2006 Workshops. Lecture otes in Computer Science (to appear), pages 1 12, [16]. ankowski and E. Hunt. Data merging in life science data integration systems. In Intelligent Information Systems, dvances in Soft Computing, pages Springer Verlag, [17] L. opa,. Velegrakis, R. J. Miller, M.. Hernández, and R. Fagin. ranslating web data. In VLDB, pages , [18]. heobald and G. Weikum. he Index- Based XXL Search Engine for Querying XML Data with Relevance Ranking, In: EDB Lecture otes in Computer Science, 2287: , [19] XML Schema art 1: Structures 2d Edition [20] C. u and L. opa. Constraint-based xml query rewriting for data integration. In SIGMOD Conference, pages , 2004.