XML Data Integration



Similar documents
XML with Incomplete Information

Question Answering and the Nature of Intercomplete Databases

Logical Foundations of Relational Data Exchange

A XML Schema Mappings: Data Exchange and Metadata Management

Data exchange. L. Libkin 1 Data Integration and Exchange

Data Integration: A Theoretical Perspective

A Tutorial on Data Integration

XML Data Exchange: Consistency and Query Answering

Scale Independent Query Queries and Network Marketing

Data Exchange: Semantics and Query Answering

Consistent Query Answering in Databases Under Cardinality-Based Repair Semantics

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

E-Commerce Expressions: Satisfiability and Compactability in XML Documents

On Scale Independence for Querying Big Data

Determination of the normalization level of database schemas through equivalence classes of attributes

Normalization Theory for XML

Repair Checking in Inconsistent Databases: Algorithms and Complexity

SPARQL: Un Lenguaje de Consulta para la Web

Full and Complete Binary Trees

A first step towards modeling semistructured data in hybrid multimodal logic

A Workbench for Prototyping XML Data Exchange (extended abstract)

Xml Normalization and Reducing RedUndies

Integrating XML Data Sources using RDF/S Schemas: The ICS-FORTH Semantic Web Integration Middleware (SWIM)

Propagating Functional Dependencies with Conditions

Peer Data Exchange. ACM Transactions on Database Systems, Vol. V, No. N, Month 20YY, Pages 1 0??.

Consistent Answers from Integrated Data Sources

The Recovery of a Schema Mapping: Bringing Exchanged Data Back

XML data integration in SixP2P a theoretical framework

Schema Mappings and Data Exchange

Data Quality in Ontology-Based Data Access: The Case of Consistency

Composing Schema Mappings: An Overview

Explainable Security for Relational Databases

Data Posting: a New Frontier for Data Exchange in the Big Data Era

Data integration general setting

The composition of Mappings in a Nautural Interface

Computational Methods for Database Repair by Signed Formulae

Xml Tree Structure and Methods of Interdependence

Software Modeling and Verification

Classification/Decision Trees (II)

Structured and Semi-Structured Data Integration

Dependencies Revisited for Improving Data Quality

Logical and categorical methods in data transformation (TransLoCaTe)

From Causes for Database Queries to Repairs and Model-Based Diagnosis and Back

! " # The Logic of Descriptions. Logics for Data and Knowledge Representation. Terminology. Overview. Three Basic Features. Some History on DLs

On the Decidability and Complexity of Query Answering over Inconsistent and Incomplete Databases

Data Integration and Exchange. L. Libkin 1 Data Integration and Exchange

Introduction to XML. Data Integration. Structure in Data Representation. Yanlei Diao UMass Amherst Nov 15, 2007

Logic Programs for Consistently Querying Data Integration Systems

Condensed Representation of Database Repairs for Consistent Query Answering. Jef Wijsen Université de Mons-Hainaut (UMH), Belgium

Computational complexity theory

1 Definitions. Supplementary Material for: Digraphs. Concept graphs

Complexity of Higher-Order Queries

Performance Evaluation of Natural and Surrogate Key Database Architectures

Policy Analysis for Administrative Role Based Access Control without Separate Administration

GRAPH THEORY LECTURE 4: TREES

NP-complete? NP-hard? Some Foundations of Complexity. Prof. Sven Hartmann Clausthal University of Technology Department of Informatics

Analysis of Algorithms I: Binary Search Trees

Improving Data Quality: Consistency and Accuracy

MATHEMATICS: CONCEPTS, AND FOUNDATIONS Vol. III - Logic and Computer Science - Phokion G. Kolaitis

Path Querying on Graph Databases

Database Design Patterns. Winter Lecture 24

Model Checking: An Introduction

Integrating and Exchanging XML Data using Ontologies

Cassandra. References:

Application of XML Tools for Enterprise-Wide RBAC Implementation Tasks

Schema Mediation in Peer Data Management Systems

Technologies for a CERIF XML based CRIS

Efficient XML-to-SQL Query Translation: Where to Add the Intelligence?

Formal Verification Coverage: Computing the Coverage Gap between Temporal Specifications

Modern Databases. Database Systems Lecture 18 Natasha Alechina

Relational Database Design

Rewriting of Visibly Pushdown Languages for XML Data Integration

Integrating Pattern Mining in Relational Databases

Detection of Corrupted Schema Mappings in XML Data Integration Systems

Data Management in Peer-to-Peer Data Integration Systems

Introduction to Logic in Computer Science: Autumn 2006

Designing and Refining Schema Mappings via Data Examples

InvGen: An Efficient Invariant Generator

Generating models of a matched formula with a polynomial delay

Chapter 3: Distributed Database Design

Plan-Space Search. Searching for a Solution Plan in a Graph of Partial Plans

Bounded Treewidth in Knowledge Representation and Reasoning 1

Updating Action Domain Descriptions

University of Ostrava. Reasoning in Description Logic with Semantic Tableau Binary Trees

Bicolored Shortest Paths in Graphs with Applications to Network Overlay Design

XML DATA INTEGRATION SYSTEM

INTEGRATION OF XML DATA IN PEER-TO-PEER E-COMMERCE APPLICATIONS

Querying Business Processes

Data Quality in Information Integration and Business Intelligence

C# Cname Ccity.. P1# Date1 Qnt1 P2# Date2 P9# Date9 1 Codd London Martin Paris Deen London

Applied Algorithm Design Lecture 5

Data Integrity Conventions and CFDs

Monitoring Metric First-order Temporal Properties

How To Handle Missing Information Without Using NULL

KEYWORD SEARCH IN RELATIONAL DATABASES

Towards a Theory of Search Queries

Completing Description Logic Knowledge Bases using Formal Concept Analysis

Navigational Plans For Data Integration

CS510 Software Engineering

Offshore Holdings Analytics Using Datalog + RuleML Rules

Transcription:

XML Data Integration Lucja Kot Cornell University 11 November 2010 Lucja Kot (Cornell University) XML Data Integration 11 November 2010 1 / 42

Introduction Data Integration and Query Answering A data integration system is a triple G, S, M where G is the global schema S is the source schema M is a set of assertions relating elements of the source schema and elements of the global schema Key issue in data integration: query answering given query on global schema, want to answer using source data Lucja Kot (Cornell University) XML Data Integration 11 November 2010 2 / 42

Introduction Data Integration and Query Answering (2) Challenge: there may be more than one way to map source data to target schema Solution: certain answers semantics for queries include only those tuples that always appear as answers first developed for databases with incomplete information now widely used in data integration and data exchange source instance + schemas + mappings = incomplete description of target instance... Lucja Kot (Cornell University) XML Data Integration 11 November 2010 3 / 42

Introduction Moving to XML How do we do data integration in XML? what does the setting look like, formally? given that some queries can return trees, what do certain answers look like? a r a a r a r a r a r a b c d b c d b c d This talk s focus: query answering problem as we move to XML Lucja Kot (Cornell University) XML Data Integration 11 November 2010 4 / 42

Introduction Talk outline 1 Warm-up : representing incomplete information in XML gets us thinking in XML introduces interesting issues in XML query answering 2 A study of query answering complexity in XML in the presence of schema mappings tradeoff between complexity of mapping and query languages 3 Certain answers for queries that return trees Lucja Kot (Cornell University) XML Data Integration 11 November 2010 5 / 42

Incomplete Information in XML (Re)introducing XML While the details of formalisms differ, XML data has the following key features: tree structure nodes have labels nodes have attributes attributes have values nodes may have ids document order Lucja Kot (Cornell University) XML Data Integration 11 November 2010 6 / 42

Incomplete Information in XML An example XML document europe country (Scotland) country (England) ruler (James V ) ruler (Mary I ) ruler (James VI & I ) ruler (Charles I ) ruler (Elizabeth I ) ruler (James VI & I ) ruler (Charles I ) Lucja Kot (Cornell University) XML Data Integration 11 November 2010 7 / 42

Incomplete Information in XML Schema information Can have schema for XML documents specifies tree structure and other related things XML Schema, DTD Example DTD: europe country country ruler ruler ε country : @name ruler : @name Lucja Kot (Cornell University) XML Data Integration 11 November 2010 8 / 42

Incomplete Information in XML Incomplete information How do we represent incomplete information in XML? Relational case: tables with null values Codd tables: all nulls distinct naïve or v-tables: repeated nulls (variables) permitted c-tables: constraints on variables permitted A representation t corresponds to a set of complete (ground) instances Rep(t) Lucja Kot (Cornell University) XML Data Integration 11 November 2010 9 / 42

Incomplete Information in XML Interesting questions about incomplete data representations Interesting problems: Consistency: given a representation t, does Rep(t)? Membership: given an instance T and a representation t, is T Rep(t)? Query answering: given a representation t and a query q, what are the certain answers to q over t? that is, what is T Rep(t) q(t )? Strong representation systems: is it the case that for each q and t,there exists a computable representation u such that Rep(u) = {q(t ) T Rep(t)}? Lucja Kot (Cornell University) XML Data Integration 11 November 2010 10 / 42

Incomplete Information in XML Incomplete Information in XML P. Barcelo, L. Libkin, A. Poggi, and C. Sirangelo. XML with incomplete information: models, properties, and query answering. PODS 2009. an in-depth study of various incomplete information models for XML In XML, incompleteness can be structural as well as value-related may only know that one node is a descendant of another, not that it is a grandchild can be missing node ids and/or node labels may or may not have a DTD present Lucja Kot (Cornell University) XML Data Integration 11 November 2010 11 / 42

Incomplete Information in XML Incomplete Information (1) r book title author year title author year Found. Vianu x y Abiteboul x of DB Lucja Kot (Cornell University) XML Data Integration 11 November 2010 12 / 42

Incomplete Information in XML Incomplete Information (2) r (i 0 ) (i 1 ) book (i 2 ) (i 3 ) (i 4 ) (i 5 ) (i 6 ) (i 7 ) title author year title author year Found. Vianu x y Abiteboul x of DB (i 8 ) Lucja Kot (Cornell University) XML Data Integration 11 November 2010 13 / 42

Incomplete Information in XML Incomplete Information (3) r (i 0 ) (i 1 ) book (i 3 ) (i 4 ) (i 5 ) (i 7 ) title author year author Found. Vianu x Abiteboul of DB Lucja Kot (Cornell University) XML Data Integration 11 November 2010 14 / 42

Incomplete Information in XML Contributions Give a taxonomy of incomplete information models for XML Study the complexity of key computational problems as a function of the types of incompleteness allowed consistency membership query answering (for queries that return tuples) Lucja Kot (Cornell University) XML Data Integration 11 November 2010 15 / 42

Incomplete Information in XML Kinds of incomplete information considered labels: may be replaced by wildcards node ids: either all absent or all present structural information may use any subset of the axes,,, may specify siblings without sibling order may use markings: root, leaf, first child, last child data values: either constants and variables (cf. naïve tables) or totally absent DTD: may be present or not Goal: understand which of these features impact complexity Lucja Kot (Cornell University) XML Data Integration 11 November 2010 16 / 42

Incomplete Information in XML Consistency This is always in NP Results overview: without node ids and without a DTD only markings can lead to inconsistency with markings, NP-complete for three specific fragments and in PTIME otherwise adding a (fixed) DTD leads to intractability even for very simple descriptions node ids help a lot always in PTIME without a DTD even with a fixed DTD, PTIME as long as descendant relation not used but remains NP-complete if DTD not fixed Lucja Kot (Cornell University) XML Data Integration 11 November 2010 17 / 42

Incomplete Information in XML Membership This is also always in NP Results overview: with node ids, is in PTIME without node ids, is NP-complete even for simple descriptions but drops to PTIME if we restrict each (data value) variable to occur only once in the tree cf. relational case membership complexity for Codd tables vs. naïve tables although proof technique used is different Lucja Kot (Cornell University) XML Data Integration 11 November 2010 18 / 42

Incomplete Information in XML Query answering Query language: a query is an incomplete tree with no node ids and existential quantification over the attribute value variables it contains a tree pattern answers are valuations analogous to relational conjunctive queries full language: unions of such queries classes of queries can be defined based on the structural information they use since queries return tuples, can define certain answers in the usual way certain(q, t) = {q(t ) T Rep(t)} Lucja Kot (Cornell University) XML Data Integration 11 November 2010 19 / 42

Incomplete Information in XML Query answering Results overview: generally, the news is not good problem is always in co-np DTDs or markings in trees and queries induce co-np completeness but can get co-np completeness even without either of these and cause problems too a tractable case: the trees are severely restricted to rigid incomplete trees essentially a complete tree that may use variables for attribute values and wildcards for node labels can perform relational-style naïve evaluation over relational representations of such trees for tractable query answering as long as the query does not use markings Lucja Kot (Cornell University) XML Data Integration 11 November 2010 20 / 42

Query answering under mappings in XML Query answering under mappings in XML S. Amano, C. David, L. Libkin, and F. Murlak. On the tradeoff between mapping and querying power in XML data exchange. ICDT 2010. a study of the complexity of query answering in data exchange setting Setting: have an XML schema mapping D s, D t, Σ where D s and D t are source and target DTDs Σ is a set of source-to-target dependencies in a suitable language Also have a query language and want to pose queries over D t queries still return tuples Lucja Kot (Cornell University) XML Data Integration 11 November 2010 21 / 42

Query answering under mappings in XML Contributions The paper is a study of the (data) complexity of computing certain answers as we vary the expressiveness of: the query language the mapping language used for source-to-target dependencies (the DTDs) Lucja Kot (Cornell University) XML Data Integration 11 November 2010 22 / 42

Query answering under mappings in XML An example source document europe country (Scotland) country (England) ruler (James V ) ruler (Mary I ) ruler (James VI & I ) ruler (Charles I ) ruler (Elizabeth I ) ruler (James VI & I ) ruler (Charles I ) Lucja Kot (Cornell University) XML Data Integration 11 November 2010 23 / 42

Query answering under mappings in XML Source DTD europe country country ruler ruler ε country : @name ruler : @name Lucja Kot (Cornell University) XML Data Integration 11 November 2010 24 / 42

Query answering under mappings in XML Target DTD rulers ruler ruler successor successor ε ruler : @name successor : @name Lucja Kot (Cornell University) XML Data Integration 11 November 2010 25 / 42

Query answering under mappings in XML An example solution (target) document rulers ruler (James V ) ruler (Mary I ) ruler (James VI & I ) ruler (Elizabeth I ) successor (MaryI ) successor (James VI & I ) successor (Charles I ) successor (James VI & I ) Lucja Kot (Cornell University) XML Data Integration 11 November 2010 26 / 42

Query answering under mappings in XML Mapping language Language for mappings between source and target documents based on tree patterns very expressive, allows vertical and horizontal navigation as well as equality/inequality constraints on variables Example tree pattern: europe/country(z)[ruler(x) ruler(y)] Example source-to-target dependency: europe/country(z)[ruler(x) ruler(y)] rulers/ruler(x)/successor(y) Lucja Kot (Cornell University) XML Data Integration 11 November 2010 27 / 42

Query answering under mappings in XML Query answering Query language: same tree patterns as used for mappings can restrict queries (or mappings!) to disallow some features e.g. horizontal navigation answers are valuations as before Assume we are given a query q, a mapping M = D s, D t, Σ and a source document T conforming to D s. certain M (q, T ) = {q(t ) T is a solution for T under M} Lucja Kot (Cornell University) XML Data Integration 11 November 2010 28 / 42

Query answering under mappings in XML Query answering known results Some results known from previous work which used a subset of the mapping language without horizontal navigation and inequality comparisons for tractability, need to restrict DTDs, specifically wrt disjunction nested relational DTDs also need to restrict mappings to fully specified ones use neither nor in target patterns otherwise the problem is co-np complete Main question in this paper: how do the new language features affect complexity? Lucja Kot (Cornell University) XML Data Integration 11 November 2010 29 / 42

Query answering under mappings in XML New results the good news Even with the most expressive mappings and queries, the complexity of query answering remains in co-np If the query language and DTD is kept simple, full horizontal navigation can be added to mappings without loss of tractability Query answering remains in PTIME when: DTDs are nested relational queries may use vertical navigation (, ) and equality comparisons mappings may use everything except and (still!) Lucja Kot (Cornell University) XML Data Integration 11 November 2010 30 / 42

Query answering under mappings in XML New results the bad news Extending the expressiveness of queries leads to intractability quickly any form of horizontal navigation leads to co-np completeness even if the mappings can only use the child relation and even if the DTDs are nested relational and even under some additional restrictions Takeaway on horizontal order: ok to use in mappings, but not in queries. Lucja Kot (Cornell University) XML Data Integration 11 November 2010 31 / 42

Certain answers and trees Certain answers for queries that return trees C. David, L. Libkin, and F. Murlak. Certain answers for XML queries. PODS 2010. a r a a r a r a r a r a b c d b c d b c d First step: revisit foundations of relational certain answers Lucja Kot (Cornell University) XML Data Integration 11 November 2010 32 / 42

Certain answers and trees The theory of certain answers Observation 1: Given a representation of a set of databases D, need a way to represent all the information that is true for all D D D can be a set of query results, i.e. {q(d ) D D }, but does not need to The notion of all the information that is true depends on what language we have available to represent it if we can only represent ground tuples, the our certain information is limited to the ground tuples that are found in all D D but if we can use naïve tables, can represent more information (weak representation systems) Lucja Kot (Cornell University) XML Data Integration 11 November 2010 33 / 42

Certain answers and trees The theory of certain answers Observation 2: A representation of a set of databases D in a language L can be viewed as a logical theory L D. D = Mod(L D ) Example: if D is represented by a naïve table R, then R defines a conjunctive query q R (R is the tableau of q R ) view q R as a logical formula for a database D, D Rep(R) if and only if D is a model of q R Given a query q on D, the certain answers are those implied by L D e.g. if we are interested in ground facts, we want tuples a such that L D q(a) Lucja Kot (Cornell University) XML Data Integration 11 November 2010 34 / 42

Certain answers and trees Max-descriptions Suppose L is a logical formalism and D a class of databases The certain L-knowledge of the class D is the L-theory of D, denoted Th L (D) this is the set of all L-formulae satisfied in all structures from D Want a finite set of L-formulae Φ such that Mod(Φ) = Mod(Th L (D)) if such a set exists, we call it a max-l-description of D (or max-description if L is clear) Lucja Kot (Cornell University) XML Data Integration 11 November 2010 35 / 42

Certain answers and trees Max-descriptions and certain answers Back to certain answers: given a set D and a query q, the certain answers to q over D are represented by a max-description of {q(d) D D} A max-description of a set D, if it exists, need not be unique but there may be a core a smallest max-description with the property that all others can be minimized to it Lucja Kot (Cornell University) XML Data Integration 11 November 2010 36 / 42

Certain answers and trees Applying this to XML Language L simple tree patterns π: fully specified trees with attribute variables If T is a set of trees, then Th(T ) = {π T T : T = π} A pattern π is a max-description for a set of trees T if Mod(π) = Mod(Th(T )) Lucja Kot (Cornell University) XML Data Integration 11 November 2010 37 / 42

Certain answers and trees Max-description for our example trees r r r a(1) a(2) a(1) a(2) a(1) a(x) a(2) b c d b c d b c d Lucja Kot (Cornell University) XML Data Integration 11 November 2010 38 / 42

Certain answers and trees Max-descriptions and cores The paper gives results about the complexity of computing max-descriptions for sets of XML trees Also give a definition of core of a max-description defined using homomorphisms theorem with bounds on the core size (upper and lower) Lucja Kot (Cornell University) XML Data Integration 11 November 2010 39 / 42

Certain answers and trees Back to query answering Give a query language that returns trees uses patterns has the flavor of XQuery FLWR expressions Certain answers to query q over T to given by a max-description of q(t ) Introduce the notion of a basis for a set T intuitively, a more concise representation a basis B for T can help in computing certain answers if q is a query in their language, then certain answers to q over B and T coincide sufficient to compute a max-description of q(b) Lucja Kot (Cornell University) XML Data Integration 11 November 2010 40 / 42

Certain answers and trees Putting it all into practice Paper gives two case studies for certain answers XML with incomplete information data exchange Show how to compute small bases for the appropriate sets T Answer several open complexity questions Define a new tractable class of data exchange problem by placing a suitable restrictions on source-to-target dependencies restriction guarantees the existence of small bases Lucja Kot (Cornell University) XML Data Integration 11 November 2010 41 / 42

Certain answers and trees Summary incomplete information in XML many more kinds of incompleteness than in relational case complexity of query answering very sensitive to the kind of incompleteness allowed in the representation query answering in XML under mappings again, complexity very sensitive to parameters chosen for expressiveness sometimes surprising/nonintuitive certain answers for queries that return trees nontrivial, but definitely doable lots of interesting work remains to be done! Lucja Kot (Cornell University) XML Data Integration 11 November 2010 42 / 42

Additional slides This section contains additional slides for the following paper: S. Abiteboul, L. Segoufin, and Victor Vianu: Representing and querying XML with incomplete information. ACM Trans. Database Syst. 31(1), pp. 208-254, 2006 Lucja Kot (Cornell University) XML Data Integration 11 November 2010 43 / 42

Representing and Querying XML with Incomplete Information One of the first papers on representing incomplete information in XML with a query answering focus Setting: maintain an incomplete, but growing XML document that represents Web data document is a data repository can be grown by asking more queries of external data sources to retrieve more information assumptions: data is static DTDs of sources are available Lucja Kot (Cornell University) XML Data Integration 11 November 2010 44 / 42

Document and schema model Document model: assume node ids, no order, no attribute names Schema model: simplified DTD with all child multiplicities restricted to {, +,?, 1} no ordering constraints as with standard regular expressions Lucja Kot (Cornell University) XML Data Integration 11 November 2010 45 / 42

Query language Queries are tree patterns with optional selection conditions two sibling nodes cannot have the same label no descendant navigation, data joins, etc. Lucja Kot (Cornell University) XML Data Integration 11 November 2010 46 / 42

Query semantics Queries return subtrees of the document based on matches of the query pattern Lucja Kot (Cornell University) XML Data Integration 11 November 2010 47 / 42

Motivation for incomplete documents Based on a query result, can build incomplete representation of the underlying data Lucja Kot (Cornell University) XML Data Integration 11 November 2010 48 / 42

Model for representing incomplete information in XML Main features of the incomplete representation: conditions on data values e.g. > 200 specialization of node labels, e.g. product1 and product2 are specializations of product some nodes may be fully instantiated (node ids are known) the DTD may now contain disjunctions of multiplicity atoms, e.g. na + a + Theorem: Consistency can be decided in PTIME Lucja Kot (Cornell University) XML Data Integration 11 November 2010 49 / 42

Query answering Theorem: this representation is a strong representation system for the query language in the paper Let Σ be a fixed set of node labels. Given an incomplete tree T and a query q, one can construct an incomplete tree q(t) such that Rep(q(T)) = {q(t ) T Rep(T)} = q(rep(t)) Furthermore, q(t) can be constructed in PTIME with respect to q and T (exponential in Σ) Lucja Kot (Cornell University) XML Data Integration 11 November 2010 50 / 42

Certain answers Can define certain answers using q(t) Idea: given any incomplete representation, can define certain prefixes (and possible prefixes) tree prefix defined formally in paper (need to account for node ids) a certain prefix of q(t) is a certain answer to q with respect to T it can be determined in PTIME whether a given tree is a certain prefix of q(t) Lucja Kot (Cornell University) XML Data Integration 11 November 2010 51 / 42

Other results in paper Focus on the specific application setting algorithm for refining representation based on successive query answers methods for shrinking the size of the representation representation that allows conjunction restrictions on queries algorithm for generating queries that supply crucial information deep search given a query q, if the answer to q on the local document is unsatisfactory, generate additional queries for a more precise answer extensions: more expressive queries, document order, no node ids all associated with an increase in complexity Lucja Kot (Cornell University) XML Data Integration 11 November 2010 52 / 42