Similar documents
Introduction to XML. Data Integration. Structure in Data Representation. Yanlei Diao UMass Amherst Nov 15, 2007

AN ENHANCED DATA MODEL AND QUERY ALGEBRA FOR PARTIALLY STRUCTURED XML DATABASE

Modern Databases. Database Systems Lecture 18 Natasha Alechina

Managing large sound databases using Mpeg7

How To Improve Performance In A Database

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

Relational Databases for Querying XML Documents: Limitations and Opportunities. Outline. Motivation and Problem Definition Querying XML using a RDBMS

Semistructured data and XML. Institutt for Informatikk INF Ahmet Soylu

Department of Computer and Information Science, Ohio State University. In Section 3, the concepts and structure of signature

II. PREVIOUS RELATED WORK

Last Week. XML (extensible Markup Language) HTML Deficiencies. XML Advantages. Syntax of XML DHTML. Applets. Modifying DOM Event bubbling

A Workbench for Prototyping XML Data Exchange (extended abstract)

Web Data Management - Some Issues

An XML Based Data Exchange Model for Power System Studies

XML WEB TECHNOLOGIES

A MEDIATION LAYER FOR HETEROGENEOUS XML SCHEMAS

Databases in Organizations

Lightweight Data Integration using the WebComposition Data Grid Service

Combining Unstructured, Fully Structured and Semi-Structured Information in Semantic Wikis

NETMARK: A SCHEMA-LESS EXTENSION FOR RELATIONAL DATABASES FOR MANAGING SEMI-STRUCTURED DATA DYNAMICALLY

Integrating Heterogeneous Data Sources Using XML

Application of XML Tools for Enterprise-Wide RBAC Implementation Tasks

Chapter 3: XML Namespaces

XML Processing and Web Services. Chapter 17

OBJECT ORIENTED EXTENSIONS TO SQL

Graph-Based Linking and Visualization for Legislation Documents (GLVD) Dincer Gultemen & Tom van Engers

10CS73:Web Programming

The MongoDB Tutorial Introduction for MySQL Users. Stephane Combaudon April 1st, 2014

CST6445: Web Services Development with Java and XML Lesson 1 Introduction To Web Services Skilltop Technology Limited. All rights reserved.

XML. CIS-3152, Spring 2013 Peter C. Chapin

CSE 233. Database System Overview

Advantages of XML as a data model for a CRIS

Mobile Web Design with HTML5, CSS3, JavaScript and JQuery Mobile Training BSP-2256 Length: 5 days Price: $ 2,895.00

Agents and Web Services

Integrating XML Data Sources using RDF/S Schemas: The ICS-FORTH Semantic Web Integration Middleware (SWIM)

XML Data Integration

Big Data Analytics. Rasoul Karimi

XML and Data Integration

A first step towards modeling semistructured data in hybrid multimodal logic

Introduction to XML Applications

Lecture 21: NoSQL III. Monday, April 20, 2015

INTEGRATION OF XML DATA IN PEER-TO-PEER E-COMMERCE APPLICATIONS

HETEROGENEOUS DATABASE INTEGRATION FOR WEB APPLICATIONS

XSLT Mapping in SAP PI 7.1

Model-Mapping Approaches for Storing and Querying XML Documents in Relational Database: A Survey

SPARQL: Un Lenguaje de Consulta para la Web

Using Database Metadata and its Semantics to Generate Automatic and Dynamic Web Entry Forms

XML: extensible Markup Language. Anabel Fraga

Lesson 8: Introduction to Databases E-R Data Modeling

CSE 132A. Database Systems Principles

Language. Johann Eder. Universitat Klagenfurt. Institut fur Informatik. Universiatsstr. 65. A-9020 Klagenfurt / AUSTRIA

Translating between XML and Relational Databases using XML Schema and Automed

2. Background on Data Management. Aspects of Data Management and an Overview of Solutions used in Engineering Applications

Data Integration through XML/XSLT. Presenter: Xin Gu

DLDB: Extending Relational Databases to Support Semantic Web Queries

Structured vs. unstructured data. Semistructured data, XML, DTDs. Motivation for self-describing data

Object Oriented Databases. OOAD Fall 2012 Arjun Gopalakrishna Bhavya Udayashankar

Chapter 2: Designing XML DTDs

Formal Engineering for Industrial Software Development

Technologies for a CERIF XML based CRIS

Programming Languages

by LindaMay Patterson PartnerWorld for Developers, AS/400 January 2000

Cedalion A Language Oriented Programming Language (Extended Abstract)

Data Integration Hub for a Hybrid Paper Search

Chapter 3. Data Modeling Using the Entity-Relationship (ER) Model

Introduction to Ingeniux Forms Builder. 90 minute Course CMSFB-V6 P

DTD Tutorial. About the tutorial. Tutorial

04 XML Schemas. Software Technology 2. MSc in Communication Sciences Program in Technologies for Human Communication Davide Eynard

Semester Review. CSC 301, Fall 2015

Part VI. Object-relational Data Models

Transcription:

Dan Suciu AT&T Labs Labs 1 AT&T

How the Web is Today HTML documents all intended for human consumption many are generated automatically by applications Labs 2 AT&T

applications consuming HTML documents are out there Netbot): brittle technology (Junglee, the new Web standard XML makes data exchange between radically simpler applications Paradigm Shift on the Web { XML generated by applications { XML consumed by applications data exchange { across platforms (enterprise inter-operability) { across enterprises Web shifts from collection of documents to data and documents Labs 3 AT&T

Database Community Can Help Relevant elds: query optimization, query processing views, transformations data warehouses, data integration systems mediators, query rewriting secondary storage, indexes Labs 4 AT&T

But it Needs a Paradigm Shift Too Web data is dierent from database data: self-describing, schema-less structure changes without notice heterogeneous, deeply nested, irregular documents and data mixed together accessed through limited capabilities distributed, linked XML was designed by document experts, not database experts! Need to extend data management to Web data management Labs 5 AT&T

What This Tutorial is About what the database community has done: semistructured data { data model: OEM { query languages: Lore, UnQL, StruQL, etc. { schemas what the Web community has done: { data format: XML { transformation language: XSL { schemas where they meet and where they dier Labs 6 AT&T

Outline Semistructured data and XML Query languages Schemas Systems issues Conclusions Labs 7 AT&T

Part 1 Semistructured Data and XML Labs 8 AT&T

Semistructured Data Origins: integration of heterogeneous sources data sources with non-rigid structure biological data Web data Labs 9 AT&T

Semistructured Data Model The Bib &o1 paper book paper author title references references references author author year author publisher author http author title &o12 &o24 &o29 title page "..." &o43 &o96 &o25 1997 "..." "..." "..." "..." "..." "..." firstname lastname first last &o243 &o206 "Serge" "Abiteboul" "Vianu" 122 133 "Victor" 8 < - complex objects: &o1, &o12,... Object Exchange Model (OEM): : - atomic objects: &o243, &o206,... Semistructured data is an edge-labeled graph Labs 10 AT&T

the data: Serializing f paper: &o12 f... g, Bib:&o1 Syntax for Semistructured Data book: &o24 f... g, paper: &o29 f author: &o52 "Abiteboul", author: &o96 f rstname: &o243 "Victor", lastname: &o206 "Vianu"g, title: &o93 "Regular path queries with constraints", references: &o12, references: &o24, page: &o25 f rst: &o64 122, last: &o92 133ggg Labs 11 AT&T

omit oid's: May paper: f Syntax for Semistructured Data f author: "Abiteboul", author: f rstname: "Victor", lastname: "Vianu"g, title: "Regular path queries with constraints", page: frst: 122, last: 133ggg Labs 12 AT&T

Characteristics of Semistructured Data missing or additional attributes multiple attributes attributes with dierent types in dierent objects heterogeneous collections self-describing, irregular data, no a priori structure Labs 13 AT&T

Comparison of the Semistructured and Relational Model row row row name phone name phone name phone name phone John 3634 "John" 3634 "Sue" 6343 "Dick" 6363 Sue 6343 f row : fname: "John", phone: 3634g, Dick 6363 row : fname: "Sue", phone: 6343g, row : fname: "Dick", phone: 6363g g schema components become labels in semistructured data Labs 14 AT&T

standard approved by W3C, to complement HTML New structured text (SGML) Origins: XML Motivation: HTML describes presentation XML describes content XML v.s. HTML [Bosak'97] { users dene new tags { arbitrary nesting { validation is possible Labs 15 AT&T

HTML: Bibliography </h1> <h1> From HTML to XML <i> Foundations of Databases </i>, Abiteboul, Hull, Vianu <p> Addison Wesley, 1995 <br> <i> Data on the Web </i>, Abiteboul, Buneman, Suciu <p> Morgan Kaufmann, <br> HTML describes the presentation Labs 16 AT&T

Basics XML <book> <title> Foundations of Databases </title> <bibliography> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book>... </bibliography> XML describes the content Labs 17 AT&T

XML Terminology tags: book, title, author,... start tag: <book> end tag: </book> elements: <book>... </book>, or <author>... </author> elements are nested empty elements: <red></red>, or <red/> an XML document consists of a single element called root element a well-formed XML document is one with correctly matching tags Labs 18 AT&T

XML: Attributes More price="55" currency="usd"> <book <title> Foundations of Databases </title> <author> Abiteboul </author> <author> Hull </author> <author> Vianu </author> <publisher> Addison Wesley </publisher> <year> 1995 </year> </book> attributes price, currency, values "55", "USD" attributes are alternative ways to represent same data Labs 19 AT&T

XML: Oid's and References More id="o555"> <name>jane</name> <person <person id="o456"> <name>mary</name> <person id="o123" mother="o456"> <children idref="o123 o555"/> </person> </person> <name>john</name> </person> oids and references in XML are just syntax Labs 20 AT&T

XML Data Model does not exists closest approximation: Document Object Model (DOM): { class hierarchy (node, element, attribute,...) { object model, not data model: objects have behavior { denes API to inspect/modify the document Labs 21 AT&T

of XML and Semistructured Data Comparison id="o123"> f person: &o123 <person <name> Alan </name> f name: "Alan", <age> 42 </age> age: 42, <email> agb@abc.com </email> email: "agb@abc.com"gg <person> <person father="o123">... f person: f father &o123...g person </person> father person father g name age email name age email Alan 42 agb@abc.com Alan 42 agb@abc.com labels on nodes v.s. on edges: trees easy to convert, graphs harder Labs 22 AT&T

XML can mix text and elements: { Making Java easier to type and easier to type <talk> <speaker>phil Wadler</speaker> Comparison of XML and Semistructured Data similarities: { both described best by a graph { both are schema-less, self-describing dierences: { XML is ordered, ssd is not </talk> { XML has other stu (processing instructions, comments) Labs 23 AT&T

Part 2 Query Languages Semistructured data and XML Query languages Schemas Systems issues Conclusions Labs 24 AT&T

Query Languages: Outline For Semistructured Data: { Lorel (coercions, regular path expressions) { UnQL (patterns, constructors) { Skolem Functions For XML: XML-QL A dierent paradigm: { Structural Recursion { XSL Labs 25 AT&T

Lorel Part of the Lore system (Stanford) Lorel adapts OQL to semistructured data select X.title from Bib.paper X where X.year > 1995 Abbreviated to: select Bib.paper.title where Bib.paper.year > 1995 Labs 26 AT&T

Lorel v.s. OQL implicit coercions ( 1995 v.s. "1995") missing attributes (empty answer v.s. type error) set-valued attributes ( X.year>1995: may be several years) regular path expressions (next) Labs 27 AT&T

Lorel Regular Path Expressions select X.title from Bib.paper X, Bib.(paperjbook) Y where Y.author.lastname? = "Ullman" and Y.reference+ X Useful for: syntactic substitute for inheritance: paperjbook navigating partially known structures: lastname? transitive closure: reference+ Labs 28 AT&T

UnQL Pattern Matching: select T where Bib.paper: f title: T, year: Y, journal: "TODS"g and Y > 1995 Labs 29 AT&T

looks like: Results result: f fn:"john", ln:"smith", f UnQL Simple Constructors select result:f fn: F, ln: L, pub: f title: T, year: Ygg where Bib.paper: f title: T, author: frstname: F, lastname: Lg, year: Yg and Y > 2000 pub:ftitle:"p equals NP", year: 2005gg, result: f fn:"joe", ln:"doe", pub:ftitle:"errata to P=NP", year: 2006gg,...g Labs 30 AT&T

languages for semistructured data have Skolem functions: Many StruQL, Yat, Florid. MSL, Skolem Functions Skolem Functions in databases: Maier, 1986: in OO systems. Kifer et al, 1989: F-logic. Hull and Yoshikawa, 1990: in deductive systems (ILOG). Papakonstantinou et al., 1996: in semistructured data (MSL). Labs 31 AT&T

Skolem Functions Strudel: a Web Site Management System StruQL: it's query language where Root! "Bib"! X, X! "paper"! P, P! "author"! A, P! "title"! T, P! "year"! Y create Root(), HomePage(A), YearPage(A,Y), PubPage(P) link Root()! "person"! HomePage(A), HomePage(A)! "yearentry"! YearPage(A,Y), YearPage(A,Y)! "publication"! PubPage(P), PubPage(P)! "author"! HomePage(A) PubPage(P)! "title"! T Labs 32 AT&T

Functions Skolem Root() person person person HomePage("Smith") HomePage("Jones") HomePage("Mark") yearentry yearentry author author yearentry yearentry yearentry author YearPage("Smith",1994) YearPage("Smith",1996) YearPage("Jones",1994)YearPage("Jones",1996) YearPage("Mark",1996) publication author publication publication author publication PubPage("The Comma") PubPage("The Dot") PubPage("Experience with Comma Categories") title title title Labs 33 AT&T

relies on the edge-labeled semistructured data model - suers model mismatch from A query language for XML: XML-QL comes from the semistructured data community features: { regular path expressions { patterns { simple constructors { Skolem Functions plus: XML syntax Labs 34 AT&T

where <book language="french"> <publisher><name>morgan Kaufmann</name> Pattern Matching in XML-QL </publisher> <title> $t </title> <author> $a </author> </book> in "www.a.b.c/bib.xml" construct $a Labs 35 AT&T

Simple Constructors in XML-QL where language=$l> <book $a </> <author> IN "www.a.b.c/bib.xml" </> <result> <author> $a </> construct <lang> $l </> </> Note: </> abbreviates </book> or </result>, etc. Groups titles by authors: <result> <author>smith</author> <lang>english</lang> </result> <result> <author>smith</author> <lang>mandarin</lang> </result>... Labs 36 AT&T

Skolem Functions in XML-QL <paperjbook> <title> $t </> where $a </> <author> IN "www.a.b.c/bib.xml" </> <result id=f($a)> <author> $a </> construct <t> $t </> </> <result> <author>joe</author> <t><t/> <t><t/> </result> <result> <author>jim</author> <t><t/> <t></t> <t></t> </result> <result> <author>jan</author> <t><t/> </result>... Labs 37 AT&T

as sets, with union operator: Data a:fb:"one",c:5g, b:4g = fa:3g U fa:fb:"one",c:5g, b:4g fa:3, fresult:3, result:5, result:4g A Dierent Paradigm: Structural Recursion = fa:3g U fa:fb:"one",c:5gg U fb:4g Example: retrieve all integers in the data. f(v) = if isint(v) then f"result": vg else fg f(fg) = fg f(fl: tg) = f(t) f(t1 U t2) = f(t1) U f(t2) Answer: standard textbook programming on trees Labs 38 AT&T

Structural Recursion with Multiple Functions Example: increase all engine's prices by 10%. f engine:f part: fprice: 100g, f engine:f part: fprice: 110g, price: 1000g, price: 1100g, f! body: f part: fprice: 100g, body: f part: fprice: 100g, price: 1000gg price: 1000gg f(v) = v g(v) = v g(fg) = fg f(fg) = fg f(fl: tg) g(fl: tg) = if l = "engine" = if l = "price" then fl: g(t)g then fl: 1.1*tg else fl: g(t)g else fl: f(t)g g(t1 U t2) = g(t1) U g(t2) f(t1 U t2) = f(t1) U f(t2) Labs 39 AT&T

limited expressive power, e.g. no joins (but continues to be changed) XSL a proposal at W3C (soon a standard) implemented in IE 5.0 purpose: stylesheet specication language: { stylesheet: XML! HTML { more general: XML! XML perfectly integrated with XML still changing... Labs 40 AT&T

<xsl:template match="/bib/*/title"> XSL: Templates and Rules a query is collection of template rules a template rule has a match pattern and a template Example: retrieve all book titles in the bibliography database: <xsl:template> <xsl:apply-templates/> </xsl:template> <result> <xsl:value-of/> </result> </xsl:template> Labs 41 AT&T

Path Expressions in XSL bib matches a bib element * matches any element / matches the root element /bib matches a bib element immediately after root bib/paper matches a paper following a bib bib//paper matches a paper following a bib at any depth //paper matches a paper at any depth paper j book matches a paper or a book @price matches a price attribute bib/book/@price the price attribute of a book Labs 42 AT&T

<xsl:template match="a"> <A> <xsl:apply-templates/> </A> <xsl:template match="b"> <B> <xsl:apply-templates/> </B> <xsl:template match="c"> <C> <xsl:value-of/> </C> Flow Control in XSL <xsl:template> <xsl:apply-templates/> </xsl:template> </xsl:template> </xsl:template> </xsl:template> Labs 43 AT&T

</b> <a> </e> 4 </c> <c> 1 </c> <c> 2 </c> <c> & % <B> <C> 1 </C> <A> 2 </C> <C> </B> <C> 3 </C> <A> </A> 4 </C> <C> $ % ' $ <a> <e> <b> '! <c> 3 </c> </a> </A> </a> & Labs 44 AT&T

query = a single structural recursion function XSL query with modes = multiple functions XSL XSL is Structural Recursion Equivalent to: f(v) = v f(fg) = fg f(fl:tg) = if l = "c" then f"c":tg else if l = "b" then f"b":f(t)g else if l = "a" then f"a":f(t)g else f(t) f(t1 U t2) = f(t1) U f(t2) Labs 45 AT&T

<xsl:template match="e"> <xsl:apply-patterns select="/"> Comparison of XSL and Structural Recursion applicability: { structural recursion: on arbitrary graphs { XSL: on trees only termination: { structural recursion: always terminates { XSL: may run forever. Add this to previous program </xsl:template> Stack overow on IE 5.0 Labs 46 AT&T

powerful features: regular expressions, constructors, Skolem structural recursion Functions, XSL intended to be a style sheet language, can serve as query substitute language Summary of Query Languages studied extensively in the context of semistructured data no standard query language for XML yet control ow in XSL is structural recursion Labs 47 AT&T

Part 3 Schemas Semistructured data and XML Query languages Schemas Systems issues Conclusions Labs 48 AT&T

Highlights of Schemas in Semistructured Data why? { to describe semantics { to improve data processing here lies our interest what? { many proposals (both in ssd and in the W3C) { this tutorial: show two basic formalisms, and DTDs when? { a posteriori { RDBMS: need schema a priori to interpret binary data how? { independent from the data (may have several schemas!) { RDF, DCD: schema hardwired with the data Labs 49 AT&T

&r1 Schemas: An Example person company person company person manages employee c.e.o. works-for &p1 &c1 &p2 &c2 &p3 works-for works-for c.e.o. name name name name name position address Phone address position url &s0 &s1 &s2 &s3 &s4 &s5 &s6 &s7 &s8 &s9 "Smith" "Manager" "Widget" "Trenton" "Jones" "Gaget" "Paris" "Dupont" "5556666" &s10 "Sales" description description "www.gp.fr" &a1 &a4 performance procurement &a5 salesrep task 1998 &a2 1997 contact &a3 &a7 &a6 "below target" "on target" some well-dened classes: companies, persons some irregular data Labs 50 AT&T

Lower-Bound Schemas Shows required labels Root company person works-for Company c.e.o. Employee managed-by name address name string Labs 51 AT&T

Upper Bound Schemas Shows permitted labels. Root manages person employee company Person c.e.o. worksfor Company description name phone position name address url Any string _ Called Schema Graphs in [Buenman et al'97] Labs 52 AT&T

The Two Questions to Ask Conformance: does that data conform to this schema? Classication: if so, then which objects belong to what classes? Labs 53 AT&T

Using Unary Datalog on Lower-Bound Schemas [Nestorov et. al 1997] Unary Datalog Types Root(X) :- ref(x,company,c), Company(C), ref(x,person,m), Employee(M) Employee(E) :- ref(e,works-for,c), Company(C), EDBs: ref, string ref(e,managed-by,m), Employee(M) IDBs: ref(e,name,n), string(n) Root, Employee, Company Company(C) :- ref(c,ceo.,e), Employee(E), ref(c,name,n), string(n), ref(c,address,a), string(a) Conformance: check Root(&r1) Classication: check Employee(&p2), Company(&c1), etc Need greatest xpoint semantics for Datalog! Labs 54 AT&T

Denition Given two edge-labeled graphs G 1 G 2 if 1 x 2 ) 2 R, then for every edge x a 1! y 1 in G 1 a (x 2 in G 2 (same label) y! there exists an edge x 2...G 2 can do too! compute simulation quite eciently: O(j G 1 j j G 2 j). can Henzinger, and Kopke'95] [Henzinger, Graph Simulation a simulation is a relation R on their nodes s.t.: s.t. (y 1 y 2 ) 2 R. Notation: G 1 R G 2 x R x 1 2 l l y 1 R y 2 what G 1 can do... G 1 G 2 Labs 55 AT&T

Using Simulation Given data instance D, schema S Upper bound schemas: { conformance: check D R S { classication: check (x c) 2 R Lower bound schemas: { conformance: check S R D { classication: check (c x) 2 R where R is the largest simulation For lower bound schemas: same as unary datalog Labs 56 AT&T

is an ecient way to decide conformance and compute Simulation classication object Conformance and Classication &r1 Root Root person company company Company works-for c.e.o. person Employee manages person company person company person manages employee c.e.o. works-for &p1 &c1 &p2 &c2 &p3 works-for works-for c.e.o. name name name name name position address Phone address position url &s0 &s1 &s2 &s3 &s4 &s5 &s6 &s7 &s8 &s9 manages description Person employee c.e.o. worksfor name phone position Company name address url name address name "Smith" "Manager" "Widget" "Trenton" "Jones" "Gaget" "Paris" "Dupont" "5556666" &s10 "Sales" description Any string Lower Bound (Unary Datalog) description "www.gp.fr" &a1 &a4 performance procurement &a5 salesrep task &a2 1997 contact &a3 &a6 "on target" Semistructured Data Instance 1998 &a7 "below target" _ Upper Bound (Schema Graph) Labs 57 AT&T

Application 1: Improve Secondary Storage Root Company company person oid name address c.e.o. works-for............ Company name address c.e.o. Employee name managed-by Employee oid name managed-by works-for string............ Store rest in overow repository Lower-Bound Schema Labs 58 AT&T

Application 2: Query Optimization Bib select X.title from Bib. X paper book where X.*.zip = \12345" year journal title address author title + int string string string X.title select zip city street last name first name string string string string string from Bib.book X where X.address.zip= \12345" Upper-Bound Schema Labs 59 AT&T

Schema Extraction from Data Problem statement: given data instance D nd the \most specic" schema S for D in practice: may be too large, need to relax (approximate, etc) Labs 60 AT&T

Schema Extraction Example data: r employee employee employee employee employee employee employee employee manages manages manages manages company manages &p1 managedby &p2 &p3 &p4 &p5 &p6 &p7 managedby &p8 worksfor managedby managedby name phone position position managedby position name name name name name name position name "Jones" "6666" "Smith" "Joe" "Marketing" "Dupont" "Gaston" "Sales" "Gonnet" worksfor "Jack" "IT" "Fred" "IT" &c worksfor name worksfor worksfor worksfor worksfor worksfor "Widget Trenton" Labs 61 AT&T

data guide is precisely the most specic (deterministic) the schema upper-bound Extracting Upper Bound Data Guide [Goldman and Widom 1997] Given data instance D: a data guide S for D is a graph s.t.: accuracy D and S have the same sets of paths conciseness every path in S is unique construction of S: powerset automaton for D implemented in Lore Labs 62 AT&T

Root &r employee Data Guide company Employees &p1, &p2, &p3, &p4 &p5, &p6, &p7, &p8 worksfor managedby name position phone manages Company &c worksfor managedby Boss &p1, &p4, &p6 name phone manages Regular &p2, &p3, &p5 &p7, &p8 name position name worksfor Data guides are deterministic Labs 63 AT&T

Extracting the Lower Bound [Nestorov et al., 1998] construction: { build \most specic" datalog program: each node = a class { run the program { collapse equal classes alternatively: compute the simulation equivalence classes Root company employee manages employee employee employee manages employee manages Boss1 Regular1 Regular2 Boss2 Regular3 managedby worksfor name phone managedby position managedby name name name name position worksfor worksfor Company worksfor worksfor name Labs 64 AT&T

Schema Inference Problem statement: given query Q nd most specic schema S s.t. { for every D, Q(D) conforms to S related to type inference in functional programming languages Labs 65 AT&T

Schema Inference where Root! "Bib"! X, X! "paper"! P, P! "author"! A, P! "title"! T, P! "year"! Y create Root(), HomePage(A), YearPage(A,Y), PubPage(P) link Root()! "person"! HomePage(A), HomePage(A)! "yearentry"! YearPage(A,Y), YearPage(A,Y)! "publication"! PubPage(P), PubPage(P)! "author"! HomePage(A) PubPage(P)! "title"! T Labs 66 AT&T

Inference Schema Root() Root() person person person HomePage("Smith") HomePage("Jones") HomePage() yearnetry yearnetry author author yearnetry yearnetry yearnetry author YearPage("Smith",1994) YearPage("Smith",1996) YearPage("Jones",1994)YearPage("Jones",1996) YearPage() publication author author publication PubPage("Moving the Comma") PubPage("Moving the Dot") PubPage() title title title Some data instance Q(D) Inferred schema Labs 67 AT&T

Schemas in XML Document Type Descriptor (DTD) optionally accompanies an XML document species a grammar has only limited use as a schema Labs 68 AT&T

] > <!ELEMENT section ((title,section*) j text)> DTD's as Grammars <!DOCTYPE paper [ <!ELEMENT paper (section*)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)> <paper> <section> <text> </text> </section> <section> <title> </title> <section>...</section> <section>...</section> </section> </paper> Labs 69 AT&T

DTD's as Schemas Not so well suited: impose unwanted restrictions on order: <!ELEMENT person (name,phone)> means that <name> must occur before <phone> problems with references: cannot restrict their range can be too vague: <!ELEMENT person ((namejphonejemail)*)> what combinations of attributes should one expect? Labs 70 AT&T

all proposals the XML documents must make explicit reference in the schema to Other Schemas Proposed for XML Currently considered by the W3C (no clear winner): XML-Data RDF Schema DCD ::: Characteristics: object-oriented avor (classes, subclasses) constraints Labs 71 AT&T

in SS data: data and schema are decoupled ) schema from data extraction in SS data: the emphasis is on using schema information in processing (optimization, storage) data in XML: schemas range from grammars (DTD) to object-oriented Summary of Schemas in XML: emphasis is on semantics Labs 72 AT&T

Part 4 Systems Issues Semistructured data and XML Query languages Schemas Systems issues Conclusions Labs 73 AT&T

Systems Issues: Outline servers for semistructured data mediators for semistructured data Labs 74 AT&T

Servers for Semistructured Data secondary storage: { text le (XML) { OO repository (need DTD): POQL [Christophides et al. 94], ObjectDesign's excelon, Bluestone, etc. { extend OO: Ozone { build object repository from scratch (Lore) { as ternary relation: [Florescu and Kossmann '99] { from data mining to relational schema [Deutsch et al. 99] indexes { deal with dierent atomic types: 1995, "1995" (Lore) { compute path expressions: Region Algebras (structured text), data guides, T-indexes, XSet query processing and optimization (Lore) Labs 75 AT&T

query optimization in the presence of limited source { (nding feasible plans) [Yerneni et al'99, capabilities query composition and rewriting [Papakonstantinou et { Papakonstantinou&Vassalos'99, Deutsch et al'99, al'96, Mediator for Semistructured Data XML = a view over Relational/OO/OR data mediators: { translation between models, e.g. Relational $ XML { integration issues: Florescu et al'99] Calvanese et al'99] { data conversion [Cluet et al'98] Labs 76 AT&T

Part 5 Conclusions Semistructured data and XML Query languages Schemas Systems issues Conclusions Labs 77 AT&T

not covered by the tutorial: transactions, limited source [Yerneni et al'99], lots of W3C standard proposals capabilities Summary data on the Web { at the intersection of semistructured data and structured text (XML) { highlights cultural mismatch: db community and document community paradigm shift, for both Web and db: { Web shifts from document to data { db shifts from structured to irregular, self-describing data covered by this tutorial: data models, queries, schemas (XML-Pointer, XML-Link, RDF, RDF-Schema, DCD) Labs 78 AT&T

XML object repository, indexes, query needed: data mining processing/optimization, What Technologies are Needed Web applications possible today: { servers store/process data in traditional ways { applications exchange data in XML (and queries too!) { XML = view over structured databases needed: mediator technology, good compression (not there yet) Web applications in the future: { store/process native XML data { mine/analyze XML data Labs 79 AT&T

Why This Is Cool for Database Researchers put to work all those techniques you teach in CS101! { recursive tree traversal (structural recursion, XSL) { automata theory (DTD's, path expressions) { graph theory (simulation, bisimulation, graph morphism) adapt old db techniques to handle the new kind of data make the world a better place (from FAX to XML)! The End Labs 80 AT&T