Data Modeling using XML. Murali Mani, WPI Antonio Badia, University of Louisville Oct 13, 2003



Similar documents
The Relational Model. Ramakrishnan&Gehrke, Chapter 3 CS4320 1

The Relational Model. Why Study the Relational Model? Relational Database: Definitions. Chapter 3

Database Design. Marta Jakubowska-Sobczak IT/ADC based on slides prepared by Paula Figueiredo, IT/DB

Translating between XML and Relational Databases using XML Schema and Automed

Lecture 6. SQL, Logical DB Design

14 Databases. Source: Foundations of Computer Science Cengage Learning. Objectives After studying this chapter, the student should be able to:

Databases Model the Real World. The Entity- Relationship Model. Conceptual Design. Steps in Database Design. ER Model Basics. ER Model Basics (Contd.

Databases and BigData

Database Design Overview. Conceptual Design ER Model. Entities and Entity Sets. Entity Set Representation. Keys

Requirement Analysis & Conceptual Database Design. Problem analysis Entity Relationship notation Integrity constraints Generalization

A Workbench for Prototyping XML Data Exchange (extended abstract)

Conceptual Design Using the Entity-Relationship (ER) Model

The Relational Model. Why Study the Relational Model? Relational Database: Definitions

Logical Schema Design: The Relational Data Model

Review: Participation Constraints

Relational Database Basics Review

Bridge from Entity Relationship modeling to creating SQL databases, tables, & relations

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. Introduction to Databases. Why databases? Why not use XML?

The Entity-Relationship Model

Data Integration for XML based on Semantic Knowledge

Chapter 1: Introduction

How To Improve Performance In A Database

SQL DDL. DBS Database Systems Designing Relational Databases. Inclusion Constraints. Key Constraints

Chapter 7 Data Modeling Using the Entity- Relationship (ER) Model

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #5: En-ty/Rela-onal Models- - - Part 1

There are five fields or columns, with names and types as shown above.

Relational Databases for Querying XML Documents: Limitations and Opportunities. Outline. Motivation and Problem Definition Querying XML using a RDBMS

ECS 165A: Introduction to Database Systems

3. Relational Model and Relational Algebra

IT2304: Database Systems 1 (DBS 1)

Overview of Database Management

2. Conceptual Modeling using the Entity-Relationship Model

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Physical Design. Phases of database design. Physical design: Inputs.

Functorial Data Migration. Categorical Informatics, Inc. December 1, C i

Storing and Querying Ordered XML Using a Relational Database System

Unraveling the Duplicate-Elimination Problem in XML-to-SQL Query Translation

Relational Foundations for Functorial Data Migration

History of Database Systems


Introduction to Databases

Data Hierarchy. Traditional File based Approach. Hierarchy of Data for a Computer-Based File

Lesson 8: Introduction to Databases E-R Data Modeling

CIS 631 Database Management Systems Sample Final Exam

Entity Relationship Diagram

Week 1 Part 1: An Introduction to Database Systems. Databases and DBMSs. Why Use a DBMS? Why Study Databases??

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Active database systems. Triggers. Triggers. Active database systems.

Index Selection Techniques in Data Warehouse Systems

The Relational Model. Why Study the Relational Model?

IT2305 Database Systems I (Compulsory)

Relational Database Concepts

COMP 378 Database Systems Notes for Chapter 7 of Database System Concepts Database Design and the Entity-Relationship Model

Chapter 10. Functional Dependencies and Normalization for Relational Databases. Copyright 2007 Ramez Elmasri and Shamkant B.

Database Design Patterns. Winter Lecture 24

SQL Server for developers. murach's TRAINING & REFERENCE. Bryan Syverson. Mike Murach & Associates, Inc. Joel Murach

Chapter 2: Designing XML DTDs

Exercise 1: Relational Model

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

Introduction to XML. Data Integration. Structure in Data Representation. Yanlei Diao UMass Amherst Nov 15, 2007

XQuery and the E-xml Component suite

Theory of Relational Database Design and Normalization

Database Management Systems. Chapter 1

Chapter 1: Introduction. Database Management System (DBMS) University Database Example

Storing and Querying XML Data using an RDMBS

Object Oriented Databases. OOAD Fall 2012 Arjun Gopalakrishna Bhavya Udayashankar

XML and Data Integration

Integrating Pattern Mining in Relational Databases

Relational Database Design

Representing XML Schema in UML A Comparison of Approaches

AVOIDANCE OF CYCLICAL REFERENCE OF FOREIGN KEYS IN DATA MODELING USING THE ENTITY-RELATIONSHIP MODEL

Extending Data Processing Capabilities of Relational Database Management Systems.

LOGICAL DATABASE DESIGN

Converting XML Data To UML Diagrams For Conceptual Data Integration

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Query optimization. DBMS Architecture. Query optimizer. Query optimizer.

A brief overview of developing a conceptual data model as the first step in creating a relational database.

The Entity-Relationship Model

Database Design. Database Design I: The Entity-Relationship Model. Entity Type (con t) Chapter 4. Entity: an object that is involved in the enterprise

Information Models: Applications and Technology. The Data Scalability Problem

Data Integration and Exchange. L. Libkin 1 Data Integration and Exchange

SQL: QUERIES, CONSTRAINTS, TRIGGERS

Data Modeling Basics

Semistructured data and XML. Institutt for Informatikk INF Ahmet Soylu

Personalization of Web Search With Protected Privacy

Content Management for Declarative Web Site Design

XML DATA INTEGRATION SYSTEM

Enhancing Traditional Databases to Support Broader Data Management Applications. Yi Chen Computer Science & Engineering Arizona State University

Database Design. Goal: specification of database schema Methodology: E-R Model is viewed as a set of

Schema Refinement, Functional Dependencies, Normalization

Schema Design and Normal Forms Sid Name Level Rating Wage Hours

Databases -Normalization III. (N Spadaccini 2010 and W Liu 2012) Databases - Normalization III 1 / 31

CSE 132A. Database Systems Principles

VALLIAMMAI ENGNIEERING COLLEGE SRM Nagar, Kattankulathur

Business Intelligence Extensions for SPARQL

DBMS / Business Intelligence, SQL Server

Database Management Systems

Data Modeling: Part 1. Entity Relationship (ER) Model

PartJoin: An Efficient Storage and Query Execution for Data Warehouses

Schema Mappings and Data Exchange

Lab Manual. Database Systems COT-313 & Database Management Systems Lab IT-216

Transcription:

Data Modeling using XML Murali Mani, WPI Antonio Badia, University of Louisville

Outline Part I: How to come up with good XML designs for real world database applications? Part II: Translation between Relational and XML models.

Part I: How to come up with good XML designs for real world database applications?

What is XML? <book> <author>j. E. Hopcroft</author> <author>j. D. Ullman</author> <publisher name= Addison-Wesley /> </book> book author author publisher J. E. Hopcroft J. D. Ullman name= "Addison- Wesley"

XML for information exchange

XML Publishing Text applications Database applications X gave two <person> <reviewer>x</reviewer> thumbs up rating <Name>A</name> Age Salary gave <rating>two thumbs to the movie XML <Age>25</Age> A 20000 up</rating> to <movie> Fugitive, The <Salary>20000</Salary> B 62 140000 Fugitive, The</movie> </person> <person> <Name>B</name> <Age>62</Age> <Salary>140000</Salary> </person>

Publishing Relational Databases Name A B Age 25 62 Salary 20000 140000 <person> <Name>A</name> <Age>25</Age> </person> Users/Applications see a uniform XML view Exchange data with other applications Querying XML is easier? Problem: What is a good XML schema for a relational schema?

XML for Data Modeling Location location (@val, @time, GPS) GPS gps (@satellite) Location location (@val, @time, Bstation*) Bstation bstation (@id, @sigstrength) Location location (@val, @time, (GPS Bstation*))

XML as a logical data model Location location (@val, @time, (GPS Bstation*)) Use data modeling features provided by XML Union types Recursive types Ordered relationships Easier to Query? Problems: What is a good XML schema for an application? How do we store the data in relational databases?

XML for data integration Source1 Browse Query, Update XML Wrapper Source2 SourceN

Database Design Stages Application Requirements Conceptual Design Conceptual Schema Logical Design Logical Schema Physical Design Physical Schema

Logical Data Model and Redundancy Bad Design Student_Professor Professor sname BS advisor age pname age SD CS MXM 40 MXM 40 YC EE MXM 40 Student sname BS advisor SD CS MXM YC EE MXM Good Design Person name address city state zip AV A1 Los An CA 90034 AN A2 Los An CA 90034

What is a Data Model? Structural Specification Constraint Specification Operations

Entity Relationship (ER) Model Structures: Entity Types, Relationship Types Constraints: Cardinality constraints Professor (0,*) (1,1) Advisor Student pname age year sname

Relational Model Structures: Relations Constraints: Key, Foreign Key Professor pname age Student sname advisor Key Constraints: Key (Professor) = <pname> Key (Student) = <sname> Foreign Key Constraints: Student (advisor) references Professor (pname)

Specifying Structures for XML G = (N, T, P, S) N = {Book, Author, Publisher, #PCDATA} T = {book, author, publisher, pcdata} S = {Book} Book book (Author +, Publisher) Author author (#PCDATA) Publisher publisher (@name::string) #PCDATA pcdata (ε) Regular Tree Grammar Every production rule is of the form A a X A N, a T, X is a regular expression over N book author author publisher J. E. Hopcroft J. D. Ullman name= "Addison- Wesley"

XML Schema Language Proposals W3C DTD: local tree grammar W3C XML Schema: single type tree grammar ISO/OASIS RELAX NG: full-fledged regular tree grammar

Properties of different Regular Tree Grammar classes Expressiveness Regular tree grammar strictly more expressive than single type tree grammar Single type tree grammar strictly more expressive than local tree grammar Closure properties Regular tree grammar closed under union, intersection and difference Single type tree grammar/local tree grammar closed only under intersection Type assignment Type assignment can be ambiguous for regular tree grammar. Type assignment is unambiguous for local tree grammar/single type tree grammar.

Ambiguous Type Assignment G = (N, T, P, S) N = {Book, Author1, Author2, Publisher, #PCDATA} T = {book, author, publisher, pcdata} S = {Book} Book book (Author1*, Author2*, Publisher) Author1 author (#PCDATA) Author2 author (#PCDATA) Publisher publisher (@name::string) #PCDATA pcdata (ε) book author author publisher J. E. Hopcroft J. D. Ullman name= "Addison- Wesley"

Constraint Specification for XML why? If we represent all relationships only by hierarchies, then the logical model will have redundancy. Person (0,*) (0,*) Review Book name zip rating ISBN title year What constraint specification? Key, Foreign Key ID/IDREF

Specifying Constraints for XML: Example library name= "RRM" person person name= "CZ" book book paper ISBN= "I1" title= "T1" BID= "B1" ISBN= "I2" title= "T2" BID= "B2" title= "T1" journal ="J1" PID= "P1" review review review review article ="P1" rating ="9" article ="B2" rating ="9" article ="B1" rating ="9" article ="B2" rating ="10"

Specifying Constraints for XML Keys are specified using (rel, sel, field) rel is relative axis sel is selector axis field is a set of path expressions For any element that belongs to rel, sel will give a set of elements. For this set of elements, field is the key. rel and sel can be types or path expressions Foreign keys are specified as (rel 1, sel 1, field 1 ) references (rel 2, sel 2, field 2 )

Constraint Specification Proposals W3C XML Schema Relative axis = type Selector axis = path expression Keys for XML WWW10 Relative axis = path expression Selector axis = path expression UCM WWW10 No relative axis Selector axis = type

Our proposal Relative axis = type Selector axis = type IDREF and IDREFS identify target types

Example library name= "RRM" person person name= "CZ" book book paper ISBN= "I1" title= "T1" BID= "B1" ISBN= "I2" title= "T2" BID= "B2" title= "T1" journal ="J1" PID= "P1" review review review review article ="P1" rating ="9" article ="B2" rating ="9" article ="B1" rating ="9" article ="B2" rating ="10" (Library, Person, <@name>) (Library, Book, <@ISBN>) (Library, Paper, <@title>) (Person, Review, <@article>) @article::idref references (Book Paper)

Why use XML as logical data model?

Path Expressions vs Joins Professor pname age MXM CZ 40 55 university Student sname year advisor pname="mxm" age="40" professor professor pname="cz age="55" SD 1998 MXM student student student YC FC 2000 1999 MXM CZ sname ="SD" year= "1998" sname ="YC" year= "2000" sname ="FC" year= "1999" Student (advisor) references Professor (pname) Query: Give names of students of professors of age 40 π name ((σ age=40 (Professor)) Student) professor [@age=40]/student/@sname

Union Types - attributes name AV AN Person city state Los A CA null null zip null 90034 Person person (@name, ((@city, @state) @zip)) university person person name ="AV" city= "Los A" state = CA name ="AN" zip= 90034

Union Types - Relationships name RRM CZ Person zip 01609 90095 Person person (@name, @zip, (Book* Paper*)) ISBN I1 I2 Book title T1 T2 author RRM RRM library title T1 Paper journal J1 author CZ name= "RRM" person zip= "01609" name= "CZ" person zip= "90095" book book paper ISBN= "I1" title= "T1" ISBN= "I2" title= "T2" title= "T1" journal ="J1"

Union Types - Relationships Conference name ER Journal name TOIT venue Chicago publisher ACM title T1 T2 inst WPI UCLA Paper conf ER null journal null TOIT Conference conference (@name, @venue, Paper*) Journal journal (@name, @publisher, Paper*) library name= "ER" conference venue= "Chicago" name= "TOIT" journal publisher ="ACM" paper paper title= "T1" inst= "WPI" title= "T2" inst= "UCLA"

Recursive Types Assembly assembly name seat superpart frame qty 1 name=bike qty=0 part tire frame wheel bike 1 1 name=frame qty=1 part part name=wheel qty=2 wheel bike bike null 2 0 name=seat qty=1 part part name=tire qty=1 WITH RECURSIVE SubPart (name) AS (SELECT name FROM Assembly WHERE superpart=bike) UNION (SELECT R2.name FROM SubPart R1, Assembly R2 WHERE R2.superPart = R1.name) SELECT * FROM SubPart Query: What are subparts of bike? part[@name=bike]//part/@name

IDREF vs Foreign Keys Professor name age MXM 40 CZ 55 Student name year advisor name= "MXM" professor age= "40" PID= "P1" university name= "CZ" professor age= "55" PID= "P2" SD 1998 MXM student student student YC FC 2000 1999 MXM CZ name= year= "SD" "1998" PRef ="P1" name= year= "YC" "2000" PRef ="P1" name= "FC" year= "1999" PRef ="P2" Student (advisor) references Professor (name) @PRef::IDREF references (Professor) Query: Give names of students of professors of age 40 student[@pref professor/@age=40]/@name

IDREF as union of foreign keys ISBN I1 I2 title T1 Book title T1 T2 journal J1 name RRM 01609 Paper CZ 90095 Review Person zip name book paper rating RRM null T1 9 CZ I1 null 9 CZ I2 null 10

IDREF as union of foreign keys library name= "RRM" person zip= "01609" name= "CZ" person zip= "90095" book book paper ISBN= "I1" title= "T1" BID= "B1" ISBN= "I2" title= "T2" BID= "B2" title= "T1" journal ="J1" PID= "P1" review review review article ="P1" rating ="9" article ="B1" rating ="9" article ="B2" rating ="10" @article::idref references (Book Paper)

Conceptual Model: ERex (ER extended for XML)

Entity Relationship (ER) model Entity Types, Relationship Types and their attributes Professor (0,*) (1,1) Advisor Student pname age year sname

Object Role Modeling (ORM) closer to natural language sentences attributes/relationships are expressed uniformly using roles Student (sname) age has Professor (pname) is advisor of for Years (year)

Unified Modeling Language (UML) Modeling software systems Class Diagrams, Association Classes Professor pname age 1 1 0 * Student sname Advisor years

From ER model Binary 1:n relationships Professor (0,*) (1,1) Advisor Student pname age year sname Person (0,*) (0,*) Review Book Binary m:n relationships name zip rating ISBN title year

From ER Model quantity N-ary relationships Company (1,*) (0,*) Supplies City (0,*) name Product name name superpart (0,*) Contains Recursive relationships Part subpart (0,1) name qty

Ordered Relationships Person (0,*) (1,1) Author Book name zip ISBN title Person name zip Ullman 95123 RRM 90095 Book ISBN title order B1 DB 2 B2 Aut 1 B3 MMSL 1 author Ullman Ullman RRM

Categories and Set Constraints name Person PersonCity PersonZip PersonCity PersonZip = PersonCity PersonZip = Person city state zip

Categories name Person (0,*) Review (0,*) Article Book Paper

Set constraints on Roles (0,*) personbook AuthorOf (1,1) Book name Person (0,*) personpaper AuthorOf (1,1) Paper personbook personpaper =

Set Constraints on Roles name venue Conference (0,*) Presented (0,1) confpaper Paper title name publisher Journal (0,*) Published (0,1) journalpaper confpaper journalpaper = confpaper journalpaper = Paper

Translating ERex schemas to XML schemas

System Architecture ERex Schemas Translator XML Schemas Final XML Schema Schema Designer

1:n relationships Professor (0,*) (1,1) Advisor Student pname age year sname

Representing 1:n Relationships - subelement Professor name RRM CZ name MM YC FC age 62 55 Student year 1998 2000 1999 advisor RRM RRM CZ Student (advisor) references Professor (name) name="rrm" age="62" name= "MM" student year= "1998" professor name= "YC" university student year= "2000" professor name= "FC" student University university (Professor*) Professor professor (@name, @age, Student*) Student student (@name, @year) (University, Professor, <@name>) (University, Student, <@name>) year= "1999" name="cz" age="55"

Representing 1:n Relationships - IDREF Professor name age university RRM 62 professor professor CZ 55 Student name= "RRM" age= "62" PID= "P1" name= "CZ" age= "55" PID= "P2" name year advisor student student student MM YC 1998 2000 RRM RRM name= year= "MM" "1998" PRef ="P1" name= year= "YC" "2000" PRef ="P1" name= year= PRef "FC" "1999" ="P2" FC 1999 CZ Student (advisor) references Professor (name) University university (Professor*, Student*) Professor professor (@name, @age, @PID) Student student (@name, @year, @PRef) (University, Professor, <@name>) (University, Student, <@name>) @PRef::IDREF references (Professor)

Representing 1:n Relationships foreign keys name RRM CZ name MM YC FC Professor age 62 55 Student year 1998 2000 1999 advisor RRM RRM CZ Student (advisor) references Professor (name) student name= year= "MM" "1998" name= "RRM" professor advisor ="RRM" age= "62" university student name= year= "YC" "2000" name= "CZ" advisor ="RRM" professor age= "55" name= "FC" student year= "1999" advisor ="CZ" University university (Professor*, Student*) Professor professor (@name, @age) Student student (@name, @year, @advisor) (University, Professor, <@name>) (University, Student, <@name>) (University, Student, <@advisor>) references (University, Professor, <@name>)

m:n relationships Person (0,*) (0,*) Review Book name rating ISBN title

Representing m:n relationships Person name RRM CZ pname RRM RRM CZ CZ Book ISBN I1 I2 I3 Review ISBN I3 I2 I1 I1 9 9 title T1 T2 T3 rating 9 10 name= "RRM" ISBN= "I1" article ="B3" book title= "T1" review BID= "B1" rating ="9" person ISBN= "I2" article ="B2" book title= "T2" review rating ="9" BID= "B2" library ISBN= "I3" book title= "T3" BID= "B3" article ="B1" Library library (Person*, Book*) Person person (@name, Review*) Book book (@ISBN, @title, @BID) Review review (@article, @rating) review rating ="9" person article ="B2" review name= "CZ" rating ="10" Review (pname) references Person (name) Review (ISBN) references Book (ISBN) (Library, Person, <@name>) (Library, Book, <@ISBN>) (Person, Review, <@article>) @article::idref references (Book)

Representing m:n relationships Person name RRM CZ pname RRM RRM CZ CZ Review ISBN I3 I2 I1 I1 Book ISBN I1 I2 I3 rating 9 9 9 10 title T1 T2 T3 name= "RRM" ISBN= "I1" article ="B3" book title= "T1" review BID= "B1" rating ="9" person book review library book Query: What is the rating given by RRM for book T1 π rating ((σ title=t1 (Book)) (σ pname=rrm (Review))) review person person[@name=rrm]/review[@article Book/title=T1]/@rating ISBN= "I2" article ="B2" title= "T2" rating ="9" BID= "B2" ISBN= "I3" title= "T3" BID= "B3" article ="B1" rating ="9" article ="B2" review name= "CZ" rating ="10"

N-ary relationships quantity Company (1,*) (0,*) Supplies City (0,*) name Product name Root root (Company*, Product*, City*) Company company (@name, Supply + ) Supply supply (@ProdRef, @CityRef, @qty) Product product (@name, @ProdID) City city (@name, @CityID) name (Root, Company, <@name>) (Root, Product, <@name>) (Root, City, <@name>) @ProdRef::IDREF references (Product) @CityRef::IDREF references (City)

Recursive Relationships superpart (0,*) Contains Part subpart (0,1) assembly Assembly name qty name=bike qty=0 part name seat tire superpart frame wheel qty 1 1 name=frame qty=1 name=seat qty=1 part part part part name=wheel qty=2 name=tire qty=1 frame wheel bike bike bike null 1 2 0 Assembly assembly (Part*) Part part (@name, @qty, Part*) (assembly, part, <@name>)

Ordered Relationships Person name Ullman zip 95123 library RRM 90095 name="ullman" zip="95123" person person name="rrm" zip="90095" Book book book book ISBN B1 title DB order 2 author Ullman ISBN= "B1" title= "Aut" ISBN= "B2" title= "DB" ISBN= "B3" title= "MMS" B2 Aut 1 Ullman B3 MMS 1 RRM Library library (Person*) Person person (@name, @zip, Book*) Book book (@ISBN, @title) (Library, Person, <@name>) (Library, Book, <@ISBN>)

Categories and set constraints name Person PersonCity PersonZip PersonCity PersonZip = PersonCity PersonZip = Person city state zip Person person (@name, ((@city, @state) @zip))

Categories and Set Constraints name Person Review (0,*) (0,*) Root root (Person*, Book*, Paper*) Person person (@name, Review*) Review review (@article, @rating) (Root, Person, <@name>) (Person, Review, <@article>) @article::idref references (Book Paper) Article Book Paper

Set constraints on Roles (0,*) personbook AuthorOf (1,1) Book name Person (0,*) personpaper AuthorOf (1,1) Paper personbook personpaper = Person person (@name, (Book* Paper*))

Set Constraints on Roles name venue Conference (0,*) Presented (0,1) confpaper Paper title name publisher Journal (0,*) Published (0,1) journalpaper confpaper journalpaper = confpaper journalpaper = Paper Conference conference (@name, @venue, Paper*) Journal journal (@name, @publisher, Paper*)

Converting ERex XML Goals Maximize relationships represented using subelement. Others try to represent using IDREF

Algorithm: ERex XML A non-terminal symbol for each entity type with key m:n relationship n-ary relationship Root non-terminal symbol Represent attributes Represent relationships and identify top nodes 1:1 and 1:n relationships m:n relationships n-ary relationships Identify key and IDREF constraints.

Review (0,*) (0,*) rating Article (0,*) personbook Author Of (1,1) Book Paper name Person (0,*) personpaper Author Of (1,1) ISBN btitle year ptitle year journal PersonCity PersonZip personbook personpaper = personbook personpaper = Person city state zip PersonCity PersonZip = PersonCity PersonZip = Person

Review (0,*) (0,*) rating Article (0,*) personbook Author Of (1,1) Book Paper name Person (0,*) personpaper Author Of (1,1) ISBN btitle year ptitle year journal PersonCity PersonZip city state zip N = {Root, Person, Book, Paper, Review} Book book (@ISBN, @btitle, @year) Paper paper (@ptitle, @year, @journal) Person person (@name, ((@city, @state) @zip)) Person person (Book* Paper*) Person person (Review*) Review review (@rating, @article) Root root (Person*) (Root, Person, <@name>) (Root, Book, <@ISBN>) (Root, Paper, <@ptitle>) (Person, Review, <@article>) @article::idref references (Book Paper)

Conclusions ERex Schemas Translator XML Schemas Final XML Schema Schema Designer Obtained good XML Schema from ERex schemas

Part II: Translation between Relational and XML models.

Why publish relational databases as XML? Provide an XML view for our legacy data Users/Applications can query our data over the web using standards. Easier to query XML than legacy (relational) data? Convert our legacy data to XML We can exchange data with applications. Store data in XML databases? Easier to query?

Application Scenarios Schema Matching Problem Given a relational schema and an XML schema by a standards body, how do we map this relational schema to XML? Tools such as XML Extender from IBM, Clio (University of Toronto and IBM), MS SQL Server Schema Mapping Problem Given a relational schema, how do we come up with a good XML schema?

Schema Matching: MS SQL Server Architecture COPYRIGHT NOTICE. Copyright 2003 Microsoft Corporation, One Microsoft Way, Redmond, Washington 98052-6399 U.S.A. All rights reserved.

Schema Mapping Problem Name A B Age 25 62 Salary 20000 140000 <person> <Name>A</name> <Age>25</Age> </person> Users/Applications see a uniform XML view Exchange data with other applications Store XML in native XML databases Querying XML is easier? Problem: What is a good XML schema for a relational schema?

Goals XML Schema should maintain constraints. Resulting XML should not introduce redundancies. Most relationships can be navigated using path expressions, rather than joins. Minimal user interaction: Our translator should suggest good XML schemas to the database designer.

System Architecture RDB NeT CoT XML Schemas Final XML Schema Schema Designer

Related Work XML-DBMS Template driven mapping language SilkRoute Declarative Query Language (RXL) for viewing relational data as XML Xperanto User specifies query in XML Query Language

Algorithms Naïve FT (Flat Translation) Consider relational data NeT (Nesting-based Translation) Consider relational schema CoT (Constraint-based Translation)

FT: Flat Translation 1 : 1 mapping from relational to XML Idea A type (non-terminal) corresponding to every relation Attributes of a relation form attributes of the type Keys and foreign keys are preserved

FT: Flat Translation - Example name RRM CZ name MM YC FC Professor age 62 55 Student year 1998 2000 1999 advisor RRM RRM CZ Student (advisor) references Professor (name) student name= year= "MM" "1998" name= "RRM" professor advisor ="RRM" age= "62" university student name= year= "YC" "2000" name= "CZ" advisor ="RRM" professor age= "55" name= "FC" student year= "1999" University university (Professor*, Student*) Professor professor (@name, @age) Student student (@name, @year, @advisor) (University, Professor, <@name>) (University, Student, <@name>) (University, Student, <@advisor>) references (University, Professor, <@name>) advisor ="CZ"

NeT: Nesting-based Translation Idea: Make use of non-flat features provided by XML: represent repeating groups using *, +

NeT: Example Course (cname, prof, text) cname Algorithms Algorithms Algorithms Algorithms prof Gafni Gafni Majid Majid text Udi Manber CLR Udi Manber CLR

NeT: Example Course (cname, prof +, text) cname Algorithms Algorithms prof {Gafni, Majid} {Gafni, Majid} text Udi Manber CLR Course (cname, prof +, text + ) Cname Algorithms prof {Gafni, Majid} text {Udi Manber, CLR}

NeT: Example university course course course course University university (Course*) Course course (@cname, @prof, @text) cname="algorithms" prof="gafni" text="udi Manber" cname="algorithms" prof="gafni" text="clr" cname="algorithms" prof="majid" text="udi Manber" cname="algorithms" prof="majid" text="clr" (University, Course, <@cname, @prof, @text>) university ValueRatio=12/5 University university (Course*) Course course (@cname, Prof+, Text+) Prof prof (#PCDATA) Text text (#PCDATA) #PCDATA pcdata (ε) cname="algorithms" prof prof course text text (University, Course, <@cname, prof, text>) Gafni Majid Udi Manber CLR

NeT: Example Person (name, city, state, zip) name city state zip MM Los Angeles CA 90034 AN Los Angeles CA 90034 Person (name +, city, state, zip) {MM, AN} Los Angeles CA 90034

NeT: Example directory directory person person city="los Angeles" state="ca" person zip="90034" name="mm" city="los Angeles" state="ca" zip="90034" name="an" city="los Angeles" state="ca" zip="90034" Directory directory (Person*) Person person (@name, @city, @state, @zip) (directory, person, <@name>) ValueRatio=8/5 name MM name AN Directory directory (Person*) Person person (@city, @state, @zip, Name*) Name name (#PCDATA) (directory, person, <@name>)

NeT: Summary Consider Table t with column set C. Nesting on column X is defined as: Any two tuples with the same values for (C X) will be combined to one tuple Observation: We need to nest only on key columns Advantages of NeT NeT removes redundancy if relation is not in 4NF NeT provides more intuitive XML schemas with less redundancy

NeT: Experimentation Test Set #attr #tuples ValueRatio #nested attributes Time (sec) Balloons1 5 16 3.64 3 1.08 Hayes 6 132 1.52 1 1.01 Bupa 7 345 1 0 4.40 Balance 5 625 2.79 4 21.48 TA_Eval 6 110 1.24 5 24.83 Car 7 1728 15.53 6 469.47 Flare 13 365 1.67 4 6693.41

CoT: Constraint-based Translation Translating relational schema Idea: Use foreign key constraints and our knowledge of how to represent relationships to come up with better XML models.

CoT: Step 1 university Professor pname RRM age 62 pname="rrm" age="62" professor Student student student student sname MM YC YC advisor RRM RRM RRM cname DBs DBs QSs Student (advisor) references Professor (pname) ValueRatio=11/8 sname= "MM" cname= "DBs" University university (Professor*) Professor professor(@pname, @age, Student*) Student student (@sname, @cname) (University, Professor, <@pname>) (Professor, Student, <@sname, @cname>) sname= "YC" cname= "DBs" sname= "YC" cname= "QSs"

CoT: Step 2 Professor pname RRM sname MM YC YC age 62 Student advisor RRM RRM RRM Course cname DBs QSs cname DBs DBs QSs since 1979 1962 pname="rrm" age="62" University university (Professor*, Course*) Professor professor (@pname, @age, Student*) Student student (@sname, @CRef) Course course (@cname, @since, @CID) (University, Professor, <@pname>) (University, Course, <@cname>) (Professor, Student, <@sname, @Cref>) CRef::IDREF references (Course) professor university course course Student (advisor) references Professor (pname) Student (cname) references Course (cname) sname= "MM" student CRef= "c1" student sname= "YC" CRef= "c1" student sname= "YC" cname= "DBs" since= "1979" CRef= "c2" CID= "c1" cname= "QSs" since= "1962" CID= "c2" ValueRatio=15/14

CoT: Example Student (SID, name, advisor) Emp (EID, name, projname) Prof (EID, name, teach) Course (CID, title, room) Dept (dno, mgr) Proj (pname, pmgr) student student proj Student (advisor) references Prof (EID) Emp (projname) references Proj (pname) Prof (teach) references Course (CID) Prof (EID, name) references Emp (EID, name) Dept (mgr) references Emp (EID) Proj (pmgr) references Emp (EID) course prof emp dept

student student proj course course prof Top nodes emp dept (University, Course, <@CID>) (University, Emp, <@EID>) (University, Prof, <@EmpRef>) (University, Student, <@SID>) (University, Proj, @pname) (University, Dept, @dno) @EmpRef::IDREF references (Emp) University university (Course*, Emp*) @ProjRef::IDREF references Course course (@CID, @title, @room, Prof*) (Proj) Prof prof (@EmpRef, Student*) Student student (@SID, @name) Emp emp (@EID, @name, @ProjRef, @EmpID, Dept*, Proj*) Proj proj (@pname, @ProjID) Dept dept (@dno)

CoT Experimentation Ran on TPC-H data Value ratio > 100/88 (size decreased by more than 12%)

Conclusions RDB NeT CoT XML Schemas Final XML Schema Schema Designer We obtained good XML schemas from relational schemas Constraints are maintained Redundancies are decreased Most relationships can be navigated using path expressions. Minimum user interaction

Storing XML data in relational databases

Options for storing XML data Store in relational databases Relational databases are robust and efficient (IBM XML Extender, Oracle, MS SQL Server) Store in native XML databases More efficient for XML than relational databases (Natix, exist, Tamino) Store in a combination of both Structured portion of XML data in relational database, and unstructured portion in native XML store.

Related Work STORED No Schema Use data mining techniques to find structured and frequent patterns These are stored in relational DB, others in semistructured overflow store Drawback: Requires integration of relational DB and semi-structured store Storing paths One relation for storing nodes, one for storing edges. Drawback: Type information is lost.

Type-based relational storage Jayavel Shanmughasundaram Several key ideas such as schema simplification, inlining, handling recursion LegoDB Use the query workload to come up with an efficient relational schema.

Main features The entire XML document is shredded and stored in a relational database. All semantic constraints in XML schema are not captured in relational schema. We do not discuss how operations on XML are translated to SQL.

Why not capture all constraints? A a (@b, ((@c, D)* (@e, F)*)) (Root, A, <@b>) A (@b, @c, @e) D (@aref) F (@aref) Semantic constraints lost if D refers to an A, then the corresponding @c should be non-null if F refers to an A, then the corresponding @e should be non-null

NF2 representation of regular tree grammars Eliminate union, paper (@ptitle, @journal @conference) paper (@ptitle, @journal) paper (@ptitle, @conference) (or) paper (@ptitle, @journal, @conference)

Schema Simplification A a (@d, (B, C, B)*) A a (@d [1, 1], (B [2, 2], C [1, 1]) [0, *]) A a (@d [1, 1], B [2, 2] [0, *], C [1, 1] [0, *]) A a (@d [1, 1], B [0, *], C [0, *]) A a (@d, B*, C*) Semantic information lost The number of B s is two times the number of C s for every C, there is a B that occurs before it, and one that occurs after it.

Inlining Conf conf (@ctitle, @date, Venue) Venue venue (@city, @country) Conf conf (@ctitle, @date, @city, @country) Why inlining? Lesser joins, hence more efficient

Mapping Collection Types Conf conf (@ctitle, @date, Paper*) Paper paper (@ptitle, @author) Conf (@ctitle, @date) Paper (@ptitle, @author, @confref) Separate relation with foreign key for every collection type

IDREF attribute Person person (@name, @zip, Review*) Book book (@ISBN, @btitle, @BID) Paper paper (@ptitle, @journal, @PID) Review review (@article, @rating) @article::idref references (Book Paper) Person (@name, @zip) Book (@ISBN, @btitle, @BID) Paper (@ptitle, @journal, @PID) Review (@personref, @bookref, @paperref, @rating)

Recursion using? A a (@d, A?) A (@d, ARef) ARef refers to the child of A, and can be null A a (@d, ARef) ARef refers to the parent of A and can be null.

Recursion using * A a (@d, A*) A (@d, ARef) ARef refers to the parent of A, and can be null

Recursion General technique For every cycle, we must have a separate relation Algorithm For every strongly connected component, define a separate relation for one of the types. In a strongly connected component, if there is a type which can be children of multiple types, then define a separate relation for that type.

Capturing Order in the Document Through order attributes Corresponding to each type in XML schema, say A, we have an associated order attribute, say aorder Conf conf (@ctitle, @date, Venue) Venue venue (@city, @country) Conf conf (@ctitle, @date, @conforder, @city, @country, @venueorder)

Conclusions XML schema with no recursion can be translated to relational schema with no nulls. XML schema with recursion cannot be translated to relational schema with no nulls. If recursion, separate relation needed for every cycle. All semantic constraints in XML cannot be captured in relational schema. XML resulting from CoT can be translated to the original relational schema; all semantic constraints are maintained.

Open Problems Standards specification: What structural and constraint specification schemes for XML are needed for database applications? XML used for text/document publishing: Keyword Search in XML documents Storing data consisting of structured and unstructured portions: integrating relational and XML stores.

Open Problem (contd ) Translating operations in XML model to underlying sources (relational) Use annotated schema (MS SQL Server) Use implicit annotations (LegoDB) Query minimization: When we do automatic translation, we might perform unnecessary joins?