Enhancing Traditional Databases to Support Broader Data Management Applications. Yi Chen Computer Science & Engineering Arizona State University

Similar documents

Object Oriented Databases. OOAD Fall 2012 Arjun Gopalakrishna Bhavya Udayashankar

Storing and Querying Ordered XML Using a Relational Database System

Database Systems. Lecture 1: Introduction

Unified XML/relational storage March The IBM approach to unified XML/relational databases

Technologies for a CERIF XML based CRIS

Model-Mapping Approaches for Storing and Querying XML Documents in Relational Database: A Survey

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

A Workbench for Prototyping XML Data Exchange (extended abstract)

Topics in basic DBMS course

Unraveling the Duplicate-Elimination Problem in XML-to-SQL Query Translation

SQL Query Evaluation. Winter Lecture 23

Relational Databases for Querying XML Documents: Limitations and Opportunities. Outline. Motivation and Problem Definition Querying XML using a RDBMS

Indexing XML Data in RDBMS using ORDPATH

Overview of Data Management

Encoding Library of Congress Subject Headings in SKOS: Authority Control for the Semantic Web

Relational Databases

An Oracle White Paper October Oracle XML DB: Choosing the Best XMLType Storage Option for Your Use Case

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. The Relational Model. The relational model

Data warehousing with PostgreSQL

XML Data Integration

Chapter 1: Introduction

CSE 233. Database System Overview

Overview of Database Management

Data Integration and Exchange. L. Libkin 1 Data Integration and Exchange

Data Integration using Agent based Mediator-Wrapper Architecture. Tutorial Report For Agent Based Software Engineering (SENG 609.

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. Introduction to Databases. Why databases? Why not use XML?

Final Project Report

AllegroGraph. a graph database. Gary King gwking@franz.com

WHAT IS A SITE MAP. Types of Site Maps. vertical. horizontal. A site map (or sitemap) is a

Data Quality in Information Integration and Business Intelligence

High Performance XML Data Retrieval

History of Database Systems

SQL - QUICK GUIDE. Allows users to access data in relational database management systems.

Contents RELATIONAL DATABASES

Technologies & Applications

Inside the PostgreSQL Query Optimizer

Introduction to Polyglot Persistence. Antonios Giannopoulos Database Administrator at ObjectRocket by Rackspace

Chapter 1: Introduction. Database Management System (DBMS) University Database Example

How To Improve Performance In A Database

Big Data Analytics. Rasoul Karimi

JCR or RDBMS why, when, how?

Data Integration Hub for a Hybrid Paper Search

CHAPTER 5: BUSINESS ANALYTICS

Relational Database Basics Review

Introduction. Part I: Finding Bottlenecks when Something s Wrong. Chapter 1: Performance Tuning 3

NakeDB: Database Schema Visualization

InfiniteGraph: The Distributed Graph Database

Implementing XML Schema inside a Relational Database

DATABASE MANAGEMENT SYSTEMS. Question Bank:

Managing large sound databases using Mpeg7

Relational Division and SQL

Extending Data Processing Capabilities of Relational Database Management Systems.

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Binary search tree with SIMD bandwidth optimization using SSE

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

ABSTRACT 1. INTRODUCTION. Kamil Bajda-Pawlikowski

12 File and Database Concepts 13 File and Database Concepts A many-to-many relationship means that one record in a particular record type can be relat

Geodatabase Programming with SQL

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

CHAPTER 1 INTRODUCTION

1. INTRODUCTION TO RDBMS

Databases in Organizations

DATABASE MANAGEMENT SYSTEM

14 Databases. Source: Foundations of Computer Science Cengage Learning. Objectives After studying this chapter, the student should be able to:

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Translating between XML and Relational Databases using XML Schema and Automed

Database Management Systems. Chapter 1

DB2 for i. Analysis and Tuning. Mike Cain IBM DB2 for i Center of Excellence. mcain@us.ibm.com

CHAPTER 4: BUSINESS ANALYTICS

End the Microsoft Access Chaos - Your simplified path to Oracle Application Express

The process of database development. Logical model: relational DBMS. Relation

The presentation explains how to create and access the web services using the user interface. WebServices.ppt. Page 1 of 14

Modeling and Querying E-Commerce Data in Hybrid Relational-XML DBMSs

Data Integration and ETL Process

Large Scale Text Analysis Using the Map/Reduce

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Oracle Database 10g: Building GIS Applications Using the Oracle Spatial Network Data Model. An Oracle Technical White Paper May 2005

GRAPH PATTERN MINING: A SURVEY OF ISSUES AND APPROACHES

10. XML Storage Motivation Motivation Motivation Motivation. XML Databases 10. XML Storage 1 Overview

ASTERIX: An Open Source System for Big Data Management and Analysis (Demo) :: Presenter :: Yassmeen Abu Hasson

UltraQuest Cloud Server. White Paper Version 1.0

Relational Algebra. Module 3, Lecture 1. Database Management Systems, R. Ramakrishnan 1

WebSphere Business Modeler

Apache Kylin Introduction Dec 8,

Transcription:

Enhancing Traditional Databases to Support Broader Data Management Applications Yi Chen Computer Science & Engineering Arizona State University

What Is a Database System? Of course, there are traditional relational database management systems (RDBMS) Was introduced in 1970 by Dr. E. F. Codd (of IBM) Commercial relational databases began to appear in the 1980s The focus of most work in the past 30 years 2

A Relational Database (RDBMS) Table (Relation) Row (Tuple, Record) Refers to Column (Attribute) Climber Name Skill age James Beginner 21 Bob Experienced 33 Climbs Name Route Date Duration Bob Last Tango 10/10/05 5 Bob Last Tango 1/10/06 4.5 A predefined data structure (schema) is required. 3

Querying RDBMS: SQL Climber Name Skill age selection: σname = James James Beginner 21 Bob Experienced 33 projection: Route = Last Tango join: Climber Climbs Name Route Date Duration Bob Last Tango 10/10/05 5 Bob Last Tango 1/10/06 4.5 Climber.name = climbs.name Climbs Name Skill Age Route Date Duration Bob Experienced 33 Last Tango 10/10/05 5 Bob Experienced 33 Last Tango 1/10/06 4.5 4

The Advantages of RDBMS Good data organization High efficiency for large datasets via indexing and query optimization Concurrency control and reliability 5

But, 80% of the World s Data is Not in RDBMS! Examples: WWW, Emails Personal data, documents of various format Sensor data A lot of scientific data (experimental data, large images, documentation, etc) Why not? There are several assumptions in relational databases that do not fit for handling this data. My research addresses how to enhance RDBMS to manage them. 6

Challenges for RDBMS (I) RDBMS Assumption: data conforms to a predefined fixed schema, which is separated from the data itself Reality: Data may be collected from different sources on the web, therefore has different schemas Schema can change over time for a single source Requirements: We need to handle data of different schemas and have the schemas tightly associated with the data 7

XML as a Data Representation Format XML has become a standard data format for various applications, because of: Flexibility in schemas -- semi-structured data Self - describing feature Representing tree data model naturally 8

XML: the Standard for Web Data Representation Web Service Requester XML Data Internet XML Data GenBank NCBI Web Service Publisher PubMed BLAST... 9

XML: Representing Phylogenetic Trees From the Tree of the Life Website, University of Arizona Orangutan Gorilla Chimpanzee Human 10

Challenges for RDBMS (II) RDBMS Assumption: Data is clean and consistent. Reality: real world data is dirty Data collected from different sources may have missing and conflicting information Data that is obtained from data mining is often not error-prone Experimental data often contains random errors Requirements: we need to measure data quality and handle imprecise and/or incomplete data 11

Roadmap of This Talk Managing XML by leveraging mature RDBMS [Chen et al 04] Introduction to XML A generic and efficient XML-to-RDBMS mapping Data mapping from trees to tables Query translation from tree navigation queries to SQL queries that are efficient Handling imprecise and incomplete data in DBMS [Chen et al 06] 12

Sample XML Data <books> <book> <> The lord of the rings... </> <section> <> Locating middle-earth </>... </section> </book> </books> The lord of the rings Locating middleearth books book A hall fit for a king section section figure......... description King Theoden's golden hall 13

Sample XML Queries XML query languages are based on hierarchical structure navigation (e.g. XPath) Sample queries: What are all the section s: //section/ Descendant axis Child axis The lord of the rings Locating middleearth books book A hall fit for a king section section figure......... description King Theoden's golden hall 14

Sample XML Queries XML query languages are based on hierarchical structure navigation (e.g. XPath) Sample queries: What are all the section s: //section/ What are the s of sections that contain a figure: //section[/figure]/ The lord of the rings Locating middleearth books book A hall fit for a king section section figure......... description King Theoden's golden hall Predicates 15

How to Query XML Data efficiently? RDBMS have achieved high performance in query evaluation. Can we leverage RDBMS by encoding XML to tables? 16

Analogy: Fourier Transforms Complex + g * h = - g(u)h(u)du Efficient G(f)H(f) 17

Mapping XML Data to RDBMS XPath Challenge: How to build the bridge between hierarchies and tables? XML fragments Query Translation SQL Storage Mapping XML data Relational databases 18

Data Mapping (3) The lord of the rings A hall fit for a king (1) books (2) book section (4) section figure (5) Locating middleearth description King Theoden's golden hall Parent ID [Florescu & Kossmann 99] T ID Tag Value Structural Information 1 books 2 book 3 The... 4 section 5 Locating 19

Data Mapping (3) The lord of the rings A hall fit for a king (1) books (2) book section (4) section figure (5) Locating middleearth description King Theoden's golden hall Design special labels to encode. node relationships T ID Tag Value Structural Information 1 books 2 book 3 The... 4 section 5 Locating 20

Query Translator Architecture XPath XPath decomposition Sub-query translation SQL sub-query composition SQL Query Translator How to choose XPath subqueries, such that: they can be easily translated to SQL subqueries the SQL subqueries can be efficiently evaluated How to combine SQL subqueries to a complete one? 21

Query Translator book figure section Q: //book[//figure]/section/ 22

Query Translator: (I) Decomposition to Suffix Paths book book figure section Q: //book[//figure]/section/ 23

Encoding Suffix Paths Using P-labeling //book/section/ (342000,343000) σ 342000 Plabel 343000 T books (1) book (2)... (3) (4) section... The lord of the (5) (100) rings section Locating middle-... earth figure A hall fit for a description king King Theoden's golden hall Evaluating suffix paths SQL selections on P-labels T id Plabel 1 100000 2 210000 3 321000 4 421000 5 342100 24

Query Translator: (II) Selection on P-labels book book figure section Q: //book[//figure]/section/ 25

D-labeling Scheme books (1, 20000, 1) book(6, 1200, 2)... D-labeling is used to connect suffix paths. (10,80,3) (81, 250,3) section... The lord of the (100, 200,4) rings section Locating middleearth figure (120, 160,... 5) A hall fit for a description king King Theoden's golden hall D-labels (start, end, depth) can be used to detect ancestor-descendant relationships between nodes in a tree. 26

A Bi-labeling Based Query Translation //book//figure (σ Plabel for //book (ρ T T )) books (1, 20000, 1) book(6, 1200, 2)... (10,80,3) (81, 250,3) section... The lord of the (100, 200,4) rings section Locating middleearth Figure (120, 160, 5) A hall... fit for a description king King Theoden's golden hall Stitching suffix paths SQL joins on D-labels T.Start T.Start T.End T.End (σ Plabel for //figure T) T Plabel Start End Depth Val 100000 1 20000 1 210000 6 1200 2 321000 10 80 3 421000 81 250 3 442100 100 200 4 554680 120 160 5 27

Query Translator: (III) Join on D-labels book book figure section Q: //book[//figure]/section/ 28

Comparison with Previous Approach book book book figure section figure section Previous Approach [Li & Moon 01, Zhang et al 01, Tatarinov et al 02, Grust 02, DeHaan et al 03, etc] Ours: fewer disk accesses, fewer joins 29

Experiment Setup Compare our system (BLAS) with the previous approach using D-labeling scheme only Data sets Data Set Query sets Suffix path queries Path queries XPath queries Benchmark queries Size(MB) Nodes(K) Tags Depth DTD Protein 70 2277 66 7 Tree Shakespeare 26 640 19 7 Acyclic graph Auction 69 1238 77 12 Cyclic graph Query Engines: DB2, TwigStack Join [Bruno et al 02] 30

Query Execution Time Time 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% QA1 QA2 QA3 QP1 QP2 QP3 QS1 QS2 QS3 Queries D-labeling BLAS Query Name: A:Auction P: Protein S: Shakespeare 1: suffix path query 2: path query 3: XPath query 31

Scalability Time (seconds) 14 12 10 8 6 4 2 0 Benchmark data, Benchmark query Q3 35 70 105 139 174 File Size (MB) DLabeling BLAS 32

Summary of XML Data Management We proposed a generic XML-to-RDB mapping, based on a bi-labeling scheme. It is more efficient compared with previous approach, since it generates SQL queries that require: fewer disk accesses fewer joins fewer intermediate results Experiments show the effectiveness 33

Roadmap of This Talk Managing XML by leveraging mature RDBMS [Chen et al 04] Introduction to XML A generic and efficient XML-to-RDBMS mapping Data mapping from trees to tables Query translation from tree navigation queries to SQL queries that are efficient Handling imprecise and incomplete data in DBMS [Chen et al 06] 34

Probabilistic Databases: Managing Imprecise Data How to measure the imprecision of the data? The simplest model: associate each tuple a probability Climber Climbs Name Skill age Pr James Beginner 21 p1 Bob Experienced 33 p2 Name Route Date Duration Pr Bob Last Tango 10/10/05 5 q1 Bob Last Tango 1/10/06 4.5 q2 Assume that all the tuples are independent events 35

How Does This Affect Query Evaluation? [Fuhr&Roellke 97] v p v 1 v 2 p 1 p 2 v 1-(1-p 1 )(1-p 2 ) σ Π v p v 1 p 1 v 2 p 2 v p 1 v p 2 36

Challenges: Correctness Depends on Execution Order [Suciu et al 05]! Find distinct climb routes that have been climbed by an experienced climber. LT 1-(1-p2q1)(1- p2q2) Bob LT p2(1-(1-q 1 )(1-q 2 )) Π Wrong! Bob LT p2q1 Bob LT p2q2 Correct LT 1-(1-q 1 )(1-q 2 ) Π Name Skill Pr Bob Exp p2 Name Route Pr Bob LT q1 Bob LT q2 Name Skill Pr Bob Exp p2 Name Route Pr Bob LT q1 Bob LT q2 37

How to Handle Incomplete Data? Find cars that are convertible and have price less than 10k. Seller Make Model Body Style Year Price Mark BMW 325Cic convertible 2006 36000 Alice Ford Taurus 2001 8000 This can not be convertible! Our techniques infer possible values with probability by mining data statistics 38

Querying Incomplete Databases Given missing value prediction Querying incomplete databases Querying probabilistic databases What if we are not able to store the predicted values? --- e.g. data integration applications On-the-fly query rewriting How should we handling imprecise and incomplete XML Data? 39

Conclusions Traditional RDBMS does not satisfy the requirements in ever growing scientific and web applications We have discussed two enhancement to RDBMS Efficient XML data management Handling imprecise and incomplete data Other enhancement to RDBMS that I am working on Data stream processing Scientific workflow modeling and query processing 40

Thank you! Questions? 41