Efficient Storage and Temporal Query Evaluation of Hierarchical Data Archiving Systems

Similar documents

Archiving Scientific Data - A Practical Approach

The Database Wiki Project: A General-Purpose Platform for Data Curation and Collaboration

Caching XML Data on Mobile Web Clients

Database Technologies

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Sorting Hierarchical Data in External Memory for Archiving

Creating Synthetic Temporal Document Collections for Web Archive Benchmarking

An Efficient Algorithm for Web Page Change Detection

Modeling and Querying E-Commerce Data in Hybrid Relational-XML DBMSs

Binary Coded Web Access Pattern Tree in Education Domain

XML Data Integration Based on Content and Structure Similarity Using Keys

QuickDB Yet YetAnother Database Management System?

Physical Data Organization

Research Problems in Data Provenance

A Model For Revelation Of Data Leakage In Data Distribution

Technologies for a CERIF XML based CRIS

A MEDIATION LAYER FOR HETEROGENEOUS XML SCHEMAS

Managing large sound databases using Mpeg7

Sharing large data collections between mobile peers

Efficient Mapping XML DTD to Relational Database

Enhancing Traditional Databases to Support Broader Data Management Applications. Yi Chen Computer Science & Engineering Arizona State University

XML DATA INTEGRATION SYSTEM

State History Storage in Disk-based Interval Trees

A Workbench for Prototyping XML Data Exchange (extended abstract)

XStruct: Efficient Schema Extraction from Multiple and Large XML Documents

Pushing XML Main Memory Databases to their Limits

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

Graph Mining and Social Network Analysis

Summary of Alma-OSF s Evaluation of MongoDB for Monitoring Data Heiko Sommer June 13, 2013

Efficient XML-to-SQL Query Translation: Where to Add the Intelligence?

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Graph Database Performance: An Oracle Perspective

How To Improve Performance In A Database

ABSTRACT 1. INTRODUCTION. Kamil Bajda-Pawlikowski

KEYWORD SEARCH IN RELATIONAL DATABASES

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Col*Fusion: Not Just Jet Another Data Repository

Duplicate Detection Algorithm In Hierarchical Data Using Efficient And Effective Network Pruning Algorithm: Survey

Report Paper: MatLab/Database Connectivity

XRecursive: An Efficient Method to Store and Query XML Documents

Original-page small file oriented EXT3 file storage system

Deferred node-copying scheme for XQuery processors

Storing and Querying XML Data using an RDMBS

ON ANALYZING THE DATABASE PERFORMANCE FOR DIFFERENT CLASSES OF XML DOCUMENTS BASED ON THE USED STORAGE APPROACH

Automatic Annotation Wrapper Generation and Mining Web Database Search Result

Elastic Enterprise Data Warehouse Query Log Analysis on a Secure Private Cloud

CSE 326: Data Structures B-Trees and B+ Trees

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Inferring Fine-Grained Data Provenance in Stream Data Processing: Reduced Storage Cost, High Accuracy

White Paper. Better Performance, Lower Costs. The Advantages of IBM PowerLinux 7R2 with PowerVM versus HP DL380p G8 with vsphere 5.

Big Data Provenance: Challenges and Implications for Benchmarking

DBMS / Business Intelligence, SQL Server

Ag + -tree: an Index Structure for Range-aggregation Queries in Data Warehouse Environments

Development of Monitoring and Analysis Tools for the Huawei Cloud Storage

Join Minimization in XML-to-SQL Translation: An Algebraic Approach

Lecture Data Warehouse Systems

Change Manager 5.0 Installation Guide

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

GRAPH PATTERN MINING: A SURVEY OF ISSUES AND APPROACHES

ArcGIS for Server Performance and Scalability: Testing Methodologies. Andrew Sakowicz, Frank Pizzi,

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

Novel Data Extraction Language for Structured Log Analysis

CHAPTER 3 PROPOSED SCHEME

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

Image Compression through DCT and Huffman Coding Technique

Concurrency Control. Chapter 17. Comp 521 Files and Databases Fall

Information Discovery on Electronic Medical Records

The basic data mining algorithms introduced may be enhanced in a number of ways.

General Purpose Database Summarization

Efficient Structure Oriented Storage of XML Documents Using ORDBMS

Energy Efficiency in Secure and Dynamic Cloud Storage

Introduction to XML Applications

Unraveling the Duplicate-Elimination Problem in XML-to-SQL Query Translation

What is a database? COSC 304 Introduction to Database Systems. Database Introduction. Example Problem. Databases in the Real-World

Firewall Design: Consistency, Completeness, Compactness

Binary Image Scanning Algorithm for Cane Segmentation

Process Mining by Measuring Process Block Similarity

Fast Contextual Preference Scoring of Database Tuples

Supporting Database Provenance under Schema Evolution

Improving Query Performance Using Materialized XML Views: A Learning-Based Approach

The Planets Preservation Planning workflow and the planning tool Plato

Supporting Ontology-based Keyword Search over Medical Databases

Database Design Patterns. Winter Lecture 24

Data Management in RFID Applications

Dependency Free Distributed Database Caching for Web Applications and Web Services

How To Test For Performance And Scalability On A Server With A Multi-Core Computer (For A Large Server)

Three Effective Top-Down Clustering Algorithms for Location Database Systems

A Platform for Supporting Data Analytics on Twitter: Challenges and Objectives 1

On Mining Group Patterns of Mobile Users

Constraint Preserving XML Storage in Relations

An Efficient and Scalable Management of Ontology

Persistent Binary Search Trees

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Protein Protein Interactions (PPI) APID (Agile Protein Interaction DataAnalyzer)

Merkle Hash Tree based Techniques for Data Integrity of Outsourced Data

Learning Outcomes. COMP202 Complexity of Algorithms. Binary Search Trees and Other Search Trees

Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices

XML Fragment Caching for Small Mobile Internet Devices

Transcription:

Efficient Storage and Temporal Query Evaluation of Hierarchical Data Archiving Systems Hui (Wendy) Wang, Ruilin Liu Stevens Institute of Technology, New Jersey, USA Dimitri Theodoratos, Xiaoying Wu New Jersey Institute of Technology, New, Jersey, USA

Scientific Data in XML extensible Markup Language (XML): A mark-up language XML for scientific data Describe data structure and give simple processing instructions (e.g., XSIL and XDMF) Provide common data format that is an open, universal standard (e.g, CML, MathML, GML) Swiss-Prot Dataset 2

Updates on Scientific Data in XML Scientific databases are continuously updated Necessity of maintaining all versions in the archive Problem: as increasing volumes of these data being accumulated, the archives can reach a critical mass. 3

An Example of Multiple XML Dataset Versions Four consecutive instances of an extract of the Swiss-Prot Dataset 4

Challenges (1) how to store successive versions of XML databases in an archiving database in a cost-effective way. (2) how to evaluate queries efficiently on the archiving database. 5

Our Contributions A novel compact and updateable storage scheme for XML archiving databases. A simple yet expressive query language for XML archiving databases. Optimization of query evaluation over the compact storage. 6

Outline Preliminaries Compact storage Temporal query language Query optimization Experiments Conclusion 7

Compact Storage Scheme (I) Merge versions Store multiple occurrences of the same node only once in the archiving database A. Each node p in A is associated with a timestamp set which contains the timestamps of the instances of p. 8

An Example of Merging root 0 Entry 1 Species 2 Features 3 Descr 4 Rattus norvegicus Domain 5 Descr 6 100KDA protein Versions root 0 Entry 1 Species 2 Features 3 Descr 4 Ref 7 Domain 5 Author 8 Rattus norvegicus 100KDA protein PRO-RICH Version 1 Version 2 root 0 [1-2] [1-2] Entry 1 Species 2 Features 3 Descr 4 Ref 7 [1-2] [1-2] [1-2] [2] Rattus norvegicus Domain 5 [1-2] Descr 6 [1] PRO-RICH 100KDA protein Author 8 [2] Nene V Nene V Monotone property on timestamps of the nodes on the same path 9

Compact Storage Scheme (II) Timestamp set compaction The timestamp label of the parent node only preserves the timestamps that do not appear on any of its children. Lp = ts(p) cchild(p) ts(c) 10

An Example of Timestamp Set root 0 Compaction root 0 Entry 1 Species 2 Features 3 Descr 4 Rattus norvegicus Domain 5 Descr 6 100KDA protein Entry 1 Species 2 Features 3 Descr 4 Ref 7 Domain 5 Author 8 Rattus norvegicus 100KDA protein PRO-RICH Version 1 Version 2 Nene V root 0 [1-2] [1-2] Entry 1 Species 2 Features 3 Descr 4 Ref 7 [1-2] [1-2] [1-2] [1-2] [1-2] [2] Domain 5 Author 8 Rattus norvegicus [1-2] [2] 100KDA protein Descr [2] [2] 6 Nene V [1] [1] PRO-RICH 11

An Example of Compact root 0 Entry 1 Storage root 0 Entry 1 Species 2 Features 3 Descr 4 Rattus norvegicus Domain 5 Descr 6 PRO-RICH 100KDA protein Species 2 Features 3 Descr 4 Ref 7 Domain 5 Author 8 Rattus norvegicus 100KDA protein Nene V root 0 Entry 1 Species 2 Features 3 Descr 4 Ref 7 [1-2] [1-2] Domain 5 Author [2] 8 100KDA protein [2] Descr 6 Nene V [1] Rattus norvegicus PRO-RICH 12

Efficient Updates Incremental timestamp label computation When insert a new instance D k at timestamp k, root 0 add the newly inserted trees T 1,..., T m in D k to the archive, For leaf nodes in D k, update the timestamp labels of their corresponding archive nodes by adding timestamp k. root 0 Entry 1 Entry 1 Features 3 Descr 9 Domain 5 Version 3 Species 2 Features 3 Descr 4 Ref 7 [1-2] [1-2] Domain 5 100KDA Author [2] 8 protein [2] Descr 6 Nene V [1] Rattus norvegicus PRO-RICH Archiving Dataset before Update Descr 9 13

Efficient Updates Incremental timestamp label computation When insert a new instance D k at timestamp k, root 0 add the newly inserted trees T 1,..., T m in D k to the archive, update the archive nodes of the leaf nodes in D k root by adding the timestamp k. 0 Entry 1 Entry 1 Species 2 Features 3 Descr 4 [1-2] [1-2] Ref 7 Descr 9 [3] Features 3 Descr 9 Domain 5 Version 3 Rattus norvegicus Domain 5 [2-3] Descr 6 [1] PRO-RICH 100KDA protein Author 8 [2] Nene V Archiving Dataset after Update 14

Comparison with Related Work Compact storage [1] (Top-down approach) Remove timestamps at children if they are the same as those of the parent node Compared with our solution (bottom-up approach) w.r.t. # of updated timestamp labels when inserting a new instance D k into the archiving database A TD: # of nodes in D k whose corresponding nodes in A have timestamp labels + new nodes in D k BU: # of leaf nodes in D k 15

An Example of TD V.S. BU root 0 [1-4] Entry 1 Species[2-3] 2 Features 3 Descr [2-4] 4 [3-4] Ref 7 [2-4] Rattus norvegicus Domain 5 [3-4] Descr [4] 6 PRO-RICH 100KDA protein Author 8 [2] Nene V TD approach: Archiving Dataset before Update root 0 Entry 1 Species 2 Features 3 Descr 4 Ref 7 [2-3] [2] [3-4] [3-4] Domain 5 100KDA Author [3] 8 protein [2] Descr 6 Nene V [4] Rattus norvegicus PRO-RICH BU approach: Archiving Dataset before Update 16

An Example of TD V.S. BU root 0 [1-5] Entry 1 Species[2-3] 2 Features 3 Descr [2-5] 4 [3-4] Ref 7 [2-4] root 0 Entry 1 Features 3 Domain 5 Descr 6 Version 5 Rattus norvegicus Domain 5 [3-5] Descr [4-5] 6 PRO-RICH root 0 Entry 1 100KDA protein Author 8 [2] Nene V TD approach: Archiving Dataset after Update Species 2 Features 3 Descr 4 Ref 7 [2-3] [2] [3-4] [3-4] Domain 5 100KDA Author [3] 8 protein [2] Descr 6 [4-5] Rattus norvegicus PRO-RICH Nene V BU approach: Archiving Dataset after Update 17

Outline Compact storage Temporal query language Query optimization Experiments Conclusion 18

Temporal Query Language Types: Snapshot history trace Temporal constraints: includes(t), overlaps(t a, t b ), before(t)/after(t), contains(t a, t b )/is_contained(t a, t b ), meets(t a, t b ) Temporal queries: XML structural queries + temporal constraints on query nodes 19

Archivin g databas e Evaluation of Temporal Constraints over Compressed Timestamps root 0 [1-2] Entry 1 Species 2 Features 3 Descr 4 Ref 7 [1-2] [1-2] [1-2] Domain 5 [1-2][2] Author 8 100KDA protein [2] Descr[1] 6 Rattus norvegicus Nene V Answer Descr 4 PRO-RICH 100KDA protein root Query Entry include(2) Domain overlaps(1,3) Descr* contain(1,2) 20

Temporal Evaluation Annotations DC (Descendant Check) LC (Local Check) NC (No Check) Archive root 0 [1-2] Entry 1 Species 2 Features 3 Descr 4 Ref 7 [1-2] [1-2] [1-2] Domain 5 [1-2][2] Author 8 100KDA protein [2] Descr 6 Nene V [1] Rattus norvegicus PRO-RICH Query root Entry DC include(2) LC Domain Descr* overlaps(1,3) contain(1,2) 21

Cost Model Cost model of temporal evaluation annotations where T q is the number of nodes of the tree rooted at q in a query. root 0 Query Domain Entry overlaps(1,3) [1-2] Entry 1 root include(2) Descr* contain(1,2) NC(0) DC(2) LC(1) Archive Species 2 Features 3 Descr 4 Ref 7 [1-2] [1-2] [1-2] Domain 5 [1-2][2] Author 8 100KDA protein [2] Descr 6 [1] Rattus norvegicus PRO-RICH Nene V 22

Outline Preliminaries Compact storage Temporal query language Query optimization Experiments Conclusion 23

Optimization Problem DC is expensive (recursive check) DC can be replaced by LC/NC in some cases Goal: Replace as many DCs as possible with LCs/NCs 24

An Example of Optimization root Entry include(2) DC Descr* contain(1,4) DC Query Q root Database Schema Entry include(2) DC NC Descr* contain(1,4) DC LC After optimization 25

Inference Rules We use inference rules to find redundant temporal annotations Inference rules: P 1,, P k R if the premises P 1,, P k are true, then the conclusion R is also true. Types of inference rules Without database schema With presence of database schema 26

Inference Rules: No Database Schema AD(ancestor-descendant) Rule: Q = p//q q t p TR(transitivity) Rule: p t q and q t r p t r 27

Inference Rules: with Database Schema SP(SinglePath) Rule: Q = p//q, SinglePath(p, q ) p = t q DC(descendant) Rule: Q = p//q, Q = p//r, SinglePath(p, r ) q t r DE(derived) Rule: Q = p//q, Q = p//r, SinglePath(p, r ), SinglePath(q, r ) r t q 28

Temporal Constraint Graph Query root Temporal constraint graph Entry Features include(2) overlaps(1,3) contain(1,4) Descr* contain(3,4) Species An edge from p to q indicates an inferred p t q relationship 29

Temporal Constraint Consumption We check temporal constraint consumption on the temporal constraint graph Consuming temporal constraint on p Consumed temporal constraint on q include(t) p t q includes(t) overlaps(t 1, t 2 ), t [t 1, t 2 ] contains(t 1, t 2 ) includes(t 3 ), t 3 [t 1, t 2 ] contains(t 3, t 4 ), t 3 t 1, t 4 t 2 overlaps(t 3, t 4 ), t 1 < t 3 < t 2 or t 1 < t 4 < t 2 q t p is_contained(t 1, t 2 ) is_contained(t 3, t 4 ), t 3 t 1, t 4 t 2 before(t) after(t) before(t 1 ), t 1 t after(t 1 ), t 1 t q = t p meets(t 1, t 2 ) meets(t 1, t 2 ) 30

An Example of Query Optimization Consuming temporal constraint on p (Descr) contains(1, 4) Consumed temporal constraint on q (Entry) Descr t Entry includes(2) root Entry include(2) DC Features overlaps(1,3) DC NC Descr* contain(1,4) DC Species contain(3,4) DC LC LC NC 31

Outline Preliminaries Compact storage Temporal query language Query optimization Experiments Conclusion 32

Experiment Setup Hardware Intel Core 2 CPU 2.40 GHz processor, 4.00 GB of RAM Software OS: Windows 7 The algorithms were implemented in Java JDOM engine: parse the XML databases Wutka DTD parser: parse the XML DTD Oracle Berkeley DB XML engine: query evaluation 33

Datasets Synthetic dataset using the IBM XML generator on the DTD of the XMark benchmark Real dataset Treebank dataset Dataset Size # of elements Max. depth Avg. depth Treebank 22.3MB 491108 36 8.5 Xmark 14.6MB 160929 21 20.068 34

Versions For Xmark and Treebank datasets, we created 50 and 20 consecutive database instances respectively. Each instance was generated from the previous one by first deleting and then inserting (sub)trees. The (inserted and deleted) trees take 10% of the nodes of the database instance. 35

Experiment Three storage approaches: The naive approach (NA): keeps the timestamp sets as un-compacted The top-down (TD) approach ([1]): eliminates the timestamps of the children nodes that are identical to the parent. Our bottom-up (BU) approach: eliminates the timestamps from the parent nodes that are repeated on children. 36

Experiment: Archiving Archiving Time Overhead Compaction ratio Dataset Top-down Bottom-up XMark (shallow&fat) XMark (deep&thin) 2.15% 2.72% 3.09% 2.75% Total number of timestamps (space overhead) 37

Experiment: Update Cost Summary of archiving overhead Both TD and BU can reduce the number of timestamps in the archive. The difference between TD and BU regarding the number of timetstamps is not significant. BU always has much better update cost than TD. 38

Experiment: Query Optimization Temporal Constraint Evaluation Optimization Our optimization can bring significant performance improvement with negligible overhead 39

Outline Preliminaries Compact storage Temporal query language Query optimization Experiments Conclusion 40

Conclusion We proposed an efficient XML archiving database system that consists of A novel compact and updateable storage scheme A simple yet expressive query language for XML archiving databases Optimization of query evaluation over the compact storage 41

Future Work Consider additional temporal constraints Consider unrestricted database schemas that may contain cycles Design efficient optimization algorithms that can work in this broader framework 42

Thank you! 43

Reference [1] P. Buneman, S. Khanna, K. Tajima, and W.-C. Tan. Archiving scientific data. In ACM Transactions on Database Systems, 2004. [2] A. P. Chapman, H. Jagadish, and P. Ramanan. Efficient provenance storage. In SIGMOD, 2008. [3] S. Chawathe and H. Garcia-molina. Meaningful change detection in structured data. In SIGMOD, 1997. [4] P. T. Jayant and J. R. Haritsa. Xgrind: A query-friendly xml compressor. In ICDE, 2002. [5] H. Liefke and D. Suciu. XMill: an efficient compressor for XML data. In SIGMOD, 1999. [6] H. M uller, P. Buneman, and I. Koltsidas. Xarch: Archiving scientific and reference data. In SIGMOD, 2008. [7] F. Rizzolo and A. A. Vaisman. Temporal xml: modeling, indexing, and query processing. The VLDB Journal, 17:1179 1212, August 2008. [8] F. Wang and C. Zaniolo. Temporal queries in XML document archives and web warehouses. In TIME-ICTL, 2003. [9] F. Wang and C. Zaniolo. Temporal queries and version management in XML-based docu-ment archives. Data Knowl. Eng., 65:304 324, May 2008. 44

Outline Preliminaries Compact storage Temporal query language Query optimization Experiments Conclusion 45

XML database XML database: tree-structured, ID-based. 46

Archiving XML database XML instances: XML database at certain time point Each instance is associate with a timestamp (version number) Archiving Database: multiple instances are merged into one database 47

Updating XML database Update operations: insertion deletion A sequence of update operations can be modeled as a set of deletions followed by a set of insertions. root 0 root 0 Entry 1 Entry 1 Species 2 Features 3 Descr 4 Species 2 Features 3 Descr 4 Ref 7 Domain 5 Rattus norvegicus 100KDA protein Descr 6 Domain 5 Rattus norvegicus 100KDA protein Descr 6 Author 8 Nene V PRO-RICH PRO-RICH 48

A Piece of Related Work P. Buneman, et al. Archiving scientific data. TODS, 2004. Solution: Merge instances Remove timestamps if same as parent node (Top-down) Weakness: Inefficient updates Inefficient query 49

Witness Graph Definition Given a query Q, a witness graph for Q is a graph WQ such that a)the nodes of W Q correspond to the nodes of Q b)there is an edge from node p to node q in W Q, iff p is a witness node of q in Q. Example Construct the witness graph from the temporal constraint graph after temporal annotation consumption (next page) 50