COMP9314: XML. Some reference sites. What is XML? Semistructured Data / XML

Similar documents
Introduction to XML. Data Integration. Structure in Data Representation. Yanlei Diao UMass Amherst Nov 15, 2007

XML and Data Management

Introduction to XML Applications

Semistructured data and XML. Institutt for Informatikk INF Ahmet Soylu

Last Week. XML (extensible Markup Language) HTML Deficiencies. XML Advantages. Syntax of XML DHTML. Applets. Modifying DOM Event bubbling

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

High Performance XML Data Retrieval

An XML Based Data Exchange Model for Power System Studies

Agents and Web Services

AN ENHANCED DATA MODEL AND QUERY ALGEBRA FOR PARTIALLY STRUCTURED XML DATABASE

Unified XML/relational storage March The IBM approach to unified XML/relational databases

Introduction. Web Data Management and Distribution. Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart

XQuery and the E-xml Component suite

Managing large sound databases using Mpeg7

XML: extensible Markup Language. Anabel Fraga

Database & Information Systems Group Prof. Marc H. Scholl. XML & Databases. Tutorial. 11. SQL Compilation, XPath Symmetries

Efficient Storage and Temporal Query Evaluation of Hierarchical Data Archiving Systems

Web Services Technologies

Modern Databases. Database Systems Lecture 18 Natasha Alechina

Extracting data from XML. Wednesday DTL

Pushing XML Main Memory Databases to their Limits

Data XML and XQuery A language that can combine and transform data

Structured vs. unstructured data. Motivation for self describing data. Enter semistructured data. Databases are highly structured

LabVIEW Internet Toolkit User Guide

Markup Languages and Semistructured Data - SS 02

Concrete uses of XML in software development and data analysis.

Developing XML Solutions with JavaServer Pages Technology

Structured vs. unstructured data. Semistructured data, XML, DTDs. Motivation for self-describing data

10CS73:Web Programming

Data Integration through XML/XSLT. Presenter: Xin Gu

Technologies for a CERIF XML based CRIS

Towards XML-based Network Management for IP Networks

Integrating XML and Databases

XML and Data Integration

XML Processing and Web Services. Chapter 17

Exchanger XML Editor - Canonicalization and XML Digital Signatures

Computer Science E-259

XML WEB TECHNOLOGIES

Geoff Lee Senior Product Manager Oracle Corporation

Extensible Markup Language (XML): Essentials for Climatologists

CST6445: Web Services Development with Java and XML Lesson 1 Introduction To Web Services Skilltop Technology Limited. All rights reserved.

TagSoup: A SAX parser in Java for nasty, ugly HTML. John Cowan (cowan@ccil.org)

Database Systems. Lecture 1: Introduction

BASI DI DATI II 2 modulo Parte II: XML e namespaces. Prof. Riccardo Torlone Università Roma Tre

MarkLogic Server. Application Developer s Guide. MarkLogic 8 February, Last Revised: 8.0-4, November, 2015

Processing XML with Java A Performance Benchmark

A Performance Evaluation of Open Source Graph Databases. Robert McColl David Ediger Jason Poovey Dan Campbell David A. Bader

REDUCING THE COST OF GROUND SYSTEM DEVELOPMENT AND MISSION OPERATIONS USING AUTOMATED XML TECHNOLOGIES. Jesse Wright Jet Propulsion Laboratory,

A MEDIATION LAYER FOR HETEROGENEOUS XML SCHEMAS

Change Management for XML, in XML

Enhancing Traditional Databases to Support Broader Data Management Applications. Yi Chen Computer Science & Engineering Arizona State University

XML Data Integration

Ficha técnica de curso Código: IFCAD320a

Java and XML parsing. EH2745 Lecture #8 Spring

XML- New meta language in e-business

High-performance XML Storage/Retrieval System

XML Programming with PHP and Ajax

File System Management

Presentation / Interface 1.3

Session Topic. Session Objectives. Extreme Java G XML Data Processing for Java MOM and POP Applications

XML DATA INTEGRATION SYSTEM

by LindaMay Patterson PartnerWorld for Developers, AS/400 January 2000

Keywords: XML, Web-based Editor

XML-Based Software Development

Multimedia Applications. Mono-media Document Example: Hypertext. Multimedia Documents

Mobile Web Design with HTML5, CSS3, JavaScript and JQuery Mobile Training BSP-2256 Length: 5 days Price: $ 2,895.00

Use a Native XML Database for Your XML Data

Firewall Builder Architecture Overview

XSLT - A Beginner's Glossary

T XML in 2 lessons! %! " #$& $ "#& ) ' */,: -.,0+(. ". "'- (. 1

DTD Tutorial. About the tutorial. Tutorial

DataDirect XQuery Technical Overview

WWW. World Wide Web Aka The Internet. dr. C. P. J. Koymans. Informatics Institute Universiteit van Amsterdam. November 30, 2007

XStruct: Efficient Schema Extraction from Multiple and Large XML Documents

EXRT: Towards a Simple Benchmark for XML Readiness Testing. Michael Carey, Ling Ling, Matthias Nicola *, and Lin Shao UC Irvine * IBM Corporation

A LANGUAGE INDEPENDENT WEB DATA EXTRACTION USING VISION BASED PAGE SEGMENTATION ALGORITHM

XML & Databases. Tutorial. 2. Parsing XML. Universität Konstanz. Database & Information Systems Group Prof. Marc H. Scholl

Search and Information Retrieval

Using Object And Object-Oriented Technologies for XML-native Database Systems

Access Support Tree & TextArray: A Data Structure for XML Document Storage & Retrieval

XPath Processing in a Nutshell

SOFTWARE ENGINEERING PROGRAM

Sage CRM Connector Tool White Paper

San Joaquin County Office of Education Career & Technical Education Web Design ~ Course Outline CBEDS#: 4601

XBRL Processor Interstage XWand and Its Application Programs

Xtreeme Search Engine Studio Help Xtreeme

XML An Introduction. Eric Scharff. Center for LifeLong Learning and Design (L3D)

WEB DEVELOPMENT COURSE (PHP/ MYSQL)

Creating a TEI-Based Website with the exist XML Database

Lecture 9. Semantic Analysis Scoping and Symbol Table

DEVELOPMENT OF THE INTEGRATING AND SHARING PLATFORM OF SPATIAL WEBSERVICES

XML: ITS ROLE IN TCP/IP PRESENTATION LAYER (LAYER 6)

EFFECTIVE STORAGE OF XBRL DOCUMENTS

N CYCLES software solutions. XML White Paper. Where XML Fits in Enterprise Applications. May 2001

A Workbench for Prototyping XML Data Exchange (extended abstract)

Standard Recommended Practice extensible Markup Language (XML) for the Interchange of Document Images and Related Metadata

Multiple electronic signatures on multiple documents

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

Deferred node-copying scheme for XQuery processors

CHAPTER 3 PROPOSED SCHEME

Transcription:

Some reference sites COMP9314: XML www.w3.org www.xml.com www.xml.org www.oasis-open.org Raymond Wong National ICT Australia & University of New South Wales wong@cse.unsw.edu.au 1 2 What is XML? Semistructured Data / XML XML is just a markup language defined by W3C (officially in Feb 98) It s a simplified version of SGML HTML for presentation markup, HTML4.0 XML SGML XML for content markup Semistructured=> loosely structured (no restrictions on tags & nesting relationships) no schema required XML under the semistructured umbrella self-describing the standard for information representation & exchange 3 4 Storage format vs presentation format - The power of markup The Family of XML Technologies Traditional Database or or Spreadsheet Raymond, Wong, Wong, wong, wong, 5932, 5932, John, John, Smith, Smith, jsmith, jsmith, 1234, 1234,...... HTML HTML XML XML <br> <br> <Staff> <Staff> <font <font size=1 size=1 color= ff003a > <Name> <Name> <ul> <ul> <FirstName> Raymond </FirstName> <li> <li> <b> <b> Raymond Wong Wong </b> </b> </li> </li> <LastName> Wong Wong </LastName> <li> <li> Login: Login: wong wong </li> </li> </Name> <li> <li> Phone: Phone: <i> <i> x5932 x5932 </i> </i> </li> </li> <Login> <Login> wong wong </Login> </ul> </ul> <Ext> <Ext> 5932 5932 </Ext> </Ext> </font> </font> </Staff> </Staff> Document structures Definition of gst, price, XMLSchema / DTD <Product> <Product> <product_id> m101 </product_id> <product_id> <Product> m101 </product_id> <name> Sony walkman </name> <name> Sony <product_id> walkman m101 </name> </product_id> <currency> AUD </currency> <currency> <name> AUD </currency> Sony walkman </name> <price> 200.00 </price> <price> 200.00 <currency> </price> AUD </currency> <gst> 10% </gst> <gst> 10% <price> </gst> 200.00 </price> </Product> </Product> <gst> 10% </gst> </Product> Presentation format info XML Stylesheet XML Stylesheet XML Stylesheet 5 6 1

The power of XML is that it facilitates the definition of common standards for information representation & exchange. DBMS support for XML data Distributors CRMs internet voice line Corps intranet ISPs Telcos When more & more data are in XML, possibly from different information sources, there is a need for XMLDB query and update language/interface concurrency control & crash recovery access control and authorization views & triggers Suppliers wireless Sales people Messaging 7 8 Why need to query XML data XML data file can be modeled as a tree To extract data from large XML docs To exchange data (data- or query-shipping) To exchange data beteen different user communities or ontologies or schemas To integrate data from multiple XML sources <Staff> <Staff> <Name> <Name> <FirstName> <FirstName> Raymond Raymond </FirstName> </FirstName> <LastName> <LastName> Wong Wong </LastName> </LastName> </Name> </Name> <Login> <Login> wong wong </Login> </Login> <Ext> <Ext> 5932 5932 </Ext> </Ext> </Staff> </Staff> Staff Name Login Ext wong 5932 FirstName LastName Raymond Wong 9 10 XML Terminology More XML: Attributes tags: book, title, author, start tag: <book>, end tag: </book> elements: <book> </book>, elements are nested empty element: <red></red> abbrv. <red/> an XML document: single root element well formed XML document: if it has matching tags <book price = 55 currency = USD > <title> Foundations of Databases </title> Abiteboul <year> 1995 </year> </book> 11 12 2

More XML: Oids and References More XML: CDATA Section <person id= o555 > <name> Jane </name> </person> <person id= o456 > <name> Mary </name> <children idref= o123 o555 /> </person> <person id= o123 mother= o456 > <name>john</name> </person> Syntax: <![CDATA[...any text here ]]> Example: <example> <![CDATA[ some text here </notatag> <> ]]> </example> 13 14 More XML: Entity References More XML: Processing Instructions Syntax: &entityname; Example: <element> this is less than < </element> Some entities: < > & < > & Syntax: <?target argument?> Example: <product> <name> Alarm Clock </name> <?ringbell 20?> <price> 19.99 </price> </product> &apos; " & Unicode char 15 16 More XML: Comments XML Namespaces Syntax <!--... Comment text... --> http://www.w3.org/tr/rec-xml-names (1/99) name ::= [prefix:]localpart <book xmlns:isbn= www.isbn-org.org/def > <title> </title> <number> 15 15 </number> <isbn:number>.. </isbn:number> </book> 17 18 3

XML Namespaces Need more? syntactic: <number>, <isbn:number> semantic: provide URL for schema <tag xmlns:mystyle = http:// > defined here <mystyle:title> </mystyle:title> <mystyle:number> </tag> Read the specs from W3C! Some XML books may give you more examples that are better organized and completed, however, most materials can be found from the net - use Google. 19 20 XML Parsers Parsing There are several different ways to categorise parsers: Validating versus non-validating parsers Parsers that support the Document Object Model (DOM) Parsers that support the Simple API for XML (SAX) Parsers written in a particular language (Java, C, C++, Perl, etc.) 21 22 Non-Validating Parsers Using an XML Parser Speed and efficiency It takes a significant amount of effort for an XML parser to process a DTD / XML Schema and make sure that every element in an XML document follows the rules of the DTD / Schema. If we only want to find tags and extract information we should use a non-validating parser Three basic steps in using an XML parser Creating a parser object Passing the XML document to the parser Processing the results Generally, writing out XML is not in the scope of parsers (though some may implement proprietary mechanisms) 23 24 4

The SAX Parser SAX Parser Events SAX parser is an event-driven API An XML document is sent to the SAX parser The XML file is read sequentially The parser notifies the class when events happen, including errors The events are handled by the implemented API methods to handle events that the programmer implemented A SAX parser generates events at the start and end of a document, at the start and end of an element, when it finds characters inside an element, and at several other points User writes the code that handles each event, and decides what to do with the information from the parser 25 26 Example Event Handlers When to (not to) use SAX startelementhandler endelementhandler chardatahandler CDATASectionHandler CommentHandler PIHandler etc... Ideal for simple operations on XML files E.g. reading and extracting elements Good for very large XML files (c.f. DOM) Not good if we want to manipulate XML structure Not designed for writing out XML 27 28 DOM DOM Parser produces a memory tree (DOM Tree) after parsing Document Object Model Set of interfaces for an application that reads an XML file into memory and stores it as a tree structure The abstract API allows for constructing, accessing and manipulating the structure and content of XML and HTML documents <Staff> <Staff> <Name> <Name> <FirstName> <FirstName> Raymond Raymond </FirstName> </FirstName> <LastName> <LastName> Wong Wong </LastName> </LastName> </Name> </Name> <Login> <Login> wong wong </Login> </Login> <Ext> <Ext> 5932 5932 </Ext> </Ext> </Staff> </Staff> DOM Parser Staff Name Login Ext FirstName Raymond LastName Wong wong 5932 29 30 5

Why to Use DOM Task of writing parsers is reduced to coding against the DOM Tree API Domain-specific frameworks will be written on top of DOM XPath 31 32 XPath Example for XPath Queries http://www.w3.org/tr/xpath (11/99) Building block for other W3C standards: XSL Transformations (XSLT) XML Link (XLink) XML Pointer (XPointer) XML Query Was originally part of XSL <bib> <bib> <book> <book> <publisher> <publisher> Addison-Wesley Addison-Wesley </publisher> </publisher> Serge Serge Abiteboul Abiteboul <first-name> <first-name> Rick Rick </first-name> </first-name> <last-name> <last-name> Hull Hull </last-name> </last-name> Victor Victor Vianu Vianu <title> <title> Foundations Foundations of of Databases Databases </title> </title> <year> <year> 1995 1995 </year> </year> </book> </book> <book <book price= 55 > price= 55 > <publisher> <publisher> Freeman Freeman </publisher> </publisher> Jeffrey Jeffrey D. D. Ullman Ullman <title> <title> Principles Principles of of Database Database and and Knowledge Knowledge Base Base Systems Systems </title> </title> <year> <year> 1998 1998 </year> </year> </book> </book> </bib> </bib> 33 34 XPath: Simple Expressions XPath: Restricted Kleene Closure /bib/book/year Result: <year> 1995 </year> <year> 1998 </year> /bib/paper/year Result: empty //author Result: Serge Abiteboul <first-name> Rick </first-name> <last-name> Hull </last-name> Victor Vianu Jeffrey D. Ullman /bib//first-name Result: <first-name> Rick </first-name> 35 36 6

XPath: Text Nodes XPath: Wildcard /bib/book/author/text() Result: Serge Abiteboul Victor Vianu Jeffrey D. Ullman Rick Hull doesn t appear because he has firstname, lastname Functions in XPath: text() = matches the text value node() = matches any node (= * or @* or text() ) name() = returns the name of the current tag //author/* Result: <first-name> Rick </first-name> <last-name> Hull </last-name> * Matches any element 37 38 XPath: Attribute Nodes XPath: Qualifiers /bib/book/@price Result: 55 @price means that price is has to be an attribute /bib/book/author[firstname] Result: <first-name> Rick </first-name> <last-name> Hull </last-name> 39 40 XPath: More Qualifiers XPath: More Qualifiers /bib/book/author[firstname][address[//zip][city]]/lastname Result: <lastname> </lastname> <lastname> </lastname> /bib/book[@price < 60 ] /bib/book[author/@age < 25 ] /bib/book[author/text()] 41 42 7

XPath: More Details XPath: More Details We can navigate along 13 axes: ancestor ancestor-or-self attribute child descendant descendant-or-self following following-sibling namespace parent preceding preceding-sibling self Examples: child::author/child:lastname = author/lastname child::author/descendant::zip = author//zip child::author/parent::* = author/.. child::author/attribute::age = author/@age What does this mean? paper/publisher/parent::*/author /bib//address[ancestor::book] /bib//author/ancestor::*//zip 43 44 XPath: Even More Details name() = the name of the current node /bib//*[name()=book] same as /bib//book What does this mean? /bib//*[ancestor::*[name()!=book]] Storage 45 46 Naïve Storage of XML using RDBMS The two tables let s use two tables for the XML instances: one to store all edge information one to store values Ref(src, label, dst) Val(oid, value) Suppose a simple query like: family/person/hobby in XPath 47 48 8

The same query in SQL Efficiency problem select v.value from Ref r1, Ref r2, Ref r3, Val v where r1.src = root AND r1.label = family AND r1.dst = r2.src AND r2.label = person AND r2.dst = r3.src AND r3.label = hobby AND r3.dst = v.oid This is a 4-way join!!! It s very inefficient though index on label can help a lot. even simple query will have a large no of joins RDBMS organizes data based on the structure of tables and type info => clustering, indexing, query optimization are not working properly for XML data Also #ways to traverse path expressions are much more than that on tables 49 50 Querying & Maintaining a Compact XML Storage (www2007) Motivations: XML everywhere Motivations Architecture How it works Experiments Conclusion 51 52 A lightweight & efficient XML storage / processor runtime? Requirements XML file + DOM + XPath lib Persistent DOM + XPath lib XML compressor (e.g. XMill) + XPath XML file + XPath / XQuery processor Native / relational XML DB XML DB on desktop? XML file + DOM on mobile? 1. Space does matter for many applications 2. Generally reducing space improves cache locality 3. Indirection is expensive 4. Support fast navigations 5. Support fast insertion and deletion 6. Support efficient joins 7. Separate topology, text and schema 53 54 9

Our Goal Proposed Storage Structure To find a space-efficient storage scheme for XML data without compromising both query and update performances The ISX Structure 55 56 Sample DBLP XML Fragment Balanced Parenthesis Encoding 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 57 58 Node Navigations Topology Tiers No. of ( No. of ) No. of text nodes Min, max of forward excess Min, max of backward excess 59 60 10

Efficient Updates Example 100 MB DBLP document 5 million XML nodes ISX: 1MB topology 61 62 Another example (demo available) ISX Features Core Duo 1.83GHz 1GB RAM 5400 RPM Harddrive MS Vista 5M DBLP Runtime (loading) Loading time Runtime (//www) //www MSXML 15MB 0.54s 21MB 0.096s ISX 4MB 0.035s 4MB 0.004s 100M DBLP Runtime (loading) Loading time Runtime (//www) //www MSXML 329MB 17.8s 333MB 1.814s ISX 67MB 0.67s 67MB 0.143s 63 64 Experiments Storage Size (ISX vs NoK) Setup Fixed at 64MB memory buffer Up to 16 GB XML document E.g. 16 GB DBLP contains > 770 million nodes NO index or query optimization has been employed for ISX (except for ISX Stream where TurboXPath algorithm has been employed) 65 66 11

Storage Size (ISX, XMill, XGrind): DBLP Storage Size (ISX, XMill): TreeBank 67 68 Bulk Loading Performance Q1: //inproceedings 69 70 Q5: //article[.//month/text() = July ]//title Node Navigation 71 72 12

Full document traversal Update (Insertion) Performance 73 74 Conclusions Small storage footprint Small runtime footprint Fast and consistent performance on navigational access Superior query performance (further indexing / query optimization can be added) Superior update performance 75 13