COMP9314: XML. Some reference sites. What is XML? Semistructured Data / XML

Some reference sites COMP9314: XML www.w3.org www.xml.com www.xml.org www.oasis-open.org Raymond Wong National ICT Australia & University of New South Wales wong@cse.unsw.edu.au 1 2 What is XML? Semistructured Data / XML XML is just a markup language defined by W3C (officially in Feb 98) It s a simplified version of SGML HTML for presentation markup, HTML4.0 XML SGML XML for content markup Semistructured=> loosely structured (no restrictions on tags & nesting relationships) no schema required XML under the semistructured umbrella self-describing the standard for information representation & exchange 3 4 Storage format vs presentation format - The power of markup The Family of XML Technologies Traditional Database or or Spreadsheet Raymond, Wong, Wong, wong, wong, 5932, 5932, John, John, Smith, Smith, jsmith, jsmith, 1234, 1234,...... HTML HTML XML XML <Staff> <Staff> <Name> <Name> <ul> <ul> <FirstName> Raymond </FirstName> <li> <li> Raymond Wong Wong </li> </li> <LastName> Wong Wong </LastName> <li> <li> Login: Login: wong wong </li> </li> </Name> <li> <li> Phone: Phone: x5932 x5932 </li> </li> <Login> <Login> wong wong </Login> </ul> </ul> <Ext> <Ext> 5932 5932 </Ext> </Ext> </Staff> </Staff> Document structures Definition of gst, price, XMLSchema / DTD <Product> <Product> <product_id> m101 </product_id> <product_id> <Product> m101 </product_id> <name> Sony walkman </name> <name> Sony <product_id> walkman m101 </name> </product_id> <currency> AUD </currency> <currency> <name> AUD </currency> Sony walkman </name> <price> 200.00 </price> <price> 200.00 <currency> </price> AUD </currency> <gst> 10% </gst> <gst> 10% <price> </gst> 200.00 </price> </Product> </Product> <gst> 10% </gst> </Product> Presentation format info XML Stylesheet XML Stylesheet XML Stylesheet 5 6 1

The power of XML is that it facilitates the definition of common standards for information representation & exchange. DBMS support for XML data Distributors CRMs internet voice line Corps intranet ISPs Telcos When more & more data are in XML, possibly from different information sources, there is a need for XMLDB query and update language/interface concurrency control & crash recovery access control and authorization views & triggers Suppliers wireless Sales people Messaging 7 8 Why need to query XML data XML data file can be modeled as a tree To extract data from large XML docs To exchange data (data- or query-shipping) To exchange data beteen different user communities or ontologies or schemas To integrate data from multiple XML sources <Staff> <Staff> <Name> <Name> <FirstName> <FirstName> Raymond Raymond </FirstName> </FirstName> <LastName> <LastName> Wong Wong </LastName> </LastName> </Name> </Name> <Login> <Login> wong wong </Login> </Login> <Ext> <Ext> 5932 5932 </Ext> </Ext> </Staff> </Staff> Staff Name Login Ext wong 5932 FirstName LastName Raymond Wong 9 10 XML Terminology More XML: Attributes tags: book, title, author, start tag: <book>, end tag: </book> elements: <book> </book>, elements are nested empty element: <red></red> abbrv. <red/> an XML document: single root element well formed XML document: if it has matching tags <book price = 55 currency = USD > <title> Foundations of Databases </title> Abiteboul <year> 1995 </year> </book> 11 12 2

More XML: Oids and References More XML: CDATA Section <person id= o555 > <name> Jane </name> </person> <person id= o456 > <name> Mary </name> <children idref= o123 o555 /> </person> <person id= o123 mother= o456 > <name>john</name> </person> Syntax: <![CDATA[...any text here ]]> Example: <example> <![CDATA[ some text here </notatag> <> ]]> </example> 13 14 More XML: Entity References More XML: Processing Instructions Syntax: &entityname; Example: <element> this is less than < </element> Some entities: < > & < > & Syntax: <?target argument?> Example: <product> <name> Alarm Clock </name> <?ringbell 20?> <price> 19.99 </price> </product> ' " & Unicode char 15 16 More XML: Comments XML Namespaces Syntax  http://www.w3.org/tr/rec-xml-names (1/99) name ::= [prefix:]localpart <book xmlns:isbn= www.isbn-org.org/def > <title> </title> <number> 15 15 </number> <isbn:number>.. </isbn:number> </book> 17 18 3

XML Namespaces Need more? syntactic: <number>, <isbn:number> semantic: provide URL for schema <tag xmlns:mystyle = http:// > defined here <mystyle:title> </mystyle:title> <mystyle:number> </tag> Read the specs from W3C! Some XML books may give you more examples that are better organized and completed, however, most materials can be found from the net - use Google. 19 20 XML Parsers Parsing There are several different ways to categorise parsers: Validating versus non-validating parsers Parsers that support the Document Object Model (DOM) Parsers that support the Simple API for XML (SAX) Parsers written in a particular language (Java, C, C++, Perl, etc.) 21 22 Non-Validating Parsers Using an XML Parser Speed and efficiency It takes a significant amount of effort for an XML parser to process a DTD / XML Schema and make sure that every element in an XML document follows the rules of the DTD / Schema. If we only want to find tags and extract information we should use a non-validating parser Three basic steps in using an XML parser Creating a parser object Passing the XML document to the parser Processing the results Generally, writing out XML is not in the scope of parsers (though some may implement proprietary mechanisms) 23 24 4

The SAX Parser SAX Parser Events SAX parser is an event-driven API An XML document is sent to the SAX parser The XML file is read sequentially The parser notifies the class when events happen, including errors The events are handled by the implemented API methods to handle events that the programmer implemented A SAX parser generates events at the start and end of a document, at the start and end of an element, when it finds characters inside an element, and at several other points User writes the code that handles each event, and decides what to do with the information from the parser 25 26 Example Event Handlers When to (not to) use SAX startelementhandler endelementhandler chardatahandler CDATASectionHandler CommentHandler PIHandler etc... Ideal for simple operations on XML files E.g. reading and extracting elements Good for very large XML files (c.f. DOM) Not good if we want to manipulate XML structure Not designed for writing out XML 27 28 DOM DOM Parser produces a memory tree (DOM Tree) after parsing Document Object Model Set of interfaces for an application that reads an XML file into memory and stores it as a tree structure The abstract API allows for constructing, accessing and manipulating the structure and content of XML and HTML documents <Staff> <Staff> <Name> <Name> <FirstName> <FirstName> Raymond Raymond </FirstName> </FirstName> <LastName> <LastName> Wong Wong </LastName> </LastName> </Name> </Name> <Login> <Login> wong wong </Login> </Login> <Ext> <Ext> 5932 5932 </Ext> </Ext> </Staff> </Staff> DOM Parser Staff Name Login Ext FirstName Raymond LastName Wong wong 5932 29 30 5

Why to Use DOM Task of writing parsers is reduced to coding against the DOM Tree API Domain-specific frameworks will be written on top of DOM XPath 31 32 XPath Example for XPath Queries http://www.w3.org/tr/xpath (11/99) Building block for other W3C standards: XSL Transformations (XSLT) XML Link (XLink) XML Pointer (XPointer) XML Query Was originally part of XSL <bib> <bib> <book> <book> <publisher> <publisher> Addison-Wesley Addison-Wesley </publisher> </publisher> Serge Serge Abiteboul Abiteboul <first-name> <first-name> Rick Rick </first-name> </first-name> <last-name> <last-name> Hull Hull </last-name> </last-name> Victor Victor Vianu Vianu <title> <title> Foundations Foundations of of Databases Databases </title> </title> <year> <year> 1995 1995 </year> </year> </book> </book> <book <book price= 55 > price= 55 > <publisher> <publisher> Freeman Freeman </publisher> </publisher> Jeffrey Jeffrey D. D. Ullman Ullman <title> <title> Principles Principles of of Database Database and and Knowledge Knowledge Base Base Systems Systems </title> </title> <year> <year> 1998 1998 </year> </year> </book> </book> </bib> </bib> 33 34 XPath: Simple Expressions XPath: Restricted Kleene Closure /bib/book/year Result: <year> 1995 </year> <year> 1998 </year> /bib/paper/year Result: empty //author Result: Serge Abiteboul <first-name> Rick </first-name> <last-name> Hull </last-name> Victor Vianu Jeffrey D. Ullman /bib//first-name Result: <first-name> Rick </first-name> 35 36 6

XPath: Text Nodes XPath: Wildcard /bib/book/author/text() Result: Serge Abiteboul Victor Vianu Jeffrey D. Ullman Rick Hull doesn t appear because he has firstname, lastname Functions in XPath: text() = matches the text value node() = matches any node (= * or @* or text() ) name() = returns the name of the current tag //author/* Result: <first-name> Rick </first-name> <last-name> Hull </last-name> * Matches any element 37 38 XPath: Attribute Nodes XPath: Qualifiers /bib/book/@price Result: 55 @price means that price is has to be an attribute /bib/book/author[firstname] Result: <first-name> Rick </first-name> <last-name> Hull </last-name> 39 40 XPath: More Qualifiers XPath: More Qualifiers /bib/book/author[firstname][address[//zip][city]]/lastname Result: <lastname> </lastname> <lastname> </lastname> /bib/book[@price < 60 ] /bib/book[author/@age < 25 ] /bib/book[author/text()] 41 42 7

XPath: More Details XPath: More Details We can navigate along 13 axes: ancestor ancestor-or-self attribute child descendant descendant-or-self following following-sibling namespace parent preceding preceding-sibling self Examples: child::author/child:lastname = author/lastname child::author/descendant::zip = author//zip child::author/parent::* = author/.. child::author/attribute::age = author/@age What does this mean? paper/publisher/parent::*/author /bib//address[ancestor::book] /bib//author/ancestor::*//zip 43 44 XPath: Even More Details name() = the name of the current node /bib//*[name()=book] same as /bib//book What does this mean? /bib//*[ancestor::*[name()!=book]] Storage 45 46 Naïve Storage of XML using RDBMS The two tables let s use two tables for the XML instances: one to store all edge information one to store values Ref(src, label, dst) Val(oid, value) Suppose a simple query like: family/person/hobby in XPath 47 48 8

The same query in SQL Efficiency problem select v.value from Ref r1, Ref r2, Ref r3, Val v where r1.src = root AND r1.label = family AND r1.dst = r2.src AND r2.label = person AND r2.dst = r3.src AND r3.label = hobby AND r3.dst = v.oid This is a 4-way join!!! It s very inefficient though index on label can help a lot. even simple query will have a large no of joins RDBMS organizes data based on the structure of tables and type info => clustering, indexing, query optimization are not working properly for XML data Also #ways to traverse path expressions are much more than that on tables 49 50 Querying & Maintaining a Compact XML Storage (www2007) Motivations: XML everywhere Motivations Architecture How it works Experiments Conclusion 51 52 A lightweight & efficient XML storage / processor runtime? Requirements XML file + DOM + XPath lib Persistent DOM + XPath lib XML compressor (e.g. XMill) + XPath XML file + XPath / XQuery processor Native / relational XML DB XML DB on desktop? XML file + DOM on mobile? 1. Space does matter for many applications 2. Generally reducing space improves cache locality 3. Indirection is expensive 4. Support fast navigations 5. Support fast insertion and deletion 6. Support efficient joins 7. Separate topology, text and schema 53 54 9

Our Goal Proposed Storage Structure To find a space-efficient storage scheme for XML data without compromising both query and update performances The ISX Structure 55 56 Sample DBLP XML Fragment Balanced Parenthesis Encoding 0 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 57 58 Node Navigations Topology Tiers No. of ( No. of ) No. of text nodes Min, max of forward excess Min, max of backward excess 59 60 10

Efficient Updates Example 100 MB DBLP document 5 million XML nodes ISX: 1MB topology 61 62 Another example (demo available) ISX Features Core Duo 1.83GHz 1GB RAM 5400 RPM Harddrive MS Vista 5M DBLP Runtime (loading) Loading time Runtime (//www) //www MSXML 15MB 0.54s 21MB 0.096s ISX 4MB 0.035s 4MB 0.004s 100M DBLP Runtime (loading) Loading time Runtime (//www) //www MSXML 329MB 17.8s 333MB 1.814s ISX 67MB 0.67s 67MB 0.143s 63 64 Experiments Storage Size (ISX vs NoK) Setup Fixed at 64MB memory buffer Up to 16 GB XML document E.g. 16 GB DBLP contains > 770 million nodes NO index or query optimization has been employed for ISX (except for ISX Stream where TurboXPath algorithm has been employed) 65 66 11

Storage Size (ISX, XMill, XGrind): DBLP Storage Size (ISX, XMill): TreeBank 67 68 Bulk Loading Performance Q1: //inproceedings 69 70 Q5: //article[.//month/text() = July ]//title Node Navigation 71 72 12

Full document traversal Update (Insertion) Performance 73 74 Conclusions Small storage footprint Small runtime footprint Fast and consistent performance on navigational access Superior query performance (further indexing / query optimization can be added) Superior update performance 75 13