Indexing XML Data in RDBMS using ORDPATH Microsoft SQL Server 2005 Concepts developed by: Patrick O Neil,, Elizabeth O Neil, (University of Massachusetts Boston) Shankar Pal,, Istvan Cseri,, Oliver Seeliger,, Gideon Schaller, Leo Giakoumakis,, Vasili Zolotov,, Nigel Westbury (Microsoft Corporation)
XML Data Model Sample XML Data (serialized form): <BOOK ISBN= 1-55860-438-3 > <SECTION> <TITLE> Bad Bugs</TITLE> Nobody loves bad bugs. <FIGURE CAPTION= Sample bug /> </SECTION> <SECTION> <TITLE> Tree Frogs </TITLE> All right-thinking people <BOLD> love </BOLD> tree frogs. </SECTION> </BOOK> 5. Juli 2006 Stephan Müller 2
XML Data Model XML Document / Fragment - Properties: 1 Book 2 ISBN 3 Section 8 Section Hierarchy 4 Title 5 Nobody 6Figure Caption 7 Title All right Bold Frogs 9 10 11 12 Document Order: 1 < 2 < 3 < 4 < 5 <.. < 11 < 12 5. Juli 2006 Stephan Müller 3
XML Data Stored in a Relational Database SQL Command: CREATE TABLE docs ( id INT PRIMARY KEY, xdoc XML ); Created docs Table: ID XDOC 1 2 XML Fragment as BLOB XML Document as BLOB SQL with embedded XQuery and XPath: XML Fragment as BLOB SELECT id, xdoc.query( for $s in /BOOK[@ISBN= 1-55860-438-3 ]//SECTION return <topic> { data($s/title) } </topic> ) FROM docs; 5. Juli 2006 Stephan Müller 4 7
ORDPATH
What we expect from a labeling scheme: Introduction Support for structural fidelity (Hierarchy + Document Order) Support for efficient structural modifications to the XML tree - insert sub-tree - delete sub-tree without relabeling!!! - move sub-tree Support for high-performance query plans for native XML queries using relational primitives Independence of XML schemas typing XML instances 5. Juli 2006 Stephan Müller 6
1 Book Example of an Initial Load 1.1 ISBN Section 1.3 Section 1.5 1.3.5 Hierarchy Title Nobody Figure Title All right Bold Frogs 1.3.1 1.3.3 Caption 1.5.1 1.5.3 1.5.5 1.5.7 Primary Index: infoset 1.3.5.1 ORDPATH 1 TAG 1 (BOOK) NODE_TYPE VALUE Null 1.1 2 (ISBN ) 2 (Attribute) '1-55860 55860-438-3' 3' 1.3 3 (SECTION) Null 1.3.1 4 (TITLE) 'Bad Bugs' 1.3.3 -- 4 (Value( Value) 'Nobody loves bad bugs' 1.3.5 5 (FIGURE) Null Document Order: 1.3.5.1 1.5 6 (CAPTION) 3 (SECTION) 2 (Attribute) 'Sample bug' Null 1 < 1.1 < 1.3 < 1.3.1 < < 1.5.7 1.5.1 1.5.3 4 (TITLE) -- 4 (Value( Value) 'Tree frogs' 'All right-thinking thinking people' 5. Juli 2006 1.5.5 7 (BOLD) 'love' 1.5.7 -- 4 (Value( Value) 'tree frogs'
L i /O i Pair Design
L i /O i Pair Design ORDPATH Example Value: 1.5.3.-9.11 Li /Oi Pair Desgin: L 0 O 0 L 1 O 1 L K O K ORDPATH bit pattern: 0100101101010110001111111000011 We need a prefix-free L i encoding 5. Juli 2006 Stephan Müller 9
Prefix Free Encoding of the L i Bitstrings (using the Fano Condition) 5. Juli 2006 Stephan Müller 10
Li /Oi Pair Design ORDPATH Example Value: 1.5.3.-9.11 Using Li values from Figure 3.2a L 0 = 3 O 0 = 1 L 1 = 3 O 1 = 5 L 2 = 3 O 2 = 3 L 3 = 4 O 3 = -9 L 4 = 4 O 4 = 11 01 001 01 101 01 011 00011 1111 100 0011 ORDPATH bit pattern 0100101101010110001111111000011 (Figure 3.2a) 5. Juli 2006 Stephan Müller 11
Li /Oi Pair Design Advantages of comparing ORDPATH Values: Determination of ancestor descendent relationships for any two ORDPATHs is very easy. Easy determination of the distance between two ORDPATHs. Simple bitstring (or byte-by by-byte) comparison yields document order. 5. Juli 2006 Stephan Müller 12
Descendants of a given Context Node Context Node ( cn = 1.3 ) 1 Book 1.1 ISBN Section 1.3 Section 1.5 1.3.5 Title Nobody Figure Title All right Bold Frogs 1.3.1 1.3.3 Caption 1.5.1 1.5.3 1.5.5 1.5.7 1.3.5.1 5. Juli 2006 Stephan Müller 13
Descendants of a given Context Node SQL Query: Infoset Table: SELECT Ordpath FROM infoset WHERE 1.3 < Ordpath (cn) AND 1.4 > Ordpath (cn+1) ORDPATH TAG NODE_TYPE VALUE 1 1 (BOOK) Null 1.1 2 (ISBN ) 2 (Attribute) '1-55860 55860-438-3' 3' 1.3 3 (SECTION) Null 1.3.1 4 (TITLE) 'Bad Bugs' 1.3.3 -- 4 (Value( Value) 'Nobody loves bad bugs' 1.3.5 5 (FIGURE) Null 1.3.5.1 6 (CAPTION) 2 (Attribute) 'Sample bug' 1.5 3 (SECTION) Null 1.5.1 4 (TITLE) Tree frogs' 1.5.3 -- 4 (Value( Value) All right-thinking thinking people' 1.5.5 7 (BOLD) love' 1.5.7 -- 4 (Value( Value) tree frogs' 14
Arbitrary Inserts
Arbitrary Insertions Rightmost / Leftmost Insertion: 3.5 Parent Child4 3.5.-1 Child1 Child2 3.5.1 3.5.3 Child3 3.5.5 5. Juli 2006 Stephan Müller 16
Arbitrary Insertions Careting in nodes between two existing nodes 3.5 3.5.1 3.5.2 3.5.3 3.5.2.1 3.5.2.2 3.5.2.3 3.5.2.2.-1 3.5.2.2.1 5. Juli 2006 Stephan Müller 17
Arbitrary Insertions Careting in nodes between two existing nodes 3.5 Parent Child1 Child3 Child6 Child5 Child4 Child2 3.5.1 3.5.2.1 3.5.2.2.-1 3.5.2.2.1 3.5.2.3 3.5.3 5. Juli 2006 Stephan Müller 18
Note: Multiple levels of carets are extremely rare in practice. Comment Advantage: Insertions require no relabelings of old nodes We avoid updates to primary key values which would involve the primary index and all secondary indexes. 5. Juli 2006 Stephan Müller 19
Conclusion ORDPATH is a hierarchical prefix-based labeling scheme. provides efficient access to subtrees. provides all kinds of modifications. 5. Juli 2006 Stephan Müller 20