Purely Relational XQuery

Purely Relational XQuery Database-Supported XML Processors Torsten Grust http://www-db.in.tum.de/~grust/ LUG Erding, Oct 2007

A Database Guy s View of XML, XPath, and XQuery 2

A Database Guy s View of XML, XPath, and XQuery Build XML processors that do not stumble once XML input volumes become sizable. 2

A Database Guy s View of XML, XPath, and XQuery Build XML processors that do not stumble once XML input volumes become sizable. Wait! Sizable data volume...? Use Relational Databases! 2

A Database Guy s View of XML, XPath, and XQuery Build XML processors that do not stumble once XML input volumes become sizable. Wait! Sizable data volume...? Use Relational Databases! <?xml version= 1.0?> <root> <elem att= val >  text </elem> </root> let $doc := doc( in.xml ) for $t in $doc//text() return count($t/../comment()) 2

A Database Guy s View of XML, XPath, and XQuery Build XML processors that do not stumble once XML input volumes become sizable. Wait! Sizable data volume...? Use Relational Databases! pre size level 1 1 10 2 1 20 3 1 30 let $doc := doc( in.xml ) for $t in $doc//text() return count($t/../comment()) 2

A Database Guy s View of XML, XPath, and XQuery Build XML processors that do not stumble once XML input volumes become sizable. Wait! Sizable data volume...? Use Relational Databases! pre size level 1 1 10 2 1 20 3 1 30 SELECT pre,size, ROW_NUMBER ( ) FROM t000 WHERE value GROUP BY iter, 2

A Database Guy s View of XML, XPath, and XQuery Build XML processors that do not stumble once XML input volumes become sizable. Wait! Sizable data volume...? Use Relational Databases! OFF-THE-SHELF pre size level 1 1 10 2 1 20 3 1 30 SELECT pre,size, ROW_NUMBER ( ) FROM t000 WHERE value GROUP BY iter, 2

RDBMSs as Processors for Non-Relational Languages Relational databases contain the best understood and most scalable query processors available today. Can we use relational query engines as-is to efficiently process non-relational languages? 3

RDBMSs as Processors for Non-Relational Languages Relational databases contain the best understood and most scalable query processors available today. Can we use relational query engines as-is to efficiently process non-relational languages? L X X 3

RDBMSs as Processors for Non-Relational Languages Relational databases contain the best understood and most scalable query processors available today. Can we use relational query engines as-is to efficiently process non-relational languages? L X X R R -1 SQL / Relational Algebra 3

RDBMSs as Processors for Non-Relational Languages Relational databases contain the best understood and most scalable query processors available today. Can we use relational query engines as-is to efficiently process non-relational languages? XQuery R R -1 SQL / Relational Algebra 3

Purely Relational XQuery Find a relational representation of XQuery (XML, XPath) that leverages the table-based processing model: Exploit functionality already built into the database system (do not invade the database kernel): SQL idioms, OLAP extensions, partitioned B-Trees. Run on top off-the-shelf relational database systems: 4

Purely Relational XQuery Find a relational representation of XQuery (XML, XPath) that leverages the table-based processing model: Exploit functionality already built into the database system (do not invade the database kernel): SQL idioms, OLAP extensions, partitioned B-Trees. Run on top off-the-shelf relational database systems: DB2 (kxsystems) IBM DB2 V9 SQL Server 2005 PostgreSQL kdb+ MonetDB DB2 Version 9 4

Pathfinder pathfinder-xquery.org Relational Database System (Kernel) 5

Pathfinder pathfinder-xquery.org Pathfinder XQuery Compiler XPath/XQuery Semantics XML Encoding Relational Database System (Kernel) 5

Pathfinder pathfinder-xquery.org XQuery Pathfinder XQuery Compiler XPath/XQuery Semantics XML Encoding Relational Database System (Kernel) 5

Pathfinder pathfinder-xquery.org XQuery Pathfinder XQuery Compiler XPath/XQuery Semantics XML Encoding SQL Relational Database System (Kernel) 5

Pathfinder pathfinder-xquery.org XQuery Pathfinder XQuery Compiler XPath/XQuery Semantics XML Encoding RA SQL Relational Database System (Kernel) 5

Pathfinder & DB2 DB2 Against 110+ MB of XML DB2 Version 9 for Linux, UNIX, an 6

Slow Motion Replay XQuery Pathfinder SQL DB2 DB2 Version 9 for Linux, UNIX, and Windows 7

Slow Motion Replay XQuery XQuery (XMark Q8) let $a := doc("xmark-110mb.xml") return Pathfinder <item person="{ $p/name/text() }"> { count($ca) } </item> SQL DB2 DB2 Version 9 for Linux, UNIX, and Windows 7

Slow Motion Replay XQuery SQL:1999 (no SQL/XML) Pathfinder SQL DB2 DB2 Version 9 WITH t0000 (iter,pre) AS SELECT FROM WHERE t0001 (iter) AS SELECT FROM WHERE SELECT FROM WHERE ORDER BY pre,size,kind, for Linux, UNIX, and Windows 7

Slow Motion Replay XQuery Pathfinder SQL DB2 Relational Encoding of Result pre size kind 5072509 2 1 5072510 0 2 5072511 0 3 DB2 Version 9 for Linux, UNIX, and Windows 7

Slow Motion Replay XQuery Pathfinder SQL DB2 Serialized XML Text <?xml version="1.0" encoding="utf-8"?> <XQuery> <item person= Jixiang >0</item> <item person= Harpreet >0</item> </XQuery> DB2 Version 9 for Linux, UNIX, and Windows 7

Table Access Modes 8

Table Access Modes Relational query engines derive much of their efficiency from the simplicity of the table of tuples model. A B 8

Table Access Modes Relational query engines derive much of their efficiency from the simplicity of the table of tuples model. 1. Sequential scans: read-ahead on disks, prefetching CPU caches. A B 8

Table Access Modes Relational query engines derive much of their efficiency from the simplicity of the table of tuples model. 1. Sequential scans: read-ahead on disks, prefetching CPU caches. A B 2. Index-based access: B-Trees, range scans. 8

Table Access Modes Relational query engines derive much of their efficiency from the simplicity of the table of tuples model. 1. Sequential scans: read-ahead on disks, prefetching CPU caches. A B 2. Index-based access: B-Trees, range scans. But: We will benefit only if we actually operate the relational database in this bulk-oriented mode. 8

The Core of XQuery 9

The Core of XQuery (e1,..,en) ordered item sequences 9

The Core of XQuery (e1,..,en) for $x in e 1 return e 2 ordered item sequences iteration 9

The Core of XQuery (e1,..,en) for $x in e 1 return e 2 let $x := e1 return e2 ordered item sequences iteration variable binding 9

The Core of XQuery (e1,..,en) for $x in e 1 return e 2 let $x := e1 return e2 if (e1) then e2 else e3 ordered item sequences iteration variable binding conditionals 9

The Core of XQuery (e1,..,en) for $x in e 1 return e 2 let $x := e1 return e2 if (e1) then e2 else e3 <t> { e1 } </t> ordered item sequences iteration variable binding conditionals XML node construction 9

The Core of XQuery (e1,..,en) for $x in e 1 return e 2 let $x := e1 return e2 if (e1) then e2 else e3 <t> { e1 } </t> e 1 /child::t, e 1 /following::t ordered item sequences iteration variable binding conditionals XML node construction XPath location steps 9

The Core of XQuery (e1,..,en) for $x in e 1 return e 2 let $x := e1 return e2 if (e1) then e2 else e3 <t> { e1 } </t> e 1 /child::t, e 1 /following::t unordered { e 1 } ordered item sequences iteration variable binding conditionals XML node construction XPath location steps local order indifference 9

The Core of XQuery (e1,..,en) for $x in e 1 return e 2 let $x := e1 return e2 if (e1) then e2 else e3 <t> { e1 } </t> e 1 /child::t, e 1 /following::t unordered { e 1 } fn:doc(e 1 ), fn:data(e 1 ) ordered item sequences iteration variable binding conditionals XML node construction XPath location steps local order indifference built-in functions + Document order, node identity, dynamic typing, schema validation, user-defined functions,... 9

The Core of XQuery (e1,..,en) for $x in e 1 return e 2 let $x := e1 return e2 if (e1) then e2 else e3 <t> { e1 } </t> e 1 /child::t, e 1 /following::t unordered { e 1 } fn:doc(e 1 ), fn:data(e 1 ) ordered item sequences iteration variable binding conditionals XML node construction XPath location steps local order indifference built-in functions + Document order, node identity, dynamic typing, schema validation, user-defined functions,... The XQuery SQL gap appears substantial. 9

FP-Style Iteration 10

FP-Style Iteration The XQuery for..in..return construct performs side-effect free iteration: for $x in (e1,..,en) return b 10

FP-Style Iteration The XQuery for..in..return construct performs side-effect free iteration: for $x in (e1,..,en) return b ( b[e1/$x],..,b[en/$x] ) Individual iterations cannot interfere and may be evaluated in any order (or in parallel). 10

Loop Lifting Relational loop unrolling : For any XQuery subexpression, generate a relational plan that evaluates the expression in all iterations. 11

Loop Lifting Relational loop unrolling : For any XQuery subexpression, generate a relational plan that evaluates the expression in all iterations. for $x in (10,20,30) return $x + 42 11

Loop Lifting Relational loop unrolling : For any XQuery subexpression, generate a relational plan that evaluates the expression in all iterations. for $x in (10,20,30) return $x + 42 iter pos item 1 1 42 2 1 42 3 1 42 11

Loop Lifting Relational loop unrolling : For any XQuery subexpression, generate a relational plan that evaluates the expression in all iterations. for $x in (10,20,30) return $x + 42 iter pos item 1 1 42 2 1 42 3 1 42 for $x in (10,20,30) return $x + 42 11

Loop Lifting Relational loop unrolling : For any XQuery subexpression, generate a relational plan that evaluates the expression in all iterations. for $x in (10,20,30) return $x + 42 for $x in (10,20,30) return $x + 42 iter pos item 1 1 42 2 1 42 3 1 42 iter pos item 1 1 10 2 1 20 3 1 30 11

Loop Lifting Relational loop unrolling : For any XQuery subexpression, generate a relational plan that evaluates the expression in all iterations. for $x in (10,20,30) return $x + 42 for $x in (10,20,30) return $x + 42 iter pos item 1 1 42 2 1 42 3 1 42 iter pos item 1 1 10 2 1 20 3 1 30 let $x := (10,20,30) return $x 11

Loop Lifting Relational loop unrolling : For any XQuery subexpression, generate a relational plan that evaluates the expression in all iterations. for $x in (10,20,30) return $x + 42 for $x in (10,20,30) return $x + 42 let $x := (10,20,30) return $x 11 iter pos item 1 1 42 2 1 42 3 1 42 iter pos item 1 1 10 2 1 20 3 1 30 iter pos item 1 1 10 1 2 20 1 3 30

Loop Lifting 12

Loop Lifting for $x in (10,20,30) return $x + 42 12

Loop Lifting for $x in (10,20,30) return $x + 42 iter pos item 1 1 10 2 1 20 3 1 30 iter 1 pos 1 item 1 1 1 42 2 1 42 3 1 42 12

Loop Lifting for $x in (10,20,30) return $x + 42 iter pos item 1 1 10 2 1 20 3 1 30 iter = iter 1 iter 1 pos 1 item 1 1 1 42 2 1 42 3 1 42 + item 2 :(item,item 1 ) π iter,pos,item 2 12

Loop Lifting for $x in (10,20,30) return $x + 42 iter pos item 1 1 10 2 1 20 3 1 30 iter = iter 1 + item 2 :(item,item 1 ) π iter,pos,item 2 iter 1 pos 1 item 1 1 1 42 2 1 42 3 1 42 iter pos item iter 1 pos 1 item 1 1 1 10 1 1 42 2 1 20 2 1 42 3 1 30 3 1 42 12

Loop Lifting for $x in (10,20,30) return $x + 42 iter pos item 1 1 10 2 1 20 3 1 30 iter = iter 1 + item 2 :(item,item 1 ) π iter,pos,item 2 iter 1 pos 1 item 1 1 1 42 2 1 42 3 1 42 iter pos item iter 1 pos 1 item 1 item 2 1 1 10 1 1 42 52 2 1 20 2 1 42 62 3 1 30 3 1 42 72 12

Loop Lifting for $x in (10,20,30) return $x + 42 iter pos item 1 1 10 2 1 20 3 1 30 iter = iter 1 iter 1 pos 1 item 1 1 1 42 2 1 42 3 1 42 + item 2 :(item,item 1 ) π iter,pos,item 2 iter pos item 2 1 1 52 2 1 62 3 1 72 12

Loop Lifting 13

for $x in (1,2,3,4) Loop Lifting return if ($x mod 2 = 0) then even else odd 13

for $x in (1,2,3,4) Loop Lifting return if ($x mod 2 = 0) then even else odd iter pos item 1 1 false 2 1 true 3 1 false 4 1 true 13

for $x in (1,2,3,4) Loop Lifting return if ($x mod 2 = 0) then even else odd σ item π iter iter pos item 1 1 false 2 1 true 3 1 false 4 1 true pos item 1 even pos item 1 odd σ item π iter 13

for $x in (1,2,3,4) Loop Lifting return if ($x mod 2 = 0) then even else odd iter pos item 2 1 true 4 1 true σ item π iter iter pos item 1 1 false 2 1 true 3 1 false 4 1 true pos item 1 even pos item 1 odd σ item π iter 13

for $x in (1,2,3,4) Loop Lifting return if ($x mod 2 = 0) then even else odd iter 2 4 σ item π iter iter pos item 1 1 false 2 1 true 3 1 false 4 1 true pos item 1 even pos item 1 odd σ item π iter 13

for $x in (1,2,3,4) Loop Lifting return if ($x mod 2 = 0) then even else odd iter pos item 2 1 even 4 1 even σ item π iter iter pos item 1 1 false 2 1 true 3 1 false 4 1 true pos item 1 even pos item 1 odd σ item π iter 13

for $x in (1,2,3,4) Loop Lifting return if ($x mod 2 = 0) then even else odd iter pos item 1 1 odd σ item π iter 2 1 even 3 1 odd iter pos item 1 1 false 2 1 true 3 1 false 4 1 true pos item 1 even pos item 1 odd 4 1 even σ item π iter 13

Intermediate Code: Relational Algebra XQuery Pathfinder XQuery Compiler Relational Algebra σ π 14

Intermediate Code: Relational Algebra XQuery Pathfinder XQuery Compiler Relational Algebra σ π Compiles into RA that mimics capabilites of SQL:1999 query engines. RA dialect is... 1. primitive 2. explicit 3. designed to be easily analyzable 14

Intermediate Code: Relational Algebra XQuery Pathfinder XQuery Compiler Relational Algebra σ π Code Gen Code Gen Code Gen Compiles into RA that mimics capabilites of SQL:1999 query engines. RA dialect is... 1. primitive 2. explicit 3. designed to be easily analyzable 14

Intermediate Code: Relational Algebra XQuery Pathfinder XQuery Compiler Relational Algebra σ π Code Gen Code Gen Code Gen Compiles into RA that mimics capabilites of SQL:1999 query engines. RA dialect is... 1. primitive SQL Q MIL 2. explicit (kxsystems) 3. designed to be easily analyzable 14

Typical Plans... 15

Relational XQuery Optimization let $d := fn:doc( ) return for $a in $d//a for $b in $d//b where $b/@c = $a/@d return $b 16

Relational XQuery Optimization We use concepts in the relational domain to analyze plans and to steer simplification and optimization. Detect join-like XQuery expressions. Let XQuery s syntactic diversity not affect the detection: let $d := fn:doc( ) return for $a in $d//a for $b in $d//b where $b/@c = $a/@d return $b 16

Relational XQuery Optimization We use concepts in the relational domain to analyze plans and to steer simplification and optimization. Detect join-like XQuery expressions. Let XQuery s syntactic diversity not affect the detection: let $d := fn:doc( ) return $d//b[@c = $d//a/@c] 16

Relational XQuery Optimization We use concepts in the relational domain to analyze plans and to steer simplification and optimization. Detect join-like XQuery expressions. Let XQuery s syntactic diversity not affect the detection: let $d := fn:doc( ) return $d//b[@c = $d//a/@c] For both queries, the value-based XQuery join surfaces as a multi-valued dependency in the algebraic plans. 16

ROW# (pos1:<sort, pos>/outer) (iter, pos, item:res) access textnode content (res:<item>) X (iter = inner) @ (pos), val: #1 (iter:inner, item) (outer:iter, sort:pos, inner) NUMBER (inner) ROW# (pos:<item>/iter) DISTINCT (iter, item) ROW# (pos:<item>/iter) / child::text (iter, item) (iter, item) ROW# (pos:<item>/iter) DISTINCT (iter, item) ROW# (pos:<item>/iter) / child::element name { item* } (iter, item) (iter, pos, item:cast) CAST (cast:<item>), type: ua (iter, pos, item:res) access attribute value (res:<item>) @ (pos), val: #1 (iter:inner, item) (iter, item) @ (pos), val: #1 (iter:inner, item) X (iter = inner) (outer:iter, sort:pos, inner) NUMBER (inner) ROW# (pos:<item>/iter) DISTINCT (iter, item) ROW# (pos:<item>/iter) / attribute::attribute id { atomic* } (iter, item) (iter:inner, item) @ (pos), val: #1 (iter:inner, item) NUMBER (inner) ROW# (pos:<item>/iter) DISTINCT (iter, item) ROW# (pos:<item>/iter) / child::element closed_auction { item* } (iter, item) (iter, item) ROW# (pos:<item>/iter) NUMBER (inner) ROW# (pos:<item>/iter) DISTINCT (iter, item) ROW# (pos:<item>/iter) / child::element person { item* } (iter, item) (iter, item) ROW# (pos:<item>/iter) DISTINCT (iter, item) DISTINCT (iter, item) ROW# (pos:<item>/iter) / child::element closed_auctions { item* } (iter, item) ROW# (pos:<item>/iter) / child::element people { item* } (iter, item) (iter, item) (iter) @ (pos), val: #1 (iter:inner, item) (iter, item) ROW# (pos:<item>/iter) DISTINCT (iter, item) DIFF ROW# (pos:<item>/iter) @ (item), val: false DISTINCT (iter:outer) NUMBER (inner) / child::element site { item* } (iter, item) ROW# (pos1:<sort, pos>/outer) (outer:iter, sort:pos, inner) NUMBER (inner) X (iter = inner) @ (item), val: 1 @ (pos), val: #1 (iter) SEL (item) (iter, pos, item:res) = (res:<item, item1>) X (iter = iter1) (iter:inner, item) X (iter = outer) (iter, pos, item:res) NOT (res:<item>) @ (pos), val: #1 U @ (item), val: true (iter, pos, item:cast) (iter1:iter, item1:cast) CAST (cast:<item>), type: str (iter:inner, pos, item) X (iter = outer) ROW# (pos1:<sort, pos>/outer) (iter:outer, pos:pos1, item) (iter:outer, pos:pos1, item) CAST (cast:<item>), type: str ROW# (pos1:<sort, pos>/outer) (iter, pos, item:cast) CAST (cast:<item>), type: ua (iter, pos, item:res) access attribute value (res:<item>) @ (pos), val: #1 X (iter = inner) (iter:inner, item) @ (pos), val: #1 (iter:inner, item) X (iter = outer) (iter:inner, pos, item) (outer:iter, sort:pos, inner) NUMBER (inner) ROW# (pos:<item>/iter) DISTINCT (iter, item) ROW# (pos:<item>/iter) / attribute::attribute person { atomic* } (iter, item) (iter, item) ROW# (pos:<item>/iter) DISTINCT (iter, item) ROW# (pos:<item>/iter) (outer:iter, sort:pos, inner) (outer:iter, sort:pos, inner) (outer:iter, sort:pos, inner) / child::element buyer { item* } (iter, item) X (iter = outer) (iter, item) Rewriting XMark Q8 17

EMPTY_FRAG FRAG_UNION FRAGs ROW# (pos1:<sort>) ELEM (iter, item:<iter, item><iter, pos, item>) FRAG_UNION FRAG_UNION FRAGs @ (item), val: person (iter:outer, pos:pos1, item) ROW# (pos1:<sort>/outer) X (iter = inner) (iter, pos, item:res) access textnode content (res:<item>) @ (pos), val: #1 (iter:inner, item) ATTR (res:<item, item1>) X (iter = iter1) @ (pos), val: #1 NUMBER (inner) ROW# (pos:<item>/iter) / child::text (iter, item) X (iter = inner) @ (pos), val: #1 ROOTS ELEM_TAG @ (item), val: item (iter1:iter, item1:item) fn:string_join (outer:iter, sort:pos, inner) / child::element name { item* } (iter, item) @ (item), val: " " (iter:inner) (iter, pos:pos1, item) ROW# (pos1:<ord>/iter) @ (ord), val: #1 @ (ord), val: #2 (iter, pos, item:res) @ (pos), val: #1 ROOTS U FRAGs (iter, item:res) ROOTS TEXT (res:<cast>) CAST (cast:<item>), type: str U FRAG_UNION @ (item), val: 0 DIFF (iter) COUNT (item:/iter) (iter:outer) X (iter = inner) (iter) X (iter = iter1) (iter1:iter) SEL (item) (iter, item:res) NOT (res:<item>) U @ (item), val: false (iter:inner) (iter:inner, item) NUMBER (inner) ROW# (pos:<item>) / child::element person { item* } (iter, item) / child::element people { item* } (iter, item) FRAGs (iter:inner, item) NUMBER (inner) @ (item), val: true / child::element closed_auction { item* } (iter, item) / child::element closed_auctions { item* } (iter, item) / child::element site { item* } (iter, item) DOC DIFF (iter, item:cast) CAST (cast:<item>), type: ua (iter, item:res) access attribute value (res:<item>) / attribute::attribute person { atomic* } (iter, item) / child::element buyer { item* } (iter, item) @ (item), val: "auctiong.xml" DISTINCT (iter:outer) @ (item), val: false (iter:inner) DIFF NUMBER (inner) (iter:inner, item) X (iter = inner) (outer:iter, sort:pos, inner) / child::element site { item* } (iter, item) ROOTS (iter) SEL (item) (iter, item:res) NOT (res:<item>) U DISTINCT (iter:outer) @ (item), val: true X (iter = inner) (iter) SEL (item) (iter, item:res) = (res:<item, item1>) (iter, item:cast) X (iter = iter1) CAST (cast:<item>), type: str (iter:inner, item) X (iter = outer) (iter:inner, item) (iter:outer, item) X (iter = inner) (outer:iter, inner) (iter:inner, item) NUMBER (inner) (iter1:iter, item1:cast) CAST (cast:<item>), type: str (iter, item:cast) CAST (cast:<item>), type: ua (iter, item:res) (iter:inner, item) access attribute value (res:<item>) (iter:inner, item) NUMBER (inner) (iter:outer, item) X (iter = inner) (outer:iter, inner) (outer:iter, inner) NUMBER (inner) / attribute::attribute id { atomic* } (iter, item) (iter:inner, item) X (iter = outer) (iter:inner, item) X (iter = outer) (outer:iter, inner) (outer:iter, inner) Rewriting XMark Q8 Rewriting phases: 1. Constant propagation and projection pushdown, 17

EMPTY_FRAG FRAG_UNION FRAGs SERIALIZE (item, pos) ROW# (pos:<pos1>) (pos1, item) X (iter = iter1) ELEM (iter1, item:<iter1, item><iter1, pos, item>) (iter1, item1, pos) ROW# (pos:<pos1>/iter1) (iter1, pos1, item1) access textnode content (item1:<item>) ROW# (pos1:<item>/iter1) / child::text (iter1, item) ROOTS FRAG_UNION FRAG_UNION FRAGs ATTR (item:<item2, item1>) @ (item2), val: person fn:string_join / child::element name { item* } (iter1, item) (item:item1, iter1:iter) @ (item1), val: " " ELEM_TAG @ (item), val: item NUMBER (iter) ROW# (pos1:<item1>) (item1) (iter1:iter) / child::element person { item* } (iter, item1) (iter1, item, pos) ROW# (pos:<pos1>/iter1) U @ (pos1), val: #1 @ (pos1), val: #2 (iter1, item) (iter1, item) ROOTS FRAGs TEXT (item:<item2>) CAST (item2:<item1>), type: str U @ (item1), val: 0 DIFF (iter1) ROOTS COUNT (item1:/iter1) (iter1) X (iter = iter2) (iter:iter2) DISTINCT (iter2) SEL (item) (iter2, item) = (item:<item3, item4>) (iter2, item3, item4) CAST (item4:<item>), type: str CAST (item3:<item2>), type: str (iter2, item2, item) CAST (item:<item4>), type: ua access attribute value (item4:<item3>) (item3, iter2, item2) / child::element people { item* } (iter, item1) FRAG_UNION X (iter1 = iter3) (iter1:iter3, item3) / attribute::attribute id { atomic* } (iter3, item3) (item3:item, iter3:iter1) FRAGs (iter2, item2:item1, iter3:iter1) NUMBER (iter1) (item, iter2, item1) CAST (item1:<item3>), type: ua access attribute value (item3:<item2>) / child::element site { item* } (iter, item1) DOC TBL: (iter item) [#1,"auctionG.xml"] @ (iter), val: #1 (item1:item) ROOTS (item2, item, iter2) X (iter = iter3) (iter:iter3, item2) / attribute::attribute person { atomic* } (iter3, item2) / child::element buyer { item* } (iter3, item2) (item, item1, iter1, iter2:iter) (item2:item, iter3:iter) (item:item1, iter2:iter, iter3:iter) (item) NUMBER (iter) (item1, iter1:iter) / child::element closed_auction { item* } (iter, item) / child::element closed_auctions { item* } (iter, item) / child::element site { item* } (iter, item) Rewriting XMark Q8 Rewriting phases: 1. Constant propagation and projection pushdown, 2. functional dependency and data flow analysis, 17

EMPTY_FRAG FRAG_UNION FRAGs SERIALIZE (item, pos) ROW# (pos:<pos1>) (pos1, item) X (iter = iter1) ELEM (iter1, item:<iter1, item><iter1, pos, item>) (iter1, item1, pos) ROW# (pos:<pos1>/iter1) (iter1, pos1, item1) access textnode content (item1:<item>) ROW# (pos1:<item>/iter1) / child::text (iter1, item) ROOTS FRAG_UNION FRAG_UNION FRAGs ATTR (item:<item2, item1>) @ (item2), val: person fn:string_join @ (item1), val: " " / child::element name { item* } (iter1, item) (item:item1, iter1:iter) ELEM_TAG @ (item), val: item (iter1:iter) NUMBER (iter) ROW# (pos1:<item1>) (item1) / child::element person { item* } (iter, item1) (iter1, item, pos) ROW# (pos:<pos1>/iter1) U @ (pos1), val: #1 @ (pos1), val: #2 (iter1, item) (iter1, item) ROOTS FRAGs TEXT (item:<item2>) CAST (item2:<item1>), type: str U @ (item1), val: 0 DIFF (iter1) COUNT (item1:/iter1) (iter1) DISTINCT (iter, iter1) ROOTS X (item1 = item) (iter, item1) (iter1, item) access attribute value (item1:<item>) (item, iter:iter2) / attribute::attribute person { atomic* } (iter2, item) / child::element people { item* } (iter, item1) FRAG_UNION FRAGs / child::element buyer { item* } (iter2, item) DOC TBL: (iter item1) [#1,"auctionG.xml"] (item:item1, iter2:iter) NUMBER (iter) ROOTS (item1) / child::element closed_auction { item* } (iter, item1) / child::element closed_auctions { item* } (iter, item1) / child::element site { item* } (iter, item1) access attribute value (item:<item1>) (item1, iter1:iter2) / attribute::attribute id { atomic* } (iter2, item1) (item1, iter2:iter) Rewriting XMark Q8 Rewriting phases: 1. Constant propagation and projection pushdown, 2. functional dependency and data flow analysis, 3. algebraic XQuery join detection, 17

Attach (item), val: item SERIALIZE ROOTS twig (iter, item) ELEM (iter, item) fcns ATTR (iter, item, item1) Attach (item), val: person fn:string_join fcns access textnode content (item1:<item>) Attach (item1), val: " " Project (iter, item) / + child::text (item:item1) level=5 / + child::element name { item* } (item1:iter) level=4 Project (iter) / + child::element person { item* } (iter:item) level=3 Project (item) TEXT (iter, item) Project (iter, item) access attribute value (item1:<item>) / + child::element people { item* } (item:item1) level=2 CAST (item:<item1>), type: str / + attribute::attribute id { atomic* } (item:iter) level=4 Project (item1) / + child::element site { item* } (item1:item) level=1 ROOTS DOC (iter, item) TBL: (iter item) [#1,"auctionG.xml"] Project (item) UNION COUNT (item1:/iter) Project (iter) ThetaJoin (item eq item1) Project (iter, item1) access attribute value (item:<item1>) Project (item1) / + attribute::attribute person { atomic* } (item1:item) level=5 Project (item) / + child::element buyer { item* } (item:item1) level=4 Project (item1) / + child::element closed_auction { item* } (item1:item) level=3 Project (item) / + child::element closed_auctions { item* } (item:item1) level=2 Attach (item1), val: 0 DIFF Project (iter) Rewriting XMark Q8 Rewriting phases: 1. Constant propagation and projection pushdown, 2. functional dependency and data flow analysis, 3. algebraic XQuery join detection, 4. XPath join graph isolation, 17

EMPTY FRAG item1 :item item:iter descendant::text() GPS = 802, level = 5 SERIALIZE FRAG UNION FRAGs FRAGs ROOTS twig (iter, item) DOC (iter, item) TBL: (iter item) item:item1 πitem1 ROOTS πiter,item1 πitem item1 :item attribute(person) GPS = 960, level = 5 πitem item:item1 descendant::element(buyer) GPS = 959, level = 4 πitem1 :item item = item1 πiter,item1 item1 :item πiter ELEM (iter, item) πitem @item: item item:iter attribute(id) GPS = 799, level = 4 item:iter fcns ATTR (iter, item, item1) @item: person COUNT item1 :()/iter πiter fcns TEXT (iter, item) πiter,item descendant::element(person) GPS = 798, level = 3 nil CAST (item: item1 ), type: str @item1 :0 \ πiter Rewriting XMark Q8 Rewriting phases: 1. Constant propagation and projection pushdown, 2. functional dependency and data flow analysis, 3. algebraic XQuery join detection, 4. XPath join graph isolation, 5. data guide exploitation ( Node GPS ). Pathfinder & IBM DB2 V9, 115 MB XML input data < 1 sec 17

Rewriting XMark Q8 π iter Rewriting phases: item=item1 1. Constant propagation and projection pushdown, π item π iter,item1 2. functional dependency and data flow analysis, item:item1 item1 :item 3. algebraic XQuery join detection, π item1 item1 :item 4. XPath join graph item:iter GPS = descendant::text() 802, isolation, level = 5 attribute(person) GPS = 960, level = 5 item1 :item item:iter a GPS = 79 5. data guide exploitation ( Node GPS ). π item Pathfinder & IBM DB2 V9, 115 MB XML item:item1 input data < 1 sec descendant::element(buyer) GPS = 959, level = 4 17 π item π item1 :item

upstream Can DB2 Cope? SORT(33) 373.32 HSJOIN(35) 373.32 NLJOIN(37) 75.02 HSJOIN(43) 298.3 IXSCAN(41) 50.01 NLJOIN(45) 149.15 DB2 Version 9 NLJOIN( IDX_GUIDE_PRE for Linux, UNIX, and Windows IXSCAN(53) 50.01 NLJOIN(47) 75.02 NLJOIN( XMARK_1.DOC IDX_GUIDE_PRE_VALUE IXSCAN(51) 50.01 IXSCAN(49) 50.01 IXSCAN(63) 50.0 XMARK_1.DOC IDX_GUIDE_PRE_PRE_PLUS_SIZE IDX_VALUE_KIND_PRE_SIZE IDX_GUIDE_PRE_VAL XMARK_1.DOC XMARK_1.DOC XMARK_1.DOC 18

upstream Can DB2 Cope? SORT(33) 373.32 HSJOIN(35) 373.32 *) Indexes avised by DB2 itself NLJOIN(37) 75.02 HSJOIN(43) 298.3 IXSCAN(41) 50.01 NLJOIN(45) 149.15 DB2 Version 9 NLJOIN( IDX_GUIDE_PRE for Linux, UNIX, and Windows IXSCAN(53) 50.01 NLJOIN(47) 75.02 NLJOIN( XMARK_1.DOC *) IDX_GUIDE_PRE_VALUE IXSCAN(51) 50.01 IXSCAN(49) 50.01 IXSCAN(63) 50.0 XMARK_1.DOC IDX_GUIDE_PRE_PRE_PLUS_SIZE *) *) IDX_VALUE_KIND_PRE_SIZE IDX_GUIDE_PRE_VAL XMARK_1.DOC XMARK_1.DOC XMARK_1.DOC 18

Order Indifference let $warning := do not press button, computer will explode! return fn:distinct-doc-order( for $x in $warning//* return $x/text() ) 19

Order Indifference Order in XML and XQuery processing is essential: let $warning := do not press button, computer will explode! return fn:distinct-doc-order( for $x in $warning//* return $x/text() ) 19

Order Indifference Order in XML and XQuery processing is essential: let $warning := do not press button, computer will explode! return fn:distinct-doc-order( for $x in $warning//* return $x/text() ) Do not press button, computer will explode! 19

Order Indifference Order in XML and XQuery processing is essential: let $warning := do not press button, computer will explode! return for $x in $warning//* return $x/text() Do press button, computer will not explode! 19

Order Indifference Order in XML and XQuery processing is essential: let $warning := do not press button, computer will explode! return for $x in $warning//* return $x/text() Do press button, computer will not explode! Order-awareness thus is often deeply wired into XQuery processors. Supporting constructs like unordered { e1 } is challenging. 19

Order Indifference Order in XML and XQuery processing is essential: let $warning := do not press button, computer will explode! return for $x in $warning//* return $x/text() Do press button, computer will not explode! Order-awareness thus is often deeply wired into XQuery processors. SERIALIZE (item, pos) Supporting constructs like unordered { e1 } is challenging. FRAG_UNION FRAGs (pos1, item) X (iter = iter1) ROOTS ELEM (iter1, item:<iter1, item><iter1, pos, item>) ROW# (pos:<pos1>) ELEM_TAG (iter1, item, pos) @ (item), val: item ROW# (pos:<pos1>/iter1) U 19 FRAG_UNION FRAG_UNION @ (pos1), val: #1 @ (pos1), val: #2 (iter1, item) (iter1, item) ROOTS FRAGs ROOTS FRAGs TEXT (item:<item2>)

Order Indifference Order in XML and XQuery processing is essential: let $warning := do not press button, computer will explode! return for $x in $warning//* return $x/text() Do press button, computer will not explode! Order-awareness thus is often deeply wired into XQuery processors. π iter,item SERIALIZE (item, pos) Supporting constructs like unordered { e1 } is challenging. FRAG_UNION FRAGs (pos1, item) X (iter = iter1) ROOTS ELEM (iter1, item:<iter1, item><iter1, pos, item>) ROW# (pos:<pos1>) ELEM_TAG (iter1, item, pos) @ (item), val: item ROW# (pos:<pos1>/iter1) U 19 FRAG_UNION FRAG_UNION @ (pos1), val: #1 @ (pos1), val: #2 (iter1, item) (iter1, item) ROOTS FRAGs ROOTS FRAGs TEXT (item:<item2>)

More Purely Relational XQuery 20

More Purely Relational XQuery Declarative XQuery debugging with time travel. Users observe expressions and may traverse iterations forward/backward. 20

More Purely Relational XQuery Declarative XQuery debugging with time travel. Users observe expressions and may traverse iterations forward/backward. iter pos item 1 2 3 4 20

More Purely Relational XQuery Declarative XQuery debugging with time travel. Users observe expressions and may traverse iterations forward/backward. iter pos item 1 2 3 4 Dependable cardinality estimates at XQuery subexpression level (# items per iteration and overall contribution). for $x in $doc//x return if (e 1 ) then e 2 else e 3 20

More Purely Relational XQuery Declarative XQuery debugging with time travel. Users observe expressions and may traverse iterations forward/backward. iter pos item 1 2 3 4 Dependable cardinality estimates at XQuery subexpression level (# items per iteration and overall contribution). for $x in $doc//x 4.3 return if (e 1 ) then e 2 else e 3 8 2 20

Atomization Indexes <a> <c>same<d>shirt</d></c> different day </a> 21

Atomization Indexes <a> <c>same<d>shirt</d></c> different day </a> Same c b d a different day shirt 21

Atomization Indexes <a> <c>same<d>shirt</d></c> different day </a> Same c b d a different day node atom1 atom2 atom3 atom4 a b c shirt d 21

Atomization Indexes <a> <c>same<d>shirt</d></c> different day </a> Same c b d a different day node atom1 atom2 atom3 atom4 a b c Same d shirt different day shirt 21

Atomization Indexes <a> <c>same<d>shirt</d></c> different day </a> Same c b d a different day node atom1 atom2 atom3 atom4 a b c Same d shirt shirt different day shirt 21

Atomization Indexes <a> <c>same<d>shirt</d></c> different day </a> Same c b d a different day node atom1 atom2 atom3 atom4 a b c Same Same Same Same shirt shirt shirt different day different d shirt shirt different day shirt 21

Atomization Indexes <a> <c>same<d>shirt</d></c> different day </a> Same c b d a different day node atom1 atom2 atom3 atom4 a b c Same Same Same Same shirt shirt shirt different day different d shirt shirt different day shirt dict atom 21

Atomization Indexes <a> <c>same<d>shirt</d></c> different day </a> Same c b d a different day node atom1 atom2 atom3 atom4 a b c Sameδ 1 Sameδ 1 Sameδ 1 Sameδ 1 shirt shirt shirt different day different d shirt shirt different day shirt dict δ 1 Same atom 21

Atomization Indexes <a> <c>same<d>shirt</d></c> different day </a> Same c b d a different day node atom1 atom2 atom3 atom4 a δ 1 δ 2 δ 3 δ 4 b δ 1 δ 2 δ 3 c δ 1 δ 2 δ 1 d δ 2 δ 2 δ 3 δ 4 21 shirt dict atom δ 1 Same δ 2 shirt δ 3 different day δ 4

Atomization Indexes <a> <c>same<d>shirt</d></c> different day </a> Same c b d a different day node atom1 atom2 atom3 atom4 a δ 15 δ 2 δ 3 δ 4 b δ 15 δ 2 δ 3 c δ 15 δ 2 δ 1 d δ 2 δ 2 δ 3 shirt dict atom δ 1 Same δ 2 shirt δ 3 different δ 4 day δ 5 Same shirt δ 4 21

Atomization Indexes <a> <c>same<d>shirt</d></c> different day </a> Same c b d a different day node atom1 atom2 atom3 atom4 a δ 15 6 δ 2 δ 3 δ 4 b δ 15 6 δ 2 δ 3 c δ 1 5 δ 2 δ 1 d δ 2 δ 2 δ 3 δ 4 21 shirt dict atom δ 1 Same δ 2 shirt δ 3 different δ 4 day δ 5 δ 1 δ 2 δ 6 δ 5 δ 3

Atomization Indexes <a> <c>same<d>shirt</d></c> different day </a> Same c b d a different day node atom1 atom2 a δ 6 δ 4 b δ 6 c δ 5 δ 1 d δ 2 δ 2 δ 3 δ 4 21 shirt dict atom δ 1 Same δ 2 shirt δ 3 different δ 4 day δ 5 δ 1 δ 2 δ 6 δ 5 δ 3

Relational Encodings for XML XML documents represent ordered, unranked trees (nodes of 7 kinds). Relational XQuery processing calls for an adequate relational encoding of such trees: 1. Schema-oblivious (XQuery constructs arbitrary XML fragments), 2. accessible for the query processor and index support ( 10 7 nodes in typical GB-range XML instances). 22

Relational Encodings for XML XML documents represent ordered, unranked trees (nodes of 7 kinds). Relational XQuery processing calls for an adequate relational encoding of such trees: 1. Schema-oblivious (XQuery constructs arbitrary XML fragments), 2. accessible for the query processor and index support ( 10 7 nodes in typical GB-range XML instances). doc X 22 1 <a>b<c/>d<e/><f>g</f></a> 2

Pre/Postorder Encoding <a> <c><d/>e</c> <f> <h><j/></h> </f> </a> 23

Pre/Postorder Encoding <a> <c><d/>e</c> <f> <h><j/></h> </f> </a> d c b e a g i f h j 23

Pre/Postorder Encoding <a> <c><d/>e</c> <f> <h><j/></h> </f> </a> 1b3b 2c2c 3d0d 0a9a 6g4g 4e1e 5f8f 8i5i 7h7h 9j6j 23

Pre/Postorder Encoding <a> <c><d/>e</c> <f> <h><j/></h> </f> </a> 1b3b 2c2c 3d0d 0a9a 6g4g 4e1e 5f8f 8i5i 7h7h 9j6j 23 pre post node 0 9 a 1 3 b 2 2 c 3 0 d 4 1 e 5 8 f 6 4 g 7 7 h 8 5 i 9 6 j

Pre/Postorder Encoding <a> <c><d/>e</c> <f> <h><j/></h> </f> </a> 1b3b 2c2c 3d0d 0a9a 6g4g 4e1e 5f8f 8i5i 7h7h 9j6j 23 document order pre post node 0 9 a 1 3 b 2 2 c 3 0 d 4 1 e 5 8 f 6 4 g 7 7 h 8 5 i 9 6 j

Pre/Postorder Encoding <a> <c><d/>e</c> <f> <h><j/></h> </f> </a> 1b3b 2c2c 3d0d 0a9a 6g4g 4e1e 5f8f 8i5i 7h7h 9j6j document order pre post node 0 9 a 1 3 b 2 2 c 3 0 d 4 1 e 5 8 f 6 4 g 7 7 h 8 5 i 9 6 j kind elem elem elem elem text elem comm elem elem elem 23

Pre/Postorder Encoding <a> <c><d/>e</c> <f> <h><j/></h> </f> </a> 1b3b 2c2c 3d0d 0a9a 6g4g 4e1e 5f8f 8i5i 7h7h 9j6j document order pre post node 0 9 a 1 3 b 2 2 c 3 0 d 4 1 e 5 8 f 6 4 g 7 7 h 8 5 i 9 6 j kind elem elem elem elem text elem comm elem elem elem size level 9 0 3 1 2 2 0 3 0 3 4 1 0 2 2 2 0 3 0 3 23

Pre/Postorder Encoding 0a9a <a> <c><d/>e</c> <f> <h><j/></h> </f> </a> 1b3b 2c2c 3d0d 4e1e 6g4g 5f8f 8i5i 7h7h 9j6j document order pre node 0 a 1 b 2 c 3 d 4 e 5 f 6 g 7 h 8 i 9 j kind elem elem elem elem text elem comm elem elem elem size level 9 0 3 1 2 2 0 3 0 3 4 1 0 2 2 2 0 3 0 3 23

Pre/Postorder Encoding <a> <c><d/>e</c> <f> <h><j/></h> </f> </a> post 5 5 pre 23 3d0d 1b3b 2c2c document order 4e1e 0a9a 6g4g pre node 0 a 1 b 2 c 3 d 4 e 5 f 6 g 7 h 8 i 9 j 5f8f 8i5i kind elem elem elem elem text elem comm elem elem elem 7h7h 9j6j size level 9 0 3 1 2 2 0 3 0 3 4 1 0 2 2 2 0 3 0 3

Pre/Postorder Encoding <a> <c><d/>e</c> <f> <h><j/></h> </f> </a> post a 5 b c d f e 5 g h i j pre 23 3d0d 1b3b 2c2c document order 4e1e 0a9a 6g4g pre node 0 a 1 b 2 c 3 d 4 e 5 f 6 g 7 h 8 i 9 j 5f8f 8i5i kind elem elem elem elem text elem comm elem elem elem 7h7h 9j6j size level 9 0 3 1 2 2 0 3 0 3 4 1 0 2 2 2 0 3 0 3

B-Trees Accelerate XPath Location Steps The XPath axes ancestor, descendant, preceding, following partition any XML document: 24

B-Trees Accelerate XPath Location Steps The XPath axes ancestor, descendant, preceding, following partition any XML document: a f h 5 b c e g i j d 5 24

B-Trees Accelerate XPath Location Steps The XPath axes ancestor, descendant, preceding, following partition any XML document: a 5 b c d f e 5 g h i j 24 B-Tree pre post node 0 9 a 1 3 b 2 2 c 3 0 d 4 1 e 5 8 f 6 4 g 7 7 h 8 5 i 9 6 j

Partitioned B-Trees 25

Partitioned B-Trees Use low selectivity key prefixes to partition the XML nodes in a B-Tree according to various criteria: 25

Partitioned B-Trees Use low selectivity key prefixes to partition the XML nodes in a B-Tree according to various criteria: 1. level (support XPath child axis) B-Tree 25

Partitioned B-Trees Use low selectivity key prefixes to partition the XML nodes in a B-Tree according to various criteria: 1. level (support XPath child axis) B-Tree level = 1 level = 2 level = 3 25

Partitioned B-Trees Use low selectivity key prefixes to partition the XML nodes in a B-Tree according to various criteria: 1. level (support XPath child axis) B-Tree 2. element tag name 3. path to node ( Node GPS ) level = 1 level = 2 level = 3 Given a Pathfinder-generated workload, the index advisor of IBM DB2 V9 proposes such partitionings automatically. 25

Injecting More Tree Awareness In XQuery, an XPath location step originates in a sequence of context nodes. Duplicate nodes, out of order results (XPath semantics!), wasted work. 26

Injecting More Tree Awareness In XQuery, an XPath location step originates in a sequence of context nodes. Duplicate nodes, out of order results (XPath semantics!), wasted work. c3 c4 c1 c2 26

Injecting More Tree Awareness In XQuery, an XPath location step originates in a sequence of context nodes. Duplicate nodes, out of order results (XPath semantics!), wasted work. Pathfinder XQuery Compiler c1 c3 c4 XPath/XQuery Semantics XML Encoding c2 26

Injecting More Tree Awareness In XQuery, an XPath location step originates in a sequence of context nodes. Duplicate nodes, out of order results (XPath semantics!), wasted work. Pathfinder XQuery Compiler c1 c2 c3 c4 XPath/XQuery Semantics XML Encoding staircase join 26

Staircase Join: Context Pruning XPath following axis: c3 c4 c1 c2 27

Staircase Join: Context Pruning XPath following axis: c1 c2 27

Staircase Join: Context Pruning XPath following axis: c 1 c1 c2 c 2 27

Staircase Join: Context Pruning XPath following axis: c 1 c2 c 2 Index scan yields duplicate free result in document order. 27

Staircase Join: Skipping XPath descendant axis: c3 c4 c1 c2 28

Staircase Join: Skipping XPath descendant axis: c3 c1 28

Staircase Join: Skipping XPath descendant axis: c3 c1 scan 28

Staircase Join: Skipping XPath descendant axis: c3 v c 1 v c1 scan 28

Staircase Join: Skipping XPath descendant axis: c3 v c 1 v c1 scan skip scan Index scan yields duplicate free result in document order. 28

MonetDB/XQuery 29

MonetDB/XQuery MonetDB: Extensible relational database kernel, optimized for query processing close to the CPU. Full vertical fragmentation (column store). pre post node 0 9 a 1 3 b 2 2 c 3 0 d 4 1 e 5 8 f 6 4 g 7 7 h 8 5 i 9 6 j 29

MonetDB/XQuery MonetDB: Extensible relational database kernel, optimized for query processing close to the CPU. Full vertical fragmentation (column store). pre post 0 9 1 3 2 2 3 0 4 1 5 8 6 4 7 7 8 5 9 6 pre node 0 a 1 b 2 c 3 d 4 e 5 f 6 g 7 h 8 i 9 j 29

MonetDB/XQuery MonetDB: Extensible relational database kernel, optimized for query processing close to the CPU. Full vertical fragmentation (column store). Pathfinder + MonetDB = MonetDB/XQuery. Implementation of XQuery 1.0. Evaluates queries against GB-range XML instances in interactive time. pre post 0 9 1 3 2 2 3 0 4 1 5 8 6 4 7 7 8 5 9 6 pre node 0 a 1 b 2 c 3 d 4 e 5 f 6 g 7 h 8 i 9 j + XQuery Update, XQuery RPC ( execute at uri { e } ), support for multi-dimensional XML,... 29

DB2 s XML support doesn t beat its relational self. 30

DB2 s XML support doesn t beat its relational self. Pathfinder builds on 30+ years of development of relational database technology. 30

DB2 s XML support doesn t beat its relational self. Pathfinder builds on 30+ years of development of relational database technology. Outperforms the built-in DB2 V9 purexml engine, especially for large XML input instances. 30

DB2 s XML support doesn t beat its relational self. Pathfinder builds on 30+ years of development of relational database technology. Outperforms the built-in DB2 V9 purexml engine, especially for large XML input instances. Covers other data-intensive languages: 1. SQL/XML 30

DB2 s XML support doesn t beat its relational self. Pathfinder builds on 30+ years of development of relational database technology. Outperforms the built-in DB2 V9 purexml engine, especially for large XML input instances. Covers other data-intensive languages: 1. SQL/XML 2. LINQ + other Nested Loop languages. 30

grust@in.tum.de http://www-db.in.tum.de/~grust/ 31