Performance Tips for DataDirect XQuery 2.0 Table of Contents Introduction... 1 Preparing Queries... 1 Query Pooling... 2 Querying Large XML Files... 2 Retrieving Query Results... 3 Avoiding Compensation in WHERE Clauses and Predicates... 4 Specifying SQL Generation Algorithms... 5 Summary... 6 Introduction DataDirect XQuery was designed for performance and scalability, which are critical to our customers. This document describes some of the performance optimizations in our XQuery implementation and provides guidelines that will make most applications perform well. Some of these guidelines may seem quite obvious to some users for instance, most JDBC programmers know they should prepare queries, and most XML programmers know that the DOM requires more memory and is slower than SAX or StAX this document is designed to be helpful to both groups. Preparing Queries The first time a query is executed, a significant percent of the total query time is spent preparing the query, which includes query optimization and generating the query plan. Because most programs invoke queries more than once, we spend the time needed to find an efficient way to process the query, even though that makes the first query slower. Fortunately, this effort does not need to be repeated. One way to ensure that a query is prepared only once in your program is to use a prepared query. (Another way is to use Query Pooling, described in the following section). To parameterize a query, use external variables to change the values used for the query each time it is invoked. For instance, the following query creates a portfolio for a given user: Example 1. A Query with an External Variable declare variable $user as xs:string external; collection('holdings')/holdings[userid=$user] In this query, $user is an external variable which must be bound before executing the query. The following XQJ code shows how to prepare the query and bind a value to $user. Example 2. Preparing a Query // Get a connection, prepare the query XQDataSource datasource = new DDXQDataSource(); datasource.setjdbcurl("jdbc:xquery:sqlserver://server1:1433;databasename=stocks"); XQConnection connection = datasource.getconnection(); XQPreparedExpression preparedexpression = connection.prepareexpression(xquerytext); // Bind variable $user to 'Jonathan' and execute preparedexpression.bindstring(new QName("user"), "Jonathan"); XQSequence xqsequence = preparedexpression.executequery(); The following code shows how to bind $user to a different value and execute prepared query again. 1
Example 3. Executing a Prepared Query with New Values // Bind variable $user to 'Minollo' and execute preparedexpression.bindstring(new QName("user"), "Minollo"); xqsequence = preparedexpression.executequery(); Benchmarks that measure the speed of a query should mirror the usage you expect in your program. Since most programs execute a query many times, when you design a benchmark you should prepare your query, then execute it many times. Query Pooling Prepared expressions let the programmer decide when a query should be prepared and when a prepared query is no longer needed. In programs that use hundreds or thousands of queries, it is generally better to let DataDirect XQuery keep track of which queries are actually being used, preparing them the first time they are executed, and discarding the least frequently used prepared queries if there are too many queries. This is called query pooling. In DataDirect XQuery, you enable query pooling by specifying the number of prepared queries to hold in the pool. If you configure your connection by setting properties in the Java API, you can set the MaxPooledQueries property to do this; for instance, the following code sets the size of the query pool to 20: Example 4. Enabling Query Pooling with the Java API XQDataSource datasource = new DDXQDataSource(); datasource.setjdbcurl("jdbc:xquery:sqlserver://server1:1433;databasename=stocks"); datasource.setmaxpooledqueries(20); XQConnection connection = datasource.getconnection(); If you configure your connection using a Source Configuration File, you can set the size of the query pool using the max- PooledQueries element: Example 5. Enabling Query Pooling in a Source Configuration File <?xml version="1.0" encoding="utf-8"?> <XQJConnection xmlns="http://www.datadirect.com/xquery"> <baseuri>file:///c:/program%20files/apache%20software%20foundation/tomcat%205.5/</baseuri> <maxpooledqueries>20</maxpooledqueries> <JDBCConnection name="stocks"> <url>jdbc:xquery:sqlserver://localhost:1433</url> <sqlxmlmapping> <forest>true</forest> <identifierescaping>none</identifierescaping> </sqlxmlmapping> </JDBCConnection> </XQJConnection> Querying Large XML Files DataDirect XQuery provides several ways to query an XML file. In DataDirect XQuery 2.0, you will get better performance, especially for large XML files, if you use fn:doc() to access XML in a query. You will get much worse performance if you parse the XML to create a DOM tree, bind the DOM tree to an external variable, and query the external variable. In most XQuery processors, a large percent of the time needed to query a large XML file is spent parsing the file and creating an in-memory representation that can be queried. The in-memory representation may be many times the size of the original XML file, and Java VM out of memory errors may occur at run time. When your query addresses an XML file using fn:doc(), DataDirect XQuery uses a technique known as Document Projection 1, which allows it to create only the part of a document needed by the query. This results in dramatic improvements in memory usage and scalability, and significant improvements in performance. Using document projection, you can generally query documents many times the size of available memory. Incidentally, if your application uses XML Deployment Adapters [http://www.stylusstudio.com/deployment/datadirectxquery/], it addresses these adapters using fn:doc(), and DataDirect XQuery uses Document Projection appropriately for these adapters just as it does for XML documents. 1 Document Projection is described in A. Marian and J. Simeon, "Projecting XML Documents", Proc. VLDB 2003. 2
Document projection uses the path expressions in a query to determine what parts of a document need to be built. Path expressions that use wildcards ('*') provide less information for document projection, so they should be avoided when querying large documents. For instance, in the examples below, the second path expression will perform much better for large documents. Example 6. Avoid path expressions with wildcards doc("scenario.xml")//*[role='teacher'] Example 7. Specific path expressions aid document projection doc("scenario.xml")/scenario/people/person[role='teacher'] If you parse a document to create a DOM tree, then bind the DOM tree to an external variable, you bypass DataDirect's Document Projection. This forces your program to create the entire in-memory representation of your XML document not just the part that DataDirect XQuery needs to process your queries. In addition to the savings due to Document Projection, DataDirect's internal representation of an XML document is much more efficient than most DOM implementations, so using fn:doc() gives you two substantial optimizations that you lose if you parse the document yourself. DataDirect XQuery also supports a second technique known as Streaming. Streaming processes a document sequentially, discarding portions of the document that are no longer needed to produce further query results. This reduces memory usage because only the portion of a document needed at a given stage of query processing is instantiated in memory. Streaming can reduce memory consumption, but generally involves a (relatively minor) performance penalty. Unlike document projection, streaming is not generally a performance win. Also, some queries require much of the document to be instantiated at a given time, diminishing the benefits of streaming. For some queries, however, streaming can be a substantial win, especially if the query results exceed available memory; therefore, we allow streaming to be enabled, but we do not enable it by default. To enable streaming, set the ddtek:xml-streaming option to 'yes' in the XQuery Prolog: Example 8. Enabling Streaming in the Query Prolog declare option ddtek:xml-streaming 'yes'; You can also enable streaming at the DataSource level, which means streaming will be used for all queries that use the DataSource. One way to do this is to set the Pragmas property of the DDXQDataSource class. Example 9. Enabling Streaming for a DataSource in the Java API XQDataSource datasource = new DDXQDataSource(); datasource.setjdbcurl("jdbc:xquery:sqlserver://server1:1433;databasename=stocks"); datasource.setpragmas("xml-streaming=yes"); Another way to enable streaming at the DataSource level is to specify the declaration option using a pragma element in a Source Configuration File: Example 10. Enabling Streaming for a DataSource in a Source Configuration File <?xml version="1.0" encoding="utf-8"?> <XQJConnection xmlns="http://www.datadirect.com/xquery"> <baseuri>file:///c:/program%20files/apache%20software%20foundation/tomcat%205.5/</baseuri> <maxpooledqueries>20</maxpooledqueries> <pragma name="xml-streaming">yes</pragma> <JDBCConnection name="stocks"> <url>jdbc:xquery:sqlserver://localhost:1433</url> <sqlxmlmapping> <forest>true</forest> <identifierescaping>none</identifierescaping> </sqlxmlmapping> </JDBCConnection> </XQJConnection> Retrieving Query Results DataDirect XQuery is carefully designed to allow very large query results to be retrieved efficiently, and to allow incremental instantiation of query results, so that memory can also be used efficiently. If your query's results can be large, stream the results to an OutputStream or use SAX or StAX, so that retrieving the results does not introduce issues with memory or performance. Retrieving large query results as DOM can use large amounts of memory and significantly increase the time needed for processing queries. 3
If you want to write your query results to an output stream or a file, the easiest way is to use the writesequence() method of the result set to write directly to the stream, as shown in the following code. Example 11. Writing a Result Sequence to a Stream XQSequence sequence = expression.executequery(query); OutputStream os = new FileOutputStream("C:/temp/results.xml"); sequence.writesequence(os); Many XML programs process XML using SAX, and DataDirect XQuery can write query results directly to a SAX event handler, as shown in the following code. Example 12. Writing to a SAX Event Handler CustomSAXEventHandler eventhandler = new CustomSAXEventHandler(); sequence.writesequencetosax(eventhandler); For many kinds of programming, pull processing using StAX is both convenient and efficient. You can get a StAX stream from a result sequence using the getsequenceasstream() method: Example 13. Retrieving Results as a StAX Stream XMLStreamReader reader = sequence.getsequenceasstream(); Avoiding Compensation in WHERE Clauses and Predicates When a database is queried, DataDirect XQuery usually pushes queries to the database, where they are executed in SQL. However, some expressions, functions and casts are not supported by SQL, and are therefore performed in program memory, a strategy known as compensation. Unfortunately, when these operations are performed in main memory, they can not take advantage of indexes and other optimizations available in the database server. This can dramatically hurt performance. The DataDirect XQuery 2.0 User's Guide and Reference clearly states which expressions, functions, and operators are compensated. Most of the commonly used functions that are compensated fall into the following categories: String Functions that Use Pattern Matching Date and Time Functions for XML Schema Date/Time Datatypes Casts to XML Schema Datatypes not supported by the underlying database. XML whitespace functions FLWOR Expressions using the AT Clause Numeric Predicates When querying relational data, if a compensated function or operator is used in the WHERE clause of a FLWOR expression or in a predicate of a path expression, performance can suffer dramatically. For instance, in the following query, normalizespace() is compensated, and forces evaluation in memory; since this function must be called once for every row, the performance hit is very significant. Example 14. Avoid using compensated functions in a WHERE clause for $h in collection('stocks.dbo.historical')/historical where normalize-space($h/ticker) = 'AMZN' return $h One way to avoid compensation is to use a SQL function instead of a compensated XQuery function, as in the following example. 4
Example 15. Using a SQL function instead of a compensated function declare function ddtek-sql:rtrim($in as xs:string) as xs:string external; for $h in collection('historical')/historical where ddtek-sql:rtrim($h/ticker) = 'AMZN' return $h When querying relational data, using numeric predicates or an AT clause in a FLWOR expression forces compensation, which can dramatically impact performance, especially when it forces a join to be done in memory. Consider the following query. Example 16. Avoid AT clauses when querying relational data for $user at $i in collection("users")/users, $hold at $j in collection("holdings")/holdings return <holding> <userid>{ $i }</userid> <user>{ string($user/userid) }</user> <stockid>{ $j }</stockid> <stock>{ string($hold/stockticker) }</stock> <share>{ string($hold/shares }</share> </holding> This query uses AT clauses to identify the individual users and stocks, and these AT clauses force DataDirect XQuery to do the joins in memory instead of letting the database engine do them. Not only is this a dramatic performance hit, it's also a bad way to identify the users and holdings because rows in a relational table do not generally have a guaranteed stable order, these identifiers might change each time the query is run. It is generally much better to use the primary key from a table to identify the rows of that table. Example 17. For identifiers, rely on the primary keys from relational tables (:------------------------------------------------------------------- : userid is the primary key for collection("users") : stockticker, userid is the primary key for collection("holdings") --------------------------------------------------------------------:) for $user in collection("users")/users, $hold in collection("holdings")/holdings return <holding> <user>{ string($user/userid) }</user> <stock>{ string($hold/stockticker) }</stock> <share>{ string($hold/shares }</share> </holding> Specifying SQL Generation Algorithms The algorithms DataDirect XQuery uses to create hierarchies are optimal for creating large, complex documents, but still good for smaller documents with shallower hierarchies. If query results contain four or fewer levels of hierarchy, and your performance seems inadequate, you will sometimes get better performance if you tell DataDirect XQuery to use either outer join or sorted outer union (see the DataDirect 2.0 User's Guide and Reference for details on these algorithms). There are several ways to specify SQL generation algorithms in DataDirect XQuery: 5
Example 18. Specifying SQL Generation Algorithms The algorithm can be specified via the XQJ API: XQDataSource ds = new DDXQDataSource(); ds.setjdbcpragmas("sql-rewrite-algorithm=nested-loop"); The algorithm can be specified with an entry in the source configuration file: <pragma name="sql-rewrite-algorithm">outer-join</pragma> The algorithm can be specified in the XQuery prolog with a declaration option: declare option ddtek:sql-rewrite-algorithm "outer-join"; Summary DataDirect XQuery was designed for performance and scalability. It does sophisticated query optimization, supports prepared queries and query pooling, and efficiently queries large XML documents using document projection and streaming. If you follow these few simple guidelines, your DataDirect XQuery applications will usually have good performance: Prepare your queries, or use query pooling. Use fn:doc() when possible for XML input; avoid parsing XML input yourself. For large XML files, avoid wildcards ('*'), and use specific paths. For large XML files, consider streaming if memory usage is large and performance is unsatisfactory. Use SAX, StAX, or output streams for output; avoid using the DOM for output. Avoid compensation in WHERE clauses and predicates. Avoid AT clauses in FLWOR expressions that query relational tables. Avoid numeric predicates in expressions that operate on relational data. For shallow hierarchies, consider specifying a SQL generation algorithm. 6