Performance Tips for DataDirect XQuery 2.0



Similar documents
DataDirect XQuery Technical Overview

Data XML and XQuery A language that can combine and transform data

High Performance XML Data Retrieval

DataDirect XQuery and the Insurance Industry

SQL Server and MicroStrategy: Functional Overview Including Recommendations for Performance Optimization. MicroStrategy World 2016

How To Use X Query For Data Collection

ABSTRACT 1. INTRODUCTION. Kamil Bajda-Pawlikowski

Concrete uses of XML in software development and data analysis.

Course 6232A: Implementing a Microsoft SQL Server 2008 Database

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package Data Federation Administration Tool Guide

Understanding SQL Server Execution Plans. Klaus Aschenbrenner Independent SQL Server Consultant SQLpassion.at

The end. Carl Nettelblad

Rational Application Developer Performance Tips Introduction

In this session, we use the table ZZTELE with approx. 115,000 records for the examples. The primary key is defined on the columns NAME,VORNAME,STR

Processing XML with Java A Performance Benchmark

New Features in Oracle Application Express 4.1. Oracle Application Express Websheets. Oracle Database Cloud Service

Smooks Dev Tools Reference Guide. Version: GA

MS SQL Performance (Tuning) Best Practices:

SQL Server Query Tuning

Java Application Developer Certificate Program Competencies

Oracle Database: SQL and PL/SQL Fundamentals

ColdFusion 8. Performance Tuning, Multi-Instance Management and Clustering. Sven Ramuschkat MAX 2008 Milan

MyOra 3.0. User Guide. SQL Tool for Oracle. Jayam Systems, LLC

Course Number: IAC-SOFT-WDAD Web Design and Application Development

Toad for Oracle 8.6 SQL Tuning

SQL Server. SQL Server 100 Most Asked Questions: Best Practices guide to managing, mining, building and developing SQL Server databases

Database Programming with PL/SQL: Learning Objectives

Improving Maintenance and Performance of SQL queries

Technologies for a CERIF XML based CRIS

Using the DataDirect Connect for JDBC Drivers with WebLogic 8.1

Storing Measurement Data

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Oracle Database 11g: SQL Tuning Workshop

Lesson 4 Web Service Interface Definition (Part I)

Cloudera Certified Developer for Apache Hadoop

Visual Basic. murach's TRAINING & REFERENCE

ORACLE DATABASE 10G ENTERPRISE EDITION

Unified XML/relational storage March The IBM approach to unified XML/relational databases

Enterprise Performance Tuning: Best Practices with SQL Server 2008 Analysis Services. By Ajay Goyal Consultant Scalability Experts, Inc.

Oracle Database: SQL and PL/SQL Fundamentals NEW

ZooKeeper. Table of contents

Empirical study of performance of data binding in ASP.NET web applications

XML Programming with PHP and Ajax

Oracle SQL. Course Summary. Duration. Objectives

KonyOne Server Prerequisites _ MS SQL Server

White Paper November Technical Comparison of Perspectium Replicator vs Traditional Enterprise Service Buses

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

Advanced Oracle SQL Tuning

Java EE Web Development Course Program

XQuery and the E-xml Component suite

Technology WHITE PAPER

D61830GC30. MySQL for Developers. Summary. Introduction. Prerequisites. At Course completion After completing this course, students will be able to:

Using the DataDirect Connect for JDBC Drivers with the Sun Java System Application Server

Oracle Database: SQL and PL/SQL Fundamentals

An XML Based Data Exchange Model for Power System Studies

Object Oriented Databases. OOAD Fall 2012 Arjun Gopalakrishna Bhavya Udayashankar

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

SQL Server for developers. murach's TRAINING & REFERENCE. Bryan Syverson. Mike Murach & Associates, Inc. Joel Murach

Achieving Database Interoperability Across Data Access APIs through SQL Up-leveling

Performance Implications of Various Cursor Types in Microsoft SQL Server. By: Edward Whalen Performance Tuning Corporation

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

Developing Microsoft SQL Server Databases MOC 20464

Oracle Database: Program with PL/SQL

Performance rule violations usually result in increased CPU or I/O, time to fix the mistake, and ultimately, a cost to the business unit.

Geodatabase Programming with SQL

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

Developing and Implementing Windows-Based Applications With Microsoft Visual C#.NET and Microsoft Visual Studio.NET

XML Processing and Web Services. Chapter 17

Querying Combined Cloud-Based and Relational Databases

Introduction to XML Applications

SOLUTION BRIEF. Advanced ODBC and JDBC Access to Salesforce Data.

Beginning C# 5.0. Databases. Vidya Vrat Agarwal. Second Edition

Hypertable Architecture Overview

MyOra 3.5. User Guide. SQL Tool for Oracle. Kris Murthy

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Realization of Inventory Databases and Object-Relational Mapping for the Common Information Model

SQL Anywhere 12 New Features Summary

Terms and Definitions for CMS Administrators, Architects, and Developers

Business Application Services Testing

An Approach to Translate XSLT into XQuery

Oracle Database 11g: SQL Tuning Workshop Release 2

The EMSX Platform. A Modular, Scalable, Efficient, Adaptable Platform to Manage Multi-technology Networks. A White Paper.

Master of Sciences in Informatics Engineering Programming Paradigms 2005/2006. Final Examination. January 24 th, 2006

JAVA r VOLUME II-ADVANCED FEATURES. e^i v it;

Chapter 1: Introduction

Scaling Sitecore for Load

Topics in basic DBMS course

Deferred node-copying scheme for XQuery processors

Firewall Builder Architecture Overview

How To Script Administrative Tasks In Marklogic Server

Monitoring HP OO 10. Overview. Available Tools. HP OO Community Guides

SQL Query Evaluation. Winter Lecture 23

Chapter 13. Introduction to SQL Programming Techniques. Database Programming: Techniques and Issues. SQL Programming. Database applications

Java 7 Recipes. Freddy Guime. vk» (,\['«** g!p#« Carl Dea. Josh Juneau. John O'Conner

Specialized Programme on Web Application Development using Open Source Tools

What is Data Virtualization? Rick F. van der Lans, R20/Consultancy

Managing large sound databases using Mpeg7

White Paper. Optimizing the Performance Of MySQL Cluster

Performance Monitoring API for Java Enterprise Applications

Developing XML Solutions with JavaServer Pages Technology

Transcription:

Performance Tips for DataDirect XQuery 2.0 Table of Contents Introduction... 1 Preparing Queries... 1 Query Pooling... 2 Querying Large XML Files... 2 Retrieving Query Results... 3 Avoiding Compensation in WHERE Clauses and Predicates... 4 Specifying SQL Generation Algorithms... 5 Summary... 6 Introduction DataDirect XQuery was designed for performance and scalability, which are critical to our customers. This document describes some of the performance optimizations in our XQuery implementation and provides guidelines that will make most applications perform well. Some of these guidelines may seem quite obvious to some users for instance, most JDBC programmers know they should prepare queries, and most XML programmers know that the DOM requires more memory and is slower than SAX or StAX this document is designed to be helpful to both groups. Preparing Queries The first time a query is executed, a significant percent of the total query time is spent preparing the query, which includes query optimization and generating the query plan. Because most programs invoke queries more than once, we spend the time needed to find an efficient way to process the query, even though that makes the first query slower. Fortunately, this effort does not need to be repeated. One way to ensure that a query is prepared only once in your program is to use a prepared query. (Another way is to use Query Pooling, described in the following section). To parameterize a query, use external variables to change the values used for the query each time it is invoked. For instance, the following query creates a portfolio for a given user: Example 1. A Query with an External Variable declare variable $user as xs:string external; collection('holdings')/holdings[userid=$user] In this query, $user is an external variable which must be bound before executing the query. The following XQJ code shows how to prepare the query and bind a value to $user. Example 2. Preparing a Query // Get a connection, prepare the query XQDataSource datasource = new DDXQDataSource(); datasource.setjdbcurl("jdbc:xquery:sqlserver://server1:1433;databasename=stocks"); XQConnection connection = datasource.getconnection(); XQPreparedExpression preparedexpression = connection.prepareexpression(xquerytext); // Bind variable $user to 'Jonathan' and execute preparedexpression.bindstring(new QName("user"), "Jonathan"); XQSequence xqsequence = preparedexpression.executequery(); The following code shows how to bind $user to a different value and execute prepared query again. 1

Example 3. Executing a Prepared Query with New Values // Bind variable $user to 'Minollo' and execute preparedexpression.bindstring(new QName("user"), "Minollo"); xqsequence = preparedexpression.executequery(); Benchmarks that measure the speed of a query should mirror the usage you expect in your program. Since most programs execute a query many times, when you design a benchmark you should prepare your query, then execute it many times. Query Pooling Prepared expressions let the programmer decide when a query should be prepared and when a prepared query is no longer needed. In programs that use hundreds or thousands of queries, it is generally better to let DataDirect XQuery keep track of which queries are actually being used, preparing them the first time they are executed, and discarding the least frequently used prepared queries if there are too many queries. This is called query pooling. In DataDirect XQuery, you enable query pooling by specifying the number of prepared queries to hold in the pool. If you configure your connection by setting properties in the Java API, you can set the MaxPooledQueries property to do this; for instance, the following code sets the size of the query pool to 20: Example 4. Enabling Query Pooling with the Java API XQDataSource datasource = new DDXQDataSource(); datasource.setjdbcurl("jdbc:xquery:sqlserver://server1:1433;databasename=stocks"); datasource.setmaxpooledqueries(20); XQConnection connection = datasource.getconnection(); If you configure your connection using a Source Configuration File, you can set the size of the query pool using the max- PooledQueries element: Example 5. Enabling Query Pooling in a Source Configuration File <?xml version="1.0" encoding="utf-8"?> <XQJConnection xmlns="http://www.datadirect.com/xquery"> <baseuri>file:///c:/program%20files/apache%20software%20foundation/tomcat%205.5/</baseuri> <maxpooledqueries>20</maxpooledqueries> <JDBCConnection name="stocks"> <url>jdbc:xquery:sqlserver://localhost:1433</url> <sqlxmlmapping> <forest>true</forest> <identifierescaping>none</identifierescaping> </sqlxmlmapping> </JDBCConnection> </XQJConnection> Querying Large XML Files DataDirect XQuery provides several ways to query an XML file. In DataDirect XQuery 2.0, you will get better performance, especially for large XML files, if you use fn:doc() to access XML in a query. You will get much worse performance if you parse the XML to create a DOM tree, bind the DOM tree to an external variable, and query the external variable. In most XQuery processors, a large percent of the time needed to query a large XML file is spent parsing the file and creating an in-memory representation that can be queried. The in-memory representation may be many times the size of the original XML file, and Java VM out of memory errors may occur at run time. When your query addresses an XML file using fn:doc(), DataDirect XQuery uses a technique known as Document Projection 1, which allows it to create only the part of a document needed by the query. This results in dramatic improvements in memory usage and scalability, and significant improvements in performance. Using document projection, you can generally query documents many times the size of available memory. Incidentally, if your application uses XML Deployment Adapters [http://www.stylusstudio.com/deployment/datadirectxquery/], it addresses these adapters using fn:doc(), and DataDirect XQuery uses Document Projection appropriately for these adapters just as it does for XML documents. 1 Document Projection is described in A. Marian and J. Simeon, "Projecting XML Documents", Proc. VLDB 2003. 2

Document projection uses the path expressions in a query to determine what parts of a document need to be built. Path expressions that use wildcards ('*') provide less information for document projection, so they should be avoided when querying large documents. For instance, in the examples below, the second path expression will perform much better for large documents. Example 6. Avoid path expressions with wildcards doc("scenario.xml")//*[role='teacher'] Example 7. Specific path expressions aid document projection doc("scenario.xml")/scenario/people/person[role='teacher'] If you parse a document to create a DOM tree, then bind the DOM tree to an external variable, you bypass DataDirect's Document Projection. This forces your program to create the entire in-memory representation of your XML document not just the part that DataDirect XQuery needs to process your queries. In addition to the savings due to Document Projection, DataDirect's internal representation of an XML document is much more efficient than most DOM implementations, so using fn:doc() gives you two substantial optimizations that you lose if you parse the document yourself. DataDirect XQuery also supports a second technique known as Streaming. Streaming processes a document sequentially, discarding portions of the document that are no longer needed to produce further query results. This reduces memory usage because only the portion of a document needed at a given stage of query processing is instantiated in memory. Streaming can reduce memory consumption, but generally involves a (relatively minor) performance penalty. Unlike document projection, streaming is not generally a performance win. Also, some queries require much of the document to be instantiated at a given time, diminishing the benefits of streaming. For some queries, however, streaming can be a substantial win, especially if the query results exceed available memory; therefore, we allow streaming to be enabled, but we do not enable it by default. To enable streaming, set the ddtek:xml-streaming option to 'yes' in the XQuery Prolog: Example 8. Enabling Streaming in the Query Prolog declare option ddtek:xml-streaming 'yes'; You can also enable streaming at the DataSource level, which means streaming will be used for all queries that use the DataSource. One way to do this is to set the Pragmas property of the DDXQDataSource class. Example 9. Enabling Streaming for a DataSource in the Java API XQDataSource datasource = new DDXQDataSource(); datasource.setjdbcurl("jdbc:xquery:sqlserver://server1:1433;databasename=stocks"); datasource.setpragmas("xml-streaming=yes"); Another way to enable streaming at the DataSource level is to specify the declaration option using a pragma element in a Source Configuration File: Example 10. Enabling Streaming for a DataSource in a Source Configuration File <?xml version="1.0" encoding="utf-8"?> <XQJConnection xmlns="http://www.datadirect.com/xquery"> <baseuri>file:///c:/program%20files/apache%20software%20foundation/tomcat%205.5/</baseuri> <maxpooledqueries>20</maxpooledqueries> <pragma name="xml-streaming">yes</pragma> <JDBCConnection name="stocks"> <url>jdbc:xquery:sqlserver://localhost:1433</url> <sqlxmlmapping> <forest>true</forest> <identifierescaping>none</identifierescaping> </sqlxmlmapping> </JDBCConnection> </XQJConnection> Retrieving Query Results DataDirect XQuery is carefully designed to allow very large query results to be retrieved efficiently, and to allow incremental instantiation of query results, so that memory can also be used efficiently. If your query's results can be large, stream the results to an OutputStream or use SAX or StAX, so that retrieving the results does not introduce issues with memory or performance. Retrieving large query results as DOM can use large amounts of memory and significantly increase the time needed for processing queries. 3

If you want to write your query results to an output stream or a file, the easiest way is to use the writesequence() method of the result set to write directly to the stream, as shown in the following code. Example 11. Writing a Result Sequence to a Stream XQSequence sequence = expression.executequery(query); OutputStream os = new FileOutputStream("C:/temp/results.xml"); sequence.writesequence(os); Many XML programs process XML using SAX, and DataDirect XQuery can write query results directly to a SAX event handler, as shown in the following code. Example 12. Writing to a SAX Event Handler CustomSAXEventHandler eventhandler = new CustomSAXEventHandler(); sequence.writesequencetosax(eventhandler); For many kinds of programming, pull processing using StAX is both convenient and efficient. You can get a StAX stream from a result sequence using the getsequenceasstream() method: Example 13. Retrieving Results as a StAX Stream XMLStreamReader reader = sequence.getsequenceasstream(); Avoiding Compensation in WHERE Clauses and Predicates When a database is queried, DataDirect XQuery usually pushes queries to the database, where they are executed in SQL. However, some expressions, functions and casts are not supported by SQL, and are therefore performed in program memory, a strategy known as compensation. Unfortunately, when these operations are performed in main memory, they can not take advantage of indexes and other optimizations available in the database server. This can dramatically hurt performance. The DataDirect XQuery 2.0 User's Guide and Reference clearly states which expressions, functions, and operators are compensated. Most of the commonly used functions that are compensated fall into the following categories: String Functions that Use Pattern Matching Date and Time Functions for XML Schema Date/Time Datatypes Casts to XML Schema Datatypes not supported by the underlying database. XML whitespace functions FLWOR Expressions using the AT Clause Numeric Predicates When querying relational data, if a compensated function or operator is used in the WHERE clause of a FLWOR expression or in a predicate of a path expression, performance can suffer dramatically. For instance, in the following query, normalizespace() is compensated, and forces evaluation in memory; since this function must be called once for every row, the performance hit is very significant. Example 14. Avoid using compensated functions in a WHERE clause for $h in collection('stocks.dbo.historical')/historical where normalize-space($h/ticker) = 'AMZN' return $h One way to avoid compensation is to use a SQL function instead of a compensated XQuery function, as in the following example. 4

Example 15. Using a SQL function instead of a compensated function declare function ddtek-sql:rtrim($in as xs:string) as xs:string external; for $h in collection('historical')/historical where ddtek-sql:rtrim($h/ticker) = 'AMZN' return $h When querying relational data, using numeric predicates or an AT clause in a FLWOR expression forces compensation, which can dramatically impact performance, especially when it forces a join to be done in memory. Consider the following query. Example 16. Avoid AT clauses when querying relational data for $user at $i in collection("users")/users, $hold at $j in collection("holdings")/holdings return <holding> <userid>{ $i }</userid> <user>{ string($user/userid) }</user> <stockid>{ $j }</stockid> <stock>{ string($hold/stockticker) }</stock> <share>{ string($hold/shares }</share> </holding> This query uses AT clauses to identify the individual users and stocks, and these AT clauses force DataDirect XQuery to do the joins in memory instead of letting the database engine do them. Not only is this a dramatic performance hit, it's also a bad way to identify the users and holdings because rows in a relational table do not generally have a guaranteed stable order, these identifiers might change each time the query is run. It is generally much better to use the primary key from a table to identify the rows of that table. Example 17. For identifiers, rely on the primary keys from relational tables (:------------------------------------------------------------------- : userid is the primary key for collection("users") : stockticker, userid is the primary key for collection("holdings") --------------------------------------------------------------------:) for $user in collection("users")/users, $hold in collection("holdings")/holdings return <holding> <user>{ string($user/userid) }</user> <stock>{ string($hold/stockticker) }</stock> <share>{ string($hold/shares }</share> </holding> Specifying SQL Generation Algorithms The algorithms DataDirect XQuery uses to create hierarchies are optimal for creating large, complex documents, but still good for smaller documents with shallower hierarchies. If query results contain four or fewer levels of hierarchy, and your performance seems inadequate, you will sometimes get better performance if you tell DataDirect XQuery to use either outer join or sorted outer union (see the DataDirect 2.0 User's Guide and Reference for details on these algorithms). There are several ways to specify SQL generation algorithms in DataDirect XQuery: 5

Example 18. Specifying SQL Generation Algorithms The algorithm can be specified via the XQJ API: XQDataSource ds = new DDXQDataSource(); ds.setjdbcpragmas("sql-rewrite-algorithm=nested-loop"); The algorithm can be specified with an entry in the source configuration file: <pragma name="sql-rewrite-algorithm">outer-join</pragma> The algorithm can be specified in the XQuery prolog with a declaration option: declare option ddtek:sql-rewrite-algorithm "outer-join"; Summary DataDirect XQuery was designed for performance and scalability. It does sophisticated query optimization, supports prepared queries and query pooling, and efficiently queries large XML documents using document projection and streaming. If you follow these few simple guidelines, your DataDirect XQuery applications will usually have good performance: Prepare your queries, or use query pooling. Use fn:doc() when possible for XML input; avoid parsing XML input yourself. For large XML files, avoid wildcards ('*'), and use specific paths. For large XML files, consider streaming if memory usage is large and performance is unsatisfactory. Use SAX, StAX, or output streams for output; avoid using the DOM for output. Avoid compensation in WHERE clauses and predicates. Avoid AT clauses in FLWOR expressions that query relational tables. Avoid numeric predicates in expressions that operate on relational data. For shallow hierarchies, consider specifying a SQL generation algorithm. 6