Reading Assignment 5 An Overview of Query Optimization in Relational Systems

Transcription

1 Reading Assignment 5 An Overview of Query Optimization in Relational Systems José Filipe Barbosa de Carvalho (jose.carvalho@fe.up.pt) 5th December 2007 Advanced Database Systems Technische Universität Wien, Karlsplatz 13, A-1040 Wien AUSTRIA Abstract: This text wants to resume the fundamental ideas in [1], after a careful reading. It also presents things that author didn t well understand and his personnel opinion about it. 1 Introduction One of the most important points in the research in Database Management Systems (DBMSs) is performance. Many investigators created and analyzed a lot of algorithms, architectures, models, to improve the efficiency in resource usage and get lower response times. Remind that a DBMS is usually a central component of applications, supporting a set of operations about data and a reliable storage system. A database system is composed by several parts, like process manager or transactional storage manager. Some of them are weaving related, other are almost independent components. One of these components is query processor, which is responsible to convert a SQL query to an internal representation, make an optimized plan to be executed. These operations are performed by relatively independent blocks of query processor: query parser, query rewriter, optimizer and executor. The optimizer is a central component to improve the overall performance of a database management system (as its name suggests). It creates an execution plan based in operators available in query executor engine and other statistical data, which should be the more efficient plan, in time and in space needed. 1

2 2 Important ideas and results of the paper The paper presents a good overview about query optimizer design issues, explaining their role within a database management system. First, it explains how query optimizer and query execution engine work together. After it talks about System R optimizer, search space optimization questions, statistics and cost estimation. Finally it presents some enumeration architectures, Starburst and Volcano/Cascades, and exposes more complex challenges in query optimization research. The text begins describing two keys components of the query evaluation in a relational database: the query executor and query optimizer. The first one implements a set of physical operators about stored data, like external sort, sequential scan, index scan, merge-join and so on. An operator consumes as input one or more data streams and produces an output stream. These operators are different of logical operators in SQL queries and it is not trivial convert logical to physical operators, because usually query executor has multiple choices available to do the same thing (like nested-loop join and sort-merge join). Query optimizer is responsible to take original logical query (maybe with some previous simplifications, as view expansion) and make the best execution plan possible, based in estimated cost of physical operators and statistical information about data that will be processed. The query executor engine is more or less like an ignorant slave: it executes the execution plan made by query optimizer, assuming that it is the best plan. It also presents an abstract representation of execution, the physical operator tree. Of course, the task of the optimizer is not trivial, because for a given query may exists a large number of possible physical operators trees. For instance, a logical representation of the query can be transformed into equivalent logical representations, and some logical representations have several different implementations (e.g., the join can be implemented using nested-loop join, sort-merge join and hash-based join). Query optimization can be viewed as a hard search problem that can be solved using: A space of plans (search space); A cost estimate technique to assign a cost to each plan in search space; An enumeration algorithm that cans search through the execution space. As the author, a desirable optimizer is one where the search space includes plans that have low cost, the costing technique is accurate and the enumeration algorithm is efficient [1]. Because achieve this properties is an enormous task, design and implement a good optimizer is a complex and difficult challenge. The System-R is a database system built as research project at IBM, and had a large impact in database research and is mentioned also in the text, because its advances in optimization techniques, that had been incorporated in many commercial optimizers. The author presents some important ideas, namely related to Select- Project-Join (SPJ) queries. The search space for the System-R optimizer in the context of SPJ queries consists of physical operator trees that correspond to linear sequence of join operations. This approach generate several different operator trees because some associative and commutative properties of joins and physical operators that can do the same thing. The system uses a cost model to estimate any plan performance and also determines the size of output for every operator in the operator tree. The database 2

3 maintains statistics on relations and indexes use formulas to estimate selectivity of predicates and to know the size of the output data stream as it uses formulas to estimate CPU and I/O costs for every operator. These formulas are quite difficult and tricky to develop and usually don t take all the variables in account due to performance issues or because not all information is at hand. Many of these formulas are based in previous works, for instance Graefe developments [2]. The text also explains some techniques applied in System-R, as dynamic programming approach, use of interesting orders, cost estimation in a bottom-up fashion, take in account if input/outputs streams are ordered, verify expected streams size and their influence when choose the physical operators, and so on. The search space for optimization is the set of possible plans, regarding the hypothetic algebraic transformations and the physical operators chosen. The algebraic transformations can improve the behavior of a query, but isn t guaranteed. So, it is necessary estimate if the transformation has some positive effect. The paper explains two types of transformations: That explores the commutativity among operators, because some of them (namely join operators) are commutative and associative. Although these transformations expand the cost of enumerating the search space considerably, they can improve greatly a query execution (for example, making earlier Cartesian products). The paper discuss only a subset of these transformations, explaining the cases of general joins, outer-joins and the relation between group-by and joins; That reduces multi-block queries to a single block, so decreasing the complexity. For instance it is possible merging views, if they are defined as conjunctive queries, to obtain a single block query. It is also possible merging nested subqueries, by using appropriate techniques; the paper explains how generally this techniques, pointing out the differences when inner query block contains or not variables from the outer query block; That exploits the selectivity of predicates across blocks, to reduce computation and streaming of the last applied operators. Cost estimation really is a tricky and complex issue. Besides the complexity of enumerate and process the search space, it is not easy evaluate which of operator trees consumes least resources. The first point is what resources usage will be measured, what are more important, between CPU-time, memory, I/O cost, communication bandwidth and others system constraints, to use a balanced metric. The second point is that estimation function, must be accurate (to ensure a correct optimization process) and efficient (because it is repeatedly invoked in the inner loop of optimization). Most of actual database optimizers do the estimation using a model derived from System-R; this model are based in collect statistical resumes of stored data, and for each operator of the plan and its input data streams, determine statistical summary of output data stream and the cost of executing the operation. Examples of statistic data are the histograms, which store information about data distribution on a column, and the values of maximum and minimum on a column. In [1] is also discussed how to estimate statistics, namely some issues like sampling and incremental maintenance, and how to propagate these statistics along the operators in operator trees. 3

4 The enumerating architecture plays a key role in optimizer design. Nowadays these architectures are extensible, so they can adapt to changes in the search space, like addition of new transformations and of new physical operators, and changes in the cost estimation techniques. Of course, this generally must be balanced with efficient and complicate the implementation. The paper presents two cases of extensible optimizers: Starburst and Volcano/Cascades. Both optimizers use generalized cost functions and physical properties with operator nodes, use a rule engine that allows transformations to modify the query expression or the operator trees and have many exposed knobs that can be used to tune the behavior of the system. However there are some differences between them: Starburst use two distinct optimizations phase and Volcano/Cascades use only one; in Volcano/Cascades framework, the mapping from algebraic operators to physical operators occurs in only one step and it does goal-driven application of rules, instead in Starburst the rules are applied using a forward chaining fashion. Finally the paper presents briefly some additional issues in query optimization: distributed and parallel databases, user-defined functions, materialized views, objectoriented systems, some of them remain yet an open issue. 3 Things not well understood I haven t troubles in understood all the paper. The subject of text is focused in query optimization that is introduced in one lesson of this course, some weeks ago, and explored in the reading assignments. The previous reading [2] was very important on understand some details about query optimizer, namely the relations with query executor engine, which are important in reading this assignment. The paper is easy to read although requires some deeper knowledge about database management systems. However I didn t know some historic developments like Starburst System (appeared in 3 th page and later), Exodus (page 3), Cascades/Volcano and XPRS project (8 th page). But with a simple search in Google or Wikipedia I understood, in a general way, their aims. I knew about System R, but I discovered a new project named System R* (8 th page). I couldn t find anything in the web about this project, but I assume that is an improvement of System R. I also find more about CUBE (appeared in 9 th page), in [3], to understand what it is. 4 Things that I like and I didn t like in the paper The paper is an introduction to query optimization, explaining in a short article (less than 9 pages) the most important topics about it. I like this paper because it resumes the optimization challenges in a simple and easy reading text. The author didn t use complex mathematical formulas and use some images to explain better some concepts. However, it is recommended some previous knowledge about database systems structure and about query executor. 4

5 I believe that many details of optimization are out of this article, but I think that further readings can be made using many given bibliographical references. This shorter version is a door to query optimization issues, that an interested reader can use to begin explore this topic. Once more in these reading assignments, this paper is quite old (made in 1998). Probably some of issues detailed in paper are now standard and other appeared as the main questions. 5 Conclusion This paper is a short and well-done summary about query optimization issues. The paper explains in generally some challenges in query optimizer, like cost estimation and how generate search space. A lot of further readings are provided, so an interested reader can go deeper in his research about this topic. So, I truly recommend the reading of this document to all involved on develop database management systems. 6 Bibliography [1] CHAUDHURI, Surajit; "An Overview of Query Optimization in Relational Systems". PODS 1998 [2] GRAEFE, Goetz; Query Evaluation Techniques for Large Databases, ACM Computing Surveys 1993 (Section 1-5, 7, 8) [3] GRAY, J., BOSWORTH, A., LAYMAN A., PIRAHESH H.; Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab, and Sub-Totals, In Proc. of IEEE Conference on Data Engineering, New Orleans,