Optimization of SQL Queries in Main-Memory Databases

Optimization of SQL Queries in Main-Memory Databases Ladislav Vastag and Ján Genči Department of Computers and Informatics Technical University of Košice, Letná 9, 042 00 Košice, Slovakia lvastag@netkosice.sk jan.genci@tuke.sk http://kpi.fei.tuke.sk Abstract. In this article we have tried to summarize and notify the problems concerning query optimization those we have been dealing during elaboration of this thesis. Though we have dedicated maximum effort to compress information presented in this article, significant amount of information is not provided herein due to the limited range of this article. Key words: logical plan, physical plan, logical optimization, physical optinozation, cost function 1 Introduction With gradual evolution of real-time systems such as telecommunication systems, satellite pictures processing systems or target-searching in RADAR systems; the demand for databases with capability to perform transactions within extremely short time periods started to grow. To be more specific, we are talking about time intervals much shorter than the database stores primary copy of data on thedrive. If we consider that accessing main memory takes much less time than accessing the drive, databases those store primary copy of data in main memory (Main Memory databases - MMDB) became an alternative of solution to a problem of time demands on transaction execution. Decrease in semiconductor memory prices and increase in their capacity enabled to store larger amounts of data in main memory and consequently MMDB became reality. However with MMDB rollout many problems, previously only partly or not at all solved by DRDB, arose. 2 Goals The main goal of this master thesis was to create an SQL query optimizer for optimization tasks in conditions of Main-Memory databases. Shortly, it meant to create a complete query compiler. To achieve this goal we needed to carry out:

2 L. Vastag and J. Genči 1. SQL parser and preprocessor - main tasks of this two components of query compiler is both syntactical and lexical analysis of incoming query and translation of query string into an internal parse tree 2. Query optimizer - takes parse tree on input and transforms it into a cheapest physical plan. As we shall introduce a task of an optimizer can be divided into two, relatively independent, stages. At first, it is a logical optimization and second stage, it is a physical optimization. As we will see the task of an optimizer is very nontrivial. At the same time it is inevitable to respect the primary organization of data that was created at the department of computers and information technology [1] 3 Analysis In this section we shall introduce a short description of functionality of all stages of query compiler. 3.1 Parser and Preprocessor The parser is responsible for syntactic checking of a query. The job of the parser is to take text written in SQL (or other language) and convert it into a parse tree that is a tree whose nodes correspond to either atoms or syntactic categories. Atoms are lexical elements such as keywords, names of attributes or relations, constants, parentheses, operators etc. Syntactic categories are names for families of query subparts those all play a similar role in a query. For example, the syntactic category simple-select will be used to represent any simple SELECT query in common select-from-where form and condition will represent any expression that is a condition i. e. it can follow WHERE in SQL. The preprocessor has several important functions. If a relation used in query is a view, then each use of this relation in from-list must be replaced by a parse tree that represents the view. The preprocessor is also responsible for semantic checking. For instance, the preprocessor must check relation uses, check and resolve attribute uses and check types. If a parse tree passes all these tests, it is said to be valid, and parse tree is given to the logical plan generator. 3.2 Query optimizer Optimizer is responsible for elaboration of physical plan for executor. Input of optimizer is a parse tree of query and output is an effective physical plan selected fromspace of possible plans. The role of the optimizer is really nontrivial, especially if we take into account that the space of possible plans for the SQL queries can be rather large due to following reasons: Algebraic expression of the query can be transformed into various logically equivalent algebraic expressions,

Optimization of SQL queries in MMDBs 3 There can be several physical plans for implementation of one algebraic expression. For instance, in database systems there are several algorithms for execution of join operation. Moreover, times of execution of given physical plans can vary. Therefore reasonable selection of suitable physical plan is rather critical. Based on these premises we can consider optimization of queries to be a solution of complicated task within search space. In order to solve this problem we need to provide the optimizer with following: 1. A space of plans (search space) 2. Cost estimation technique that will assign each plan from search space its cost. Cost estimation is an estimate of system resources consumption required for the execution of the plan. 3. Searching or enumeration algorithm that will be used to search within the search space. In the context of previously mentioned points we can state that the required optimizer has following qualities: 1. Search space contains, beside other, also plans with low cost, 2. Cost estimate is exact, 3. Enumeration algorithm is effective. We should also state that transforming SQL query into physical plan is called query compilation. Logical optimization The first stage of query optimization is a logical optimization. The logical optimization is responsible for turning the parse tree into the preferred logical plan. Logical optimization consists of two steps. In the first step, the structures and nodes of parse tree are, in appropriate groups, replaced by an operator or operators of relational algebra. In this step the optimizer also simplifies the query by removing sub queries from condition i. e. from WHERE clause. The second step is aimed on taking the logical plan from first step and to turning it into the expression that, as we expect can be converted into the most efficient physical query plan. The second step is commonly called as query rewriting. The query rewriting applies some heuristics using the algebraic laws [2]. Keep in your mind that the logical query optimization of id conditions of Main-Memory database is same like that in conventional databases. Physical optimization The second stage of query optimization is the physical optimization. Physical optimization in MMDB significantly differs from physical optimization used in conventional databases because of different data organization and cost function. In conventional databases the cost function simply counts data blocks those are transferred from or to disk during an evaluation of physical operator. This is impossible in Main-Memory database because MMDB holds a primary copy of data in main memory. Cost function in MMDB must cover

4 L. Vastag and J. Genči costs such as CPU computation cost, cache misses, TLB misses and memory- CPU bandwidth. As an example we can see cost functions which are implemented in MMDB Monet. Detailed descriptions of these functions are in [3]. The role of physical optimization is to translate logical plan into physical plan. The first problem that shall be solved by physical optimization is selecting an order for joining of three or more relations. Considering that there can be large amount of join trees those represent equivalent connections it is inevitable to limit plan space by any sort of heuristics in order reach effectiveness. The mentioned techniques will be described briefly in the following article. After selecting an order for joining the resulting logical plan is translated into physical plan in such way that every logical operator is translated into one or more physical operators. Consequently algorithms for physical operators are selected. It is required to state that selection of physical operators and their algorithms is performed based on cost estimate so that the final cost of physical plan is minimum. Such an optimization is called cost based optimization. Basically the same principles as in logical optimization are used because every logical operator can be replaced by various physical operators or their combination. At the same time every physical operator can be implemented by various algorithms. Search space and enumeration algorithms As we have already mentioned the space for possible plans can be rather large. This can result in situation where optimization cost can exceed optimization benefits in case naive search is applied. In order to prevent this situation techniques called enumeration algorithms are used. Their role is to: Limit the space to potentially advantageous plans Increase the effectiveness of searching within the plans space As an example we can mention experimental database system SYSTEM R from IBM. In this system limitation of search space was reached by considering only linear join trees. For searching the search space dynamic programming was used. However this was not classical dynamic programming. We are considering splitting the task into sub tasks resulting in partial assignments and utilizing sub optimality principle. Technique called memoization is used. It is based on memorizing the previously computed results. This approach to optimization is called MEMO-based optimization and it is described in [4, 5] Besides dynamic programming also other techniques can be used. For instance PosgreSQL utilizes genetic algorithm for selecting a join tree. It is a heuristics optimization method that utilizes non deterministic random searching. Detailed description of genetic optimization of queries implemented in PosgreSQL, geqo (genetic query optimization) for short, can be found in [6] 4 Creating a query compiler If we take into account that logical optimization in conventional databases is the same as in MMDB, it was possible to take it over from any Open Source database system. Our selection was PosgreSQL. Let us proceed step by step.

Optimization of SQL queries in MMDBs 5 The first essential part of a query compiler is parser and preprocessor. Because we have possibility to take over logical optimization from PostgreSQL, there is no reason why we can t do this in case of parser. Therefore parser, after a modification, was taken over too. Modification consisted of elimination of parsable queries into SELECT. From optimization point of view this was sufficient. Preprocessor could not have been taken over because it has to access system catalog. In our case system catalog has to be based on [1]. Unfortunately, we have discovered that based on the solution in [1] it is not possible to build neither required system catalog nor physical optimization. The reason is an inadequate API of primary data organization created in [1]. This API is built on relation operators and its character does not enable implementation of physical optimization because its output is physical plan and its nodes are created from physical operators. Implementation of logical optimization encountered problems because transferring of parameters into functions of this API is executed on the basis of text strings what means that our query compiler shall work as follows. Syntactic and semantic analysis of query is performed. Afterwards query is translated into logical plan and logical optimization takes place. Consequently text string representing query is generated from logical plan and transferred into executor. However, such a compilation of queries is not effective because physical optimization is omitted. 4.1 Other components borrowed from PostgreSQL Other components taken over from PosgreSQL are error logging and memory management. Error logging Robust error logging system is implemented in PosgreSQL. It is capable of processing recurrent errors, where several options are needed to be taken into account: Error can arise also during processing of current mistake, therefore it is inevitable to differentiate between current and re-entered recursion, this is treated by creating a small stack of ErrorData type, Error logging system itself can create an error during treating another error, the most common error is out of memory. Memory management Memory allocation is performed based on so called memory context. Implementation of memory context is performed by AllocSet, that can be found in file backend/utils/mmgr/aset.c. Memory management provides API those enables to perform following basic operations: Context generation Allocating, reallocating or freeing of memory chunk within the context Deleting the context (including freeing all memory allocated within context) Resetting context (i. e. freeing memory allocated within context with the exception of context itself)

6 L. Vastag and J. Genči 5 Conclusion Part of the compiler created by myself is not yet finished at the moment. Only lexical and syntactical analyses are currently working. To complete the compiler and link it to the primary data organization from [1] the following steps need to be completed: Suggest data model of system catalog Modify API of primary data organization from [1] and at the same time suggest and implement execution engine Implement simple bootloader that shall load system catalog (and also testing data if required) from the drive during database start-up Implement preprocessor Take over logical optimization from PostgreSQL Implement physical optimization References 1. Raška P.: Main Memory Databases - Experimental Verification of Primary Organization, Master Thesis 2005, Technical University of Košice, Faculty of electrotechnics and informatics, Department of Computers and Informatics 2. Garcia-Molina H., Ullman J. D., Widom J.: Database System Implementation, Prentice Hall 2000, ISBN 0130402648, pp. 329 422 3. Manegold S.: Understanding, Modeling, and Improving Main-Memory Database Performance, SIKS Dissertation Series No. 2002-17, ISBN 90 6196 5179, pp. 71 104 4. Chaudhuri S.: An Overview of Query Optimization in Relational Systems 5. Wass F.: Principles of Probablistic Query Optimization, SIKS Dissertation Series No. 2000-9, ISBN 90-6196-494-6, pp. 49 56 6. The PostgreSQL Global Development Group: PostgreSQL 8.0.0 Documentation, pp. 1121 1124