Query Optimization. Coming to Introduction to Database Systems by C. J. Date, he discussed the automatic optimization.

Transcription

1 Query Optimization Introduction: Query optimization is a function of many relational database management systems in which multiple query plans for satisfying a query are examined and a good query plan is identified. This may or not be the absolute best strategy because there are many ways of doing plans. There is a trade-off between the amount of time spent figuring out the best plan and the amount running the plan. Different qualities of database management systems have different ways of balancing these two. Cost based query optimizers evaluate the resource footprint of various query plans and use this as the basis for plan selection. Typically the resources which are costed are CPU path length, amount of disk buffer space, disk storage service time, and interconnect usage between units of parallelism. The set of query plans examined is formed by examining possible access paths (e.g., primary index access, secondary index access, full file scan) and various relational table join techniques (e.g., merge join, hash join, product join). The search space can become quite large depending on the complexity of the SQL query. There are two types of optimization. These consist of logical optimization which generates a sequence of relational algebra to solve the query. In addition there is physical optimization which is used to determine the means of carrying out each operation. Coming to Introduction to Database Systems by C. J. Date, he discussed the automatic optimization. There are several reasons to say that optimizer might actually do better than Human. A good optimizer will have a wealth of information which a normal human user doesn t have, like certain statistical information. o Number of distinct values of each type. o Number of tuples currently appearing in each base relvar. o Number of distinct values currently appearing in each attribute in each base relvar. and so on. As a result, the optimizer is able to make more accurate assessment of the efficiency in any given strategy for implementing particular request. Thus, it is more likely to choose the most efficient implementation.

2 If the database statistics change over time, then a different strategy might be chosen; re-optimization might be required. In Relational system the reoptimization is trivial; it needs to reprocess the original query by system optimizer whereas in Non-Relational System, the re-optimization involves rewriting of the program. Optimization is a program. So it is much more patient than humans. It considers several hundreds of implementation strategies for given request. Human user would not consider more than three or four. Optimizer is designed by the skills and services of best human programmers. So it makes the scarce set of resources available to everybody in an efficient and cost effective manner. The above reasons support as evidence that relational requests are Optimizable- in fact strength of Relational systems. Motivating Example: Consider the shipment example, ((SP JOIN S) WHERE P# = P# ( P2 )) {SNAME} Consider database contains 100 suppliers and 10,000 shipments, of which 50 are of part P2. If the above query is to be executed without optimizing then following sequence will occur. JOIN SP and S (over S#): This step involves reading 10,000 shipments; reading 100 suppliers 10,000 times each; constructing 10,000 joined tuples; writing them back into disk. Restrict the result of step 1 to just the tuples for part 2: This step involves reading the joined tuples and produces a result consisting of 50 tuples. Project the result of step 2 over SNAME: This step produces the desired result. If the above example is performed with optimizing then following sequence will occur.

3 Restrict SP to just the tuples for part P2: This step involves the reading of 10,000 records once producing 50 tuples of part P2. Join Result of step 1 to S (over S#): This step involves reading 100 suppliers only once and produces the 50 joined tuples. Project the result of step 2 over SNAME: Desired result will be produced. From the above we can clearly see that the execution without optimization involves total of 1,030,000 tuple I/O s, whereas with optimization involves 10,100. If number of tuple I/O s is our measure then second procedure is 100 times better than first. We see that a simple change in the execution algorithm (doing restriction and then joining instead of joining and then restricting) has produced a dramatic improvement in performance. Performance would improve more dramatically if we include hashing or indexing on P#. The number of shipments read in step 1 would reduce from 10,000 to 50 and number of suppliers read in step 2 would reduce from 100 to 50; which is almost 10,000 times better than the original execution. I.e. if un-optimized query took 3 hours to execute, the optimized query using hashing or indexing will take just over 1 sec. An overview of Query Optimization: We can identify the four broad stages in Query Processing: 1. Cast the query into internal form. 2. Convert to colonial form. 3. Choose candidate low-level procedures. 4. Generate Query plans and choose the cheapest.

4 Query Processing Overview Cast the query into internal form: The original query is converted into some internal representation that is more suitable for machine manipulations; thus eliminating the external considerations (such as syntax) and paving the subsequent stages in overall process. View Processing is also done during this stage. What formalism should the internal representation based on? Whatever formalism chosen it must be rich enough to represent all the queries in external query language. It should be neutral, in sense that it should not prejudice the subsequent choices. Internal form typically chosen is some kind of abstract syntax tree or query tree.

5 Query tree for Get names of suppliers who supplies part P2 However for internal representation, it will be convenient to choose the formalisms we are familiar with: namely Relational Algebra, Relational Calculus. The algebraic expression for the above tree will be ((SP JOIN S) WHERE P# = P# ( P2 )) {SNAME} Convert to colonial form: In this stage, optimizer performs a number of optimizations that guarantee to be good, regardless the actual data and its physical path. The point is relational language allows all but simplest of queries can be expressed in variety of ways and not by replacing A=B by B=A, etc. And the performance is not dependent on the way the user writes. The next step in processing is converting the internal representation into colonial form, with objective to eliminate such superficial distinctions. Given a set Q of objects and a notion of equivalence among those objects, Subset C of Q is said to be canonical set of Q if every object q of Q is equivalent with exactly one c of C. In order to transform the result of stage 1 to equivalent but efficient form, the optimizer makes use of certain transformation rules. E.g. (A JOIN B) WHERE restriction on A Can be transformed into equivalent and efficient expression (A WHERE restriction on A) JOIN B Choose candidate low-level Procedures: After converting the internal representation into some more desirable form, the optimizer must decide how to execute this transformed query. At this stage all the data values, physical path, etc. Come into play.

6 The best strategy is to consider the query as a sequence of low level operations. The code to perform will require its input tuples to be sorted in some order. The output tuples of the preceding operation must be in sequence to input for the next operation. For each possible low level operation, the optimizer will have set of pre-defined implementation procedures. For example, Restriction operation has a set of implementation procedures: One is using Equality comparison. One where restriction attribute is indexed, One where restriction attribute is hashed. Next by using the catalog information regarding the current state of the database, the optimizer will choose one or more candidate procedures. The process is sometimes referred to as Access Path Selection. Generate query plans and choose the cheapest: The final stage of optimization process involves construction of set of candidate query plans, followed by best of those plans. Each query is built by combining the candidate procedures; One such procedure for each low level operation in the query. It is not a good idea to analyze all possible plans. So it is better to use some heuristic algorithm to set the bounds. It reduces the search space thereby referred as reducing search space. Choosing the cheapest plan obviously need a method to find the cost. In optimization the cost of the given plan is the sum of all the individual costs. The problem is cost depends on the size of the relation to be processed. Since the intermediate results will be generated during execution, it has to find cost of these intermediate results. But these results are dependent on actual data values. So accurate cost estimation is a difficult problem. Expression Transformation: In this session we describe some of the transformation rules that might be useful in stage 2 of optimization process. Explaining why they were useful with examples. Given a particular expression to transform, the application of one rule might generate an expression that could be transformed in accordance to other rule. Starting from one expression the optimizer will apply its transformation repeatedly until it finally arrives at an expression it could judge based on some set of heuristics.

7 Restrictions & Projections: It is better to do restriction before projection as it reduces the size of input to the projection and reduce the amount of data that might need to be sorted for duplicate elimination purposes. Distributivity: This transformation rule used in the previous example (transforming a join followed by a restriction into a restriction followed by a join) is actually a special case of Distributive law. In general f is said to be distributive over o if and only if f (A o B) = f (A) o f (B), for all A, B. In general arithmetic, for example, SQRT is distributive over multiplication, because SQRT (A * B) = SQRT (A) * SQRT (B)

8 So an arithmetic expression optimizer can replace either expression by other when doing arithmetic expression transformation. In counter example the SQRT is not distributive over Addition, as SQRT of A+B is not equal to SQRT (A) + SQRT (B). In Relational Algebra, restriction is distributive over union, intersection, and difference. It also distributes over join, if and only if the restriction condition consists, at its most complex, of two simple restriction conditions ANDed together, one for each of the two join operands. In the case of supplier s example, this requirement was indeed satisfied- in fact the condition was a simple restriction condition on just one of the operands- and so we could use the distributive law to replace the overall expression by a more efficient equivalent. The net effect was that we were able to do the restriction early. Doing the restriction early is a good idea, because it reduces the number of tuples to be scanned in the next operation in sequence and probably reduces the number of tuples in the output from the next operation too. Here are a couple more specific cases of distributive law, this time involving projection. First projection distributes over union and intersection but not difference. A and B must be of same type of course. Second, Projection also distributes over join as long as the projection retains all of the join attributes, thus: Here acl1 is the union of the join attributes and those attributes of acl that appear in A only, acl2 is the union of the join attributes and attributes of acl that appear in B only. These laws can be used to do projections early, which again is usually a good idea for reasons similar to those given previously for restrictions. Idempotent and Absorption:

9 Commutativity and Associativity: Computational Expressions: It is not just relational expressions that are subject to transformation laws. For instance, we have already indicated that certain transformations are valid for arithmetic expressions. Here is a specific example: The expression A * B + A * C Can be transformed into A * (B + C) By virtue of the fact that * distributes over +. A relational optimizer needs to know about such transformations because it will encounter such expressions in the context of the extend and summarize operators. Note, incidentally, that this example illustrates a slightly more general form of distributivity. Earlier, we defined distributivity in term of a monadic operator distributing over a dyadic operator; in the case at hand, however, * and + are both

10 dyadic operators. In general, the dyadic operator δ is said to be distributive over the dyadic operator Ο if and only if A δ (B Ο C) = (A δ B) Ο (A δ C) For all A, B, C (in the arithmetic example, take δ as * and Ο as + ). Boolean Expressions: We turn now to Boolean expressions. Suppose A and B are attributes of two distinct relations. Then the Boolean expression A > B and B > 3 Is clearly equivalent to the following: A > B and B > 3 and A > 3 The equivalence is based on the fact that the comparison operator ">" is transitive. Note that this transformation is certainly work making, because it enables the system to perform an additional restriction (on A) before doing the greater-than join implied by the comparison "A > B''. To repeat a point made earlier doing restrictions early is generally a good idea; having the system infer additional "early'' restrictions, as here, is also a good idea. Note: This technique is implemented in several commercial products, including, for example, DB2 (where it is called "predicate transitive closure') and Ingres. Here is another example: The expression A > B or (C = D and E < F) Can be transformed into (A > B or C = D) and (A > B or E < F) By virtue of the fact that OR distributes over AND. This example illustrates another general law-vfz.; any Boolean expression can be transformed into an equivalent in what is called conjunctive normal form (CNF). A CNF expression is an expression of the form C1 and C2 and and Cn Where each of C1, C2 Cn is, in turn a boolean expression (called a conjunct) that involves no ANDs. The advantage of CNF expression is true only if every conjunct is true; equivalently, it is false if any conjunct is false. Since AND is commutative, the optimizer can evaluate the individual conjuncts in any order it likes; in particular, it can do them in order of increasing difficulty. As soon as it finds one that is false, the whole process can stop. Furthermore, in a parallel-processing system, it might even be possible to evaluate all of the conjuncts in parallel. Again, as soon as one is found that is false, the whole process can stop.

11 It follows from this subsection and its predecessor that the optimizer needs to know how general properties such as distributivity apply not only to relational operators such as join, but also to comparison operators such as >, Boolean operators such as AND & OR, arithmetic operators such as +, and so on. Choice of Evaluation Plans: Generation of expressions is only part of the query-optimization process, since each operation in the expression can be implemented with different algorithms. An evaluation plan is therefore needed to define exactly what algorithm should be used for each operation, and how the execution of the operations should be coordinated. As we have seen, several different algorithms can be used for each relational operation, giving rise to alternative evaluation plans. Further, decisions about pipelining have to be made. In the figure, the edges from the selection operations to the merge join operation are marked as pipelined; pipelining is feasible if the selection operations generate their output sorted on the Join attributes. They would do so if the indices on branch and account store records with equal values for the index attributes sorted by branch_name. Interaction of Evaluation Techniques: One way to choose an evaluation plan for a query expression is simply to choose for each operation the cheapest algorithm for evaluating it. We can choose any ordering of the operations that ensures that operations lower in the tree are executed before operations higher in the tree. However, choosing the cheapest algorithm for each operation independently is not necessarily a good idea. Although a merge join at a given level may be costlier

12 than a hash join, it may provide a Sorted Output that makes evaluating a later operation (such as duplicate eliminations, intersection, or another merge join) cheaper. Similarly, a nested loop join with indexing may provide opportunities for pipelining the results to the next operation, and thus may be Useful even if it is not the cheapest way of performing the Join. To choose the best overall algorithm, we must consider even non-optimal algorithms for individual operations. Thus, in addition to considering alternative expressions for a query, we must also consider alternative algorithms for each operation in an expression. We can use rules much like the equivalence rules to define what algorithms can be used for each operation, and Whether its result can be pipelined or must be materialized. We can use these rules to generate all the query-evaluation plans for a given expression. Depending upon the indices available, certain selection operations can be evaluated using only an index without accessing the relation itself. That still leaves the problem of choosing the best evaluating plan for a query. There are two broad approaches: The first searches all the plans, and chooses the best plan in a cost based fashion. The second uses heuristics to choose a plan. Practically query optimizers incorporate elements of both approaches. Cost-Based Optimization: A cost-based optimizer generates a range of query-evaluation plans from the given query by using the equivalence rules, and chooses the one with the least cost. For a complex query, the number of different query plans that are equivalent to a given plan can be large. As an illustration, consider the expression r1 r2 rn where the joins are expressed without any ordering. With n = 3, there are 12 different join orderings: r1 (r2 r3) r1 (r3 r2) (r2 r3) r1 (r3 r2) r1 r2 (r1 r3) r2 (r3 r1) (r1 r3) r2 (r3 r1) r2 r3 (r1 r2) r3 (r2 r1) (r1 r2) r3 (r2 r1) r3 In general, with n relations, there are (2(n - 1))! / (n - 1)! different join orders. For joins involving small numbers of relations, this number is acceptable; for example, with n = 5, the number is However, as n increases, this number rises quickly. With n = 7, the number is 665,280; with n, = 10, the number is greater than 17.6 billion! Luckily, it is not necessary to generate the entire expressions equivalent to a given expression. For example, suppose we want to find the best join order of the form (r1 r2 r3) r4 r5 which represents all join orders where r1, r2, and r3 are joined first (in some order), and the result is joined (in some order) with r4 and r5. There are 12 different join orders for computing r1 r2 r3, and 12 orders for computing the join of this result

13 with r4 and r5. Thus, there appear to be 144 join orders to examine. However, once we have found the best join order for the subset of relations {r1, r2, r3}, we can use that order for further joins with r4 and r5, and can ignore all costlier join orders of r1 r2 r3. Thus, instead of 144 choices to examine, we need to examine only choices. Using this idea, we can develop a dynamic-programming algorithm for finding optimal join orders. Dynamic-programming algorithms store results of computations and reuse them, a procedure that can reduce execution time greatly. The procedure stores the evaluation plans it computes in an associative array bestplan, which is indexed by sets of relations. Each element of the associative array contains two components: the cost of the best plan of S, and the plan itself. The value of bestplan[s].cost is assumed to be initialized to if bestplan[s] has not yet been computed. Dynamic-programming algorithm for join order optimization. procedure FindBestPlan (S) if (bestplan[s].cost <> ) /* bestplan[s] already computed */ return bestplan[s] if (S contains only 1 relation) set bestplan[s].plan and bestplan[s].cost based on best way of accessing S else for each non-empty subset S1 of S such that S1<> S P1 = FindBestPlan (S1) P2 = FindBestPlan (S - S1) A = best algorithm for joining results of P1 and P2 cost = P1.cost + P2.cost + cost of A if cost < bestplan[s].cost return bestplan[s] bestplan[s].cost = cost bestplan[s].plan = execute P1.plan; execute P2.plan; join results of P1 and P2 using A The procedure first checks if the best plan for computing the join of the given set of relations S has been computed already (and stored in the associative array bestplan); if so, it returns the already computed plan. If S contains only one relation, the best way of accessing S (taking selections on S, if any, into account) is recorded in bestplan. This may involve using an index to identify tuples, and then fetching the tuples (often referred to as an index scan), or scanning the entire relation (often referred to as a relation scan). Otherwise, the procedure tries every way of dividing S into two disjoint subsets. For each division, the procedure recursively finds the best plans for each of

14 the two subsets, and then computes the cost of the overall plan by using that division. The procedure picks the cheapest plan from among all the alternatives for dividing S into two sets. The cheapest plan and its cost are stored in the array bestplan, and returned by the procedure. The time complexity of the procedure can be shown to be O (3 n ). Actually the order in which tuples are generated by the join of a set of relations is also important for finding the best overall join order, since it can affect the cost of further joins (for instance, if merge join is used). A particular sort order of the tuples is said to be an interesting sort order if it could be useful for a later operation. For instance, generating the result of r1 r2 r3 sorted on the attributes common with r4 and r5 may be useful, but generating it sorted on the attributes common to only r1 and r2 is not useful. Using merge join for computing r1 r2 r3 may be costlier than using some other join technique, but may provide an output sorted in an interesting sort order. Hence, it is not sufficient to find the best join order for each subset of the set of n given relations. Instead, we have to find the best join order for each subset, for each interesting sort order of the join result for that subset. The number of subsets of n relations is 2 n. The number of interesting sort orders is generally not large. Thus, about 2 n join expressions need to be stored. The dynamic-programming algorithm for finding the best join order can be easily extended to handle sort orders. The cost of the extended algorithm depends on the number of interesting orders for each subset of relations; since this number has been found to be small in practice, the cost remains at O (3 n ). With n = 10, this number is around 59,000, which is much better than the 17.6 billion different join orders. More important, the storage required is much less than before, since we need to store only one join order for each interesting sort order of each of 1024 subsets of r1,..., r10. Although both numbers still increase rapidly with n, commonly occurring joins usually have less than 10 relations, and can be handled easily. We can use several techniques to reduce further the cost of searching through a large number of plans. For instance, when examining the plans for an expression, we can terminate after we examine only a part of the expression, if we determine that the cheapest plan for that part is already costlier than the cheapest evaluation plan for a full expression examined earlier. Similarly, suppose that we determine that the cheapest way of evaluating a sub-expression is costlier than the cheapest evaluation plan for a full expression examined earlier. Then, no full expression involving that sub-expression needs to be examined. We can further reduce the number of evaluation plans that need to be considered fully by first making a heuristic guess of a good plan, and estimating that plan's cost. Then, only a few competing plans will require a full analysis of cost. These optimizations can reduce the overhead of query optimization significantly. The intricacies of SQL introduce a good deal of complexity into query optimizers. The approach to optimization described above concentrates on join-order

15 optimization. In contrast, the optimizers used in some other systems, notably Microsoft SQL Server, are based on equivalence rules. The benefit of using equivalence rules is that it is easy to extend the optimizer with new rules. For example, nested queries can be represented using extended relational-algebra constructs, and transformations of nested queries can be expressed as equivalence rules. To make the approach work efficiently requires efficient techniques for detecting duplicate derivations, and a form of dynamic programming to avoid reoptimizing the same sub-expressions. This approach was pioneered by the Volcano research project. Advanced Types of Optimization: In this section, we attempt to provide a brief glimpse of advanced types of optimization that researchers have proposed over the past few years. The descriptions are based on examples only; further details may be found in the references provided. Furthermore, there are several issues that are not discussed at all due to lack of space, although much interesting work has been done on them, e.g., nested query optimization, rule-based query optimization, query optimizer generators, object-oriented query optimization, optimization with materialized views, heterogeneous query optimization, recursive query optimization, aggregate query optimization, optimization with expensive selection predicates, and query optimizer validation. 1. Semantic Query Optimization: Semantic query optimization is a form of optimization mostly related to the Rewriter module. The basic idea lies in using integrity constraints defined in the database to rewrite a given query into semantically equivalent ones [Kin81]. These can then be optimized by the Planner as regular queries and the most efficient plan among all can be used to answer the original query. As a simple example, using a hypothetical SQL-like syntax, consider the following integrity constraint: ASSERT sal-constraint ON emp: sal >100K WHERE job = Sr. Programmer". Also consider the following query: SELECT name, floor FROM emp, dept WHERE emp.dno = dept.dno AND job = Sr. Programmer". Using the above integrity constraint, the query can be rewritten into a semantically equivalent one to include a selection on sal: SELECT name, floor FROM emp, dept

16 WHERE emp.dno = dept.dno AND job = \Sr. Programmer" AND sal>100k. Having the extra selection could help tremendously in finding a fast plan to answer the query if the only index in the database is a B+-tree on emp.sal. On the other hand, it would certainly be a waste if no such index exists. For such reasons, all proposals for semantic query optimization present various heuristics or rules on which rewritings have the potential of being beneficial and should be applied and which not. 2. Global Query Optimization: So far, we have focused our attention to optimizing individual queries. Quite often, however, multiple queries become available for optimization at the same time, e.g., queries with unions, queries from multiple concurrent users, queries embedded in a single program, or queries in a deductive system. Instead of optimizing each query separately, one may be able to obtain a global plan that, although possibly suboptimal for each individual query, is optimal for the execution of all of them as a group. Several techniques have been proposed for global query optimization [Sel88]. As a simple example of the problem of global optimization consider the following two queries: SELECT name, floor FROM emp, dept WHERE emp.dno = dept.dno AND job = Sr. Programmer", SELECT name FROM emp, dept WHERE emp.dno = dept.dno AND budget > 1M. Depending on the sizes of the emp and dept relations and the selectivity of the selections, it may well be that computing the entire join once and then applying separately the two selections to obtain the results of the two queries is more efficient than doing the join twice, each time taking into account the corresponding selection. Developing Planner modules that would examine all the available global plans and identify the optimal one is the goal of global/multiple query optimizers. 3. Parametric/Dynamic Query Optimization: As mentioned earlier, embedded queries are typically optimized once at compile time and are executed multiple times at run time. Because of this temporal separation between optimization and execution, the values of various parameters that are used during optimization may be very different during execution. This may make the chosen plan invalid (e.g., if indices used in the plan are no longer available) or simply not optimal (e.g., if the number of available buffer pages or operator selectivity s have changed, or if new indices have become available). To address this issue, several techniques have been proposed that use various search strategies (e.g., randomized algorithms or the strategy of Volcano) to optimize queries as much as possible at compile time taking into account all possible values that interesting parameters may have at run time. These techniques use the actual parameter values

17 at run time, and simply pick the plan that was found optimal for them with little or no overhead. Of a drastically different flavor is the technique of Rdb/VMS [Ant93], where by dynamically monitoring how the probability distribution of plan costs changes, plan switching may actually occur during query execution. Estimation of Query-Processing Cost: 1. To choose a strategy based on reliable information, the database system may store statistics for each relation r: o nr - The number of tuples in r. o sr - The size in bytes of a tuple of r (for fixed-length records). o V (A, r) - the number of distinct values that appear in relation r for attribute A. 2. The first two quantities allow us to estimate accurately the size of a Cartesian product. o The Cartesian product r s contains nr ns tuples. o Each tuple of r s occupies sr + ss bytes. o The third statistic is used to estimate how many tuples satisfy a selection predicate of the form o <attribute-name> = <value> o We need to know how often each value appears in a column. o If we assume each value appears with equal probability, then σa = a (r) is estimated to have tuples. o This may not be the case, but it is a good approximation of reality in many relations. o We assume such a uniform distribution for the rest of this chapter. o Estimation of the size of a natural join is more difficult. o Let r1 (R1) and r1 (R1) be relations on schemes R1 and R2. o If R1 R2 = Φ (no common attributes), then r1 can estimate the size of this accurately. r2 is the same as r s and we o If R1 R2 is a key for R1, then we know that a tuple of r2 will join with exactly one tuple of r1. o Thus the number of tuples in r1 r2 will be no greater than nr2. o If R1 R2 is not a key for R1 or R2, things are more difficult.

18 o We use the third statistic and the assumption of uniform distribution. o Assume R1 R2 = {Λ} o We assume there are tuples in r2 with an A value of t [A] for tuple t in r1. o So tuple t of r1 produces tuples in r1 r2 3. Considering all the tuples in r1, we estimate that there are tuples in total in r1 r2 4. If we reverse the roles of r1 and r2 in this equation, we get a different estimate if V (Λ, r1)<> V (Λ, r2) o If this occurs, there are likely to be some dangling tuples that do not participate in the join. o Thus the lower estimate is probably the better one. o This estimate may still be high if the V (Λ, r1) values in r1 have few values in common with the V (Λ, r2) values in r2. o However, it is unlikely that the estimate is far off, as dangling tuples are likely to be a small fraction of the tuples in a real world relation. 5. To maintain accurate statistics, it is necessary to update the statistics whenever a relation is modified. This can be substantial, so most systems do this updating during periods of light load on the system. Guidelines: For any production database, SQL query performance becomes an issue sooner or later. Having long-running queries not only consumes system resources that makes the server and application run slowly, but also may lead to table locking and data corruption issues. So, query optimization becomes an important task. First, we offer some guiding principles for query optimization: 1. Understand how your database is executing your query Nowadays all databases have their own query optimizer, and offer a way for users to understand how a query is executed. For example, which index from which table is

19 being used to execute the query? The first step to query optimization understands what the database is doing. Different databases have different commands for this. For example, in MySql, one can use "EXPLAIN [SQL Query]" keyword to see the query plan. In Oracle, one can use "EXPLAIN PLAN FOR [SQL Query]" to see the query plan. 2. Retrieve as little data as possible The more data returned from the query, the more resources the database needs to expand to process and store these data. So for example, if you only need to retrieve one column from a table, do not use 'SELECT *'. 3. Store intermediate results Sometimes logic for a query can be quite complex. Often, it is possible to achieve the desired result through the use of sub queries, inline views, and UNION-type statements. For those cases, the intermediate results are not stored in the database, but are immediately used within the query. This can lead to performance issues, especially when the intermediate results have a large number of rows. The way to increase query performance in those cases is to store the intermediate results in a temporary table, and break up the initial SQL statement into several SQL statements. In many cases, you can even build an index on the temporary table to speed up the query performance even more. Granted, this adds a little complexity in query management (i.e., the need to manage temporary tables), but the speedup in query performance is often worth the trouble. Below are several specific query optimization strategies. Use Index Using an index is the first strategy one should use to speed up a query. In fact, this strategy is so important that index optimization is also discussed. Aggregate Table Pre-populating tables at higher levels so fewer amounts of data need to be parsed. Vertical Partitioning Partition the table by columns. This strategy decreases the amount of data a SQL query needs to process. Horizontal Partitioning Partition the table by data value, most often time. This strategy decreases the amount of data a SQL query needs to process.

20 De-normalization The process of de-normalization combines multiple tables into a single table. This speeds up query performance because fewer table joins are needed. Server Tuning Each server has its own parameters, and often tuning server parameters so that it can fully take advantage of the hardware resources can significantly speed up query performance. References: An Introduction to Database Systems, Eight Edition - C. J. Date Database System Concepts, Fifth Edition Silberschatz, Korth, Sudharshan