Query Optimization
Evaluation of Expressions Materialization: one operation at a time, materialize intermediate results for subsequent use Good for all situations Sum of costs of individual operations + cost of writing intermediate results to disk, may be costly Pipeline: evaluate several operations simultaneously, no need to store temporary results May not be applicable in some situations Double buffering: use two output buffers for each operation, when one is full write it to disk while the other is getting filled allowing overlap of disk writes with computation and reducing execution time CMPT 454: Database II -- Query Optimization 2
Pipelining Demand driven (pulling data up): the system keeps making requests for tuples from the operation at the top of the pipeline Producer driven (pushing data up): operations generate tuples eagerly into output buffers until they are full Iterators: many operations are active at once, pipelining Join using pipelining balance Iterator Hasnext() Next() σ balance<2500 ; use index 1 account CMPT 454: Database II -- Query Optimization 3
Query Optimization Example Query: find the names of all customers who have an account at any branch located in Brooklyn Relational expression: Pushing selection down CMPT 454: Database II -- Query Optimization 4
Steps in Query Optimization Input: an expression Generate expressions that are logically equivalent to the given expression Using equivalence rules Estimate the cost of each evaluation plan Using statistical information about the relations, e.g., size, indices, etc. Annotate the resultant expressions in alternative ways to generate alternative query-evaluation plans Cost-based optimization: find the query plan of the lowest cost CMPT 454: Database II -- Query Optimization 5
Equivalence Rules Two relational algebra expressions are said to be equivalent if on every legal database instance the two expressions generate the same set of tuples Two expressions in the multiset version of the relational algebra are said to be equivalent if on every legal database instance the two expressions generate the same multiset of tuples Order of tuples does not matter An equivalence rule says that expressions of two forms are equivalent We can replace one form by the other σ ( θ ) ( ( )) 1 θ E = σ E 2 θ σ 1 θ 2 σ ( θ σ ( )) ( ( )) 1 θ E = σ E 2 θ σ 2 θ 1 Π Π ( K( Π ( E)) K)) = Π ( ) L ( 1 L E 2 Ln L 1 σ θ (E 1 X E 2 ) = E 1 θ E 2 σ θ1 (E 1 θ2 E 2 ) = E 1 θ1 θ2 E 2 CMPT 454: Database II -- Query Optimization 6
More Equivalence Rules E 1 θ E 2 = E 2 θ E 1 (E 1 E 2 ) E 3 = E 1 (E 2 E 3 ) (E 1 θ1 E 2 ) θ2 θ3 E 3 = E 1 θ2 θ3 (E 2 θ2 E 3 ) σ θ0 (E 1 θ E 2 ) = (σ θ0 (E 1 )) θ E 2 σ θ1 θ2 (E 1 θ E 2 ) = (σ θ1 (E 1 )) θ (σ θ2 (E 2 )) CMPT 454: Database II -- Query Optimization 7
More Equivalence Rules L L ( E θ E2) = ( L ( E1)) θ ( L ( 2)) 1 L2 1 E 1 2 ( E θ E ) = 2 L (( L L L ( E1)) ( L L ( 2))) 1 L 1 E 2 1 2 1 3 θ 2 4 E 1 E 2 = E 2 E 1 E 1 E 2 = E 2 E 1 (E 1 E 2 ) E 3 = E 1 (E 2 E 3 ) (E 1 E 2 ) E 3 = E 1 (E 2 E 3 ) σ θ (E 1 E 2 ) = σ θ (E 1 ) σ θ (E 2 ) and similarly for and in place of σ θ (E 1 for E 2 ) = σ θ (E 1 ) E 2 and similarly for in place of, but not Π L (E 1 E 2 ) = (Π L (E 1 )) (Π L (E 2 )) CMPT 454: Database II -- Query Optimization 8
Example Find the names of all customers with an account at a Brooklyn branch whose account balance is over $1000 Π customer_name( (σ branch_city = Brooklyn balance > 1000 (branch (account depositor))) Transformation using join associatively (Rule 6a): Π customer_name ((σ branch_city = Brooklyn balance > 1000 (branch account)) depositor) Second form provides an opportunity to apply the perform selections early rule, resulting in the subexpression σ branch_city = Brooklyn (branch) σ balance > 1000 (account) When we compute (σ branch_city = Brooklyn (branch) account ), we obtain a relation whose schema is (branch_name, branch_city, assets, account_number, balance) Push projections and eliminate unneeded attributes from intermediate results to get Π customer_name ((Π account_number ( (σ branch_city = Brooklyn (branch) account )) depositor ) Performing the projection as early as possible reduces the size of the relation to be joined CMPT 454: Database II -- Query Optimization 9
Example CMPT 454: Database II -- Query Optimization 10
Join Order Consider the expression Π customer_name ((σ branch_city = Brooklyn (branch)) (account depositor)) Could compute account depositor first, and join result with σ branch_city = Brooklyn (branch), but account depositor is likely to be a large relation Only a small fraction of the bank s customers are likely to have accounts in branches located in Brooklyn it is better to compute σ branch_city = Brooklyn (branch) account first CMPT 454: Database II -- Query Optimization 11
Catalog Information Stored in DBMS catalog n r : number of tuples in a relation r b r : number of blocks containing tuples of r l r : size of a tuple of r f r : blocking factor of r the number of tuples of r that fit into one block V(A, r): number of distinct values that appear in r for attribute A; same as the size of π A (r) If tuples of r are stored together physically in a file, then Maintained periodically Maintaining exact information is too expensive! Statistics information in real-world systems More information Histogram n b = f r r r Attribute age can be divided into ranges of 0-9, 10-19,, 90-99, 100+ The count of tuples in each range CMPT 454: Database II -- Query Optimization 12
Obtaining Statistics in SQL Server CREATE STATISTICS statistics_name ON { table view } ( column [,...n ] ) [ WITH [ [ FULLSCAN SAMPLE number { PERCENT ROWS } ] [, ] ] [ NORECOMPUTE ] ] CREATE STATISTICS names ON Customers (CompanyName, ContactName) WITH SAMPLE 5 PERCENT CREATE STATISTICS anames ON authors (au_lname, au_fname) WITH SAMPLE 50 PERCENT CMPT 454: Database II -- Query Optimization 13
Selection Size Estimation Simple case: σ A=v (r) Under uniform distribution: n r / V(A, r) n r : number of tuples in a relation r V(A, r): number of distinct values that appear in r for attribute A Uniform distribution often does not hold in real data sets A reasonable approximation of reality in many cases, keep our presentation relatively simple Histogram information can be used Estimation of Single Comparison σ A V (r) C = 0 if v < min(a,r) v min( A, r) C = n r max( A, r) min( A, r) min(a, r) and max(a, r): the minimum and maximum values that appear in r for attribute A, where n r is number of tuples in a relation r If statistics information is unavailable, then estimate c as n r / 2 CMPT 454: Database II -- Query Optimization 14
Estimation of Complex Cases Selectivity of a condition θ i The probability that a tuple in the relation r satisfies θ i If s i is the number of satisfying tuples in r, the selectivity of θ i is s i /n r Conjunction: σ θ1 θ2... θn (r): Disjunction: σ θ1 θ2... θn (r): 1 Negation: σ θ (r): n r size(σ θ (r)) n r n j= 1 s (1 n j r ) n r n j= 1 n n r s j CMPT 454: Database II -- Query Optimization 15
Join Size Estimation Cartesian product r x s: (n r * n s ) tuples R S = : r s is the same as r x s R S is a key for R: r s is no greater than the number of tuples in s A tuple of s will join with at most one tuple from r R Sin S is a foreign key in S referencing R: r s has the same number of tuples as s Example Consider the natural join R S U Statistics for three relations R(a,b) S(b,c) U(c,d) T(R) = 1000 V(R,b) = 20 T(S) = 2000 V(S,b) = 50 V(S,c) = 100 T(U) = 5000 V(U,c) = 500 CMPT 454: Database II -- Query Optimization 16
Example (Continued) Estimate for (R S) U T(R S) is T(R)T(S) / max(v(r,b), V(S,b)) T(R S) = 1000 * 2000 / 50 = 40000 Join R S with U T(R S)T(U) / max(v(r S,c), V(U,c)) V(R S,c) is the same as V(S,c) (=100) The result is, 40000 * 5000 / max(100, 500), or 400000 Could start by joining S and U first The estimate of any natural join is the same, regardless of how we order the joins CMPT 454: Database II -- Query Optimization 17
The Case of Non-Key Each value appears with equal probability s All tuples in r produce tuples in r s Each tuple t in r produces tuples in r s s All tuples t in s produce tuples in r s Choose the smaller one as the estimation n r ns max{ V ( A, r), V ( A, s)} n r n V ( A, s) n s V ( A, s) n r n V ( A, r) CMPT 454: Database II -- Query Optimization 18
Estimation for Other Operations Projection: estimated size of π A (r) = V(A,r) Aggregation: estimated size of A g F (r) = V(A,r) Outer join size of r s = size of r s + size of r size of r s = size of r s + size of r + size of s Inaccurate, only upper bounds on the sizes Set operations Unions/intersections of selections on the same relation: rewrite and use size estimate for selections σ θ1 (r) σ θ2 (r) can be rewritten as σ θ1 σ θ2 (r) Operations on different relations: Estimated size of r s = size of r + size of s Estimated size of r s = minimum size of r and size of s Estimated size of r s = r Inaccurate, upper bounds on the sizes CMPT 454: Database II -- Query Optimization 19
Estimation of Number of Distinct Values for Selections σ θ (r) If θ forces A to take a specified value: V(A,σ θ (r)) = 1 If θ forces A to take one of a specified set of values: V(A,σ θ (r)) = number of specified values e.g., (A = 1 V A= 3 V A = 4) If the selection condition θ is of the form A op r, estimated V(A,σ θ (r)) = V(A.r) * s, where s is the selectivity of the selection In all the other cases: use approximate estimate of min(v(a,r), n σθ (r) ) More accurate estimate is feasible using probability theory, but the above works fine generally CMPT 454: Database II -- Query Optimization 20
Enumeration of Equivalent Plans Use equivalence rules to systematically generate expressions equivalent to the given expression For each expression found so far, use all applicable equivalence rules, and add newly generated expressions to the set of expressions found so far Repeat until no more expressions can be found Reduce space cost: share common subexpressions CMPT 454: Database II -- Query Optimization 21
A Local Optimal Method Choose the best algorithm for each operator The global effect may not be optimal A merge sort may be costlier than a hash join, but enables fast algorithms for later operations, e.g., duplicate elimination, insertion, or another merge join All possible algorithms should be considered Consider all the possible plans and choose the best one in a cost-based fashion Use heuristics to choose a plan Practical system incorporates elements from both approaches CMPT 454: Database II -- Query Optimization 22
Cost-Based Optimization Generate a range of query-evaluation plans using the equivalence rules Choose the one with the least cost For a complex query, the number of different possible query plans can be large For r 1 r 2... R n, there are (2(n 1))!/(n 1)! different join orders! No need to generate all the join orders Use dynamic programming, the least-cost join order for any subset of {r 1, r 2,... r n } is computed only once and stored for future use CMPT 454: Database II -- Query Optimization 23
Find the Best Plan procedure findbestplan(s) Complexity: O(3 n ) if (bestplan[s].cost ) Space complexity: O(2 n ) return bestplan[s] // else bestplan[s] has not been computed earlier, compute it now for each non-empty proper subset S1 of S P1= findbestplan(s1) P2= findbestplan(s - S1) A = best algorithm for joining results of P1 and P2 cost = P1.cost + P2.cost + cost of A if cost < bestplan[s].cost bestplan[s].cost = cost bestplan[s].plan = execute P1.plan; execute P2.plan; join results of P1 and P2 using A return bestplan[s] CMPT 454: Database II -- Query Optimization 24
Heuristic Optimization Deconstruct conjunctive selections into a sequence of single selection operations Move selection operations down the query tree for the earliest possible execution Execute first those selection and join operations that will produce the smallest relations Replace Cartesian product operations that are followed by a selection condition by join operations Deconstruct and move as far down the tree as possible lists of projection attributes, creating new projections where needed Identify those subtrees whose operations can be pipelined, and execute them using pipelining CMPT 454: Database II -- Query Optimization 25
Left-Deep Join Order Only consider those join orders where the right operand of each join is one of the initial relations Convenient for pipelined evaluation, only one input the each join is pipelined If only left-deep join orders are considered, time complexity is O(n!) By dynamic programming, the complexity is O(n2 n ) For an n-way join, consider n sets of left-deep join orders such that each set starts with a different one of the n relations Left-deep join tree Non-left-deep join tree CMPT 454: Database II -- Query Optimization 26