Evaluation of Expressions



Similar documents
Chapter 14: Query Optimization

Chapter 13: Query Optimization

Chapter 13: Query Processing. Basic Steps in Query Processing

Query Processing. Q Query Plan. Example: Select B,D From R,S Where R.A = c S.E = 2 R.C=S.C. Himanshu Gupta CSE 532 Query Proc. 1

Query Processing C H A P T E R12. Practice Exercises

SQL Query Evaluation. Winter Lecture 23

Simple SQL Queries (3)

SUBQUERIES AND VIEWS. CS121: Introduction to Relational Database Systems Fall 2015 Lecture 6

Relational Databases

SQL is capable in manipulating relational data SQL is not good for many other tasks

Introduction to SQL ( )

Comp 5311 Database Management Systems. 3. Structured Query Language 1

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification

Chapter 6: Integrity Constraints

More on SQL. Juliana Freire. Some slides adapted from J. Ullman, L. Delcambre, R. Ramakrishnan, G. Lindstrom and Silberschatz, Korth and Sudarshan

Lecture 6: Query optimization, query tuning. Rasmus Pagh

Advance DBMS. Structured Query Language (SQL)

Chapter 5: Overview of Query Processing

Execution Strategies for SQL Subqueries

The MonetDB Architecture. Martin Kersten CWI Amsterdam. M.Kersten

SQL DATA DEFINITION: KEY CONSTRAINTS. CS121: Introduction to Relational Database Systems Fall 2015 Lecture 7

Relational Division and SQL

In this session, we use the table ZZTELE with approx. 115,000 records for the examples. The primary key is defined on the columns NAME,VORNAME,STR

MapReduce examples. CSE 344 section 8 worksheet. May 19, 2011

Query Optimization for Distributed Database Systems Robert Taylor Candidate Number : Hertford College Supervisor: Dr.

You can use command show database; to check if bank has been created.

Schema Mappings and Data Exchange

COMP 5138 Relational Database Management Systems. Week 5 : Basic SQL. Today s Agenda. Overview. Basic SQL Queries. Joins Queries

3. Relational Model and Relational Algebra

DATABASE DESIGN - 1DL400

Inside the PostgreSQL Query Optimizer

Relational Algebra. Basic Operations Algebra of Bags

Introducing Cost Based Optimizer to Apache Hive

Access Queries (Office 2003)

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Analysis of Algorithms I: Binary Search Trees

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Oracle Database 11g: SQL Tuning Workshop

Part A: Data Definition Language (DDL) Schema and Catalog CREAT TABLE. Referential Triggered Actions. CSC 742 Database Management Systems

Answer Key. UNIVERSITY OF CALIFORNIA College of Engineering Department of EECS, Computer Science Division

Chapter 4: SQL. Schema Used in Examples

Analysis of Algorithms I: Optimal Binary Search Trees

SQL SELECT Query: Intermediate

x a x 2 (1 + x 2 ) n.

Double Integrals in Polar Coordinates

Introduction to tuple calculus Tore Risch

C H A P T E R 1 Introducing Data Relationships, Techniques for Data Manipulation, and Access Methods

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. The Relational Model. The relational model

Advanced Oracle SQL Tuning

1Z0-117 Oracle Database 11g Release 2: SQL Tuning. Oracle

SQL: Queries, Programming, Triggers

Translating SQL into the Relational Algebra

Effective Use of SQL in SAS Programming

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Physical Design. Phases of database design. Physical design: Inputs.

MapReduce and the New Software Stack

Load Balancing. Load Balancing 1 / 24

Chapter 6: Physical Database Design and Performance. Database Development Process. Physical Design Process. Physical Database Design

Oracle Database 11g: SQL Tuning Workshop Release 2

Creating QBE Queries in Microsoft SQL Server

Advanced Query for Query Developers

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1


External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

Oracle Database 10g: Introduction to SQL

Query Optimization Over Web Services Using A Mixed Approach

Relational Database: Additional Operations on Relations; SQL

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

Boyce-Codd Normal Form

Efficiently Identifying Inclusion Dependencies in RDBMS

Rethinking SIMD Vectorization for In-Memory Databases

Guide to Performance and Tuning: Query Performance and Sampled Selectivity

Optimizing Your Data Warehouse Design for Superior Performance

Oracle EXAM - 1Z Oracle Database 11g Release 2: SQL Tuning. Buy Full Product.

University of Massachusetts Amherst Department of Computer Science Prof. Yanlei Diao

Introduction to database design

Relational Algebra. Module 3, Lecture 1. Database Management Systems, R. Ramakrishnan 1

Database Design Patterns. Winter Lecture 24

Data warehousing in Oracle. SQL extensions for data warehouse analysis. Available OLAP functions. Physical aggregation example

1 Structured Query Language: Again. 2 Joining Tables

6.830 Lecture PS1 Due Next Time (Tuesday!) Lab 1 Out today start early! Relational Model Continued, and Schema Design and Normalization

Performing Queries Using PROC SQL (1)

CIS 631 Database Management Systems Sample Final Exam

Quiz 2: Database Systems I Instructor: Hassan Khosravi Spring 2012 CMPT 354

Performance Tuning for the Teradata Database

Temporal Database System

Lesson 8: Introduction to Databases E-R Data Modeling

Unit Storage Structures 1. Storage Structures. Unit 4.3

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

The Volcano Optimizer Generator: Extensibility and Efficient Search

Exercises. a. Find the total number of people who owned cars that were involved in accidents

Scheduling Shop Scheduling. Tim Nieberg

Nonlinear Algebraic Equations Example

We know how to query a database using SQL. A set of tables and their schemas are given Data are properly loaded

DBMS / Business Intelligence, SQL Server

Transcription:

Query Optimization

Evaluation of Expressions Materialization: one operation at a time, materialize intermediate results for subsequent use Good for all situations Sum of costs of individual operations + cost of writing intermediate results to disk, may be costly Pipeline: evaluate several operations simultaneously, no need to store temporary results May not be applicable in some situations Double buffering: use two output buffers for each operation, when one is full write it to disk while the other is getting filled allowing overlap of disk writes with computation and reducing execution time CMPT 454: Database II -- Query Optimization 2

Pipelining Demand driven (pulling data up): the system keeps making requests for tuples from the operation at the top of the pipeline Producer driven (pushing data up): operations generate tuples eagerly into output buffers until they are full Iterators: many operations are active at once, pipelining Join using pipelining balance Iterator Hasnext() Next() σ balance<2500 ; use index 1 account CMPT 454: Database II -- Query Optimization 3

Query Optimization Example Query: find the names of all customers who have an account at any branch located in Brooklyn Relational expression: Pushing selection down CMPT 454: Database II -- Query Optimization 4

Steps in Query Optimization Input: an expression Generate expressions that are logically equivalent to the given expression Using equivalence rules Estimate the cost of each evaluation plan Using statistical information about the relations, e.g., size, indices, etc. Annotate the resultant expressions in alternative ways to generate alternative query-evaluation plans Cost-based optimization: find the query plan of the lowest cost CMPT 454: Database II -- Query Optimization 5

Equivalence Rules Two relational algebra expressions are said to be equivalent if on every legal database instance the two expressions generate the same set of tuples Two expressions in the multiset version of the relational algebra are said to be equivalent if on every legal database instance the two expressions generate the same multiset of tuples Order of tuples does not matter An equivalence rule says that expressions of two forms are equivalent We can replace one form by the other σ ( θ ) ( ( )) 1 θ E = σ E 2 θ σ 1 θ 2 σ ( θ σ ( )) ( ( )) 1 θ E = σ E 2 θ σ 2 θ 1 Π Π ( K( Π ( E)) K)) = Π ( ) L ( 1 L E 2 Ln L 1 σ θ (E 1 X E 2 ) = E 1 θ E 2 σ θ1 (E 1 θ2 E 2 ) = E 1 θ1 θ2 E 2 CMPT 454: Database II -- Query Optimization 6

More Equivalence Rules E 1 θ E 2 = E 2 θ E 1 (E 1 E 2 ) E 3 = E 1 (E 2 E 3 ) (E 1 θ1 E 2 ) θ2 θ3 E 3 = E 1 θ2 θ3 (E 2 θ2 E 3 ) σ θ0 (E 1 θ E 2 ) = (σ θ0 (E 1 )) θ E 2 σ θ1 θ2 (E 1 θ E 2 ) = (σ θ1 (E 1 )) θ (σ θ2 (E 2 )) CMPT 454: Database II -- Query Optimization 7

More Equivalence Rules L L ( E θ E2) = ( L ( E1)) θ ( L ( 2)) 1 L2 1 E 1 2 ( E θ E ) = 2 L (( L L L ( E1)) ( L L ( 2))) 1 L 1 E 2 1 2 1 3 θ 2 4 E 1 E 2 = E 2 E 1 E 1 E 2 = E 2 E 1 (E 1 E 2 ) E 3 = E 1 (E 2 E 3 ) (E 1 E 2 ) E 3 = E 1 (E 2 E 3 ) σ θ (E 1 E 2 ) = σ θ (E 1 ) σ θ (E 2 ) and similarly for and in place of σ θ (E 1 for E 2 ) = σ θ (E 1 ) E 2 and similarly for in place of, but not Π L (E 1 E 2 ) = (Π L (E 1 )) (Π L (E 2 )) CMPT 454: Database II -- Query Optimization 8

Example Find the names of all customers with an account at a Brooklyn branch whose account balance is over $1000 Π customer_name( (σ branch_city = Brooklyn balance > 1000 (branch (account depositor))) Transformation using join associatively (Rule 6a): Π customer_name ((σ branch_city = Brooklyn balance > 1000 (branch account)) depositor) Second form provides an opportunity to apply the perform selections early rule, resulting in the subexpression σ branch_city = Brooklyn (branch) σ balance > 1000 (account) When we compute (σ branch_city = Brooklyn (branch) account ), we obtain a relation whose schema is (branch_name, branch_city, assets, account_number, balance) Push projections and eliminate unneeded attributes from intermediate results to get Π customer_name ((Π account_number ( (σ branch_city = Brooklyn (branch) account )) depositor ) Performing the projection as early as possible reduces the size of the relation to be joined CMPT 454: Database II -- Query Optimization 9

Example CMPT 454: Database II -- Query Optimization 10

Join Order Consider the expression Π customer_name ((σ branch_city = Brooklyn (branch)) (account depositor)) Could compute account depositor first, and join result with σ branch_city = Brooklyn (branch), but account depositor is likely to be a large relation Only a small fraction of the bank s customers are likely to have accounts in branches located in Brooklyn it is better to compute σ branch_city = Brooklyn (branch) account first CMPT 454: Database II -- Query Optimization 11

Catalog Information Stored in DBMS catalog n r : number of tuples in a relation r b r : number of blocks containing tuples of r l r : size of a tuple of r f r : blocking factor of r the number of tuples of r that fit into one block V(A, r): number of distinct values that appear in r for attribute A; same as the size of π A (r) If tuples of r are stored together physically in a file, then Maintained periodically Maintaining exact information is too expensive! Statistics information in real-world systems More information Histogram n b = f r r r Attribute age can be divided into ranges of 0-9, 10-19,, 90-99, 100+ The count of tuples in each range CMPT 454: Database II -- Query Optimization 12

Obtaining Statistics in SQL Server CREATE STATISTICS statistics_name ON { table view } ( column [,...n ] ) [ WITH [ [ FULLSCAN SAMPLE number { PERCENT ROWS } ] [, ] ] [ NORECOMPUTE ] ] CREATE STATISTICS names ON Customers (CompanyName, ContactName) WITH SAMPLE 5 PERCENT CREATE STATISTICS anames ON authors (au_lname, au_fname) WITH SAMPLE 50 PERCENT CMPT 454: Database II -- Query Optimization 13

Selection Size Estimation Simple case: σ A=v (r) Under uniform distribution: n r / V(A, r) n r : number of tuples in a relation r V(A, r): number of distinct values that appear in r for attribute A Uniform distribution often does not hold in real data sets A reasonable approximation of reality in many cases, keep our presentation relatively simple Histogram information can be used Estimation of Single Comparison σ A V (r) C = 0 if v < min(a,r) v min( A, r) C = n r max( A, r) min( A, r) min(a, r) and max(a, r): the minimum and maximum values that appear in r for attribute A, where n r is number of tuples in a relation r If statistics information is unavailable, then estimate c as n r / 2 CMPT 454: Database II -- Query Optimization 14

Estimation of Complex Cases Selectivity of a condition θ i The probability that a tuple in the relation r satisfies θ i If s i is the number of satisfying tuples in r, the selectivity of θ i is s i /n r Conjunction: σ θ1 θ2... θn (r): Disjunction: σ θ1 θ2... θn (r): 1 Negation: σ θ (r): n r size(σ θ (r)) n r n j= 1 s (1 n j r ) n r n j= 1 n n r s j CMPT 454: Database II -- Query Optimization 15

Join Size Estimation Cartesian product r x s: (n r * n s ) tuples R S = : r s is the same as r x s R S is a key for R: r s is no greater than the number of tuples in s A tuple of s will join with at most one tuple from r R Sin S is a foreign key in S referencing R: r s has the same number of tuples as s Example Consider the natural join R S U Statistics for three relations R(a,b) S(b,c) U(c,d) T(R) = 1000 V(R,b) = 20 T(S) = 2000 V(S,b) = 50 V(S,c) = 100 T(U) = 5000 V(U,c) = 500 CMPT 454: Database II -- Query Optimization 16

Example (Continued) Estimate for (R S) U T(R S) is T(R)T(S) / max(v(r,b), V(S,b)) T(R S) = 1000 * 2000 / 50 = 40000 Join R S with U T(R S)T(U) / max(v(r S,c), V(U,c)) V(R S,c) is the same as V(S,c) (=100) The result is, 40000 * 5000 / max(100, 500), or 400000 Could start by joining S and U first The estimate of any natural join is the same, regardless of how we order the joins CMPT 454: Database II -- Query Optimization 17

The Case of Non-Key Each value appears with equal probability s All tuples in r produce tuples in r s Each tuple t in r produces tuples in r s s All tuples t in s produce tuples in r s Choose the smaller one as the estimation n r ns max{ V ( A, r), V ( A, s)} n r n V ( A, s) n s V ( A, s) n r n V ( A, r) CMPT 454: Database II -- Query Optimization 18

Estimation for Other Operations Projection: estimated size of π A (r) = V(A,r) Aggregation: estimated size of A g F (r) = V(A,r) Outer join size of r s = size of r s + size of r size of r s = size of r s + size of r + size of s Inaccurate, only upper bounds on the sizes Set operations Unions/intersections of selections on the same relation: rewrite and use size estimate for selections σ θ1 (r) σ θ2 (r) can be rewritten as σ θ1 σ θ2 (r) Operations on different relations: Estimated size of r s = size of r + size of s Estimated size of r s = minimum size of r and size of s Estimated size of r s = r Inaccurate, upper bounds on the sizes CMPT 454: Database II -- Query Optimization 19

Estimation of Number of Distinct Values for Selections σ θ (r) If θ forces A to take a specified value: V(A,σ θ (r)) = 1 If θ forces A to take one of a specified set of values: V(A,σ θ (r)) = number of specified values e.g., (A = 1 V A= 3 V A = 4) If the selection condition θ is of the form A op r, estimated V(A,σ θ (r)) = V(A.r) * s, where s is the selectivity of the selection In all the other cases: use approximate estimate of min(v(a,r), n σθ (r) ) More accurate estimate is feasible using probability theory, but the above works fine generally CMPT 454: Database II -- Query Optimization 20

Enumeration of Equivalent Plans Use equivalence rules to systematically generate expressions equivalent to the given expression For each expression found so far, use all applicable equivalence rules, and add newly generated expressions to the set of expressions found so far Repeat until no more expressions can be found Reduce space cost: share common subexpressions CMPT 454: Database II -- Query Optimization 21

A Local Optimal Method Choose the best algorithm for each operator The global effect may not be optimal A merge sort may be costlier than a hash join, but enables fast algorithms for later operations, e.g., duplicate elimination, insertion, or another merge join All possible algorithms should be considered Consider all the possible plans and choose the best one in a cost-based fashion Use heuristics to choose a plan Practical system incorporates elements from both approaches CMPT 454: Database II -- Query Optimization 22

Cost-Based Optimization Generate a range of query-evaluation plans using the equivalence rules Choose the one with the least cost For a complex query, the number of different possible query plans can be large For r 1 r 2... R n, there are (2(n 1))!/(n 1)! different join orders! No need to generate all the join orders Use dynamic programming, the least-cost join order for any subset of {r 1, r 2,... r n } is computed only once and stored for future use CMPT 454: Database II -- Query Optimization 23

Find the Best Plan procedure findbestplan(s) Complexity: O(3 n ) if (bestplan[s].cost ) Space complexity: O(2 n ) return bestplan[s] // else bestplan[s] has not been computed earlier, compute it now for each non-empty proper subset S1 of S P1= findbestplan(s1) P2= findbestplan(s - S1) A = best algorithm for joining results of P1 and P2 cost = P1.cost + P2.cost + cost of A if cost < bestplan[s].cost bestplan[s].cost = cost bestplan[s].plan = execute P1.plan; execute P2.plan; join results of P1 and P2 using A return bestplan[s] CMPT 454: Database II -- Query Optimization 24

Heuristic Optimization Deconstruct conjunctive selections into a sequence of single selection operations Move selection operations down the query tree for the earliest possible execution Execute first those selection and join operations that will produce the smallest relations Replace Cartesian product operations that are followed by a selection condition by join operations Deconstruct and move as far down the tree as possible lists of projection attributes, creating new projections where needed Identify those subtrees whose operations can be pipelined, and execute them using pipelining CMPT 454: Database II -- Query Optimization 25

Left-Deep Join Order Only consider those join orders where the right operand of each join is one of the initial relations Convenient for pipelined evaluation, only one input the each join is pipelined If only left-deep join orders are considered, time complexity is O(n!) By dynamic programming, the complexity is O(n2 n ) For an n-way join, consider n sets of left-deep join orders such that each set starts with a different one of the n relations Left-deep join tree Non-left-deep join tree CMPT 454: Database II -- Query Optimization 26