CPSC 421 Database Management Systems. Lecture 18: Query Optimization

Similar documents
Answer Key. UNIVERSITY OF CALIFORNIA College of Engineering Department of EECS, Computer Science Division

Lecture 6: Query optimization, query tuning. Rasmus Pagh

Chapter 13: Query Processing. Basic Steps in Query Processing

Evaluation of Expressions

Boats bid bname color 101 Interlake blue 102 Interlake red 103 Clipper green 104 Marine red. Figure 1: Instances of Sailors, Boats and Reserves

SQL: Queries, Programming, Triggers

Example Instances. SQL: Queries, Programming, Triggers. Conceptual Evaluation Strategy. Basic SQL Query. A Note on Range Variables

Chapter 14: Query Optimization

Chapter 5. SQL: Queries, Constraints, Triggers

Inside the PostgreSQL Query Optimizer

Advanced Oracle SQL Tuning

Why Query Optimization? Access Path Selection in a Relational Database Management System. How to come up with the right query plan?

SQL: Queries, Programming, Triggers

Chapter 13: Query Optimization

Relational Algebra and SQL

Query Processing C H A P T E R12. Practice Exercises

SQL Query Evaluation. Winter Lecture 23

Chapter 5. Selection 5-1

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

Datenbanksysteme II: Implementation of Database Systems Implementing Joins

University of Massachusetts Amherst Department of Computer Science Prof. Yanlei Diao

Symbol Tables. Introduction

CS 525 Advanced Database Organization - Spring 2013 Mon + Wed 3:15-4:30 PM, Room: Wishnick Hall 113

Lecture 4: More SQL and Relational Algebra

Relational Algebra. Module 3, Lecture 1. Database Management Systems, R. Ramakrishnan 1

D B M G Data Base and Data Mining Group of Politecnico di Torino

2) Write in detail the issues in the design of code generator.

Query Processing. Q Query Plan. Example: Select B,D From R,S Where R.A = c S.E = 2 R.C=S.C. Himanshu Gupta CSE 532 Query Proc. 1

University of Aarhus. Databases IBM Corporation


SQL: QUERIES, CONSTRAINTS, TRIGGERS

FRACTIONS MODULE Part I

In this session, we use the table ZZTELE with approx. 115,000 records for the examples. The primary key is defined on the columns NAME,VORNAME,STR

Query tuning by eliminating throwaway

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Physical Design. Phases of database design. Physical design: Inputs.

Object-Relational Query Processing

DATABASE DESIGN - 1DL400

Multiplying Fractions

Welcome to Basic Math Skills!

Factors Galore C: Prime Factorization

Finding Rates and the Geometric Mean

Named Memory Slots. Properties. CHAPTER 16 Programming Your App s Memory

a presentation by Kirk Paul Lafler SAS Consultant, Author, and Trainer

Live Event Count Issue

AsicBoost A Speedup for Bitcoin Mining

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Query optimization. DBMS Architecture. Query optimizer. Query optimizer.

Setting up a basic database in Access 2003

Oracle Database 11g: SQL Tuning Workshop

Guide to Performance and Tuning: Query Performance and Sampled Selectivity

University of Wisconsin. Imagine yourself standing in front of an exquisite buæet ælled with numerous delicacies.

Database Management Systems. Chapter 1

Is ETL Becoming Obsolete?

Fallacies of the Cost Based Optimizer

MapReduce examples. CSE 344 section 8 worksheet. May 19, 2011

COMP 5138 Relational Database Management Systems. Week 5 : Basic SQL. Today s Agenda. Overview. Basic SQL Queries. Joins Queries

Query Optimization for Distributed Database Systems Robert Taylor Candidate Number : Hertford College Supervisor: Dr.

Introduction to tuple calculus Tore Risch

Performance Improvement In Java Application

CSE 326: Data Structures B-Trees and B+ Trees

Partitioning under the hood in MySQL 5.5

Grade descriptions Computer Science Stage 1

Lecture 3: Finding integer solutions to systems of linear equations

Lesson 3: Calculating Conditional Probabilities and Evaluating Independence Using Two-Way Tables

Proposed solutions Project Assignment #1 (SQL)

Physical Database Design and Tuning

MTAT Cryptology II. Sigma Protocols. Sven Laur University of Tartu

Simplifying Algebraic Fractions

Question #1: What are the main issues considered during physical database design?

1. Physical Database Design in Relational Databases (1)

PERFORMANCE TIPS FOR BATCH JOBS

Greatest Common Factor and Least Common Multiple

Unique column combinations

Performance Evaluation of Natural and Surrogate Key Database Architectures

PowerScore Test Preparation (800)

How to Improve Planning Data Collections Performance Oracle ASCP Collection Process Insight

forum Forecasting Enrollment to Achieve Institutional Goals by Janet Ward Campus Viewpoint

MS SQL Performance (Tuning) Best Practices:

Database Security. Chapter 21

1. The RSA algorithm In this chapter, we ll learn how the RSA algorithm works.

SQL Server Query Tuning

CS143 Notes: Views & Authorization

AP * Statistics Review. Descriptive Statistics

Algorithms and optimization for search engine marketing

Relational Calculus. Module 3, Lecture 2. Database Management Systems, R. Ramakrishnan 1

Question 1. Relational Data Model [17 marks] Question 2. SQL and Relational Algebra [31 marks]

Storage and File Structure

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification

Query Optimization Over Web Services Using A Mixed Approach

Databasesystemer, forår 2005 IT Universitetet i København. Forelæsning 3: Business rules, constraints & triggers. 3. marts 2005

Grade 6 Math Circles March 10/11, 2015 Prime Time Solutions

Guide to SQL Programming: SQL:1999 and Oracle Rdb V7.1

Execution Strategies for SQL Subqueries

Relational Databases

SQL Databases Course. by Applied Technology Research Center. This course provides training for MySQL, Oracle, SQL Server and PostgreSQL databases.

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

APP INVENTOR. Test Review

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package Data Federation Administration Tool Guide

MySQL for Beginners Ed 3

Differential privacy in health care analytics and medical research An interactive tutorial

Transcription:

CPSC 421 Database Management Systems Lecture 18: Query Optimization * Some material adapted from R. Ramakrishnan, L. Delcambre, and B. Ludaescher Agenda No quiz! More on Query Optimization CPSC 421, 2009 2 1

On the fly In on they fly evaluation, each operator is performed in memory while the tuple is available (streaming!) On-the-fly evaluation induces no I/O cost! Note: Relation on left is assumed to be the outer relation CPSC 421, 2009 3 Limitations of On the fly Can only happen if: Computation can be done entirely on tuples in memory Results do not need to be materialized Cannot be used for first table scan Someone has to read the table Cannot be used with all operations CPSC 421, 2009 4 2

Cost of Plan 1 Assume: No index Sailors inner relation SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid = S.sid AND R.bid = 100 AND S.raIng > 5; Cost of page-oriented nested loops join is: M + M * N 1000 + 1000*500 = 501,000 π sname σ bid=100 rating>5 sid=sid (on the fly) (on the fly) (Page NL Join) And on the fly ops have no I/O so plan cost is 501,000! Reserves Sailors CPSC 421, 2009 5 Generating More Plans How can we generate more plans? Use equivalences to produce new query trees Then find a plan for each of these Or assign different algorithms to operators That is, generate more plans from one query tree Example: Apply commutativity of join to the current tree CPSC 421, 2009 6 3

Cost of Plan 2 Assume: No index Reserves inner relation SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid = S.sid AND R.bid = 100 AND S.raIng > 5; Cost of page-oriented nested loops join is: N + N * M 500 + 500*1000 = 500,500 π sname σ bid=100 rating>5 sid=sid (on the fly) (on the fly) (Page NL Join) And on the fly ops have no I/O so plan cost is 500,500! Sailors Reserves CPSC 421, 2009 7 Cost of Plan 3 Push down selects: σ c (R S) = σ c (R) S SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid = S.sid AND R.bid = 100 AND S.raIng > 5; What is the cost now? Scan Sailors and Reserves costs M + N I/Os What about the join? Depends on how many reservations for boat 100 and how many sailors have a rating > 5 π sname sid=sid (on the fly) (Page NL Join) How do we find this information? Guess Better: Use the statistics kept by the DBMS! σ rating>5 Sailors σ bid=100 Reserves (Table Scan) CPSC 421, 2009 8 4

Estimating costs For all operators (except leaf-level of the query plan) The input tables are the result of some earlier query Thus, we need to estimate the size of intermediate results! This can be difficult This is a reason why cost estimates may not be very good Estimation errors tend to compound For example, what information would you need to estimate the number of reservations for bid = 100? Or how many sailors have a rating higher than 5? CPSC 421, 2009 9 The DBMS Catalog and Statistics A Catalog typically contains at least: The # of tuples for each table The # of pages for each table The # of distinct key values and # of pages for each index The index height, low/high key values for each tree index A distribution of data values Catalogs are updated periodically This could be after each insert, daily, weekly, monthly, after each backup... Simplest case: Assume values uniformly distributed Thus, for a gender attribute Assume half the rows have the male value and the other half have the female value This may be grossly inaccurate! CPSC 421, 2009 10 5

Calculating Selectivities Assume that rating values range from 1 to 10 And that bid values range from 1 to 100 What percentage of the incoming tuples to the operator σ bid=100 will be output? What about σ rating>5? What about σ bid=100 rating>5? More later CPSC 421, 2009 11 Improving Uniform Distributions The DBMS might gather more detailed information about how the values of attributes distribute Often using histograms of the values in a field The histograms are stored in the catalog Suppose there was an attribute degree-program with 3 possible values BS/CS, MS/CS, PhD/CS The DBMS might count the values and know there are 428 BS/CS values, 98 MS/CS values, and 25 PhD/CS values This allows for a much better estimate of the reduction factor CPSC 421, 2009 12 6

Reduction Factor (RF) A reduction factor (RF) is associated with each term The fraction of tuples in the table that satisfy a given conjunct Reflects the impact of the term in reducing the result size A value between 0 and 1 often expressed as a percentage (e.g., a reduction of 30%) Confusingly, more is less For a term Result Cardinality = Cardinality * Reduction Factor CPSC 421, 2009 13 Independence of Reduction Factors How do we use distributions for multiple conditions? That is, a select operator with terms separated by AND E.g., σ bid=100 rating>5 We assume that all terms are independent! If one attribute is class and the other is num_hours The optimizer might assume that class is uniformly distributed over {Freshman, Sophomore, Junior, Senior} And that num_hours uniformly distributes over {0, 1,, 144} But we know that class correlates with credit hours! (E.g., it could be that num_hours class) CPSC 421, 2009 14 7

Independence of Reduction Factors What percentage of the input tuples, to the operator σ bid=100 rating>5 will be output? What is the reduction factor of this complex term? Given multiple terms (ANDed together), we simply multiply all the reduction factors Total RF = RF T1 * RF T2 * * RF Tn So: Result Cardinality = Cardinality * product of all RFs How accurate is this? CPSC 421, 2009 15 Enumerating Plans for Multiple Joins What if the query has multiple joins? Should we generate as many plans as possible? No it typically takes to long / there are too many It is best to generate a few, if we can find some cheap ones In System R only left-deep join trees are generated A B C D C A B D A CPSC 421, 2009 16 B C D This is a leftdeep join tree! 8

Enumerating Plans for Multiple Joins Left-deep trees allow us to generate all of the fully pipelined plans That is, intermediate results not written to temporary files Not all left-deep tree can be fully pipelined (e.g., under Sort-Merge Join) Because picking only left-deep plans restricts the search space (of plans) the optimizer may not find the optimal plan! C D A CPSC 421, 2009 17 B Enumerating Left-Deep Plans Considering all possible left-deep plans For SPJ queries, these are enumerated by ordering tables For each ordering, consider the access method for each relation and the join method for each join (Greedy) Enumeration using N passes (if N relations joined): Pass 1: Find best 1-relation plan for each relation Pass 2: Find best way to join result of each 1-relation plan (as outer) to another relation (All 2-relation plans) Pass N: Find best way to join result of (N 1)-relation plan (as outer) to the N th relation (All N-relation plans) At the end of each pass only retain Cheapest overall plan + cheapest plan for each interesting order of tuples (e.g., for further joins, ORDER-BY, GROUP-BY, etc.) CPSC 421, 2009 18 9

Nested Queries Nested block is optimized independently Outer tuple considered as providing a selection condition Outer block is optimized with the cost of calling the nested block Implicit ordering of these blocks means that some good strategies are not considered The non-nested version of the query is traditionally more optimized so you may need to explicitly unnest the query SELECT S.sname FROM Sailors S WHERE EXISTS (SELECT * FROM Reserves R WHERE R.bid = 103 AND R.sid = S.sid); Nested block to opimize: SELECT * FROM Reserves R WHERE R.bid = 103 AND R.sid = outer val; Equivalent non nested version: SELECT S.sname FROM Sailors S, Reserves R WHERE S.sid = R.sid AND R.bid = 103; CPSC 421, 2009 19 Finding the Best Plan Query optimizers don t (always) find the best plan There are usually more plans than you can consider (even if only left-deep plans are considered) The optimizer might not even try to generate all of them Sometimes the optimizer will compare the optimization cost to the estimated execution cost and quit early The optimizer chooses the plan with the lowest ESTIMATED cost actual costs may differ Optimizers and optimization techniques are fairly sophisticated We have only covered the traditional approaches (that stemmed from System R) CPSC 421, 2009 20 10

Next time Physical database design Then transactions and recovery CPSC 421, 2009 21 11