Why Query Optimization? Access Path Selection in a Relational Database Management System. How to come up with the right query plan?



Similar documents
Access Path Selection in a Relational Database Management System

Inside the PostgreSQL Query Optimizer

Access Path Selection in a Relational Database Management System. P. Griffiths Selinger M. M. Astrahan D. D. Chamberlin,: It. A. Lorie.

Portable Bushy Processing Trees for Join Queries

Chapter 13: Query Processing. Basic Steps in Query Processing

Query Optimization in a Memory-Resident Domain Relational Calculus Database System

SQL Query Evaluation. Winter Lecture 23

Advanced Oracle SQL Tuning

Fallacies of the Cost Based Optimizer

Query Processing C H A P T E R12. Practice Exercises

Chapter 13: Query Optimization

D B M G Data Base and Data Mining Group of Politecnico di Torino

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Physical Design. Phases of database design. Physical design: Inputs.

Rethinking SIMD Vectorization for In-Memory Databases

Understanding SQL Server Execution Plans. Klaus Aschenbrenner Independent SQL Server Consultant SQLpassion.at

Answer Key. UNIVERSITY OF CALIFORNIA College of Engineering Department of EECS, Computer Science Division

In-Memory Data Management for Enterprise Applications

University of Massachusetts Amherst Department of Computer Science Prof. Yanlei Diao

Datenbanksysteme II: Implementation of Database Systems Implementing Joins

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

In this session, we use the table ZZTELE with approx. 115,000 records for the examples. The primary key is defined on the columns NAME,VORNAME,STR

Testing SQL Server s Query Optimizer: Challenges, Techniques and Experiences

Topics in basic DBMS course

Data storage Tree indexes

Best Practices for DB2 on z/os Performance

Chapter 5: Overview of Query Processing

Chapter 14: Query Optimization

Query Optimization in Microsoft SQL Server PDW The article done by: Srinath Shankar, Rimma Nehme, Josep Aguilar-Saborit, Andrew Chung, Mostafa

PartJoin: An Efficient Storage and Query Execution for Data Warehouses

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

Query Optimization for Distributed Database Systems Robert Taylor Candidate Number : Hertford College Supervisor: Dr.

Analysis of Algorithms I: Optimal Binary Search Trees

CA Performance Handbook. for DB2 for z/os

Efficient Processing of Joins on Set-valued Attributes

Chapter 6: Episode discovery process

Run$me Query Op$miza$on

SQL Server Query Tuning

Research Statement Immanuel Trummer

DBMS / Business Intelligence, SQL Server

The Tower of Hanoi. Recursion Solution. Recursive Function. Time Complexity. Recursive Thinking. Why Recursion? n! = n* (n-1)!

Data Warehouse Design

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Automatic segmentation of text into structured records

Architectures for Big Data Analytics A database perspective

SQL Tuning Proven Methodologies

Searching frequent itemsets by clustering data

Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC

Efficient Data Access and Data Integration Using Information Objects Mica J. Block

Section IV.1: Recursive Algorithms and Recursion Trees

Unraveling the Duplicate-Elimination Problem in XML-to-SQL Query Translation

Data Warehousing und Data Mining

Indexing Techniques for Data Warehouses Queries. Abstract

ML for the Working Programmer

Exploring Query Optimization Techniques in Relational Databases

Efficient Integration of Data Mining Techniques in Database Management Systems

Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices

Lecture 6: Query optimization, query tuning. Rasmus Pagh

Oracle Database 11g: SQL Tuning Workshop Release 2

Big Data Challenges in Bioinformatics

Execution Strategies for SQL Subqueries

Query Optimization in Oracle 12c Database In-Memory

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

SQL Performance for a Big Data 22 Billion row data warehouse

Architecture Sensitive Database Design: Examples from the Columbia Group

In-Memory Columnar Databases HyPer. Arto Kärki University of Helsinki

Instructional Design Framework CSE: Unit 1 Lesson 1

Bitmap Index an Efficient Approach to Improve Performance of Data Warehouse Queries

Physical DB design and tuning: outline

A Comparison of Functional and Imperative Programming Techniques for Mathematical Software Development

Efficiently Identifying Inclusion Dependencies in RDBMS

Reminder: Complexity (1) Parallel Complexity Theory. Reminder: Complexity (2) Complexity-new

MATHEMATICAL INDUCTION. Mathematical Induction. This is a powerful method to prove properties of positive integers.

FPGA-based Multithreading for In-Memory Hash Joins

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

Evaluation of Expressions

An Oracle White Paper May The Oracle Optimizer Explain the Explain Plan

Oracle Database 11g: SQL Tuning Workshop

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

DATA STRUCTURES USING C

A Generic Auto-Provisioning Framework for Cloud Databases

Performance Evaluation of Natural and Surrogate Key Database Architectures

Web Service Mediation Through Multi-level Views

Caching XML Data on Mobile Web Clients

Transcription:

Why Query Optimization? Access Path Selection in a Relational Database Management System P. Selinger, M. Astrahan, D. Chamberlin, R. Lorie, T. Price Peyman Talebifard Queries must be executed and execution takes time There are multiple execution plans for most queries Some plans cost less than others How to come up with the right query plan? 1. Measure the cost of each query 2. Enumerate possibilities 3. Pick the least expensive one Is that all? Simple Example SELECT * FROM A,B,C WHERE A.n = B.n AND B.m = C.m A = 100 tuples B = 50 tuples C = 2 tuples Which plan is cheaper? Join( C, Join( A, B ) ) Join( A, Join( B, C ) ) But the search space is too big Search space becomes too large as the number of joins increases In a matter of n! Just to remind you, 20! = 2,432,902,008,176,640,000 So now what do we do? Discussion 1 What will be changed in the query optimizer with the hardware today? 1.Can the search on the query space be decomposed into sub-problems, i.e.,could we design a parallel program for the space search problem? 2.More broadly, is there some way that we take advantage of the new multi-core technology to improve query optimization? 1

Changes Brought by Multi-core to Query Optimizer [VLDB 2008] Parallelizing Query Optimization [SIGMOD 2009] Parallelizing Extensible Query Optimizations [SIGMOD 2009] Dependency-Aware Reordering for Parallelizing Query Optimization in Multi-Core CPUs Selectivity factor Use Statistics Expected fraction of tuples that satisfy the predicate For each relation keep track of Cardinality of relations Number of pages in the segment containing tuples of relation T Fraction of empty to non-empty pages Use these statistics in conjunction with Sargable predicates Interesting Orders But Statistics alone cannot save us Expensive to compute Can t keep track of all joint statistics Compromise on statistics Periodically update stats for each relation Predicates Predicates like =, >, NOT, etc. reduce the number of tuples THUS: Evaluate predicates as early as possible System R Optimizer System R Optimizer A join operator can use either nested loop implementation sort-merge implementation Predicates are evaluated as early as possible Cost model relies on Use of statistics Selectivity factor CPU and I/O cost Consideration of interesting orders 2

Interesting Orders GROUP BY and ORDER BY or sortmerge joins generate interesting orders Cost of generating the interesting order should be added to the cost of a plan Back to the issue of search space Recall: search space too big (n!) Dynamic programming approach Dynamic programming (Wikipedia) Dynamic programming Assumption: Cost model satisfies the principle of optimality Optimal substructure means that optimal solutions of subproblems can be used to find the optimal solutions of the overall problem. An N-Join is really just a sequence of 2-Joins 1. Break the problem into smaller sub-problems. 2. Solve these problems optimally using this three-step process recursively. 3. Use these optimal solutions to construct an optimal solution for the original problem. Discussion 2 With formulas & methods; Without the estimation & experiment 1. Similar to Codd's paper "A Relational Model of Data for Large Shared Data Banks", this paper was also published in the ACM even though it lacks experimental data. Can we say that Codd's paper perhaps caused a paradigm shift in how the journal accepts papers? 2. It is the first time that a cost-based method is proposed to select access paths. The method has a deep influences to the subsequent work on query optimizer. But at that time when the paper is published, do you think the method proposed in the paper is not convincing without experimental data? If so, why does it succeed in the following decades? Major Contributions of Paper Cost based optimization Statistics CPU utilization (for sorts, etc.) Dynamic programming approach Interesting Orders 3

Summary of the Approach Example Schema Enumerate access paths to each relation Sequential scans Interesting orders Enumerate access paths to join a second relation to these results (if there is a predicate to do so) Nested loop (unordered) Merge (interesting order) Compare with equivalent solutions found so far but only keep the cheapest Example Query Example Initial Access Paths for single relations Example Search Tree 2 nd Relations Nested Loop Joining a second relation to the results for single relations 4

2 nd Relations Merge Join Prune and 3 Relations 5