SQL Query Evaluation. Winter 2006-2007 Lecture 23



Similar documents
Chapter 13: Query Processing. Basic Steps in Query Processing

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

Query Processing C H A P T E R12. Practice Exercises

Chapter 13: Query Optimization

Physical DB design and tuning: outline

In this session, we use the table ZZTELE with approx. 115,000 records for the examples. The primary key is defined on the columns NAME,VORNAME,STR

SQL Server Query Tuning

Chapter 14: Query Optimization

D B M G Data Base and Data Mining Group of Politecnico di Torino

SQL Query Performance Tuning: Tips and Best Practices

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Physical Design. Phases of database design. Physical design: Inputs.

Lecture 6: Query optimization, query tuning. Rasmus Pagh

Inside the PostgreSQL Query Optimizer

Database Design Patterns. Winter Lecture 24

Chapter 6: Physical Database Design and Performance. Database Development Process. Physical Design Process. Physical Database Design

Join Example. Join Example Cart Prod Comprehensive Consulting Solutions, Inc.All rights reserved.

Understanding Query Processing and Query Plans in SQL Server. Craig Freedman Software Design Engineer Microsoft SQL Server

Execution Strategies for SQL Subqueries

University of Massachusetts Amherst Department of Computer Science Prof. Yanlei Diao

Unit Storage Structures 1. Storage Structures. Unit 4.3

Advanced Oracle SQL Tuning

SQL Tables, Keys, Views, Indexes

Relational Databases

Oracle Database 11g: SQL Tuning Workshop Release 2

Relational Division and SQL

Evaluation of Expressions

Topics in basic DBMS course

Performance Tuning for the Teradata Database

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification

Oracle Database 11g: SQL Tuning Workshop

SUBQUERIES AND VIEWS. CS121: Introduction to Relational Database Systems Fall 2015 Lecture 6

MOC 20461C: Querying Microsoft SQL Server. Course Overview

Execution Plans: The Secret to Query Tuning Success. MagicPASS January 2015

Data Warehousing und Data Mining

Physical Data Organization

Datenbanksysteme II: Implementation of Database Systems Implementing Joins

CIS 631 Database Management Systems Sample Final Exam

Oracle Database 12c: Introduction to SQL Ed 1.1

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

In-Memory Database: Query Optimisation. S S Kausik ( ) Aamod Kore ( ) Mehul Goyal ( ) Nisheeth Lahoti ( )

FHE DEFINITIVE GUIDE. ^phihri^^lv JEFFREY GARBUS. Joe Celko. Alvin Chang. PLAMEN ratchev JONES & BARTLETT LEARN IN G. y ti rvrrtuttnrr i t i r

Best Practices for DB2 on z/os Performance

Useful Business Analytics SQL operators and more Ajaykumar Gupte IBM

MS SQL Performance (Tuning) Best Practices:

Course ID#: W 35 Hrs. Course Content

DATABASE DESIGN - 1DL400

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.


1Z0-117 Oracle Database 11g Release 2: SQL Tuning. Oracle

Multi-dimensional index structures Part I: motivation

Querying Microsoft SQL Server 20461C; 5 days

Querying Microsoft SQL Server

Understanding SQL Server Execution Plans. Klaus Aschenbrenner Independent SQL Server Consultant SQLpassion.at

Query Optimization for Distributed Database Systems Robert Taylor Candidate Number : Hertford College Supervisor: Dr.

Query tuning by eliminating throwaway

Partitioning under the hood in MySQL 5.5

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

Chapter 5: Overview of Query Processing

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

Querying Microsoft SQL Server Course M Day(s) 30:00 Hours

Oracle Database 11 g Performance Tuning. Recipes. Sam R. Alapati Darl Kuhn Bill Padfield. Apress*

Data storage Tree indexes

In-Memory Databases MemSQL

Relational Database: Additional Operations on Relations; SQL

Querying Microsoft SQL Server 2012

Efficient Data Access and Data Integration Using Information Objects Mica J. Block

Course 20461C: Querying Microsoft SQL Server Duration: 35 hours

MapReduce examples. CSE 344 section 8 worksheet. May 19, 2011

SQL Server for developers. murach's TRAINING & REFERENCE. Bryan Syverson. Mike Murach & Associates, Inc. Joel Murach

Lecture 1: Data Storage & Index

Course 10774A: Querying Microsoft SQL Server 2012

Course 10774A: Querying Microsoft SQL Server 2012 Length: 5 Days Published: May 25, 2012 Language(s): English Audience(s): IT Professionals

Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC

The Relational Algebra

Fallacies of the Cost Based Optimizer

Introducing Microsoft SQL Server 2012 Getting Started with SQL Server Management Studio

Efficiently Identifying Inclusion Dependencies in RDBMS

Access Path Selection in a Relational Database Management System

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Binary Heap Algorithms

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

COMP 5138 Relational Database Management Systems. Week 5 : Basic SQL. Today s Agenda. Overview. Basic SQL Queries. Joins Queries

Introduction. Part I: Finding Bottlenecks when Something s Wrong. Chapter 1: Performance Tuning 3

Chapter 8: Structures for Files. Truong Quynh Chi Spring- 2013

Oracle Database 10g: Introduction to SQL

Why Query Optimization? Access Path Selection in a Relational Database Management System. How to come up with the right query plan?

Introduction to SQL for Data Scientists

Oracle EXAM - 1Z Oracle Database 11g Release 2: SQL Tuning. Buy Full Product.

DB2 for i5/os: Tuning for Performance

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package Data Federation Administration Tool Guide

Capacity Planning Process Estimating the load Initial configuration

In This Lecture. Physical Design. RAID Arrays. RAID Level 0. RAID Level 1. Physical DB Issues, Indexes, Query Optimisation. Physical DB Issues

Database Systems. Session 8 Main Theme. Physical Database Design, Query Execution Concepts and Database Programming Techniques

Transcription:

SQL Query Evaluation Winter 2006-2007 Lecture 23

SQL Query Processing Databases go through three steps: Parse SQL into an execution plan Optimize the execution plan Evaluate the optimized plan Execution plans are generally based on the extended relational algebra Includes general projection and grouping Also some other features, like sorting results, nested queries, etc.

Query Evaluation Example Simple query: SELECT balance FROM account WHERE balance < 2500 Relational algebra expression: Π balance (σ balance < 2500 (account)) Database might create this structure: Can be evaluated using a pushor a pull-based approach Evaluation loop requests results from top-level Π operation Π balance σ balance < 2500 account

Query Optimization Previous query can be evaluated several ways: Π balance (σ balance < 2500 (account)) σ balance < 2500 (Π balance (account)) Which is faster? Query optimization phase generates a variety of equivalent plans Plans are usually assigned a cost The lowest-cost plan is chosen Heuristic optimizers just follow a set of rules for optimizing a query plan

Query Evaluation Costs A variety of costs in query evaluation Primary expense is reading data from disk Usually, data being processed won t fit in memory Try to minimize disk seeks, reads and writes! CPU and memory requirements are secondary Some ways of computing a result require more CPU and memory resources than others Become especially important in concurrent usage scenarios Can be other costs as well In distributed database systems, network bandwidth must be managed by query optimizer

Implementation Details Several questions to consider: How is a relation s data stored on the disk? How are relational algebra operations actually implemented? How does the database decide which query execution plan is best? Given the answers to these questions, how can we make the database go faster? as a user of the database, of course.

Data Storage Format Several different data storage formats can be used in database implementations A very common format: Each table s data is stored in its own file Individual records can have varying sizes e.g. due to VARCHAR or NUMERIC fields Records not stored in any particular order Called a heap file organization Accesses to a data file use a specific block size Fastest access when records are within the block size Larger records must span multiple blocks, which slows down reads

Select Operation How to implement σ P operation? An easy solution: scan the entire data file Just test selection predicate against each tuple in the data file Will be slow, since every block must be read into memory Not a good general solution What is the selection predicate P? Depending on the characteristics of P, can choose more optimal evaluation strategies

Select Operation Most select predicates are a binary comparison Is an attribute equal to some value? Is an attribute less than some value? If data file was ordered, could use binary search Would dramatically reduce number of blocks read Maintaining record ordering becomes costly if data changes frequently Solution: Continue using heap file organization for main data For important attributes, build an index on the data Index provides a faster way to find specific values in the data file

Table Indexes Indexes provide an alternate access path to particular table records If looking for a specific value or range of values, can use index to find approximately where to start looking in the table Different kinds of indexes for different purposes Ordered indexes maintain a sorted ordering of values in the index Hash indexes divide values into buckets, based on a hash function Often want to build multiple indexes against a particular table

Index Characteristics Indexes have various characteristics What types of access are most efficient for the kind of index? How costly is it to find a particular item, or a set of items? How much time to add a new item to the index? How much time to delete an item from the index? How much additional space does the index take? Indexes can speed many kinds of access, but must also be kept up to date! Indexes can often slow down update operations, while making selects faster

Select Operation For select operations: If select predicate is an equality test and an index is available for that attribute, can use an index scan Can also use index scan for comparison tests if ordered index is available for the attribute For more complicated tests, or if no index is available for attributes being used: Use simple file scan approach

Query Optimization Using Indexes Database query optimizer looks for available indexes If a select operation can use an index, execution plan is annotated with this detail Databases often build indexes for primary keys, automatically Lots of queries will use primary key to retrieve records A database designer can create other indexes based on most common queries e.g. relation employee(emp_id, ssn, emp_name, ) could have indexes on ssn and emp_name fields

Project Operation Project operation is simple to implement For each input tuple, create a new tuple with specified attributes May also involve computed values Which would be faster, in general? Π balance (σ balance < 2500 (account)) σ balance < 2500 (Π balance (account)) Want to project as few rows as possible, to minimize CPU and memory usage Do select first: Π balance (σ balance < 2500 (account)) Good example of a heuristic: Do projects last.

Sorting SQL allows results to be ordered Databases must provide sorting capabilities in execution plans Data being sorted may be much larger than memory! For relations that fit in memory, normal sorting techniques are used (e.g. quick-sort) For relations that are larger than memory, must use an external sorting technique Relation divided into runs to be sorted in memory Each run is sorted, then written to a temporary file All runs are merged using an N-way merge sort

Sorting (2) In general, sorting should be applied as late as possible Ideally rows being sorted fit into memory Some other operations can also use sorted inputs to improve performance Join operations Grouping and aggregation Usually occurs when sorted results are already available Could also perform sorting with an ordered index Scan index, and retrieve each tuple in order Disk seek cost usually makes this prohibitive

Join Operations Join operations are very common in SQL queries especially when using normalized schemas Could also potentially be a very costly operation! r s defined as σ r.a=s.a (r s) A simple strategy for r θ s : for each tuple t r in r do begin for each tuple t s in s do begin if t r, t s satisfy condition θ then add t r t s to result end end

Nested-Loop Join Called the nested-loop join algorithm: for each tuple t r in r do begin for each tuple t s in s do begin if t r, t s satisfy condition θ then add t r t s to result end end Very slow join implementation Scans r once, and s once for each row in r! Not so horrible if s fits entirely in memory Can also handle arbitrary conditions For some queries, only option is nested-loop join!

Indexed Nested-Loop Join Most join conditions involve equalities Called equijoins Indexes can speed up table lookups Version of nested-loop join that takes advantage of indexes for each tuple t r in r do begin use index on s to retrieve tuple t s if t r, t s satisfy condition θ then add t r t s to result end Only an option for equijoins, where an index exists for the join attributes

Sort-Merge Join If relations are already ordered by join attributes, can use merge-sort technique Must be an equijoin! Simple high-level description: Two pointers to traverse relations in order: pr starts at first tuple in r ps starts at first tuple in s If one pointer s tuple has join attribute values less than the other pointer, advance that pointer When pointers have same join attribute values, generate joins using those rows

Sort-Merge Join (2) Much better performance than nested-loop join Dramatically reduces disk accesses Unfortunately, relations aren t usually ordered Can also enhance sort-merge joins when at least one relation has an index e.g. only one relation is sorted, but unsorted relation has an index on the join attributes Traverse unsorted relation s index in order For matching rows, use index to pull tuples from disk Disk seek cost must be managed carefully, with this technique

Hash Join Another join technique for equijoins For relations r and s : Use a hash function on join attribute values to divide rows of r and s into partitions Use same hash function on both r and s, of course Partitions are saved to disk as they are generated Aim for each partition to fit in memory r partitions: H r0, H r1,, H rn s partitions: H s0, H s1,, H sn Rows in H ri will only join with rows in H si

Hash Join (2) After partitioning: for i = 0 to n do build a hash index on H si for each row t r in H ri probe hash index for matching rows in H si for each matching tuple t s in H si add t r t s to result end end end Very fast and efficient equijoin strategy Can perform badly when rows can t be hashed into partitions that fit into memory

Outer Joins Join algorithms can be modified to generate left outer joins reasonably efficiently Right outer join can be restated as left outer join Will still impact overall query performance if many rows are generated Full outer joins can be significantly harder to implement Sort-merge join can compute full outer join easily Nested loop and hash join are harder to extend Full outer joins can also impact query performance heavily

Other Operations Set operations require duplicate elimination Duplicate elimination can be performed with sorting or with hashing Grouping and aggregation can be implemented in several ways Can sort results, then compute aggregates over sorted values Much more efficient to generate aggregate results on the fly

Optimizing Query Performance To improve query performance, you must know how the DB actually runs your query Most databases have an explain statement that reports how a query will be executed Using this information, you can: Create indexes on tables, where appropriate Restate the query to help the DB pick a better plan In hard cases, may need to use multiple steps: Generate intermediate results more well-suited for the desired query Use intermediate results to generate final results

Creating Indexes General syntax is simple: CREATE INDEX name ON table (col1, col2,...); Indexes are usually ordered indexes B-tree indexes: a balanced tree structure with a rather large branching factor Commercial databases offer other index types Hash indexes, or partition indexes Bitmap indexes Overall performance may change Queries may speed up, but updates may slow down Also, may want to verify that the database actually uses the index

Restating Queries Databases with a good optimizer will usually generate good plans Sometimes, need to force database to perform operations in a specific order Join operations are most common example For an N-way join, and binary join operator implementation, N! options Goal: Eliminate as many rows as quickly as possible! Sometimes database picks wrong order Databases usually provide a way to force order of join operations in a query

Correlated Subqueries Correlated subqueries are a notorious source of slow performance Subqueries in SELECT clause are a common source of correlated subqueries Set-membership tests in WHERE clause are another! A simple example: Find total salary of each department in a company SELECT dept_name, dept_vp_name, (SELECT SUM(salary) FROM employee AS e WHERE e.dept_name = d.dept_name) AS payroll FROM department AS d; A good optimizer can rewrite this as an equijoin and grouping/aggregation

Correlated Subqueries (2) A nastier example: Find the sales offices, and their cities, where individual salespeople have sold more than $50,000 SELECT office_id, city FROM sales_offices AS o WHERE EXISTS(SELECT r.employee_name FROM sales_records AS r WHERE r.office_id = o.office_id GROUP BY r.employee_name HAVING SUM(sale_value) > 50000); Inner query must be evaluated for each record in outer query! No guarantee that optimizer can make this fast

Rewriting Correlated Subqueries Usually, can rewrite as uncorrelated subqueries Then, join together computed results on proper attribute For example: Compute each employee s sales total, and limit to sales > $50,000 while you re at it! SELECT employee_name, office_id, SUM(sale_value) AS total_sales FROM sales_records GROUP BY employee_name, office_id HAVING total_sales > 50000; Very easy to join this result against sales_offices and report the desired result

Difficult Queries Some queries are very difficult to optimize Especially queries involving joins on inequality comparisons Sometimes there is a trick Example: dense rank query Computing dense rank with a correlated subquery or a Cartesian product will be slow for large datasets Or, could just sort distinct values being ranked on, and assign each value an increasing number e.g. using a sequence Join values and their ranks back against original data

Review Discussed general details of how most databases evaluate SQL queries Some relational algebra operations have several ways to evaluate them Optimizations for very common special cases, e.g. equijoins Can give the database some guidance Create indexes on tables where appropriate Rewrite queries to be more efficient