Query Processing C H A P T E R12. Practice Exercises



Similar documents
Chapter 13: Query Processing. Basic Steps in Query Processing

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

Datenbanksysteme II: Implementation of Database Systems Implementing Joins

SQL Query Evaluation. Winter Lecture 23

Lecture 1: Data Storage & Index

Evaluation of Expressions

Efficient Processing of Joins on Set-valued Attributes

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Physical Design. Phases of database design. Physical design: Inputs.

Chapter 14: Query Optimization

Databases and Information Systems 1 Part 3: Storage Structures and Indices

Overview of Storage and Indexing

File Management. Chapter 12

Answer Key. UNIVERSITY OF CALIFORNIA College of Engineering Department of EECS, Computer Science Division

Overview of Storage and Indexing. Data on External Storage. Alternative File Organizations. Chapter 8

In-Memory Database: Query Optimisation. S S Kausik ( ) Aamod Kore ( ) Mehul Goyal ( ) Nisheeth Lahoti ( )

File Management. Chapter 12

Analysis of Binary Search algorithm and Selection Sort algorithm

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

Physical Data Organization

DATABASE DESIGN - 1DL400

University of Massachusetts Amherst Department of Computer Science Prof. Yanlei Diao

Symbol Tables. Introduction

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

Unit Storage Structures 1. Storage Structures. Unit 4.3

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

Chapter 2 Data Storage

Chapter 13: Query Optimization

Storage and File Structure

Query Optimization in a Memory-Resident Domain Relational Calculus Database System

Inside the PostgreSQL Query Optimizer

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Oracle Database 11g: SQL Tuning Workshop Release 2

SQL Server Query Tuning

What Is Recursion? Recursion. Binary search example postponed to end of lecture

Architecture Sensitive Database Design: Examples from the Columbia Group

Advanced Oracle SQL Tuning

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Understanding Query Processing and Query Plans in SQL Server. Craig Freedman Software Design Engineer Microsoft SQL Server

Big Data and Scripting. Part 4: Memory Hierarchies

CIS 631 Database Management Systems Sample Final Exam

Comp 5311 Database Management Systems. 3. Structured Query Language 1

Adaptive Join Operators for Result Rate Optimization on Streaming Inputs

Why Query Optimization? Access Path Selection in a Relational Database Management System. How to come up with the right query plan?

Oracle Database 11g: SQL Tuning Workshop

Operating Systems. Virtual Memory

Performance Tuning for the Teradata Database

6. Standard Algorithms

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

File Management. File Management

Bloom Filters. Christian Antognini Trivadis AG Zürich, Switzerland

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification

Understanding SQL Server Execution Plans. Klaus Aschenbrenner Independent SQL Server Consultant SQLpassion.at

Semi-Streamed Index Join for Near-Real Time Execution of ETL Transformations

Physical Database Design Process. Physical Database Design Process. Major Inputs to Physical Database. Components of Physical Database Design

Chapter 9 Joining Data from Multiple Tables. Oracle 10g: SQL

Chapter 12 File Management

Chapter 12 File Management. Roadmap

SMALL INDEX LARGE INDEX (SILT)

Loop Invariants and Binary Search

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

Data Warehousing und Data Mining

In this session, we use the table ZZTELE with approx. 115,000 records for the examples. The primary key is defined on the columns NAME,VORNAME,STR

CS 245 Final Exam Winter 2013

Recovery System C H A P T E R16. Practice Exercises

Chapter 6: Physical Database Design and Performance. Database Development Process. Physical Design Process. Physical Database Design

Query Optimization for Distributed Database Systems Robert Taylor Candidate Number : Hertford College Supervisor: Dr.

Execution Strategies for SQL Subqueries

Query tuning by eliminating throwaway

Big Data With Hadoop

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Zabin Visram Room CS115 CS126 Searching. Binary Search

Chapter 12 File Management

Question 1. Relational Data Model [17 marks] Question 2. SQL and Relational Algebra [31 marks]

DATA STRUCTURES USING C

Chapter 8: Bags and Sets

Multi-dimensional index structures Part I: motivation

The Tower of Hanoi. Recursion Solution. Recursive Function. Time Complexity. Recursive Thinking. Why Recursion? n! = n* (n-1)!

2) What is the structure of an organization? Explain how IT support at different organizational levels.

Developing MapReduce Programs

Oracle Rdb Performance Management Guide

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework

Data Structures and Algorithms Written Examination

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

IBM DB2: LUW Performance Tuning and Monitoring for Single and Multiple Partition DBs

White Paper. Optimizing the Performance Of MySQL Cluster

Chapter 12 File Management

Transcription:

C H A P T E R12 Query Processing Practice Exercises 12.1 Assume (for simplicity in this exercise) that only one tuple fits in a block and memory holds at most 3 blocks. Show the runs created on each pass of the sort-merge algorithm, when applied to sort the following tuples on the first attribute: (kangaroo, 17), (wallaby, 21), (emu, 1), (wombat, 13), (platypus, 3), (lion, 8), (warthog, 4), (zebra, 11), (meerkat, 6), (hyena, 9), (hornbill, 2), (baboon, 12). We will refer to the tuples (kangaroo, 17) through (baboon, 12) using tuple numbers t 1 through t 12. We refer to the j th run used by the i th pass, as r i j. The initial sorted runs have three blocks each. They are: r 11 = {t 3, t 1, t 2 } r 12 = {t 6, t 5, t 4 } r 13 = {t 9, t 7, t 8 } r 14 = {t 12, t 11, t 10 } Each pass merges three runs. Therefore the runs after the of the first pass are: r 21 = {t 3, t 1, t 6, t 9, t 5, t 2, t 7, t 4, t 8 } r 22 = {t 12, t 11, t 10 } At the of the second pass, the tuples are completely sorted into one run: r 31 = {t 12, t 3, t 11, t 10, t 1, t 6, t 9, t 5, t 2, t 7, t 4, t 8 } 12.2 Consider the bank database of Figure 12.13, where the primary keys are underlined, and the following SQL query: 1

2 Chapter 12 Query Processing select T.branch name from branch T, branch S where T.assets > S.assets and S.branch city = Brooklyn Write an efficient relational-algebra expression that is equivalent to this query. Justify your choice. Query: T.branch name (( branch name, assets ( T (branch))) T.assets > S.assets ( assets ( (branch city= Brooklyn ) ( S (branch))))) This expression performs the theta join on the smallest amount of data possible. It does this by restricting the right hand side operand of the join to only those branches in Brooklyn, and also eliminating the unneeded attributes from both the operands. 12.3 Let relations r 1 (A, B, C) and r 2 (C, D, E) have the following properties: r 1 has 20,000 tuples, r 2 has 45,000 tuples, 25 tuples of r 1 fit on one block, and 30 tuples of r 2 fit on one block. Estimate the number of block transfers and seeks required, using each of the following join strategies for r 1 r 2 : a. Nested-loop join. b. Block nested-loop join. c. Merge join. d. Hash join. r 1 needs 800 blocks, and r 2 needs 1500 blocks. Let us assume M pages of memory. If M > 800, the join can easily be done in 1500 + 800 disk accesses, using even plain nested-loop join. So we consider only the case where M 800 pages. a. Nested-loop join: Using r 1 as the outer relation we need 20000 1500 + 800 = 30, 000, 800 disk accesses, if r 2 is the outer relation we need 45000 800 + 1500 = 36, 001, 500 disk accesses. b. Block nested-loop join: If r 1 is the outer relation, we need 800 1500 + 800 disk accesses, M 1 if r 2 is the outer relation we need 1500 800 + 1500 disk accesses. M 1 c. Merge-join: Assuming that r 1 and r 2 are not initially sorted on the join key, the total sorting cost inclusive of the output is B s = 1500(2 log M 1 (1500/M) +

Exercises 3 2) + 800(2 log M 1 (800/M) + 2) disk accesses. Assuming all tuples with the same value for the join attributes fit in memory, the total cost is B s + 1500 + 800 disk accesses. d. Hash-join: We assume no overflow occurs. Since r 1 is smaller, we use it as the build relation and r 2 as the probe relation. If M > 800/M, i.e. no need for recursive partitioning, then the cost is 3(1500 + 800) = 6900 disk accesses, else the cost is 2(1500 + 800) log M 1 (800) 1 + 1500 + 800 disk accesses. 12.4 The indexed nested-loop join algorithm described in Section 12.5.3 can be inefficient if the index is a secondary index, and there are multiple tuples with the same value for the join attributes. Why is it inefficient? Describe a way, using sorting, to reduce the cost of retrieving tuples of the inner relation. Under what conditions would this algorithm be more efficient than hybrid merge join? If there are multiple tuples in the inner relation with the same value for the join attributes, we may have to access that many blocks of the inner relation for each tuple of the outer relation. That is why it is inefficient. To reduce this cost we can perform a join of the outer relation tuples with just the secondary index leaf entries, postponing the inner relation tuple retrieval. The result file obtained is then sorted on the inner relation addresses, allowing an efficient physical order scan to complete the join. Hybrid merge join requires the outer relation to be sorted. The above algorithm does not have this requirement, but for each tuple in the outer relation it needs to perform an index lookup on the inner relation. If the outer relation is much larger than the inner relation, this index lookup cost will be less than the sorting cost, thus this algorithm will be more efficient. 12.5 Let r and s be relations with no indices, and assume that the relations are not sorted. Assuming infinite memory, what is the lowest-cost way (in terms of I/O operations) to compute r s? What is the amount of memory required for this algorithm? We can store the entire smaller relation in memory, read the larger relation block by block and perform nested loop join using the larger one as the outer relation. The number of I/O operations is equal to b r + b s, and memory requirement is min(b r, b s ) + 2 pages. 12.6 Consider the bank database of Figure 12.13, where the primary keys are underlined. Suppose that a B + -tree index on branch city is available on relation branch, and that no other index is available. List different ways to handle the following selections that involve negation: a. (branch city< Brooklyn ) (branch)

4 Chapter 12 Query Processing b. (branch city= Brooklyn ) (branch) c. (branch city< Brooklyn assets<5000) (branch) a. Use the index to locate the first tuple whose branch city field has value Brooklyn. From this tuple, follow the pointer chains till the, retrieving all the tuples. b. For this query, the index serves no purpose. We can scan the file sequentially and select all tuples whose branch city field is anything other than Brooklyn. c. This query is equivalent to the query (branch city Brooklyn assets<5000)(branch) Using the branch-city index, we can retrieve all tuples with branch-city value greater than or equal to Brooklyn by following the pointer chains from the first Brooklyn tuple. We also apply the additional criteria of assets < 5000 on every tuple. 12.7 Write pseudocode for an iterator that implements indexed nested-loop join, where the outer relation is pipelined. Your pseudocode must define the standard iterator functions open(), next(), and close(). Show what state information the iterator must maintain between calls. Let outer be the iterator which returns successive tuples from the pipelined outer relation. Let inner be the iterator which returns successive tuples of the inner relation having a given value at the join attributes. The inner iterator returns these tuples by performing an index lookup. The functions IndexedNLJoin::open, IndexedNLJoin::close and IndexedNLJoin::next to implement the indexed nested-loop join iterator are given below. The two iterators outer and inner, the value of the last read outer relation tuple t r and a flag done r indicating whether the of the outer relation scan has been reached are the state information which need to be remembered by IndexedNLJoin between calls. IndexedNLJoin::open() outer.open(); inner.open(); done r := false; if(outer.next() false) move tuple from outer s output buffer to t r ; else done r := true;

Exercises 5 IndexedNLJoin::close() outer.close(); inner.close(); boolean IndexedNLJoin::next() while( done r ) if(inner.next(t r [JoinAttrs]) false) move tuple from inner s output buffer to t s ; compute t r t s and place it in output buffer; return true; else if(outer.next() false) move tuple from outer s output buffer to t r ; rewind inner to first tuple of s; else done r := true; return false; 12.8 Design sort-based and hash-based algorithms for computing the relational division operation (see Practise Exercises of Chapter 6 for a definition of the division operation). Suppose r(t S) and s(s) be two relations and r s has to be computed. For sorting based algorithm, sort relation s on S. Sort relation r on (T, S). Now, start scanning r and look at the T attribute values of the first tuple. Scan r till tuples have same value of T. Also scan s simultaneously and check whether every tuple of s also occurs as the S attribute of r, in a fashion similar to merge join. If this is the case, output that value of T and proceed with the next value of T. Relation s may have to be scanned multiple times but r will only be scanned once. Total disk accesses, after

6 Chapter 12 Query Processing sorting both the relations, will be r + N s, where N is the number of distinct values of T in r. We assume that for any value of T, all tuples in r with that T value fit in memory, and consider the general case at the. Partition the relation r on attributes in T such that each partition fits in memory (always possible because of our assumption). Consider partitions one at a time. Build a hash table on the tuples, at the same time collecting all distinct T values in a separate hash table. For each value of T, Now, for each value V T of T, each value s of S, probe the hash table on (V T, s). If any of the values is absent, discard the value V T, else output the value V T. In the case that not all r tuples with one value for T fit in memory, partition r and s on the S attributes such that the condition is satisfied, run the algorithm on each corresponding pair of partitions r i and s i. Output the intersection of the T values generated in each partition. 12.9 What is the effect on the cost of merging runs if the number of buffer blocks per run is increased, while keeping overall memory available for buffering runs fixed? Seek overhead is reduced, but the the number of runs that can be merged in a pass decreases potentially leading to more passes. A value of b b that minimizes overall cost should be chosen.