Homework 2: Query Processing/Optimization, Transactions/Recovery (due October 6 th, 2014, 2:30pm, in class hard-copy please)

Similar documents
Answer Key. UNIVERSITY OF CALIFORNIA College of Engineering Department of EECS, Computer Science Division

Boats bid bname color 101 Interlake blue 102 Interlake red 103 Clipper green 104 Marine red. Figure 1: Instances of Sailors, Boats and Reserves

Chapter 5. SQL: Queries, Constraints, Triggers

Lecture 1: Data Storage & Index

Comp 5311 Database Management Systems. 16. Review 2 (Physical Level)

Transaction Management Overview

Homework 8. Revision : 2015/04/14 08:13

SQL: Queries, Programming, Triggers

Example Instances. SQL: Queries, Programming, Triggers. Conceptual Evaluation Strategy. Basic SQL Query. A Note on Range Variables

SQL: Queries, Programming, Triggers

DATABASE DESIGN - 1DL400

Chapter 13: Query Processing. Basic Steps in Query Processing

Introduction to Database Management Systems

Relational Algebra and SQL

Chapter 13. Disk Storage, Basic File Structures, and Hashing

CS 245 Final Exam Winter 2013

CIS 631 Database Management Systems Sample Final Exam

CMU - SCS / Database Applications Spring 2013, C. Faloutsos Homework 1: E.R. + Formal Q.L. Deadline: 1:30pm on Tuesday, 2/5/2013

Chapter 13. Chapter Outline. Disk Storage, Basic File Structures, and Hashing

W I S E. SQL Server 2008/2008 R2 Advanced DBA Performance & WISE LTD.

Review: The ACID properties

! Volatile storage: ! Nonvolatile storage:

Recovery System C H A P T E R16. Practice Exercises

Databases and BigData

Database Tuning and Physical Design: Execution of Transactions

Query Processing C H A P T E R12. Practice Exercises

Unit Storage Structures 1. Storage Structures. Unit 4.3

Chapter 15: Recovery System

Unit 12 Database Recovery

Chapter 10. Backup and Recovery

Lecture 7: Concurrency control. Rasmus Pagh

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

INTRODUCTION TO DATABASE SYSTEMS

Recovery and the ACID properties CMPUT 391: Implementing Durability Recovery Manager Atomicity Durability

Oracle Database 10g Express

Chapter 16: Recovery System

A Brief Introduction to MySQL

Storage and File Structure

2) What is the structure of an organization? Explain how IT support at different organizational levels.

Crashes and Recovery. Write-ahead logging

CHAPTER 13: DISK STORAGE, BASIC FILE STRUCTURES, AND HASHING

Lecture 6: Query optimization, query tuning. Rasmus Pagh

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Introduction to Database Systems. Module 1, Lecture 1. Instructor: Raghu Ramakrishnan UW-Madison

Oracle Database 11g: SQL Tuning Workshop

Inside the PostgreSQL Query Optimizer

Chapter 14: Recovery System

Week 1 Part 1: An Introduction to Database Systems. Databases and DBMSs. Why Use a DBMS? Why Study Databases??

B+ Tree Properties B+ Tree Searching B+ Tree Insertion B+ Tree Deletion Static Hashing Extendable Hashing Questions in pass papers

Performance Counters. Microsoft SQL. Technical Data Sheet. Overview:

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

CS 525 Advanced Database Organization - Spring 2013 Mon + Wed 3:15-4:30 PM, Room: Wishnick Hall 113

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Proposed solutions Project Assignment #1 (SQL)

Introduction to Database Systems CS4320. Instructor: Christoph Koch CS

Last Class Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Introduction. Part I: Finding Bottlenecks when Something s Wrong. Chapter 1: Performance Tuning 3

1 File Processing Systems

Topics. Introduction to Database Management System. What Is a DBMS? DBMS Types

UVA. Failure and Recovery. Failure and inconsistency. - transaction failures - system failures - media failures. Principle of recovery

VALLIAMMAI ENGNIEERING COLLEGE SRM Nagar, Kattankulathur

Java DB Performance. Olav Sandstå Sun Microsystems, Trondheim, Norway Submission ID: 860

How To Recover From Failure In A Relational Database System

PERFORMANCE TUNING IN MICROSOFT SQL SERVER DBMS

Concurrency Control. Module 6, Lectures 1 and 2

Raima Database Manager Version 14.0 In-memory Database Engine

Chapter 8: Structures for Files. Truong Quynh Chi Spring- 2013

Oracle Database: SQL and PL/SQL Fundamentals NEW

Question 1. Relational Data Model [17 marks] Question 2. SQL and Relational Algebra [31 marks]

DB2 for Linux, UNIX, and Windows Performance Tuning and Monitoring Workshop

Concurrency control. Concurrency problems. Database Management System

Θεµελίωση Βάσεων εδοµένων

CS 464/564 Introduction to Database Management System Instructor: Abdullah Mueen

Geodatabase Programming with SQL

IBM DB2: LUW Performance Tuning and Monitoring for Single and Multiple Partition DBs

Datenbanksysteme II: Implementation of Database Systems Implementing Joins

In-Memory Databases MemSQL

Recovery: Write-Ahead Logging

Transactions and Recovery. Database Systems Lecture 15 Natasha Alechina

A Shared-nothing cluster system: Postgres-XC

Course Content. Transactions and Concurrency Control. Objectives of Lecture 4 Transactions and Concurrency Control

THE UNIVERSITY OF TRINIDAD & TOBAGO

In this session, we use the table ZZTELE with approx. 115,000 records for the examples. The primary key is defined on the columns NAME,VORNAME,STR

CSE 562 Database Systems

Configuring Apache Derby for Performance and Durability Olav Sandstå

Advanced Oracle SQL Tuning

ORACLE INSTANCE ARCHITECTURE

CSE 544 Principles of Database Management Systems. Magdalena Balazinska (magda) Winter 2009 Lecture 1 - Class Introduction

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 5 - DBMS Architecture

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

Physical Data Organization

Oracle Backup and Recover 101. Osborne Press ISBN

Record Storage and Primary File Organization

Report on the Train Ticketing System

EZManage V4.0 Release Notes. Document revision 1.08 ( )

Databases and Information Systems 1 Part 3: Storage Structures and Indices

Concurrency Control. Chapter 17. Comp 521 Files and Databases Fall

DATABASE MANAGEMENT SYSTEMS. Question Bank:

Transcription:

Virginia Tech. Computer Science CS 5614 (Big) Data Management Systems Fall 2014, Prakash Homework 2: Query Processing/Optimization, Transactions/Recovery (due October 6 th, 2014, 2:30pm, in class hard-copy please) Reminders: a. Out of 100 points. Contains 7 pages. b. Rough time-estimates: 5~8 hours. c. Please type your answers. Illegible handwriting may get no points, at the discretion of the grader. Only drawings may be hand-drawn, as long as they are neat and legible. d. There could be more than one correct answer. We shall accept them all. e. Whenever you are making an assumption, please state it clearly. f. Each HW has to be done individually, without taking any help from non-class resources (e.g. websites etc). Q1. Hashing [20 points] Q1.1. (10 points) Suppose that we are using extendible hashing on a file that contains records with the following search-key values: (2369, 3760, 4692, 4871, 5659, 1821, 1074, 7115, 1620, 2428, 3943, 4750, 6975, 4981, 9208) Load these records into a file in the given order using extendible hashing. Assume that every block (bucket) can store up to four (4) values. Show the structure of the directory every 3 insertions. Use the hash function: h(k) = K mod 128 and then apply the extendible hashing technique. Q1.2. (10 points) Suppose we are using linear hashing as discussed in class. Assume two records per bucket. The intermediate hash functions hi(x), are given by the last (i+1) bits of x e.g. h0(x) will be equal to the last bit in the corresponding binary notation of x and so on. Assume a split is initiated whenever an overflow bucket is created. Starting with an empty hash file with the hash function h0(x) and two buckets show the significant intermediate steps (whenever there is an overflows and/or splits) when keys are inserted in the order given: (1,3,2,8,6,10,9,7,5,11). Q2. Query Optimization (Pen-n-Paper) [20 points] Consider the following schema: Sailors(sid, sname, rating, age) Reserve(sid, did, day) Boats(bid, bname, size) 1

Reserver.sid is a foreign key to Sailors and Reserves.bid is a foreign key to Boats.bid. We are given the following information about the database: Reserves contains 10,000 records with 40 records per page. Sailors contains 1000 records with 20 records per page. Boats contains 100 records with 10 records per page. There are 50 values for Reserves.bid. There are 10 values for Sailors.rating(1..10). There are 10 values for Boat.size There are 500 values for Reserves.day. Consider the following queries: Query 1: SELECT S.sid, S.sname, B.bname FROM Sailors S, Reserves R, Boats B WHERE S.sid=R.sid AND R.bid = B.bid Query 2: SELECT S.sid, S.sname, B.bname FROM Sailors S, Reserves R, Boats B WHERE S.sid=R.sid AND R.bid = B.bid AND B.size>5 AND R.day= July 4, 2003 Q2.1. (4 points) Assuming uniform distribution of values and column independence, estimate the number of tuples returned by Query 2. Also assume you are given that the size of Query 1 is 10^4 tuples. (Hint: Estimate the selectivities of Boat.size>5 and Reserves.day= July 4, 2003 ) Q2.2. (4 points) Draw all possible left-deep join query trees for Query 1. Q2.3. (12 points) For the first join in each query plan you get in Q2.2 (i.e. the join at the bottom of the tree), what join algorithm would work best (i.e. cost the least)? Assume that you have 50 pages of buffer memory. There are no indexes; so indexed nested loop is not an option. Consider the Page Oriented Nested Join (PNJ), Block-NJ, Sort-Merge-J, and Hash-J. Make sure you also write down the formula you use for each case (check slides). Q3. Query Optimization (Hands-on) [25 points] Consider the following two relations: employees(emp_no, birth_date, first_name, last_name, gender, hire_date); emp_salaries(emp_no, salary, from_date, to_date); 2

The employees table has the id of the employee, birth date, first name, last name, gender (M or F) and hire_date. Each employee could have many salaries through their careers. These salaries are present in the emp_salaries table that has the emp_no, the salary and the dater range he had this particular salary. Assume there are no existing indexes on these tables and both relations are stores as simple heap files. We have created sample instances of these two tables on the cs4604.cs.vt.edu server. This is the first time you will be accesing the PostgreSQL server, so refer to the guidelines (do not wait till the HW due date to check your access to the server!): http://people.cs.vt.edu/~badityap/classes/cs5614-fall14/homeworks/hw2/pgsql-instructions.pdf Use the following commands to copy the tables to your private database: pg_dump U YOUR-PID t employees cs4604s14 psql d YOUR-PID pg_dump U YOUR-PID t emp_salaries cs4604s14 psql d YOUR-PID Sanity Check: run the following two statements and verify the output. select count(*) from employees; // output-count = 300024 select count(*) from emp_salaries; // output-count = 2844047 For this question, it may help to familiarize yourself with the pg_class and pg_stats tables, provided by PostgreSQL as part of their catalog. Please see the links below: http://www.postgresql.org/docs/8.4/static/view-pg-stats.html http://www.postgresql.org/docs/9.3/static/catalog-pg-class.html Now answer the following questions: Q3.1. (5 points) Using a single SQL SELECT query on the pg_class table, find if the relations have primary keys, the number of attributes and the number of disks pages occupied by each relation (employees and emp_salaries). Write down the SQL query you used and also paste the output. Q3.2. (5x3=15 points) We want to find if the employees are mostly males or females, using the gender attribute in the employees table. We will do this via two different ways. A. Write a SQL query that uses the employees table to analyze and find the most frequent gender between the males and females employees. Paste the output too. 3

B. Now write a SQL query that uses only the pg_stats table instead to analyze and find the most frequent gender. Again paste the output. C. Use the EXPLAIN ANALYZE command to find out the total runtime and analyze the execution plan that the PostgreSQL planner generates for the two queries from parts A and B. Also compare the outputs you get in parts A and B. What do you observe? Q3.3. (2 points) Consider the following query that retrieves every salary that was granted after Feb 9, 1952. Note: You do not need to run anything on the server for this part. select * from emp_salaries where from_date > '1953-09-02'; Will a non-clustered B+- Tree index on the from_date attribute help to speed up the retrieval? Explain in only 1 line. Q3.4. (3 points) Create an index on your database that will decrease the runtime of the query in Q3.3. Write the index and the time improvement. Q4: Concurrency Control and Deadlocks [15 points] Q4.1. (5 points) Consider the schedule below. R(.) stands for Read and W(.) for write. Answer the following questions. a. (1 point) Is the schedule serial? b. (2 points) Give the dependency graph. Is the schedule Conflict-Serializable? If yes, give the equivalent serial schedule, else explain briefly. c. (2 points) Could this schedule have been produced by 2PL? Either show the sequence of locks and unlocks obeying 2PL or explain why not. Q4.2. (4 points) Consider the schedule below. R(.) stands for Read and W(.) for write. 4

Answer the following questions: a. (2 points) Could this schedule have been produced by 2PL? Either show a valid sequence of locks and unlocks obeying 2PL or explain why not. b. (2 points) Could this schedule have been produced by strict-2pl? Either show a valid sequence of locks and unlocks obeying strict-2pl or explain why not. Q4.3. (6 points) Consider the two lock schedules below. S(.) stands for a shared lock and X(.) denotes an exclusive lock. Schedule 1 Schedule 2 5

Answer the following questions: a. (2 points) For both schedules, which locks will be granted or blocked by the lock manager? b. (2 points) Give the waits-for graphs for both the schedules. c. (2 points) Is there a deadlock at the end of Schedule 1? What about Schedule 2? For each explain in 1 line. Q5: Logging and Recovery [20 points] Consider the log below. The records are of the form: (t-id, object-id, old-value, newvalue). Assumptions: a. The PrevLSN has been omitted (it s easy to figure it out yourself); b. for simplicity we assume that A, B, C and D each represents a page; 1. (T1, start) 2. (T1, A, 45, 10) 3. (T2, start) 4. (T2, B, 5, 10) 5. (T2, C, 35, 10) 6. (T1, D, 15, 5) 7. (T1, commit) 8. (T3, start) 9. (T3, A, 10, 15) 10. (T2, D, 5, 20) 11. (begin checkpoint, end checkpoint) 12. (T2, commit) 13. (T4, start) 14. (T4, D, 20, 30) 15. (T3, C, 10, 15) 16. (T3, commit) 17. (T4, commit) Q5.1. (3x4=12 points) What are the values of pages A, B, C and D in the buffer pool after recovery? Also specify which transactions have been ReDo(ne) and which transactions have been UnDo(ne). You are not required to show the details of the intermediate step. a. If the system crashes just before line 6 is written to the stable storage? b. If the system crashes just before line 10 is written to the stable storage? c. If the system crashes just before line 13 is written to the stable storage? d. If the system crashes just before line 17 is written to the stable storage? 6

Q5.2. (8 points) Assume only the crash as listed in Q5.1c has really happened and a recovery has then been performed, and the dirty pages caused by T1 have been flushed to disk before line 8. a. (2 points) Show the content of the transaction table and the dirty page table at both the beginning and the end of the Analysis phase. You may assume that both the transaction table and dirty page table are empty at the beginning of line 1 of the log. b. (6 points) Show the content of log when the recovery has completed. 7