Homework 3: Normalization, Indexing SOLUTION. AB C ; D B ; AC D. Answer each question below and carefully justify your answer.

Similar documents
Schema Design and Normal Forms Sid Name Level Rating Wage Hours

Relational Database Design

Why Is This Important? Schema Refinement and Normal Forms. The Evils of Redundancy. Functional Dependencies (FDs) Example (Contd.)

Schema Refinement and Normalization

normalisation Goals: Suppose we have a db scheme: is it good? define precise notions of the qualities of a relational database scheme

Lecture Notes on Database Normalization

Limitations of E-R Designs. Relational Normalization Theory. Redundancy and Other Problems. Redundancy. Anomalies. Example

Database Design and Normal Forms

The University of British Columbia

Limitations of DB Design Processes

Answer Key. UNIVERSITY OF CALIFORNIA College of Engineering Department of EECS, Computer Science Division

Functional Dependencies and Finding a Minimal Cover

Jordan University of Science & Technology Computer Science Department CS 728: Advanced Database Systems Midterm Exam First 2009/2010

Databases -Normalization III. (N Spadaccini 2010 and W Liu 2012) Databases - Normalization III 1 / 31

LiTH, Tekniska högskolan vid Linköpings universitet 1(7) IDA, Institutionen för datavetenskap Juha Takkinen

Class One: Degree Sequences

How To Find Out What A Key Is In A Database Engine

Design of Relational Database Schemas

Database Design and Normalization

Schema Refinement, Functional Dependencies, Normalization

CSE 326: Data Structures B-Trees and B+ Trees

Introduction Decomposition Simple Synthesis Bernstein Synthesis and Beyond. 6. Normalization. Stéphane Bressan. January 28, 2015

Functional Dependencies and Normalization

Functional Dependencies

Introduction to Databases, Fall 2005 IT University of Copenhagen. Lecture 5: Normalization II; Database design case studies. September 26, 2005

Graham Kemp (telephone , room 6475 EDIT) The examiner will visit the exam room at 15:00 and 17:00.

Theory behind Normalization & DB Design. Satisfiability: Does an FD hold? Lecture 12

Week 11: Normal Forms. Logical Database Design. Normal Forms and Normalization. Examples of Redundancy

Database Management Systems. Redundancy and Other Problems. Redundancy

Announcements. SQL is hot! Facebook. Goal. Database Design Process. IT420: Database Management and Organization. Normalization (Chapter 3)

Relational Database Design: FD s & BCNF

CS143 Notes: Normalization Theory

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Chapter 13: Query Processing. Basic Steps in Query Processing

Physical Database Design and Tuning

Chapter 7: Relational Database Design

Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Overview - detailed. Goal. Faloutsos CMU SCS

Query Processing C H A P T E R12. Practice Exercises

1. Physical Database Design in Relational Databases (1)

Chapter 10 Functional Dependencies and Normalization for Relational Databases

Database Design and Normalization

Theory of Relational Database Design and Normalization

Part I: Entity Relationship Diagrams and SQL (40/100 Pt.)

CSCI-GA Database Systems Lecture 7: Schema Refinement and Normalization

Lecture 1: Data Storage & Index

CIS 631 Database Management Systems Sample Final Exam

Introduction to Database Systems. Normalization

CS 377 Database Systems. Database Design Theory and Normalization. Li Xiong Department of Mathematics and Computer Science Emory University

Relational Normalization Theory (supplemental material)

Overview of Storage and Indexing

Unique column combinations

Database Constraints and Design

Theory I: Database Foundations

Overview of Storage and Indexing. Data on External Storage. Alternative File Organizations. Chapter 8

Mining Social Network Graphs

Boolean Algebra (cont d) UNIT 3 BOOLEAN ALGEBRA (CONT D) Guidelines for Multiplying Out and Factoring. Objectives. Iris Hui-Ru Jiang Spring 2010

Chapter 8. Database Design II: Relational Normalization Theory

Relational Database Design Theory

Chapter 7: Relational Database Design

Normalisation. Why normalise? To improve (simplify) database design in order to. Avoid update problems Avoid redundancy Simplify update operations

MCQs~Databases~Relational Model and Normalization

Geometry Module 4 Unit 2 Practice Exam

6.830 Lecture PS1 Due Next Time (Tuesday!) Lab 1 Out today start early! Relational Model Continued, and Schema Design and Normalization

DATABASE DESIGN - 1DL400

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

Chapter 10. Functional Dependencies and Normalization for Relational Databases

Theory of Relational Database Design and Normalization

Unit 3 Boolean Algebra (Continued)

Data Warehousing und Data Mining

Advanced Oracle SQL Tuning

DATABASE MANAGEMENT SYSTEMS. Question Bank:

Playing with Numbers

Advanced Relational Database Design

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Data Mining Apriori Algorithm

International Legal English Certificate

Normalisation and Data Storage Devices

B+ Tree Properties B+ Tree Searching B+ Tree Insertion B+ Tree Deletion Static Hashing Extendable Hashing Questions in pass papers

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. The Relational Model. The relational model

Normalisation to 3NF. Database Systems Lecture 11 Natasha Alechina

Objectives, outcomes, and key concepts. Objectives: give an overview of the normal forms and their benefits and problems.

Normalization of database model. Pazmany Peter Catholic University 2005 Zoltan Fodroczi

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

Storage in Database Systems. CMPSCI 445 Fall 2010

Database Sample Examination

Chapter 8: Structures for Files. Truong Quynh Chi Spring- 2013

10CS35: Data Structures Using C

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Physical Design. Phases of database design. Physical design: Inputs.

Question 1. Relational Data Model [17 marks] Question 2. SQL and Relational Algebra [31 marks]

Database Systems. National Chiao Tung University Chun-Jen Tsai 05/30/2012

D B M G Data Base and Data Mining Group of Politecnico di Torino

Design Theory for Relational Databases: Functional Dependencies and Normalization

Chapter 5: Logical Database Design and the Relational Model Part 2: Normalization. Introduction to Normalization. Normal Forms.

University of Massachusetts Amherst Department of Computer Science Prof. Yanlei Diao

Chapter 10. Functional Dependencies and Normalization for Relational Databases. Copyright 2007 Ramez Elmasri and Shamkant B.

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

BCA. Database Management System

Chapter 5: FUNCTIONAL DEPENDENCIES AND NORMALIZATION FOR RELATIONAL DATABASES

Transcription:

CS 461, Database Systems, Spring 2015 Problem 1 (25pts): Normalization Homework 3: Normalization, Indexing SOLUTION Consider relation R (ABCD) together with the following set of FDs: AB C ; D B ; AC D. Answer each question below and carefully justify your answer. (a) (5 points) List all candidate keys of relation R. This relation has 3 candidate keys, AB, AC and AD. This is because {AB} + ={ABCD}, {AC} + ={ABCD} and {AD} + ={ABCD}. (b) (5 points) Does AD B follow from the set of FDs AB C ; D B ; AC D? Yes. To check this, we must check whether B is in the closure of {AD}. We know that this is the case because, as we saw in (a), {AD} is a candidate key of R, and so all attributes are in the closure of {AD}. (c) (5 points) Is relation R in 3NF? Is relation R in BCNF? Justify your answer. R is in 3NF. This is because FDs AB C; AC D have a candidate key on the left. The final FD, D B, has part of the candidate key {AB} on the right. R is not in BCNF, the FD D B violates BCNF because D is not a candidate key or a superkey, and the FD is non- trivial. (d) (10 points) Decompose R into BCNF, underlining the key for each relation in the decomposition. Show the projected dependencies for each relation. Is this decomposition dependency- preserving? ABCD is decomposed on the FD D B into R1(ACD), with keys AC and AD, and FDs AC D and AD C, and R2(DB), with key D and FD D B. Both R1 and R2 are in BCNF, so decomposition stops. This decomposition is not dependency- preserving, because FD enforced. AB C is not

Problem 2 (20 points): Normalization continued (a) (10 points) Consider relation R (WXYZ) with the following set of FDs: Y Z YZ W WX Y XZ W. Give a decomposition of R into BCNF, underlining the key for each relation in the decomposition. Show the projected dependencies for each relation. Is this decomposition dependency- preserving? First, we must determine candidate keys of this relation. We start by observing that, since no FD has X on the right, X must be part of the candidate key. It turns out that all two- element sets that include X, namely, XY, XZ and XW, are candidate keys of R. Next, we check which FDs violate BCNF. There are two such FDs: Y Z and YZ W. However, note that the second FD is not part of the minimal cover of FDs, Z can be removed from the left hand side, with no effect on attribute closures. Therefore, rather than considering YZ W, we will consider Y W. There are two FDs that violate BCNF, we show two decompositions, one is sufficient for full credit. Option 1: Decomposing on Y Z, we get: R1(XYW) with candidate keys XW and XY, and FDs XW Y, XY W and Y W. (Underlining only one of the two keys.) R1 is not in BCNF Y W is the offending FD. We further decompose R1 as follows: o R3(YW) with key Y and FD Y W, this relation is in BCNF. o R4 (XY) with key XY, this relation is in BCNF. R2(YZW), with candidate key Y and FD Y Z and Y W. Note that R2 contains attribute W in addition to Y and Z, since W is in the closure of Y w.r.t. original FDs. R2 is in BCNF since Y is a candidate key. This decomposition is not dependency- preserving, since FD are lost. Option 2: Decomposing on Y W, we get: XZ W and WX Y R1(XYZ) with candidate keys XY and XZ, and FDs XY Z and XZ Y. (Underlining only one of the two keys.) This relation is in BCNF, since XY and XZ are candidate keys. R2(YZW), see Option 1 for keys and FDs. R2 is in BCNF. This decomposition is not dependency- preserving, since FD XZ W is lost.

(b) (10 points) Consider relation R (ABCD) with the following set of FDs: C B A B CD A BCD A. Decompose R into 3NF, underlining the key for each relation in the decomposition. Show the projected dependencies for each relation. First, we compute candidate keys for R. Since no FDs have either C or D on the right, both these attributes must be part of a candidate key. In fact, {CD} is the only candidate key of R, since {CD} + ={ABCD}. R is not in 3NF, since FDs C B and A B violate this normal form. To find a 3NF decomposition, we compute minimal basis of the set of FDs. To do this, we observe that the last FD, with BCD on the left, can be dropped, since it is redundant with the FD that has CD on the left. We create a 3NF decomposition with relations R1(CB), R2(AB) and R3(CDA). Since R3 is a superkey for R, we don t need to add any more relations to the decomposition, done. Problem 3 (20 points): External sorting Consider a file in which there are 10,000 records, each record is 1KB in size. Further, suppose that the size of a block is 64KB. (a) (10 points) How many passes will be required to sort this file using two- way external merge- sort? What is the total I/O cost of sorting this file? In this dataset, there are ceil(10,000 / 64) = 157 pages that must be sorted. In two- way external merge- sort, we use 1 memory block in pass 0 (each 64- record block is sorted), and 3 memory blocks in subsequent passes (pairs of adjacent sorted runs are merged). To sort 157 pages, we will need 1 + ceil(log2157) = 9 passes. Each page is read and written once on each pass (2 I/Os per page per pass). Thus, the total cost of two- way external merge- sort on this dataset is 2 * 157 * 9 = 2,826 I/Os. (b) (10 points) Suppose now that we have 320KB of memory at our disposal. How many passes will be required to sort this file using generalized external merge- sort? What is the total I/O cost of sorting this file? In phase 0 of generalized external merge- sort, we read in and sort 320KB (5 pages worth) at a time, creating ceil(157/5) = 32 sorted runs of 5 blocks each. Then in subsequent passes we merge 5-1=4 neighboring runs. We need ceil(log432)=3 passes to complete sorting. That s a total of 4 passes, with 2 I/Os

per page per pass, for a total of 2 * 157 * 4 = 1,256 I/Os, a significant reduction compared to (a). Problem 4 (25pts): Indexing Consider the following relation: Sailors (id: integer; name: string; rating: integer; age: integer) Ids range from 0 to 100,000, ratings range from 1 to 10, ages range from 20 to 80. You can assume uniform distributions of age and rating values, that is, all values of age and rating are equally likely and are uncorrelated. The Sailors relation is stored on disk as a sorted file, sorted in id. There are 100,000 records in this file, 1,000 per disk page, for a total of 100 disk pages. Suppose that the following access paths are available, and that all indexes are unclustered. No index Hash index on (id) Hash index on (age) Hash index on (age, rating) Hash index on (name, age, rating) B+- tree index on (name, age, rating) B+- tree index on (age, rating) For each query below, decide which access path you will use to speed up the query, and briefly explain why. (a) (5 points) Print name, age, rating of all sailors. B+- tree index on (name, age, rating) contains all the required information. This index can be traversed, and assuming that the index fits in memory, no disk pages will need to be retrieved at all. (b) (5 points) Print name, age and rating of the sailor with id = 123 Hash index on id should be used. This index is on the primary key, at most 1 record will match the query, and if a record does match, we will retrieve exactly 1 page from disk. (c) (5 points) Count the number of sailors with rating = 5 and age < 40 We can use the unclustered B+- tree index on (age, rating) to answer this query. The leaf level of the index will contain all the relevant data entries, and we will be able to count the number of records without retrieving any pages from disk. (d) (5 points) Count the number of sailors with rating = 5.

Either of the B+- tree indexes can be used for this operation. While the condition rating=5 does not match either index, since it does not make a prefix of either (name, age, rating) or (age, rating), we cannot use the indexes to look up records with rating=5. However, we can traverse the indexes, filter results on rating=5 in memory, and compute the count of the matching record identifiers. Assuming that the index fits in memory, this operation will incur no I/Os. (e) (5 points) Print name, age and rating of sailors with rating < 5 and age < 40. We can use the B+- tree index on age, rating to answer this query, however, because the index is unclustered, and because it does not contain complete information needed to answer this query (sailor name is missing), we have to be careful to not incur more disk I/Os than a sequential scan would. About 40% of the records have rating <5, and about 30% have age < 40. Since attributes are uncorrelated, we expect about 12% of the records to match both conditions. That s 12,000 records. Accessing these records using an unclustered index will incur 12,000 I/Os. In contrast, a full scan of the relation will incur 1000 I/Os. Therefore, it is more efficient to not use the index in this case, and to access the file sequentially instead.