Homework 3: Normalization, Indexing SOLUTION. AB C ; D B ; AC D. Answer each question below and carefully justify your answer.

CS 461, Database Systems, Spring 2015 Problem 1 (25pts): Normalization Homework 3: Normalization, Indexing SOLUTION Consider relation R (ABCD) together with the following set of FDs: AB C ; D B ; AC D. Answer each question below and carefully justify your answer. (a) (5 points) List all candidate keys of relation R. This relation has 3 candidate keys, AB, AC and AD. This is because {AB} + ={ABCD}, {AC} + ={ABCD} and {AD} + ={ABCD}. (b) (5 points) Does AD B follow from the set of FDs AB C ; D B ; AC D? Yes. To check this, we must check whether B is in the closure of {AD}. We know that this is the case because, as we saw in (a), {AD} is a candidate key of R, and so all attributes are in the closure of {AD}. (c) (5 points) Is relation R in 3NF? Is relation R in BCNF? Justify your answer. R is in 3NF. This is because FDs AB C; AC D have a candidate key on the left. The final FD, D B, has part of the candidate key {AB} on the right. R is not in BCNF, the FD D B violates BCNF because D is not a candidate key or a superkey, and the FD is non- trivial. (d) (10 points) Decompose R into BCNF, underlining the key for each relation in the decomposition. Show the projected dependencies for each relation. Is this decomposition dependency- preserving? ABCD is decomposed on the FD D B into R1(ACD), with keys AC and AD, and FDs AC D and AD C, and R2(DB), with key D and FD D B. Both R1 and R2 are in BCNF, so decomposition stops. This decomposition is not dependency- preserving, because FD enforced. AB C is not

Problem 2 (20 points): Normalization continued (a) (10 points) Consider relation R (WXYZ) with the following set of FDs: Y Z YZ W WX Y XZ W. Give a decomposition of R into BCNF, underlining the key for each relation in the decomposition. Show the projected dependencies for each relation. Is this decomposition dependency- preserving? First, we must determine candidate keys of this relation. We start by observing that, since no FD has X on the right, X must be part of the candidate key. It turns out that all two- element sets that include X, namely, XY, XZ and XW, are candidate keys of R. Next, we check which FDs violate BCNF. There are two such FDs: Y Z and YZ W. However, note that the second FD is not part of the minimal cover of FDs, Z can be removed from the left hand side, with no effect on attribute closures. Therefore, rather than considering YZ W, we will consider Y W. There are two FDs that violate BCNF, we show two decompositions, one is sufficient for full credit. Option 1: Decomposing on Y Z, we get: R1(XYW) with candidate keys XW and XY, and FDs XW Y, XY W and Y W. (Underlining only one of the two keys.) R1 is not in BCNF Y W is the offending FD. We further decompose R1 as follows: o R3(YW) with key Y and FD Y W, this relation is in BCNF. o R4 (XY) with key XY, this relation is in BCNF. R2(YZW), with candidate key Y and FD Y Z and Y W. Note that R2 contains attribute W in addition to Y and Z, since W is in the closure of Y w.r.t. original FDs. R2 is in BCNF since Y is a candidate key. This decomposition is not dependency- preserving, since FD are lost. Option 2: Decomposing on Y W, we get: XZ W and WX Y R1(XYZ) with candidate keys XY and XZ, and FDs XY Z and XZ Y. (Underlining only one of the two keys.) This relation is in BCNF, since XY and XZ are candidate keys. R2(YZW), see Option 1 for keys and FDs. R2 is in BCNF. This decomposition is not dependency- preserving, since FD XZ W is lost.

(b) (10 points) Consider relation R (ABCD) with the following set of FDs: C B A B CD A BCD A. Decompose R into 3NF, underlining the key for each relation in the decomposition. Show the projected dependencies for each relation. First, we compute candidate keys for R. Since no FDs have either C or D on the right, both these attributes must be part of a candidate key. In fact, {CD} is the only candidate key of R, since {CD} + ={ABCD}. R is not in 3NF, since FDs C B and A B violate this normal form. To find a 3NF decomposition, we compute minimal basis of the set of FDs. To do this, we observe that the last FD, with BCD on the left, can be dropped, since it is redundant with the FD that has CD on the left. We create a 3NF decomposition with relations R1(CB), R2(AB) and R3(CDA). Since R3 is a superkey for R, we don t need to add any more relations to the decomposition, done. Problem 3 (20 points): External sorting Consider a file in which there are 10,000 records, each record is 1KB in size. Further, suppose that the size of a block is 64KB. (a) (10 points) How many passes will be required to sort this file using two- way external merge- sort? What is the total I/O cost of sorting this file? In this dataset, there are ceil(10,000 / 64) = 157 pages that must be sorted. In two- way external merge- sort, we use 1 memory block in pass 0 (each 64- record block is sorted), and 3 memory blocks in subsequent passes (pairs of adjacent sorted runs are merged). To sort 157 pages, we will need 1 + ceil(log2157) = 9 passes. Each page is read and written once on each pass (2 I/Os per page per pass). Thus, the total cost of two- way external merge- sort on this dataset is 2 * 157 * 9 = 2,826 I/Os. (b) (10 points) Suppose now that we have 320KB of memory at our disposal. How many passes will be required to sort this file using generalized external merge- sort? What is the total I/O cost of sorting this file? In phase 0 of generalized external merge- sort, we read in and sort 320KB (5 pages worth) at a time, creating ceil(157/5) = 32 sorted runs of 5 blocks each. Then in subsequent passes we merge 5-1=4 neighboring runs. We need ceil(log432)=3 passes to complete sorting. That s a total of 4 passes, with 2 I/Os

per page per pass, for a total of 2 * 157 * 4 = 1,256 I/Os, a significant reduction compared to (a). Problem 4 (25pts): Indexing Consider the following relation: Sailors (id: integer; name: string; rating: integer; age: integer) Ids range from 0 to 100,000, ratings range from 1 to 10, ages range from 20 to 80. You can assume uniform distributions of age and rating values, that is, all values of age and rating are equally likely and are uncorrelated. The Sailors relation is stored on disk as a sorted file, sorted in id. There are 100,000 records in this file, 1,000 per disk page, for a total of 100 disk pages. Suppose that the following access paths are available, and that all indexes are unclustered. No index Hash index on (id) Hash index on (age) Hash index on (age, rating) Hash index on (name, age, rating) B+- tree index on (name, age, rating) B+- tree index on (age, rating) For each query below, decide which access path you will use to speed up the query, and briefly explain why. (a) (5 points) Print name, age, rating of all sailors. B+- tree index on (name, age, rating) contains all the required information. This index can be traversed, and assuming that the index fits in memory, no disk pages will need to be retrieved at all. (b) (5 points) Print name, age and rating of the sailor with id = 123 Hash index on id should be used. This index is on the primary key, at most 1 record will match the query, and if a record does match, we will retrieve exactly 1 page from disk. (c) (5 points) Count the number of sailors with rating = 5 and age < 40 We can use the unclustered B+- tree index on (age, rating) to answer this query. The leaf level of the index will contain all the relevant data entries, and we will be able to count the number of records without retrieving any pages from disk. (d) (5 points) Count the number of sailors with rating = 5.

Either of the B+- tree indexes can be used for this operation. While the condition rating=5 does not match either index, since it does not make a prefix of either (name, age, rating) or (age, rating), we cannot use the indexes to look up records with rating=5. However, we can traverse the indexes, filter results on rating=5 in memory, and compute the count of the matching record identifiers. Assuming that the index fits in memory, this operation will incur no I/Os. (e) (5 points) Print name, age and rating of sailors with rating < 5 and age < 40. We can use the B+- tree index on age, rating to answer this query, however, because the index is unclustered, and because it does not contain complete information needed to answer this query (sailor name is missing), we have to be careful to not incur more disk I/Os than a sequential scan would. About 40% of the records have rating <5, and about 30% have age < 40. Since attributes are uncorrelated, we expect about 12% of the records to match both conditions. That s 12,000 records. Accessing these records using an unclustered index will incur 12,000 I/Os. In contrast, a full scan of the relation will incur 1000 I/Os. Therefore, it is more efficient to not use the index in this case, and to access the file sequentially instead.