Jordan University of Science & Technology Computer Science Department CS 728: Advanced Database Systems Midterm Exam First 2009/2010

Jordan University of Science & Technology Computer Science Department CS 728: Advanced Database Systems Midterm Exam First 2009/2010 Student Name: ID: Part 1: Multiple-Choice Questions (17 questions, 1 point each) 1. Select the TRUE statement concerning normalization. a. Performance is increased. b. Data consistency is improved. c. Redundancy is increased. d. The number of tables is reduced. e. Functional dependencies increase. 2. A lack of normalization can lead to which one of the following problems: a. Deadlock b. Lost Updates c. Insertion problems d. Deferred updates e. Deletion of data 3. Each of the following is an argument which might be used to support the use of relations which are not fully normalized. Select the weakest argument. a. A fully normalized database may have too many tables b. Full normalization may compromise existing applications/systems c. Full normalization may make some queries too complicated d. A fully normalised database may result in tables which are too large e. A fully normalized database may perform too slowly 4. If a non-key attribute of a table can be null, that table automatically violates which normal form (choose the lowest one): NONE 1NF 2NF 3NF BCNF 4NF 5NF 5. If an attribute of a table can have multiple values, that table automatically violates which normal form (choose the lowest one): NONE 1NF 2NF 3NF BCNF 4NF 5NF 6. Given the following relation and dependences, state which normal form the relation is in. R(p,q,r,s,t) p,q -> r,s,t r,s -> p,q,t t -> s a. 1NF b. 2NF c. 3NF d. BCNF e. Unnormalized

7. Which of the following is the highest normal form by which the R relation can be classified? R(patient, consultant, hospital, address, date, time) Given patient, consultant -> hospital, address, date, time hospital -> address a. 1NF b. Unnormalised c. 2NF d. BCNF e. 3NF 8. Assume the relation R(A, B, C,D, E) is in at least 3NF. Which of the following functional dependencies must be FALSE? a. A, B -> C b. A, C -> E c. A, B -> D d. C, D -> E e. None of the above 9. Consider the relational schema R(A,B,C,D,E) with non-key functional dependencies C,D -> E and B -> C. Select the strongest statement that can be made about the schema R a. R is in third normal form b. R is in second normal form c. R is in BCNF normal form d. R is in first normal form e. None of the above 10. Consider the following functional dependencies: a, b -> c, d e -> c b -> e, f Given the same functional dependencies as shown above, which option shows the relations normalized to 3NF of: R(a, b, c, d, e, f) a. R(a,b,c,d,e,f) R(e,c) R(b,e,f) b. R(a,b,c,d) R(c,e) R(e,f,b) c. R(a,b,c,d) R(c,e) R(b,e,f) d. R(a,b,d) R(e,c) R(b,e,f) e. R(a,b,c,d,e,f)

11. Given the following relation and dependencies, select the option that is the result of fully normalizing the relation to BCNF. R(a,b,c,d,e) a -> c d -> c,e a. R1(a,c) R2(d,e) R(a,b,d) b. R1(a,c) R2(d,c,e) R(a,b,d) c. R1(d,c,e) R(a,b,d) d. R(a,b,c,d,e) e. None of the above 12. If a relation schema contains two different multi-valued dependencies, that table automatically violates which normal form (choose the lowest one): NONE 1NF 2NF 3NF BCNF 4NF 5NF 13. In spatial databases, what does MBR refer to? a. Minimum Box Rectangle b. Minimum Box Region c. Minimum Bounding Region d. Minimum Bounding Rectangle e. Minimum Bounding Box 14. The two steps (in order) that are performed when answering a spatial query using MBR are a. filtering and refinement steps b. refinement and filtering steps c. filtering and screening steps b. screening and filtering steps 15. The first step(s) in the knowledge discovery process is/are a. data selection b. visualization c. data mining d. data transformation e. data cleaning and integration 16. The measure of interestingness used by association rule mining that is based on how frequently an itemset appears in data is called a. correlation b. confidence c. support d. importance e. none of the above 17. Association rule mining is an example of a. Descriptive data mining b. Unsupervised learning c. Predictive data mining d. Learning by example e. NOA

1. Part 2: Essay Questions (83 points all) a. Is the following decomposition of Book lossless? Why or why not? 4 R1(A, T, P), R2(I, P, Y ), R3(I, T) I A T P Y R 1 b 11 b 12 b 13 b 14 b 15 R 2 b 21 b 22 b 23 b 24 b 25 R 3 b 31 b 32 b 33 b 34 b 35 TP I AP T I ATP TP I AP T I ATP I A T P Y R 1 b 11 a 2 a 3 a 4 b 15 R 2 a 1 b 22 b 23 a 4 a 5 R 3 a 1 b 32 a 3 b 34 b 35 No changes to table S No changes to table S Table S is changed to I A T P Y R 1 b 11 a 2 a 3 a 4 b 15 R 2 a 1 b 22 a 3 a 4 a 5 R 3 a 1 b 22 a 3 a 4 b 35 Table S is changed to I A T P Y R 1 a 1 a 2 a 3 a 4 b 15 R 2 a 1 b 22 a 3 a 4 a 5 R 3 a 1 b 22 a 3 a 4 b 35 No changes to table S Table S is changed to I A T P Y R 1 a 1 a 2 a 3 a 4 b 15 R 2 a 1 a 2 a 3 a 4 a 5 R 3 a 1 a 2 a 3 a 4 b 35 Since row 2 is all "a" symbols, then the decomposition is lossless.

b. Does the decomposition in a preserve all functional dependencies in F +? If so, simply state Yes. If not, give a functional dependency that is not preserved. 4 No, because TP I is not preserved. Here is why. Restrictions of F + to each relation in the decomposition: R 1 = ATP: {AP T; TP A} R 2 = IPY: {I P} R 3 = IT: {I T} Testing FD preservation: test each FD in F to see if it is implied by the restricted FDs above. TP I: (TP) + = TPA so TP I is not preserved. AP T: (AP) + = APT so AP T is preserved. I ATP: I + = IATP so I ATP is preserved. Another justification: F + = {TP I, TP A, AP T, I A, I T, I P} (π R1 (F) π R2 (F) π R3 (F)) + = (AP T, TP A, I P, I T) + {AP T, TP A, I P, I T, I A} F + c. Consider replacing F by an alternative set of functional dependencies G: 4 AP-> IT I -> ATP Is the effect of G the same as the effect of F? In other words, does G + = F +? Why or why not? Under these FDs (G), we have TP + = TP, so TP I is not implied, hence G + F +. (TP) + (AP) + I + Under F = TPIA = APIT = IATP (TP) + (AP) + I + Under G = TP = APIT = IATP 2.

a. Consider the following database instance D 1 of R: 4 Is D 1 consistent with the dependencies specified above? Why or why not? No, because the instant D 1 violates (does not preserve) the dependency S D. We have Novell (S) paying dividends (D) $0.05 and $0.10. b. Give a lossless decomposition of R into Boyce-Codd Normal Form. 4 The Key of R is IS, since (IS) + = R. F + = F {I O} Solution I: 1. Decompose R by I B into R 1 = IB and R 2 = IOSQD. 2. R 1 is in BCNF. 3. Decompose R 2 by S D into R 3 = SD and R 4 = IOSQ. 4. R 3 is in BCNF. 5. Decompose R 4 by I O into R 5 = IO and R 6 = ISQ. 6. R 5 is in BCNF. 7. R 6 is in BCNF. Solution II: {BO, IB, SD, ISQ}, which does preserve functional dependencies. c. Does your answer to Question b preserve all given and implied functional dependencies? If No, state which dependencies that are not preserved. 4 Solution I: does not preserve FD B O Solution II: does preserve all FDs 3. Consider the following function dependencies for a relation R(X, Y, Z, W): 4 WX -> Y X -> Z Z -> WY Give a derivation of X -> Y from the given functional dependencies above. Justify your steps with Armstrong s axioms. Z -> WY Z -> Y X -> Z X -> Y

4. Consider a relation R(A, B, C, D, E, F, G, H) with Functional Dependencies 21 AB E C D F GH B F (a) Consider R 1 (A, B, C, D) with the above functional dependencies. What would be a candidate key for R 1? Solution I: A + = A Not a key B + = B Not a key C + = CD Not a key (AB) + = AB Not a key (ABC) + = ABCD = R 1 is a key Solution II: Assum that the key K is ABCD K - D ABCD = R 1 K - DC AB R 1 K - DA BCD R 1 (b) (c) (d) K - DB ACD R 1 So the key is K D = ABC Is R 1 (A, B, C, D) in second normal form? If not, say why. No, because of C D (partial dependency), D is not fully dependent on the key ABC. Is R 1 (A, B, C, D) and R 2 (A, B, E) a lossless join decomposition of R (A, B, C, D, E)? Explain why or why not. Since the decomposition is binary, we can test the lossless property using the following rules: If R 1 R 2 R 1 - R 2 or R 2 - R 1 then it is lossless. Since R 1 R 2 = AB, R 2 - R 1 = E, and AB E, then decomposition is lossless. Is R 1 (A, B, C, D), R 2 (A, B, E), R 3 (F, G, H) and R 4 (B, F) a dependency preserving decomposition? Explain why or why not. Yes. AB E is covered by R 2

C D is covered by R 1 F GH is covered by R 3 B F is covered by R 4 Another justification: F + = {AB E, C D, F GH, B FGH} (π R1 (F) π R2 (F) π R3 (F) π R4 (F)) + = (C D, AB E, F GH, B F) + (e) {AB E, C D, F GH, B FGH} = F + Now start with these functional dependencies: AB E C D F GH FG GH B FG Find a minimal covering of these functional dependencies. Then use it to synthesize a set of dependency preserving, 3NF relations with a lossless join. I.e. use algorithm 11.4 from the text book. Minimal Covering: Step 1. Write the FDs with singleton RHSs AB E C D F G FG G F H FG H B F B G Step 2. Remove redundant LHS attributes. FG G becomes F G FG H becomes F H Step 3. Remove redundant FDs Remove one copy of F G Remove one copy of F H Also remove B G The minimal covering is {AB E, C D, F G, F H, B F} Now on to the 3NF algorithm: Step 1 is done. Step 2, write down relations: R1(A, B, E), R2(C, D), R3(F, G, H) and R4(B, F) Step 3, no relation is a key for the whole thing. Add another relation R5(A, B, C)

5. List three spatial data types. 3 a. point b. line c. region 6. What are the three spatial operators? Give an example of each operator. 6 a. Topological operators: Overlap b. Direction operators: North c. Distance operators: Near 7. List two types of spatial queries. Give an example of each one. 4 a. Spatial Range Queries Find all cities within 50 miles of Madison Find me buildings that are adjacent to the Railway Stations? b. Nearest-Neighbor Queries Find the 10 cities nearest to Madison Find me the nearest fire station to Clementi Ave. 3? c. Window Range Query Find me data points that satisfy the conditions x1 < A1 < x2, y1 <A2 <y2? d. Spatial Join Queries: Find all cities near a lake 8. Any member of R-Tree family has a set of characteristics. List 3 of these characteristics. 3 it is a height-balanced tree structure it is based on MBR approximation of spatial objects it guarantees that the storage utilization is at least 50% it takes paging into account

9. Draw the R-tree for the following set of rectangles. 3 B 1 E 4 2 F 3 6 5 11 12 G 10 7 8 H 9 C A: B C B: E F C: G H E F G 4 5 6 1 2 3 10 11 1 2 H 7 8 9

10. Consider the following set of spatial objects and the corresponding R-tee. How many pages (nodes, I/O) are read in order to search for object #5? 3 1 a 6 b 2 5 7 3 4 8 9 c d x 10 12 13 11 f 14 y e g 15 16 x y a b c d e f g 1,2,3 6,7 4,5 8,9 10, 11 12, 13, 14 15, 16 4 pages 11. What is the main difference between classification and clustering? 3 Clustering (Unsupervised learning) vs. Classification (Supervised learning) No prior knowledge (Number of clusters, Meaning of clusters) 12. List three factors that affect classification based on decision trees. 3 Choosing Splitting Attributes Ordering of Splitting Attributes Splits Predicates Tree Structure Stopping Criteria Training Data 13. List three methods used to represent the distance between clusters 3 Single/Complete/Average Link Centroid Medoid 14. Consider the following set of transactions. What is the support of PeanutButter & Jelly? 3 1/5 = 20%