Normalization Page 1 Objectives, outcomes, and key concepts Tuesday, January 6, 2015 11:45 AM Objectives: give an overview of the normal forms and their benefits and problems. Outcomes: students should be able to: Describe the normal forms and their requirements. Check whether a given database satisfies a given normal form. Normalize a given database -- given a real-world description and a starting schema -- into a given normal form. Describe situations in which normalization is not possible. Key concepts and ideas: As far as requirements: 4NF BCNF 3NF Benefits of specific normal forms. Elimination of anomalies Preservation of dependencies Recovery of the base relation Eliminiation of redundancy MVDs and MVDAs
The normal forms Thursday, February 19, 2015 11:10 AM 1NF: first normal form: values are atomic and indivisible; no "sets as values" 2NF: second normal form: values are only dependent upon candidate keys, and not parts of keys. 3NF: third normal form: recoverable, preserves functional dependencies, prevents insertion anomalies. Prevents transitive dependencies. BCNF (3.5NF): Boyce-Codd normal form. recoverable, prevents redundancy based upon preserving FDs, does not always preserve functional dependencies. 4NF: fourth normal form: BCNF + prevents redundant use of storage. (When I studied databases as a student, I learned that 4NF was 3NF plus redundancy elimination... BCNF had not become popular...) Normalization Page 2
A correction Thursday, February 19, 2015 11:16 AM I guess part of my "Freudian slip" attitude toward BCNF is summarized by the fact that it is less faithful (in extremal cases) than 3NF. The lack of restrictions in 3NF allows it to be more faithful. Normalization Page 3
The true hierarchy of normal forms Thursday, February 19, 2015 11:26 AM Normalization Page 4
Several facts about normal forms Thursday, February 19, 2015 11:28 AM A normal form is a set of properties that must be true of a database as a set of relations. That a particular databases's design complies with a normal form is independent of how the design was derived. For example, it is very common for a database schema generated by the 3NF algorithm to also satisfy BCNF and 4NF. Thus, it is just as important to know how to check a database for compliance as it is to be able to transform a database into a specific form. Normalization Page 5
A short tour of what we know Thursday, February 19, 2015 11:32 AM 3NF: Generated by eliminating redundant functional dependencies and then designing relations according to the remaining dependencies, with a single "catch-all" relation for those attributes that are not part of other relations. Always: eliminates common insertion anomalies. is recoverable. preserves functional dependencies. BCNF: Generated by splitting non-compliant relations into two parts. Splitting is not commutative unless all FDs have singleton LHSs. Splitting can -- in extreme cases -- lead to non-compliant databases or the elimination of existing functional dependencies. Always: Protects against more insertion anomalies. Is recoverable. Sometimes: Preserves functional dependencies. Normalization Page 6
Differences between 3NF and BCNF Thursday, February 19, 2015 11:43 AM The 3NF conditions, as summarized by Zaniolo in 1982: For each functional dependency X A of the relation R, either: X contains A (X A is trivial), or X is a superkey for R, or Each attribute in A-X is contained in some candidate key for R. The BCNF conditions eliminate one possibility: X contains A (X A is trivial), or X is a superkey for R, or Each attribute in A-X is contained in some candidate key for R. Normalization Page 7
Normalization Page 8 BCNF 3NF Thursday, February 19, 2015 3:59 PM Obviously, since BCNF disallows something that 3NF allows, anything in BCNF is automatically in 3NF. Also "almost everything" that is in 3NF is also in BCNF. The exception occurs when one has several overlapping candidate keys for a table. If candidate keys do not overlap, then 3NF BCNF
A very subtle example Thursday, February 19, 2015 4:02 PM Court Start Time End Time Rate Type 1 09:30 10:30 SAVER 1 11:00 12:00 SAVER 1 14:00 15:30 STANDARD 2 10:00 11:30 PREMIUM-B 2 11:30 13:30 PREMIUM-B 2 15:00 16:30 PREMIUM-A From <http://en.wikipedia.org/wiki/boyce%e2%80%93codd_normal_form> Multiple overlapping candidate keys: { Court, Start Time } { Court, End Time } { Rate Type, Start Time } { Rate Type, End Time } and one very pesky functional dependency: Rate Type Court This violates BCNF but not 3NF, because the functional dependency Rate Type Court is allowed by the third clause in 3NF but not in BCNF. Normalization Page 9
Solution: encode the rate type dependency in another table {rate type, court, member} Leaving {rate type, start time, end time} Normalization Page 10
A no-win situation Thursday, February 19, 2015 6:28 PM Normalization Page 11
Identifying 3NF relations Thursday, February 19, 2015 4:19 PM The above example is a bit of a technicality but it provides a way to determine whether a relation is in 3NF or not: A relation in 3NF has no transitive dependencies A B C where A is in a candidate key A B, B A and neither B nor C are parts of candidate keys. In other words, to determine whether a relation is in 3NF, look for transitive dependencies. Normalization Page 12
Example Thursday, February 19, 2015 4:21 PM The following table { shoe name, shoe maker, maker address } with the FDs shoe name shoe maker shoe name maker address shoe maker maker address is not in 3NF. The third dependency is transitive, in the sense that shoe name is a key and shoe maker and maker address are not. The appropriate fix is to normalize to {shoe name, shoe maker} and {shoe maker, maker address} via Boyce-Codd split. Since the normalized version is in 3NF and has no overlapping candidate keys, it is also in BCNF as well. Normalization Page 13
End of lecture on 2/8 Monday, February 8, 2016 6:58 PM Normalization Page 14
So far, Wednesday, February 10, 2016 2:07 PM We've studied normal forms 3NF and 3.5NF (BCNF) We know how to evaluate a decomposition through two methods: The Chase algorithm for determining lossless joins. The fact that a functional dependency is not preserved if its terms are split among more than one relation. Normalization Page 15
An example Wednesday, February 10, 2016 2:09 PM Consider name address, position, extension position salary address phone snack Part 1: Compute the 3NF for this: {name, address, position, extension} {position, salary} {address, phone} {snack, name} (the catchall superkey) Part 2: Is this lossless? Let's notate the Chase tableau a different way, using subscripts instead of namesubscript name address position extension 1 1 1 2 2 position 2 salary 2 2 3 address 3 3 3 phone 3 name 4 4 4 4 4 snack name address position extension salary 1 1 2 2 position 2 salary 2 2 3 address 3 3 3 phone 3 name 4 4 4 4 4 snack name address position extension salary phone 1 2 2 position 2 salary 2 2 Normalization Page 16
2 2 position 2 salary 2 2 3 address 3 3 3 phone 3 name 4 4 4 4 4 snack name address position extension salary phone 1 2 2 position 2 salary 2 2 3 address 3 3 3 phone 3 name address position extension salary phone snack So the Chase completes, and it is lossless. Does it preserve functional dependencies? Yes, the FDs are each contained in one relation. Normalization Page 17
Normalization Page 18 Multi-value dependencies Thursday, February 19, 2015 12:01 PM It is perfectly fine -- in relational normalization terms -- to have data in tables that is not functionally determined. One user determines a list of preferences. Can have a list of friends in which person, friend is an unconstrained relation. What is not so fine is the situation in which, in some implicit fashion, a cross-product of data is included in the design of a table. This creates redundancy in the database as well as a second kind of insertion anomaly: insertion with omission.
Some multi-valued dependencies Wednesday, February 10, 2016 1:50 PM {model} {color} A car model determines a set of possible colors {(color)} {recipe} {ingredient, quantity} A recipe determines a set of pairs {(ingredient, quantity)} {tournament} {team, opponent, score} A tournament determines a set of triples {(team, opponent, score)} Some simple facts about If X Y Z, this does not mean that X Y or X Z Ingredients in the above recipe are meaningless unless their quantity is also present! Likewise, if X Y and X Z, this does not mean that X Y Z There would be no correlation between the pairs! If X Y then X Y (perfectly good to have sets of one element!) Converse is obviously false. Normalization Page 19
Normalization Page 20 A very simple example Thursday, February 19, 2015 12:05 PM We have models of cars. Every model can have one of five colors: white, black, blue, gray, yellow. Every model can have one of three trim finish grades: std, ltd, custom Then, to store descriptions of three cars in a table (model, color, trim), one has to store 3x5x3 = 45 rows. We call this a multi-value dependency and write model color and model trim = "multi-determines" The goal of fourth normal form is to eliminate this storage redundancy. The normalization step is to do a Boyce-Codd split that create a separate relation {model, color} and preserve model in the original relation. Thus, we obtain two relations (model, color) (model, trim) equivalent with the original relation (model, color, trim)
Normalization Page 21 (model, color, trim) The join (model, color) model (model, trim) exactly reproduces the cross product in the MVD. In other words, to remove an MVD, split X Y in R into relations (X Y), and (X (R-Y)) using a Boyce- Codd split.
Wednesday, February 10, 2016 6:27 PM Normalization Page 22
Normalization Page 23
Normalization Page 24
Explicitly, the 4NF algorithm is Thursday, February 19, 2015 4:44 PM Express the database in BCNF. For each MVD X Y, split the relation R into X Y, XuR-Y via a Boyce-Codd split. The difference here is that X Y is the key of (X,Y), not X. Until there are no more remaining MVDs combined with non-mvd columns. Watch out: The fact that there are two MVDs in a relation does not mean that two splits will be required. If the two MVDs are the whole relation, then one split will separate them. Only if there is other data that is not an MVD will two splits be necessary. Normalization Page 25
Normalization Page 26 A problem with the book Monday, February 8, 2016 10:11 AM Last time I taught this, I stuck with the definition of MVD in the book: A multi-valued dependency occurs when two separate set-valued dependencies on the same key occur in the same table. The result was mass confusion...... mostly because the book itself is confused on the issue! It is therefore time for civil disobedience! My revised (and more common) definition: A multi-value dependency (MVD) is any situation in which a value for one set of attributes determines a set of values for another set of attributes. A multi-value dependency anomaly (MVDA) is any situation in which an MVD is expressed along with non-mvd columns in the same relation. In other words, a) an MVD by itself is still an MVD, but not a problem.
Normalization Page 27 b) problem. An MVD A B becomes part of an MVDA when there is a set of columns C in the relation that are not part of the MVD.
Normalization Page 28 The bad news Thursday, February 19, 2015 12:23 PM MVDs are exceedingly difficult to define and locate in a relation. In general, a multi-value dependency anomaly occurs when a relation R has attributes A, B, and C such that A determines a set of values for B, There is a separate set of columns C, and the values of B and C are independent of each other. We satisfy ourselves with stepwise refinement by setting C to be the rest of the columns, because if the rest of the columns contain C, then we can utilize that set as an independent entity from B. Thus, the definition of an MVDA is somewhat contorted: Definition. For sets of columns X and Y, an MVD X Y is also part of an MVDA if: If t1 and t2 are tuples such that πx(t1) = πx(t2), then there are tuples t3 and t4 such that 1. πx(t1) = πx(t3) = πx(t4) 2. πy(t1) = πy(t3) and πy(t2) = πy(t4)
2. πy(t1) = πy(t3) and πy(t2) = πy(t4) 3. πz(t1) = πz(t4) and πz(t2) = πz(t3) where Z = R - (X U Y) Normalization Page 29
A rather strange definition Thursday, February 19, 2015 12:40 PM X Y R-(X Y) t1 something v1 r1 t2 something v2 r2 t3 something v2 r1 t4 something v1 r2 Normalization Page 30
Spotting MVDAs Thursday, February 19, 2015 12:50 PM These cannot be spotted easily from a database; they are a rather complex constraint on what values must appear (rather than what values must not appear, as with single-valued dependencies (FDs)). They are -- in essence -- hidden cross products. One violates an MVDA by leaving out a tuple of a cross-product. However, it is easy to spot them in English descriptions of a database. And similarly, it is easy to spot them in application code that manipulates databases. Normalization Page 31
Example: Wednesday, February 10, 2016 2:31 PM model color model trim And suppose model=taurus, color=blue, gray model=taurus, trim = L, LX, SE So that we would need in our table model color trim taurur blue taurus blue taurus blue taurus gray taurus gray taurus gray L LX SE L LX SE Omitting any row of this is an omission anomaly. We would normalize this into two tables X= model color taurus blue taurus gray and Y= model trim Normalization Page 32
taurus L taurus LX taurus SE So that X Y is the original relation. Then omission anomalies cannot happen. Normalization Page 33
Normalization Page 34 What is not an MVDA Thursday, February 19, 2015 4:11 PM MVDAs require that there be at least two independently varying things: {model, color} does not contain an MVDA by itself. {model, color, trim} does, because color and trim vary independently based upon model. The key is that there has to be something to take the cross product of.
Example Thursday, February 19, 2015 4:13 PM A hat company makes hats that come in several colors and sizes. Each particular hat comes in a fixed number of sizes and colors, but other hats can vary in the range of sizes and colors. Let's model this via the MVDs hat name size hat name color And consider the relation { hat name, size, color, price } with additional FD { hat name price } The solution is to split the relation on one MVD (say hat name size) to create { hat name, size } { hat name, color, price } Is the result in 4NF? No! We need {hat name, size} {hat name, price} {hat name, color} Normalization Page 35
MVD algebra Thursday, February 19, 2015 4:32 PM It is not surprising that MVDs have an algebra similar to that of FDs. It is surprising that several of the things we like about the algebra of FDs are not true for FDs. What is even more subtle is that some of the rules come with qualifications. Some rules to remember: Transitivity: A B and B C implies A C - A Promotion: A B implies A B Complementation: A B implies A R - B (where R is the set of all attributes of the relation) Trivial MVDs: B A implies A B, but there is a new kind of triviality. If A B=R and A B=, then A B Some rules that do not work for MVDs: Cannot split or combine MVDs. In general A B, C does not imply A B A B, A C does not imply A B, C Normalization Page 36
A very evocative table Thursday, February 19, 2015 4:43 PM See page 113: Property Eliminates redundancy due to FDs Eliminates redundancy due to MVDs 3NF BCNF 4NF No Yes Yes No No Yes Preserves FDs Yes No No Preserves MVDs No No No Normalization Page 37
End of lecture on 2/10 Wednesday, February 10, 2016 6:59 PM Normalization Page 38
Some common mistakes Thursday, February 19, 2015 4:51 PM Whether a table is in 3NF, BCNF, 4NF is not a property of the data. It is solely a property of table columns and their FDs and MVDs. (It is always possible that the instance of the table does not portray the particular MVDA you are seeking to stamp out. This does not mean that it won't appear in the future.) Normalization Page 39
Normalization Page 40 5NF Wednesday, February 10, 2016 2:38 PM A table is in 5NF (also known as Project-Join Normal Form (PJNF) if a) it is in 4NF b) no further split of any kind can result in a lossless join. In practice, this is an impractical definition: a) Most tables in 4NF are in 5NF. b) The ones that aren't have a rather subtle structure.
Normalization Page 41 A super-subtle 5NF from Wikipedia Wednesday, February 10, 2016 2:53 PM Traveling Brand Product Type Salesman Jack Schneider Acme Vacuum Jack Schneider Acme Cleaner Breadbox Mary Jones Robust Pruning Shears Mary Jones orobust Vacuum Mary Jones orobust Cleaner Breadbox Mary Jones orobust Umbrella Stand Louis Ferguson orobust Vacuum Louis Ferguson orobust Cleaner Telescope Louis Ferguson oacme Vacuum Louis Ferguson Acme Cleaner Lava Lamp Louis Ferguson Nimbus Tie Rack From <https://en.wikipedia.org/wiki/fifth_normal_form> This cannot be split via the normal FD rules. But it can be split by another kind of rule: A Traveling Salesman has certain Brands and certain Product Types in his repertoire. If Brand B1 and Brand B2 are in his repertoire, and Product Type P is in his repertoire, then (assuming Brand B1 and Brand B2 both make Product Type P), the Traveling Salesman must offer products of
Normalization Page 42 Traveling Salesman must offer products of Product Type P those made by Brand B1 and those made by Brand B2. From <https://en.wikipedia.org/wiki/fifth_normal_form> This is not an FD or an MVD rule. But it does give rise to a decomposition: {Traveling Salesman, Brand} and {Traveling Salesman, Product Type} whose join just happens to be lossless because of the (rather strange) rule!
A basic caveat Wednesday, February 10, 2016 3:00 PM The effectiveness of a normalization depends upon how one writes into a database. So far, we have been studying normalizations that are based upon errors one can make with INSERT. But, in other kinds of databases, these normalizations fail to be useful because there is no INSERT available! Example: SOLR Key/value store A key determines a document. There is no concept of a partial change to a document. I.e., no INSERT. Three primitives set(key, document) get(key) # provides document search(phrases) # provides a set of documents matching the phrases No concept of a join. One must manually compute joins by fetching documents, getting keys from them, and then fetching sub-documents. In SOLR, thus Relational normalization is irrelevant. One wants documents to be large. Normalization Page 43
One wants documents to be large. One wants -- to the extent possible -- to avoid joins. Thus everything we learned so far won't help us! Normalization Page 44