Chapter 8. Database Design II: Relational Normalization Theory

Transcription

1 Chapter 8 Database Design II: Relational Normalization Theory The E-R approach is a good way to start dealing with the complexity of modeling a real-world enterprise. However, it is only a set of guidelines that requires considerable expertise and intuition, and it can lead to several alternative designs for the same enterprise. Unfortunately, the E-R approach does not include the criteria or tools to help evaluate alternative designs and suggest improvements. In this chapter, we present the relational normalization theory, which includes a set of concepts and algorithms that can help with the evaluation and refinement of the designs obtained through the E-R approach. The main tool used in normalization theory is the notion of functional dependency (and, to a lesser degree, join dependency). Functional dependency is a generalization of the key dependencies in the E-R approach whereas join dependencies do not have an E-R counterpart. Both types of dependency are used by designers to spot situations in which the E-R design unnaturally places attributes of two distinct entity types into the same relation schema. These situations are characterized in terms of normal forms, whence comes the term normalization theory. Normalization theory forces relations into an appropriate normal form using decompositions, which break up schemas involving unhappy unions of attributes of unrelated entity types. Because of the central role that decompositions play in relational design, the techniques that we are about to discuss are sometimes also called relational decomposition theory. 8.1 THE PROBLEM OF REDUNDANCY Example is the best way to understand the potential problems with relational designs based on the E-R approach. Consider the CREATE TABLE Person statement (5.1) on page 93. Recall that this relation schema was obtained by direct translation from the E-R diagram in Figure 5.1. The first indication of something wrong with this translation was the realization that SSN is not a key of the resulting Person relation. Instead, the key is a combination (SSN, Hobby). In other words, the attribute SSN does not uniquely identify the tuples in the Person relation even though it 211

2 212 Chapter 8 Database Design II: Relational Normalization Theory does uniquely identify the entities in the Person entity set. Not only is this counterintuitive, but it also has a number of undesirable effects on the instances of the Person relation schema. To see this, we take a closer look at the relation instance shown in Figure 5.2, page 93. Notice that John Doe and Mary Doe are both represented by multiple tuples and that their addresses, names, and Ids occur multiple times as well. Redundant storage of the same information is apparent here. However, wasted space is the least of the problems. The real issue is that when database updates occur, we must keep all the redundant copies of the same data consistent with each other, and we must do it efficiently. Specifically, we can identify the following problems: Update anomaly. If John Doe moves to 1 Hill Top Dr., updating the relation in Figure 5.2 requires changing the address in both tuples that describe the John Doe entity. Insertion anomaly. Suppose that we decide to add Homer Simpson to the Person relation, but Homer s information sheet does not specify any hobbies. One way around this problem might be to add the tuple , Homer Simpson, Fox 5 TV, NULL that is, to fill in the missing field with NULL. However, Hobby is part of the primary key, and most DBMSs do not allow null values in primary keys. Why? For one thing, DBMSs generally maintain an index on the primary key, and it is not clear how the index should refer to the null value. Assuming that this problem can be solved, suppose that a request is made to insert , Homer Simpson, Fox 5 TV, acting. Should this new tuple just be added, or should it replace the existing tuple , Homer Simpson, Fox 5 TV, NULL? A human will most likely choose to replace it, because humans do not normally think of hobbies as a defining characteristic of a person. However, how does a computer know that the tuples with primary key , NULL and , acting refer to the same entity? (Recall that the information about which tuple came from which entity is lost in the translation!) Redundancy is at the root of this ambiguity: If Homer were described by at most one tuple, only one course of action would be possible. Deletion anomaly. Suppose that Homer Simpson is no longer interested in acting. How are we to delete this hobby? We can, of course, delete the tuple that talks about Homer s acting hobby. However, since there is only one tuple that refers to Homer (see Figure 5.2), this throws out perfectly good information about Homer s Id and address. To avoid this loss of information, we can try to replace acting with NULL. Unfortunately, this again raises the issue of nulls in primary key attributes. Once again, redundancy is the culprit. If only one tuple could possibly describe Homer, the attribute Hobby would not be part of the key. For convenience, we sometimes use the term update anomalies to refer to all of the above anomaly types.

3 8.2 Decompositions 213 Figure 8.1 Decomposition of the Person relation shown in Figure 5.2. SSN Hobby SSN Name Address John Doe 123 Main St Mary Doe 7 Lake Dr Bart Simpson Fox 5 TV (a) Person stamps hiking coins hiking skating acting (b) Hobby 8.2 DECOMPOSITIONS The problems caused by redundancy wasted storage and anomalies can be easily fixed using the following simple technique. Instead of having one relation describe all that is known about persons, we can use two separate relation schemas. Person1(SSN, Name, Address) Hobby(SSN, Hobby) (8.1) Projecting the relation in Figure 5.2 on each of these schemas yields the result shown in Figure 8.1. The new design has the following important properties: 1. The original relation of Figure 5.2 is exactly the natural join of the two relations in Figure 8.1. In fact, one can prove that this property is not an artifact of our example that is, that under certain natural assumptions 1 any relation, r, over the schema of Figure 5.2 equals the natural join of the projections of r on the two relational schemas in (8.1). This property, called losslessness, will be discussed in Section This means that our decomposition preserves the original information represented by the Person relation. 2. The redundancy present in the original relation of Figure 5.2 is gone and so are the update anomalies. The only items that are stored more than once are SSNs, which are identifiers of entities of type Person. Thus, changes to addresses, names, or hobbies now affect only a single tuple. Likewise, the removal of Bart Simpson s hobbies from the database does not delete the information about his address and does not require us to rely on null values. The insertion anomaly is also gone because we can now add people and hobbies independently. Observe that the new design still has a certain amount of redundancy and that we might still need to use null values in certain cases. First, since we use SSNs as tuple 1 Namely, that SSN uniquely determines the name and address of a person.

4 214 Chapter 8 Database Design II: Relational Normalization Theory Figure 8.2 The ultimate decomposition. SSN Name John Doe Mary Doe Bart Simpson Address 123 Main St. 7 Lake Dr. Fox 5 TV Hobby stamps hiking coins skating acting identifiers, each SSN can occur multiple times and all of these occurrences must be kept consistent across the database. So consistency maintenance has not been eliminated completely. However, if the identifiers are not dynamic (for instance, SSNs, which do not change frequently), consistency maintenance is considerably simplified. Second, imagine a situation in which we add a person to our Person1 relation and the address is not known. Clearly, even with the new design, we have to insert NULL in the Address field for the corresponding tuple. However, Address is not part of a primary key, so the use of NULL here is not that bad (we will still have difficulties joining Person1 ontheaddress attribute). It is important to understand that not all decompositions are created equal. In fact, most of them do not make any sense even though they might be doing a good job at eliminating redundancy. The decomposition Ssn(SSN) Name(Name) Address(Address) (8.2) Hobby(Hobby) is the ultimate redundancy eliminator. Projecting on the relation of Figure 5.2 yields a database where each value appears exactly once, as shown in Figure 8.2. Unfortunately, this new database is completely devoid of any useful information; for instance, it is no longer possible to tell where John Doe lives or who collects stamps as a hobby. This situation is in sharp contrast with the decomposition of Figure 8.1, where we were able to completely restore the information represented by the original relation using a natural join. The need for schema refinement. Translation of the Person entity type into the relational model indicates that one cannot rely solely on the E-R approach for designing database schemas. Furthermore, the problems exhibited by the Person example are by no means rare or unique. Consider the relationship HasAccount of Figure A typical translation of HasAccount into the relational model might be CREATE TABLE HasAccount ( AccountNumber INTEGER NOT NULL, ClientId CHAR(20), (8.3) OfficeId INTEGER,

5 8.3 Functional Dependencies 215 PRIMARY KEY (ClientId, OfficeId), FOREIGN KEY (OfficeId) REFERENCES Office ) Recall that a client can have at most one account in an office and hence (ClientId, OfficeId) is a key. Also, an account must be assigned to exactly one office. Careful analysis shows that this requirement leads to some of the same problems that we saw in the Person example. For example, a tuple that records the fact that a particular account is managed by a particular office cannot be added without also recording client information (since ClientId is part of the primary key) which is an insertion anomaly. This (and the dual deletion anomaly) is perhaps not a serious problem, because of the specifics of this particular application, but the update anomaly could present maintenance issues. Moving an account from one office to another involves changing OfficeId in every tuple corresponding to that account. If the account has multiple clients, this might be a problem. We return to this example later in this chapter because HasAccount exhibits certain interesting properties not found in the Person example. For instance, even though a decomposition of HasAccount might still be desirable, it incurs additional maintenance overhead that the decomposition of Person does not. The above discussion brings out two key points: (1) Decomposition of relation schemas can serve as a useful tool that complements the E-R approach by eliminating redundancy problems; (2) The criteria for choosing the right decomposition are not immediately obvious, especially when we have to deal with schemas that contain many attributes. For these reasons, the purpose of Sections 8.3 through 8.6 is to develop techniques and criteria for identifying relation schemas that are in need of decomposition, as well as to understand what it means for a decomposition not to lose information. The central tool in developing much of decomposition theory is functional dependency, which is a generalization of the idea of key constraints. Functional dependencies are used to define normal forms a set of requirements on relational schemas that are desirable in update-intensive transaction systems. This is why the theory of decompositions is often also called normalization theory. Sections 8.5 through 8.9 develop algorithms for carrying out the normalization process. 8.3 FUNCTIONAL DEPENDENCIES For the remainder of this chapter, we use a special notation for representing attributes, which is common in relational normalization theory, as follows: Capital letters from the beginning of the alphabet (A, B, C, D) represent individual attributes; capital letters from the middle to the end of the alphabet with bars over them (P, V, W, X, Y, Z) represent sets of attributes. Also, strings of letters, such as ABCD, denote sets of the respective attributes ({A, B, C, D}); strings of letters with bars over them, (X Y Z), stand for unions of these sets (X Y Z). Although this notation requires some getting used to, it is very convenient and provides a succinct language, which we use in examples and definitions.

6 216 Chapter 8 Database Design II: Relational Normalization Theory A functional dependency (FD) on a relation schema, R, is a constraint of the form X Y, where X and Y are sets of attributes used in R. Ifr is a relation instance of R, it is said to satisfy this functional dependency if For every pair of tuples, t and s, inr, ift and s agree on all attributes in X, then t and s agree on all attributes in Y. Put another way, there must not be a pair of tuples in r such that they have the same values for every attribute in X but different values for some attribute in Y. Note that the key constraints introduced in Chapter 4 are a special kind of FD. Suppose that key(k) is a key constraint on the relational schema R and that r is a relational instance over R. By definition, r satisfies key(k) if and only if there is no pair of distinct tuples, t, s r, such that t and s agree on every attribute in key(k). Therefore, this key constraint is equivalent to the FD K R, where K is the set of attributes in the key constraint and R denotes the set of all attributes in the schema R. Keep in mind that functional dependencies are associated with relation schemas, but when we consider whether or not a functional dependency is satisfied we must consider relation instances over those schemas. This is because FDs are integrity constraints on the schema (much like key constraints), which restrict the set of allowable relation instances to those that satisfy the given FDs. Thus, given a schema, R = (R; Constraints), where R is a set of attributes and Constraints is a set of FDs, we are looking for a set of all relation instances over R that satisfies every FD in Constraints. Such relational instances are called legal instances of R. Functional dependencies and update anomalies. Certain functional dependencies that exist in a relational schema can lead to redundancy in the corresponding relation instances. Consider the two examples discussed in Sections 8.1 and 8.3: the schemas Person and HasAccount. Each has a primary key, as illustrated by the corresponding CREATE TABLE commands (5.1), page 93, and (8.3), page 214. Correspondingly, there are the following functional dependencies: Person: SSN,Hobby SSN,Name,Address,Hobby HasAccount: ClientId,OfficeId AccountNumber, (8.4) ClientId,OfficeId These are not the only FDs implied by the original specifications, however. For instance, both Name and Address are defined as single-valued attributes in the E-R diagram of Figure 5.1. This clearly implies that one Person entity (identified by its attribute SSN) can have at most one name and one address. Similarly, the business rules of PSSC (the brokerage firm discussed in Section 5.5) require that every account be assigned to exactly one office, which means that the following FDs must also hold for the corresponding relation schemas: Person: SSN Name, Address HasAccount: AccountNumber OfficeId (8.5)

7 8.4 Properties of Functional Dependencies 217 It is easy to see that the syntactic structure of the dependencies in (8.4) closely corresponds to the update anomalies that we identified for the corresponding relations. For instance, the problem with Person was that for any given SSN we could not change the values for the attributes Name and Address independently of whether the corresponding person had hobbies: If the person had multiple hobbies, the change had to occur in multiple rows. Likewise with HasAccount we cannot change the value of OfficeId (i.e., transfer an account to a different office) without having to look for all clients associated with this account. Since a number of clients might share the same account, there can be multiple rows in HasAccount that refer to the same account; hence, multiple rows in which OfficeId must be changed. Note that in both cases, the attributes involved in the update anomalies appear on the left-hand sides of an FD. We can see that update anomalies are associated with certain kinds of functional dependencies. Which dependencies are bad? At the risk of giving away the secret, we draw your attention to one major difference between the dependencies in (8.4) and (8.5): The former specify key constraints for their corresponding relations whereas the latter do not. However, simply knowing which dependencies cause the anomalies is not enough we must do something about them. We cannot just abolish the offending FDs, because they are part of the semantics of the enterprise being modeled by the databases. They are implicitly or explicitly part of the Requirements Document and cannot be changed without an agreement with the customer. On the other hand, we saw that schema decomposition can be a useful tool. Even though a decomposition cannot abolish a functional dependency, it can make it behave. For instance, the decomposition shown in (8.1) yields schemas in which the offending FD, SSN Name, Address, becomes a well-behaved key constraint. 8.4 PROPERTIES OF FUNCTIONAL DEPENDENCIES Before going any further, we need to learn some mathematical properties of functional dependencies and develop algorithms to test them. Since these properties and algorithms rely heavily on the notational conventions introduced at the beginning of Section 8.3, it might be a good idea to revisit these conventions. The properties of FDs that we are going to study are based on entailment. Consider a set of attributes R, a set, F, of FDs over R, and another FD, f,onr. We say that F entails f if every relation r over the set of attributes R has the following property: If r satisfies every FD in F, then r satisfies the FD f. Given a set of FDs, F, the closure of F, denoted F +, is the set of all FDs entailed by F. Clearly, F + contains F as a subset. 2 2 If f F, then every relation that satisfies every FD in F obviously satisfies f. Therefore, by the definition of entailment, f is entailed by F.

8 218 Chapter 8 Database Design II: Relational Normalization Theory If F and G are sets of FDs, we say that F entails G if F entails every individual FD in G. F and G are said to be equivalent if F entails G and G entails F. We now present several simple but important properties of entailment. Reflexivity. Some FDs are satisfied by every relation no matter what. These dependencies all have the form X Y, where Y X, and are called trivial FDs. The reflexivity property states that, if Y X, then X Y. To see why trivial FDs are always satisfied, consider a relation, r, whose set of attributes includes all of the attributes mentioned in X Y. Suppose that t, s r are tuples that agree on X. But, since Y X, this means that t and s agree on Y as well. Thus, r satisfies X Y. We can now relate trivial FDs and entailment. Because a trivial FD is satisfied by every relation, it is entailed by every set of FDs! In particular, F + contains every trivial FD. Augmentation. Consider an FD, X Y, and another set of attributes, Z. Let R contain X Y Z. Then X Y entails X Z Y Z. In other words, every relation r over R that satisfies X Y must also satisfy the FD X Z Y Z. The augmentation property states that, if X Y, then X Z Y Z. To see why this is true, if tuples t, s r agree on every attribute of X Z then in particular they agree on X. Since r satisfies X Y, t and s must also agree on Y. They also agree on Z, since we have assumed that they agree on a bigger set of attributes, X Z. Thus, if s, t agree on every attribute in X Z, they must agree on Y Z. As this is an arbitrarily chosen pair of tuples in r, it follows that r satisfies X Z Y Z. Transitivity. The set of FDs {X Y, Y Z} entails the FD X Z. The transitivity property states that, if X Y and Y Z, then X Z. This property can be established similarly to the previous two (see the exercises). These three properties of FDs are known as Armstrong s Axioms, 3 and their main use is in the proofs of correctness of various database design algorithms. However, they are also a powerful tool used by (real, breathing human) database designers because they can help them quickly spot problematic FDs in relational schemas. We now show how Armstrong s axioms are used to derive new FDs. Union of FDs. Any relation, r, that satisfies X Y and X Z must also satisfy X Y Z. To show this, we can derive X Y Z from X Y and X Z using simple syntactic manipulations defined by Armstrong s axioms. Such manipulations can 3 Strictly speaking, these are inference rules, because they derive new rules out of old ones. Only the reflexivity rule can be viewed as an axiom.

9 8.4 Properties of Functional Dependencies 219 be easily programmed on a computer, unlike the tuple-based considerations we used to establish the axioms themselves. Here is how it is done: (a) X Y Given (b) X Z Given (c) X Y X Adding X to both sides of (a): Armstrong s augmentation rule (d) Y X Y Z Adding Y to both sides of (b): Armstrong s augmentation rule (e) X Y Z By Armstrong s transitivity rule applied to (c) and (d) Decomposition of FDs. In a similar way, we can prove the following rule: Every relation that satisfies X Y Z must also satisfy the FDs X Y and X Z. This is accomplished by the following simple steps: (a) X Y Z Given (b) Y Z Y By Armstrong s reflexivity rule, since Y Y Z (c) X Y By transitivity from (a) and (b) Derivation of X Z is similar. Armstrong s axioms are obviously sound.by sound we mean that any expression of the form X Y derived using the axioms is actually a functional dependency. Soundness follows from the fact that we have proved that these inference rules are valid for every relation. It is much less obvious, however, that they are also complete that is, if a set of FDs, F, entails another FD, f, then f can be derived from F by a sequence of steps similar to the ones above that rely solely on Armstrong s axioms! We do not prove this fact here, but a proof can be found in [Ullman 1988]. Soundness and completeness of Armstrong s axioms is not just a theoretical curiosity this result has considerable practical value because it guarantees that entailment of FDs (i.e., the question of whether f F + ) can be verified by a computer program. We are now going to develop one such algorithm. An obvious way to verify entailment of an FD, f, by a set of FDs, F, isto instruct the computer to apply Armstrong s axioms to F in all possible ways. Since the number of attributes mentioned in F and f is finite, this derivation process cannot go on forever. When we are satisfied that all possible derivations have been made, we can simply check whether f is among the FDs derived by this process. Completeness of Armstrong s axioms guarantees that f F + if and only if f is one of the FDs thus derived. To see how this process works, consider the following sets of FDs: F ={AC B, A C, D A} and G ={A B, A C, D A, D B}. We can use Armstrong s axioms to prove that these two sets are equivalent, i.e., every FD in G is entailed by F, and vice versa. For instance, to prove that A B is implied by F, we can apply Armstrong s axioms in all possible ways. Most of these attempts will not lead anywhere, but a few will. For instance, the following derivation establishes the desired entailment:

10 220 Chapter 8 Database Design II: Relational Normalization Theory (a) A C (b) A AC (c) A B An FD in F From (a) and Armstrong s augmentation axiom From (b), AC B F, and Armstrong s transitivity axiom The FDs A C and D A belong to both F and G, so the derivation is trivial. For D B in G, the computer can try to apply Armstrong s axioms until this FD is derived. After a while, it will stumble upon this valid derivation: (a) D A an FD in F (b) A B derived previously (c) D B from (a), (b), and Armstrong s transitivity axiom This shows that every FD in F entailed by G is done similarly. Although the simplicity of checking entailment by blindly applying Armstrong s axioms is attractive, it is not very efficient. In fact, the size of F + can be exponential in the size of F, so for large database schemas it can take a very long time before the designer ever sees the result. We are therefore going to develop a more efficient algorithm, which is also based on Armstrong s axioms but which applies them much more judiciously. Checking entailment of FDs. The idea of the new algorithm for verifying entailment is based on attribute closure. Given a set of FDs, F, and a set of attributes, X, we define the attribute closure of X with respect to F, denoted X + F, as follows: X + F ={A X A F+ } In other words, X + F is a set of all those attributes, A, such that X A is entailed by F. Note that X X + F because, if A X, then, by Armstrong s reflexivity axiom, X A is a trivial FD that is entailed by every set of FDs, including F. It is important to understand that the closure of F (i.e., F + ) and the closure of X (i.e., X + F ) are related but different notions: F+ is a set of functional dependencies, whereas X + F is a set of attributes. The more efficient algorithm for checking entailment of FDs now works as follows: Given a set of FDs, F, and an FD, X Y, check whether Y X + F. If this is so, then F entails X Y. Otherwise, if Y X + F, then F does not entail X Y. The correctness of this algorithm follows from Armstrong s axioms. If Y X + F, then X A F + for every A Y (by the definition of X + F ). By the union rule for FDs, F thus entails X Y. Conversely, if Y X + F, then there is B Y such that B X + F. Hence, X B is not entailed by F. But then F cannot entail X Y; ifit did, it would have to entail X B as well, by the decomposition rule for FDs. The heart of the above algorithm is a check of whether a set of attributes belongs to X + F. Therefore, we are not done yet: We need an algorithm for computing the

11 8.4 Properties of Functional Dependencies 221 Figure 8.3 Computation of attribute closure X + F. closure := X repeat old := closure if there is an FD Z V F such that Z closure then closure := closure V until old = closure return closure closure of X, which we present in Figure 8.3. The idea behind the algorithm is enlarging the set of attributes known to belong to X + F by applying the FDs in F. The closure is initialized to X, since we know that X is always a subset of X + F. The soundness of the algorithm can be proved by induction. Initially, closure is X, so X closure is in F +. Then, assuming that X closure F + at some intermediate step in the repeat loop of Figure 8.3, and given an FD Z V F such that Z closure, we can use the generalized transitivity rule (see exercise 8.7) to infer that F entails X closure V. Thus, if A closure at the end of the computation, then A X + F. The converse is also true: If A X+ F, then at the end of the computation A closure (see exercise 8.8). Unlike the simple-minded algorithm that uses Armstrong s axioms indiscriminately, the run-time complexity of the algorithm in Figure 8.3 is quadratic in the size of F. In fact, an algorithm for computing X + F that is linear in the size of F is given in [Beeri and Bernstein 1979]. This algorithm is better suited for a computer program, but its inner workings are more complex. Example (Checking Entailment) Consider a relational schema, R = (R; F), where R = ABCDEFGHIJ, and the set of FDs, F, which contains the following FDs: AB C, D E, AE G, GD H, ID J. We wish to check whether F entails ABD GH and ABD HJ. First, let us compute ABD + F. We begin with closure = ABD. Two FDs can be used in the first iteration of the loop in Figure 8.3. For definiteness, let us use AB C, which makes closure = ABDC. In the second iteration, we can use D E, which makes closure = ABDCE. Now it becomes possible to use the FD AE G in the third iteration, yielding closure = ABDCEG. This in turn allows GD H to be applied in the fourth iteration, which results in closure = ABDCEGH. In the fifth iteration, we cannot apply any new FDs, so closure does not change and the loop terminates. Thus, ABD + F = ABDCEGH. Since GH ABDCEGH, we conclude that F entails ABD GH. On the other hand, HJ ABDCEGH, so we conclude that ABD HJ is not entailed by F. Note, however, that F does entail ABD H.

12 222 Chapter 8 Database Design II: Relational Normalization Theory Figure 8.4 Testing equivalence of sets of FDs. Input: F, G FD sets Output: true,iff is equivalent to G; false otherwise for each f F do if G does not entail f then return false for each g G do if F does not entail g then return false return true The above algorithm for testing entailment leads to a simple test for equivalence between a pair of sets of FDs. Let F and G be such sets. To check that they are equivalent, we must check that every FD in G is entailed by F, and vice versa. The algorithm is depicted in Figure NORMAL FORMS To eliminate redundancy and potential update anomalies, database theory identifies several normal forms for schemas such that, if a schema is in one of the normal forms, it has certain predictable properties. Originally, [Codd 1970] proposed three normal forms, each eliminating more anomalies than the previous one. The first normal form (1NF), as introduced by Codd, is equivalent to the definition of the relational data model. The second normal form (2NF) was an attempt to eliminate some potential anomalies, but it has turned out to be of no practical use, so we do not discuss it. The third normal form (3NF) was initially thought to be the ultimate normal form. However, Boyce and Codd soon realized that 3NF can still harbor undesirable combinations of functional dependencies, so they introduced the so-called Boyce- Codd normal form (BCNF). Unfortunately, the sobering reality of computational sciences is that there is no free lunch. Even though BCNF is more desirable, it is not always achievable without paying a price elsewhere. In this section, we define both BCNF and 3NF. Subsequent sections develop algorithms for automatically converting relational schemas that possess various bad properties into sets of schemas in 3NF and BCNF. We also study the trade-offs associated with such conversions. Toward the end of this chapter, we show that certain types of redundancy are caused not by FDs but by other dependencies. To deal with this problem, we introduce the fourth normal form (4NF), which further extends BCNF. The Boyce-Codd Normal Form. A relational schema, R = (R; F), where R is the set of attributes of R and F is the set of functional dependencies associated with R, is in Boyce-Codd normal form if, for every FD X Y F, either of the following is true: Y X (i.e., this is a trivial FD). X is a superkey of R.

13 8.5 Normal Forms 223 In other words, the only nontrivial FDs are those in which a key functionally determines one or more attributes. It is easy to see that Person1 and Hobby, the relational schemas given in (8.1), are in BCNF, because the only nontrivial FD is SSN Name, Address. It applies to Person1 which has SSN as a key. On the other hand, consider the schema Person defined by the CREATE TABLE statement (5.1), page 93, and the schema HasAccount defined by the SQL statement (8.3), page 214. As discussed earlier, these statements fail to capture some important relationships, which are represented by the FDs in (8.5), page 216. Each of these FDs is in violation of the requirement to be in BCNF: They are not trivial, and their left-hand sides SSN and AccountNumber are not keys of their respective schemas. Note that a BCNF schema can have more than one key. For instance, R = (ABCD; F), where F ={AB CD, AC BD} has two keys, AB and AC. And yet it is in BCNF because the left-hand side of each of the two FDs in F is a key. An important property of BCNF schemas is that their instances do not contain redundant information. Since we have been illustrating redundancy problems only through concrete examples, the above statement might seem vague. Exactly what is redundant information? For instance, does the abstract relation A B C D over the above mentioned BCNF schema R store redundant information? Superficially it might seem so, because the two tuples agree on all but one attribute. However, having identical values in some attributes of different tuples does not necessarily imply that the tuples are storing redundant information. Redundancy arises when the values of some set of attributes, X, necessarily implies the value that must exist in another attribute, A a functional dependency. If two tuples have the same values in X, they must have the same value of A. Redundancy is eliminated if we store the association between X and A only once (in a separate relation) instead of repeating it in all tuples of an instance of the schema R that agree on X. Since R does not have FDs over the attributes BCD, no redundant information is stored. The fact that the tuples in the relation coincide over BCD is coincidental. For instance, the value of attribute D in the first tuple can be changed from 4 to 5 without regard for the second tuple. A DBMS automatically eliminates one type of redundancy: Two tuples with the same values in the key fields are prohibited in any instance of a schema. This is a special case: The key identifies an entity and so determines the values of all attributes describing that entity. As the definition of BCNF precludes associations that do not contain keys, the relations over BCNF schemas do not store redundant information. As a result, deletion and update anomalies do not arise in BCNF relations.

14 224 Chapter 8 Database Design II: Relational Normalization Theory Relations with more than one key still can have insertion anomalies. To see this, suppose that associations over ABD and over ACD are added to our relation as shown: A B C D NULL 5 3 NULL 2 5 Because the value over the attribute C in the first association and over B in the second is unknown, we fill in the missing information with NULL. However, now we cannot tell if the two newly added tuples are the same it all depends on the real values for the nulls. A practical solution to this problem, as adopted by the SQL standard, is to designate one key as primary and to prohibit null values in its attributes. The Third Normal Form. A relational schema, R = (R; F), where R is the set of attributes of R and F is the set of functional dependencies associated with R, isin third normal form if, for every FD X A F, any one of the following is true: A X, (i.e., this is a trivial FD). X is a superkey of R. A K for some key K of R. Observe that the first two conditions in the definition of 3NF are identical to the conditions that define BCNF. Thus, 3NF is a relaxation of BCNF s requirements. Every schema in BCNF must also be in 3NF, but the converse is not true in general. For instance, the relation HasAccount (8.3) on page 214 is in 3NF because the only FD that is not based on a key constraint is AccountNumber OfficeId, and OfficeId is part of the key. However, this relation is not in BCNF, as shown previously. 4 If you are wondering about the intrinsic merit of the third condition in the definition of 3NF, the answer is that there is none. In a way, 3NF was discovered by mistake in the search for what we now call BCNF! The reason for its remarkable survival is that it was found later to have some desirable algorithmic properties that BCNF does not have. We discuss these issues in subsequent sections. Recall from Section 8.2 that relation instances over HasAccount might store redundant information. Now we can see that this redundancy arises because of the functional dependency that relates AccountNumber and OfficeId and that is not implied by key constraints. 4 In fact, HasAccount is the smallest possible example of a 3NF relation that is not in BCNF (see Exercise 8.5).

15 8.6 Properties of Decompositions 225 For another example, consider the schema Person discussed earlier. This schema violates the 3NF requirements because, for example, the FD SSN Name is not based on a key constraint (SSN is not a superkey) and Name does not belong to a key of Person. However, the decomposition of this schema into Person1 and Hobby in (8.1), page 213, yields a pair of schemas that are in both 3NF and BCNF. 8.6 PROPERTIES OF DECOMPOSITIONS Since there is no redundancy in BCNF schemas and redundancy in 3NF is limited, we are interested in decomposing a given schema into a collection of schemas, each of which is in one of these normal forms. The main thrust of the discussion in the previous section was that 3NF does not completely solve the redundancy problem. Therefore, at first glance, there appears to be no justification to consider 3NF as a goal for database design. It turns out, however, that the maintenance problems associated with redundancy do not show the whole picture. As we will see, maintenance is also associated with integrity constraints, 5 and 3NF decompositions sometimes have better properties in this regard than do BCNF decompositions. Our first step is to define these properties. Recall from Section 8.2 that not all decompositions are created equal. For instance, the decomposition of Person shown in (8.1) is considered good while the one in (8.2) makes no sense. Is there an objective way to tell which decompositions make sense and which do not, and can this objective way be explained to a computer? The answer to both questions is yes. The decompositions that make sense are called lossless. Before tackling this notion, we need to be more precise about what we mean by a decomposition. A decomposition of a schema, R = (R; F), where R is a set of attributes of the schema and F is its set of functional dependencies, is a collection of schemas R 1 = (R 1 ; F 1 ), R 2 = (R 2 ; F 2 ),..., R n = (R n ; F n ) such that the following conditions hold 1. R = n i=1 R i 2. F entails F i for every i = 1,..., n. The first part of the definition is clear: A decomposition should not introduce new attributes, and it should not drop attributes found in the original schema. The second part of the definition says that a decomposition should not introduce new functional dependencies (but may drop some). We discuss the second requirement in more detail later. The decomposition of a schema naturally leads to decomposition of relations over it. A decomposition of a relation, r, over schema R is a set of relations r 1 = π R1 (r), r 2 = π R2 (r),..., r n = π Rn (r) 5 For example, a particular integrity constraint in the original table might be checkable in the decomposed tables only by taking the join of these tables which results in significant overhead at run time.

16 226 Chapter 8 Database Design II: Relational Normalization Theory where π is the projection operator. It can be shown (see exercise 8.9) that, if r is a valid instance of R, then each r i satisfies all FDs in F i and thus each r i is a valid relation instance over the schema R i. The purpose of a decomposition is to replace the original relation, r, with a set of relations r 1,..., r n over the schemas that constitute the decomposed schema. In view of the above definitions, it is important to realize that schema decomposition and relation instance decomposition are two different but related notions. Applying the above definitions to our running example, we see that splitting Person into Person1 and Hobby (see (8.1) and Figure 8.1) yields a decomposition. Splitting Person as shown in (8.2) and Figure 8.2 is also a decomposition in the above sense. It clearly satisfies the first requirement for being a decomposition. It also satisfies the second, since only trivial FDs hold in (8.2), and these are entailed by every set of dependencies. This last example shows that the above definition of a decomposition does not capture all of the desirable properties of a decomposition because, as you may recall from Section 8.2, the decomposition in (8.2) makes no sense. In the following, we introduce additional desirable properties of decompositions Lossless and Lossy Decompositions Consider a relation, r, and its decomposition, r 1,..., r n, as defined above. Since after the decomposition the database no longer stores the relation r and instead maintains its projections r 1,..., r n, the database must be able to reconstruct the original relation r from these projections. Not being able to reconstruct r means that the decomposition does not represent the same information as does the original database (imagine a bank losing the information about who owns which account or, worse, giving those accounts to the wrong owners!). In principle, one can use any computational method that guarantees reconstruction of r from its projections. However, the natural and, in most cases, practical method is the natural join. We thus assume that r is reconstructible if and only if r = r 1 r 2 r n Reconstructibility must be a property of schema decomposition and not of a particular instance over this schema. At the database design stage, the designer manipulates schemas, not relations, and any transformation performed on a schema must guarantee that reconstructibility holds for all of its valid relation instances. This discussion leads to the following notion. A decomposition of schema R = (R; F) into a collection of schemas R 1 = (R 1 ; F 1 ), R 2 = (R 2 ; F 2 ),..., R n = (R n ; F n ) is lossless if, for every valid instance r of schema R, where r = r 1 r 2 r n r 1 = π R1 (r), r 2 = π R2 (r),..., r n = π Rn (r)

17 8.6 Properties of Decompositions 227 A decomposition is lossy otherwise. In plain terms, a lossless schema decomposition is one that guarantees that any valid instance of the original schema can be reconstructed from its projections on the individual schemas of the decomposition. Note that r r 1 r 2 r n holds for any decomposition (exercise 8.10), so losslessness really just states the opposite inclusion. The fact that r r 1 r 1 r n r r 1 r 2 r n holds, no matter what, may seem confusing at first. If we can get more tuples by joining the projections of r, why is such a decomposition called lossy? After all, we gained more tuples, not less! To clarify this issue, observe that what we might lose here are not tuples but rather information about which tuples are the right ones. Consider, for instance, the decomposition (8.2) of schema Person. Figure 8.2 presents the corresponding decomposition of a valid relation instance over Person shown in Figure 5.2. However, if we now compute a natural join of the relations in the decomposition (which becomes a Cartesian product since these relations do not share attributes), we will not be able to tell who lives where and who has what hobbies. The relationship among names and SSNs is also lost. In other words, when reconstructing the original relation getting more tuples is as bad as getting fewer we must get exactly the set of tuples in the original relation. Now that we are convinced of the importance of losslessness, we need an algorithm that a computer can use to verify this property, since the definition of lossless joins does not provide an effective test but only tells us to try every possible relation. This is not feasible or efficient. A general test of whether a decomposition into n schemas is lossless exists, but is somewhat complex. It can be found in [Beeri et al. 1981]. However, there is a much simpler test that works for binary decompositions, that is, decompositions into a pair of schemas. This test can establish losslessness of a decomposition into more than two schemas, provided that this decomposition was obtained by a series of binary decompositions. Since most decompositions are obtained in this way, the simple binary test introduced below is sufficient for most practical purposes. Testing the losslessness of a binary decomposition. Let R = (R; F) be a schema and R 1 = (R 1 ; F 1 ), R 2 = (R 2 ; F 2 ) be a binary decomposition of R. This decomposition is lossless if and only if either of the following is true: (R 1 R 2 ) R 1 F +. (R 1 R 2 ) R 2 F +.

18 228 Chapter 8 Database Design II: Relational Normalization Theory Figure 8.5 Tuple structure in a lossless binary decomposition: A row of r 1 combines with exactly one row of r 2. R 1 R 2 r 1 r 2 R 1 R2 R 2 To see why this is so, suppose that (R 1 R 2 ) R 2 F +. Then R 1 is a superkey of R since the values in a subset of attributes of R 1 determine the values of all attributes of R 2 (and, obviously, R 1 functionally determines R 1 ). Let r be a valid relation instance for R. Since R 1 is a superkey of R every tuple in r 1 = π R1 (r) extends to exactly one tuple in r. Thus, as depicted in Figure 8.5, the cardinality (the number of tuples) in r 1 equals the cardinality of r, and every tuple in r 1 joins with exactly one tuple in r 2 = π R2 (r) (if more than one tuple in r 2 joined with a tuple in r 1, (R 1 R 2 ) R 2 would not be an FD). Therefore, the cardinality of r 1 r 2 equals the cardinality of r 1, which in turn equals the cardinality of r. Since r must be a subset of r 1 r 2, it follows that r = r 1 r 2. Conversely, if neither of the above FDs holds, it is easy to construct a relation r such that r r 1 r 2. Details of this construction are left to exercise The above test can now be used to substantiate our intuition that the decomposition (8.1) of Person into Person1 and Hobby is a good one. The intersection of the attributes of Hobby and Person1 is{ssn}, and SSN is a key of Person1. Thus, this decomposition is lossless Dependency-Preserving Decompositions Consider the schema HasAccount once again. Recall that it has attributes AccountNumber, ClientId, OfficeId, and its FDs are ClientId, OfficeId AccountNumber (8.6) AccountNumber OfficeId (8.7) According to the losslessness test, the following decomposition is lossless:

19 8.6 Properties of Decompositions 229 AcctOffice = (AccountNumber, OfficeId; {AccountNumber OfficeId}) (8.8) AcctClient = (AccountNumber, ClientId; { }) because AccountNumber (the intersection of the two attribute sets) is a key of the first schema, AcctOffice. Even though decomposition (8.8) is lossless, something seems to have fallen through the cracks. AcctOffice hosts the FD (8.7), but AcctClient s associated set of FDs is empty. This leaves the FD (8.6), which exists in the original schema, homeless. Neither of the two schemas in the decomposition has all of the attributes needed to house this FD; furthermore, the FD cannot be derived from the FDs that belong to the schemas AcctOffice and AcctClient. In practical terms, this means that even though decomposing relations over the schema HasAccount into relations over AcctOffice and AcctClient does not lead to information loss, it might incur a cost for maintaining the integrity constraint that corresponds to the lost FD. Unlike the FD (8.7), which can be checked locally, in the relation AcctOffice, verification of the FD (8.6) requires computing the join of AcctOffice and AcctClient before checking of the FD can begin. In such cases, we say that the decomposition does not preserve the dependencies in the original schema. We now define what this means. Consider a schema, R = (R; F), and suppose that R 1 = (R 1 ; F 1 ), R 2 = (R 2 ; F 2 ),..., R n = (R n ; F n ) is a decomposition. F entails each F i by definition, so F entails n i=1 F i. However, this definition does not require that the two sets of dependencies be equivalent that is, that n i=1 F i must also entail F. This reverse entailment is what is missing in the above example. The FD set of HasAccount, which consists of dependencies (8.6) and (8.7), is not entailed by the union of the dependencies in decomposition (8.8), which consists of only a single dependency AccountNumber OfficeId. Because of this, (8.8) is not a dependency-preserving decomposition. Formally, R 1 = (R 1 ; F 1 ), R 2 = (R 2 ; F 2 ),..., R n = (R n ; F n ) is said to be a dependency-preserving decomposition of R = (R; F) if and only if it is a decomposition and the sets of FDs F and n i=1 F i are equivalent. The fact that an FD, f,inf is not in any F i does not mean that the decomposition is not dependency preserving, since f might be entailed by n i=1 F i. In this case, maintaining f as a functional dependency requires no extra effort: If the FDs in n i=1 F i are maintained, f will be also. It is only when f is not entailed by n i=1 F i that the decomposition is not dependency preserving and so maintenance of f requires a join. In view of this definition, the above decomposition of HasAccount is not dependency preserving, and this is precisely what is wrong. The dependencies that exist in the original schema but are lost in the decomposition become interrelational constraints that cannot be maintained locally. Each time a relation in

20 230 Chapter 8 Database Design II: Relational Normalization Theory Figure 8.6 Decomposition of the HasAccount relation. AccountNumber ClientId OfficeId B SB01 A MN08 HasAccount AccountNumber OfficeId AccountNumber ClientId B123 A908 SB01 MN08 AcctOffice B A AcctClient Figure 8.7 HasAccount and its decomposition after the insertion of several rows. AccountNumber ClientId OfficeId B SB01 B SB01 A MN08 HasAccount AccountNumber OfficeId AccountNumber ClientId B123 B567 A908 SB01 SB01 MN08 AcctOffice B B A AcctClient the decomposition is changed, satisfaction of the inter-relational constraints can be checked only after the reconstruction of the original relation. To illustrate, consider the decomposition of HasAccount in Figure 8.6. If we now add the tuple B567, SB01 to the relation AcctOffice and the tuple B567, to AcctClient, the two relations will still satisfy their local FDs (in fact, we see from (8.8) that only AcctOffice has a dependency to satisfy). In contrast, the inter-relational FD (8.6) is not satisfied after these updates, but this is not immediately apparent. To verify this, we must join the two relations, as depicted in Figure 8.7. We now see that constraint (8.6) is violated by the first two tuples in the updated HasAccount relation.