Introduction to Database Systems. Chapter 4 Normal Forms in the Relational Model. Chapter 4 Normal Forms



Similar documents
Database Design and Normalization

Relational Database Design: FD s & BCNF

Introduction to Databases, Fall 2005 IT University of Copenhagen. Lecture 5: Normalization II; Database design case studies. September 26, 2005

Normalisation to 3NF. Database Systems Lecture 11 Natasha Alechina

COSC344 Database Theory and Applications. Lecture 9 Normalisation. COSC344 Lecture 9 1

Relational Database Design

Chapter 7: Relational Database Design

normalisation Goals: Suppose we have a db scheme: is it good? define precise notions of the qualities of a relational database scheme

Functional Dependencies and Finding a Minimal Cover

Introduction to Database Systems. Normalization

Databases -Normalization III. (N Spadaccini 2010 and W Liu 2012) Databases - Normalization III 1 / 31

Design of Relational Database Schemas

Schema Refinement, Functional Dependencies, Normalization

Database Management System

Database Design and Normal Forms

Chapter 15 Basics of Functional Dependencies and Normalization for Relational Databases

DATABASE NORMALIZATION

Database Management Systems. Redundancy and Other Problems. Redundancy

CS 377 Database Systems. Database Design Theory and Normalization. Li Xiong Department of Mathematics and Computer Science Emory University

Lecture Notes on Database Normalization

Chapter 10. Functional Dependencies and Normalization for Relational Databases

Normalization in Database Design

Schema Refinement and Normalization

Theory behind Normalization & DB Design. Satisfiability: Does an FD hold? Lecture 12

Schema Design and Normal Forms Sid Name Level Rating Wage Hours

Functional Dependencies and Normalization

Theory of Relational Database Design and Normalization

Functional Dependencies

Chapter 8. Database Design II: Relational Normalization Theory

CS143 Notes: Normalization Theory

6.830 Lecture PS1 Due Next Time (Tuesday!) Lab 1 Out today start early! Relational Model Continued, and Schema Design and Normalization

CSCI-GA Database Systems Lecture 7: Schema Refinement and Normalization

Week 11: Normal Forms. Logical Database Design. Normal Forms and Normalization. Examples of Redundancy

Functional Dependency and Normalization for Relational Databases

Relational Database Design Theory

Normalization. Reduces the liklihood of anomolies

Chapter 5: FUNCTIONAL DEPENDENCIES AND NORMALIZATION FOR RELATIONAL DATABASES

Limitations of E-R Designs. Relational Normalization Theory. Redundancy and Other Problems. Redundancy. Anomalies. Example

Chapter 10. Functional Dependencies and Normalization for Relational Databases. Copyright 2007 Ramez Elmasri and Shamkant B.

Chapter 7: Relational Database Design

Why Is This Important? Schema Refinement and Normal Forms. The Evils of Redundancy. Functional Dependencies (FDs) Example (Contd.)

RELATIONAL DATABASE DESIGN

Introduction Decomposition Simple Synthesis Bernstein Synthesis and Beyond. 6. Normalization. Stéphane Bressan. January 28, 2015

DATABASE DESIGN: NORMALIZATION NOTE & EXERCISES (Up to 3NF)

C# Cname Ccity.. P1# Date1 Qnt1 P2# Date2 P9# Date9 1 Codd London Martin Paris Deen London

Boyce-Codd Normal Form

Normalization. CIS 331: Introduction to Database Systems

Chapter 10 Functional Dependencies and Normalization for Relational Databases

Objectives of Database Design Functional Dependencies 1st Normal Form Decomposition Boyce-Codd Normal Form 3rd Normal Form Multivalue Dependencies

An Algorithmic Approach to Database Normalization

BCA. Database Management System

Relational Normalization Theory (supplemental material)

Jordan University of Science & Technology Computer Science Department CS 728: Advanced Database Systems Midterm Exam First 2009/2010

DBMS. Normalization. Module Title?

Introduction to Databases

Normalisation. Why normalise? To improve (simplify) database design in order to. Avoid update problems Avoid redundancy Simplify update operations

How To Find Out What A Key Is In A Database Engine

Normalization. Functional Dependence. Normalization. Normalization. GIS Applications. Spring 2011

Database Systems Concepts, Languages and Architectures

Design Theory for Relational Databases: Functional Dependencies and Normalization

LiTH, Tekniska högskolan vid Linköpings universitet 1(7) IDA, Institutionen för datavetenskap Juha Takkinen

14 Databases. Source: Foundations of Computer Science Cengage Learning. Objectives After studying this chapter, the student should be able to:

The process of database development. Logical model: relational DBMS. Relation

Theory of Relational Database Design and Normalization

SQL DDL. DBS Database Systems Designing Relational Databases. Inclusion Constraints. Key Constraints

Announcements. SQL is hot! Facebook. Goal. Database Design Process. IT420: Database Management and Organization. Normalization (Chapter 3)

Database Design and Normalization

IT2305 Database Systems I (Compulsory)

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer

Conceptual Design Using the Entity-Relationship (ER) Model


Database Design. Marta Jakubowska-Sobczak IT/ADC based on slides prepared by Paula Figueiredo, IT/DB

Normalisation 6 TABLE OF CONTENTS LEARNING OUTCOMES

Database Constraints and Design

Chapter 5: Logical Database Design and the Relational Model Part 2: Normalization. Introduction to Normalization. Normal Forms.

Normalization of Database

Part 6. Normalization

Tutorial on Relational Database Design

Topic 5.1: Database Tables and Normalization

Advanced Relational Database Design

Lecture 6. SQL, Logical DB Design

Lecture 2 Normalization

Normal forms and normalization

Chapter 9: Normalization

The Relational Model. Ramakrishnan&Gehrke, Chapter 3 CS4320 1

Normalization. Normalization. Normalization. Data Redundancy

UNDERSTANDING NORMALIZATION

Unit 3.1. Normalisation 1 - V Normalisation 1. Dr Gordon Russell, Napier University

Module 5: Normalization of database tables

Limitations of DB Design Processes

Once the schema has been designed, it can be implemented in the RDBMS.

Normalisation 1. Chapter 4.1 V4.0. Napier University

Determination of the normalization level of database schemas through equivalence classes of attributes

MCQs~Databases~Relational Model and Normalization

KNOWLEDGE FACTORING USING NORMALIZATION THEORY

Chapter 6. Database Tables & Normalization. The Need for Normalization. Database Tables & Normalization

DATABASE SYSTEMS. Chapter 7 Normalisation

Transcription:

Introduction to Database Systems Winter term 2013/2014 Melanie Herschel melanie.herschel@lri.fr Université Paris Sud, LRI 1 Chapter 4 Normal Forms in the Relational Model After completing this chapter, you should be able to work with functional dependencies (FDs): define them, detect them in DB schemas, decide implication, determine keys. explain insert, update, and delete anomalies, explain BCNF, test a given relation for BCNF, and transform a relation into BCNF 1. Introduction 2. ER-Modeling 3. Relational model(ing) 4. Relational algebra 5. SQL 2 Chapter 4 Normal Forms Functional Dependencies (FDs) Anomalies, FD-based Normal Forms 3

Introduction Relational database design theory is mainly based on a class of constraints called Functional Dependencies (FDs). FDs are a generalization of keys. This theory defines when a relation is in normal form (e.g., in Third Normal Form or 3NF) for a given set of FDs. It is usually a sign of bad DB design if a schema contains relations that violate the normal form requirements. There are exceptions and tradeoffs, however. 4 Introduction If a normal form is violated, data is stored redundantly, and information about different concepts is intermixed. Redundant data storage Movie MID Title Year Studio StudioAddress 1 The Incredibles 2004 Disney Burbank, CA 2 Bambi 1942 Disney Burbank, CA 3 Avatar 2010 20th Century Century City, CA The studio address is stored multiple times (once for each movie the studio has produced). The table also mixes the concepts of Studios and Movies. 5 Introduction Today, the Third Normal Form (3NF) is considered the standard relational normal form used in practice (and education). Boyce-Codd Normal Form (BCNF) is a bit more restrictive, easier to define, and will be a better match for our intuition of good DB design. In rare circumstances, a relation might not have an equivalent BCNF form while preserving all its FDs (this is always guaranteed for 3NF). In a nutshell, BNCF requires that all FDs are already enforced (represented) by keys. 6

Introduction Normalization algorithms can construct good relation schemas from a set of attributes and a set of FDs alone. In practice, relations are derived from ER models and normalization is used as an additional check (for BCNF) only. When an ER model is well designed, the resulting derived relational tables will automatically be in BCNF. Awareness of normal forms can help to detect design errors already in the ER design phase. 7 First Normal Form First normal form (1NF) First Normal Form (1NF) requires that all table entries are atomic (not lists, sets, records, relations) The relational model is already defined this way. All further normal forms assume that tables are in 1NF. A few modern DBMS allow table entries to be non-atomic (structured). Such systems are commonly referred to as NF 2 DBMS (NF 2 = NFNF = Non-First Normal Form). 8 First Normal Form NF 2 table Actor AID Name DOB Movies 1 Pitt 4.6.75 Title Year Rating Babel 2006 7 Inglorious Basterds 2009 8 2 Jolie 18.12.63 Title Year Rating Wanted 2008 3 Note: A discussion of 1NF does not really belong here (1NF is not based on functional dependencies at all). 9

First Normal Form It is not a violation of 1NF if 1. A table entry contains values of a domain that exhibit an internal structure (e.g., all Genres of a Movie are collected in a VARCHAR(100) attribute and are separated by spaces), 2. Related information is represented by several columns for a single row (e.g., columns Genre1, Genre2, Genre3 to represent up to three genres a of a movie). These are signs of bad design anyway. Why? 10 Functional Dependencies An instance of a functional dependency (FD) in our latest example tables is Studio! StudioAddress Whenever two rows of a relation agree in the the attribute Studio, they must also agree in the StudioAddress column values: Redundant data storage Movie MID Title Year Studio StudioAddress 1 The Incredibles 2004 Disney Burbank, CA 2 Bambi 1942 Disney Burbank, CA 3 Avatar 2010 20th Century Century City, CA 11 Functional Dependencies Intuitively, the reason for the validity of Studio! StudioAddress is that studio address only depends on the studio itself, not on other movie data. This FD is read as Studio {functionally, uniquely} determines StudioAddress. Also: Studio is a determinant for StudioAddress. Saying that A is a determinant for B is actually stronger than the FD A B since A must be minimal and A B (see below). A determinant is like a partial key: it uniquely determines some attributes, but not all in general. 12

Functional Dependencies A key uniquely determines all attributes of its relation. In the example, the FDs MID! TITLE, MID! Year, MID! Studio, MID! StudioAddress hold, too. There will never be two distinct rows with the same MID, so the FD condition is trivially satisfied. The FD Studio! Title is not satisfied, for example. At least two rows agreeing on Studio disagree on Title: Redundant data storage Movie MID Title Year Studio StudioAddress 1 The Incredibles 2004 Disney Burbank, CA 2 Bambi 1942 Disney Burbank, CA 3 Avatar 2010 20th Century Century City, CA 13 Functional Dependencies In general, an FD takes the form A1,...,An B1,...,Bm Sequence and multiplicity of attributes in an FD are unimportant, since both sides formally are sets of attributes: {A1,...,An} {B1,...,Bm} In discussing FDs, the focus is on a single relation R. All Ai, Bj are attributes of the same relation R. 14 Functional Dependencies Functional dependency The functional dependency (FD) A1,..., An B1,..., Bm holds for a relation R in a database state I if and only if for all tuples t, u I(R): t.a1 = u.a1... t.an = u.an t.b1 = u.b1... t.bm=u.bm 15

Functional Dependencies An FD with m attributes on the right-hand side (rhs) is equivalent to the m FDs: A1,..., An B1,..., Bm A1,..., An B1 A1,..., An Bm Thus, in the following discussion it suffices to consider FDs with a single column name on the rhs. 16 Functional Dependencies are Constraints For the DB design process, the only interesting FDs are those that hold for all possible database states. FDs are constraints (like keys). Non-interesting FDs Movie MID Title Year Studio StudioAddress 1 The Incredibles 2004 Disney Burbank, CA 2 Bambi 1942 Disney Burbank, CA 3 Avatar 2010 20th Century Century City, CA In this example state, FD Title! Year holds. But this is not true in general (movies with equal titles may appear in different years). It is a task of DB design to verify that a FD is not mere coincidence. 17 FDs vs. Keys FDs are a generalization of keys: A1,..., An is a key of relation R(A1,..., An, B1,..., Bm) A1,..., An B1,..., Bm holds Given the FDs for a relation R, one can derive a key for R, i.e., as a set of attributes A1,..., An that functionally determines the other attributes of R (see below). 18

Example The following table holds information about All-time-favorite movies and their respective actors. Identifying FDs Movies Actor No Title Year MID Tim Robbins 1 The Shawshank Redemption 1994 1 Morgan Freeman 2 The Shawshank Redemption 1994 1 Marlon Brando 1 The Godfather 1972 2 Al Pacino 2 The Godfather 1972 2 James Caan 3 The Godfather 1972 2 Several actors can star in a movie and we have one actor per row. Attribute No is used to identify the position of an actor in the movie credits. 19 Example The MID uniquely identifies a movie. Thus, the following FD holds MID! Title, Year Equivalently, we could use two FDs: MID! Title MID! Year Since a movie may have several actors, the following FD does not hold: MID! Actor 20 Example One actor can star in many movies, thus the following FD may not be assumed although it happens to hold in the given example DB state: Actor! Title It is possible that there are movies with the same title but with different actors or with different years. So Title determines no other attribute. For every movie (we now know that movies are identified by MID) there can only be one first (second, third,...) actor: MID, No! Actor At first glance, the actor of any given movie is also uniquely assigned a position in the credit sequence. MID, Actor! Num However, this is violated by credits that include a same actor twice (e.g., Tracey Ullman speaks both Neil van Dort and Hildegarde in Corpse Bride). 21

Example We might also be tempted to assume the FD Title, Year, No! Actor If, in a TV series the credit sequence changes over a season, we are in trouble. During DB design, only unquestionable conditions should be used as FDs. DB normalization alters the table structure depending on the specified FDs. If it turns out later that an FD indeed does not hold, it is not sufficient to simply remove a constraint (e.g., a key) from the schema with an ALTER TABLE SQL statement. Instead, tables will be created or deleted, data will be copied, and application programs may have to be changed. 22 FD Quiz FD-Quiz 1.Which FDs should hold for this table in general? 2.Identify an FD that holds in this state but not in general. MovieShows CinemaID CName CAddress Showtime Showroom Capacity MID 1 Ufa Stuttgart 14:00 A 50 1 1 Ufa Stuttgart 16:00 A 50 1 1 Ufa Stuttgart 14:00 B 100 2 2 Cinestar Berlin 20:00 X 200 1 2 Cinestar Berlin 20:00 Y 150 3 3 Korso Stuttgart 19:00 1 50 3 23 FD Quiz - Notes 24

Implication of FDs Note: the FD MID! StudioAddress is a consequence since we already knew MID! Studio and Studio! StudioAddress. Whenever A B and B C hold, A C is automatically satisfied. Studio! Studio trivially holds, but is not interesting. FDs of the form A A always hold (for every DB state). Implication of FDs A set of FDs {α1 β1,...,αn βn} implies an FD α β if and only if every DB state which satisfies αi βi,1 i n, also satisfies α β. 25 Implication of FDs The DB designer is normally not interested in all FDs which hold, but only in a representative FD set that implies all other FDs. Armstrong Axioms Reflexivity: If β α, then α β. Augmentation: If α β, then α γ β γ. Transitivity: If α β and β γ, then α γ. 26 Implication of FDs However, a simpler way to check whether α β is implied by a given FD set is to compute the cover α + of α and then to check if β α +. Cover The cover α + of a set of attributes α is the set of all attributes B that are uniquely determined by the attributes α (with respect to a given FD set F): α + := {B F implies α B} If necessary, write α + F to make the dependence on F explicit. A set of FDs F implies α β if and only if β α + F. 27

Implication of FDs Cover computation Input:! α (set of attributes) α1! β1,...,αn! βn (set of FDs F) Output:! α + (set of attributes) x α; while x did change do foreach given FD αi! βi do if αi x then x x βi ; fi od od return x; 28 Implication of FDs Compute {MID} + Consider the following FDs: MID! Title, Studio MID, No! Actor Studio! StudioAddress 1. We start with x = {MID}. 2. The FD MID! Title, Studio has a lhs that is completely contained in the current attribute set x. 3. Extend x by the FD s rhs attribute set, giving x = {MID, Title, Studio}. 4. Now, the FD Studio! StudioAddress becomes applicable. 5. Add rhs attribute set of the FD to the current attribute set x, giving x = {MID, Title, Studio, StudioAddress} 6. The FD MID, No! Actor is still not applicable (attribute No x). 7. No further way to extend set x, the algorithm returns {MID} + = {MID, Title, Studio, StudioAddress} 8. We may now conclude, e.g., MID! StudioAddress 29 How to Determine Keys Given a set of FDs and the set of all attributes A of a relation R, one can determine all possible keys of R: α A is key of R α + = A Remember: Normally, we are interested in minimal keys only. A superset of a key is a key again, e.g., if {Title, Year} is key for relation Movie then so is {Title, Year, Length}). We thus usually require that every A α is vital, i.e., (α {A}) + A. 30

How to Determine Keys 1. Start to construct a key x iteratively with the empty attribute set (x = ). 2. In order to avoid non-minimal keys, make sure that the set x never contains the rhs B of an FD α B when x already contains the lhs, i.e., α x. Without loss of generality, we assume the rhs of all FDs to consist of a single attribute only. 3. As long as x does not yet determine all attributes of R (i.e., x + A), choose any of the remaining attributes X A x +. 4. Now the process splits (need to follow both alternatives): (a)simply add X to x. (b)for each FD α X,add α to x. The search for the key branches. We need to backtrack to such a branch if We run into a dead-end (determine a non-minimal key), or We have found a key and want to look for an alternative minimal key. 31 How to Determine Keys Key Computation (call with x = ) procedure key(x,a,f) foreach α! B F do if α x and B x and B α then return; /* x not minimal */ fi od if x + = A then print x; /*found a minimal key x*/ else X any element of A x + ; od fi key(x {X}, A, F); foreach α!x F do key (x α, A, F); 32 FD Quiz Determine keys for a relation Assume F = { MovieID, CreditPos Actor, MovieID title } MovieActors MovieID Title CreditPos Actor 1 Star Wars 1 Mark Hamill 1 Star Wars 2 Harrison Ford 2 Star Trek 1 Patrick Stewart 2 Star Trek 2 Johnathan Frakes 1. Does F imply MovieID, CreditPos title? 2.Determine a key for relation MovieActors 33

FD Quiz - Notes 34 Determinants Determinants (Non-trivial FDs) The attribute set A1,...,An is called a determinant for attribute set B1,...,Bm if and only if the FD A1,...,An B1,...Bm holds, and the lhs is minimal, i.e., whenever any Ai is removed then A1,...,Ai 1,Ai+1,An B1,...Bm does not hold, and the lhs and rhs are distinct, i.e., {A1,...,An} {B1,...,Bm}. 35 Chapter 4 Normal Forms Functional Dependencies (FDs) Anomalies, FD-based Normal Forms 36

Consequences of Bad DB Design It is normally a severe sign of bad DB design if a table contains an FD (encodes a partial function) that is not implied by a key (e.g., Studio StudioAddress). Among other problems, this leads to the redundant storage of certain facts (here: studio addresses): Redundant data storage Movie MID Title Year Studio StudioAddress 1 The Incredibles 2004 Disney Burbank, CA 2 Bambi 1942 Disney Burbank, CA 3 Avatar 2010 20th Century Century City, CA 37 Consequences of Bad DB Design If a schema permits redundant data, additional constraints need to be specified to ensure that the redundant copies indeed agree. In the example, this constraint is precisely the FD Studio StudioAddress. General FDs, however, are not standard constraints of the relational model and thus unsupported by RDBMS. The solution is to transform FDs into key constraints. This is what DB normalization tries to do. 38 Consequences of Bad DB Design Update Anomalies Update anomalies: When a single mini-world fact needs to be changed (e.g., a change in the studio address for Disney), multiple tuples must be updated. This complicates application programs. Redundant copies potentially get out of sync and it is impossible/hard to identify the correct information. Application programs unexpectedly receive multiple answers when a single answer was expected: SELECT INTO breaks due to unexpected result SELECT DISTINCT StudioAddress INTO :address FROM MOVIE WHERE Studio = Disney 39

Consequences of Bad DB Design Insertion Anomalies Insertion anomalies: The address of a new studio cannot be inserted into the database until it is known what movies it will produce. Since MID is a key, it cannot be set to NULL. Such insertion anomalies arise when unrelated concepts (here: movies and studios) are stored together in a single table. Remember: a relational table should encode a single (partial) function. 40 Consequences of Bad DB Design Deletion Anomalies Deletion anomalies: When the last movie of a studio is deleted, its address is lost. Even if MID could be set to NULL: it would be strange that all movies produced by a studio may be deleted but the last. Again: such deletion anomalies arise when unrelated concepts (here: movies and studios) are stored together in a single table. 41 Boyce-Codd Normal Form Boyce-Codd Normal Form A relation R is in Boyce-Codd Normal Form (BCNF) if and only if all its FDs are already implied by its key constraints. For relation R in BCNF and any FD A1,...,An B1,...,Bm of R one of the following statements hold: 1. the FD is trivial, i.e., {B1,...,Bm} {A1,...,An}, or 2. the FD follows from a key, because {A1,...,An} or a subset of it is already a key of R. Intuitively: Each attribute must provide a fact about the key, the whole key, and nothing but the key. BCNF every determinant is a possible key of R. If a relation R is in BCNF, ensuring its key constraints automatically satisfies all of R s FDs. The three anomalies (update / insertion / deletion) do not occur. 42

Boyce-Codd Normal Form BCNF example The relation with FDs MOVIE(MID, Title, Studio, StudioAddress) MID Title, Studio, StudioAddress Studio StudioAddress is not in BCNF because the FD Studio StudioAddress is not implied by a key: Studio is not a key of the entire relation The FD is not trivial. However, without the attribute StudioAddress (and the second FD) the relation MOVIE(MID, Title, Studio) is in BCNF. 43 Boyce-Codd Normal Form BCNF example A movie set can be booked by at most one movie at a time. MovieSet(MID, Title, Date, Time, Location) The relation thus satisfies the following FDs (plus implied ones): The keys of MovieSet are: MID Date, Time, Location MID Title, Date, Time, Location Date, Time, Location MID Both FDs are implied by keys (in this case, their lhs even coincide with the keys), thus MovieSet is in BCNF. 44 Boyce-Codd Normal Form BCNF example Consider the relation and the following FDs: ACTOR(AID, Name, Fortune) AID Name, AID Fortune Name, Fortune Name AID, Name Fortune This relation is in BCNF: The two first FDs indicate that AID is a key. Both FDs are thus implied by a key. The second FD is trivial (and may be ignored). The lhs of the last FD contains a key and is thus also implied by a key. 45

BCNF Quiz BCNF Quiz 1. Is Movie(MID, Title, CreditPos, Actor) with the following FDs in BCNF? MID, CreditPos Actor MID Title 2. Is Invoice(Inv_No, Date, Amount, Cust_No, Cust_Name) with the following FDs in BCNF? Inv_No Date, Amount, Cust_No Inv_No, Date Cust_Name Cust_No Cust_Name Date, Amount Date 46 BCNF Quiz - Notes 47 Third Normal Form Third Normal Form (3NF) is slightly weaker than BCNF: if a relation is in BCNF, it is automatically in 3NF. The differences are small. For most practical applications, the two can be considered as equivalent. BCNF is a clear concept: we want no non-trivial FDs except those which are implied by a key constraint. In some rare scenarios, this preservation of FDs (see below) is lost when a relation is transformed into BCNF. This never happens with 3NF. 48

Third Normal Form Third Normal Form A relation R is in Third Normal Form (3NF) if and only if every FD A1,...,An B (assume that FDs with multiple attributes on rhs have been expanded) satisfies at least one of the following conditions: (1)The FD is trivial, i.e., B {A1,...,An},or (2)The FD follows from a key, because {A1,...,An} or some subset of it is already a key of R, or (3)B is a key attribute, i.e., element of a candidate key of R. Intuitively: Each non-key attribute must provide a fact about the key, the whole key, and nothing but the key. The only difference to BCNF is the additional clause (3). An attribute is called a non-key attribute (of R), if it does not appear in any minimal key of R. The minimality requirement is important here. Otherwise all attributes of R would qualify as key attributes. All attributes that are not non-key attributes are key attributes. 3NF means that every determinant of a non-key attribute is a key (see clauses (2), (3)). 49 Third Normal Form 3NF example Given the relation assume the following FDs: The keys of Shows are: Title, City Cinema, Title Shows(Title, Cinema, City) Cinema City Title, City Cinema (assumes a movie is not shown twice in the same city) The first FD violates BCNF, because Cinema is not a super-key. However, Shows is in 3NF because City is a key-attribute (part of the first key). Note: whenever keys do not overlap and 3NF is satisfied, then the relation automatically also satisfies BCNF. 50 Second Normal Form Second Normal Form A relation is in Second Normal Form (2NF) if and only if every non-key attribute depends on the complete key. The Second Normal Form (2NF) is only interesting for historical reasons. If a relation is in 3NF or BCNF, it is automatically in 2NF. 2NF is too weak, e.g., the Movie(MID, Title, Year, Studio, StudioAddress) example actually satisfies 2NF (although it exhibits anomalies). 51

Second Normal Form A relation is in 2NF if and only if the given FDs do not imply an FD A1,...,An B such that 1. A1,...,An is a strict subset of a minimal key, and 2. B is not contained in any minimal key (i.e., is a non-key attribute). 2NF counter example The MovieActors table is not in 2NF: Title is already uniquely determined by MovieID, i.e., a part of the key. MovieActors MovieID Title CreditPos Actor 1 Star Wars 1 Mark Hamill 1 Star Wars 2 Harrison Ford 2 Star Trek 1 Patrick Stewart 2 Star Trek 2 Johnathan Frakes 52 Splitting Relations If a table R is not in BCNF, we can split it into two tables (decomposition). The violating FD determines how to split. If the FD A1,...,An B1,...Bm violates BCNF, create a new relation S(A1,...,An,B1,...,Bm) and remove B1,...,Bm from the original relation R. Splitting along an FD Remember: the FD is the reason why table violates BCNF. Split the table into Studio StudioAddress Movie(MID, Title, Studio, StudioAddress) MovieList(MID, Title, Studio) AddressList (Studio, StudioAddress) 53 Splitting Relations It is important that this splitting transformation is lossless, i.e., that the original relation can be reconstructed by means of a (natural) join: Splitting along an FD Movie = MovieList AddressList CREATE VIEW Movie (MID, TITLE, Studio, StudioAddress) AS SELECT M.MID, M.TITLE, M.Studio, A.StudioAddress FROM MovieList M, AddressList A WHERE M.Studio = A.Studio 54

Splitting Relations Decomposition Theorem The split of relations is guaranteed to be lossless if the intersection of the attributes of the new tables is a key of at least one of them. The join connects tuples depending on the attribute (values) in the intersection. If these values uniquely identify tuples in the other relation we do not lose information. Lossy decomposition Original table Decomposition Reconstruction (key A, B, C) R 1 R 2 R 1 on R 2 A B C A B A C A B C a 11 b 11 c 11 a 11 b 11 a 11 c 11 a 11 b 11 c 11 a 11 b 11 c 12 a 11 b 12 a 11 c 12 a 11 b 11 c 12 a 11 b 12 c 11 a 11 b 12 c 11 a 11 b 12 c 12 55 Splitting Relations Lossless split condition satisfied {MID, Title, Studio} {Studio, StudioAddress} = {Studio} All splits initiated by the above method for transforming relations into BCNF satisfy the condition of the decomposition theorem. It is always possible to transform a relation into BCNF by lossless splitting (if necessary, split repeatedly). 56 Splitting Relations However, not every lossless split is reasonable: Example Movie MID Title Year 1 The Incredibles 2004 2 Bambi 1942 3 Avatar 2010 Splitting Movie into MovieTitle(MID, Title) and MovieYear(MID, Year) is lossless, but the split is not necessary to enforce a normal form, only requires costly joins in subsequent queries. Full splitting down to binary tables on the physical DB level, however, leads to highperformance DBMS implementations. 57

Splitting Relations Losslessness means that the resulting schema (after splitting) can represent all DB states which were possible before. We can translate states from the old schema into the new schema and back. The new schema supports all queries supported by the old schema. However, the new schema allows states which do not correspond to a state in the old schema (we may now store studios and studio addresses without any affiliation to a movie). The new schema is more general. If studios without movies are possible in the real world, the decomposition removes a fault in the old schema (insertion and update anomalies). If they are not, - a new constraint is needed that is not necessarily easier to enforce than the FD, but - at least the redundancy is avoided (update anomaly). 58 Splitting Relations Finally, note that although computable columns (such as Age which is derivable from Birthdate) lead to violations of BCNF, splitting the relation is not the right solution. The split would yield a relation R(Birthday, Age) which would try to materialize the computable function. The correct solution is to eliminate Age from the original table and to define a view which contains all columns plus the computed column Age (invoking a SQL stored procedure). 59 Preservations of FDs Besides losslessness, a property that a good decomposition of a relation should guarantee is the preservation of FDs. The problem is that an FD can refer only to attributes of a single relation. When you split a relation into two, there might be FDs that can no longer be expressed (these FDs are not preserved). FD gets lost during decomposition Consider Addresses(Street, City, State, ZIP) with FDs Street, City, State! ZIP ZIP! State The second FD violates BCNF and would lead us to split the relation into Addresses1(Street, City, ZIP) and Addresses2(ZIP, State). But now the first FD can no longer be expressed! 60

Preservation of FDs Probably, most designers would not split the table (the table is in 3NF but violates BCNF). If there are many addresses with the same ZIP code, there will be significant redundancy. If you split, queries will involve more joins. If this were a DB for a mailing company, a separate table of ZIP codes might be of interest on its own. Then the split looks feasible. 61 3NF Synthesis Algorithm The following algorithm (divided in 2 sub-algorithms on this and the next slide), namely the synthesis algorithm, produces a lossless decomposition of a given relation R into 3NF relations that preserves the FDs. Determine a minimal (canonical) set of FDs that is equivalent to the given FDs (a) Replace every FD α B1,..., Bmby the corresponding FDs α Bi, 1 i m. Let the result be F. (b) Minimize the lhs of every FD in F. For each FD A1,...,An B and each i = 1,...,n compute {A1,...,Ai 1,Ai+1,...,An} + F. If the result contains B, replace F by (F {A1,...,An B}) {A1,...,Ai 1,Ai+1,...,An B}, and repeat. (c) Remove implied FDs. For each FD α B, compute the cover α + F where F = F {α B}. If the cover contains B (i.e., the FD is already implied by the FDs in F), continue with F F. 62 3NF Synthesis Algorithm Synthesis Algorithm (1) Compute a canonical (minimal) set of FDs F (see (a) - (c) on previous slide). (2) For each lhs α of an FD in F create a relation with attributes A = α {B α B F }. (3) If none of the relations constructed in step (2) contains a key of the original relation R, add one relation containing the attributes of a key of R. (4) For any two relations R1,2 constructed in steps (2), (3), if the schema of R1 is contained in the schema of R2, discard R1. 63

Summary Tables should not contain FDs other than those implied by the keys (i.e., all tables should be in BCNF). Such violating FDs indicate the combination of pieces of information that should be stored separately (presence of an embedded function). This leads to redundancy. A relation may be normalized into BCNF by splitting it. Sometimes it may make sense to avoid a split (and thus to violate BCNF). The DB designer has to carefully resolve such scenarios, incorporating application or domain knowledge. 64 Summary Functional dependencies: detect them in DB schemas, decide implication, determine keys. Insert, update, and delete anomalies. Normal forms based on FDs: BCNF, 3NF, 2NF Synthesis algorithm to produce schema in 3NF 1NF 2NF 3NF BCNF 65