Topic 5.1: Database Tables and Normalization



Similar documents
Module 5: Normalization of database tables

Chapter 6. Database Tables & Normalization. The Need for Normalization. Database Tables & Normalization

DATABASE SYSTEMS. Chapter 7 Normalisation

Chapter 9: Normalization

Functional Dependency and Normalization for Relational Databases

Normalization. Reduces the liklihood of anomolies

DATABASE NORMALIZATION

Chapter 5: Logical Database Design and the Relational Model Part 2: Normalization. Introduction to Normalization. Normal Forms.

COSC344 Database Theory and Applications. Lecture 9 Normalisation. COSC344 Lecture 9 1

C# Cname Ccity.. P1# Date1 Qnt1 P2# Date2 P9# Date9 1 Codd London Martin Paris Deen London

Database Design. Marta Jakubowska-Sobczak IT/ADC based on slides prepared by Paula Figueiredo, IT/DB

DATABASE DESIGN: NORMALIZATION NOTE & EXERCISES (Up to 3NF)

Normalization of Database Tables. Functional Dependency. Examples of Functional Dependencies: So Now what is Normalization? Transitive Dependencies

Part 6. Normalization

Normalization in Database Design

Database Design and the Reality of Normalisation

Chapter 15 Basics of Functional Dependencies and Normalization for Relational Databases

Normalisation to 3NF. Database Systems Lecture 11 Natasha Alechina

A. TRUE-FALSE: GROUP 2 PRACTICE EXAMPLES FOR THE REVIEW QUIZ:

Normalization of Database

Normal forms and normalization

Normalization. CIS 331: Introduction to Database Systems


MCQs~Databases~Relational Model and Normalization

BCA. Database Management System

SQL Server. 1. What is RDBMS?

DBMS. Normalization. Module Title?

Normalisation 1. Chapter 4.1 V4.0. Napier University

Unit 3.1. Normalisation 1 - V Normalisation 1. Dr Gordon Russell, Napier University

Normalisation 6 TABLE OF CONTENTS LEARNING OUTCOMES

Database Design and Normalization

Normalization. Functional Dependence. Normalization. Normalization. GIS Applications. Spring 2011

Introduction to Computing. Lectured by: Dr. Pham Tran Vu

Chapter 10. Functional Dependencies and Normalization for Relational Databases. Copyright 2007 Ramez Elmasri and Shamkant B.

Normalization in OODB Design

CS 377 Database Systems. Database Design Theory and Normalization. Li Xiong Department of Mathematics and Computer Science Emory University

Databases -Normalization III. (N Spadaccini 2010 and W Liu 2012) Databases - Normalization III 1 / 31

Normalization. CIS 3730 Designing and Managing Data. J.G. Zheng Fall 2010

C HAPTER 4 INTRODUCTION. Relational Databases FILE VS. DATABASES FILE VS. DATABASES

Database Management System

How To Write A Diagram

Introduction to normalization. Introduction to normalization

Chapter 10 Functional Dependencies and Normalization for Relational Databases

Fundamentals of Database System

Optimum Database Design: Using Normal Forms and Ensuring Data Integrity. by Patrick Crever, Relational Database Programmer, Synergex

Normalization. Normalization. Normalization. Data Redundancy

Chapter 10. Functional Dependencies and Normalization for Relational Databases

If it's in the 2nd NF and there are no non-key fields that depend on attributes in the table other than the Primary Key.

RELATIONAL DATABASE DESIGN

14 Databases. Source: Foundations of Computer Science Cengage Learning. Objectives After studying this chapter, the student should be able to:

Tutorial on Relational Database Design

Theory of Relational Database Design and Normalization

DATABASE DESIGN: Normalization Exercises & Answers

Functional Dependencies and Finding a Minimal Cover

Chapter 5: FUNCTIONAL DEPENDENCIES AND NORMALIZATION FOR RELATIONAL DATABASES

Introduction to Databases, Fall 2005 IT University of Copenhagen. Lecture 5: Normalization II; Database design case studies. September 26, 2005

CSCI-GA Database Systems Lecture 7: Schema Refinement and Normalization

Fundamentals of Relational Database Design

Analysis and Design Complex and Large Data Base using MySQL Workbench

Theory of Relational Database Design and Normalization

MODULE 8 LOGICAL DATABASE DESIGN. Contents. 2. LEARNING UNIT 1 Entity-relationship(E-R) modelling of data elements of an application.

Lecture 2 Normalization

Announcements. SQL is hot! Facebook. Goal. Database Design Process. IT420: Database Management and Organization. Normalization (Chapter 3)

Normalization. Normalization. First goal: to eliminate redundant data. for example, don t storing the same data in more than one table

The Relational Database Model

The process of database development. Logical model: relational DBMS. Relation

Fundamentals of Database Design

Database Normalization. Mohua Sarkar, Ph.D Software Engineer California Pacific Medical Center

z Introduction to Relational Databases for Clinical Research Michael A. Kohn, MD, MPP copyright 2007Michael A.

Normalisation. Why normalise? To improve (simplify) database design in order to. Avoid update problems Avoid redundancy Simplify update operations

- Eliminating redundant data - Ensuring data dependencies makes sense. ie:- data is stored logically

Improving Data Quality in Relational Databases: Overcoming Functional Entanglements

Jordan University of Science & Technology Computer Science Department CS 728: Advanced Database Systems Midterm Exam First 2009/2010

Chapter 7: Relational Database Design

Schema Design and Normal Forms Sid Name Level Rating Wage Hours

RELATIONAL DATABASE DESIGN. Basic Concepts

Lecture Notes on Database Normalization

B2.2-R3: INTRODUCTION TO DATABASE MANAGEMENT SYSTEMS

Introduction to Database Systems. Normalization

CS143 Notes: Normalization Theory

Teaching Database Modeling and Design: Areas of Confusion and Helpful Hints

Answers to Selected Questions and Problems

Database Design and Normalization

7. Databases and Database Management Systems

Access Database Design

Determination of the normalization level of database schemas through equivalence classes of attributes

2. Basic Relational Data Model

THE BCS PROFESSIONAL EXAMINATION Diploma. October 2004 EXAMINERS REPORT. Database Systems

UNDERSTANDING NORMALIZATION

KNOWLEDGE FACTORING USING NORMALIZATION THEORY

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer

Functional Dependencies

Database Design Standards. U.S. Small Business Administration Office of the Chief Information Officer Office of Information Systems Support

IT2305 Database Systems I (Compulsory)

Lecture Notes INFORMATION RESOURCES

Design of Relational Database Schemas

Transcription:

Topic 5.1: Database Tables and Normalization What is Normalization? Normalization is a process for evaluating and correcting table structures to minimize data redundancies, thereby helping to eliminate data anomalies. It helps us evaluate table structures and produce good tables. What are the stages of Normalization? Normalization Works through a series of stages called normal forms: o Normal form (1NF) o Second normal form (2NF) o Third normal form (3NF) o Boyce-Codd Normal Form (BCNF) o Forth Normal (4NF) o Fifth Normal (5NF) o Domain-key normal form (DKNF) From a structural point of view, 2NF is better than 1NF, and 3NF is better than 2NF. For most business database design purposes, 3NF is as high as we need to go in the normalization process. And some very specialized applications may require normalization beyond 4NF. Although normalization is a very important database design ingredient, you should not assume that the highest level of normalization is always the most desirable. Generally, the higher the normal form, the more joins are required to produce a specified output and the more slowly the database system responds to end-user demands. A successful design must also consider end-user demand for fast performance. Therefore, you will occasionally be expected to denormalize some portions of a database design in order to meet performance requirements. (Denormalization produces a lower normal form). 5.1.1 The Need for Normalization To illustrate the normalization process, we will examine a simple business application. In this case we will explore the simplified database activities of a construction company that manages several building projects. Each project has its own project number, name, employees assigned to it and so on. Each employee has an employee number, name, and job classification such as engineer or computer technician.

The company charges its clients by billing the hours spent on each contract. The hourly billing rate is dependent on the employee s position. Periodically, a report is generated that contains the information displayed in Table 5.1. The Total Charge in Table 5.1 is a derived attribute and, at this point is not stored in this table. The easiest short-term to generate the required report might seem to be a table whose contents correspond to the reporting requirements. (See Figure 5.1.)

The structure of the data set in Figure 5.1 does not handle data very well for the following reasons: 1. The project number (PROJ_NUM) is apparently intended to be a primary key, but it contains nulls. 2. The table entries invite data inconsistencies. For example, the JOB_CLASS value Elect.Engineer might be entered as Elect.Eng. in some cases, El. Eng or EE in others. 3. The table displays data redundancies. These data redundancies yield the following anomalies: a. Update anomalies. Modifying the JOB_CLASS for employee number 105 requires (potentially many alterations, one for each EMP_NUM =105) b. Insertion anomalies. Just to complete a row definition, a new employee must be assigned to a project. If the employee is not yet assigned, a phantom project must be created to complete the employee data entry. c. Deletion anomalies. If employee 103 quits, deletions must be made for every entry in which EMP_NUM =103. Such deletions will result in loosing other vital data of project assignments from the database.

In spite of these structural deficiencies, the table structure appears to work; the report is generated with ease. Unfortunately, the report may yield different results, depending on what data anomaly has occurred. 5.1.2 Conversion to First Normal Form Figure 5.1 contains what is known as repeating groups. A repeating group derives its name from the fact that a group of multiple (related) entries can exist for any single key attribute occurrence. A relational table must not contain repeating groups. The existence of repeating groups provides evidence that the RPT_FORMAT table in Figure 5.1 fails to meet even the lowest normal form requirements, thus reflecting data redundancies. Normalizing the table structure will reduce these data redundancies. If repeating groups do exist, they must be eliminated by making sure that each row defines a single entity. In addition, the dependencies must be identified to diagnose the normal form. The normalization process starts with a simple three-step procedure. Step 1: Eliminate the Repeating Groups Start by presenting the data in a tabular format, where each cell has a single value and there are no repeating groups. To eliminate the repeating groups, eliminate the nulls by making sure that each repeating group attribute contains an appropriate data value. This change converts the RPT_FORMAT table in Figure 5.1 to the DATA_ORG_1NF table in Figure 5.2.

Step 2: Identify the Primary Key The layout in Figure 5.2 represents much more than a mere cosmetic change. Even a casual observer will note that PROJ_NUM is not an adequate primary key because the project number does not uniquely identify all the remaining entity (row) attributes. For example, the PROJ_NUM value 15 can identify any one of five employees. To maintain a proper primary key that will uniquely identify any attribute value, the new key must be composed of a combination of PROJ_NUM and EMP_NUM. Step 3: Identify all Dependencies Dependencies can be depicted with the help of a diagram as shown in Figure 5.3. Because such a diagram depicts all the dependencies found within a given table structure, it is known as a dependency diagram. Dependency diagrams are very helpful in getting a bird s-eye view of all the relationships among a table s attributes, and their use makes it much less likely that you might overlook an important dependency.

Notice the following dependency diagram features from Figure 5.3: 1. The primary key attributes are bold, underlined, and shaded in a different color. 2. The arrows above the attributes indicate all desirable dependencies, that is, dependencies that are based on the primary key. In this case, note that the entity s attributes are dependent on the combination of PROJ_NUM and EMP_NUM. 3. The arrows below the dependency diagram indicate less-desirable dependencies. Two types of such dependencies exist: a. Partial dependencies. Dependencies based on only a part of a composite primary key are called partial dependencies. b. Transitive dependencies. A transitive dependency is a dependency of one nonprime attribute on another nonprime attribute. The problem with transitive dependencies is that they still yield data anomalies. The first normal form (1NF) describes the tabular format, shown in Figure 5.2, in which three conditions are satisfied: All the key attributes are defined. There are no repeating groups in the table. All attributes are dependent on the primary key. All relational tables satisfy the 1NF requirements. The problem with the 1NF table structure shown in Figure 5.3 is that it contains partial dependencies and transitive dependency.

5.1.3 Conversion to Second Normal Form The rule of conversion from INF format to 2NF format is: Eliminate all partial dependencies from the 1NF format. The conversion from 1NF to 2NF format is done in two steps: Step 1: Identify All the Key Components Fortunately, the relational database design can be improved easily by converting the database into a format known as the second normal form (2NF). The 1NF-to- 2NF conversion is simple: Starting with the 1NF format displayed in Figure 5.3, you do the following activity: Eliminate partial dependencies from the 1NF format in Figure 5.3. This step will result in producing three tables from the original table shown in Figure 5.2. From Figure 5.3, two partial dependencies exist: 1. PROJ_NAME depends on PROJ_NUM, and 2. EMP_NAME, JOB_CLASS, and CHG_HOUR depend on EMP_NUM. To eliminate the existing two partial dependencies, write each component on a separate line, and then write the original (composite) key on the last line. PROJ_NUM EMP_NUM PROJ_NUM EMP_NUM Each component will become the key in a new table. The original table is now divided into three tables: PROJECT, EMPLOYEE, and ASSIGN. Step 2: Identify the Dependent Attributes Determine which attributes are dependent on which other attributes. The three new tables, PROJECT, EMPLOYEE and ASSIGN, are described by: PROJECT (PROJ_NUM, PROJ_NAME) EMPLOYEE (EMP_NUM, EMP_NAME, JOB_CLASS, CHG_HOUR) ASSIGN (PROJ_NUM, EMP_NUM, ASSIGN_HOURS)

The results of steps 1 and 2 are displayed in Figure 5.4. At this point, most of the anomalies have been eliminated. A table is in second normal form (2NF) if the following two conditions: It is in 1NF It includes no partial dependencies; that is, no attribute is dependent on only a portion of the primary key. Because a partial dependency can exist only if a table s primary key is composed of several attributes, a table whose primary key consists of only a single attribute is automatically in 2NF if it is in 1NF. In the ASSIGN table, the attribute ASSIGN_HOURS depends on both key attributes of the composite primary key EMP_NUM and PROJ_NUM. However, Figure 5.4 still shows a transitive dependency as CHG_HOUR depends on JOB_CLASS. This transitive dependency can generate anomalies. 5.1.4 Conversion to Third Normal Form The rule of conversion from 2NF format to 3NF format is: Eliminate all transitive dependencies from the 2NF format.

The conversion from 2NF to 3NF format is done in three steps: The data anomalies created by the database organization shown in Figure 5.4 are easily eliminated by completing the following three steps: Step 1: Identify Each New Determinant For every transitive dependency, write its determinant as a PK for a new table. (A determinant is any attribute whose value determines other values within a row.) If you have three different transitive dependencies, you will have three different determinants. Figure 5.4 shows only one case of transitive dependency. Therefore, write the determinant for this transitive dependency: JOB_CLASS Step 2: Identify the Dependent Attributes Identify the attributes that are dependent on each determinant identified in Step 1 and identify the dependency. In this case, you write: JOB_CLASS CHG_HOUR Name the table to reflect its contents and function. In this case, JOB seems appropriate. Step 3: Remove the Dependent Attributes from Transitive Dependencies o Eliminate all the dependent attributes in the transitive relationship(s) from each of the tables shown Figure 5.4 that have such a transitive relationship. o Draw a new dependency diagram to show all the tables defined in Steps 1-3. o Check the new tables as well as the tables modified in Step 3 to make sure that each table has a determinant and that no table contains inappropriate dependencies (partial or transitive). When Steps 1-3 have been completed, the resulting tables will be shown in Figure 5.5.

A table is in third normal form (3NF) if the following two conditions are satisfied: o It is in 2NF o It contains no transitive dependencies 5.1.5 Improving the Design The table structures are refined to eliminate the troublesome initial partial and transitive dependencies. Normalization cannot, by itself, be relied on to make good designs. Instead it is valuable because its use helps eliminate data redundancies. Therefore the following changes have been made: o PK Assignment o Naming Conventions o Attribute Atomicity o Adding Attributes o Adding Relationships o Refining PKs o Maintaining Historical Accuracy o Using Derived Attributes The enhancements are shown in the tables and dependency diagrams in Figure 5.6.

5.1.6 Limitations on System-Assigned Keys System-assigned primary key may not prevent confusing entries Data entries in Table 5.2 are inappropriate because they duplicate existing records - Yet there has been no violation of either entity integrity or referential integrity This multiple duplicate records problem was created when the JOB_CODE was added to become the PK. In any case, if JOB_CODE is to be the PK, we still must ensure the existence of unique values in the JOB_DESCRIPTION through the use of a unique index. Although our design meets the vital entity and referential requirements, there are still some concerns the designer must address. The JOB_CODE attribute was created and designated to be the JOB table s primary key to ensure entity integrity in the JOB table. The DBMS may be used to have the system assign the PK values. However it is useful to remember that the JOB_CODE does not prevent us from making the entries in the JOB table shown in Table 5.2. It is worth repeating that database design often involves trade-offs and the exercise of professional judgment. In a real-world environment, we must strike a balance between design integrity and flexibility. 5.1.7 The Boyce-Codd Normal Form (BCNF) A table is in Boyce-Codd normal form (BCNF) If every determinant in the table is a candidate key Has same characteristics as primary key, but for some reason, not chosen to be primary key If a table contains only one candidate key, the 3NF and the BCNF are equivalent BCNF can be violated only if the table contains more than one candidate key

Most designers consider the Boyce-Codd normal form (BCNF) as a special case of 3NF A table is in 3NF if it is in 2NF and there are no transitive dependencies A table can be in 3NF and not be in BCNF A non-key attribute is the determinant of a key attribute In other words, a table is in 3NF if it is in 2NF and there are no transitive dependencies. But what about a case in which a non-key attribute is the determinant of a key attribute? The condition does not violate 3NF, yet it fails to meet the BCNF requirements, because BCNF requires that every determinant in the table be a candidate key. The situation mentioned is most easily understood by examining Figure 5.7. The table structure shown in Figure 5.7 has no partial dependencies, nor does it contain transitive dependencies. (The condition C B indicates that a nonkey attribute determines part of the primary key and this dependency is not transitive.) To convert the table structure in Figure 5.7 into table structures that are in 3NF and in BCNF, first change the primary key to A + C. This is an appropriate action, because the dependency C B means that C is, in effect, a superset of B. Next, follow the standard decomposition procedures to produce the results shown in Figure 5.8.

To see how this procedure can be applied to an actual problem, examine the sample data in Table 5.3. Table 5.3 reflects the following conditions: Each CLASS_CODE identifies a class uniquely. For example, a course labeled INFS 420 might be taught in two classes (sections), each identified by a unique code to facilitate registration. Thus the CLASS_CODE 32456 might identify INFS 420, class section 1, while the CLASS_CODE 32457 might identify INFS 420, class section 2. A student can take many classes. A staff member can teach many classes, but each class is taught by only one staff member. Because the relationship between CLASS and

STAFF is 1:1, a CLASS_CODE value can determine the value of STAFF_ID of the staff teaching the class. The structure shown in Table 5.3 is reflected in Panel A of Figure 5.9: Panel A in Figure 5.9 shows a structure that is clearly in 3NF, but the table represented by this structure has a major problem; because it is trying to describe two things: student's enrolment and teaching assignment. This cause data redundancy and consequently causes data anomalies. The solution to this problem is to decompose the table structure following the procedure outlined earlier. The decomposition of Panel B shown in Figure 5.9 yields two table structures that conform to both 3NF and BCNF requirements. Notice, each table in panel B represents the details of one thing or one concept. The table on the left represents student's enrolment, and the other table on the right represents class assignment. Note also that each of the two tables panel B is in BCNF because if every determinant in each table is a candidate key, or there only one candidate in each table which is the primary key.

Concept Check What is normalization? What are the stages of normalization? What is BCNF?