Database Design Foundations of of Information Management (WS (WS 2008/0 2008/09) Chapter 4? Principles of of Database Design Foundations of IM 1
Design of DB applications Relative costs for compensating the consequences of errors: 1 Requirements analysis Database development 10 10 Design Topic of this chapter 100 100 Implementation >> >> 100 100 Usage Foundations of IM 2
DB-Design: Design: Phases Application area in the "real world" Design documentation Logical structure on paper" Conceptual design Logical design Conceptual level Logical level Physical structure in storage Physical design Physical level Foundations of IM 3
Conceptual design Application area in the "real world" Design documentation Formally define definewhat whatyou youknow about about the the world world without withoutthinking of of computers and and databases (yet (yet)! Conceptual design Conceptual level Logical structure on paper" Logical design Logical level Physical structure in storage Physical design Physical level Foundations of IM 4
The Entity-Relationship model Entity-Relationship data model (ER model): Proposed in 1976 in a paper by Peter Chen Graphical notation for application modeling (ER diagrams) Independent semantic data model (aiming at the meaning of concepts in real world) Predecessor of today s object-oriented oriented data models Extremely successful as a means of pre-design of relational DBs The ER model offers few very simple and basic concepts: Entities (objects), characterized by attributes (properties) Binary or n-ary relationships between entities, possibly characterized by attributes as well Often not mentioned explicitly, but important and basic: Values: printable symbols as values of attributes; play a subordinate role (characterizing objects) Roles: Names for the special meaning an entity has within a relationship Foundations of IM 5
An entity Each entity is completely characterized by the values of all its attributes. an an entity its itsattributes persnr firstname name age topic rank 765 "Rainer" "Manthey" 53 "Informatik" their "C3" their theirvalues Foundations of IM 6
Two similar entities persnr firstname name age topic rank 765 "Rainer" "Manthey" 53 "Informatik" "C3" another entity persnr firstname name age topic rank identical attributes?? Armin B." Cremers" 60 "Informatik" "C4" partially different values Foundations of IM 7
Entity types Similar entities can be combined into entity types. Similarity" requires at least identical attribute structure. (Attribute names and corresponding value domains are identical.) Entity types are graphically represented by rectangles.. Attributes label the line connecting an entity type and a value domain (often symbolized by an oval): professor Name of the type persnr firstname name age topic rank Int String String Int String String Foundations of IM 8
Entity types: : Alternative representations Domain names are often omitted (in case they are obvious or irrelevant). Instead attributes are placed inside ovals: professor persnr firstname name age topic rank In larger diagrams, the attribute structure is often entirely omitted in order to save space (or is written down in abbreviated form only): professor persnr, firstname, name,,... Foundations of IM 9
Instances and population of an entity type Each entity belongs to" at least one entity type: It is called an instance of this type. an instance of professor The set of all current instances of an entity type is calles its current population. a population of professor Foundations of IM 10
Multiple classification One and the same entity can be an instance of various entity types. instance of professor instance of fan of Werder Bremen In such a case, the attributes of the different types of this entity may well be quite different. Foundations of IM 11
Classification in presence of identical attributes Entities with the same attributes do not at all have to be instances of the same type! instance of Ass name, firstname,, age instance of Fan of football club Werder Bremen name, firstname,, age Almost always additional classification criteria are required, usually not derivable from the attribute structure. Foundations of IM 12
Key attributes As in the relational model (Access, SQL), there are usually one or more attributes per entity type the values of which are sufficient for uniquely identifying each instance: Key Key attributes professor persnr firstname name age topic rank (Primary) key attributes are usually underlined in an ER-diagram diagram. Keys ought to be "minimal"" (no attribute can be omitted). A distinction between primary key and other candidate keys is not made in the ER-model model, even though it would be useful to do so. Foundations of IM 13
Relationships 2 nd main concept of the ER-model model: elementary relationships between two or more entities (possibly with their own attribute values) Relationship PersNr 765 year of assignment 1992 name Universität Bonn Each relationship is uniquely characterized by the key values of the partici- pating entities and by the values of all relationship attributes. However, there are no separate key attributes for relationships,, as the keys of the participating entities always suffice for unique identification. Foundations of IM 14
Multiple relationships Entitites may participate in various relationships (also similar ones). year of assignment 1992 year of assignment 1999 Foundations of IM 15
Relationship types Similar relationships may be grouped into relationship types. Similarity" requires at least identical attribute structure and identical types (and number) of participating entities. Relationship types are graphically denoted in the ER-model by a diamond. Attributes are written as for entity types. assigned_to professor university persnr,,... year of assignment name,,... Int Foundations of IM 16
Instances of a relationship type assigned_to professor university persnr,,... year of assignment Int name,,... Relationship types have instances, too: Individual relationships between individual entities are analogously considered instances of the corresponding R-type. assigned_to persnr 765... year of assignment 1992 name... Universität Bonn Foundations of IM 17
Type vs. instance assigned_to professor university persnr,,... entity type type year of assignment Int name,,... relationship type type The distinction between types and instances is difficult even for specialists: Try to be precise from the very beginning! assigned_to persnr 765... entity year of assignment 1992 relationship name... Universität Bonn Foundations of IM 18
Roles One and the same entity type may participate in a relationship type more than once,, e.g.: mother descendent_of person daughter For a (syntactical( syntactical) distinction between the different forms of participation, special designators are used, called roles. Roles are used as labels of the lines connecting the resp. entity and relationship type. Foundations of IM 19
Extended Geo-DB: ER-diagram capital capital_of_country country 1 neighbour_country city country country 2 stadt_an_fluss city_in_country river_ through_ country region_ in_country river source river relationship type region source_of role entity type Foundations of IM 20
Relationship types and mathematical relations On the level of instances, a relation over the populations of the entity types involved is associated with each relationship type. E 1 R E 2 entities not involved in R Foundations of IM 21
Functional relationships In many cases, at most one entity on one side"" of a relationship type may be associated with a particular entity on the other side of the relationship,, e.g.: city country city_in_country Each Eachcity cityis isin in exactly one onecountry. In In each eachcountry, however, arbitrarily many cities may maybe besituated. Foundations of IM 22
Functional relationships (2) Mathematically, city_in_country is a function: city_in_country: city city country More exactly: :... a function from the population of city into the population of country. population(city (city) population(country (country) Foundations of IM 23
Functional relationships (3) In the ER-model model,, such restrictions of the admissible combinations can be expressed by means of so-called functionalities, annotations attached to the edges connecting entity and relationship types. There are four different kinds of binary relationships expressible by means of functionalities: 1 : : 1 1 : : N N : : 1 N : : M In this context,, N resp. M stands for arbitrary integer values 0. In the city_in_country' city_in_country'-example,, an N:1-relationship is appropriate: There is exactly one country per city, but arbitrarily many citites per country (N 0). city N 1 city_in_country country Foundations of IM 24
Functional relationships (4) Functionalities of type 1 : 1, 1 : N or N : 1 define partial functions where some of the instances of the types involved possibly are not related at all. city N 1 city_in_country country City without country! In the normal ER-model model,, total functions cannot be distinguished from partial ones in extensions of the model there are additional graphical means for ex- plicitly stating whether a function is partial or total. Foundations of IM 25
Functional relationships (5) In a 1 : 1-relationship1 each instance of one of the entity types involved is related to none or exactly one of the instances of the other entity type. An N : M-relationship can be considered the normal case" without restrictions on the number of participating entities. If no functionalities have been stated for a relationship type, then an implicit N : M functionality is assumed. Functionalities can be defined for relationships with more than two entities in- volved, too: E 1 N R M E 2 R: R: E 1 1 E 2 2 E 3 3 1 E 3 Foundations of IM 26
Functional relationships (6) If in an n-ary relationship several edges are marked by '1', then the resp. relationship type represents several partial functions: E 1 N R 1 E 2 R (1) R (1) (1) : : E 1 1 E 2 2 E 3 3 1 R (2) R (2) (2) : : E 1 1 E 3 3 E 2 2 E 3... and analogously: E 1 1 R 1 1 E 2 E 3 R (1) R (1) (1) : : E 1 1 E 2 2 E 3 3 R (2) R (2) (2) : : E 1 1 E 3 3 E 2 2 R (3) R (3) (3) : : E 2 2 E 3 3 E 1 1 Foundations of IM 27
Extended ER model: Generalization The concepts introduced so far have been contained in Chen s s original proposal throughout. Since then, however, various extensions have been proposed: Extended Entity-Relationship Model (EER-model) Most important extensions (as in object-oriented oriented models): Generalization This means: formation of subtypes of entity types sub-/supertype relationships (type hierarchy) inheritance of attributes and of participations"" in R-types Special graphical notation for generalization of E-types: is_a... Foundations of IM 28
Generalization (2) city capital_of region supertype name, area, inhabitants subtypes is_a country prime minister state governor Inheritance"" in this example means: Both subtypes inherit all attributes of the supertype, i.e., they own" these attributes without explicit definition. Both subtypes participate in the relationship type capital_of', which has been explicitly defined for the supertype only. Foundations of IM 29
Generalization (3) region is_a Generalization always means that the populations of the subtypes are subsets of the population of the supertype. country state population(region (region) This circumstance motivates the notion 'is_a' is_a'-relationship: population(country (country) "Every country is a region." Quantifier over instances! population(state (state) Foundations of IM 30
Generalization (4) In the example: special case disjoint generalization (Empty intersection of the populations) In general: This form of the 'is_a' is_a'-notation just means some form of subset formation,, i.e. overlapping, incomplete subdivision. Foundations of IM 31
Logical design: ER Relations Design documentation Logical structure on paper" Conceptual design Logical design Physical design ER ER model 1 st 1 st st step step Mapping Relational model Improvement Map MapER types systema- tically to to relations: From semantic concepts to to syntactic structures!! Foundations of IM 32
ER Relations: Mapping Entity types Mapping from ER model to relational model: in principle very easy: : per type one relation (table) in detail and for extensions: quite difficult Mapping of entity types: rather obvious type table name attribute E column name E A 1 A 2... A n A 1 A 2... A n Key attributes are mapped to primary key columns. Foundations of IM 33
ER Relations: Mapping Relationship types For relationship types: Use the same basic idea: one relation per type. but: How are participating entity types represented? city city_in_state state name inhabitants... since name area... Obvious solution as well: The participating entities are represented by means of the values of their primary key attributes (possibly after renaming). city_in_state city.name state.name since Foreign key Foundations of IM 34
ER Relations: Mapping Relationship types (2) In presence of special functionalities (1 : N, N : 1, 1 : 1 resp.), a separate relation- ship table is not necessary,, as the relationship information can be embedded into the table of the entity type on the N-side: city city_in_state N 1 state name inhabitants... since name area... Foreign key city city.name inhabitants... state.name since Foundations of IM 35
Relational representation of generalization hierarchies E 1 A 1, A 2,... How Howto to realize inheritance and and sub-/super type typerelationships relationally? is_a inheritance Relational model: does not know any inheritance! Inheritance thus has to be "simulated"". E 2 B 1, B 2,... Relational representation of a super type E 1 is obvious: E 1 A 1 A 2... Foundations of IM 36
Relational representation of generalization hierarchies (2) E 1 is_a A 1, A 2,... Inheritance E 1 A 1 A 2... a 11 a 12... a 21 a 22... inherited native" attributes E 2 B 1, B 2,... E 2 A 1 A 2... B 1 B 2... a 11 a 12... b 11 b 12... Obvious relational realization of a subtype: subtype relation E 2 owns native" and inherited attributes. But: Values of the inherited attributes of all E 2 -instances have to be (redundantly) repeated in the E 1 -relation in this case! Reason: Each E 2 -instance is an E 1 -instance, too! Foundations of IM 37
Relational representation of generalization hierarchies (3) E 1 is_a A 1, A 2,... Avoiding duplication: : Store only native" attributes (+ key for joining) ) in the subtype relation! inherited attributes E 2 B 1, B 2,... E 1 inheritance A 1 A 2... a 11 a 12... a 21 a 22... joint key native" attributes But Butin in this thiscase caserelation E 2 does 2 does no no longer contain all allattributes of of E 1 -entities 1!! E 2 A 1 B 1 B 2... 12... a 11 b 11 b 12 Foundations of IM 38
Relational representation of generalization hierarchies (4) way out: : E2-population is completely realized by means of a view joining the inherited and the native attributes. inherited attributes E 2 -global A 1 A 2... B 1 B 2... a 11 a 12... b 11 b 12... CREATE CREATE VIEW VIEW E2-global E2-global AS AS (SELECT (SELECT* * FROM FROM E1 E1 INNER INNER JOIN JOINE2-local ON ON E1.A1 E1.A1 = E2-local.A1) E2-local.A1) E 1 E 2 -local A 1 A 2... a 11 a 12... a 21 a 22... native" attributes A 1 B 1 B 2... 12... a 11 b 11 b 12 Foundations of IM 39
Relational representation of generalization hierarchies (5) 3 rd rd alternative (also free of redundancies and using a view): Distribute values of the inherited attributes to different relations super types are reconstructed via views. E 1 -local E 1 -global A 1 A 2... a 11 a 12... a 21 a 22... A 1 A 2... a 21 a 22... CREATE CREATE VIEW VIEW E1-global E1-global AS AS ( ( TABLE TABLE E1-local E1-local )) UNION UNION (SELECT (SELECT A 1, 1, A 2, 2,...... FROM FROM E2 E2 )) E 2 π inherited native" attributes A 1 A 2... B 1 B 2... a 11 a 12... b 11 b 12... Foundations of IM 40
Relational representation of generalization hierarchies (6) Which of of the thethree threealternatives is is the the best" best"? 1. Relations E 1 and E 2, no views: + short access time (without any join of tables) high requirements for space (due to redundant storage) 2. Relations E 1 and E 2 -local, view E 2 -global (JOIN): + Only key attribute values are stored redundantly. Access to E 2 -attributes is slower (due to join). 3. Relations E 2 and E 1 -local, view E 1 -global (PROJECT-UNION): + No duplication of any attribute values. Access to E 1 -attributes is slower (due to projection and union) ). Foundations of IM 41
Generalization and updates E 1 is_a E 2 A 1, A 2,... B 1, B 2,... What happens if ifan an E 2 -entity 2 is isdeleted? relational variant 1 (inherited( attributes duplicated): Deletion from both relations is necessary. relational variant 2 (inherited( and native A. separated): Deletion from both relations is necessary. relational variant 3 (E 2 -attributes only in one relation): no propagation of deletions required In In variants 1 and and 2: 2: referential integrity constraints with withdelete cascade is isrequired. For insertions and modifications: Changes in several relations may be necessary, too (depending on the chosen strategy). Deletion of instances of the super type E 1 : Cascading deletion if the resp. instance is an E 2 -instance, too (again referential integrity). Foundations of IM 42
Relationships and referential integrity Each relationship type induces FOREIGN KEY-constraints as well: With N : M-functionality: river_through_state river N M state CREATE CREATE TABLE TABLE river_through_state river_through_state ( ( River River String String REFERENCES REFERENCESriver, river, State State String String REFERENCES REFERENCESstate state )) With N : 1-functionality: 1 city N city_in in_state 1 state CREATE CREATE TABLE TABLE city city ( (....,, State State String String REFERENCES REFERENCESstate state )) Foundations of IM 43
Functionalities as constraints But: Uniqueness of the state in the city_in in_state-relationship has not yet been expressed! An additional CHECK-constraint is required contraining the number of state instances: city city_in in_state N 1 state CREATE CREATE TABLE TABLE city city ( (....,, CHECK CHECK COUNT COUNT (SELECT (SELECT State State FROM FROM city citys WHERE WHERE S.Name S.Name = Name) Name) =< =< 11 Implicit universal quantifier ranging over each city row! Foundations of IM 44
ER Relations (2) Design documentation Logical structure on paper" Conceptual design Logical design Physical design ER ER model Mapping Relational model Improvement 2 nd nd step step (after (afterreaching relations): Improve design by by fine tuning the the structure of of tables Foundations of IM 45
Redundancies due to bad design Example of bad design (remaining after mapping from ER level): City_in_ City_in_state and capital_of have been placed into a single table. City Bonn Köln Essen... Mainz Trier... State Capital NW NW NW RP RP Düsseldorf Düsseldorf Düsseldorf Mainz Mainz Redundantly stored information: Düsseldorf is isthe thecapital of of North Rhine-Westphalia. Obviously one topic (Which city is the capital of...?) has been combined with another topic (In which state is a certain city situated?) in such an unlucky manner that considerable redundancies occur, resulting in waste of space. What does does a a topic mean? Foundations of IM 46
Anomalies resulting from redundancies An immediate consequence of such cases of storing multiple topics in one table is the occurrence of so-called anomalies when updating such tables: Assume Köln (as( largest city in NW) replaces Düsseldorf as capital: One fact changes, but multiple updates have to be made. City Bonn Köln Essen... Mainz Trier... State Capital NW NW NW RP RP Düsseldorf Düsseldorf Düsseldorf Mainz Mainz Analoguos anomalies may happen due to insertions and deletions: An instance of topic 1 disappears, as soon as it is no longer associated with any instance of topic 2. A new instance of topic 1 can only be inserted, if it is combined with an instance of topic 2 (or( null values are used). Foundations of IM 47
Decompositions avoid redundancies and anomalies How Howto to prevent such such defects (anomalies, redundancies) )? In the example, there is a simple remedy: Separate the two topics into different relations! City Bonn Köln Essen... Mainz Trier... State Capital NW NW NW RP RP Düsseldorf Düsseldorf Düsseldorf Mainz Mainz No redundancy,, no anomaly! Decomposition City Bonn Köln Neuss... State NW NW NW State Capital NW RP... Düsseldorf Mainz Foundations of IM 48
Functional dependencies and decompositions Already discovered by Codd before 1970: Functional relationships between attributes are of help for finding meaningful decompositions and for avoiding reduncancies! Resulting from this observation, Codd developed an elaborate theory of relational normal forms. Prerequisite: : Designers identify such functional relationships during schema design (quite( similar to identifying functionalities in the ER model) ) and express them as special integrity constraints. Functional dependency (short:: FD) Principle of functional dependency: Let A and B be attributes of a relation R. B depends functionally on A, A if in each state of R each A-value always occurs in combination with the same, uniquely determined B-value. symbolic notation: A B Foundations of IM 49
FDs in the example Each city lies in exactly one state: City State Each state has exactly one capital: State Capital But also: If a city is a capital then it is the capital of exactly one state: Capital State Each city is associated with exactly one capital (namely the capital of its state): City Capital City Bonn Köln Essen... Mainz Trier... State Capital NW NW NW RP RP Düsseldorf Düsseldorf Düsseldorf Mainz Mainz Decomposition separates FDs: City State State Capital One FD seems to be lost, however: : City Capital Foundations of IM 50
"Codd's thesis" Thesis (claim( claim) on which the normalization theory of Codd is based: Attributes connected via viaan anfd FDrepresent semantically significant topics of of the theapplication domain. That is: Every FD identifies a topic thus separating topics means separating FDs. But: Not every such topic is necessarily represented by an FD. Moreover: : FD-connection is a sufficient, but not a necessary criterion for the existence of a topic. Basic idea of Codd s approach to normalization of relations: Decompose relations in such a way that normally" each FD has a compo- nent relation of its own. But try to identify exceptions where several FDs may coexist in one and the same relation! Foundations of IM 51
Transitive FDs and Armstrong s axioms There are FDs which are derivable from other FDs, already known.. An important example are so-called transitive FDs: α β is a transitive FD if there is an attribute set γ, such that α γ and γ β are both FDs, but not γ α. Every such transitive case leads to a (new( new) functional dependency. There are two other such inference rules for FDs, called the Armstrong axioms (as they have been discovered by the Canadian scientist W. Armstrong). ββ α α αα β β αα β β ααγ γ ββγ γ Every subset depends on its superset. Augmentation on both sides. Not to be found in the literature, but quite useful: : Special notion for FDs which are not transitive FDs: α β is a direct FD if there is no attribute set γ, such that α γ and γ β are both FDs, but not γ α. Foundations of IM 52
Normal forms Properly determining FDs and investigating their properties is the basis for each meaningful decomposition of relations into components free of redundancies. Codd's recepy (in (in short shortfrom): Redundancies can canbe besafely avoided if iffds FDsalways originate from from candidate keys keysof of a relation. Codd defined various degrees of FD separation, called normal forms the process of transforming a given schema into relations all of which exhibit a given normal form is called normalization. The most important normal form is the third normal form (short: 3NF) defined as follows: A relation is isin in 3 rd rd normal form form Each Eachnon-key attribute functionally depends directly on on each eachcandidate key keyof of the therelation.. There are various other normal forms (1NF, 2NF, 4NF and others). Foundations of IM 53
Normalization in the example schema Our example schema from the geographic domain originally was not in 3NF, as it still contains a transitive dependency pointing from a candidate key (City) to a non-key attribute (Capital): (R, ) City 3NF Bonn Köln Essen... State NW NW NW Capital Düsseldorf Düsseldorf Düsseldorf (R, ) (R, ) City State State Capital 3NF 3NF After decomposition, however, each of the resulting component schemas is in 3NF! Such a decomposition will always be possible. The lost dependency City Capital can be reconstructed by means of the transitivity axiom of Armstrong. Foundations of IM 54