Normalisation and Data Storage Devices

Transcription

1 Unit 4 Normalisation and Data Storage Devices Structure 4.1 Introduction 4.2 Functional Dependency 4.3 Normalisation Why do we Normalize a Relation? Second Normal Form Relation Third Normal Form Boyce-Codd Normal Form (BCNF) Fourth and Fifth Normal Form 4.4 Data Storage Devices 4.5 File Systems 4.6 Summary 4.7 Self Understanding 4.1 Introduction To design and arrive at good database schema normalisation is used. Normally a database consists of table definitions and column definitions and also some constraints to be enforced by the system. Although we had a glance of the process of normalisation earlier, let us study the same formally with an example. Normally a table containing arbitrary collection of attributes can result in number of problems especially with regards to update operations. For example a table called SCD is defined to contain data about students, the table will consist of Roll no., name, course, credits for the courses, grade obtained by the students and his study center number. The key of this table is a composite key consisting of roll number and course. Fundamentals of Database Management Page No.: 86

2 One may face following problems while updating the data in the database. If one likes to add a new course, they can not be added to the table since roll number will need to be null for this course which is not permitted as it is the key attribute. If any change has to be carried out regarding subjects in the course, modifications has to be done in a number of tuples since many students might have opted for this course. If some students has to be deleted it may lead to deletion of some course itself, if only those students have opted for that course. To include integrity constraints workload will be more on DBMS since DBMS may have to check many tuples. To avoid these anomalies (problems in updating) one can keep the following principles for a good database design: 1. Unrelated data should be kept in different tables. For example, data regarding students and courses has to be kept in separate tables. 2. The design should try to represent constraints explicitly to the extent possible and table structure should itself reflect the database constraints. 3. Table should not contain any redundancy. For example, in the above table data of courses repeat for every student which leads to updating problems (anomalies). These good principles are considered in the theory of normalisation to arrive at good database. Constraints: Basically there are two types of constraints, one which defines permitted values attributes can have and another which defines a relationship between different attributes (generally known as dependency). Let us look at dependency in details, as it is a formal tool, which is used to capture constraints that have influence on database design. Fundamentals of Database Management Page No.: 87

3 4.2 Functional Dependency The functional dependency is denoted by. FD X Y means X uniquely determines Y where X and Y are simple or composite attributes. The dependency from X to Y is said to be there if application has the following: If T1 and T2 are two tuples with some values X then value for Y must also be same in T1 and T2 i.e. relationship between X and Y is independent of other attribute which might be present in the table. In simple words, for a given X there is always single value of Y. For example: A street of a city pin code (a street of a city has a unique pincode, however the reverse needn't be true). ISBN, TITLE AUTHOR (given the ISBN of a book, one can find title and author of the book). From the example in hand i.e. SCD, one can identify following FD's: ROLL NO NAME COURSE CREDITS ROLL NO STUDY CENTER NO ROLL NO, COURSE GRADE From the definition of relation, we note that every row in a table is unique and no two rows can have exactly same attribute values. The key plays an important role in design of tables, since key has unique value in each row. The key may consist of one or more attributes, may be minimal or consist of superfluous attributes. A table may have one or more keys called candidate keys where one of the candidate key may be designated as primary key. Implications and Covers: The application can call for some functional dependencies which may imply additional functional dependencies. Fundamentals of Database Management Page No.: 88

4 If F is a set of FDs then we define closure of F denoted as F+ to be set to all possible FDs, which are implied by F. To find F+, given F, we have to find out the inference rules for the FD's which are implied by F. The inference rules are very important for good database design for the following reasons. Given F, one may like to determine whether X Y is implied or not. For computing the closure of F+ of F. Given F we may want to remove those FDs, which are redundant in F. A FD is redundant if it is implied by another FD in F. While designing database schema, one has to find minimal cover "G" of F. The minimal cover G does not contain any redundant FDs (i.e. G+ will be same as F+). By computing minimal cover G of F, we can ensure that DBMS will enforce the constraints, which automatically enforces the constraints implied by G. Inference rules for FDs: Inference rules also known as Armstrong's Axioms are published by Armstrong. These properties are as given below: 1. Reflexivity property: X Y is true if Y is subset of X. 2. Augmentation property: If X Y is true, then XZ YZ is also true. 3. Transitivity property: If X Y and Y Z then X Z is implied. 4. Union property: If X Y and X Z are true, then X YZ is also true. This property indicates that if right hand side of FD contains many attributes then FD exists for each of them. 5. Decomposition property: If X Y is implied and Z is subset of Y, then X Z is implied. This property is the reverse of union property. 6. Pseudotransitivity property: If X Y and WY Z are given, then XW Z is true. Fundamentals of Database Management Page No.: 89

5 To have a better understanding of these properties let us consider an example. Consider a example of a college having a table STUDY with course, teacher, room no and department as attributes. STUDY(course, teacher, roomno, dept), here we can identify few FDs namely Course teacher Teacher Department Course room number Additional following FDs can be derived from above using inference properties as below: By reflexivity: course, teacher teacher By Augmentation: course, room number teacher, room number By transitivity: course department By union: course teacher, room number The main axioms proposed by Armstrong are sound and complete and are defined as : 1. Soundness property: If X Y can be inferred from F using above axioms, then X Y will be true in any relation in which F holds. 2. Completeness property: If X Y can not be inferred from F and F holds in relation R, then X Y will not be true in relation R. 4.3 Normalization Consider the relation shown in Table 4.1 In this relation, an order no. includes many items. The attribute order lines is not single attribute but is composed of many attributes. Fundamentals of Database Management Page No.: 90

6 Table 4.1 An Unnormalized Relation Order no. Order date Item lines Item code Quantity Price/unit Item code Quantity Price/unit Item code Quantity Price/unit Besides this, the number of item lines is variable. This form is not suitable for storage as a file in a computer. Further, retrieval of data based on a component of a composite attribute is difficult. For example, to find out how many items with a specified item code are ordered, one must break up the composite attribute first before attempting a search. Thus a relation with a format such as the one in Table 4.1 is not allowed. It is said to be unnormalized. To normalize this relation, a composite attribute is converted to individual attributes. The normalization step consists of first identifying fields within a composite attribute as individual attributes. After doing this common attributes for a composite attribute are duplicated as many times as there are lines in the composite attribute. The normalized relation corresponding to the relation given in Table 4.1 is shown in Table 4.2. Fundamentals of Database Management Page No.: 91

7 Table 4.2 Normalized Form of the Relation given in Table 4.1 Order no. Order date Item code Quantity Price/ unit The relation shown in Table 4.2 is said to be in First Normal Form, abbreviated as 1NF. This form is also called a flat file. There are no composite attributes, and every attribute is single and describes one Property. Converting a relation to the 1NF form is the first essential step in normalization. There are successive higher normal forms known as 2NF, 3NF, BCNF, 4NF and 5NF.Each form is an improvement over the earlier form. In other words, 2NF is an improvement on lnf, 3NF is an improvement on 2NF, and so on. A higher normal form relation is a sub-set of lower normal form as shown in Fig: 4.1 The higher normalization steps are based on three important concepts. Fundamentals of Database Management Page No.: 92

8 5NF 4NF BCNF 3NF 2NF 1NF Fig. 4.1 Illustration of successive normal forms of a relation 1. Dependence among attributes in a relation. 2. Identification of an attribute or a set of attributes as the key of a relation. 3. Multivalued dependency between attributes. (i) Functional dependency: As the concept of dependency is very important, it is essential that we first understand it well and then proceed to the idea of normalization. There is no fool-proof algorithmic method of identifying dependency. We have to use our commonsense and judgement to specify dependencies. Let X and Y be two attributes of a relation. Given the value of X, if there is only one value of Y corresponding to it, then Y is said to be functionally dependent on X. This is indicated by the notation: X Y Fundamentals of Database Management Page No.: 93

9 For example, given the value of item code, there is only one value of item name for it. Thus item name is functionally dependent on item code. This is as shown as: Item code item name Similarly in Table 4.2, given an order number, the date of the order is known. Thus: Order no. Order date Functional dependency may also be based on a composite attribute. For example, if we write X,Z Y it means that there is only one value of Y corresponding to given values of X, Z. In other words, Y is functionally dependent on the composite X, Z. In Table 4.2, for example, Order no., and Item code together determine Qty. and Price. Thus: Order no., Item code Qty., Price As another example, consider the relation Student (Roll no., Name, Address, Dept., Year of study) In this relation, Name is functionally dependent on Roll no. In fact, given the value of Roll no., the values of all the other attributes can be uniquely determined. Name and Department are not functionally dependent because given the name of a student; one cannot find his department uniquely. This is due to the fact that there may be more than one student with the same name. Name in this case is not a key. Department and Year of study are not functionally dependent as Year of study pertains to a student whereas Department is an independent attribute. The functional dependency in this Fundamentals of Database Management Page No.: 94

10 relation is shown in Fig. 4.2 as a dependency diagram. Such dependency diagrams are very useful in normalization. Name Roll no. Address Department Year of study Fig. 4.2 Dependency diagram for the relation "Student" (ii) Relation key: Consider the relation of Table 4.1. Given the Vendor code, the Vendor Name and Address are uniquely determined Thus Vendor code is the relation key. Given a relation, if the value of an attribute X uniquely determines the values of all other attributes in a row, then X is said to be the key of that relation. Sometimes more than one attribute is needed to uniquely determine other attributes in a relation row. In that case such a set of attributes is the key. In Table 4.2, Order no. and Item code together determine Order date, Qty. and Price. Thus Order no. and Item code together form the key. In the relation "Supplies" (Vendor code, item code, Qty. supplied, Date of supply, Price/unit), Vendor code and Item code together form the key. This dependency is shown in the dependency diagram of Fig Fundamentals of Database Management Page No.: 95

11 Vendor code Quantity Supplied Date of supply Item code Price/unit Fig. 4.3 Dependency diagram for the relation "Supplies" Observe that in the figure the fact that Vendor code and Item code together form a composite key is clearly shown by enclosing them together in a rectangle Why do we Normalize a Relation? Relations are normalized so that when relations in a database are to be altered during the lifetime of the database, we do not lose information or introduce inconsistencies. The type of alterations normally needed for relations are: 1. Insertion of new data values to a relation. This should be possible without being forced to leave blank fields for some attributes. 2. Deletion of a tuple, namely, a row of a relation. This should be possible without losing vital information unknowingly. 3. Updating or changing a value of an attribute in a tuple. This should be possible without exhaustively searching all the tuples in the relation. Consider, for example, the relation shown in Table 4.2. If we wish to enter in our database a new item with item code 3945, whose price/ unit is Rs but for which no order has been placed, we cannot do it unless we leave blank fields for order no. and order date. Order no. is a key field and leaving a blank for it would make retrieval impossible. Fundamentals of Database Management Page No.: 96

12 If Order no in Table 4.2 is deleted then we lose the information that Item code 4629 costs Rs Such an accidental loss of information should not occur. If the price of item code 4627 is changed from Rs to Rs , then in the relation of Table 4.2, it is necessary to find out all the tuples (rows) where Item code 4627 occurs and then change the Price/unit in all these places. In the table, three rows should be changed. If by mistake one row is missed there will be inconsistency in the database. Ideal relations after normalization should have the following properties so that the problems mentioned above do not occur for relations in the (ideal) normalized form: 1. No data value should be duplicated in different rows unnecessarily. 2. A value must be specified (and required) for every attribute in a row. 3. Each relation should be self-contained. In other words, if a row from a relation is deleted, important information should not be accidentally lost. 4. When a row is added to a relation, other relations in the database should not be affected. 5. A value of an attribute in a tuple may be changed independent of other tuples in the relation and other relations. The idea of normalizing relations to higher and higher normal forms is to attain the goals of having a set of ideal relations meeting the above criteria Second Normal Form Relation We will now define a relation in the Second Normal Form (2NF). A relation is said to be in 2NF if it is in INF and non-key attributes are functionally dependent on the key attribute(s). Further, if the key has more than one attribute then no non-key attributes should be functionally dependent upon a part of the key attributes. Consider, for example, the relation given in Table 4.2. This relation is in INF. The key is (Order no., Item code). The Fundamentals of Database Management Page No.: 97

13 dependency diagram for attributes of this relation is shown in Fig The non-key attribute price/unit is functionally dependent on Item code, which is part of the relation key. Also, the non-key attribute Order date is functionally dependent on Order no. which is a part of the relation key. Thus the relation is not in 2NF. It can be transformed to 2NF by splitting it into three relations as shown in Table 4.3. In table 4.3 the relation orders has order no. as the key. The relation "Order details" has the composite key Order no. and Item Order date Order no. Quantity Item code Price/unit Fig. 4.4 Dependency diagram for the relation given in table 4.2 Table 4.3 Splitting of Relation given in Table 4.2 into 2NF Relations (a) Orders Order Order No. date (b) Order Details Order Item Qty No. code (c) Prices Item Price/ Code unit code. In both relations the non-key attributes are functionally dependent on the whole key. Observe that by transforming to 2NF relations the repetition of Order date (Table 4.2) has been removed. Further, if an order for an item is cancelled, the price of an item is not lost. For example, if Order no. "1886" Fundamentals of Database Management Page No.: 98

14 for Item code "4629 is cancelled in Table 4.2, then the fourth row win be removed and the price of the item is lost. In Table 4.3 only the fourth row of the Table 4.3 (b) is omitted. The item price is not lost as it is available in Table 4.3 (c). The date of the order is also not lost as it is in Table 4.3 (a). These relations in 2NF form meet all the "ideal" conditions specified. Observe that the three relations obtained are self-contained. There is no duplication of data within a relation Third Normal Form A Third Normal Form normalization will be needed where all attributes in a relation tuple are not functionally dependent only on the key attribute. If two non-key attributes are functionally dependent, then there will be unnecessary duplication of data. Consider the relation given in Table 4.4. Here, Roll no. is the key and all other attributes are Table 4.4 A 2NF Form Relation Roll no. Name Department Year Hostel name 1784 Raman Physics 1 Ganga 1648 Krishnan Chemistry 1 Ganga 1768 Gopalan Mathematics 2 Kaveri 1848 Raja Botany 2 Kaveri 1682 Maya Geology 3 Krishna 1485 Singh Zoology 4 Godavari functionally dependent on it. Thus it is in 2NF. If it is known that in the college all first year students are accommodated in Ganga hostel, all second year students in Kaveri, all third year students in Krishna, and all fourth year students-in Godavari, then the non-key attribute Hostel name is dependent on the non-key attribute Year. This dependency is shown in Fig Observe that given the year of student, his hostel is known and vice versa. Fundamentals of Database Management Page No.: 99

15 Name Roll no Department Year Hostel Name Fig. 4.5 Dependency diagram for the relation given in Table 4.4 The dependency of hostel on year leads to duplication of data as is evident from Table 4.4. If it is decided to ask all first year students to move to Kaveri hostel, and- all second year students to Ganga hostel, this change should be made in many places in Table 4.4. Also, when a student's year of study changes, his hostel change should also be noted in Table 4.4. This is undesirable. A table is said to be in 3NF if it is in 2NF and no non-key attribute is functionally dependent on any other non-key attribute. Table 4.4 is thus not in 3NF. To transform it to 3NF, we should introduce another relation, which includes the functionally related non-key attributes. This is shown in Table 4.5. It should be stressed again that dependency between attributes is a semantic property and has to be stated in the problem specification. In this example the dependency between Year and Hostel is clearly stated. In case hostel allocated to students do not depend on their' year in college, then Table 4.4 is already in 3NF. Fundamentals of Database Management Page No.: 100

16 Table 4.5 Conversion of Table 4.4 into to 3NF Relations Roll no. Name Department Year 1784 Raman Physics Krishnan Chemistry Gopalan Mathematics Raja Botany Maya Geology Singh Zoology 4 Year Hostel name 1 Ganga 1 Ganga 2 Kaveri 2 Kaveri 3 Krishna 4 Godavari Let us consider another example of a relation. The relation "Employee" is given below and its dependency diagram in Fig Employee (Employee code, Employee name, Dept., Salary, Project no., Termination date of project) As can be seen from the figure, the termination date of a project is dependent on the Project no. Thus this relation is not in 3NF. The 3NF relations are: Employee (Employee code, Employee name, Salary, Project no.) Project (Project no., Termination date) Employee name Department Employee code Salary Project no. Termination date Fig 4.6 Dependency diagram of employee relation Fundamentals of Database Management Page No.: 101

17 4.3.4 Boyce-Codd Normal Form (BCNF) Assume that a relation has more than one possible key. Assume further that the composite keys have a common attribute. If an attribute of a composite key is dependent on an attribute of the other composite key, a normalization called BCNF is needed. Consider, as an example, the relation "Professor": Professor (Professor code, Dept., Head of Dept., Percent time) It is assumed that 1. A Professor can work in more than one department. 2. The percentage of the time he spends in each department is given. 3. Each department has only one Head of Department. The relationship diagram for the above relation is given in Fig Table 4.6 gives the relation attributes. The two possible composite keys are Professor code and Dept. or Professor code and Head of Dept. Observe that department as well as Head of Dept. are not non-key attributes. They are a part of a composite key. Fundamentals of Database Management Page No.: 102

18 Department Head of Department Professor code Percent time Department Head of Department Department Head of Department Professor code Percent time Fig 4.7 Dependency diagram of professor relation Table 4.6 Normalization of Relation Professor" Professor code Department Head of Depart Percent time P1 Physics Ghosh 50 P1 Mathematics Krishnan 50 P2 Chemistry Rao 25 P2 Physics Ghosh 75 P3 Mathematics Krishnan 100 Fundamentals of Database Management Page No.: 103

19 The relation given in Table 4.6 is in 3NF. Observe, however, that the names of Dept. and Head of Dept. are duplicated.. Further, if Professor P2 resigns, rows 3 and 4 are deleted. We lose the information that Rao is the Head of Department of Chemistry. The normalization of the relation is done by creating a new relation for Dept. and Head of Dept. and deleting Head of Dept. from Professor relation. The normalized relations are shown in Table 4.7. Table 4.7 Normalized Professor Relation in BCNF (a) Professor Department Percent Code time Pl Physics 50 Pl Mathematics 50 P2 Chemistry 25 P2 Physics 75 P3 Mathematics 100 and the dependency diagrams for these new relations in Fig The dependency diagram gives the important clue to this normalization step, as is clear from Figs. 4.7 and 4.8. Department Physics Mathematics Chemistry (b) Head of Dept Ghosh Krishnan Rao Department Percent time Professor code Department Head of Department Fig. 4.8 Dependency diagram of Professor relation Fundamentals of Database Management Page No.: 104

20 4.3.5 Fourth and Fifth Normal Form When attributes in a relation have multivalued dependency, further normalization to 4NF and 5NF are required. We will illustrate this with an example. Consider a vendor supplying many items to many projects in an organization. The following are the assumptions: 1) A vendor is capable of supplying many items. 2) A project uses many items. 3) A vendor supplies to many projects. 4) An item may be supplied by many vendors. Table 4.8 gives a relation for this problem and Fig. 4.9 the dependency diagram(s). Table 4.8 Vendor-supply-projects Relation Vendor code Item code Project no. V1 I1 P1 V1 I2 P1 V1 I1 P3 V1 I2 P3 V2 I2 P1 V2 I3 P1 V3 I1 P1 V3 I1 P2 V I R V P I indicates multivalued dependency Fig. 4.9 Dependency diagrams of vendor-supply-project relation: Fundamentals of Database Management Page No.: 105

21 The relation given in Table 4.8 has a number of problems. For example: If vendor Y1 has supply to project P2, but the item is not yet decided, then a row with a blank for item code has to be introduced. The information about item 1 is stored twice for vendor V3. Observe that the relation given in Table 4.8 is in 3NF and also in, BCNF. It still has the problems mentioned above. The problem is reduced by expressing this relation as two relations in the Fourth Normal Form (4NF). A relation is in 4NF if it has no more than one independent multivalued dependency or one independent multivalued dependency with a functional dependency. Table 4.8 can be expressed as the two 4NF relations given in Table 4.9. The fact that vendors are capable of supplying certain items and that they are assigned to supply for some projects is independently specified in the 4NF relation. Table 4.9 Vendor-supply-project Relations in 4NF (a) Vendor Supply Vendor code Item code VI I1 VI I2 V2 I2 V2 I3 V3 I1 (b) Vendor Project Vendor code Project no. VI PI VI P3 V2 PI V3 PI V3 P2 These relations still have a problem. Even though vendor V1's capability to supply items and his allotment to supply for specified Projects are known, he may not be actually supplying them to a project as the project may not need it. We thus need another relation, which specifies this. This is called 5NF form. The 5NF relations are the relations in Table 4.9(a) and (b) together with the relation given in Table Fundamentals of Database Management Page No.: 106

22 Table NF Additional Relation Project no. P1 P1 P2 P3 P3 Item code I1 I2 I1 I1 I3 In Table 4.11 we summarize the normalization steps already explained. Table 4.11 Summary of Normalization Steps Input relation Transformation Output relation All relations Eliminate variable length records 1NF Remove multiattribute lines in Table 1NF relation Remove dependency of non-key 2NF attribute on part of a multiattribute key 2NF Remove dependency of non-key 3NF attributes on other non-key attributes 3NF Remove dependency of an attribute of a BCNF multiattribute key on an attribute of another(overlapping) multiattribute key BCNF Remove more than one independent 4NF multivalued dependency from relation by splitting relation 4NF Add one relation relating attributes 5NF with multuivalued dependency to the two relations with multivalued dependency Fundamentals of Database Management Page No.: 107

23 4.4 Data Storage Devices: There are many ways of storing and accessing data in the application. Let us have brief idea of various types of physical methods of storage and accessing data. A comparison of different file organization and advanced storage techniques available will be point of focus in this section. Physical Data Organization: As we know data in storage devices are stored in files. A file is a collection of records and a record contains values for many fields. In each file the contents of the records like type of field must be defined. Based on instances of records, file can be categorised into: Homogeneous file: Where file holds instance of a single record type. Non-Homogeneous file: Where file holds instances of many different record types. Also the records in the file may be of fixed length or varying length. A field (like attribute in a table) must have its domain and its internal representation. Normally fields are of fixed length, but some file permit varying length fields. Also in a record, field may occur once or many times (array). In summary, a typical application will have multiple files, and the data within these files would be inter-related. To process inter-related records we have to capture inter-file relationships efficiently. Now based on usage, a file can be categorised into: Master file: Which contains operational data. Transaction file: Which contains records of various business transactions. Reference file: Which contains a semi-permanent data for use in the processing. History file: Which contains the past data from either master or transaction files. Fundamentals of Database Management Page No.: 108

24 One may note at this point that, to process data stored in files a programming language provides operations for writing a new record, deleting an existing record, retrieving and modifying an existing record. Access Method: To retrieve the data from the file various access methods can be used and in these methods one has to indicate the access path so as to reach to the record stored in a file. The nature of this path will depend on how the data is organised and searched. We are more concerned with the length of the access path which would be measured in terms of the number of I/O operations required to be performed in getting a desired record into memory. An access method is a software that searches through the access path to locate the record. Different types of access methods corresponding to different ways of file organization are typically provided as part of a data processing environment. The basic access methods are the following: In the sequential access mode the next physical record is retrieved. In the random access mode any record in the file can be accessed at random. A dynamic access mode permits both sequential and random access mode. Even in the random access mode, it may be necessary to sequentially access a few records in the file. In most data processing situations, we will be interested in locating records given a value for one of their fields called key fields. For example, we may wish to locate the record of a bank customer given his account no., then account no is the search key. The keys may be broadly divided into two categories. The primary key has unique values in all records. The secondary key may be duplicated in many records of the file. Fundamentals of Database Management Page No.: 109

25 Performance: The performance of a file organisation may be measured in absolute terms to find out how it will perform in a given situation. The overall performance depends not only on how the data is organised, but also on types of file operations and their frequency (e.g. number of reads, number of updates, etc.). The important performance measures are: The response time: Which is the time lapsed from initiation to the completion of operation and includes: time spent in waiting for processor and device availability which is dependent on system and its load. time required to locate data on device. time required to transfer data between device and memory. time to process the data. Search length: Which is the length of access path may vary between range of values as records in a file may not have same access path. Hence average length is used in evaluating performance measurement. Expected I/O time: This is calculated for comparing files where different number of sequential and random access is required. Application Parameters: The following application parameters are used in evaluating performance of a file organization for application: Hit ratio: What percentage of file records will be accessed in an operation (i.e., business function). Volatility: What percentage of records are added, modified and deleted (over a period of time). Access keys: Which fields are used for accessing the records. Frequency: How frequently is the operation executed. Fundamentals of Database Management Page No.: 110

26 Hence in an application, although there may be many kinds of accesses to different data types. Only the dominant operations are considered which require efficient handling. Storage Devices: There are many storage devices available in the market to store the data required for the DBMS. We shall discuss some of them in general. Magnetic Tape: The salient features of this device can be listed as below: 1. It is a sequential access device, where data is recorded here as a physical sequence of blocks separated by 'inter-record gaps'. 2. The tape is moved when a read/write operation is initiated. 3. Here Data can be recorded in high densities (varying from bytes per inch of tape). 4. The transfer rate between the device and memory is at speeds of 50 to 300 KBytes per second. 5. A block on tape is unit of I/O, which may be of fixed length or may vary. 6. One or many file may fit on one reel of tape, whereas a large file may occupy many reels. 7. Since the nature of device is sequential, a file stored on tape can either be in input mode or in output mode. Hence tape file cannot be updated in place on the other hand a new copy has to be created. 8. A block on tape may store one or more file records called 'blocking' of records. This blocking and 'unblocking' of records is carried out automatically by the file organization software. The blocking increases effective utilisation of tape, reducing number of I/O operations on the file. 9. The tape includes certain labels to ensure identification of stored data and their correct processing. The two types of labels, called volume label and file label, contains the following: Typical Volume Label Contents label number (multiple labels permitted) Fundamentals of Database Management Page No.: 111

27 volume serial number security code identification Typical File Label Contents file identifier file and volume sequence numbers for proper sequencing of multivolume file generation and version numbers creation and expiry dates file security count of data blocks (in trailer label) From the above it is clear that access to data stored on tape is highly restricted due to its sequential access. Hence in a DBMS environment, the use of tape files is minimal or restricted to the following: for archiving of historical data for storing transaction logs (quite rare) for storing transaction data to help in disaster recovery (again, quite rare now-a-days). Magnetic Disks: The salient features of the magnetic disks are as below: 1. Disk devices are used primarily for storing data by DBMSs as they offer flexibility in organising data in many ways for efficient access as they permit random access to the stored data. 2. Capacities of the disk devices vary from 1 gigabyte to hundreds of gigabytes and they support high rate of data transfer. 3. Here the data is stored in sectors which are organised into multiple circular tracks on magnetised (and rotating) disk medium. 4. Each disk surface has its own read/write head, which is positioned on the track and sector for reading or writing into that track. Fundamentals of Database Management Page No.: 112

28 5. The device permits direct access to stored data by specifying the address of block or sector containing the data. 6. The following hardware parameters are considered for performance evaluation: Seek-time: time to position on required cylinder (min 4-5 msec, avg msec, max msec). Rotational delay: to locate required block on a track; also called latency time; equals half of rotation time on average (typically 8.33 msec at 3600 rpm). Transfer rate: 200 to 3000 KB/sec (very slow compared to CPU, hence disk i/o must be minimized by query speeds in megabytes per second). 7. The disk volume contains variety of control data for identification as well as quick positioning. At the end we can summarise that the disk device is a fairly complex device and the flexibility offered is the backbone of modern DBMSs. 4.5 File Systems A file system is an important component of the operating system of a computer. A DBMS uses features offered by a file system, and builds its own facilities on top of a file system. The most important functions of a file system are: 1. Directory service: A hierarchical directory structure is commonly provided for grouping related files in multiple directory levels and stores control information (access rights, how and where the data are stored, date of creation etc.). Fundamentals of Database Management Page No.: 113

29 2. Space allocation of files: The space is allocated to a file as it is created. The page of a fixed size, which may store one/more file records is used as a unit of allocation. Space may be allocated to a file in continuous pages. Blocking and buffering of records: One disk block/page can contain multiple file records, which improves disk space utilization. A large chunk of memory in which disk pages are read are set aside called 'buffer' space which increases processing efficiency. OS uses some policy (such as leastrecently-used) for replacement of pages in buffers to make room for new pages. Many OS provide a few built-in file organisations as part of their file system. A file organization defines how data will be organised, what additional structures will be created to access data efficiently, and what would be storage size and speed of access. Disk devices allow a lot of flexibility in organising data. The typical file organisations offered by many file systems includes sequential, indexed and hash-based methods. However, a DBMS may offer a select few. Record pointers: A relationship between two record types is often implemented using pointers. A pointer is an address of a record on disk. This allows direct access to related records. This pointer is also used for implementing indexes / linking records (relative or absolute) having same values for a field of interest. Sequential Organization A sequential file has its records stored in a physical sequence. In order to facilitate proper processing, the records are stored in a sorted order based on values of some field (e.g., account number). The file requires minimal storage, but offers limited operations, which limits its usefulness. The operations are : Fundamentals of Database Management Page No.: 114

30 open the file: either in input or output mode, although update mode is possible for disk sequential files. read next or write next record or, rewrite the last read record, but without changing its length. close the file. Such files may be placed on tape or disk device. In order to achieve independence from device used, such files are not updated in place. A new version of the file is produced when updating a sequential file. To achieve efficiency of operation, updates are carried out for a batch of transactions instead of each transaction. Because of limited options available with a sequential file, their use in a DBMS is limited to the following situations: when the file has high hit ratio but infrequent updates for small files for intermediate results of operations. Indexed-Sequential Files An indexed-sequential file facilitates both sequential and direct access on one key field. The records are stored in ascending order of their key values. An index is also built on the key field to facilitate random access. Index is a small table which contains entries giving key, record-address pairs. In a sparse index, these entries are made only for one key for one block of records, which can be considered as an 'anchor' for that block/sector/ track/cylinder on the disk. By examining consecutive entries in the index table, the access method can determine the block in which a record with desired key may be found. For large data files, the index table itself becomes large. Note that the index table will also be stored externally on the disk, and may need to be read in parts for searching its entries. For improving searching in index, it is Fundamentals of Database Management Page No.: 115

31 possible to create multiple levels of indexes, where all index at higher level is an index to the next level of index as indicated below: 1. Track index: Contains anchor entry of each track in that cylinder; it is stored on same cylinder. 2. Cylinder index: Contains one anchor per cylinder; it is stored in separate area. 3. Master index: It is index to cylinder index, created when later is itself large; it has one entry for each block of cylinder index. The procedure to locate a data record using these levels of indexes is: 1. read and search master index, and locate the block of cylinder index in which the key falls; read the cylinder block; 2. read the track index stored on that cylinder, and search it to get the likely track on which the record may be present; 3. read records on the track for required key (this may be done in one/more i/o operations). Thus, upto 3 reads are required for searching the index levels. The advantage of using sparse index is that it saves considerable storage space, as the index is small and facilitates efficient random access. There is also a disadvantage namely, there can be only one such index for a file and the insert/delete operations may require periodic re-organization of the file. Insertion in indexed sequential file: First find the track to which the record logically belongs. If this track is full, make room for the new record. While inserting, one of the existing record on the track gets displaced. Store the displaced record in the overflow track on the same cylinder, to avoid unnecessary movement of disk arm. Fundamentals of Database Management Page No.: 116

32 Modify the format of track index to include not only the highest key on the track, but also the highest key and its record address in the overflow area. Link the overflown records from the same track using pointers to facilitate search. If overflow track itself is full, a separate overflow area, consisting of a few cylinders, is used to absorb the overflows. From the above it is clear that when insertions take place, the access time suffers. Deletion in indexed sequential file: One of the simple way to carry out deletion of a record is to mark it by a flag. To remove overflows, and to physically delete flagged records, periodically re-organise an indexed sequential file, by re-creating it as a new file. This action restores its performance to the initial level. The following operations are normally supported for the indexed-sequential file: 1. OPEN (in Input, Output or I/O mode) 2. READ next, READ by key 3. WRITE (also means insert in I/O mode) 4. REWRITE (key can't be changed) 5. DELETE last read record or by key position by key for subsequent sequential processing. Hashed Files A hashed file permits random access on some field (usually, key field) in the record. It uses a mapping, called as 'hashing function', to convert a key value into a record position in the file. The hashing function should be such Fundamentals of Database Management Page No.: 117

33 that it produces record positions within the file space, and gives distinct positions for all keys. As meeting both of these requirements is difficult, in the situation when two keys produces same record position is called 'collision'. The hashing methods are designed to handle collisions. The ideal method to convert key values to record positions is random distribution of records within the file, since it removes any 'bias' from keys to their positions. Also, while designing one has to consider the type and range of key values for the given application. After making experiments it has been found that following methods work satisfactory in most cases: Multiplicative method: Which multiply given key by a factor, and take m lower significant bits of the product as hash result. The factor recommended is (sqrt(5) -1)/2, this factor is called "golden ratio". Division method: Which is based on taking remainder of a division: (i = k mod p). p chosen as a prime number to remove any bias of key from result and to achieve good scattering. When no collisions are present, the hashed file organization gives the best performance. The desired record is retrieved in a single access as its position can be directly obtained by hashing the specified key value. Insertions are also simple, as the new record can be stored at the place given by the hashing method. The deletion can be handled by a flag. However, collisions are common, and the hashed file must use some Method to handle them. We must detect that a collision has occurred, and, for Insertion, find a new free position where the record can be stored. The strategy for this should be such that all records, including those which collided, can be efficiently located during retrieval. Fundamentals of Database Management Page No.: 118

34 There are many methods available for collision handling: 1. Chaining Method: Where all records colliding at position I are linked by pointers (the colliding records are stored in a separate area). 2. The Open Addressing Method: Which finds an alternate place with respect to the hashed value i by using an appropriate increment (for some constant c). The record being deleted is on a chain (logical or physical) of the collided records. This chain needs to be adjusted (not an easy task) before deleting the record. An easy way would be to flag the record as deleted. Hashed files, in general, give a good performance however it does not facilitate sequential access, and only one hashed access can be set up. Indexed Files Although indexed sequential file uses a sparse index, and stores the index efficiently, it does not handle insertions/deletions efficiently, and it allows only one sparse index. The indexed files use a dense index, where the index has an entry for every key value, many independent indexes can be created which facilitate both sequential and random access on many fields (e.g., on both account number and customer name for accounts file for a bank). Here the file records need not be physically stored in key sequence to simplify insertion. The index is itself stored in ascending order of key values to facilitate sequential retrieval of records in the ascending order of keys. Being in sorted order, it can be searched efficiently for random access. For a large data, the index itself becomes large. Since it is stored on disk, it must be so organised that it permits efficient search and updates without incurring high I/O cost. Fundamentals of Database Management Page No.: 119

35 B-Tree: B-Tree is the practical and efficient method for organising indexes on external storage devices. Each level in the B-tree is like a level in the index, leading to a multilevel index. It effectively provides indexes to indexes from one level to another, until we reach the node leading to the desired record. The B-tree data structure is defined as follows: an order m is associated with a B-tree the root node has at least 1 key value and 2 pointers all leaf nodes are at the same level all nodes other than root have at least m/2 keys and m/2 + 1 pointers (maximum keys can be m) Searching B-Tree for a given key value k: First start with the root node, if the node contains k, the search ends here else, look for two consecutive keys in the node between which k falls, and take the pointer between them to the node on next level. The above process is repeated until k is found. From the above we conclude that maximum length of search is equal to the height of B-tree. Insertion of new key value: Starting from the root node, locate the leaf node B into which k must be placed. If B is not full (has less than m keys) then k is added to B (maintaining order of keys). Fundamentals of Database Management Page No.: 120

36 If B is already full, adding k to it will make it have m+1 keys. We now need to split B. This is done as follows: get a free node B. redistribute m+1 keys in B and B, each having m/2 keys the middle key and pointer to B are inserted in the parent of. B using same procedure. One may also note that B-tree always grows upwards: Deletion of the key. First ensure that the definition of B-tree is preserved. The deletion of k is simple if it is in a leaf node. Otherwise, we replace it by the next higher key k1, which would be in a leaf, and delete k1 instead. While deleting, a leaf may become critical when keys in it reduce below m/2: in which case, either borrow key values from its brother nodes, or merge it with others. The merging reduces 1 node in the tree. The merging also propagates upwards, and may reduce height of the tree. The advantages of B-tree for organising index are : Usually order (m) is quite large ( ); hence, their heights are usually small. With buffering, most of action takes place in main memory : In one experiment with m = 120, file was created with 1,00,000 keys; 10 buffers were used for buffering nodes; it required only 22 reads and 857 writes to create the index. Space utilisation is good as nodes are required by definition to be at least half-full; can be further improved by modifying definition. Fundamentals of Database Management Page No.: 121

37 Secondary Indexes The salient features regarding secondary indexes can be listed as below: It is an index on a non-key field, which may not have unique values in the records. A file may have many secondary indexes to provide efficient access paths on many attributes independently. This index may be exhaustive or selective, where in the former case, index entries are made for all values of the attribute and in the later case, the entries exist only for selected values of the attributes. As a key value may occur in many records, a typical index entry consists of a value and a set of pointers to records. The size of index entry will vary depending on the set size. One may choose an appropriate method for storing such varying- length entries. For Insertion and deletion of records for a file requires modifying the index too. i.e. Insertion requires a pointer to be added to the set and deletion requires a pointer to be removed from the set. Varying Length Records When file records are of fixed length, it is easy to calculate offset (i.e., relative position) of a field within the record and access the field value. When the fields are of varying length, we need to store field lengths also within the record, which makes access to field values difficult. A varying length record contains a varying length field (e.g., employee name), or varying number of occurrences of a field or actual number of occurrences must be stored within a record as illustrated below: E# Length of name Employee name Salary. In order to interpret such a record, we must know how many varying-length fields are present and how the lengths are stored. The contents of a record need to be scanned in order to locate a field value (such as salary). Fundamentals of Database Management Page No.: 122

38 A varying length record may be stored using different methods as discussed below: 1. Reserved space: Here we allocate maximum length required by a field, and use spaces/nulls for shorter values. (e.g.; shorter names are appended by blanks). Essentially, this corresponds to fixed-length fields. The wastage of space may be high here. 2. Using pointers: The varying field/array is stored in a separate area. The record contains a pointer (and length of value) to where the value is stored. The record now becomes a fixed length record, facilitating efficient access to fields. 3. Combined method: The record contains space for an average length or average number of occurrences. Additional characters in the value are stored in a separate area, whose pointer is placed in the main record. Here also, the record is of fixed length. In most cases, it may not be necessary to access the separate area. Multi-attribute Queries A multi-attribute query requires data to be retrieved on multiple fields in the record. The query will specify values (or, range of values) for the desired fields. For example, the following q specifies two fields : select NAME from STUDENT where HOSTEL = 5 and GAME = 'cricket'. The queries can be divided into different categories based on its results: 1. exact-match query: It specifies a value for the selected fields and uses,=, (equality) operation. 2. range query: It specifies a range for the selected attributes, or uses non- equality (e.g., <, >, etc.) operations. Fundamentals of Database Management Page No.: 123

39 The file may have indexes on some or all the attributes used in the query. When more than one index is available, it should be possible to do the boolean operations (e.g., 'and' above) on the index itself without actually retrieving records from the file. Although a number of innovative file organizations have been proposed to handle multi-attribute queries, they have limitations like they fix an ordering for attributes. Comparison of File Organizations We have studied many types of file organisations in this module. They create additional data structures for the file records for providing efficient access. Thus, each file organization is associated with a storage cost and access cost. Further, they use the direct access capabilities of the disk device to fine-tune their performance. The file organisation is chosen based on the characteristics of the data and how they will be processed by the applications. The following data about the application is required for performance evaluation: 1. Volume of data (number of records for each record type). 2. Growth and volatility of data (e.g., the employee data may grow by 20 records in this file get updated in a year; deletions may be 5). 3. Pattern of usage of data by the applications; for each application, we should obtain frequency of execution fields using which the records are accessed fields which are retrieved/updated hit-ratio (number of records accessed in each execution) sequence, if any, in which the records must be retrieved Using the above data, we can determine the overall performance of a file organisation based on storage, access and update costs. In a DBMS environment, first a logical database design is made which gives a set of Fundamentals of Database Management Page No.: 124

40 normalized tables, which may be modified for improving performance. This step is sometimes called 'de-normalization'. The typical modifications include the following: 1. split a table vertically, so that the attributes commonly required (accessed/updated) are bundled together in different tables, 2. split a table horizontally, so that tuples are placed in different tables based on their usage by different applications. (For example, employee file may be split into two files: permanent and temporary employees), 3. merge two/more tables (by taking their natural join), 4. merge tuples from same table by grouping them on a field, 5. introduce aggregate fields for ease of processing. The above modifications must be done in view of usage of data by the applications. There must be a good justification for every 'de-normalisation' action. 4.6 Summary In this unit we have learnt about functional dependency, how to decompose relations, normalisation and its steps in detail. Designing schemas and different approaches for its design have been discussed. Different storage devices and filing systems have been covered in detail. 4.7 Self Understanding 1. Discuss the advantages and shortcomings of various storage devices. 2. Discuss the functions of file systems. 3. Differentiate between Sequential file, Indexed sequential file, Hashed files and Indexed files. 4. What do you understand by Functional Dependency, discuss the inference rules for them. Fundamentals of Database Management Page No.: 125

41 5. What do you understand by decomposition of relations? 6. Explain in detail about Normalisation with example. 7. Derive the normal form for the case of consultant dealing with database of students/candidates. 8. Derive the normal form for the case of manufacturing department in an industry. Fundamentals of Database Management Page No.: 126