A Review of Database Schemas

A Review of Database Schemas Introduction The purpose of this note is to review the traditional set of schemas used in databases, particularly as regards how the conceptual schemas affect the design of the storage of relations. It is sometimes considered that each base relation must be stored as is, i.e. each logical tuple must be mapped directly to a corresponding physical record. This is not so, and some relational DBMS products already provide some facilities to improve on this. The results of the review lead to a rationalisation of the database schema architecture. Physical Data Independence The contents of a base relation 1 are physically stored. However the whole purpose of physical data independence is that the data can be stored in any desirable way. It does not need to be stored as the physical equivalent of a table. Any file or data structure or combination thereof can be used and manipulated, so long as the stored data appears to the user as the logical abstraction of a relation. Another consequence of physical data independence is that a base relation could be stored in terms of other relations. There are three possibilities :- 1. A relation is fragmented into several smaller relations, and the relational fragments are physically stored instead. 2. A relation is merged with other relations into a bigger relation, which is physically stored instead. 3. Some combination of the above two methods. Whenever one needs the original relation, its contents are created from the contents of the relevant stored relations. The first possible method is well developed for the design of distributed databases. A relation is fragmented horizontally or vertically. Horizontal fragmentation uses restrictions and/or semi-joins to split the original relation into several sets of tuples (i.e. several relations), such that they can be unioned together to re-create the original relation. Vertical fragmentation uses projection to split the original relation into several sets of attributes (i.e. several relations), such that they can be naturally joined together to re-create the original relation. A fragment can itself be fragmented, either horizontally or vertically, so that the fragmentation can be continued recursively till fragments suitable for storage are obtained. There are design rules governing how the fragmentation should be specified, in order to ensure that the original relations can be re-created from the fragments and to ensure the fragments have desirable characteristics; these design rules are not further considered here. 1 All the named relations in a database, whether they be base relations, views or stored relations (as defined later) are actually relational variables, since they permanently exist as relations and their values are allowed to vary during their lifetimes. However traditional terminology is still used here, even though it doesn t always make clear that the relations are variables. Page 1 of 15

A variation of the first method is also used for single-site databases by some DBMSs - e.g. Oracle and RdB - in that relations can be split horizontally into sub-relations that are separately stored. However the splitting is not done via algebra or its SQL equivalent, but by syntax which is especially developed as part of the SQL storage control facilities. This is in conformance with the principles of physical data independence, in that it makes the fragmentation a purely storage issue and hides it from the SQL query user. With the second method, relations can be merged together horizontally or vertically. A horizontal merger is derived by unioning together compatible relations, such that the original relations can be re-created via restrictions on the merged result. A vertical merger is derived by a natural join of appropriate relations, such that the original relations can be re-created via projections of the merger. Mergers can themselves be merged together, either horizontally or vertically, so that the merging can be continued recursively till a merger suitable for storage is obtained. Design rules corresponding to the fragmentation rules apply here too. Traditionally larger mergers are not considered so attractive for storage, as in general it is harder to access a desired data item from a larger set of data than it is from a smaller set. Nevertheless vertical mergers can be beneficial. If a query involves joining relations, and if the relation that is the result of the join were to be physically stored instead by means of vertical mergers using the same join properties as in the query, it would remove the need for the physical join operation to answer the query; and joins are potentially time-consuming. Despite this, only the Caché DBMS uses vertical mergers; it stores the tables representing hypercubes as a single joined table. (It also uses a system of coding the data to compress table sizes). Some DBMSs, e.g. Oracle, do allow clustering whereby data from two relations to be joined is stored so that the data of tuples that would be merged in the result is held in the same physical block. This is a halfway house that uses just physical design ideas rather than any logical design ideas. The de-normalisation of some Conceptual schema designs often merges together relations in the original Conceptual schema. The motivation is usually to improve query performance by merging relations that have to be joined together in commonly occurring queries. However de-normalisation can lead to confusion and maintenance problems in the long run because the design of the Conceptual schema no longer reflects the inherent nature and structure of the data in the database. Queries and performance must inevitably be based on the inherent nature and structure of the data, since nothing else makes sense. However, there have been research efforts to store vertical mergers in order to improve performance. For example, Schkolnik and Sorenson have investigated storing denormalised relations as materialised joins in addition to the normalised relations (see [5]). Scholl, Paul, and Schek have investigated storing nested relations that are the logical equivalent of joins (see [6]). Thus it would be helpful if database designers could separate out the logical relations that represent the inherent nature and structure of the data from the logical relations needed to optimise performance, particularly as performance optima can change much more frequently than the inherent nature and structure of the data. Page 2 of 15

The author has not come across any instances of the third method, combining both merging and fragmentation, but logically it is possible. If one is alive to the possibility, then opportunities for its use may emerge. (If merging is never formally used, then naturally it cannot be combined with fragmentation). It becomes convenient to distinguish between base relations and stored relations. A stored relation is one whose data contents are physically stored in files such that there is always a direct link between the stored relation and its physical storage, even if the physical storage is changed (possibly radically and/or frequently) to optimise physical performance. A base relation is also one whose data contents are physically stored, but the storage may be direct - i.e. the base relation is also a stored relation - or it may be indirect in that it may be mapped onto other relation(s), possibly recursively, culminating in stored relation(s). Note that from a technical perspective updating an underlying stored relation(s) in order to update a base relation, is in principle no different to updating an underlying base relation(s) in order to update a view. Consider the relations : R1 (A, B, C) R2 (A, D, E) REL (A, B, C, D, E) where A denotes an attribute that is a candidate key. Furthermore, suppose : REL R1 Join[ A ] R2 R1 REL Project[ A, B, C ] R2 REL Project[ A, D, E ] R1 R2 REL Then is it possible to deduce unequivocally that R1 and R2 are two views derived from the base relation REL by projection, or that REL is a stored relation created by a natural join of the two base relations R1 and R2 and used to store their data? The answer is that it is not, because the two cases are logically equivalent. In fact it is generally true that when a relation(s) is derived from another relation(s) via algebraic operations, the derivation or mapping between the two relation(s) is independent of whether a view(s) is being derived from a base relation(s) or a base relation(s) is being derived from a stored relation(s). The differences between the two situations are that : 1. they have different purposes; Page 3 of 15

2. different approaches are used to defining views and stored relations, since both are derived from base relations but whereas base relations underlie views, stored relations underlie base relations; 3. stored relations should obey design criteria which views are not constrained to obey. However none of these affect the logical nature of the mappings. Date and McGoveran have a developed a method by which all logically-updateable views may be updated by updating the relations from which the views are derived : see ref. [3]. This method is very general and wide-ranging and so does not have the limitations of conventional ad hoc view updating methods. Consequently the same method can be used for updating base relations whose data is held in stored relations. Ref. [4] explicitly refers to this. In fact Date and McGoveran s method is far more general than the horizontal and vertical partitioning described above. The only constraint is that updates must be logically possible. An example of where an update is impossible, is an attempt to insert a tuple into a view formed by a GroupBy operation where the view includes an attribute formed by aggregation. The insert is impossible since there is no general way to determine which set of tuples, from all the different possible sets of tuples that are logically possible, is the set to be inserted into the underlying relation in order to create the inserted tuple in the view. As the fragmentation and merging methods do not incur this limitation, it does not cause a problem for them. A final bonus from distinguishing between stored and base relations is that a query optimiser that uses algebraic transformations can use these transformations in going from base to stored relations. Thus a greater part of the optimisation process can be done via algebraic transformations. (This was a major motivation for Scholl, Paul, and Schek s work). The extra optimisation opportunity costs nothing since it merely uses the existing algebraic transformation part of the optimiser. Other things being equal, the part of the optimiser that deals with physical storage can now be simplified as the physical storage options can now be simplified without loss of optimisation. Database Schemas The fact that fragment relations and merged relations are not intended to be seen by the users, nor reflect the inherent nature or structure of the data, suggests that they be viewed in terms of another schema. In order to pursue this idea, first consider existing database architectures. The traditional database schema architecture, based on the ANSI/SPARC standard, is the 3-layer one : 1. External (or Sub-) schemas. 2. Conceptual (or Logical) schema. 3. Internal (or Physical) schema. Where fragment relations have been employed in distributed databases, the traditional schema architecture has been expanded (for example, see ref. [1]) to encompass the following : 1. Global External (or Sub-) schemas. 2. Global Conceptual (or Logical) schema. Page 4 of 15

3. (Global) Fragmentation schema. 4. (Global) Allocation (or Replication) schema. 5. Local Conceptual (or Logical) schemas. 6. Local Internal (or Physical) schemas. Comparing the traditional and distributed cases, schemas 1 and 2 are the same in both of the cases. The distributed schema 6 is the traditional schema 3 at each node. The distributed schema 5 is the equivalent of the traditional schema 2 at each node. The distributed schemas 3 and 4 are new. The Fragmentation schema is the schema wherein is specified what fragments are to be used to store the data of base relations, and how the base relations relate to their fragments. Thus it is proposed that : the Fragmentation schema become generalised and allow both fragmentation and merging; it is called the Storage schema to reflect the change; it is used by all databases, whether single-site or distributed. Since some large single-site databases may also wish to store multiple copies of some relations, in order to improve resilience and/or query performance, one may as well also keep the Allocation schema as standard for all databases. In which case, there would be several Local Conceptual schemas, one for each separately stored part of the database, where typically each part is stored on a separate storage device. This could help to manage the stored data. Where a single-site database has no multiple copies of relations, the Allocation schema would trivially be a one-to-one mapping of the Storage schema to the one and only Local Conceptual schema; i.e. the Storage schema would be identical to the Local Conceptual schema. Consequently one would expect the DBMS to automatically optimise this situation so that it caused no additional overhead. Even in this situation, if the database is large and complicated, it might make the database easier to manage if there were several Local Conceptual schemas and the Allocation schema mapped different portions of the Storage schema to each Local Conceptual schema. These Local Conceptual schemas would each be a physically coherent portion of the database that was best managed as a whole. From a design point of view, adding the Storage schema provides a pleasing symmetry for the top three schema levels of the database. The central schema of the three, the Conceptual schema, consists of all the base relations in the database. These should be designed to the best possible logical design standards to reflect the inherent structure and characteristics of the data, i.e. be a canonical model of the real-world data. They would be designed using normalisation and without regard to application or performance considerations. The External schemas in the layer above would be derived from it, and each would reflect what was best for the application(s) that it supported. The Storage schema in the layer below would also be derived from it, but would reflect what was most desirable from a performance point of view. It is conjectured that the lack of the Storage schema in the past has encouraged database designers to incorporate the physical design possibilities for each relation in the Conceptual schema, rather than separate out the logical and physical aspects of the relational design. Also that this lack has led to the dilemma of de-normalisation : should Conceptual schema relations be de-normalised for performance or maintained Page 5 of 15

as a valid logical design to support the long term maintenance of the database? However the ability to fragment and/or merge relations allows for a lot more possibilities for designing stored relations than merely de-normalising. The introduction of the Storage schema could remove all such impediments to thought. As the horizontal fragmentation storage options of many DBMSs (e.g. Oracle and RdB) indicate, there is a perceived need to create storage fragments. Creating the new schema allows this to be done openly, with the full power of relational algebra, or its SQL equivalent, instead of having to re-invent parts of the algebra/sql under cover of the storage options; and yet by being in a separate schema, it maintains physical data independence, which is so important. It also gives full reign to consider other storage designs that go beyond just simple horizontal fragmentation. It is also possible to surmise that the lack of the Storage schema is in turn due to the lack of a general-purpose view updating mechanism, such as that proposed by Date and McGoveran. It is obvious that a general-purpose Storage schema is infeasible without such a mechanism. Yet hitherto no such mechanism has been implemented in any DBMS. View updating is limited and ad hoc in SQL. In turn this is due to the nature of SQL which either does not implement relational principles fully or implements them anomalously. Were Date and McGoveran s view updating mechanism to be found to be unsound or flawed, then it could not be used for even a pure relational database. Either another mechanism (if this were logically possible) or a workaround would be needed, at least for the algebraic operators involved in horizontal and vertical fragmentation. Otherwise there would still be no support for the separation of the logical and physical design concerns, and for the distributed case no support for the geographical design concerns. Appraisal of the Schemas Consideration of the various schemas reveals that schemas of relations fall into two categories : 1. sets of relations, 2. sets of mappings between relations. Let us consider each category of schemas in turn. Schemas that are Sets of Relations The schemas containing sets of relations are the :- Subschemas, the Conceptual schema, the Storage schema, local Storage schemas. Note that the use of external and internal as schema names has been abandoned for simplicity, as has the use of the adjective Global. These schemas represent a four-layered architecture. Hence they can be portrayed as follows :- Page 6 of 15

Subschemas :- Conceptual Schema :- Storage Schema :- Local Storage Schemas :- View relations appear only in Subschemas, although Subschemas may also include base relations. The Conceptual schema contains only base relations. The Storage schema and the Local Storage schemas contain only stored relations, any of which may also be a base relation. It would be possible to have a Subschema that contains all the Conceptual schema s base relations plus some views. (The terminology of subschema here - with its normal meaning of subset of the Conceptual schema - would still be appropriate, as a subset does not have to be a proper subset; it can be equal to the other set. Even the additional views are not holding additional information, merely additionally displaying existing information in more convenient ways). To clarify this architecture, consider a small example that just illustrates the architecture s top three layers. Let V View B Base Relation S Stored relation The following represents a small database :- Subschemas :- V 1 B 1 B 3 V 2 B 2 V 3 Conceptual Schema :- B 1 B 2 B 3 B 4 Storage Schema :- S 1 S 2 S 3 B 3 B 4 Page 7 of 15

This shows that some of the base relations appear in Subschemas and the Storage schema as well as (of course) in the Conceptual schema. If these schemas were to be portrayed using a Venn diagram (since they are all sets of relations), they would be shown as :- V 1 V 2 V 3 B 1 B 2 B 3 B 4 S 1 S 2 S 3 This example raises the following question. When a view is created, in what schema is it immediately held? One does not want the inconvenience of having to assign it to a Subschema as part of the view creation operation. The proposal here is that there is a system Subschema (called Views ) to which any view is immediately and automatically assigned. Views can then be moved or copied from there to other Subschemas as and when desired. Still this architecture, because it refers only to relations, omits the actual physical storage structures that hold the relational data, the files, indexes, etc. A Local Physical schema is needed for each Local Storage schema, where the Local Physical schema is the set of all physical objects - files, indexes, etc. - used to actually store the data of the Local Storage schema s relations. Thus another architectural layer, consisting of the Local Physical schemas, should be added on to the bottom of the architecture as follows :- Page 8 of 15

Subschemas :- Conceptual Schema :- Storage Schema :- Local Storage Schemas :- Local Physical Schemas :- Note that the above architecture assumes a distributed database, or a centralised database where it is useful to have local schemas for each of a set of storage devices attached to the single computer. In the case of a simple centralised database where the Storage schema is not be divided up between several local Storage schemas, the architecture could be simplified to :- Subschemas :- Conceptual Schema :- Storage Schema :- Physical Schema :- These architectures give rise to some identities that are always true, and which could be useful in managing the database :- Page 9 of 15

Dist[ Union ] Set-of-Subschemas-except-Views Diff Conceptual-schema Views-Subschema (N.B. Dist[ Union ] means Distributed Union 2 ). Subschema-X Diff Conceptual-schema Set-of-Views-in-X Conceptual-schema Diff Subschema-X Set-of-Base-Relations-not-in-X Conceptual-schema Diff Storage-schema Set-of-Base-Relations-not-directlystored Storage-schema Diff Conceptual-schema Set-of-all-Storage-only-Relations The schema architecture for a distributed database still only has 5 layers compared to the usual 6. This is because the Allocation schema, which is a set of mappings between relations, is missing. Schemas that are Sets of Mappings Adding the Allocation schema yields the following architecture :- Subschemas :- Conceptual Schema :- Storage Schema :- Allocation Schema :- Local Storage Schemas :- Local Physical Schemas :- 2 These identities assume a mathematical notation. The operators (including the comparators) could be implemented in a relational algebra language, e.g. RAQUEL. Page 10 of 15

It is readily seen that besides the Allocation schema, other sets of mappings between the layers of the architecture must exist in order to link adjacent layers, and that the database designer needs to be aware of and use these mappings. Two other mapping schemas are :- 1. the View schema (the mappings that define the views in terms of other relations); 1. the Equivalence schema (the mappings that define base relations in terms of stored relations). They are like the Allocation schema in that they are mappings between relations, but they differ in that they arise automatically from the definitions of views and fragments/mergers, whereas the Allocation schema mappings must be entered manually because their choice is part of the design of the database. Displayed pictorially, a small View schema might look as follows :- V1 V2 B1 B2 B3 Three mappings are used to define two views, because one view is defined in terms of two relations but the other only in terms of one relation. The mappings correspond to the definition of the views. Naturally the actual algebraic definitions of the views, or SQL equivalents, also need to be stored in the schema. Similarly a small Equivalence schema might look like :- B1 B2 B3 S1 S2 S3 Here one base relation is stored in two (fragment) relations, while two other base relations are stored in one (merged) relation. Again the mappings correspond to the definitions, whose algebraic/sql expression needs to be stored. Another set of mappings is from relations to their physical storage arrangements. These are called Local Conversion schemas. Their purpose is to define how each stored relation s data is actually stored in terms of the physical objects. There will need to be a Local Conversion schema to map between each Local Storage schema and each corresponding Local Physical schema. Because Local Conversion schemas do not deal purely with the relational model, they could be handled differently by different DBMSs. In particular, it could be that Local Physical schemas could be derived automatically from Local Conversion schemas. Page 11 of 15

Thus there are 4 kinds of mapping schema. If each kind is made a layer of the architecture in the same way that the Allocation schema is, then one ends up with a 9- layer architecture. It can be portrayed as follows :- Subschemas :- View Schema :- Conceptual Schema :- Equivalence Schema :- Storage Schema :- Allocation Schema :- Local Storage Schemas :- Local Conversion Schemas :- Local Physical Schemas :- Again, a simple centralised database could have an architecture that was simplified in the obvious way. Page 12 of 15

Although 9 layers may seem excessively complex, note that in reality they are all there anyway. Nothing new has been introduced. The only change is to make explicit what was previously implicit. Together the 9 layers provide a comprehensive conceptual structure for the database so that its designers can better envisage what they need to design, and better monitor the progress of their design; it also supports the amendment of the design of current databases when evolutionary changes need to be made. All 9 components are necessary. However a proper DBMS should be able to automate the production of much of the schemas in ways already indicated, thereby giving the database designer the maximum support with the minimum of effort. A proper DBMS should also provide the tools - via its data dictionary and/or user friendly commands - to help the designer use all the schemas effectively. There are some further identities that are always true, and which could be useful in managing the database :- Dist[ Union ] Set-of-Subschemas-except-Views Dom[ View ] Union Conceptual-schema (N.B. Dom[ View ] is the domain of the View schema mapping). Conceptual-schema Diff Dom[ Equivalence ] Union Storage-schema Set-of-all-Stored-Relations Ran[ Allocation ] Dist[ Union ] Local-Storage-schemas (N.B. Ran[ Allocation ] is the range of the Allocation schema mapping). All the identities take advantage of the fact that the set and mapping schemas conform to the principles of traditional mathematical sets and relations respectively; i.e. a set schema is a set of items (in this case a set of database relations) and a mapping schema is a set of mappings (in this case between database relations). Conclusion The above is an attempt to provide a rational framework in which the study of the design of relational databases could be carried out. It tries to rise above the peculiarities of any individual relational DBMS so that it can provide a basis not only for a range of current relational DBMSs (so that they can be compared and one can easily transfer from working with one to working with another) but also for future developments in relational DBMSs (so that new developments can be evaluated using rational criteria). The framework is based on a rationalisation of the current standard approaches to centralised and distributed databases, and can be used for both. The proposals can be summarised as :- 1. The standard database schema architecture should be expanded to include a Storage schema. This will facilitate the design of the best storage mechanisms while encouraging the Conceptual schema to be a design based purely on the inherent nature and structure of the data. This overcomes the dilemma which many database designers face as to which of these two choices to go for. Page 13 of 15

2. To achieve this, stored relations should be differentiated from base relations. The latter may be stored indirectly via mappings to stored relations, while the former are always stored directly. 3. Date and McGoveran s method for view updating can and should also be applied for the updating of base relations whose data are held indirectly in stored relations. If this method is inadequate, then some logically equivalent means to the same end must be found to achieve the same end. 4. Any reasonable relational DBMS has an optimiser which uses the technique of algebraic transformations as part of its method of its optimisation strategy. Such an optimiser will be able to use the storage definitions of base relations for additional optimisation. Furthermore, it will not require an extension of the optimiser to accomplish this - the extra optimisation comes for free - since the extra optimisation can be done by algebraic transformations. 5. It may be useful to make an Allocation schema a standard part of the database architecture, since centralised databases may also wish to have more than one copy of some data. However this does imply that to avoid being a burden in the simple centralised case, one-to-one mappings should appear as the default, and the DBMS should automatically note one-to-one mappings and not create any overhead for them. 6. It is worthwhile to rationalise the schema architecture, which should be built from two kinds of schema : schemas that are sets of relations and schemas that are sets of mappings between relations. A mapping schema forms a layer of the architecture that appears between two architectural layers formed from sets of relations. The mappings show how the relations above and below in the architecture relate to each other. Together they provide much better support for designing and maintaining the database. 7. A distributed database would have 5 set schemas with 4 mapping schemas sandwiched between them, while a simple centralised database could reduce this to 4 set schemas with 3 mapping schemas. 8. A DBMS should support the database designer by automating as much as possible of schema creation and by providing easy ways to inspect the contents of schemas. It should also be able to exploit a variety of identities between schemas. References [1] Distributed Databases : Principles and Systems. S. Ceri & G. Pelagatti (McGraw-Hill Computer Science Series, 1985). [2] Private communication from Hugh Darwen, 1999. [3] An Introduction to Database Systems. C. J. Date (Addison-Wesley, 2000), ch. 9, section 4. [4] An Introduction to Database Systems. C. J. Date (Addison-Wesley, 2000), ch. 23, page 698. (David McGoveran was the original author of this chapter). [5] The Effects of Denormalisation on Database Performance. M. Schkolnik & P.Sorenson (Res. Rep. RJ3082(38128), IBM Research Lab, San Jose, 1981). Page 14 of 15

[6] Supporting Flat Relations by a Nested Relational Kernel. M. H. Scholl, H. B. Paul, & H. J. Schek (Proceedings of 13 th VLDB Conference, Brighton, 1987). Page 15 of 15