Physical Database Design for. Marc H. Scholl. Oberer Eselsberg, D-7900 Ulm, Germany.

Transcription

1 1 Physical Database Design for an Object-Oriented Database System Marc H. Scholl University of Ulm, Department of Computer Science Oberer Eselsberg, D-7900 Ulm, Germany In: J.-C. Freytag, G. Vossen, D.E. Maier (eds.), Query Processing for Advanced Database Applications, Morgan Kaufmann, 1992, to appear 1

2 Abstract Object-oriented database systems typically oer a variety of structuring capabilities to model complex objects. This exibility, together with type (or class) hierarchies and computed \attributes" (methods), poses a high demand on the physical design of object-oriented databases. Similar to traditional databases, it is hardly ever true that the conceptual structure of the database is also a good, that is, ecient, internal one. Rather, data representing the conceptual objects may be structured completely dierent, for performance reasons. Database systems providing a reasonable amount of data independence allow aphysical design that diers from the logical structure signicantly. Hence, the performance of the system can be tailored to the overall transaction load faced. The paper presents choices for physical designs that make use of a complex storage model, an extended nested relational model. A rst prototype of a physical design optimizer is also presented. 1.1 Introduction Object-oriented database management systems (OODBMSs) typically oer a variety of structuring capabilities to model complex objects: objects may be hierarchically composed of subobjects, several objects may share common subobjects, objects may appear as (attribute) values of other objects, dierent objects can be related to each other by functions, methods, or relationships. Type (or class) hierarchies introduce another dimension of object interrelation: an object of one class also \appears" in all its superclasses again, with multiple inheritance, this need not be a strict hierarchical inclusion. Computed values (attributes, methods) may be used to derive, rather than store, data that are associated with objects. Obviously, it is not at all trivial to nd good, that is, ecient, storage structures that support the variety of operations on objects reasonably well. Two of the standard approaches to implementing such object-oriented data models are to either (i) map everything to an underlying relational database system (RDBMS), or (ii) implementanadvanced storage server that oers more complex structures than at relations. The rst approach oers the advantage that one can build on established, matured technology and, because of standards, that this seems to be a portable solution: it is not necessarily tied to one particular RDBMS.

3 On the other hand, the typical disadvantages of such \front-end" solutions are that without being able to internally tune the RDBMS to that application, it is unlikely to obtain good performance, since the complex structures of the object-oriented database schema have tobebroken into small pieces in order to be stored in (at) relations. As a consequence, queries to the object-oriented schema have to be mapped into large joins queries against the relational database. Of course, one mightimprove on the state-of-the-art in commercial RDBMSs by including advanced access support, such as join indices, link elds, or materialized functions, in order to make the relational implementation more feasible. We discuss such extensions under the second approach in a more general context. The second approach has the obvious disadvantage that one has to implement a new storage manager with more powerful capabilities for complex structured data, which requires a major eort in terms of design and implementation. On the other hand, the potential benet of such anendeavor is superior performance due to a more exible physical database organization that allows for more ecient query processing algorithms. Over the last decade, there have been numerous attempts to come up with new DBMS architectures based on advanced storage managers (see [Carey et al., 1986 Batory, 1987 Haas et al., 1990 Haerder et al., 1987] for some examples). The DASDBS project 1 is one of these attempts [Paul et al., 1987 Schek et al., 1990], where the storage manager implements nested relations, that is, a hierarchical data structure where attribute values can either be atomic or embedded (sub-) relations. The idea is that nested relations serve asahigh- level, abstract description of internal storage structures. It was shown in [Scholl et al., 1987] that nested relations can in fact be used to model all schema-driven clustering strategies. That is, all storage schemes that are described by static information. For example, a physical design that stores all employee records adjacent to \their" department recordisaschema-driven strategy (which is naturally represented by a nested department relation with an employee subrelation). In contrast, a physical design that locates employee records either with \their" department record or with \their" manager record, depending on which of these related records was \current" at the time of the creation of the employee record, represents a dynamic clustering strategy, which is partially, but not completely, schema-driven. An example of a fully dynamic, schema-independent clustering strategy is \always append at the end of the database". Schema-driven clustering techniques are 1 Darmstadt Database Kernel System

4 both, practically important (they resemble but largely extend the current state-of-the-art) and theoretically interesting (they give rise to powerful algebraic query optimizations). The latter was shown in [Scholl et al., 1987 Scholl, 1986] in a context where (at) relational schemas were internally represented as nested relations. Both, the transformations of the structures and that of the operations can be expressed in a nested relational algebra, so query optimization can mostly operate on an algebraic level. In this paper, we discuss the problem of physical database design that is, given a logical DB schema and a transaction load, we want todetermine what the internal DB layout with the least overall cost of transaction execution is in the context of DASDBS as the storage manager. Therefore, all relational database designs are subsumed by this approach, while more exibility is introduced by taking hierarchical clustering strategies into account. For example, tuples of two \related" tables might as well be stored together in one nested relational tuple, containing for each tuple of the rst table all the \matching" tuples from the second as a \subrelation". The intuition behind such an organization is that the larger nested tuple is stored consecutively (that is, together with its subtuples) in one or as few as possible page(s) on disk (see [Paul et al., 1987 Schek et al., 1990 Deppisch et al., 1987] for details). Other options to accelerate the execution of \implicit" or \functional joins" include \link elds", that is, references (in the form of object identiers (OIDs) or addresses) to objects, that may be stored together with the referencing object tuple or separately from it. We will show that, with the extended nested relational interface of the DASDBS storage manager (the extension consists in the availability of physical tuple addresses), we can express a wide variety of physical design alternatives. Furthermore, the high-level (i.e., relational) description of these choices allows the query optimizer to apply algebraic transformations in order to exploit these storage structures when mapping logical level query expressions to the physical level. The placement trees of O 2 [Benzaken, 1990] pursue a similar purpose: objects that are related via super- and subobject relationships or via methods can be clustered hierarchically by dening appropriate placement trees. A placement tree is a hierarchical structure whose nodes are O 2 -classes. Upon generation of a new object of some class, the system searches for placement trees that contain the object's class and places the new object on the same page as its parent object, if such a placement tree is found and if the new object is related to an instance of the parent class in that tree. Multiple placement

5 trees may contain the same class, in which case the algorithm for determining the storage location of a newly generated object becomes more complex. In contrast to our approach, placement trees are more a dynamic clustering strategy, because the clustering strategy expressed there is not guaranteed, it is used as a \guideline" rather. In our approach, the denition of a certain clustering strategy (in terms of a nested relational storage relation) precisely determines where object tuples will be stored. While being less exible on the one hand, our approach has the advantage that the query processor can rely on the information represented in the storage hierarchies, whereas the O 2 optimizer makes no use of this information (yet?). Rather, O 2 expects performance gains to result from increased hit ratios in the buer pool. The paper is structured as follows: Section 1.2 describes our notation for the object model as well as the nested relational description of storage clusters. Section 1.3 presents the alternatives that are oered when mapping object schemas to DASDBS, and discusses their pros and cons in terms of which operations benet and which incur extra costs. In Section 1.4 we describe a rst physical database design tool that we have implemented to select a good (ideally the best) database layout for a given object schema and load description. Some remarks about query optimization are given in Section 1.6 together with a summary and outline of future work. 1.2 Notation and Terminology Before entering the technical exposition, we introduce the notation and terminology used throughout this paper. We rst describe the COCOON object model used at the logical level for the schema description then we set up the framework for the physical level, the extended nested relations available at the interface of the DASDBS storage manager The COCOON Object Model The is no universal consensus on a specic object-oriented data model (OODM), however, many of the features in any of the proposals seem to approach a mature state. For example, support for complex structures, including shared subobjects, and some form of inheritance hierachies. Like many others, we have contributed to the eld by proposing one such model, called COCOON [Scholl and Schek, 1990b Scholl and Schek, 1990a]. In this paper, we do not depend heavily on the particular avor of the OODM used for the

6 logical database schema, so we use our notation and terminology mostly because we are most familiar with this model. For the reader, however, it should be straightforward to translate into his or her model of choice. COCOON is a so-called \object-function" model, as is IRIS [Wilkinson et al., 1990], for example. This means, objects are pure abstractions, in the sense of the well-known abstract data type (ADT) approach. Particularly, none of the descriptive information \associated with" an object is considered to be \part of" the object in any sense. Rather, functions (or methods) are used as the uniform abstraction of stored elds, computed attributes, and relationships. In an even more general interpretation, \functions" can also be taken as an abstraction of retrieval and update methods, that is, the ADTspecic operators. Throughout the rest of the paper, we do not consider functions with side-eects, that is, update methods. Therefore, the term \function" refers to type-specic operators without side-eects. Intentionally, we do not distinguish between stored and computed or derived functions here, considering it as a higher level of data independence to hide this distinction from the logical database schema. In order express general relationships, functions may be set-valued. Furthermore, two functions may be dened as being inverses of each other. In terms of other OODMs that do distinguish between attributes (stored) or instance variables on the one hand and methods on the other, just think of database schemas where all attributes are hidden (encapsulated) behind access functions (retrieval methods). The point behind our more abstract view is that we want to leave it up to the process of physical database design to make the decisions on what to store and what to derive. Of course, there are restrictions to these decisions, so we can mainly decide to materialize derived functions, trading update eort for retrieval speed. In COCOON, like in most other OODMs, objects are instances of types, which are arranged in an inheritance hierarchy (actually, due to multiple inheritance, this is not a strict hierarchy). A type describes the set of functions that can be applied to its instances. COCOON's query language is strongly typed, that is, a type checker (statically) guarantees that only type-valid expressions are ever executed. The subtype hierarchy essentially represents the superset relationship between the sets of functions dened on the subtype as compared to its supertype(s). A subtype inherits all the functions from all its supertypes and adds new functions. Also, all instances of a subtype are also instances of the supertype(s). A less common characteristic of the COCOON model is its separation

7 between types and classes. Atype describes the common interface of all instances, whereas a class represents a collection of objects of a given type (or subtypes thereof). Therefore, we can have more than one collection of a given type, for example distinguished by dierent membership predicates. Each class, C, ischaracterized by two properties: the type of its member objects, mtype(c), and its current set of member objects, extent(c). Classes can also be arranged in a (non-strict) hierarchy, representing the subset relationships between their extents, that is, the extent of a subclass is necessarily a subset of the extents of all its superclasses. 2 Other OODMs that do not distinguish between types and classes would map into COCOON by dening exactly one class per type. Notice that O 2, for example, uses both terms, however, with dierent semantics: O 2 -types are data-types, O 2 -classes are object-types. The O 2 -clause \with extent" indicates that an explicit extent shouldbekept for that particular O 2 -class. (In our terminology: O 2 -classes are types, and \with extent" denes a class with the same name as the type.) Example 1 [Logical DB Schema] In the following discussions we will refer back to this example database as the logical level DB schema. The database contains information about companies, employees and cities [Scholl and Schek, 1990a]. define database SampleDB define type city = name : string, zip : string, pop : integer, has_comp : set of company inverse location define type person = name : string, bdate : date, 2 An analysis of object algebra operators, such as selection and projection, shows that the separation of types and classes is necessary in order to dene \object-preserving" queries, because the subtype relationship between types and the subset relationship between classes need not always correspond to each other [Scholl and Schek, 1990a Heuer and Scholl, 1991 Beeri, 1990]. Details of this aspect, however, are not relevant for the purpose of this paper. Just keep in mind that we want to allow the maintenance of more than one \type extent".

8 addr : city define type company = name : string, budg : integer, loc : set of city inverse has_comp, pres : chief, staff : set of employee inverse works_for define type employee isa person = hired : date, ssec : integer, sal : integer, works_for : company inverse staff define class City : city define class Pers : person define class Comp : company define class Empl : employee some Pers end. The phrase \some Pers" in the denition of class \Empl" states that (i) Empl is a subclass (i.e., subset) of Pers, and (ii) that inclusion of person objects in the subclass has to be specied explicitly by the user. (In contrast, \all Pers where P" would dene a class whose members are automatically determined from the superclass and the predicate P.) Nested Relations as a Description of Storage Structures In this section we introduce our notation for physical database designs. We use nested relations to describe the physical clustering strategy on disk blocks. That is, if we say that data from the conceptual database schema is stored in a particular nested relation on the physical level, we assumethatthe nested tuples are directly mapped to disk blocks in a depth-rst fashion: for each nesting level, we will rst nd all the atomic attribute values followed by the representations of all tuples of the rst subrelation, then followed by the representations of all tuples of the second subrelations, and so on, recursively,

9 until no more subrelations exist. This way, one nested tuple is implemented on as few pages as possible. Furthermore, this implementation gives ecient access to complete nested tuples as well as to parts thereof. For the latter to work, the storage manager has to keep structural information that helps guring out which pages belonging to a nested tuple actually havetobereadin from disk in order to process a given request. One implementationtechnique for this purpose, the one used by DASDBS, is described in detail in [Paul et al., 1987 Deppisch et al., 1987]. As usual, we denote the schema of a nested relation by recursively giving the name of the (sub-) relation followed by a list of attributes enclosed in parentheses: Dept(dno dname budget Empl(eno ename salary))isatwolevel nested relation Dept with three atomic attributes and one subrelation, Empl, that itself has three atomic attributes. In the sequel, when talking about relations (on the physical level), we always mean \relations used to store some information about objects" (object-relations). In order to describe physical level nested relations we introduce the following additional for a given relation R denotes the physical address of R-tuples, e.g. tuple identiers (TID). We can either as a (virtual) attribute of relation R, or as a (stored) attribute in any other relation, S (not necessarily distinct from R), in order to describe the fact that the stored S-tuples contain a physical reference to an R-tuple (a \link eld"). #R for a given relation R, representing a conceptual object type R, denotes the unique object identier (OID). This is a stored value used to represent the object itself, a surrogate value that is given to the object by the system upon creation. Notice that we do not a priori assume that we use the physical address of the tuple representing an object as the object's identier. Given that we use tuple identiers (TIDs) as the physical addressing scheme, we could have chosen to do so, since TIDs are guaranteed to be stable. Let us explain why wechose to separate the issues. First, even with TIDs as the phyiscal adressing scheme, it might be necessary to have logical OIDs that are never (not even in case of DB reorganizations) changed or reused for example, if OIDs are given to users for some reasons (we do not do this!. Second, and more importantly, we do not exclude redundant storage schemes, where objects may be represented in more than one tuple. This might be useful as a \decomposed" storage strategy, for example, when dierent object properties have very inhomogeneous

10 access frequencies. Furthermore, we may gain overalll performance from object replication. In the latter case, we certainly need a unique OID that is independent from physical tuple addresses. The use of nested relations to describe a wide varietyofphysical database designs has been discussed extensively in the context of at relations as the conceptual model in [Scholl et al., 1987]. Essentially, the choices that we have now, for an object-oriented model at the conceptual level, are largely the same. A relational schema together with the key-to-foreign-key relationships might be considered a \Complex Object" schema, without generalization. Therefore, we repeat the basic ideas here using a few examples. Example 2 [Physical storage structures desribed with nested relations] A logical \relationship" can be supported in a variety ofways at the physical level, ranging from no particular support, via several kinds of indexes or link elds, through physical neighborhood of related tuples (clustering). Some of these, particularly clustering, can only be applied to 1 : n-relationships without incorporating redundancy. Basically, all the options for n : m-relationships can be tracked down to a specic choice for any ofthetwo hierarchical directions embodied in the n : m-relationship. Assume two relations R and S are related through some predicate P (which might just be equality on a common attribute, some more complex condition, a function in the COCOON object model, or whatever). Physical database design may provide No specic support: Relations R and S are stored separately: R (#R... some attributes...) S (#S... some attributes...) Upon retrieval, both have to be traversed (possibly in a nested loops fashion) in order to evaluate the predicate P on every pair of R- and S-tuples. (Alternatively, an index could be used, if present. Here and in the following, we do not take indexes into account, since this is an orthogonal issue. Indexes can be useful in all the designs discussed here.) An embedded reference: Assuming that each R-tuple is related to at most one S-tuple, we could store the address of that S-tuple with each R-tuple. That is, upon insertion of the R-tuple, we evaluate

11 the predicate P on relation S for that R-tuple, nd the matching S-tuple (if any), and store its physical address (@S), or its OID (#S), or both, in the R-tuple: S (#S... some attributes...) and R (#R #S... some attributes...) or R some attributes...) or R (#R some attributes...) In the case where the predicate P is actually a function relating objects on the conceptual level, we will have to store at least the OID of the referenced object. An embedded reference set: Similarly,if an R-tuple can be related to more than one S-tuple, we can store a whole set of references (OIDs and/or TIDs) within each R-tuple, in a nested subrelation, SRef (for \references to S"): 3 S (#S... some attributes...) and R (#R SRef(#S)... some attributes...) or R (#R SRef(@S)... some attributes...) or R (#R some attributes...) This storage structure corresponds to CODASYL pointer arrays. A \join index": The pointers linking related objects could also be stored separately from the object-tuples, thus resembling the idea of join indices [Valduriez, 1987]: R (#R... some attributes...) S (#S... some attributes...) plus JI 1 (@R SRef(@S)) and JI 2 (@S RRef(@R)) Notice that we have grouped (nested) one set of addresses in each of the two parts of the join index. Furthermore, we can be more 3 the notation \SRef(...)" inside the schema of relation R denotes a subrelation with name \SRef" and subattributes \(...)"

12 exible in that physical addresses (TIDs) can be accompanied by or replaced with OIDs, independently form each other in any of the two parts of the join index. Physical clustering: The strongest way of supporting fast access \along" the predicate P is to physically cluster related R and S tuples. This is only possible without replication if it is a 1 : n-relationship, though: R (#R... R-attributes... S(#S... S-attributes...)) In this storage scheme, all S-tuples related to a given R-tuple are stored within that R-tuple, so no extra I/O is necessary once we have the R-tuple. Given a logical database schema, it is the task of the physical database design process to select one of these choices for each \relationship" between objects in the logical schema. The choice is based on cost estimates for all types of operations on all the dierent storage structures. Heuristics (such as an experienced DBA or even some automated tool) can be used to nd a good design for a given transaction load (see Section 1.4). 3 The choices indicated in the example above all refer to the implementation of \relationships" between objects, such as via functions in COCOON. Decisions have to be taken for other choices too, for example, how to implement the inheritance hierarchy, and how to deal with computed functions. We present the alternatives considered in our context next. 1.3 Alternatives for Physical DB Design This section presents the alternatives for mapping object-oriented database schemas from the conceptual level to nested relations at the physical level. We proceed by stepping through the basic concepts of the COCOON object model, and showing the implementation choices. Since the choices for each of the concepts combine orthogonally, a large decision space is spaned that is later on investigated by the physical database design tool (next section).

13 1.3.1 Implementing Objects According to the object-function paradigm of COCOON, an object itself is suciently implemented by a unique identier (OID), which is generated by the system. All data related to an object in one way or the other will refer to this identier (see below). Following the conventions set up above, we denote, for each object type T, attributes of internal relations containing the OID of objects of type T by #T Implementing Functions In COCOON, functions are the basic way of associating information (data values or other objects) to objects. In principle, we can think of each function being implemented as a binary relation, with one attribute for the argument OID and the other for the result value (data item or OID). In the case of set-valued functions the second attribute will actually be a subrelation of unary subtuples, containing one result (OID or data value) each. So, in principle, a single-valued function f s : T 1! T 2 andamulti-valued function f m : T 1! set(t 3 ) could be implemented by two binary relations: f s (#T 1 #T 2 ) f m (#T 1 T 3 Ref(#T 3 )) There are some obvious choices (such as: Do we really store each function in a separate binary relation or do we combine several of them into a \wider" relation?), and also some more subtle alternatives (such as: Shall we include physical pointers?). The decision space as far as function implementations are concerned includes the following alternatives in our current approach: Bundled vs. Decoupled: Each function f dened on a given domain object type T might either be stored in a separate (binary) relation f as shown above: we call this the decoupled mode. Alternatively, we can bundle the function f (possibly with other functions) together with the relation T implementing the type T (see below). Notice the restriction: the set of all functions dened on the same domain type are partitioned into bundled functions (that are all stored together in one internal relation) and decoupled functions (that are all stored in separate tables, one each). More exible function partitioning schemes are possible and certainly useful. However, we currently limit our optimization process to the restricted choice for tractability reasons.

14 In the example above, the bundled implementation of both, f s and f m would yield the following type table for T 1 : T 1 (#T 1 f s # f m #Set(#T 3 )) Notice the naming convention: attributes are named after the function they implement, a sux \#" indicates a (logical, OID) reference, a sux \Set" indicates a multi-valued function (a subrelation). Logical vs. Physical Reference: A function returning a (set of) object(s), not (a) data value(s), can be implemented by storing just OIDs (logical reference) of result objects or by including a TID (physical reference) as well. In the latter case, relations for single-valued functions become 3-attribute relations, those for multi-valued functions now have pairs in the subrelation. Continuing on the example above (bundled), inclusion of physical references for both, f s and f m,would result in: T 1 (#T 1 f s # f f m #Set(#T 3 )) Oneway vs. Bothway References: A function f from type T to (possibly a set of) type S can be implemented by a forward reference only (oneway), or it can be implemented with backpointers (bothway). Again, backward pointers, if any, can be implemented with just logical references or with physical references. Notice that COCOON includes the specication of inverse functions. Ifaninverse functions is dened in the conceptual schema, then the \backward" reference is present anyway. Therefore, this option is only considered for functions that have no inverse in the object schema. 4 Whenever the inverse function is not given explicitly in the schema, we have to assume that back references are multi-valued. For decoupled functions, the backward references will also be decoupled. Therefore, decoupled functions with backpointers result in the \join indices" shown in the previous section. For bundled functions, back references are also bundled with the corresponding type table. In our (bundled) example above, assuming a (logical only) backpointer for function f s would make the type table for type T 2 look like: T 2 (#T 2 f ;1 s #... other attributes...) 4 Backpointers might be useful, because COCOON's query language allows traversing functions backwards even if the inverse is not given explicitly in the schema.

15 Reference vs. Materialized: Functions returning (sets of) objects, not data values, can be implemented by the various forms of references discussed up to now. Alternatively, however, we can directly materialize the object-tuple(s) representing the result object(s) within the object-tuple representing the argument tuple. That is, we can store the resulting object-tuple \in-place". This strategy achieves physical clustering. In our example, the decision to materialize the function f m would generate a nested type table for T 1 that contains the type table for T 3 as a subrelation: T 1 (#T 1 f s # f f m Set(#T 3... other T 3 -attributes...)) Obviously, we need no backward references in this case. Furthermore, this alternative is free of redundancy only if the materialized function is 1: n, that is, its inverse is single-valued. As shown in the example, materialization is considered only in conjunction with bundling in our current optimizer. More generally, it may be optimal to materialize decoupled functions as well. Then we would actually partition the objects of the result type according to this function. Computed vs. Materialized: Finally, an additional option is to materialize derived (computed) functions. Assuming that some function f on type T can be computed, we could nonetheless decide to internally materialize it, if retrieval on f dominates updates to the underlying base information signicantly. The more retrieval dominates updates, and the more costly the computation is, the more likely is the case that materialization pays o. For example, with geometric object descriptions, one typically uses a \bounding box" function to lter objects coarsely in spatial queries. Obviously, the bounding box is derived from the actual geometry of objects. But computing the bounding box incurs quite some eort, and if object shapes rarely change, materializing the bounding box function clearly is a good strategy (see also [Kemper and Moerkotte, 1990a Kemper and Moerkotte, 1990b]). Let us repeat that choosing how to implement functions (retrieval methods) for an object-oriented database schema is essentially the same problem as physical database design for network (CODASYL) databases, or for \Complex Object" databases (in the sense of [Abiteboul and Beeri, 1988 Abiteboul et al., 1989]), or even for relational databases (where the `structure' stems from key-foreign key relationships).

16 1.3.3 Implementing Types, Classes, and Inheritance Some new aspects in physical database design, however, originate from data modelling concepts not typically found in `pre-object-oriented" models: inheritance hierarchies. In the context of the COCOON object model, we are dealing with two such hierarchies: one between types (organizing structural, function inheritance), and one between classes (organizing set inclusion). Before going into the details of these hierachies, there is one more basic question to be answered: how to implement types and classes. In general, our approach to physical design is schema-driven (as opposed to fully dynamic or instance-driven). That is, we analyze and optimize the physical DB layout based on schema-level information (types, classes, functions) rather than for individual objects. In our model, this raises the question whether we do the design for types of objects, or for individual classes. Since classes are always bound to a particular (member-) type, physical design for types is the larger grain approach, whereas design for individual classes would be the ner grain approach (remember, there may be more than one class per type). Currently, we do the physical design on a type basis, that is, all objects of a given type are physically represented in the same way(even if they belong to several classes). The argument for doing so is that it is easier, because it gives fewer choices. Furthermore, if typical database schemas have roughly one class per type, the dierence as compared to a class-based physical design is only marginal. Classes are implemented as views over their underlying type table. If classes are dened by a predicate, this predicate is used as a selection condition, user-dened classes (whose members are explicitly added/removed by query language operators) require an additional boolean attribute in the type table. The inheritance hierarchy for types introduces two further degrees of freedom for physical design: rst, if functions are bundled, shall we include inherited functions in the type tables of subtypes? Second, shall objects be represented in an object-tuple only for the most specic subtype's table, or in several object-tuples, one per supertype? These choices have sometimes been called horizontal versus vertical partitioning of objects or properties. Currently, we allow only very limited choices with respect to types, classes, and inheritance: Types: Each objecttype T is mapped to a type table T with at least one attribute, #T, containing the OID. Additional attributes are present in case of any bundled functions and/or materializations of object functions. The type table T may itself be a subrelation of some other table

17 S, if type T was materialized w.r.t. a function returning T -objects. (In the latter case, a dummy object tuple has to be added to S that collects T -objects not related to any S-object, that is, if the function used for materialization is not onto T.) Classes: Each class C is implemented as a view over its underlying type table. If the class is dened by a predicate (\all"-classes and views in COCOON), this predicate is used as the selection condition. If the class is dened to include manually added member objects (\some"-classes in COCOON, see [Scholl and Schek, 1990a Scholl et al., 1991]), the underlying type table is extended by a boolean attribute C that is set to true if and only if the object is a member of this class C. Inheritance: Subtyping is implemented by having one type table per subtype. Two possibilities are considered: an object-tuple is included in each supertype's table. In case there are any bundled or materialized functions, these are not repeated in the subtypes' tables. In this case, object-tuple in subtype tables might optionally include physical references to supertype tuples. Using this option, physical references to object tuples always point to an object tuple in a specic type table. So, there is yet another degree of freedom: which one to point to? When choosing this option, we always point to the object tuple in the type table that implements the range type of the function under consideration. an object-tuple is included in only one type's table, that of the most specic subtype. In case of any bundled or materialized functions in supertypes, these are also included in the subtype's table. Therefore, function values are never kept redundantly, while OIDs may be replicated in all supertypes. Subclassing is implemented without redundancy, because all classes are views anyway. Future plans include the consideration of classes instead of types as the basis for physical design, and potential redundancy with respect to inheritance.

18 1.3.4 Indexes Obviously, among the most important decisions that have to made during physical database design for any database is the selection of appropriate indexes. All the classical indexing techniques, such asb + -trees, will be considered. Furthermore, several specialized index structures have been proposed that are designed to support OODB-specic kinds of operations, such as path traversals [Bertino, 1990]. In order to evaluate the advantages and costs of using a complex record (DASDBS) instead of a at record (RDBMS) storage manager, though, indexes play only a supporting role. The main emphasis is on the eects of hierarchical clustering and embedded references. Therefore, we do currently not consider index selection. In the future, we plan to take indexes into account, particularly because DASDBS allows the implementation of very powerful index structures, such as nested or path indexes The Default Physical Design In order to have a starting point for both the physical database design tool described in the next section, and the implementation of COCOON on top of DASDBS, we have identied a default physical design that includes the following choice of implementation strategies: Functions: All functions are bundled with their type table, so as to cluster all object properties together. Object-valued functions are implemented as references (potentially shared subobjects), with physical references and backpointers (ecient access, also for inverse direction). Multi-valued functions become subrelations. Inheritance: Objects are present in all supertype tables (ecient access to all instances at all levels), inherited functions are not repeated in subtype tables (non-redundant storage scheme). No backpointers to supertype tuples are included (fast access via index is assumed). For the conceptual schema given in Example 1, the default physical design would be (a backpointer for a function f is called f ;1..., names of set-valued functions are suxed by \...Set"): City Comp ; #City pop name zip has compset(comp# Comp@) addr ;1 Set(Pers# P ers@) ; #Comp name budg pres# pres@

19 Pers Empl locset(city# staffset(emp# ; #P ers name bdate addr# ; #Empl sal hired ssno works for# works pres ;1 Set(Comp# Notice that an employee object's OID (Empl#) is actually the same as the corresponding person object's OID (Pers#). 1.4 A Physical Design Tool In this section, we present a preliminary physical database design tool that considers some of the alternatives above and produces an internal nested relational schema for use with DASDBS, derived from a conceptual COCOON schema, a load description, and a cost model General Approach One of the main objectives of the COCOON project is to investigate the architecture of OODBMSs. Therefore, three implementation platforms are currently being used: a commercial relational DBMS (Oracle), a commercial OODBMS (Ontos), and the DASDBS prototype. The relational and the nested relational \storage managers" will be used for extensive performance experiments to evaluate the pros and cons of the dierent storage alternatives. While the physical design alternatives presented above were developed mainly for implementation on top of DASDBS, some of them can be mapped to Oracle, too. For example, nested relations with only two levels of nesting (i.e., all subrelations are at) can be simulated quite exactly by means of Oracle's \Clusters" [ORACLE, 1990]. In order to assist the DBA in selecting a good physical design for a given conceptual database schema and anticipated transaction load (which may be estimated, observed, or \guesstimated"), we have implemented a rst prototype of a physical database design tool (DBDesigner). The system was developed in one master thesis [Gross, 1991] and is implemented in PROLOG. It uses a simplied version of a cost model for DASDBS operations developed earlier [Brauburger and Deuer, 1987 Paul, 1988].

20 1.4.2 Load Description The transaction load is given to DBDesigner as a collection of abstractions of COCOON operations together with their frequencies. That is, a load description is a collection of entries of the form /f/operation-specification The operation specications consist of the following: For selections we record what the attributes in the predicate are, and the general form of the predicate. The specic predicate used is not included. Furthermore, the estimated selectivity of the predicate is also given in the load description (as an absolute cardinality orasarelative fraction). As an example, consider the following two entries: /100/select/0.3/[(name(manufacturer)) mul (name(owner))] (Vehicle) /30/select/150/[address rel location(works_for)](employee) The rst entry states that 100 times a selection of Vehicles returning 30% of all member objects of that class is issued. The predicate involves names of manufacturers and names of owners, these two parts are conjunctively (\mul") combined. The second entry indicates that 30 selections on the Employee class are issued that return about 150 objects each. The predicate compares the address of the Employee with the location of his or her employer (using a set comparator, as indicated by \rel"). For projections, of which COCOON's query language actually has two forms, an object-preserving operator \project" and a tuple-generating operator \extract", we record only extract, since project is a typecast operation that is \executed" completely at compile-time (used for type checking purposes). Since extract's can be nested to produce nested sets of tuples, the load description for extracts records the \path traversals" that are performed by these operations. For example, if 50 extracts of Company data together with (nested) Employee data are contained in the mix, the load description will have the following line:

21 /50/extract[cname,budget,extract[name,salary](staff)] (Company) This basically conveys the information, what parts of the accessed objects are read for output. For extend operations that dene new derived functions and can also be used to simulate joins (see [Scholl and Schek, 1990a]), the load description currently contains no entries. The query compiler will substitute the dening expressions for the function names, so in the rst DBDesigner prototype we expect this substitution to be done before the load description. Set operations, such as union, dierence, and intersection, are currently excluded from the load description. The reason is mainly to reduce the problem space (and also, that we expect them to be less frequent and crucial). For update operations, the main information is frequencies and what functions get updated. The following three lines are used to describe two update operations that occur 30 and 4 times, respectively. /30/update[produces := select/430/[id](vehicle)](company) /4/newemployer := select/1/[name](company) /4/update[works_for := newemployer](employee) The rst update sets the `produces' function to a new value for a Company. The new value for `produces' is obtained by a query (selection) against the Vehicle class (that returns 430 objects on the average). The second update proceeds in two steps: rst, a variable is assigned the result of a selection on Companies (returning only 1 object), then this Company is made the new value of the `works for' function of an Employee object. Notice that frequency information for update statements can be interpreted in two ways: either single object updates are performed that often, or that many objects are updated at once. For the cost calculation there is no dierence Statistical Information In order to compute the cost of operations, the design optimizer should actually cooperate with the query optimizer of the execution engine. In the

22 current development phase of the COCOON{DASDBS mapping, however, this part is not yet completed (see Section 1.5). Therefore, DBDesigner uses its own set of statistics and cost formulae. The statistics used by physical designer are: cardinalities of types (how many objects of that type are in the DB?) 5 (average) cardinalities of set-valued functions (average) sizes of all atomic values No information about value distributions is needed, since selectivities are included in the load description The Optimization Process The optimization algorithm of DBDesigner uses a branch-and-bound method. From the given load description we rst generate a \transaction graph" (TG) that will be used for enumerating the design alternatives. The TG consists of vertices representing the types in the concpetual database schema, directed edges connect types if there is a \traversal" in the corresponding direction in the load. Traversals can, for example, occur in selection conditions: whenever a selection on a class over type T 1 uses a function that returns objects of type T 2, there will be an edge from node T 1 to node T 2 in the TG. Other possibilities for such traversal are projections (extract) and `information ow' in update statements (where do the values assigned come from?). Finally, use of inherited functions also leads to an edge from the subtype node to the supertype node. The next step is to add a weight to the edges in the TG. This weight represents the accumulated trac across this link, that is, from the load description we computes the sum of the frequencies of all operations that incur the traversal represented by the edge. The current version of DBDesigner has the following restriction: usually, multiple connections may be present in the conceptual schema between two types. For example, several functions might connect two types. This is not permitted in schemas that can be optimized by DBDesigner. As a consequence, if the conceptual schema does contain such cases, the range type of such functions has to be specialized into subtypes, such that the functions map objects of the domain type to dierent range types. 5 Accumulation for supertypes is done by the tool, based on the assumption that dierent subtypes of one type have disjoint extents.

23 After the construction of the TG, the optimization can actually begin. Starting from the default physical design (see above), the optimizer selects the most promising (that is, heaviest) edge from the TG and tries to improve performance by materializing the corresponding function (physical clustering) or by repeating the inherited attributes in the subtype's object-table. The total cost of the new design is compared with the old cost. The next step depends on how these costs compare. If the transformation of the physical design led to an improvement, we proceed with this design, otherwise we try the 2nd heaviest edge, and so on. The search isalways continued at the currently best alternative, as long as it still contains some immediate potential for improvement. An alternative has no immediate potential, if either no more transformations can be applied, or all possible transformations have already been tested, but they have all led to no improvement. Notice that an alternative without immediate potential could still lead to the optimal solution. So, immediate potential is only used as the criterion where to continue the search next. If no alternatives with immediate potential are left, we continue with the best alternative that allows transformations until no more transformations are possible. The search can further be limited by the user by giving a maximum number of nal designs to evaluate. A design is nal if it has no immediate potential Experiences and Extensions We have tested DBDesigner with a couple of (rather small) sample databases, with only a modest complexity in the transaction load. Larger scale experiments are planned, but have not yet been carried out. With the small test cases, as expected, the performance was good and PROLOG has not (yet?) turned out to be a big penalty. The rst prototype has already been extended to allow dynamic modications in either the transaction load or the database schema. The objective here is to avoid complete re-iteration of the optimization process, for two reasons: one is to avoid duplicate work. The other, more challenging one is that the new physical design should not be too dierent from the old one in case we already have a big populated database. Otherwise we would have to reorganize the existing data. Particular emphasis was put on the inclusion of view denitions in the schema description. For this specic case we

24 have added redundant storage strategies to the optimizer's repertoire: a view can be materialized in a separate internal relation or just kept as a virtual (computed) class. 1.5 Query Optimization In this section we briey discuss the transformation and optimization of queries that are given to the system in terms of the conceptual database schema. It is the task of the query optimizer to map these COCOON queries down to the physical level by: (i) transforming them to the nested relational model and algebra as available at the DASDBS kernel interface, and (ii) selecting a good (if not the best) execution strategy. Because COCOON's query language, COOL, is pretty similar to a nested relational algebra, a straightforward transformation from COOL expressions down to a nested relational algebra expression against a xed implementation at the internal level (e.g., the default physical design) is done rather easily. Complications arise from the fact that the mapping of data structures is quite exible, and that, depending on the chosen design, operations have to be optimized substantially. Originally,we planed to investigate two competitive approaches to query transformation and optimization. The rst would have been a purely algebraic one, comparable to what we did with the relational to nested relational mapping [Scholl et al., 1987 Scholl, 1986]: COCOON classes would be dened as `views' over the stored nested relations, COOL queries would be transformed to the nested relational level by `view substitution', and nally, algebraic transformations within the nested relational algebra could be applied, so as to eliminate redundant subexpressions. Quite a few redundant joins would have to be removed in case we materialized functions (hierachical clustering). This has exactly been the problem addressed in [Scholl, 1986]. The second approach is to transform the given COOL query into a nested algebra query using a class connection graph, similar to the one used in [Lanzelotte et al., 1991]. The class connection graph is somewhat similar to the transaction graph used in DBDesigner, edges are labelled by the implementation strategy. For example, whether a pointer (in case of a physical reference) has to be followed, whether a join has to be performed (in case of a logical reference), or whether a subrelation has to be accessed (in case of a materialized function), is represented by corresponding labels. Each label