A Framework for the Design of Distributed Databases

Transcription

1 A Framework for the Design of Distributed Databases Fernanda Baiao Marta Mattoso Computer Science Department, COPPE/UFRJ Federal University of Rio de Janeiro, Brazil Gerson Zaverucha This work presents a framework to handle the class fragmentation problem during the design of distributed object databases. The framework works in the conceptual level, and thus uses the object data model to capture the application semantics represented by the user. The proposed framework integrates three modules. The heuristic module defines a set of heuristics to drive the fragmentation of object databases and incorporates them in a methodology that includes an analysis algorithm, horizontal and vertical class fragmentation algorithms. The theory revision module automatically improves the analysis algorithm through the use of an artificial intelligence technique named theory revision, using fragmentation schemas with previously known performance presented as examples. Finally, the branchand-bound module uses optimization techniques to perform an intelligent search for an optimal fragmentation schema through a larger space of hypotheses when compared to the space of hypotheses covered by the heuristic approach. INTRODUCTION Distributed and parallel processing on database management systems (DBMS) is an efficient way of improving performance of applications that manipulate large volumes of data. This may be accomplished by removing irrelevant data accessed during the execution of queries and by reducing the data exchange among sites, which are the two main goals of the design of distributed databases [1]. Also, many recent problem domains are supported by applications that are typically more complex than traditional applications, in addition to their great volume of data. Those applications require, at least in the conceptual level, a semantically richer data model which is capable of directly representing complex structures and operations in a more natural and adequate manner, such as the object data model. Therefore, in order to improve performance of those applications, it is very important to design information distribution properly, and take the application semantics into account as much as possible. The distribution design involves making decisions on the fragmentation and placement of data across the sites of a computer network. The first phase of the distribution design in a top-down approach is the fragmentation phase, which is the process of clustering in fragments the information accessed simultaneously by applications. The fragmentation phase is then followed by the allocation phase, which handles the physical storage of the generated fragments among the nodes of a computer network, and the replication of fragments. This work addresses the fragmentation phase of databases. We believe that, by outputting good fragmentation schemas with improved performance, data allocation and replication may then be carried out more efficiently, since the

2 fragmentation schema will adequately reflect appropriate units of distribution according to the application access patterns, and thus may significantly reduce the search space of the allocation phase. However, the generation of a good fragmentation schema of a database using the object data model is a difficult task, because of four basic reasons: (i) it is not a well-defined problem; (ii) it must take many parameters into account; (iii) it has a lot of conflicting goals, and (iv) it requires some estimates and heuristics that may be sometimes conflicting. However, the designer may concentrate on semantic relationships leaving physical distribution design to the last phase. To fragment a class, it is possible to use two basic techniques: horizontal fragmentation and vertical fragmentation. In object databases, horizontal fragmentation distributes class instances across the fragments. Thus, a horizontal fragment of a class contains a subset of the whole class extension. On the other hand, vertical fragmentation (VF) breaks the class logical structure (its attributes and methods) and distributes them across the fragments. The horizontal fragmentation is usually subdivided in primary and derived horizontal fragmentation. Primary horizontal fragmentation (PHF) basically optimizes set operations (search over a class extension), firstly by reducing the amount of irrelevant data accessed and, secondly, by permitting applications to be executed concurrently, thus achieving a high degree of parallelism. On the other hand, derived horizontal fragmentation (DHF) can be viewed as an approach of clustering objects of distinct classes in the disk, therefore clearly addressing the relationships between classes and improving performance of applications with navigational access. It is also possible to apply both vertical and horizontal fragmentation techniques in a class simultaneously (which we call hybrid fragmentation) or to apply different fragmentation techniques in different classes in the database schema (which we call mixed fragmentation). There are many approaches in the literature addressing the DDODB problem [2, 3, 4, 5, 6]. However, due to complexity, most of them rely on a specific set of estimates and heuristics. Also, some approaches require an instantiated database to work on, which may limit their application. Most important, the distribution design algorithms presented are limited to the application of just one of the fragmentation techniques (horizontal or vertical, but not both) in all classes of the schema, therefore proposing either a horizontal-only or a vertical-only class fragmentation approach for all classes of the schema. We have already pointed out in [7] the benefits of mixed fragmentation (that is, the combination of vertical and horizontal fragmentation in different classes of the schema) and hybrid fragmentation (in the same class) to increase the performance of applications. It is also important to analyze the database schema and the application characteristics in order to propose good fragmentation schemas. However, such issues are not addressed in other works in the literature.

3 FRAMEWORK FOR THE DESIGN OF DISTRIBUTED DATABASES This work presents a framework to handle the class fragmentation problem during the design of distributed databases, using the object model in the conceptual level. This way, the ideas presented may be applied in different environments (such as in domains where data is managed by object-relational or object-oriented database management systems), as long as the application conceptual model is compatible with the object-oriented model defined in this work. The proposed framework (illustrated in Figure 1) integrates three modules: the DDODB heuristic module, the theory revision module (TREND3) and the DDODB branch-and-bound module. Database Application (Semantics + Operations + quantitative info) Good fragmentation schema DDODB Heuristic Module (AA VF HF) Improved Analysis Algorithm (Revised Theory) Distribution Designer Known fragmentation schemas Analysis Algorithm (Initial Theory) (Examples) Optimal fragmentation schema (Examples) TREND 3 Module FORTE FORTE Module Optimal fragmentation schema DDODB Branch and Bound Module Query Processing Cost Function Figure 1. Overall framework for the class fragmentation in the DDODB The distribution designer provides input information about the database semantics, the operations that will be executed over the stored data and additional quantitative information such as the estimate cardinality of each class. This information is then passed to the DDODB Heuristic Module. The DDODB heuristic module defines a set of heuristics to search for the best fragmentation schema for a given database application. The execution of the algorithms from the heuristic module (AA-Analysis Algorithm, VF-Vertical Fragmentation and HF-Horizontal Fragmentation) will follow this set of heuristics and quickly output a good fragmentation schema to the distribution designer to be implemented on the database. Intermediary results of the heuristic module are presented in [7, 8, 9]. Performance results from these works have proven the effectiveness of the DDODB heuristic module during an experimental study on top of Benchmark 007. The set of heuristics implemented by the DDODB heuristic module may be further automatically improved by executing a theory revision process through the use of inductive logic programming (ILP) [10]. This process is called Theory REvisioN on the Design of Distributed Databases (TREND3), and is represented in our framework by the TREND3 module[11]. The improvement

4 process may be carried out by providing two input parameters to the TREND3 module: the analysis algorithm PROLOG implementation (representing the initial theory) and a fragmentation schema with previously known performance (representing a set of examples). The analysis algorithm is then automatically modified by a theory revision system (called FORTE) so as to produce a revised theory. The revised theory will represent an improved analysis algorithm that will be able to output the fragmentation schema given as input parameter, and this revised analysis algorithm will then substitute the original one in the DDODB Heuristic Module. Additionally, the input information from the distribution designer may be passed to our third module, the DDODB Branch-and-Bound Module[12]. This module represents an alternative approach to the heuristic module in searching for the best fragmentation schema for a given database application. The branch-and-bound procedure searches for an optimal solution in the space of potentially good fragmentation schemas for an application and outputs its result to the distribution designer. Although the search space covered by the branch-and-bound algorithm is much larger than the one covered by the heuristic algorithm, its execution cost is also much higher. To handle this, the branch-andbound algorithm tries to bound its search for the best fragmentation schema by using a query processing cost function during the evaluation of each fragmentation schema in the hypotheses space. This cost function, defined in [13], is responsible for estimating the execution cost of queries on top of a distributed database beign evaluated. The branch-and-bound algorithm then discards all the fragmentation schemas with an estimate cost higher than the cost of the fragmentation schema output from the heuristic module. Finally, the final result from the branch-and-bound algorithm, as well as the fragmentation schemas discarded during the searh, may generate examples (positive or negative) to the TREND3 module, thus incorporating the branch-and-bound results into the DDODB heuristic module. The complete framework is detailed in [14]. CONCLUSIONS This work presents a framework to handle the class fragmentation problem during the design of distributed object databases. The framework works in the conceptual level, and thus uses the object data model to capture the application semantics represented by the user. The proposed framework integrates three modules (heuristic, knowledge-based and branch-and-bound). The heuristic module defines a set of heuristics to drive the fragmentation of object databases and incorporates them in a methodology that includes an analysis algorithm, horizontal and vertical class fragmentation algorithms, addressing the need mentioned by Özsu and Valduriez [1] of a distribution design methodology which encompasses the horizontal and vertical fragmentation algorithms and uses them as part of a more general strategy. Experiments using our methodology resulted in fragmentation schemas with better performance results when compared to other fragmentation schemas proposed

5 in the literature. The main contribution of the heuristic module is the analysis phase, which chooses the most adequate fragmentation technique to be applied in each class of the database schema, based on heuristics derived from experimental results previously obtained. With current algorithms proposed in the literature, the distribution designer is induced to apply one single type of fragmentation to all classes. Even when the designer decides to use a horizontal fragmentation algorithm to one class and another vertical fragmentation algorithm to another class, he is left with no assistance to make this decision. REFERENCES [1] M. Özsu and P. Valduriez, Principles of Distributed Database Systems, 2 nd edition (1 st edition 1991), New Jersey, Prentice-Hall, [2] L. Bellatreche, K. Karlapalem and A. Simonet, "Algorithms and Support for Horizontal Class Partitioning in Object- Oriented Databases", International Journal of Distributed and Parallel Databases, Kluwer Academic Publishers, vol. 8(2), 2000, pp [3] Y. Chen and S. Su, "Implementation and Evaluation of Parallel Query Processing Algorithms and Data Partitioning Heuristics in Object Oriented Databases, International Journal of Distributed and Parallel Databases, Kluwer Academic Publishers, vol. 4(2), 1996, pp [4] C. Ezeife and K. Barker, "Distributed Object Based Design: Vertical Fragmentation of Classes", International Journal of Distributed and Parallel Databases, Kluwer Academic Publishers, vol. 6(4), 1998, pp [5] K. Karlapalem, S. Navathe and M. Morsi, Issues in Distribution Design of Object-Oriented Databases. In M. Özsu et al. (eds.), Distributed Object Management, Morgan Kaufmann Publishers Inc., San Francisco, USA, [6] M. Savonnet, M. Terrasse and K. Yétongnon, Fragtique: A Methodology for Distributing Object Oriented Databases. In: Proceedings of the International Conference on Computing and Information (ICCI'98), Winnipeg, Canada, 1998, pp [7] F. Baião and M. Mattoso, A Mixed Fragmentation Algorithm for Distributed Object Oriented Databases. In Special Issue of the Journal of Computing and Information (JCI), vol. 3(1), ICCI 98, March 2000, ISSN , pp [8] F. Baião, M. Mattoso and G. Zaverucha, Towards an Inductive Design of Distributed Object Oriented Databases. In Proceedings of the Third IFCIS Conference on Cooperative Information Systems (CoopIS'98), IEEE CS Press, New York, USA, Ago 1998, pp [9] F. Baião, M. Mattoso and G. Zaverucha, "Horizontal Fragmentation in Object DBMS: New Issues and Performance Evaluation". In Proceedings of the "19 th IEEE International Performance, Computing and Communications Conference" (IPCCC 2000), IEEE CS Press, Phoenix, Feb 2000, pp [10] N. Lavrac and S. Dzreroski, Inductive Logic Programming: Techniques and Applications, Ellis Horwood, [11] F. Baião, M. Mattoso, J. Shavlik and G. Zaverucha, "Applying Theory Revision in the Design of Distributed Databases". In preparation, Feb [12] F. Baião, M. Mattoso, J. Shavlik and G. Zaverucha, "A Branch-and-Bound Approach for the Design of Distributed Databases ". In preparation, Feb [13] G. Ruberg, F. Baião, M. Mattoso, "A Cost Model for the Evaluation of Path Expressions in Distributed Object Databases", submitted for publication, Nov [14] F. Baião A Methodology and Algorithms for the Design of Distributed Databases using Theory Revision D.Sc. Thesis, COPPE/UFRJ, Dec (