Transparent Indexing in Distributed Object-Oriented Databases

Transcription

1 TECHNICAL UNIVERSITY OF LODZ Faculty of Electrical, Electronic, Computer and Control Engineering Computer Engineering Department mgr inŝ. Tomasz Marek Kowalski Ph.D. Thesis Transparent Indexing in Distributed Object-Oriented Databases Supervisor: prof. dr hab. inŝ. Kazimierz Subieta Łódź 2009

2 To my wife Kasia

3 Index of Contents INDEX OF CONTENTS SUMMARY... 6 ROZSZERZONE STRESZCZENIE... 8 CHAPTER 1 INTRODUCTION Context Short State of the Art of Indexing in Databases Research Problem Formulation Proposed Solution Main Theses of the PhD Dissertation Thesis Outline CHAPTER 2 INDEXING IN DATABASES - STATE OF THE ART Database Index Properties Transparency Indices Classification Index Data-Structures Linear-Hashing Scalable Distributed Data Structure (SDDS) Relational Systems OODBMSs db4o Database Objectivity/DB ObjectStore Versant GemStone s Products Advanced Solutions in Object-Relational Databases Oracle s Function-based Index Maintenance Global Indexing Strategies in Parallel Systems Central Indexing Strategies Involving Decentralised Indexing Distributed DBMSs CHAPTER 3 THE STACK-BASED APPROACH Abstract Data Store Models AS0 Model Abstract Store Models Supporting Inheritance Example Database Schema Example Store with Static Inheritance of Objects Environment and Result Stacks Bind Operation Nested Function SBQL Query Language Expressions Evaluation Imperative Statements Evaluation Static Query Evaluation and Metabase Type Checking Updateable Object-Oriented Views CHAPTER 4 ORGANISATION OF INDEXING IN OODBMS Implementation of a Linear Hashing Based Index Index Key Types Page 3 of 181

4 Index of Contents Example Indices Index Management Index Creating Rules and Assumed Limitations Automatic Index Updating Index Update Triggers The Architectural View of the Index Update Process SBQL Interpreter and Binding Extension Example of Update Scenarios Conceptual Example Path Modification Keys with Optional Attributes Polymorphic Keys Optimising Index Updating Properties of the Solution Comparison of Index Maintenance Approaches Indexing Architecture for Distributed Environment Global Indexing Management and Maintenance Example on Distributed Homogeneous Data Schema CHAPTER 5 QUERY OPTIMISATION AND INDEX OPTIMISER Query Optimisation in the ODRA Prototype Index Optimiser Overview General Algorithm Selection Predicates Analysis Incommutable Predicates Matching Index Key Values Criteria Processing Inclusion Operator Role of a Cost Model Estimation of Selectivity Query Transformation Applying Indices Index Invocation Syntax Rewriting Routines Processing Disjunction of Predicates Optimising Existential Quantifier Reuse of Indices through Inheritance Secondary Methods Factoring Out Independent Subqueries Pushing Selection Methods Assisting Invoking Views Syntax Tree Normalisation Harmful Methods Optimisations involving Distributed Index Rank Queries Optimisation Hoare s Algorithm in Distributed Environment Modification of Hoare s Algorithm Increasing Query Flexibility with Respect to Indices Management CHAPTER 6 INDEXING OPTIMISATION RESULTS Test Data Distribution Sample Index Optimisation Test Omitting Key in an Index Call Test enum Key Types Multiple Index Invocation Test Complex Expression Based Index Test Disjunction of Predicates Test CHAPTER 7 INDEXING FOR OPTIMISING PROCESSING OF HETEROGENEOUS RESOURCES Volatile Indexing Conditions for Volatile Indexing Optimisation Page 4 of 181

5 Index of Contents Index Materialisation Solution Properties Prove of Concept Test Optimising Queries Addressing Heterogeneous Resources Overview of a Wrapper to RDBMS Volatile Indexing Technique Test CHAPTER 8 CONCLUSIONS Future Work INDEX OF FIGURES INDEX OF TABLES BIBLIOGRAPHY Page 5 of 181

6 Summary SUMMARY The Ph.D. thesis focuses on the development of robust transparent indexing architecture for distributed object-oriented database. The solution comprises management facilities, automatic index updating mechanism and index optimiser. From the conceptual point of view transparency is the most essential property of a database index. It implies that programmers need not to involve explicit operations on indices into an application program. Usually a query optimiser automatically inserts references to indices when necessary into a query execution plan. The second aspect of the transparency concerns a mechanism maintaining the cohesion between existing indices and indexed data. So-called automatic index updating detects data modifications and reflects them in indices accordingly. The thesis has been developed in the context of the Stack-Based Architecture (SBA) [1, 117] a theoretical and methodological framework for developing objectoriented query and programming languages. The developed query optimisation methods are based on the corresponding Stack-Based Query Language (SBQL). The orthogonality of SBQL constructs enables simple defining of complex selection predicates accessing arbitrary data. The main goal of the work is designing the indexing architecture facilitating processing of a possibly wide family of predicates. This requires generic and complete approach to the problem of index transparency. The solution presented in the thesis provides transparent indexing employing single or multiple-key indices in a distributed homogeneous object-oriented environment. The selection of an index structure, either centralised or distributed, is not restricted. The work extensively describes optimisation methods facilitating processing in the context of a where operator, i.e. selection, considering the role of a cost model, conjunction and disjunction of predicates, and the class inheritance. The author proposes a robust approach to automatic index updating capable of dealing with index keys based on arbitrary, deterministic and side effects free expressions. Consequently, optimised selection predicates can be freely composed of various SBQL constructs, in particular, algebraic and non-algebraic operators, path expressions, aggregate functions and class methods invocations. The solution also takes into consideration inheritance and polymorphism. Page 6 of 181

7 Summary A part of the thesis concerns optimisation methods devoted to distributed objectoriented databases enabling efficient parallel processing of queries. In particular, one of the designed methods concerns optimisation of rank queries. It enables taking advantage of distributed and scalable index structures. A particularly difficult query optimisation domain concerns processing queries addressing heterogeneous resources. A volatile indexing technique proposed by the author is a significant step in this matter. This solution relies on the developed indexing architecture. Additionally, it can be applied to data virtually accessible using SBQL views. In contrast to regular indices, a volatile index is materialised only during a query evaluation. Therefore, efficacy of this technique shows when the index is invoked multiple times which mainly concerns processing of complex and laborious queries. A key aspect concerning the development of database query optimisation methods is preservation of original query semantics. Consequently, for the designed optimisation methods the author has determined rules in a context of the assumed object data model and the SBQL query language. With this knowledge a database programmer can be assisted and advised, e.g. by compiler, on how to design safe and optimisable queries. Moreover, the conducted research can facilitate also database designers. Among the other, the potential influence of other optimisation methods on indexing has been verified. A significant part of algorithms and solutions developed in the thesis have been verified and confirmed in the prototype ODRA OODBMS implementation [58, 59]. Keywords: indexing, database, distributed database, object-oriented, query optimisation, SBA, SBQL, ODRA Page 7 of 181

8 Streszczenie POLITECHNIKA ŁÓDZKA Wydział Elektrotechniki, Elektroniki, Informatyki i Automatyki Katedra Informatyki Stosowanej Praca doktorska pt.: Przezroczyste indeksowanie w rozproszonych obiektowych bazach danych ROZSZERZONE STRESZCZENIE Bazy danych stanowią podstawę wielu rozległych i w dzisiejszych czasach często rozproszonych systemów komputerowych. Zarządzanie systemami o takim rozmiarze i złoŝoności ułatwiają technologie obiektowe. Przemysł jednak skłania się ku rozwiązaniom relacyjnym, gdy kwestią kluczową jest wydajność. Ten aspekt jest wciąŝ zaniedbany w opartych o obiektowe paradygmaty bazach danych z uwagi na niedostatek zaawansowanych procedur optymalizacyjnych. Indeksowanie jest najwaŝniejszą metodą optymalizacyjną w bazach danych. Zasadnicza koncepcja indeksowania w obiektowych bazach danych nie róŝni się od indeksowania w systemach relacyjnych [15, 20, 29, 54, 55, 65]. Z koncepcyjnego punktu widzenia najistotniejszą własnością indeksu w bazie danych jest przezroczystość. Oznacza ona, Ŝe programista aplikacji z bazą danych nie musi być świadomy istnienia indeksów. Najczęściej optymalizator zapytań jest odpowiedzialny za automatyczne wykorzystanie indeksów. Drugi waŝny aspekt przezroczystości jest związany z utrzymywaniem spójności między indeksami a indeksowanymi danymi. Jest to problem tzw. automatycznej aktualizacji indeksu. Modyfikacje w bazie powinny być automatycznie wykrywane i odzwierciedlane w odpowiednich indeksach. W rozproszonych bazach danych najbardziej zaawansowane rozwiązania opierają się o statyczne partycjonowanie indeksu. Są one zaimplementowane w czołowych produktach obiektowo-relacyjnych. Pozwalają one jedynie na definiowanie kluczy indeksu uŝywając prostych wyraŝeń korzystających z danych znajdujących się w Page 8 of 181

9 Streszczenie jednej tabeli. W kwestii globalnej optymalizacji zapytań odnoszących się do heterogonicznych zasobów, autor nie znalazł w literaturze naukowej Ŝadnych sformalizowanych metod opartych o indeksowanie. Analiza stanu wiedzy jednoznacznie wskazuje na potrzebę rozwijania metod i architektury indeksowania dla rozproszonych obiektowych baz danych. Ortogonalność języka SBQL pozwala na wyjątkowo łatwe definiowanie złoŝonych predykatów selekcji odnoszących się do dowolnych danych. Głównym celem pracy jest opracowanie architektury indeksowania, która wspomagałaby przetwarzanie moŝliwie szerokiej rodziny predykatów. Wymaga to generycznego i kompletnego podejścia do problemu przezroczystości. PoniewaŜ praca dotyczy rozproszonych obiektowych baz danych kolejnym waŝnym celem jest opracowanie metod optymalizacyjnych, które będą umoŝliwiały zrównoleglenie obliczeń w szczególności poprzez wykorzystanie rozproszonych skalowalnych struktur indeksu. Szczególnie trudną dziedziną w kontekście optymalizacji jest przetwarzanie zapytań odnoszących się do rozproszonych heterogenicznych zasobów. Z tego powodu jako kolejny cel pracy autor postawił sobie identyfikacje przezroczystej i wydajnej strategii indeksowania, którą moŝna by stosować na poziomie globalnego schematu bazy. Kluczowym aspektem pracy nad wszystkimi metodami optymalizacyjnymi jest zachowanie oryginalnej semantyki zapytań. W tym celu autor określił reguły, które odnoszą się do opracowanych metod w kontekście przyjętego obiektowego modelu danych i języka zapytań SBQL. Znajomość tych reguł moŝe być przydatna programistom w budowaniu zapytań, których postać umoŝliwia automatyczną optymalizację. Dodatkowo, przeprowadzone badania mogą być równieŝ pomocne projektantom baz danych. Między innymi, określono potencjalny wpływ innych metod optymalizacyjnych na pracę optymalizatora wykorzystującego indeksy. Zaproponowane przez autora w pracy doktorskiej rozwiązania są przedstawione w kontekście stosowej architektury (SBA, Stack-Based Architecture) [1, 117] i wynikającego z niej języka zapytań (SBQL, Stack-Based Query Language). Architektura stosowa jest to formalna metodologia dotycząca obiektowych języków zapytań i programowania w bazach danych. Page 9 of 181

10 Streszczenie Tezy będące przedmiotem dysertacji są następujące: 1. Przetwarzanie predykatów selekcji opartych o dowolne wyraŝenia kluczowe korzystające z danych w rozproszonej obiektowej bazie danych moŝe być zoptymalizowane przez scentralizowane lub rozproszone przezroczyste indeksowanie. 2. Wykonywanie złoŝonych zapytań odnoszących się do rozproszonych heterogenicznych zasobów moŝe być wspomagane przez techniki wykorzystujące przezroczystą optymalizację opartą o indeksowanie. W udokumentowaniu w/w tez wykorzystano zaprojektowane przez autora system zarządzania indeksami i optymalizator zapytań stosujący indeksy. Dodatkowym elementem ściśle związanym z pierwszą tezę jest autorskie podejście do problemu automatycznej aktualizacji indeksu. Przedstawione rozwiązanie zapewnia przezroczyste indeksowanie wykorzystujące indeksy oparte o jeden lub wiele kluczy. Optymalizacja dotyczy przetwarzania predykatów selekcji opartych o dowolne, deterministyczne i pozbawione efektów ubocznych wyraŝenia, na które mogą się składać: np. wyraŝenia ścieŝkowe, funkcje agregujące i wywołania metod klas (z uwzględnieniem dziedziczenia i polimorfizmu). Zaproponowana architektura indeksowania moŝe być zastosowana do rozproszonych homogenicznych źródeł danych. Wybór struktury indeksu, scentralizowanej czy rozproszonej, nie jest w Ŝaden sposób ograniczony. Autor zaproponował równieŝ metodę optymalizacji rankingowych zapytań, która umoŝliwia wykorzystanie zarówno istniejących lokalnych indeksów, jak i rozproszonego, skalowalnego globalnego indeksu. Rozwiązaniem zaproponowanym przez autora w celu udowodnienia drugiej tezy pracy jest technika ulotnego indeksowania. Polega ona na tej samej architekturze indeksowania, ale dodatkowo moŝe być stosowana do przetwarzania danych heterogenicznych wirtualnie dostępnych poprzez perspektywy SBQL. W odróŝnieniu od normalnych indeksów ulotny indeks jest materializowany tylko podczas wykonywania zapytania. Przedstawiona technika jest skuteczna w przetwarzaniu złoŝonych zapytań, w których indeks jest wywoływany więcej niŝ jeden raz. Opracowane algorytmy i rozwiązania związane z tezami pracy zostały w znaczącym zakresie zweryfikowane i potwierdzone na prototypowej implementacji w obiektowej bazie danych ODRA [58, 59]. Page 10 of 181

11 Streszczenie Dysertacja została podzielona na osiem rozdziałów, których zwięzły opis znajduje się poniŝej: Chapter 1 Introduction Wstęp Pierwszy rozdział wprowadza w tematykę pracy, przedstawia jej kontekst, zwięzły opis stanu wiedzy w dziedzinie i motywacje autora. Sformułowano cele pracy oraz zidentyfikowano związane z nimi problemy. W tym kontekście omówiono szczegółowo tezy dysertacji oraz zarysowano opracowane przez autora rozwiązania. Chapter 2 Indexing In Databases - State of the Art Indeksowanie w Bazach Danych Stan Wiedzy W opisie stanu wiedzy przedstawiono podstawowe pojęcia związane z indeksowaniem w bazach danych. Przytoczono reprezentatywne przykłady istniejących rozwiązań w przemyśle i literaturze naukowej. Rozdział zawiera przegląd róŝnych struktur indeksujących ze szczególnym uwzględnieniem liniowego haszingu, który został wykorzystany w autorskim rozwiązaniu. Dodatkowo, zbadano zcentralizowane i rozproszone strategie indeksowania w róŝnych systemach rozproszonych. Chapter 3 The Stack-based Approach Podejście Stosowe Rozdział dotyczy teoretycznych podstaw tez pracy, tj. architektury stosowej (SBA) i wynikającego z niej języka zapytań SBQL. Przytoczono opisy podstawowych pojęć: stosu środowiskowego, stosu rezultatów, wiązania nazw, statycznej ewaluacji zapytań i aktualizowalnych obiektowych perspektyw. Chapter 4 Organisation of Indexing in OODBMS Organizacja Indeksowania w Obiektowych Bazach Danych Ta część pracy przedstawia zaprojektowaną i w znaczącym zakresie zaimplementowaną architekturę indeksowania w obiektowej bazie danych ODRA. Opisano podstawowe własności zastosowanej struktury indeksu i modułu zarządzania indeksami. Przedstawiono równieŝ autorski mechanizm zapewniający przezroczystą, automatyczną aktualizację indeksów, który opiera się o ideę wyzwalaczy aktualizacji indeksu (index update triggers). Przedstawiona koncepcja jest rozszerzona na potrzeby globalnego indeksowania w kontekście rozwijanej w projekcie ODRA rozproszonej architektury. Page 11 of 181

12 Streszczenie Chapter 5 Query Optimisation and Index Optimiser Optymalizacja Zapytań z Wykorzystaniem Indeksów Rozdział koncentruje się na rozwijanych przez autora metodach przezroczystego wykorzystania indeksów w optymalizacji zapytań. Przedstawiono algorytmy dotyczące transformacji pośredniego drzewa zapytania oraz związane z nimi reguły. Szczególny nacisk został połoŝony na zachowanie w procesie optymalizacji pierwotnej semantyki zapytania. Opracowane metody zostały poparte odpowiednimi rzeczywistymi przykładami przekształceń w języku SBQL. Autor przedstawił równieŝ dyskusję na temat wpływu innych metod optymalizacji zapytań na indeksowanie. W rozdziale omówiono równieŝ metody optymalizacji dedykowane przetwarzaniu globalnych zapytań w rozproszonym środowisku. W tym zakresie przedstawiono autorskie podejście do optymalizacji zapytań rankingowych w rozproszonej architekturze bazy danych oparte o zmodyfikowany algorytm Hoare a. Chapter 6 Indexing Optimisation Results Wyniki Optymalizacji przez Indeksowanie W rozdziale zaprezentowano rezultaty testów zaimplementowanego systemu indeksowania. Wyniki potwierdzają skuteczność i wydajność opracowanej metodologii. Testy empirycznie potwierdzają poprawność zastosowanych rozwiązań opisanych w rozdziałach 4-tym i 5-tym. Całość stanowi dowód pierwszej tezy dysertacji. Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources Indeksowanie w Optymalizacji Przetwarzania Heterogenicznych Zasobów Ta część pracy dowodzi drugiej tezy dysertacji. W rozdziale przedstawiono technikę tzw. ulotnego indeksowania (volatile indexing technique) oraz jej zastosowanie w optymalizacji zapytań odnoszących się do rozproszonych heterogonicznych danych. Skuteczność zaproponowanej techniki jest potwierdzona testem, w którym optymalizowane jest zapytanie SBQL odnoszące się do zasobów obiektowej bazy danych i zasobów znajdujących się w relacyjnej bazie danych. Chapter 8 Conclusions Podsumowanie Ostatni rozdział podsumowuje pracę nad architekturą systemu indeksowania dla rozproszonej obiektowej bazy danych. Wymieniono opracowane rozwiązania i wyniki badań jednoznacznie potwierdzające słuszność tez pracy doktorskiej. Na koniec wskazano kierunki dalszych badań w tej dziedzinie. Page 12 of 181

13 Chapter 1 Introduction Chapter 1 Introduction Databases are a fundamental feature of many large computer applications. In many cases databases are to be geographically distributed. The size and complexity of such systems require the developers to take advantage of modern software engineering methods which as a rule are based on the object-oriented approach (cf. UML notation). In contrast, the industry still widely uses relational databases. While the efficiency of them in majority of applications cannot be questioned, many professionals point out their drawbacks. One of the major drawbacks is so-called impedance mismatch. The mismatch concerns many incompatibilities between object-oriented design and relational implementation. The mismatch concerns also incompatibilities between object-oriented programming (in languages such as C++, Java and C#) and SQL, the primary programming interface to relational databases. For this reason in the last two decades new and new object-oriented database management systems are proposed. Some of them are well recognized on the market (e.g. ObjectStore, Objectivity/DB, Versant, db4o, and others), however the scale of applications of them is at least the order of magnitude lower than applications of relational systems (some of them extended by object-oriented features). One of the reasons of relatively low acceptance of commercial object-oriented databases concerns their query languages that are considered very limited and treated as secondary in the development of applications. This is in sharp contrast to relational systems, where SQL is considered the primary factor stimulating their successes. In this research we focus on equipping object-oriented database systems with a powerful and efficient query language. The power of such a language should not be lower than the power of SQL. The performance efficiency of such a language requires powerful query optimization methods. Query optimisation in object-oriented database management systems has been deeply investigated over last two decades. Unfortunately, this research remains mostly not implemented in nowadays OODBMSs because of many reasons: limited query languages, non-implementable methods that were proposed, lack of interest of commercial companies, etc. In this thesis we investigate a well-known and the most important method of performance improvement known as indexing. The research addresses this subject in the Page 13 of 181

14 Chapter 1 Introduction context of the Stack-Based Architecture (SBA), which is a theoretical and methodological framework for developing object-oriented query and programming languages. The solutions that we have developed are implemented and tested in the ODRA OODBMS prototype [58, 59] that is based on SBA and its own query language SBQL (Stack-Based Query Language). 1.1 Context The Stack-Based Architecture (SBA) is a formal methodology addressing object-oriented database query and programming languages [1, 117]. It assumes the object relativism principle that claims no conceptual difference between the objects of different kinds or stored on different object hierarchy levels. Everything (e.g. a Person object, a salary attribute, a procedure returning the age of a person and a view returning well-paid employees) is considered an object with an own unique identifier. SBA reconstructs query languages concepts from the point of view of programming languages (PLs) introducing notions and methods developed in the domain of programming languages (e.g. environment stack, result stack, nesting and binding names). ODRA (Object Database for Rapid Application development) is a prototype object-oriented database management system based on the Stack-Based Architecture (SBA) [2, 119]. ODRA introduces its own query language SBQL that is integrated with programming capabilities and abstractions, including database abstractions: updatable views, stored procedures and transactions. The main goal of the ODRA project is to develop new paradigms of database application development together with a distributed database-oriented and object-oriented execution environment. 1.2 Short State of the Art of Indexing in Databases The general idea of indices in object-oriented databases does not differ from indexing in relational databases [15, 20, 29, 54, 55, 65]. The most characteristic property of the database indexing is transparency. A programmer of database applications does not need to be aware of the indices existence as they are utilised by the database engine automatically. This is usually accomplished by a query optimiser that automatically inserts references to indices into a query execution plan when necessary. The second important aspect of transparency concerns maintaining cohesion between existing indices and the data that is indexed. Data modifications are Page 14 of 181

15 Chapter 1 Introduction automatically detected and corresponding changes are reflected in indices. This process is called automatic index updating. Many indexing methods can be adopted from relational database systems and even their applicability can be significantly extended. There are also situations where indexing methods from RDBMSs become outdated in object-oriented databases. In particular, join operations do not need to be supported because in object databases the necessity for joins is much lower due to object identifiers and explicit pointer links in the database. In the object-oriented database domain the research into indexing has been mainly focused on path expression processing and inheritance hierarchy inside indexed collections [10, 11, 12, 21, 67, 77, 81, 111]. Some papers propose generic approaches to provide automatic index maintenance transparency [43, 46]. However, there is no information that these proposals have been actually incorporated in commercial or open source database products. Indexing is also an important subject in a distributed environment. The most of research concerns development of various distributed index structures and global indexing strategies. Many works are conducted in the context of data exchange in p2p networks. In databases, the most advanced solutions are based on static index partitioning. They are implemented in leading object-relational products. Nevertheless, an index key definition is limited to expressions accessing data from only one table. The Author have not found in the research literature any formalised global optimisation methods based on indexing for processing queries involving heterogeneous resources. The analysis of the state of art unambiguously indicates that the development of indexing methods and architectures that are dedicated to distributed object-oriented databases is still a valid and challenging subject. 1.3 Research Problem Formulation The orthogonality of SBQL language constructs allows defining selection predicates using complex and robust expressions accessing arbitrary data. The transparent indexing of objects to facilitate processing queries involving such predicates requires development of a generic and complete solution. Particularly, achieving automatic index updating transparency is simple only in case of indices defined on simple keys, i.e. direct attributes and table columns. Inheritance, methods Page 15 of 181

16 Chapter 1 Introduction polymorphism, data distribution, etc. make difficult identifying objects influencing a value of an index key. Data processing in a distributed environment enables parallel processing of queries and may take advantage of distributed and scalable index structures. This creates a demand for introducing an appropriate indexing architecture and specific optimisation methods. An even more complex task concerns evaluation of queries addressing a heterogeneous distributed environment. From the point of view of performance it is vital to exploit local resources optimisation methods and to develop robust techniques improving query processing on a global schema level. Identifying effective transparent global indexing strategies is in this context a significant, but particularly challenging subject. Finally, each optimisation method improving query performance must ensure preservation of query semantics. Therefore, in the context of a query language and an object model the appropriate rules for exploiting such methods must be determined. With this knowledge a database programmer can be assisted and advised, e.g. by a compiler concerning how to design proper optimisable queries. 1.4 Proposed Solution In order to provide transparent indexing in distributed object-oriented databases, the author of this thesis proposes the following tenets: precisely defined indices management facilities and convenient syntax for an index call to be used in query optimisation, set of algorithms, optimisation methods and rules composing the index optimiser, i.e. the module responsible for detecting parts of a query that can be substituted with an index call and performing appropriate query transformations, the generic automatic index maintenance solution based on index update definitions assigned to indices and associated with them index update triggers assigned to objects participating in indexing, volatile indexing technique enabling taking advantage of the developed indexing architecture and omitting troublesome issue of the automatic index maintenance in processing specific family of queries addressing heterogeneous resources. Page 16 of 181

17 Chapter 1 Introduction The most important properties necessary to provide desired indices behaviour have been implemented in ODRA OODBMS prototype and are operating [59]. 1.5 Main Theses of the PhD Dissertation The summarised theses are: 1. Processing of selection predicates based on arbitrary key expressions accessing data in a distributed object-oriented database can be optimised by centralised or distributed transparent indexing. 2. Evaluation of complex queries involving distributed heterogeneous resources can be facilitated by techniques taking advantage of transparent index optimisation. The common basis for accomplishing the theses are developed indexing management facilities and the index optimiser. The first thesis is additionally supported by the author s generic approach to automatic index maintenance. The proposed approach provides transparent indexing using single or multiple-key indices. It applies to selection predicates based on arbitrary, deterministic and side effects free expressions consisting of e.g. path expressions, aggregate functions and class methods invocations (addressing inheritance and polymorphism). An extensive part of the work comprises optimisation methods facilitating processing in the context of a where operator (i.e. selection), considering the role of a cost model, conjunction and disjunction of predicates, and class inheritance. The proposed architecture can handle homogeneous data distribution and distributed index structures. The selection of an index structure, either centralised or distributed, is not restricted. The author also introduces an efficient method for optimisation of rank queries taking advantage of indexing in a distributed environment. The solution proposed by the author addressing the second thesis is the volatile indexing technique. It relies on the same indexing architecture, but addresses as well data virtually accessible through SBQL views. A volatile index differs from a regular index since it is materialised only during a query evaluation. Therefore, efficacy of this technique can be seen in processing of laborious queries when the index is invoked more than once. A significant part of theses has been verified and confirmed by a prototype Page 17 of 181

18 Chapter 1 Introduction implementation in the ODRA OODBMS. The only important aspect to be implemented and validated in the future concerns data and index distribution in the context of the first thesis. This element is planned to be finished together with the development of a distributed infrastructure in the ODRA prototype. 1.6 Thesis Outline The thesis is organised as follows: Chapter 1 Introduction The chapter presents a general overview of the thesis subject, context, the author s motivation, formulation of the problem and objectives of the research, the theses and the description of developed solutions. Chapter 2 Indexing In Databases - State of the Art The state of the art chapter introduces basic concepts concerning indexing in databases together with an overview of solutions existing in commercial products and in the research literature. Additionally, the inspection of varieties of index-structures and indexing strategies applying to centralised and distributed environment is provided. Chapter 3 The Stack-based Approach The theoretical fundament for the thesis is the Stack-Based Architecture (SBA) and the corresponding query language SBQL. The chapter introduces basic notions relevant to the work including environment and result stacks, static query evaluation and updateable object-oriented views. Chapter 4 Organisation of Indexing in OODBMS The chapter presents the designed and implemented indexing architecture in the ODRA OODBMS. It focuses particularly on basic properties of the employed index structure, the designed indexing management facilities and module providing automatic index updating transparency (based on the author s index update triggers concept). Finally, extending the architecture to distributed databases is discussed. Chapter 5 Query Optimisation and Index Optimiser The algorithms and rules responsible for taking advantage of indices in transparent optimisation of queries with respect to query semantics are presented and explained on examples. The chapter includes description of indexing methods designed Page 18 of 181

19 Chapter 1 Introduction for a distributed environment and discussion about influence of secondary methods on indexing. Chapter 6 Indexing Optimisation Results The chapter presents results of tests confirming efficiency of the methods presented in the thesis. Chapter 7 Indexing for Optimising Processing of Heterogeneous Resources The chapter focuses on the volatile indexing technique and presents its application in optimisation of queries addressing heterogeneous resources. The description is supported by an appropriate test proving the efficacy of this technique. Chapter 8 Conclusions The chapter gives conclusions concerning achieved objectives and depicts the area of future works. Page 19 of 181

20 Chapter 2 Indexing In Databases - State of the Art Chapter 2 Indexing In Databases - State of the Art Indices are auxiliary (redundant) data structures stored at a server. A database administrator manages a pool of indices generating a new or removing an existing one depending on the current need. As indices at the end of a book are used for quick page finding, a database index makes quick retrieving objects (or records) matching given criteria possible. Because indices have relatively small size (comparing to a whole database) the gain in performance is fully justified by some extra storage space. Due to single aspect search, which allows one for very efficient physical organisation, the gain in performance can be even several orders of magnitude. In general, an index can be considered a two-column table where the first column consists of unique key values and the other one holds non-key values, usually references to objects or database table rows. Key values are used as an input for an index search procedure. As the result, the procedure returns corresponding non-key values from the same table row. In query optimisation indices are usually used in the context of a where operator when the left operand refers to a collection indexed by key values composing the right operand. [29, 33, 118] High-level language query Syntactic analysis and validation Intermediate query representation Query optimiser Query evaluation plan Query code generator Query code Runtime database processor Query result Fig. 2.1 A typical stages of high-level language query optimisation [29] A database query is expressed in a high-level query language (e.g. SQL, OQL). Fig. 2.1 presents general steps of processing a query. First, it has to be the subject of syntactic analysis (parsing). Next, it is validated for semantic correctness and accordance with a present database schema. The database uses internal query Page 20 of 181

21 Chapter 2 Indexing In Databases - State of the Art representation usually organised into a tree or graph structure. There might be many execution strategies that a DBMS can follow to obtain an answer to a query. In terms of query results all execution plans are equivalent but the cost difference between alternative plans can be enormous. The cost is usually measured as the time needed to complete query execution. A database query optimiser should efficiently estimate the cost of a plan. Final steps of query processing consist of code generation according to the designed execution strategy and eventually its execution [29, 54, 118]. An important part of designing an execution plan is the analysis of database indices. The query optimiser should be capable of identifying parts of the query which evaluation can be assisted with indexing. Next, with the help of a database cost model it has to decide which combination of indices would minimise the cost of query execution. An important task of database administrators is to manage a pool of indices, which is a part of processes of physical designing and tuning of a database. From the physical and conceptual properties of a database index follow its obvious advantages. However, when the design is improper, processing queries through indices may cause disadvantages concerning the global processing time. The disadvantages are usually caused by frequent database updates, which may totally undermine the gain in query processing due to an index, because the updating cost of the index exceeds the gain due to faster query processing. 2.1 Database Index Properties Indices are an essential constituent of a database s architecture. Obviously, their central feature is a data-structure that can be efficiently organised, searched and maintained. Nonetheless, their actual strength lies in unique properties and versatile utilisation of a database s index. The significant advantage and a partial cause of a success of large database systems is an indexing transparency Transparency In a common approach, the programmer should not involve explicit operations on indices into an application program. To make indexing transparent from the point of view of a database application programmer, it ought to ensure two important functionalities: index optimisation and automatic index updating. The first functionality means that indices are used automatically during query Page 21 of 181

22 Chapter 2 Indexing In Databases - State of the Art evaluation. Therefore, the administrator of a database can freely establish new indices and remove them without changing codes of applications. The responsibility for ensuring such transparency lies in query optimisation and particularly in the index optimiser. The second functionality, i.e. automatic index updating, in research literature is also referred as index maintenance or dynamic index adaptation. It is a response to changes in a database. Indices, like all redundant structures, can lose cohesion with data if the database is updated. An automatic mechanism should alter, remove or rebuild an index in case of database updates that affect topicality of its contents. Consequently, the gain in queries performance coming from indexing compromises insertion, deletion and data modifications speed, since such operations require suitable updates to indices. Thus, it is an administrator s responsibility to manage indices judiciously and not to cause an overall database s performance deterioration, particularly in update-intensive systems. In general, databases provide the user transparency fully. Nevertheless, some approaches let administrators and application designers decide about the degree of the index transparency and explicitly control indices state depending on a need. Occasionally, the transparency is supported only to a limited extent burdening a database s user Indices Classification According to [29] there are three essential kinds of a database s index: primary index physically ordering data on a disk or in a memory according to some unique property field (each record must contain a unique value for such a field so-called primary key), clustering index introducing physical data order according to a non-unique property (i.e. when several data records can have equal value of ordering fields), secondary index providing an alternative access to data according to designated criteria without affecting their actual location (called also secondary access paths or methods). Since one physical ordering is possible, a data table or collection can have only one primary index or clustering index not both. The limit concerning the number of Page 22 of 181

23 Chapter 2 Indexing In Databases - State of the Art secondary indices depends on a database. In reality often occur some departures from definitions above. For example, in some databases data and indices are stored separately and even primary or clustering indices contain only references to actual data, which are stored physically e.g. in a linked list. Indices can be also classified according to a relation between keys and indexed data. Usually the division is the following: dense index contains an entry for each key value occurring in a database, sparse index associates blocks of ordered indexed data only with a single key value (e.g. the lowest one). Primary indices are usually sparse since physically data are often divided into blocks. Additionally to dense and sparse, a range index can be considered since index can be split into slots representing specified ranges of key values. Other obvious classification of indices concerns their data-structure, e.g. hash table or a B-tree. In databases sometimes many index kinds are combined in one socalled multilevel index. The next subchapter describes the most popular kinds of datastructures employed in databases indexing systems. 2.2 Index Data-Structures The most popular data structures used for index organisation are various kinds of B-trees proposed by Bayer and McCreight [8] and hash tables invented by H. P. Luhn [14]. Improving efficiency of selecting or sorting data queries vary with the choice of a proper data structure for indexing certain data. However, each index consumes some amount of database store space and needs some additional overhead for time of inserting, modifying or removing indexed data. Individual properties of different structures have been presented in thousands of papers and books devoted to databases and algorithms, e.g. [14, 23, 29]. In the context of this dissertation the kind of an exploited physical index organisation is generally insignificant. Only some properties of the index structure are important, in particular: key order preservation, i.e. range queries support, support for indexing using multiple keys, distribution of an index on multiple servers. Page 23 of 181

24 Chapter 2 Indexing In Databases - State of the Art The same index interface from the point of the database can be used for a variety of index structures. Therefore, this work omits detailed discussion of this subject focusing mainly on an index structure used in the author s implementation, i.e. linear hashing. The hash table uses the hash coding method based on a hash function which maps key values into a limited set of integer values. A calculated hash value points to a table in a memory (called a bucket) holding corresponding non-key values. This method allows indexed values to be looked up or updated in a very short constant time, particularly when a hash function distributes key values equally. A disadvantage of this technique is a necessity of specifying a size of the index table. However, dynamic hashing and linear hashing (described in the next section) deal with this issue. Another problem appears when two or more keys are mapped to the same location in the table. Similarly, it may happen that two or more objects have the same key value of an attribute. Resolving these so-called collisions leads to deterioration of index performance. There are many techniques allowing to put such items in a hash table and query them in a fairly fast way: a rehash function, a linked list approach (separate chaining), a linked list inside a table (coalesced hashing) and buckets. Methods involving linear or dynamic hashing use load control algorithms automatically forcing a hash table to expand to prevent performance loss. Another very popular indexing technique is based on B-trees. A B-tree is slightly worse than hash table from the point of view of search time and frequent data updates, which often involve the tree reorganisation. However, its advantage is simplicity of an algorithm and the economical memory space consumption. B-trees store keys in a nondescending order, so they can be very helpful in laborious queries involving sorting or ranking data. Many different kinds of tree-structures are proposed in the literature and incorporated in commercial products, e.g. B + tree, B# tree, B* tree varieties of B-tree, AVL tree, splay tree balanced binary search trees, radix tree optimised to store a set of strings in lexicographical order, and many more [14]. Indexing techniques used in data warehousing applications are a bit different from the techniques used in on-line transaction processing. Bitmap indices are stored as bitmaps (often compressed) [17, 66]. Consequently, answer to the most of queries can Page 24 of 181

25 Chapter 2 Indexing In Databases - State of the Art be obtained by performing bitwise logical operations. They are the most effective on keys with a limited set of values (e.g. a gender field) and often use a combination of such keys (i.e. multiple keys index). When these conditions are met, the bitmap indices prove reduced storage requirements and greater efficiency than regular indices. On the other hand the performance of the index maintenance is their serious drawback. Bitmap indexes are primarily intended for non-volatile systems since the method is very sensitive to updating indexed data. It causes the necessity to keep locks on segments storing a bitmap index to reflect a change, which is very time consuming. In typical cases bitmap indices are easier to destroy and re-create than to maintain. Other variants of indices for data warehousing have also been developed [84]: projection index - quite useful in cases where column values must be retrieved for all selected rows, because they probably would be found on the same index page, bit-sliced index based on processing bitmaps; provide an efficient means of calculating aggregates. Many other index structures evolve to facilitate various indices applications: inverted files, signature-based files two principal indexing methods for text documents databases purposes [134] and for indexing according to set-valued attributes with low cardinality [41], multi-index, path index, access support relations, T-Index, path dictionary index for path expression processing in OODBMSs [10, 11, 12, 67, 81], inherited multi-index, nested-inherited index, triple-node hierarchies, H-tree, CH-tree, hcc-tree (hierarchy class Chain tree), signature file hierarchies, signature graphs oriented on facilitating processing collections organized in class hierarchies in OODBMSs [10, 11, 12, 21, 77, 111], R-Tree, UB-tree, kd-tree, X-Tree, Parametric R-Tree, TPR-Tree (Time parameterized R-Tree), TPR*-Tree, grid file - for spatial (i.e. multidimensional) and spatio-temporal data, e.g. Geographic Information Systems, [30, 32, 40, 125]. etc. Another group of index structures can be defined in a distributed environment Page 25 of 181

26 Chapter 2 Indexing In Databases - State of the Art domain. In general, together with a growth of an indexed dataset an index can be split into small parts maintained on independent servers; hence, utilising their storage (e.g. main memory or disks) and processing power. In contrast to local indices such an index: enables exploiting a parallel computing power (therefore, they are usually referred as parallel or distributed indices), can be scalable, freely spreading its parts between network nodes, without compromising its primary efficiency, provides a higher level of concurrency. An overview of properties of a distributed index structure based on the idea of linear hashing is given in section Similarly to local indices, parallel indices have been developed in many variants for various applications and systems, e.g.: scalable distributed data structure variants (cf. section 2.2.2), distributed hash table (DHT) [34] for data indexing in peer-to-peer (P2P) networks, e.g. Chord [113], scalable distributed B-tree [3], a combination of a bit vector, a graph structure and a grid file database multi-key distributed index [45], hierarchical distributed index psix for XML documents in p2p networks [105], DiST, PN-tree structures for indexing multidimensional (spatial) datasets [4, 16]. All index structures mentioned in this subchapter are only a small fraction of existing solutions, which are described in thousands of research papers. The next section concerns a linear hashing index which is an important part of the author s prototype implementation verifying the thesis Linear-Hashing Linear hashing is a dynamic indexing structure invented by Witold Litwin [72]. Similarly as a regular hash table, it comprises the buckets, which store index entries according to some hash function. The linear hashing strives to keep a relation between the number of index entries and the number of buckets in order to ensure constant Page 26 of 181

27 Chapter 2 Indexing In Databases - State of the Art search, insertion and deletion efficiency and to minimise the buckets capacity overflowing. Buckets are added (through splitting) and removed (through merging) one at a time which is possible by taking advantage of a dynamic hashing functions family. At the start, a linear hashing structure consists of N 0 empty buckets numbered starting from 0 to N 0-1. Three important parameters describe an index state: n the number of a bucket to be split next if necessary (initially equal to 0), j a current lowest index buckets level (initially equal to 0). N the number of buckets equal to N 0 2 j +n (consequently, initially equal to N 0 ). The buckets from n TH to N 0 2 j -1 belong to the j TH level while the rest of the buckets, i.e. from 0 to n TH -1 and all starting from N 0 2 j bucket to N TH -1, belong to the j TH +1 level. Index entries are spread over the index according to hash functions h j (key) depending on the level of an index bucket. The target bucket T for a key is determined according to the following formulas: if ( h( j, key) T ( j, key) = if ( h( j, key) where: TH [ 0, n [) TH j [ n, N 2 [ h(j, key) := hash(key) mod (N 0 2 j ), h( j + 1, key), 0 ) h( j, key) hash(key) is the basic key hashing function, [minvalue stands for the inclusive left limit of the defined values range, maxvalue[ stands for the exclusive right limit of the defined values range. The most crucial operation, i.e. splitting, is triggered after an insertion when the index load becomes too high. A new bucket is appended to the buckets table and elements from n TH bucket are divided between n TH and the new bucket n TH +N 0 2 j according to the h(j+1, key) function. It is worth noticing that: h(j+1, key) {h(j, key), h(j, key) + N 0 2 j } Next, parameters n and N are incremented by one. Eventually, when n reaches N 0 2 (j+1) indicating that the hash table has doubled its size from N 0 2 j the n parameter is set to 0 and the index level j is incremented by one. Oppositely to splitting, if during a deletion the index load falls below some fixed Page 27 of 181

28 Chapter 2 Indexing In Databases - State of the Art threshold then merging of buckets is performed. An example bucket split procedure is presented in Fig Buckets entries are represented by values of their hash(key) functions. The state before the split is presented in Fig. 2.2a. The parameters of an index were the following: n = 0, N 0 = N =100, j = 0, so all buckets are addressed using the h(0, key) function. Fig. 2.2 Example of a bucket split operation [72] The split is performed on the n TH bucket, which was already overflowed. A new bucket at the end of the buckets table is allocated and it is filled with entries moved from the 0 bucket for which h(1, key), i.e. hash(key) mod 200, is equal to 100. Finally, n and N are incremented. An index state after the split is shown in Fig. 2.2b. As it is shown dynamic expansion of a linear hashing table helps to minimise the buckets overflow. An overview of SDDS based on the idea of linear hashing, an efficient structure for distributed indexing, is presented in the following section Scalable Distributed Data Structure (SDDS) SDDS is a scalable distributed data structure introduced by W. Litwin [73, 74] which deals with storing index positions in a file distributed over a given network. Its properties make it a good candidate for indexing global data in a distributed Page 28 of 181

29 Chapter 2 Indexing In Databases - State of the Art infrastructure (e.g. grid). SDDS uses LH* which generalises a linear hashing method described in section to a distributed memory or disk files. In contrast to the linear hashing, SDDS buckets can be located on different sites. LH* structure does not require a central directory, and it grows gracefully, through splits of one bucket at a time, to virtually any number of servers. The SDDS strategies differ with an approach to buckets splitting which can be managed by a coordinator site, triggered by a bucket overflow or by controlling an index load factor. An application of the SDDS significantly extends features of the linear hashing. The major advantages of an SDDS index concerning distributed indexing are the following: avoiding of a central address calculation spot, parallel and distributed query evaluation support, concurrency transparency, scalability it does not assume any constraints in size or capacity, an SDDS file expands over new servers when current servers are overloaded, index updating does not demand a global refresh on servers or clients, over 65% of an SDDS file is used, in general, the small number of messages between servers (1 per random insert; 2 per key search), parallel operations on SDDS M buckets require at most 2 M+1 messages and between 1 and O(log(M)) rounds of messages. The characteristics of SDDS outperform in efficiency the centralised index directory approach (described in detail in section 2.6.1) or any static data structures. Variants of SDDS index include implementations: preserving key order and supporting range queries (e.g. a RP* family of SDDS structures [75]), providing a high-availability, i.e. toleration for unavailability of some servers sites composing SDDS (e.g. LH* RS [76]) Page 29 of 181

30 Chapter 2 Indexing In Databases - State of the Art 2.3 Relational Systems The System-R, developed by IBM 1 Research between 1972 and 1981, is the first database management system implementing the relational model [6]. Innovative solutions developed within the system included query optimiser utilising indices [15, 65]. The overview of the relational query optimisation including the fundamentals of an approach to indexing has been collected in [20, 54, 55]. Almost 40 years of research on relational systems resulted in development of various indexing aspects. Numerous indexing based solutions are incorporated in available commercial products. The major RDBMSs currently are SQL Server by Microsoft [109], DB2 by IBM [24], Informix by IBM [53] and Oracle by Oracle Corporation [91]. The most popular open-source relational systems are PostgreSQL by PostgreSQL Global Development Group [103], MySQL by SUN Microsystems [83] and Firebird by Firebird Foundation [31]. The well-known indexing solutions designed for RDBMSs are the following: primary index, clustering index, secondary access paths (cf. section 2.1.2), multi-key index enables indexing using combination of multiple fields, derived key index (isystem DB2/400 by IBM) [51], function-based index (Oracle) [115], functional indexes (Informix) [49] indices on expressions, built-in or user functions that exactly match selection predicates within an SQL where clause, computed-column indices (MSSQL Sever) solution similar to the previous one but relying on an additional table column (computed-column), which can define indexable expression using derived attributes and user functions (the index maintenance relies on maintenance of the computed column) [110], temporary index transient internal structure created automatically by DB engine or defined manually (it is described below in this subchapter) [51, 110], development of diverse index structures (it is the topic of subchapter 2.2), other product specific solutions. In RDBMSs the keys that are used for defining an index on a table are usually 1 International Business Machines Corporation Page 30 of 181

31 Chapter 2 Indexing In Databases - State of the Art simple values stored in columns. Developers of such an index can use various index structures and mechanisms for assuring index transparency. A query optimiser can easily identify where clauses addressing indexed selection predicates. Modifications to an indexed table are also easy to detect by the DB engine during run-time or even earlier through the analysis of an intermediate form of DML 2 statements. Insertion or deletion of table rows transparently triggers addition or removal of an appropriate index entry. Analogously, modification to any value in a key column results in changes inside an index. Therefore, details of automatic index updating in RDBMSs are usually omitted in technical RDBMS specifications and considered rather as an implementation issue. Function-based indices and other similar solutions enabling defining keys using expressions addressing more than one table column and internal functions or userwritten functions generally do not introduce conceptual difficulties. The functions supporting such indices can be written in a native database language (e.g. PL/SQL) or an external programming language (C++, Java, etc.). Furthermore, they must be deterministic (i.e. depend only on the state of a database store) and side effects free (i.e. do not introduce any changes to data). The idea of function-based indices is derived from optimisation through method (or function) pre-computation or materialisation. It is widely discussed in the research literature [5, 9, 13, 27, 57, 80]. The optimisation gain relies on pre-calculating the result of a given function or a derived attribute for all objects of a collection. The obtained results are used as keys to index objects and are stored inside the index. Thus, when queries are evaluated, the optimiser strives to use the result computed earlier in order to avoid laborious execution of functions and derived attributes. The automatic index maintenance of function-based indices requires simply considering modifications to any value stored in all columns used in a key definition. Nevertheless, this aspect of indexing becomes complex when object-oriented model and language extensions are considered. In extreme cases, it may even lead to serious errors (see section 2.5.1). If appropriate indices do not exist then the optimiser can try to facilitate query 2 Data Manipulation Language Page 31 of 181

32 Chapter 2 Indexing In Databases - State of the Art processing using temporary indices instead. Their applications are described in detail for isystem DB2/400 by IBM [51]. A temporary index can be created solely for performing joins (e.g. nested loop join), ordering, grouping, distinct and record selection. It is applied by the optimiser to satisfy a specific query request. Such an index can be built as a part of a query plan. After query execution it is destroyed. In effect, it would not be reused and shared across jobs and queries. Sometimes a temporary index can be created for a longer period. Such a decision can be made by the DB engine basing on the analysis of query requests over time. In order to reuse and share such an index it has to be altered if the underlying table changes. The advantage of a temporary index is the shorter access time as it is stored only in main memory. 2.4 OODBMSs Index organisation and optimisation in object-oriented database management systems have been deeply researched, see [7, 10, 11, 12, 21, 43, 46, 77, 79, 81, 93, 111]. Experimental database prototypes are among other the following: IRIS by Hewlett Packard, ORION by MCC 3, OPENOODB by Texas Instruments and project ENCORE/ObServer by Brown University. Few former commercial OODBMSs are: ONTOS by Ontos, ARDENT by ARDENT Software 4, ODE by AT&T Bell Labs, POET by POET Software [29]. OODBMSs base on a hierarchical object-oriented data model. One of important notions of the object-oriented model is a reference, i.e. a pointer link to an object. Pointer links express relationships (associations) between objects. In the result of attempts to standardise object-oriented database management systems the ODMG 5 [18] purposed OQL 6 [29, 33] which to some extent influenced the development of objectoriented query languages. Differences in data models and query languages imply that some indexing techniques are specialised to relational or object-oriented approaches only. OQL involves path expressions composed of object names separated by dots in 3 Microelectronics and Computer Technology Corporation, Austin, Texas 4 Formerly O2 by O2 Technology 5 Object Data Management Group 6 Object Query Language Page 32 of 181

33 Chapter 2 Indexing In Databases - State of the Art order to navigate via pointers to objects easily. Navigation to a pointed object in OODBMSs can be fast as it is usually resolved at a low-level with a direct link. In the relational model such relationships (i.e. primary-foreign key dependencies) require performing joins and for efficient query evaluation require indices. Nevertheless, some object-oriented systems may implicitly rely on a flat, relational-like data model. In such a case, navigation along a pointer link still requires performing an implicit join among objects. Thus the assumption limiting OQL path expressions is that the operand before a dot operator should not deliver a collection. Much work has been dedicated in the OODBMSs research to cope with improving the efficiency of processing nested predicates, i.e. based on derived attributes defined using path expressions. These works additionally extend path expression indexing with consideration of inheritance issues. The most important proposed solutions are Multi-Index, Inherited Multi-Index, Nested-Inherited Index, Path Index, Access Support Relations [10, 11, 12], Triple-node hierarchies [77] and T-Index (focused on semi-structured data) [81]. The efficiency of these methods was deeply studied, described through appropriate cost models and verified by prototype implementations. The solutions are focused on various criteria, such as the cost of retrieval, cost of updates operations or cost of storage. However, the transparency aspect of automatic index updating is not always precisely explained. Generally, it is assumed that each modification of an attribute of a class instance and creation or deletion of an instance should cause appropriate index updating actions. However, instances of one of classes accessed by an indexed path expression can be located in different collections. Moreover, these collections can contain the arbitrary number of objects not associated with indexed objects. These circumstances can make automatic index updating routines inapplicable or seriously affect database s performance. Let us consider an example of an OQL query returning data concerning departments who are supervised by an employee John Doe: SELECT * FROM Departments d WHERE d.supervisedby.name = JOHN DOE A path expression based index supporting query evaluation concerns only a number of employees who are department supervisors. Unfortunately, modifying a name of each employee would be burdened by the index maintenance mechanisms. This inconvenience is however justified. In the approach to automatic index updating Page 33 of 181

34 Chapter 2 Indexing In Databases - State of the Art presented in [10, 11, 12] all instances of classes associated with a path expression based index need to be taken into consideration to ensure index validity after data modifications. Hence, an index structure often preserves some additional information concerning objects currently not accessed by the given index but located in collections processed during the path expression evaluation. An overview of the architecture of a system oriented on indexing based on path expressions is given in section The distributed object management system H-PCTE 7 developed at the University of Siegen [47] has proposed a different solution to automatic index maintenance. It is independent of an index structure kind and its contents. This work relies on an extended OQL language variant P-OQL [42] designed to reflect a data model of H-PCTE. The approach is based on so-called index update definitions which consist of event description for an event causing the need for an index update, a reference for the affected index structure, a query determining the elements for which the respective index entries have to be updated and a corresponding update operation. These index update definitions can be generated during index creation. The solution handles complex derived attributes, for instance, employing regular path expressions and exploiting OQL aggregate functions. On the other hand, the authors outline some limitations of this approach concerning efficiency and consideration of user-methods giving general suggestions how these disadvantages should be overcame [43]. Another, approach to index maintenance is in detail discussed for function-based indexing [46] developed in the context of Thor, a distributed, object-oriented database system developed by the Massachusetts Institute of Technology [71]. It descends from works on optimisation for methods and functions in databases [9, 57]. Indices are maintained using a so-called objects registration schema. Registration concerns only objects which modification can affect an index. An index update is triggered by a mechanism that checks registration information during objects modification. Despite the theoretical genericity of this approach, it has not been fully implemented, since Thor provides object persistence for applications, but without the support for queries. In [123] the authors present an approach generalising the methods based on indices by stored queries. They propose to store a response, i.e. the result of a query for 7 High-performance Portable Common Tool Environment Page 34 of 181

35 Chapter 2 Indexing In Databases - State of the Art a current database state, according to the query. Universality of this solution enables taking advantage of indexing exploiting complex predicates, e.g. aggregate functions. However, in context of traditional approach to the database s index this work is close to optimisation by query caching. To the best of author s knowledge, only a few indexing techniques proposed in the scientific literature have been incorporated into commercial OODBMSs products and major prototypes. Careful inspection of applied indexing facilities is possible through analysis of major object-oriented DBMSs. The mentioned above prototypes and commercial products presented in next sections represent only part of existing objectoriented database management systems landscape. Nevertheless, they provide sufficiently complete overview of the indexing state of the art in OODBMSs db4o Database The db4o database system by db4objects [25] is designed as a tool providing transparent persistence for object-oriented language objects. Native Queries in db4o supply an advanced query interface using the semantics of JAVA programming language [22]. Another query interface is SODA. Transparent indexing in the db4o OODBMS is provided only for attributes of classes defining indexed collections [26]. This means that db4o handles index maintenance and query optimisation automatically. The documentation does not present details about indices properties, but only about the usage. SODA query optimisation allows db4o to use indices. Native Queries are converted to SODA where possible. Otherwise Native Query are be executed by instantiating all objects Objectivity/DB The Objectivity/DB by Objectivity [85, 86, 87] approach to objects persistence in programming languages is similar to db4o; however, it is considered more as an alternative to traditional understanding of a query language. Objectivity/DB, besides the C++, JAVA and SMALLTALK support, provides Objectivity/SQL++ which complies with ANSI-standard SQL-92 and extends it with some object-oriented extensions. Storage objects in terms of Objectivity/DB are used to group other objects and their indices to obtain space utilisation, performance and concurrency requirements. There are three kinds of storage objects that correspond to three levels of grouping in Page 35 of 181

36 Chapter 2 Indexing In Databases - State of the Art the Objectivity/DB storage hierarchy: federated database, database and the container. A structure of an index maintains references to persistent objects of a particular class (so called indexed class) and its derived classes within a particular storage object. The Objectivity/DB supports indexing on a single class field or concatenated index on several attributes (key fields). The indexed class is specified when creating an index. The creation of an index can be performed on any persistence-capable class, i.e. a class which behaviour enables storing their instances persistently in Objectivity/DB. Indices can be referred as sorted collections of references to indexed objects. The order of key values of an index is very relevant regarding the proper activity of predicate scans. Indexed objects by default are stored by ascending order of values of their key fields; this can be specified while creating the index. Let us consider the index usage in Objectivity/DB. The main goal of an index is to optimise predicate scans. The predicate used in the scan can be one of the following: a single optimised condition (=, ==, >, <, >=, <=, =~ -string match) that tests the first key field of the index, a conjunction (&&) of conditions in which the first conjunct is an optimised condition that tests the first key field of the index (no disjunction - OR - support). Objectivity provides the way to determine the uniqueness property of an index for a combination of values in its key fields of indexed objects. This can be specified when creating an index. DB however does not automatically ensure the property. The Objectivity just considers indexing objects with unique key field values combination. Otherwise, the next object with the same combination of key fields values will not be considered for indexing. Modifications concerning objects of an indexed class in the relevant storage object cause appropriate changes in the index automatically. Additionally, to control updates the session s index mode can be used. In enables determining the time of an index update relatively to when indexed objects are modified. The index modes are as follows: INSENSITIVE an update is applied when the transaction commits, SENSITIVE the update will work when the next predicate scan is performed in the transaction or, if no scans are performed, when the transaction commits, Page 36 of 181

37 Chapter 2 Indexing In Databases - State of the Art EXPLICIT_UPDATE suppress automatic updating of indices; the updateintensive application that works with this index mode can update indices explicitly after every relevant change ObjectStore Similarly to db4o and Objectivity/DB, one of the goals of the ObjectStore by Progress [88, 89] is making an access to a database transparent for a programming language. In ObjectStore, a collection is an object that groups together other objects. While adding an index to a collection the order and the uniqueness of an index can be specified (by default it is unordered and allows for duplicates). In case of ObjectStore the place of an index storage can be chosen while creation. It can be a pointed database segment or a specified database. By default, the index is stored in the same database, segment, and cluster as the collection to which the index was added. The ObjectStore introduces a so-called multistep index. It can be created using complex path expressions, which can access multiple public data members and methods, as a key. Additionally, for the purpose of optimising queries involving types that have many subtypes, the idea of superindex was implemented. By default, adding an index on a type results in recursive adding of indices to all its subtypes. Still for queries with a large and intricate hierarchy of subtypes regular indexing can seriously deteriorate processing. Adding a superindex to a type with many subtypes differs from a default index in one essential feature, i.e. a superindex is only one. It eliminates a recursion; consequently, only one parent query operation occurs in contrast to multiple queries when using the regular index. The superindex is automatically updated just as it takes place in case of default one. However, there are some flaws regarding the superindex: using a superindex to the small number of subtypes will not bring significant gain, starting a query for a subtype gives no gain from supertype s superindex, superindex cannot be applied to types with subtypes in different segments of the same database or in a different database. Page 37 of 181

38 Chapter 2 Indexing In Databases - State of the Art The last superindex s limitation can be used to prevent from adding new subtypes located in different databases to a superindexed type. The ObjectStore ODBMS automatically optimises a query applied to a collection. If an index is added to a collection, the database first evaluates indexed fields and establishes a preliminary result set. Next, it applies non-indexed fields and methods to elements in the preliminary result set. In ObjectStore optimisation can be done explicitly (by preparing a query) or automatically (otherwise). The latter means that a query is optimised to use exactly indices which are available on the collection being queried. The automatic optimisation is convenient and effective. Nevertheless, when a query is to be run many times against multiple collections, with potentially different indices, it is recommended to take manual control over the optimisation strategy. The ObjectStore supports multistep indices, but provides only partial index maintenance transparency. The ObjectStore automatically updates an index when elements are removed or added to a indexed collection. However, updating an index entry after data modification must be explicitly determined by the programmer. Besides all mentioned above indexing capabilities, ObjectStore can create a primary index for an unordered collection that does not allow duplicates. It is an index used for queries and for looking up objects in such collection. Therefore, the primary index must contain no duplicate keys and must contain all elements in the collection. Thanks to this solution in some cases the look-up times and insertions/removals from the collection are faster Versant The Versant Object Database by Versant [126] requires explicit use in programming language codes statements of a query language. The statements are seen in the code as strings. They are processed in run time by a special utility in order to find and manipulate objects in a database. It exploits its native query language VQL 8 similar to SQL with some object-oriented extensions. In Versant indices are set on a single attribute of a class and affect all instances of the class. Versant uses two kinds of index structures: B-trees and hash tables. Both maintain a separate storage area containing attribute values and different organisation. 8 Versant Query Language Page 38 of 181

39 Chapter 2 Indexing In Databases - State of the Art An attribute can be associated with two indices, one of each kind. A B-tree index is useful in case of value-range comparisons, while a hash index is better for exact match comparisons of values. No index inheritance is supported by Versant. An index can be created on an attribute of only one class. No class inheriting from the one with the index will inherit the index. To index subclass attributes you need to specifically set indices on each subclass. This results in the need for providing index consistency by an administrator. Advanced transparent indexing in Versant is achieved on virtual attributes, which is a similar technique to indexing on a computed column in the Microsoft SQL Server. This approach enables indexing of derived virtual attributes built using one or more normal attributes of a class [127]. Indices in Versant do not have names and they are maintained automatically while adding, removing or updating the value of an attribute. An extra constraint that a Versant s index can enforce is the uniqueness. A unique index ensures that each instance of a class has a value of the index attribute that is unique with respect to attribute values in all other instances of the class. In other words, once an attribute receives the unique index, no duplicate value for this attribute can be committed; the database server process will first check for the uniqueness constraint. However, such uniqueness must be assured by the index administrator and can only be changed by removing an index GemStone s Products The last evaluated by the thesis author commercial OODBMSs are GemStone products [37]: GemFire the Enterprise Data Fabric [35], which supports a subset of OQL, and Facets [36], which provide transparent persistence for Java programming language using an SQL-92 based-language with object extensions. These are the only tested databases which support transparent indexing employing path expressions. Both databases originate from the GemStone database which approach to indexing has been discussed in [79]. In the context of GemStone indices address path expressions. The variable name appearing in a beginning of a path is called path prefix. Then, a path contains a sequence of links and a path suffix; e.g. Employee.worksIn.manager. For each link (for instance, a variable of an object) in the path suffix one index is available; thus forming a Page 39 of 181

40 Chapter 2 Indexing In Databases - State of the Art sequence of index components. GemStone supports five basic storage formats for objects one of which is a non-sequenceable collection (NSC). When objects in this format grow large, its representation switches from a contiguous one to a B-tree which maintains the members by OOP 9 for NSCs. Every NSC object has a not accessible to the user instance variable named NSCDict. If there are no indices on a NSC, then the value of NSCDict is nil; otherwise, the value of NSCDict is the OOP of an index dictionary. An index dictionary contains the OOPs of one or more dictionary entries. Dictionary entry contains the information about the kind of an index (either equality or identity), the length of a path suffix and two arrays: the first one representing the offset representation of the path suffix and the second one responsible for holding the OOP of the index component for each instance variable in the path suffix. Those index components are implemented using B + -trees. An index component stores the information about the ordering of keys in the component s B-tree. If the path suffixes of two or more indices into a NSC have common prefix, then the indices will share the index components on the common prefix. In GemStone identity indices directly support exact match lookups; whereas, equality indices and identity indices on boolean, characters and integers directly support =, >, >=, <, <= and range lookups. Objects in GemStone may be tagged with a dependency list. For every index component in which an object is a value in the component s B-tree, the object s dependency list will contain a pair of values consisting of the OOP of an index component and an offset. The pair indicates that if a value at the specified offset is updated then an update must be made to the corresponding index component. Consequently, an index component is automatically dependent on the value of the object at the given offset. 2.5 Advanced Solutions in Object-Relational Databases A very promising feature of relational systems extended with object-oriented capabilities is indexing using keys defined on expressions consisting of derived attributes, internal and user methods (described in subchapter 2.3: derived index, 9 GemStone uses unique surrogates called object-oriented pointers (OOPs) to refer to objects, and an object table to map an OOP to a physical location. Page 40 of 181

41 Chapter 2 Indexing In Databases - State of the Art functional indexes, function-based index and computed-column indices). The plain relational systems appoint limits for index key definition: an index key can only be calculated using data in a current tuple, since SQL does not enable defining an index using data from other tables associated with the primary-foreign key relationship, SQL aggregate functions are forbidden in index definitions, since a simple SQL expression used in a selection predicate for a table returns a single value. Without advanced object-oriented extensions there is no support for methods associated with tuples, polymorphism and path expressions. Such limitations also apply to majority of indexing techniques in ORDBMSs. The author has not found any objectrelational DBMS supporting indexing using aggregate functions or path expressions. Relatively complex indices involving method invocations and polymorphism in the object-relational environment can be created with the use of the Oracle function-based indices feature [108, 115]. The Oracle documentation does not provide extensive information concerning the automatic maintenance of such indices. To identify properties of the Oracle s approach the author performed tests described in the next section. Besides regular indexing facilities, some products introduce robust extensions for advanced indexing purposes. As an example let us consider two solutions provided by IBM, i.e. Virtual-Index in Informix [50] and Index Extensions in DB2 [114]. These tools are dedicated for experienced database programmers which require indexing mechanisms going beyond standard database capabilities, e.g.: creating secondary access methods (i.e. indexing) that provide SQL access to non-relational and other data that does not conform to built-in access methods. (e.g. a user-defined access method retrieving data from an external location), creating a specialised index support to take the semantics of structured types into account, introducing various index structures. Nevertheless, to take advantage of such extensions the user needs often to define specialised routines, in particular, responsible for index maintenance (key generator) and performing index scans (range producer). Therefore, the solutions presented above Page 41 of 181

42 Chapter 2 Indexing In Databases - State of the Art do not fulfil indexing transparency property. DB2 additionally introduces transparent indexing technique for indexing semistructured XML 10 data [48]. The so-called purexml feature allows to store wellformed XML documents kept in its native hierarchical form in table columns that have the XML data type. XQuery (XML Query Language), SQL, or a combination of both can be used to query and update XML data. An index over XML data indexes a part of a column, according to a definition which is limited to XPath (XML Path Language) expression. Hence, an index key can be a value of an atomic type element nested in an XML structure stored in a column. XML data are stored entirely in table columns, so modifications done to XML data can be easily reflected in the index. The author did not encounter any transparent indexing solutions that would enable indexing using keys more advanced than solutions presented above. The next subsection focuses on evaluation of the function-based index technique, which is one of the most advanced and relevant to the author s work Oracle s Function-based Index Maintenance In order to verify properties of the function-based index maintenance we introduce the following example of a database schema (Fig. 2.3): Fig. 2.3 Example object-relational schemata The method gettotalincomes of the EmpType returns the value of salary attribute of a tuple. It is overloaded in the empstudenttype in order to consider the value of scholarship. The emp table consists of tuples of both types. Creating an index on such a method associated with a table is straightforward: CREATE INDEX emp_gettotalincomes_idx ON emp e (e.gettotalincomes()); Such an index is automatically used by the query optimiser. Efficiency of the selection process is improved not only through reducing the number of processed rows but also 10 Extensible Markup Language Page 42 of 181

43 Chapter 2 Indexing In Databases - State of the Art through avoiding method invocation since calculated results can be taken from an index. The index efficacy has been tested on series of simple queries. Modifications to a salary or a scholarship attribute, e.g. UPDATE emp e SET e.salary = 1500 WHERE e.name = 'KUC'; trigger appropriate changes in the index. According to anticipations, the time of such data alteration after adding an emp_gettotalincomes_idx index deteriorates. Processing an update is more than three times longer, because the automatic index maintenance needs to alter corresponding index s entries. As far as possible, tests have shown that the created index works correctly. Unfortunately defining an index on method calls in Oracle shows some unexpected disadvantages. The index update operations are also triggered during the modification of any name attribute in the emp collection. Hence, alteration of any emp tuple s attribute after creating an gettotalincomes_idx index similarly is slower more than three times. This has been caused by unnecessary index updating routines. The Oracle approach to index updating in case of the method-based indices consists in triggering index update routines during modifications done to any data in a tuple with associated index entries. The disadvantage mentioned above grows to a large problem in case when the method used to define an index key accesses a data outside indexed tuples. For example the method getyearcost of the DeptType has the following definition: CREATE OR REPLACE TYPE BODY dept_type IS MEMBER FUNCTION getyearcost RETURN NUMBER DETERMINISTIC IS BEGIN DECLARE counter NUMBER; BEGIN SELECT sum(salary) INTO counter FROM emp e WHERE e.dept.name = self.name; RETURN counter * 12; END; END; END; It accesses not only the given DeptType tuple data but also reaches the emp collection. Oracle also enables indexing dept collection according to getyearcost method: CREATE INDEX dept_getyearcost_idx ON dept d (d.getyearcost()); Page 43 of 181

44 Chapter 2 Indexing In Databases - State of the Art Similarly like in case of emp_gettotalincomes_idx, a command altering dept tuples triggers updating of the index. However, any modifications done to emp tuples, e.g. INSERT INTO EMP SELECT emp_type ('John Smith', 350, REF(d)) FROM DEPT d WHERE d.name = 'HR'; are not taken into consideration and the dept_getyearcost_idx index loses cohesion with the data. Unfortunately, queries which use the index, e.g.: SELECT d.name, d.getyearcost() FROM DEPT d WHERE d.getyearcost() < 24500; can return incorrect answers, since the selection process and final results depend on the index contents. Hence, the applied index updating solution is not proper to handle indices with keys based on too complex methods. In practice the function-based indices feature in Oracle can lead to erroneous work of database queries and applications. The reference dept in EmpType associates employee tuples with departments. It can be used to formulate selection predicates employing path expressions, e.g.: SELECT e.name FROM emp e where e.dept.name = 'HR'; Nevertheless, using such path expressions to define an index is forbidden: CREATE INDEX emp_deptname_idx ON emp e (e.dept.name); ORA-22808: REF dereferencing not allowed because it would require accessing a tuple from another table, which obviously would make the index maintenance impossible. 2.6 Global Indexing Strategies in Parallel Systems Various indexing approaches have been developed in distributed systems over the last two decades. The most of interesting solutions have been implemented in the domain of p2p networks [104, 107]. Work [124] introduces detail taxonomy of indexing strategies (described as index partitioning schemes) for distributed DBMSs. It analyses index maintenance strategies and storage requirements in the context of data partitioning in relational system. It assumes that index is partitioned over same nodes as data. Factors considered Page 44 of 181

45 Chapter 2 Indexing In Databases - State of the Art as fundaments of the given taxonomy are: a degree of index replication between system nodes (non-, partial-, full-), index partitioning in the context of data partitioning i.e. method determining how index entries are distributed among system nodes. Generally, local indexing strategy implies that indices are locally built on the local data. Distributed indexing occurs when partitioning of the index is different from partitioning of the data. The taxonomy however omits a centralised indexing strategy which is very important. Local data indexing is the most common optimisation method used in database systems. Moreover, it is also applicable to indexing data of a single peer in a distributed environment. There are several advantages of the local indexing strategy in the distributed database environment. The knowledge of indices existing in local stores need not to be available on the level of a global schema. A query addressing global schema, during a process of optimisation, in many cases can be decomposed into subqueries addressing particular servers. Such a sub-query concerns data stored locally on a target site. Before evaluation, it can be optimised according to a local optimiser in order to take advantage of existing local indices. Global query optimisation is divided between servers and the global optimiser needs not to take into account local optimisations. Consequently, local indexing is transparent for global applications. Since data and indices are located on the same machine and in the same repository, an implementation of all indexing mechanisms, including index management and maintenance, is standard. In contrast to distributed environment indexing techniques, it is not so complex. However, local indexing is not always sufficient regarding a computational power of a distributed database. The global indices can be kept by a global store. This approach has the significant potential for optimisation of global queries. An idle time of a global store can be adopted to indexing and cataloguing data held by local servers. From the users point of view, the distributed technology should satisfy the following general requirements: transparency, security, interoperability, efficiency and pragmatic universality. Distributed or federated databases and data-intensive grid technologies, which can be perceived as their successors, aim at providing transparency in many forms: location, concurrency, implementation, scaling, fragmentation, Page 45 of 181

46 Chapter 2 Indexing In Databases - State of the Art replication, failure transparency, etc. [39]. The transparency is the most important feature for reducing the complexity of a design and for supporting programming and maintenance of applications addressing distributed data and services. It much reduces the complexity of a global application. One of the forms of transparency concerns indexing. As in centralised databases, the programmers should not involve indices explicitly into the code of applications. Any performance enhancements should be on the side of database tuning that is the job of database administration. There are several important aspects connected with transparent indexing in distributed databases: location and access transparency - the geographical location of indices should not effect the users work, scaling and migration transparency - indices should be maintained in such a way that servers data may be migrated, added or removed without any impact on the consistency of applications, failure transparency - indices should be updated or migrated if some of the nodes are broken, implementation and fragmentation transparency - the user need not to know how indices are implemented or partitioned, concurrency transparency - the users can access indexed resources simultaneously and need not to know that other users exist. Next sections discuss the basic properties of centralised and distributed approaches to indexing Central Indexing The most common practice for distributed resources indexing is dedicating one server for an index repository. This strategy is called central indexing and has certainly proved its value in many internet applications. It played a particularly important role in the development of p2p networks. For example, Napster, an application allowing for sharing music files, used a directory server to locate desired resources [104, 107]. The features of this approach include: small amount of communication necessary, an efficiency for selective queries, Page 46 of 181

47 Chapter 2 Indexing In Databases - State of the Art an architecture simplicity. However, there are also some disadvantages resulting from central indexing. Indexing server becomes a single point of failure. Moreover, the query evaluation performance deteriorates if a server is overloaded (i.e. too many clients use an index simultaneously) or fails. Also, this approach does not take advantage of parallel computations Strategies Involving Decentralised Indexing In the Gnutella [38] p2p network each participating node is responsible for answering and forwarding search requests (a so-called flooded request model). It is an example implementation of the local indexing strategy. However, features of the Napster solution have proved to be superior and resulted in better performance than Gnutella. An efficient possibility of decentralised indexing is the use of global distributed and parallel indices, e.g. SDDS (see section 2.2.2). These kinds of indices assume that a searched key-value points to another server where it can be further forwarded or desired non-key values can be found. A simple example of such technique could be indexing employees by their profession. One server can store references to all employees whose profession starts with a letter A, another server starting with a letter B, etc. The performance comparison of local indexing strategies (described as partialindexes) and distributed indexing (referred as partitioned global indexes) in query processing of horizontally fragmented data is a topic of [70]. The evaluation was in favour of the strategy utilising a distributed index. Similar investigation has been performed in the context of an inverted index for parallel text retrieval systems [19]. The conducted research indicated that the local index strategy should be preferred in case when queries exploiting indices are infrequent. The advantages of the distributed indexing strategy in contrast to the centralised one are the following: it uses the computing potential of a grid (enabling a parallel query evaluation), it is insensitive to overloading, it decentralises necessary communication. Page 47 of 181

48 Chapter 2 Indexing In Databases - State of the Art An organisation and an architecture of such an index is more complex than in case of a central index. Sites can dynamically join or leave a community forcing the reorganisation of a part or a whole index. An achievement of scalability in an index distribution requires the use of advanced algorithms and data-structures which complexity can have a disadvantageous impact on index performance. It is common for all global indexing techniques that index positions stored on server X are not associated with data stored on server X, so maintaining a convergence between data and an index is more difficult and has to be done on the global level. Some works, e.g. [7], consider indexing schemas for a distributed page server OODB recognizing local caching of a centralised index as distributed indexing strategy. Nevertheless, this technique does not introduce significant performance improvement for parallel query processing. 2.7 Distributed DBMSs Despite of the relatively large number of distributed relational and objectoriented DBMSs only small fraction of them has global indexing capabilities. The most advanced solutions are based on index partitioning, e.g. SQL Server and Oracle. In databases, partitioning usually refers to tables or indices. The common model of table partitioning in distributed databases relies on the static division of data into independent datasets [92]. Data are partitioned horizontally by the declustering of relations based on a function (usually a hash function or a range index). This kind of partitioning is static since rules assigning datasets to designated partitions do not change without an administrator s interference. With a hash function the data can be partitioned according to one attribute or a combination of several attributes. Such an approach enables efficient processing of exactly matching queries, often independently within only one partition. As a representative example, the Oracle s approach to index partitioning is discussed in this subchapter. Its details are described in [82, 133]. Oracle enables creating a partitioned index for a partitioned and non-partitioned table. On the other hand, a partitioned table can have partitioned and non-partitioned indices. If a key partitioning a table is identical to a key partitioning a corresponding index, then the index is local. In remaining cases, we deal with a global index. Nevertheless, partitioning of tables uses the same mechanisms, so the number of index partitions in Page 48 of 181

49 Chapter 2 Indexing In Databases - State of the Art case of global or local indices usually does not differ. Moreover, local indexing is superior to global in efficiency of the index maintenance. Consequently, it is the most common used indexing strategy. Partitioned indices inherit majority of regular indices features, e.g. they can be defined using function-based expressions. In the Oracle s approach, partitioning is not a mean of data integration and partitions are not managed autonomously. Therefore, a global or local partitioned index can be created only on an entire table. The research [45] concerns the architecture of a multi-key distributed index. It proposes a distributed index composed of two types of index structures: Global Index (GI) and Local Index (LI). GI is a part managed on a distributed database s level and each LI is created and maintained by local database components. In such an architecture different indexing aspects are described, e.g. query optimisation, index implementation and maintenance (referred as dynamic adaptation [44]), together with evaluation of performance. Generally, the capabilities of the presented approach do not overcome the presented Oracle s index partitioning solution. The SDDS index structure was employed in the SD-SQL 11 Server database [106] in order to distribute data dynamically and transparently between separate database instances. Table rows are moved between sites according to primary key values by SDDS algorithms. This approach solves the limitations of static table partitioning improving data load balancing. SD-SQL Server automatically manages and accordingly queries database instances. This solution is built on top of SQL Server using database stored procedures. 11 Scalable Distributed SQL Page 49 of 181

50 Chapter 3 The Stack-based Approach Chapter 3 The Stack-based Approach The Stack-based Architecture (SBA) [1, 117, 118] is a formal methodology concerning the construction and semantics of database query languages, especially object-oriented. SBA is a coherent theory that enables creating a powerful query language for practically any known data model. The basic assumption behind SBA is that query languages are variants of programming languages. Consequently, notions, concepts and methods developed in the domain of programming languages should be applied also to query languages. In particular, the main semantic and implementation notion in majority of programming languages is an environment stack. It is an elementary structure used for defining names space, binding names, calling procedures [120] (including recursive calls), passing parameters and supporting object-oriented notions such as encapsulation, inheritance and polymorphism. The Stack-based Approach to query languages exploits the environment stack mechanism in order to define and implement operators specific to query languages, such as selection, projection, navigation, join and quantifiers. Taking advantage of the semantics based on the environment stack, SBA makes it possible to achieve full orthogonality and compositionality of the operators. Moreover, SBA enables seamless integration of query language with imperative constructs and other programming abstractions, including procedures, types and classes, This chapter contains a brief description of basic SBA notions and presents the model query language SBQL (Stack-Based Query Language) developed according to SBA. More details on SBA and SBQL can be found in [117, 118]. 3.1 Abstract Data Store Models SBA deals with several universal models of object stores. Depending on the complexity, they are referred to as abstract store models AS0, AS1, AS2 and AS3 (previously M0, M1, M2 and M3 were used, correspondingly). Each next model extends the previous one with some new features. The mentioned models do not exhaust all possibilities; however, they cover the most currently known ones AS0 Model The AS0 model is built according to relativity and internal objects identification Page 50 of 181

51 Chapter 3 The Stack-based Approach principles. It is a very simple data store model that is capable of representing semistructured data [121]. In AS0 each object comprises an internal identifier (implicit for the programmer), an external identifier (an object name available for the programmer) and a value. There are three kinds of objects: atomic, reference and complex. Assuming that I denotes the set of all acceptable internal identifiers, N the set of acceptable external names of objects, V the set of simple values like numbers, strings, etc. and O denotes any set of AS0 objects, we can define objects as the following triples (where i 1, i 2 I, n N and v V ): Atomic objects - the simplest kind of objects. They are identified by internal identifier i 1, have name n and hold an atomic value v. Reference objects - they model relations between objects. Similarly like in the previous case, they are identified by a internal identifier i 1 and have a name n. Their value is an identifier i 2 referring to some object. Complex objects used to model object nesting. Object with an internal identifier i 1 and name n consists of objects which belong to O. Elements of O are considered subobjects of the object having i 1 as the identifier Abstract Store Models Supporting Inheritance The AS1 store model extends AS0 with classes and static inheritance. A class is a plain complex object containing subobjects which represent invariants of a certain group of objects. Additionally the inheritance relation between class objects can be defined. Apart from the inheritance relation, there is a relation defining an object's membership to a corresponding class. The AS2 model introduces the notion of object's dynamic role. Each object can be associated with one or more such roles. If an object is an owner of a role, its situation is similar to being a class instance. However whereas the inheritance has static character during run-time object can take new and lose old roles and inheritance between roles is dynamic [56, 98]. The AS3 model extends AS1 (AS3.1) or AS2 (AS3.2) model with the encapsulation mechanism. It is assumed that each class can be equipped with an export list which is a set of class fields names that are explicitly visible outside implemented class instances. Other fields are not visible and are treated as private. Page 51 of 181

52 Chapter 3 The Stack-based Approach Example Database Schema The example schema in Fig. 3.1 is introduced as a basis for presenting conceptual examples in this and in the following chapters. The abstraction level of the schema relates to the AS1 Store Model described in the previous section. Therefore, the schema consists of hierarchical objects, pointer links between objects, classes, static inheritance and multiple inheritance. These are the most relevant elements from the point of view of object-oriented modelling. The indexing solution for the ODRA database management system supports many features that result from adapting the AS1 store model. Student : StudentClass scholarship : Integer getfullname() : String getscholarship() : Integer setscholarship(wartość : Integer) EmpStudent : EmpStudentClass getfullname() : String gettotalincomes() : Integer Person : PersonClass name : String surname : String age : Integer married : Boolean getfullname() : String Emp : EmpClass salary : Integer getfullname() : String gettotalincomes() : Integer worksin 0..* employs 1 Dept : DeptType name : String 1 address 1 address 1 AddressType city : String street : String zip : Integer[0..1] 1 Fig. 3.1 Example of an object-oriented database schema for a company The example schema illustrates personnel records of the company. It introduces several classes PersonClass, StudentClass, EmpClass, EmpStudentClass and two structure types DeptType and AddressType. Persistent instances of the classes mentioned above can be accessed using their instance names Person, Student, Emp and finally EmpStudent. Objects called Dept have DeptType structure with a primary attribute name and represent departments of the company. Each Person object stands for person somehow connected with the company. Its attributes provide basic information: name, age and marital status. Additionally each Dept and Person object includes an address subobject which specifies a city, a street name and optionally a zip-code Page 52 of 181

53 Chapter 3 The Stack-based Approach according to the AddressType structure. Instances of the EmpClass represent current employees of the company and extend Person object attributes with the salary attribute. Emp and Dept objects are associated by references. The worksin reference of an Emp object leads to a department. Dept objects contain employs references to department employees. Another class, which extends the PersonClass, is the StudentClass. Its objects refer to students who are granted a scholarship by the company. For that reason, this class introduces the scholarship attribute. The last class presented on the schema is called EmpStudentClass and like its name suggest it inherits from EmpClass and StudentClass. It is introduced to represent students who are simultaneously employees of the company. In SBQL using Person name results in returning all instances of PersonClass class and its subclasses. Similarly, via Emp name programmer refers both EmpClass and EmpStudentClass instances. Beside attributes, classes are composed of methods. Taking advantage of the polymorphism some methods are overridden in derived subclasses. E.g. gettotalincomes() method of EmpClass returns the value of a salary attribute, but for instances of the EmpStudentClass it returns sum of salary and scholarship attributes Example Store with Static Inheritance of Objects Referring to the data schema in Fig. 3.1 we introduce the example store shown in Fig. 3.2, consistent with the AS1 model (cf. section 3.1.2), presenting classes and objects, their values, identifiers and the most important relations between them. An identifier is a property of every database entity. This sample store consists of two objects of DeptType type and two instances of EmpClass (one of them is also an EmpStudentClass instance). One Emp object describes Marek Kowalski a person who works in a CNC department. The EmpStudent object depicts Piotr Kuc, a student who is employed by the HR department. Classes PersonClass and StudentClass are omitted but according to the schema in Fig. 3.1 they are present in the database. Page 53 of 181

54 (i 61 ) Emp (i 62 ) name : Marek (i 63 ) surname : Kowalski (i 64 ) age : 28 (i 65 ) married :true (i 66 ) address (i 67 ) city : Kraków (i 68 ) street : Bracka (i 69 ) zip : (i 70 ) worksin (i 71 ) salary : 1200 (i 11 ) EmpClass (i 13 ) getfullname() : < method code > (i 14 ) gettotalincomes() : < method code >... (i 21 ) EmpStudentClass (i 23 ) getfullname() : < method code > (i 24 ) gettotalincomes() : < method code >... Chapter 3 The Stack-based Approach (i 31 ) EmpStudent (i 32 ) name : Piotr (i 33 ) surname : Kuc (i 34 ) age : 30 (i 35 ) married :false (i 36 ) address (i 37 ) city : Warszawa (i 38 ) street : Koszykowa (i 39 ) scholarship : 500 (i 40 ) worksin (i 41 ) salary : 1000 (i 131 ) Dept (i 132 ) name : CNC (i 133 ) address (i 134 ) city : Opole (i 135 ) street : Wiejska (i 136 ) zip : (i 141 ) Dept (i 142 ) name : HR (i 143 ) address (i 144 ) city : Kraków (i 145 ) street : Reymonta (i 146 ) zip : (i 151 ) employs (i 152 ) employs Fig. 3.2 Sample store with classes and objects 3.2 Environment and Result Stacks The semantics of a query language in the Stack-based Approach is explained using two stacks: the environment stack (which was mentioned earlier) and the result stack. The environment stack (ENVS) controls names binding space. This stack consists of sections and each section holds binders. Binder is a construct which is used to bind names with an appropriate run-time entity. It is assumed that binders will be written as n(r), where n N, r R. The R set denotes all possible queries results. This brings us to the second mentioned result stack (QRES). It is used for storing temporary Page 54 of 181

55 Chapter 3 The Stack-based Approach and final query results. The following r elements belong to the R queries results set: Value result (number, characters string, logical value, date, etc.) are results of literal expressions or arise through dereference (process of acquiring a value) of an atomic database objects. Reference result (identifier of an internal object) are plain results of expressions referring through names to database objects. They are usually results of name binding, however they can also appear through dereference of reference objects. Binder result (mentioned earlier pair n(r), where n N, r R) are created when the operators introducing an auxiliary name are used (as, groupas) or as a result of the nested i X operation (described further), where i X is an identifier of a reference or complex object. Structure result (struct{r 1, r 2,, r n }, where r 1, r 2,, r n SR and SR R is a set of query results which are not collections) such results are a sequence of single results (SR set elements). Structures usually are created using comma expression, join or as a result of dereference of a complex object. Collection of single results (bag{r 1, r 2,, r n }, sequence{r 1, r 2,, r n } where r 1, r 2,, r n SR) they consists of any elements of the R set except other collections. Results collections can be nested in other collections only if they are values of binders. Collections are typically created as a result of binding names or using set operators (e.g. union). There are two main collection types: preserving order sequences and not preserving order bags. Nevertheless other collections can be introduced if necessary, e.g. an array. The following sections describe bind and nested operations which are defined using the stacks presented above. All these SBA elements are very essential from the point of view of the author s work Bind Operation Each name occurring in a query is bound with an appropriate run-time entity according to name binding space. Names binding is performed using so called bind operation. This operation works on the environment stack in order to find appropriate Page 55 of 181

56 Chapter 3 The Stack-based Approach binders in its sections. In the beginning of a query evaluation the ENVS comprises one section (i.e. base section) which holds binder to all database root objects. During a query evaluation new sections, empty or holding several binders, are put onto or pushed off the environment stack but the base section remains untouched. Generally binding of the n name consists in searching the ENVS in direction from top to bottom for the first section which holds at least one binder described with the n name. Since binder names can repeat inside one section, it is possible that a result of a binding operation will be a collection of all found binders values. Particularly if no section holds binders described with n name then empty collection is returned Nested Function The nested function formalises all cases that require pushing new sections on the ENVS, particularly the concept of pushing the interior of an object. This function takes any query result as a parameter and returns a set of binders. The following results of nested operation are defined depending on a parameter kind: Reference to a complex object the result is a set consisting of binders created using subobjects of the given complex object. For each subobject created binder has its name and value is defined by the subobject s internal identifier. Reference to a pointer object the result is a set holding a binder with a name of an object pointed by the pointer and a value equal to internal identifier of the pointed object. Binder the result is a set holding the identical binder. Structure the result is a set that is the union of the results of the nested function applied for all elements of the structure. In other cases, the result is the empty set. 3.3 SBQL Query Language Queries in the Stack-based Approach are treated in the same way as traditional programming languages treat expressions. Therefore in this thesis the terms expression and query are used interchangeably. Even though SBA is independent of syntax in order to explain some semantic constructs the abstract syntax called SBQL (Stack-Based Query Language) is used. Stack-Based Query Language is a formalised object-oriented query language in the SQL Page 56 of 181

57 Chapter 3 The Stack-based Approach or OQL style, however its syntactic has been significantly reduced, particularly to avoid large syntactic constructs like select from where known from SQL Expressions Evaluation SBQL expressions follow the compositionality principle, which means that the semantics of a query is a function of the semantics of its components, recursively. Similarly to programming languages, the simplest queries are names and literals. The most complex queries are created by free connecting several subqueries by operators (providing typing constraints are preserved). There are no constrains concerning nesting queries. SBA uses operational semantic in order to define operators. The most important SBQL operators and their semantic are described in the tables below. Tab. 3-1 Evaluation of traditional arithmetic operators Evaluation steps: Evaluation steps: Unary operators: Execute the subquery. 2. Take the result from QRES. 3. Verify is it a single result (if not run-time exception is raised). 4. For the reference result dereference is performed. 5. Execute appropriate operation on the value. 6. Push the final result on QRES. Binary operators: + - * / =!= < <= > >= or and 1. Execute both subqueries in sequence. 2. Take both results from QRES. 3. Verify they are single results (if not run-time exception is raised) 4. For each reference result dereference is performed. 5. Execute appropriate operation on the values. 6. Push the final result on QRES. Tab. 3-2 Evaluation of operators working on collections Evaluation steps: Structure constructor operator:, (comma) 1. Initialise an empty bag (eres). 2. Execute both subexpressions in sequence. 3. Take both results from QRES (first e2res and next e1res). 4. For each element (e1) of the e1res result do: 4.1. For each element (e2) of the e2res result do: Create structure {e1, e2}. If e1 and/or e2 is structure then its fields are used Add structure to eres. 5. Push eres on QRES. Page 57 of 181

58 Evaluation steps: Evaluation steps: Evaluation steps: Bag and sequence constructors: bag sequence Chapter 3 The Stack-based Approach 1. Initialise an empty bag (eres). 2. Execute subquery. 3. Take result from QRES. 4. Result is treated as structure and each structure field is added to eres. 5. Push eres on QRES. Existence operator: exists 1. Execute subquery. 2. Take result from QRES. 3. Push false on QRES if result is the empty collection, otherwise true. Removing duplicates: unique, uniqueref 1. Initialise an empty bag (eres). 2. Execute subquery. 3. Take a result collection from QRES (colres). 4. For each element (el) of the colres result do: 4.1. If there is no element in eres equal to el then add el to eres. 5. Push eres on QRES. In order to evaluate unique operators elements from colres are subjected to dereference operation if necessary. Sum of sets: expr1 union expr2 Evaluation 1. Initialise an empty bag (eres). steps: 2. Execute both subexpressions in sequence. 3. Take results from QRES. 4. Insert all elements from both results into eres. 5. Push eres on QRES. Traditional set operators: expr1 minus expr2, expr1 intersect expr2 Evaluation steps: 1. Initialise an empty bag (eres). 2. Execute both subexpressions in sequence. 3. Take both results from QRES (first e2res and next e1res). 4. For each element (e1) of the e1res result do: 4.1. In case of minus: if e2res does not contain element equal to e1 push e1 on QRES. In case of intersect: if e2res contains element equal to e1 push e1 on QRES. In order to compare elements from e1 and e2 operator performs necessary dereference operations. Inclusion operator: expr1 in expr2 Evaluation 1. Execute both subexpressions in sequence. steps: 2. Take both results from QRES (first e2res and next e1res). 3. For each element (e1) of the e1res result do: 3.1. If e2res does not contain element equal to e1 then the false logical literal is pushed on QRES and evaluation of operator is stopped. 4. Push true logical literal on QRES. Traditional aggregate operators: sum, min, avg, max, count Evaluation steps: 1. Execute subquery. 2. Take a result collection from QRES (colres). Page 58 of 181

59 Chapter 3 The Stack-based Approach 3. The final result is initialised (0 in case of sum and count operators, value of the first colres collection element). 4. For each element (el) of the colres result do: 4.1. Suitably to the given operator the final result is updated considering el element or its value. 5. Push the final result on QRES. In order to evaluate sum, min, max operators elements from colres are subjected to dereference operation if necessary. The evaluation of avg operator consists of sum and count operators evaluation. Evaluation steps: Tab. 3-3 Evaluation of non-algebraic SBQL operators Projection/navigation: leftquery. (dot) rightquery 1. Initialise an empty bag (eres) 2. Execute the left subquery. 3. Take a result collection from QRES (colres). 4. For each element (el) of the colres result do: 4.1. Open new section on ENVS Execute function nested(el) Execute the right subquery Take its result from QRES (elres) Insert elres result into eres. 5. Push eres on QRES. Selection: leftquery where rightquery Evaluation Similarly link in case of dot operator except for: steps: 4.5. Verify whether elres is a single result (if not run-time exception is raised) If elres is equal to true add el to eres. Dependent/navigational join: leftquery join rightquery Evaluation steps: Similarly link in case of dot operator except for: 4.5. Perform Cartesian Product operation on el and elres Insert obtained structure into eres. Universal quantifier: leftquery forall rightquery Evaluation 1. Execute the left subquery. steps: 2. Take a result collection from QRES (colres). 3. For each element (el) of the colres result do: 3.1. Open new section on ENVS Execute function nested(el) Execute the right subquery Take its result from QRES (elres) Verify whether elres is a single result (if not run-time exception is raised) If elres is equal to false then the false logical literal is pushed on QRES and evaluation of operator is stopped. 4. Push true literal on QRES. Existential quantifier: exists leftquery such that rightquery Page 59 of 181

60 Chapter 3 The Stack-based Approach Evaluation steps: Evaluation steps: Similarly link in case of forall operator except for: 3.6. If elres is equal to true then the true logical literal is pushed on QRES and evaluation of operator is stopped. 4. Push false literal on QRES. Sorting: leftquery orderby rightquery 1. Execute join operation. 2. Take the result from QRES 3. Sort obtained structures according to second structure field, then third, forth, etc. 4. Create new collection using the first structure fields. 5. Push the final collection on QRES Tab. 3-4 Evaluation of auxiliary names defining operators Assigning auxiliary names to collections elements: subquery as name Evaluation 1. Execute the subquery. 2. Take its result from QRES steps: 3. Each element of the obtained collection replace with a binder using the given as the operator parameter name and the given element as a value. 4. Push the final collection on QRES. Assigning auxiliary name to the whole collection: subquery groupas name Evaluation steps: 1. Execute the subquery. 2. Take its result from QRES 3. Create binder using the given as the operator parameter name and the obtained result as a value. 4. Push the final collection on QRES. Tab. 3-5 Evaluation of sequences ranking operators Assigning auxiliary ranking binders to sequence: seqquery rangeas name Evaluation 1. Execute the seqquery returning sequence (all sequences are indexed starting from 1). steps: 2. Take its result from QRES (seqres) 3. Each element of the obtained sequence (seqres) replace with a structure consisting of the given element and binder using the given as the operator parameter name and an index of element in the sequence as a value. 4. Push the final sequence converted to a bag on QRES. Extracting elements from a sequence: seqquery[subquery] Evaluation 1. Initialise an empty bag (eres) 2. Execute the seqquery returning sequence (all sequences are indexed starting from 1) Page 60 of 181

61 Chapter 3 The Stack-based Approach steps: 3. Take its result from QRES (seqres) 4. Execute the subquery returning collection of integers. 5. Take a result collection from QRES (intres). 6. For each element (el) of the intres result do: 6.1. Insert seqres element with an index el into eres. 7. Push eres on QRES. In this section the most important SBQL language operators are presented. SBA enables introducing also more sophisticated operators, e.g. transitive closures and fixed point equations. Nonetheless, operators essential for the author s work are presented above Imperative Statements Evaluation The following operators used to modify the state of data are also a part of the SBQL language; however, they cannot construct expressions that can be used by other operators to form complex queries. Tab. 3-6 Evaluation of imperative operators Assigning a value to an object: := Evaluation steps: Evaluation steps: 1. Execute the right subexpressions. 2. Take its results from QRES. 3. Verify it is a single results (if not run-time exception is raised) 4. Perform dereference on a result if necessary to obtain a value. 5. Execute the left subexpressions. 6. Take its results from QRES (it is assumed that result of the left subquery should be a reference to a suitable complex object). 7. Verify they it is a single reference (if not run-time exception is raised) 8. Assign value of the right subquery result to the object pointed by the reference result. Creating an object inside existing object: :<< 1. Execute the right subexpressions. 2. Take its results from QRES (it is assumed that the right subquery returns binder). 3. Verify it is a single results (if not run-time exception is raised) 4. Execute the left subexpressions. 5. Take its results from QRES (it is assumed that result of the left subquery should be a reference to a suitable complex object). 6. Verify it is a single results (if not run-time exception is raised) 7. Create database object according to binder name and its value. If binders has atomic value inside then new object is atomic. If binder contains another binder or a structure then complex object should be created (for nested binders appropriate new subobjects are created). 8. Nest the new object inside object referenced by the left subquery result. Page 61 of 181

62 Chapter 3 The Stack-based Approach Removing an object: delete Evaluation steps: 1. Execute the subquery 2. Take a result collection from QRES (colres). It is assumed that this collection holds references to existing objects 3. For each element (ref) of the colres result do: 3.1. Remove the object pointed by the ref from the database together with its subobjects and objects referencing it. 3.4 Static Query Evaluation and Metabase The SBQL queries during compilation are subjected to the static analysis. This process is indispensable in order to perform the static type control and the most of optimisations. Static analysis consists in mechanisms similar to evaluation of a query. The task of such an evaluation is simulation of the greatest number of possible situations that may occur in the run-time, however using data appropriate for the compile-time. Hence, the static analysis does not refer to real data. Instead, it uses a metabase, i.e. a graph of a databases schema constructed from declaration of program entities. A database schema graph is a similar structure to a database graph. It is also modelled using simple, complex and references objects. Significant differences in contrast to a database graph are the following: a metabase, instead of particular occurrences of objects, stores only the information about the minimal and maximal numbers of objects, i.e. the cardinality of a collection. instead of a specific values, the metabase stores the information on a data type and relationships (e.g. static inheritance) between them. the metabase additionally contains information which can be used during costbased optimisations, e.g. data related statistics. For example the following source code fragment: i : integer [0..*]; setvar : record { txt : string; note : string [1..5] }; would result it the following metabase written according to the AS0 model: Page 62 of 181

63 > Chapter 3 The Stack-based Approach > > > > > > The equivalent for a query result during compile-time is an operation signature. The following kinds of signatures can be distinguished: static reference, that is, a reference to a metabase object, static binder that contains a name and a associated signature as a value, variant that contains several possible signatures (when during static analysis the unambiguous signature cannot be determined), value type representation that contains identifier of a primitive type which it represents (usually it concerns literals and static references to atomic objects when dereference is applied), static structure that contains a set of signatures representing fields of that structure. Page 63 of 181

64 Chapter 3 The Stack-based Approach Each of these signatures contains additional information, e.g. concerning possible cardinality of the run-time result returned by a query represented by the signature. Besides signatures, the context query analyser is equipped with static equivalents of environment and query result stacks. Distinct from the run-time stacks these structures work with signatures and the database schema graph rather than with query results Type Checking The compile-time static query analysis allows for performing static type control [112]. According to type determining rules, which are specified for every operator, the compiler can determine the type of a value returned by a complex query through analysis of its individual parts. The following example is a single rule concerning the union operator: bag[a..b](type) union bag[c..d](type) => bag[a+c..b+d](type) This rule describes a set-theoretic sum of bags, which comprises from at least a elements to at most b elements, represented by the left union operand signature with another set, which cardinality is from c to d elements, represented by the right union operand signature. It additionally assumes that the types of elements in these collections must be identical. Consequently, this rule indicates that the final collection will preserve the type of the input collections and that it would comprise from at least a + c elements to at most b + d elements. The set of similar rules is an internal part of almost every programming language. Yet SBQL is also a query language and thus operation signatures are enhanced with additional information concerning collections like cardinality and order. The rules created for arithmetic operators are usually more restrictive. For example, for + operator the following rule can be designed: value[1..1](integer) + value[1..1](real) => value[1..1](real) This rule assumes that the addition of integer and real values can be executed if operands will be single values (i.e. the cardinality of both arguments is [1..1]), otherwise the typing error would occur. Nevertheless, one can wonder if such assumption concerning the cardinality is too restrictive and consequently whether some Page 64 of 181

65 Chapter 3 The Stack-based Approach part of the type checking should be moved to the run-time. Alternative form of the rule above can be rewritten: value[0..*](integer) + value[0..*](real) => value[1..1](real) In this case the + operator would allow situation that in the compile-time the actual number of arguments is unknown. The suitable control ensues in the run-time. Therefore, if a left or right subquery does not return a single value then the interpreter would report the run-time error. Such solution leads to the so-called semi-strong type system. Let us consider the example query: (Person where surname = Kuc ).age + 1 It illustrates the reason why semi-strong type system is more comfortable for a programmer. In this example it is assumed that there exist only one person exist with the given surname. This assumption is controlled dynamically, i.e. if it is not fulfilled than run-time error indicates a typing error. In case of a more restrictive type system, the compiler would reject the above construction. 3.5 Updateable Object-Oriented Views A database view is a collection of virtual objects that are arbitrarily mapped from stored objects. In the context of distributed applications (e.g. web applications) views can be used to resolve incompabilities between heterogeneous data sources enabling their integration [60, 61]. The idea of updateable object views relies in augmenting the definition of a view with the information on users intents with respect to updating operations. Only the view definer is able to express the semantics of view updating. To achieve it, a view definition is subdivided into two parts. The first part is a functional procedure, which maps stored objects into virtual objects (similarly to SQL). It returns entities called seeds that unambiguously identify virtual objects (in particular, seeds are OIDs of stored objects). The second part contains redefinitions of generic operations on virtual objects. These procedures express the view definer s intentions with respect to update, delete, insert and retrieve operations performed on virtual objects. Seeds are (implicitly) passed as procedures parameters. A view definition usually contains definitions of subviews, which are defined on the same rule, according to the relativism principle. Because a Page 65 of 181

66 Chapter 3 The Stack-based Approach view definition is a regular complex object, it may also contain other elements, such as procedures, functions, state objects, etc. The above assumptions and SBA semantics allow achieving the following properties: (1) full transparency of views after defining a view, its user uses the virtual objects in the same way as stored objects, (2) views can be recursive and (as procedures) may have parameters. Page 66 of 181

67 Chapter 4 Organisation of Indexing in OODBMS Chapter 4 Organisation of Indexing in OODBMS This chapter concerns primarily the architecture and rules applying to the index management and maintenance. Actual optimisation of query processing is the topic of the next chapter. Nonetheless, improving performance depends on diversity of exploited index structures and flexibility in defining an index. The properties of SBQL, in particular, orthogonality and compositionality, enable easy formulating complex selection predicates including usage of complex expressions with polymorphic methods and aggregate operators. The proposed organisation of indexing provides all necessary mechanisms so the database administrator is unconstrained in creating local or global indices with keys based on such expressions. The implementation exploits the linear hashing index structure (see section 2.2.1). Nevertheless, the solution does not limit the possibility to apply different indexing techniques, e.g. B-Trees. Details of this aspect of database indexing are omitted since it is generally orthogonal and independent of index management, maintenance and query optimisation. 4.1 Implementation of a Linear Hashing Based Index The primary reason for implementing a linear hashing index is a possibility to extend this structure to its distributed SDDS version (see section 2.2.2) in order to utilise optimally distributed database resources. Moreover, the author wants to provide extensive query optimisation support by enabling: dense indexing for integer, real, string, date and reference key values (dense index key type), support for optimising range queries on integer, real, string, date key values (range and enum key types), indexing using multiple keys, enum key type - a special support facilitating indexing of integer, real, string, date, reference and boolean keys with a countable limited set of distinct values (low key value cardinality). The enum key type provides an additional flexibility when applied to multiple Page 67 of 181

68 Chapter 4 Organisation of Indexing in OODBMS key indices since such keys can be skipped in an index invocation, i.e. can be considered optional Index Key Types The mentioned properties of indexing are introduced through different design of a hash function. The dense key type implies that the optimisation of selection queries which use the given key as a condition will be applied only for selection predicates based on = or in operators. Therefore, a hash function can distribute objects in index randomly omitting key values order. Such an index does not support optimising range queries, however it is faster in processing index invocations with exact match selection criteria. The range key type additionally supports optimisation concerning selection predicates based on range operators: >,, < and. This is achieved through a range partitioning [62, 75] variant implemented by the author. Within index a hash function groups object references in individual buckets (see bucket definition in section 2.2.1) according to key value ranges. The ranges are dynamically split as an index grows increasing its selectivity. The last key type enum is introduced in order to take advantage of keys with countable limited set of distinct values, i.e. keys with low values cardinality. The performance of an index can be strongly deteriorated if key values have low cardinality e.g. person eye colour, marriage status (boolean value) or the year of birth. To prevent this, an index internally stores all possible key values (or key values range limits in case of integer values) and uses this information to facilitate index hashing. The enum key type can deal with optimising selection predicates exactly like in the case of the range key type, i.e. for: =, in, >,, < and operators. Multiple key indexing is introduced through defining overall hash function as a composition of individual keys hash functions. An enum type key hash function assigns key values to consecutive hash-codes. As a result, enum type keys are particularly effective in multiple key indices. First, they can be omitted in index calls generated during optimisation of queries, what improves flexibility of the index optimiser (details are described in section 5.5.2). Furthermore, index invocation evaluation proves great efficiency if all index keys are enum and the number of indexed objects is large enough. In such conditions each key value combination points to a separate bucket of object Page 68 of 181

69 Chapter 4 Organisation of Indexing in OODBMS references which eliminates necessity to verify search criteria for retained objects Example Indices The given in Fig. 3.1 example schema opens possibility to present a wide variety of indices supported by the OODBMS indexing engine implemented by the author. Prefix idx is used to distinguish names of indices from other database entities. Firstly let us discuss simple single key indices created on object s attributes: IdxPerAge returns Person objects according to the value of the age attribute. It is assumed that this index is capable of processing range queries. IdxEmpAge identical as above, except only for Emp objects. IdxEmpSalary returns Emp objects queried by their salary attribute. This index similarly can use salary range as a selection criterion. IdxPerSurname a dense index returning Person objects according to the string type surname attribute. IdxDeptName a dense index returning Dept objects according to the name attribute. IdxPerZip the range index which returns Person objects queried by a zip attribute of its subobject address. It is important to note that the zip attribute is optional and therefore this index stores only Person objects containing this attribute. IdxEmpCity returns instances of the Emp class according to an address.city complex attribute. It is assumed that this index is dense. IdxAddrStreet a dense index which returns address subobjects of Person objects according to the street attribute. Differently than in case of other indices non-key objects are defined by a path expression, i.e. Person.address. The following indices use derived and complex attributes as keys: IdxEmpDeptName a dense index which uses the derived attribute worksin.dept.name to retrieve Emp objects. IdxEmpWorkCity an index using the derived attribute worksin.dept.address.city for Emp objects. Additionally in order to take Page 69 of 181

70 Chapter 4 Organisation of Indexing in OODBMS advantage of the fact that company departments are located in the limited number of cities (low key values cardinality) the key type is enum. IdxDeptYearCost the most complex of the indices for Dept objects. The key is based on the expression sum(employs.emp.salary) * 12 which returns an approximate total cost of salaries of a given department for the year period. It is assumed that this index is range. IdxEmpTotalIncomes a range index which uses the Emp class method gettotalincomes() as a key for selecting Emp objects. This method is overridden for instances of the EmpStudent class. Another powerful feature of the proposed indexing solution is multiple key indexing. Using such an index can strengthen the selectivity property (cf. section 5.4.1), in particular, when individual keys return only few distinct values. idxempage&workcity an index for instances of Emp objects. It consists of two dense keys. The first key is set on the age attribute and the next one on the derived attribute worksin.dept.address.city. It is assumed that it is necessary to specify both attributes to take advantage of this index. idxperage&surname the last index for indexing Person objects, which also uses two keys. The first key is set on the age attribute and supports range queries. It is assumed low cardinality of values (enum key type). The second dense key is set on the person surname attribute. This index offers greater flexibility hence the age key can be omitted in an index call. In order to take advantage of indexing the administrator must only create proper indices. The rest of the optimisation issues are completely transparent. 4.2 Index Management All indices existing in the database are registered and managed by the index manager. Beside the list of meta-references to objects describing indices, it holds also auxiliary redundant information needed by the index optimiser and the static evaluator, i.e. a list of structures (called Nonkey Structures) maintained for each indexed collection of objects containing information about: query defining the given collection, Page 70 of 181

71 Chapter 4 Organisation of Indexing in OODBMS reference to the metabase object representing objects belonging to this collection, indices set on the given collection along with their meta-references, list of keys used to index the given collection of objects holding precise information about each key: o an expression defining the key, o a list of indices using the given key. The efficient access to elements of the lists mentioned above is provided by auxiliary indices. The structure of the index manager is presented in Fig Fig. 4.1 Index manager structure Page 71 of 181

72 Chapter 4 Organisation of Indexing in OODBMS The index manager assists in the index optimisation process by making wellorganised information about existing indices available (details can be found in section and subchapter 5.3). For instance if the index optimiser processes a where clause which selects objects from the whole Emp collection then the Nonkey Structures Index would return necessary information about indices set on EmpClass instances. Such information in case of indices introduced in section is presented in Fig INDEXED OBJECTS INFORMATION query: Emp EmpClass OBJECTS VARIABLE METAREFERENCE LIST OF ASSOCIATED INDICES idxemp Salary idxemp City idxemp WorkCity idxemp TotalIncomes idxemp Age&WorkCity LIST OF KEYS 1. KEY INFORMATION Key expression: salary 2. KEY INFORMATION Key expression: city 3. KEY INFORMATION Key expression: gettotalincomes() LIST OF INDICES UTILISING THE KEY idxempsalary (1 ST key) LIST OF INDICES UTILISING THE KEY idxempcity (1 ST key) LIST OF INDICES UTILISING THE KEY idxemptotalincomes (1 ST key) 4. KEY INFORMATION Key expression: age 5. KEY INFORMATION Key expression: worksin.dept.address.city LIST OF INDICES UTILISING THE KEY LIST OF INDICES UTILISING THE KEY idxempage&workcity (1 ST key) idxempworkcity (1 ST key) idxempage&workcity (2 ND key) INDEX OF KEYS STRUCTURES ACCORDING TO KEY EXPRESSION FIELD VALUE Fig. 4.2 Example Nonkey structure for Emp collection Taking advantage of the Nonkey Structure presented above the index optimiser can efficiently match selection predicates of the where clause with the associated indices keys. Page 72 of 181

73 Chapter 4 Organisation of Indexing in OODBMS Index Creating Rules and Assumed Limitations Each index has to be unique in its namespace. Because of the wide range of discussed topics in this work the concept of modules and namespaces connected with them developed in the ODRA prototype [68, 119] are omitted. The administrator issues the add index command to create a new index in the database. The syntax of this command is the following: add index <indexname> ( <typeind_1> [ <typeind_2>... ] ) on <nonkeyexpr> (<keyexpr_1> [, <keyexpr_2>... ] ) where: indexname stands for a unique name of the index, typeind_i is the a type indicator of the i-th index key specified by the following values: dense, range and enum (described in section 4.1.1), nonkeyexpr a path expression defining indexed objects, keyexpr_i a query defining the i-th key used to retain indexed objects. index. The number of type indicators corresponds to the number of keys forming an Indexed objects are defined by the nonkeyexpr expression which must be bound in the lowest database section (the databases root) of the environment stack. For simplification, it is assumed that this definition should be built using a path expression (name expressions connected using dot non-algebraic operators). Moreover, this path expression should return collection of distinct objects to be indexed. It is important because some of the optimisation methods are not currently designed to deal with collections containing duplicates. Using reference objects in defining nonkeyexpr can result in the possibility of indexing duplicates, e.g. usually more than several employees works in a single company department and hence the following path expression: Emp.worksIn.Dept would probably return same Dept objects many times because worksin references associate different employees with the same department. Nevertheless, the mentioned limitation concerning indexed objects definition is normal in the context of typical indexing solutions for databases. Additionally, such an Page 73 of 181

74 Chapter 4 Organisation of Indexing in OODBMS index can be used to enforce a constraint that collection should comprise distinct objects. Enabling support for more complex definitions of non-key objects is possible; however, it would result in some limitations concerning applying optimisations and in increasing implementation complexity of automatic index updating, which is presented in subchapter 4.3. Each key value expression keyexpr_i should be defined in the context of the objects defined by nonkeyexpr expression. Consequently the query: nonkeyexpr join (keyexpr_1 [, keyexpr_2... ] ) returns non-key objects together with corresponding key values. Each keyexpr_i should depend on the join operator. Index keys should return values of following types: integer, real, string, date, reference or boolean. Moreover, each key expression has to be deterministic, i.e. for a given non-key object it must return the exact same result provided that data used to calculate it has not changed (for example it excludes the usage of random method). An important property of a created index is the cardinality of keys. For each key it indicates the possible number of returned values. Usually keys return a single value so their cardinality is [1..1] and a key value exists for each non-key object. As the result, the whole non-key collection is indexed. When the minimal key cardinality is zero, e.g. address.zip key for the idxperzip index, some objects can be omitted in indexing, since their key value may not exist. This situation does not disable indexing; however, it introduces several requirements for the database programmer in order not to thwart index optimisation (this problem is explained in detail in subchapter 5.3 and section 5.5.2). Currently the author has not provided a support for indexing when the keys maximum cardinality is above singular because of the ambiguity in generating key values for an object, i.e. more than one key values combination can be generated for a single object. Considering such a scenario would require introducing minor changes in generating the index structure and extending index optimisation methods to properly deal with selection predicates working with collections. If all conditions described above are met, the index manager initialises an index structure and creates an index related meta-object. Next, it proceeds to organise information required for optimisation. First, a corresponding to the given nonkeyexpr Page 74 of 181

75 Chapter 4 Organisation of Indexing in OODBMS expression Nonkey Structure is located or a new one is created. Then, the structure is updated with information depicted in Fig. 4.1 concerning the index and all its keys. Each keyexpr_i expression is marked with the index being created in the proper Key Structure. This terminates creation of a new index. However, it is crucial that the index manager enables index updating mechanism in order to fill the index structure with objects. During that operation, the index is filled with appropriate objects. This topic together with related issues is discussed in subchapter Automatic Index Updating Indices, like all redundant structures, can lose cohesion if the data stored in the database are altered. Rebuilding an index is to be transparent for application programmers and should ensure validity of maintained indices. For that reasons the automatic index updating has been designed and implemented. Furthermore, the additional time required for an index update in response to a data modification should be minimised. This is critical from the point of view of the large databases efficiency. Any change to data cannot cause long lasting verification of existing indices or rebuilding the whole index from scratch. To achieve this, a database system should efficiently find indices which became outdated because of a performed data modification. Next, the appropriate index entries should be corrected so that all index invocations would provide valid answers. Such index updating routines should not influence the performance of retrieving information from the database and the overhead introduced to writing data should be minimal (particularly when no index has been affected by changes to the database). However finding a general and optimal solution for index updating is not possible because of the complexity of DBMSs. Such a task requires analysis of many different real life situations occurring in the database environment in order to minimise deterioration of performance Index Update Triggers Each modification performed on objects (creation, update and deletion) is executed through the ODRA object store CRUD (acronym for Create, Read, Update and Delete) which generally is responsible for access to persistent data and other database entities. The proposed approach to automatic index updating concentrates on this element of the system as it is the easiest and certain way to trace data modifications. Page 75 of 181

76 Chapter 4 Organisation of Indexing in OODBMS Possible modifications that can be performed on an object are the following: updating a value of an integer, double, string, boolean, date or object reference, deleting, adding a child object (in case of a complex object), other database implementation dependent modifications: e.g. adding a child to an aggregate object (role of this kind of objects in index updating is described in section 4.3.5). The author has introduced a group of special auxiliary structures called Index Update Triggers (IUT) together with Triggers Definitions (TD). These elements are essential to perform index updating. Each IUT associates one database object with an appropriate index through a TD. Existing IUTs automatically initialise the index updating mechanism when a modification concerning the given object is about to occur. More than one IUT can be connected with a single object. TDs provide means to find objects which should be equipped with IUTs. Additionally, TD specifies the type of an IUT. An object is associated with IUTs when it participates in accessing non-key objects or calculating key values for indices. Therefore, modification to objects not linked with any index does not trigger unnecessary index updating. Altering objects equipped with IUTs is likely to influence topicality of indices and IUTs. Four basic types of IUTs (each IUT refers to different TD type) are proposed: 1. Root Index Update Trigger (R-IUT) is by default associated with the root database entry which is a direct or indirect parent for all indexed database objects. When a new object is created in the databases root, the trigger can cause generation of a NonkeyPath- or Nonkey- Index Update Trigger (described below) for the new child object. This trigger is also used to initialise or terminate all triggers associated with an index. 2. NonkeyPath Index Update Trigger (NP-IUT) a type of a trigger associated with objects which are potential direct or indirect parent objects for new indexed objects. This type of a trigger is generated when an index non-key object is defined by a path expression (e.g. idxaddrstreet index), i.e. when non-key objects are not direct children of the databases root. Similarly to a R-IUT, this Page 76 of 181

77 Chapter 4 Organisation of Indexing in OODBMS trigger can cause generation of a NonkeyPath- or Nonkey- Index Update Trigger for the new child object. 3. Nonkey Index Update Trigger (NK-IUT) a trigger that is assigned to indexed (non-key) objects. It is generated by direct parent object s update triggers (R-IUTs or NP-IUTs). The process of creating a NK-IUT consists of the following steps: first a NK-IUT is assigned to the given indexed object, the key value is calculated, the corresponding index entry is created (if a valid key value is found), Key Index Update Triggers (described below) are generated and parameterised with the indexed object identifier. Creating a child object inside a non-key object initialises routines identical to a Key Index Update Trigger. 4. Key Index Update Trigger (K-IUT) associated with objects used to evaluate a key value for a specific non-key object (identifier passed together with TD as an additional parameter). Each modification to such objects can potentially modify the process of evaluating a key and hence its value. Therefore, a K-IUT is responsible for updating a corresponding index entry and maintaining appropriate K-IUTs corresponding to the given non-key object. Basing on the sample store depicted in Fig. 3.2 for indices idxperage, idxempworkcity and idxattrstreet introduced in section example IUTs shown in Fig. 4.3, Fig. 4.4 and Fig. 4.5 would be generated. Let us assume that i 0 is the identifier of the databases root. Non-key objects associated with K-IUTs are stated in parentheses. Fig. 4.3 Example Index Update Triggers generated for idxperage index Page 77 of 181

78 Chapter 4 Organisation of Indexing in OODBMS Fig. 4.4 Example Index Update Triggers generated for idxempworkcity index Fig. 4.5 Example Index Update Triggers generated for idxaddrstreet index The Architectural View of the Index Update Process The overview of the index update process that has been proposed and implemented by the author is presented in Fig Fig. 4.6 Automatic index updating architecture Page 78 of 181

79 Chapter 4 Organisation of Indexing in OODBMS When the administrator adds an index, TDs are created before IUTs (this step is shown using the green coloured arrows numbered 1a and 1b): Index manager initialises a new index and issues the triggers manager a message to build TDs. Next, the triggers manager activates the index updating mechanism which basing on the knowledge about indices and TDs proceeds to add IUTs: o This process is initialised by introducing a R-IUT for the databases root entry. o The R-IUT trigger propagates remaining triggers to database objects. o When a NK-IUT is added to an indexed non-key object then a key value is evaluated and an adequate entry is added to the index. Removing an index causes the removal of IUTs and TDs. Together with NK-IUTs corresponding index entries are deleted. The mediator managing the addition and removal of IUTs is a special extension of the CRUD interface. The second case the index updating mechanism is activated occurs when the databases store CRUD interface receives a message to modify the object which is marked with one or more IUTs (shown in Fig using the blue coloured arrow with number 2). CRUD notifies the index updating mechanism about forthcoming modifications and all necessary preparation before database s alteration are performed. This step is particularly important in case of changes which can affect a key value for the given non-key object. It consists of: locating an index entry which corresponds to the non-key object (a key value is necessary), identifying objects that are accessed in order to calculate the key value (they are equipped with an identical K-IUT). After gathering required information CRUD performs requested modifications and the index updating mechanism proceeds to: update index entries for the given non-key object by: o moving an entry corresponding to the non-key according to a new key value, Page 79 of 181

80 Chapter 4 Organisation of Indexing in OODBMS o removing the outdated entry if there is no proper new key value, o inserting a new entry into the index if a proper key value was calculated only after database alteration. update existing IUTs by generating new or removing outdated ones. This finishes servicing the trigger caused by alteration of the database SBQL Interpreter and Binding Extension A significant element used by the index updating mechanism is a query execution engine, i.e. the SBQL interpreter (also shown in Fig. 4.6.), extended with the ability to: 1. Log database objects that occur during evaluation of an index key expression. Logging takes place during binding object names on ENVS (other database entities like procedures, views, etc. and literals are discarded) this feature is used to locate all objects which are or should be equipped with K-IUTs. 2. Limit the first performed binding only to one specified object this feature significantly accelerates and facilitates verification whether a new child subobject added to an object with R-IUT or NP-IUT should be equipped with NP-IUT or NK-IUT, i.e. to check whether a new child is a non-key object or the potential direct or indirect parent of a non-key object. The only module of the SBQL interpreter which required author s modifications is the run-time binding manager. The proposed extension is introduced using Java static inheritance; therefore, applying different binding mechanisms to the SBQL interpreter is straightforward. The interpreter is used by the index updating mechanism in order to: traverse from the database s root or objects equipped with NP-IUT to non-key objects, generate a key value for a given non-key object. Let us consider the following example of adding IUTs starting with R-IUT during creation of the idxaddrstreet index. The SBQL interpreter is used first to evaluate the query Person, which returns identifiers to PersonClass and its subclasses EmpClass, StudentClass, EmpStudentClass instances, i.e. in this case i 31 EmpStudent and i 61 Emp objects. Consequently, NP-IUTs are added to the object i 31 and the object Page 80 of 181

81 Chapter 4 Organisation of Indexing in OODBMS i 61. In order to propagate triggers to non-key objects, first, the index updating mechanism performs the operation nested on i 31 object to prepare a suitable context for the SBQL interpreter by pushing necessary binders on the environment stack. Then, the query address is evaluated and returns i 36 object. Similarly, actions are taken for the i 61 Emp object and the i 66 identifier is returned. Therefore, the NK-IUTs are added to address objects i 36 and i 66. Next, in the context of both non-key objects the SBQL interpreter evaluates the key expression street. Accordingly, street objects i 34 and i 64 containing key values are returned. This procedure allows inserting two non-key objects into the idxadrstreet index and building R-IUT, NP-IUT and NK-IUT triggers. Nevertheless, it is insufficient to find objects which should be equipped with a K-IUT because of possible key expression complexity which is not limited only to a path expression. However, enhancement to the run-time binding manager enables finding those objects during calculation of a key value. All objects called in evaluation of a key expression by the SBQL interpreter occur during binding operation. Moreover, this enhancement allows finding aggregate objects which implicitly facilitate binding. Such objects can be also useful in improving performance of index updating (cf. section 4.3.5). To conclude the example, as a result IUTs have been generated according to Fig The next section discusses more complex examples concerning K-IUTs in order to present versatility of the proposed approach Example of Update Scenarios In order to trace example scenarios of index updating let us refer to the sample store in Fig In case of examples presented below the most important are object and method identifiers. Classes PersonClass and StudentClass occur during nesting operation; however, in presented examples they do not effect binding operation. We assume that all examples are correct so during evaluation run-time errors do not occur. In particular, a left operand of assign expressions always returns precisely one object to be modified Conceptual Example The given statement concerns updating the age attribute for the Person object which surname is equal to Kuc : (Person where surname = Kuc ).age := 31 Page 81 of 181

82 Chapter 4 Organisation of Indexing in OODBMS According to the store state depicted in Fig. 3.2, the left operand of the assignment returns the age attribute with the identifier i 34. The interpreter sends a message to the ODRA database CRUD mechanism to update a value of the i 34 integer attribute to 31. Before the update operation, the CRUD mechanism checks for IUT triggers connected with the attribute being modified. Let us assume that according to Fig. 4.3 there is a K- IUT described by the following properties: < index: idxperage, non-key object: i 31 > associated with the object i 34. Consequently, the index updating mechanism is triggered to calculate a key value used to access the i 31 object in the idxperage index. It is important that additionally during this step the objects affecting the key value are identified. To obtain the key value updating routines initialise a new SBQL interpreter instance with empty ENVS and perform the following operations: 1. A reference to the i 31 object is put onto the QRES and nested operation is performed. 2. New frames are created on the ENVS. The lowest stack section contains components of classes according the inheritance hierarchy: first for PersonClass, followed by EmpClass and StudentClass (however order is not predictable because of the multiple inheritance) and above them EmpStudentClass. The top ENVS frame is filled with subobjects of the i 31 EmpStudent object. Fig. 4.7 Calculating the idxperage index key value for i 31 object Next, the interpreter proceeds to evaluate the idxperage index key expression age. The evaluation steps are shown in Fig The bind operation with the age name Page 82 of 181

83 Chapter 4 Organisation of Indexing in OODBMS parameter is performed. The i 34 attribute is put onto QRES. During binding the i 34 identifier is stored by the index updating mechanism, as it has influenced the value of the key. The key value is obtained by dereferencing the i 34 attribute. The index updating mechanism uses the non-key object i 31 and the calculated key value to locate the corresponding to the i 31 idxperage index entry. This is necessary for modifying the index after updating the key value for the given EmpStudent object. Now the CRUD mechanism can alter the i 34 attribute and assign it the new value 31. After age update, the process of calculating the key value is repeated. In this case, it does not differ from the preceding one presented in Fig The index updating mechanism uses all gathered information to: Update the idxperage index: the entry for the non-key object i 31 with the key value 30 is properly adjusted to the new key value 31. Revise IUTs for idxperage index and non-key object i 31 : before as well as after modifying the i 34 attribute the index updating mechanism identified that i 34 is the only object influencing the key value. Since K-IUT (index: idxperage, non-key object: i 31 ) associated with this object is still valid no changes are made to existing IUTs. This finishes the index updating routines for the example presented above Path Modification The next more complex example concerns reassigning the Emp object which surname is equal to Kowalski to HR department through updating worksin reference: (Emp where surname = Kowalski ).worksin := ref Dept where name = HR This operation causes assignment of the i 141 Dept object reference to the i 70 worksin attribute. If there is the idxempworkcity index then CRUD finds a K-IUT described by the following properties < index: idxempworkcity, non-key object: i 61 > associated with object i 70. Therefore, before CRUD proceeds to modify the value of the worksin attribute routines presented in Fig. 4.8 are performed in order to calculate a corresponding key value, i.e. value of the i 134 object Opole, and to identify objects which compose the key, i.e. identifiers occurring during binding: i 70 worksin, i 131 Dept, i 133 address and i 134 city (written with the green colour in Fig. 4.8). Page 83 of 181

84 Chapter 4 Organisation of Indexing in OODBMS Fig. 4.8 Calculating the idxempworkcity index key value before update The CRUD mechanism performs update on the i 70 worksin attribute value. Fig. 4.9 Calculating the idxempworkcity index key value after update Page 84 of 181

85 Chapter 4 Organisation of Indexing in OODBMS This modification from the point of view of the index updating mechanism introduces significant changes in the evaluation of the key expression. As it can be seen in Fig. 4.9 not only the key value changed to Kraków, i.e. value of the i 144 city attribute, but also the set of identifiers of objects which affect the key value is different, i.e. i 70 worksin, i 141 Dept, i 143 address and i 144 city. The index updating mechanism uses all gathered information to: Update the idxempworkcity index: the entry for the non-key object i 61 with the key value Opole is adjusted to the new key value Kraków. Revise IUTs for idxempworkcity index and non-key object i 61 : before as well as after modifying the i 70 attribute the index updating mechanism has identified that the i 70 object influences the key value. However objects i 131, i 133 and i 134 no longer affect the key value and therefore K-IUTs < index: idxempworkcity, non-key object: i 61 > associated with them are removed. On the other hand currently objects i 141, i 143 and i 144 assist in computing the key value so K-IUTs < index: idxempworkcity, non-key object: i 61 > are reassigned to them. It is important to note that modifying the worksin attribute of i 61 Emp should be followed by suitable changes in Dept objects to ensure consistency between worksin and employs references. However, it is not an issue of the automatic index updating but rather of a particular database application Keys with Optional Attributes The given example consists of two statements removing and adding a zip attribute for an address subobject of an Emp object which surname is equal to Kowalski. It shows the main idea how the automatic index updating deals with deletion and creation of objects. The following statement causes that the database CRUD mechanism removes the i 69 object: delete((emp where surname = Kowalski ).address.zip) Let us assume that the idxperzip index is created and hence before deletion the index updating mechanism finds a K-IUT described by the following properties < index: idxperzip, non-key object: i 61 > associated with the object i 69. The index updating mechanism calculates the key value corresponding to the non-key object (states of SBQL stacks during evaluation are presented in Fig. 4.10), i.e. value of the i 69 object Page 85 of 181

86 Chapter 4 Organisation of Indexing in OODBMS Additionally, objects which influence the key value are identified, i.e. identifiers occurring during binding: the i 66 address and the i 69 zip. The latter identifier because of the removal will not be further considered by the index updating mechanism during update triggers revising. Fig Calculating the idxperzip index key value before removing zip attribute The CRUD mechanism deletes the i 69 zip attribute together with all associated IUTs. Consequently, the successful evaluation of the key value is not possible (it is depicted in Fig. 4.11). Despite the lack of a key value the index update mechanism finds identifiers of objects which are used during the key value calculation, i.e. the i 66 address. Fig Calculating the idxperzip index key value without zip attribute The index updating mechanism uses gathered information to: Update the idxperzip index: the entry for the non-key object i 61 is removed. Revise IUTs for the idxperzip index and the non-key object i 61 : before as well as after modifying the i 66 attribute influences the key value. The i 69 zip attribute no longer affects the key value, however the K-IUT < index: idxperzip, nonkey object: i 61 > associated with it was already removed during the deletion. As a result, no other changes are made to existing IUTs. Page 86 of 181

87 Chapter 4 Organisation of Indexing in OODBMS Let us analyse how the automatic index updating deals with inserting a new zip attribute into the address object: (Emp where surname = Kowalski ).address :<< zip(99726) The index updating mechanism finds a K-IUT described by the following properties < index: idxperzip, non-key object: i 61 > associated with the object i 66. Before the insertion the key value corresponding to the non-key object is calculated. State of the key value has not changed; so, identically as in Fig no key value is found. During the key value computation the i 66 address object is used. The CRUD mechanism creates the new zip attribute with the value and the identifier i 104 and inserts it into the i 66 address object. The index updating mechanism proceeds to the evaluation of the key expression according to steps presented in Fig Objects which influence the key value are identified, i.e. the i 66 address object and the i 104 new zip object. Fig Calculating the idxperzip index key value after inserting zip attribute Consequently the index updating mechanism: Updates the idxperzip index: the entry for the non-key object i 61 with the key value is added. Revises IUTs for the idxperzip index and the non-key object i 61 : before as well as after inserting the new zip attribute the index updating mechanism identified that i 66 influences the key value. However, the i 104 object is assigned the K-IUT < index: idxperzip, non-key object: i 61 > because it is important in computing the key value Polymorphic Keys The following examples concerns the idxemptotalincomes index with key based Page 87 of 181

88 Chapter 4 Organisation of Indexing in OODBMS on the gettotalincomes() method which is polymorphic depending on the class of the non-key object. For EmpClass instances gettotalincomes() returns value of the salary attribute: return deref(salary) whereas for EmpStudentClass it also takes into consideration the scholarship attribute: return deref(salary) + deref(scholarship) Statement below concerns updating the salary attribute for the Emp object which surname is equal to Kowalski : (Emp where surname = Kowalski ).salary := 2000 As a result of evaluation the SBQL interpreter sends a message to the CRUD mechanism to modify the i 71 salary attribute of the i 61 Emp object. Before the modification is executed the index updating mechanism finds the K-IUT < index: idxemptotalincomes, non-key object: i 61 > associated with the i 71 salary attribute. The key value for the non-key object is computed and amounted to Binding operations performed during the evaluation presented in Fig indicate that objects which influence the key value are the i 14 gettotalincomes, i.e. the EmpClass procedure object, and the i 71 salary object. The procedure identifier (written with red colour) can be discarded by the index updating mechanism during update triggers revision. Fig Calculating the idxemptotalincomes index key value for i 61 object before update After the salary attribute update the second calculation of the key value is similar to one presented above in Fig Only the final value changes to Considering this information the index updating mechanism: Updates the idxemptotalincomes index: the entry for the non-key object i 61 is adjusted to the new salary key value 2000, Revises IUTs for idxemptotalincomes index and non-key object i 61 : before as Page 88 of 181

89 Chapter 4 Organisation of Indexing in OODBMS well as after inserting same IUTs were identified by the index updating mechanism hence no changes are done. Let us consider how the automatic index updating deals with an EmpStudentClass instance. The given statement concerns updating scholarship attribute for EmpStudent objects which age is equal to 30: (EmpStudent where age = 30).setScholarship(1500) According to the sample schema in Fig. 3.2 due to invocation of the setter setscholarship() method the i 39 scholarship attribute of the i 31 EmpStudent object will be updated. Again before the modification the key value for the idxemptotalincomes index is calculated. From the interpreter routines depicted in Fig results that objects i 24 gettotalincomes (the EmpStudentClass procedure object), i 41 salary attribute and the i 39 scholarship attribute influenced the key value. The procedure identifier can be discarded during the update triggers revision. Fig Calculating the idxemptotalincomes index key value for i 31 object before update After the scholarship attribute update the second calculation of the key value is similar to one presented above in Fig Only final steps differ (see Fig. 4.15). In order to conclude CRUD operations the index updating mechanism: Page 89 of 181

90 Chapter 4 Organisation of Indexing in OODBMS Updates the idxemptotalincomes index: the entry for the non-key object i 31 is adjusted to the new key value 2500, Revises IUTs for the idxemptotalincomes index and the non-key object i 31 : before as well as after inserting same IUT triggers were identified by the index updating mechanism hence no changes are done. Fig Last steps of computing the idxemptotalincomes index key value for i 31 after update The proposed approach presented on examples above due to its generality is capable of dealing with updating indices with even more complex keys. Extending this solution to support AS2 and the following abstract store types (which consider dynamic inheritance and encapsulation as depicted in section 3.1.2) does not require introducing significant changes Optimising Index Updating The presented solution to index updating is universal and versatile; however, without optimisations it can cause unnecessary performance deterioration particularly in simple updating cases. In the most common scenario, a key value is defined by a path expression (e.g. indices idxperage, idxperzip, idxempdeptname, idxempworkcity). Often alterations concerning indexed objects key are updating the object which holds a key value. Such an object could be equipped in a different type of a trigger the Index Key Value Update Trigger (KV-IUT) instead of the K-IUT. Modifying the value of an object equipped with this trigger does not require using the SBQL interpreter to recalculate the key value. Moreover, revising IUTs is also unnecessary. This would significantly simplify the index updating mechanism. For example in case of database state presented in Fig. 3.2 and the idxempworkcity index the following statement changing city of HR department to Warszawa: Page 90 of 181

91 Chapter 4 Organisation of Indexing in OODBMS (Dept where name = HR ).address.city := Warszawa would execute the KV-IUT associated with the i 144 city object and the i 31 EmpStudent non-key object. However, in order to calculate key value instead of executing the query worksin.dept.address.city in context of the non-key object the index updating mechanism directly dereferences the city object. Moreover, revising K-IUTs is skipped. A next optimisation takes advantage of aggregate objects which are used to model a collection of objects with the same name and type. Aggregate objects are a physical optimisation for searching subobjects used when cardinality of a subobject is not singular. The parent object instead of multiple subobjects with the same name contains one aggregate subobject. Calling all subobjects by their common name is achieved through the mediation of an aggregate though aggregate is their direct parent. If similar IUTs refer to such a collection of objects then their aggregate parent can be equipped with the identical IUT; consequently, it can be automatically propagated to new created aggregate object children. For example let us consider the idxperage index and adding new Person object to the database. The new Person object is not added directly to the database s root, but to a Person aggregate object. Therefore, correct NK- IUT is simply propagated from the aggregate and does not need to be generated by the R-IUT which is more complex since it requires an additional verification procedure. In the current implementation the index updating mechanism works within a range of an atomic databases CRUD operation. Still often even single statement can cause several changes to the database. In many cases it would be optimal to gather necessary information during an execution of series of atomic operations and delay index updating and index update triggers revision to the very end of a complex operation. This would however require cooperation with the database transaction mechanism, which is still under development in the implemented prototype. The following example statement: (Dept where name = HR ).address := ("Warszawa" as city, "Koszykowa" as street) consists of the four following atomic CRUD operations: first the deletion of i 144 city, i 145 street objects and next creation of new city, street objects. In case of the idxempworkcity index it results in running index updating at least three times for the i 31 EmpStudent non-key object (only deletion of the i 145 street object is not connected with K-IUTs). However, it would be efficient (approximately three times faster) to Page 91 of 181

92 Chapter 4 Organisation of Indexing in OODBMS execute the index updating mechanism only before the first atomic deletion to gather necessary information and after completing the creation of objects. The depicted lazy index updating strategy would be optimal if the most of index maintenance routines would occur in a database s idle time, i.e. after data modifying statement s execution but before a next index invocation. The last approach to optimisation of index updating concerns not only efficiency but also decreasing databases load by removing unnecessary IUTs. If identical K-IUT refers to a collection of identical subobjects then only their aggregate parent can be equipped with the K-IUT instead. This solution would reduce space occupied by the IUTs in case of more complex indices (e.g. in the idxdeptyearcost index employees of a department are accessed using references inside employs aggregate object); nevertheless, the index updating mechanism has to additionally check for update triggers of the parent aggregate objects. As a result of introducing the transaction mechanism together with aggregate objects maintaining some of IUTs would be unnecessary. This method must precisely consider the architecture of a database s store and properties of an object-oriented query language. In the current implementation aggregate objects are automatically created for a complex object containing subobjects with cardinality different than singular. Therefore, a statement cannot create a direct new child to existing complex object provided that the appropriate subobject was not earlier deleted within processing the given statement. The K-IUT connected with complex objects would not be necessary because a similar trigger which is responsible for preparation of the index update mechanism would be earlier started during deletion of subobjects of a complex object. For instance, in the previous example, concerning the idxempworkcity index and modifying the address of the HR department, deleting the i 144 city subobject should initialise the index updating mechanism. Therefore, the K-IUT associated with the i 143 address object would not be necessary. Similarly, for the given i 31 EmpStudent non-key object the K-IUT for the i 141 Dept object can be also omitted. Majority of the sketched above optimisations proposed by the author are implemented in the ODRA database prototype. Modifications which concern taking advantage of the transaction mechanism are planned to be implemented together with the development of transactions in ODRA. The other source of potential optimisations concerning the index maintenance for indices based on path expressions is the research Page 92 of 181

93 Chapter 4 Organisation of Indexing in OODBMS literature, e.g. [10, 11, 12] Properties of the Solution The proposed index updating mechanism meets guidelines depicted in the introduction in subchapter 4.3 and proves several supplementary advantages, i.e.: each modification to the indexed data is automatically reflected in the appropriate indices contents, index updating routines do not influence the performance of retrieving information from the database, index updates are triggered only in case of modifications concerning objects used to access the indexed objects or to determine key values, modification to a single key value introduces an additional time overhead, which is comparable to the time of calculating the given key value two times and performing modification to index records, automatic index updating performance can be improved by many optimisations described in subchapter 4.3.5, basic solution is independent from: o the query language and execution environment (does not require additional routines during compile-time or run-time), o index structure, generic support for a variety of index definitions (including usage of complex expressions with polymorphic methods and aggregate operators). On the other hand, the proposed solution to index updating issue introduces many additional database structures. Unfortunately, almost every object used to access indexed objects or calculate a key value must be equipped with appropriate IUTs (some exceptions are depicted in subchapter 4.3.5). It is caused by the properties of the SBQL query language that make in many situations difficult or even impossible to predict basing only on an index definition which objects should trigger an index update. Nevertheless, the author does not exclude possibility to develop more optimisation methods for this aspect of index updating process. In particular, such space-preserving optimisations can be easily introduced for very simple indices, e.g. on objects attributes. Page 93 of 181

94 Chapter 4 Organisation of Indexing in OODBMS Comparison of Index Maintenance Approaches The ODRA OODBMS is a proof of concept prototype as well as the implemented index updating mechanism. There is a great deal of available relational, object-relational or object-oriented databases based on different paradigms and exploiting different index structures for diverse applications. Those systems generally are deprived of detail efficiency comparisons between other existing solutions in aspect of maintaining index cohesion. The fair comparison of approaches can be conducted considering general properties to the index maintenance and its influence on capabilities of a database indexing. Thus, the comparison of efficiency between the proposed solution and solutions applied in other systems is omitted. The overview concerning indexing features of many existing products and some prototype approaches is described earlier in subchapters 2.3, 2.4 and 2.5. Routines responsible for index maintenance in relational and object-relational databases are straightforward and therefore simple. An undoubted advantage of the index updating approach in majority of relational databases is an economic usage of the data store. The information necessary for the mechanisms maintaining cohesion between data and indices is associated with table columns as they are identical for each row. The quantity of such information is therefore independent from the quantity of data stored in tables (i.e. the number of table rows). Similarly, object-oriented databases associate automatic index updating mechanisms with a whole collection or a class rather than with an object. In contrast, in the implemented solution for the ODRA object-oriented database IUT triggers are in many cases written together with complex objects and atomic objects containing values. Fortunately, majority of the databases store space is occupied by the data rather than by the redundant information. This situation is acceptable considering that nowadays databases administer a very large amount of memory (or disk) space. The proposed implementation of the fully transparent indexing in the Stack- Based Approach enables creation and automatic maintenance of indices with keys defined using arbitrary deterministic expressions, including methods invocation (also polymorphic and aggregate functions), e.g.: idxempdeptname key based on worksin.dept.name path expression, idxdeptyearcost key based on sum(employs.emp.salary) * 12 expression, idxemptotalincomes key based on an Emp class method gettotalincomes(). Page 94 of 181

95 Chapter 4 Organisation of Indexing in OODBMS This method is overridden for instances of the EmpStudent class. As it was said in subchapter 2.5, properties of a query language (e.g. SQL), a lack of appropriate object-oriented extensions or a primitive approach to the index maintenance limits defining advanced indices. For the following reasons, the most of advanced object-relational transparent indexing approaches including SQL Server (computed columns) and Informix (functional indexes) does not provide sufficient support to introduce indices with complexity similar to ones presented above. Similarly, the IBM DB2 Universal Database in spite of proposing the Index Extensions, which are very powerful indexing tools, also had not provided sufficient transparent solutions. From among OODBMSs only GemStone products enable indexing for indices based on a path expression like idxempdeptname index. The Oracle function-based index feature, despite lack of path expression based index support, provides facilities for creating an index similar to the idxemptotalincomes index. The test conducted in section has used the schema in Fig. 2.3, partially corresponding to the object store in Fig The created emp_gettotalincomes_idx Oracle index is based on an analogous polymorphic method. The disadvantage of the Oracle s equivalent of the discussed index concerns its influence on the database performance. The index updates occur in case of any modifications to indexed table, not only those concerning columns used to determine key values. The attempt to introduce in Oracle an index dept_getyearcost_idx corresponding to the idxdeptyearcost index was unsuccessful. The modifications to a table with data about employees, which were used to calculate index key, caused dept_getyearcost_idx to lose cohesion with the data. There are no similar errors concerning maintaining idxdeptyearcost index in the ODRA implementation. Advanced approaches to indices based on path expressions are described in many research documents, e.g. [10, 43]. The index maintenance issue is usually solved by preserving additional information inside the index structure, which enables efficient and correct index updating. To the best of author s knowledge, implemented solutions concerning the transparent index maintenance presented in the research literature or incorporated in commercial products apply to a specified family of index definitions and cannot be considered generic. Not implemented, but a generic solution for the maintenance of function-based indexes is defined in [46]. Similarly to the ODRA implementation, index updating Page 95 of 181

96 information is connected with objects associated with indices. Chapter 4 Organisation of Indexing in OODBMS In contrast to all presented above solutions to the automatic index updating issue, the author s approach based on Index Update Triggers implemented in ODRA provides transparent, complete and generic support for a variety of index definitions. Moreover, the additional data modification costs associated with the index maintenance concerns exclusively objects used to access the indexed objects or to determine a key value. One can argue about increased storage cost caused by IUTs. Nevertheless, as it is shown in [10, 11, 12] the maintenance of indices defined using complex expressions require introducing a lot of additional information in the index structure (not only entries of indexed objects according to values pointed by path expressions). Other advantage of author s IUTs set on objects used to determine a key value is that they include direct reference to an indexed object, whereas other solutions [10, 43] are often forced to identify it indirectly (e.g. by reverse navigation methods or accessing key value first and looking out indexed object in the index). 4.4 Indexing Architecture for Distributed Environment Different aspects of indexing presented in this chapter form the complete architecture of local index management and maintenance. In subchapter 2.6 the local indexing strategy is explained. It completely relies on the local indexing architecture and general optimisation methods for distributed query processing (i.e. global query decomposition). Therefore, analysis of this strategy is considered straightforward and is omitted in this subchapter. The discussed global indexing architecture concerns homogeneous, horizontally fragmented data on the integration schema level. It is a currently developed approach (in the ODRA prototype) to integration of distributed resources. The integration schema describes how data and services residing on local servers are to be integrated. It consists of individual schemas. The idea of a schema is a combination of an interface known from the object-oriented programming languages and a typical database schema. A schema is an abstract description specifying objects with attributes and methods that must be provided by a group of servers contributing to the given schema. Nonetheless, local servers implementing the schema still have wide autonomy. Contributed objects can be either materialised or virtual using SBA views. They can contain additional attributes and methods not included in the schema. Moreover, objects contributing Page 96 of 181

97 Chapter 4 Organisation of Indexing in OODBMS servers provide an own implementation of object methods and can transparently take advantage of inheritance and polymorphism. Generally, schemas enable type-safe querying integrated horizontally fragmented data. A query addressing an integration schema is decomposed on parts referring to individual schemas and appropriate subqueries are sent to servers to be evaluated locally in parallel. The local evaluation differs, depending on local schema implementation. According to the taxonomy presented in [124] the global indexing strategy proposed in this subchapter corresponds to a Non-Replicated Index with Index Partitioning Attribute indexing schema. It is the result of the following factors: a selected distributed index structure i.e. SDDS basic variant does not replicate parts of an index on different servers, in the global indexing strategy data partitioning and index partitioning are orthogonal. The data integration approach does not imply any concrete data partitioning method; hence, a distributed index can spread on the greater number of servers than data. In that context, the described indexing schema is not entirely compatible with the presented taxonomy. Similarly, the centralised indexing strategy is not taken into consideration in the taxonomy. More advanced and complex data integration, e.g. involving mixed fragmentation, data heterogeneity and replication, can be implemented on top of presented integration schemas using updateable views (see subchapter 3.5). Such solutions are a topic described in works, e.g. [68], and many research papers, e.g. [2, 39, 60, 61] including contributed by the author [63, 64, 131]. The next section discusses a proposed approach to indexing management and index maintenance in distributed object-oriented database. To conclude this subchapter an example of indexing in a global schema is presented Global Indexing Management and Maintenance Let us consider creating a global index defined on a schema, addressing a horizontally fragmented collection stored on servers (contributing sites). First, an appropriate index structure is created. Stored non-key values consist of an indexed object reference together with the information on its origin, i.e. contributing site Page 97 of 181

98 Chapter 4 Organisation of Indexing in OODBMS identifier. A global index can be centralised, i.e. located on one server, or distributed between several indexing sites over the database. Regardless of an indexing strategy, such an index must be made available to many servers. Locally it can be represented by a proxy forwarding index calls. A centralised index communicates with proxies on servers directly. In case of a distributed indexing strategy an individual proxy can forward index calls to an arbitrary indexing site hosting an index part. Optimally, a proxy may transparently become a part of a distributed index. Further processing of an index call and communication between indexing sites depends on a particular index implementation. For example, a linear hashing implementation discussed in subchapter 4.1 can be used for centralised indexing. It can be extended to an SDDS distributed index in order to preserve indexing properties and enable parallel processing. The next step of creating the global index is its registration. The subchapter 4.2 proposes an organisation of the index management which can be completely applied to indices on the level of integration schema. An auxiliary information provided by the index manager, which is needed by the index optimiser and the static evaluator, are used in the same way as in case of the local indexing. The main difference lies in the fact that information about global indices must be replicated together with the integration schema on all servers which can utilise it. Obviously, the indices referenced by the index manager can be a local proxies enabling communication with an appropriate centralised or distributed index. In the next turn, the index manager initialises populating the index. According to the author s approach presented in subchapter 4.3 it is connected with an activation of the automatic index updating. Again, this mechanism relies mainly on the local index maintenance architecture. It is essential that the currently considered data distribution model disables storing references to remote objects; therefore, in presented solution it is assumed that each key value can be calculated within an indexed object site. As a result, the index manager delegates activation of the index maintenance to contributing sites where appropriate Trigger Definitions are created according to an index definition. Next, locally and independently Index Update Triggers are generated. During this operation objects are inserted into the global index. If local index maintenance routines evaluating non-key or key expressions encounter elements, such as e.g. views invocations or links to remote databases, making the automatic index updating impossible then an appropriate error message is sent to the global index manager and Page 98 of 181

99 Chapter 4 Organisation of Indexing in OODBMS the creation of the global index is cancelled. Concluding, populating the global index and the further transparent index maintenance is provided mainly locally by the architecture presented in Fig. 4.6 where only the index manager and database indices are global (in contrast to a case discussed in subchapter 4.3). The final element of the indexing architecture, i.e. the approach to index transparency from the point of view of query processing, is the topic of Chapter 5. The presented solution is general as it applies equally to indexing on local and global levels Example on Distributed Homogeneous Data Schema Let us consider a schema describing horizontally fragmented data presented in the Figure below: Fig Example database schema for data integration It comprises three interfaces defining what attributes and methods must contributed Person, Emp and Dept collections of objects contain. Contributing sites have to share data fitting the given integration schema. Actual schemas of contributing sites can be distinct. An example database schema that matches one presented above was introduced in Fig Differences of a local schema like e.g. other collections, inheritance relations between collections, extra attributes or methods, does not matter as long as a local schema contains elements required by the integration schema. Let us consider the creation and work of an idxempdeptname global index using a derived attribute worksin.dept.name to retrieve Emp objects. First, an appropriate empty, centralised or distributed index structure is initialised and is made available among the distributed database servers. Next, it is registered by the index manager and information is generated which is necessary by index optimiser and static Page 99 of 181

100 Chapter 4 Organisation of Indexing in OODBMS evaluator modules working on queries addressing the integration schema. In the final step of the global index creation, the index manager initialises automatic index updating mechanisms on contributing sites. On each site this operation causes the following steps (described in detail in section 4.3.2): according to the index definition Trigger Definitions are created, Root Index Update Trigger is added to the database s root, Nonkey Index Update Triggers associated with objects belonging to an Emp collection are generated, for each non-key object the key value is calculated and objects used to determine it are equipped with Key Index Update Triggers and a corresponding index entry is added to the global index. It is significant that evaluation of the key expression worksin.dept.name in the context of an Emp object can be performed completely on a contributing site since the integration model restricts a worksin reference to point to a local Dept object. This makes indexing architecture simple and effective. Consequently, changes affecting indexed data are detected locally within a contributing site and independently an appropriate global index update command is issued by local index maintenance mechanisms. Finally, the index optimiser does not distinguish between local and global queries applying indices available in a schema a query addresses. Similarly, it is possible to create and utilise in the integration schema depicted in Fig almost all indices, which apply to Fig. 3.1, introduced in section The only global index that cannot be created by the administrator is idxempsalary, because in the integration schema Emp objects are devoid of a salary attribute. There exist several other aspects that implementation of indexing in distributed environment should consider. The main problem concerns dynamic joining and disconnecting of contributing or indexing sites and distributed transactions management. However, there exists a variety of solutions addressing those issues that can be applied, e.g. [29, 76]. Page 100 of 181

101 Chapter 5 Query Optimisation and Index Optimiser Chapter 5 Query Optimisation and Index Optimiser The research on optimisation of SBQL queries resulted in the work [93] deeply investigating this issue and in many papers e.g. [94, 95, 96, 97, 98, 99, 100, 122]. The goal of the developed optimisation methods is similar like in the case of optimisation in RDBMSs [20, 29, 54, 55]. The original query is processed in order to improve its efficiency by modifying its default evaluation plan and at the same time to preserve its semantics. In the implemented approach query optimisation is achieved through query transformations, mostly efficient, reliable and easy to implement query rewriting methods. In contrast to relational optimisers, no other intermediate query representations are applied, e.g. object-oriented algebra. The transformation processes are facilitated by static query analysis (sketched in subchapter 3.4). Query optimisation exploits information about the size of an environment stack during an evaluation of query parts in order to: equip each non-algebraic operator occurring in a query with number referring to a ENVS section which it opens, assign the current size of ENVS to each name when it is bound, together with the section number where the binding is performed. Static query analysis also facilitates locating query parts which raise a threat of run-time errors. One of the most important methods exploiting information from the static analysis is factoring out independent subqueries [95, 97]. Frequently a database query contains a subquery for which all names are bound in sections different than opened by a currently evaluated non-algebraic operator. Such a subquery can be evaluated before this operator puts its section onto ENVS. Consequently, the calculation of this subquery is planned earlier than it would result from the original query syntax tree. This operation is vital in optimisation of non-algebraic operators evaluation because it prevents from processing the subquery multiple times in case when its result is always the same. Let us consider the query which retrieves surnames of the employees who earn as much as employee with surname Kuc : Page 101 of 181

102 Chapter 5 Query Optimisation and Index Optimiser (Emp where salary = (Emp where surname = Kuc ).salary).surname The SBQL optimiser rewrites it to the following form: ((Emp where surname = Kuc ).salary groupas salaux). (Emp where salary = salaux).surname The independent subquery, which determines the salary of the given employee, is factored out and therefore is calculated only ones at the very beginning. Its result is stored inside the salaux binder and is repeatedly accessed by the where clause in order to compare salaries of all employees. Other SBQL optimisations, which are also implemented in the ODRA prototype, take advantage of the distributivity property of some SBQL operators (e.g. pushing selection before join), use redundant database structures (e.g. indices, caching) or perform other query transformations (e.g. removing auxiliary names, removing dead subquries), etc. Some of these methods are discussed in the context of indexing in further sections. 5.1 Query Optimisation in the ODRA Prototype ODRA (Object Database for Rapid Application development) [2, 119] is a research platform providing database application development tools. The essential features of the prototype are functional: run-time environment integrated with an OODBMS, SBQL query language, optimisation framework, etc. This subchapter depicts the view on the internal architecture of the ODRA optimisation framework. Its schema is presented in Fig. 5.1; it contains data structures (dashed lines figures) and program modules (grey boxes). The architecture reflects only the most important components from the point of view of the query optimisation and processing. Each ODRA instance can work as a client and as a server; therefore, this subdivision is introduced to increase comprehensibility. A server can service many clients and a client can communicate with many servers. Fig. 5.1 illustrates also general SBQL query processing flow. First, a query is parsed from its textual form to an equivalent query syntax tree. The processing flow order concerning suitable transformations of the syntax tree proceeds according to the numbers on the schema: 1. Static evaluation adds necessary operators (e.g. casts and dereferences), and Page 102 of 181

103 Chapter 5 Query Optimisation and Index Optimiser equips the query syntax tree with signatures which facilitate optimisers. 2. Query syntax tree is processed through the chain of optimisers in an appropriate order. Each optimiser rewrites the query and returns its syntax tree with current set of signatures. The index optimiser is concerned as one of such optimisers; however, it additionally employs the index manager module. 3. The syntax tree of the optimised and type-checked query is sent for further compilation and evaluation to a suitable ODRA module. Fig. 5.1 ODRA optimisation architecture [2] 5.2 Index Optimiser Overview The index optimiser is the main mechanism responsible for reorganising queries in order to take advantage of available indices. It is one of optimisers, which can be used in the query optimisation process. The index optimiser is essential to ensure one of Page 103 of 181

104 Chapter 5 Query Optimisation and Index Optimiser the most important indexing properties index transparency. During compilation of adhoc SBQL queries or ODRA modules, which often contain not optimised queries in procedures, updateable views generic procedures and class methods, queries are processed by the index optimiser in order to improve their efficiency. Fig. 5.2 illustrates the index optimisation process and all vital cooperating ODRA elements. Fig. 5.2 Schema of the index optimiser The index optimiser input is a query which is already passed through static evaluation. Therefore, its syntax tree nodes are equipped with signatures containing typing information. The index optimiser adds index calls to the query and performs necessary modifications. The most important issue concerning all optimisation methods is to preserve query semantics while rewriting so the optimisation does not affect a query evaluation result. The transformed query must also preserve typing constraints. The index optimiser communicates with the following ODRA modules: Index manager provides information about indices set on database s objects. This information is internally ordered and enables the index optimiser to find indices according to their non-keys as well as keys. Metabase provides a detailed description of a database s schema. The index optimiser uses information about indices from the metabase to determine if an index call can substitute a fragment of the query. Cost model holds statistical information about properties of databases objects attributes. The index optimiser choosing between alternative index Page 104 of 181

105 Chapter 5 Query Optimisation and Index Optimiser combinations uses the cost model to pick the best solution. Static evaluator calculates signatures in a query syntax tree. Each time the index optimiser applies an index, the modified part of the syntax tree is filled with the description of types. The example scenario of query syntax tree transformation applied by the index optimiser is shown on the Fig Fig. 5.3 Example optimisation applied by the index optimiser The given query concerns retrieving persons with a surname KOWALSKI who are 28 years old: Person where ((surname = KOWALSKI ) and (age = 28)) The index Optimiser applies the idxperage index which retrieves Person objects according to their age attribute and rewrites a query to the following form: $index_idxperage(28 groupas $equal) where surname = KOWALSKI Fig. 5.3 shows that first the predicate age = 28 is selected and removed. The index optimiser replaces the where left operand (Person) with an index invocation exactly matching the removed predicate. This transformation preserves semantic equivalence. Page 105 of 181