Transparent Integration of Distributed Resources within Object-Oriented Database Grid

Transcription

1 POLITECHNIKA ŁÓDZKA WYDZIAŁ ELEKTROTECHNIKI, ELEKTRONIKI, INFORMATYKI I AUTOMATYKI KATEDRA INFORMATYKI STOSOWANEJ mgr inż. Kamil Kuliberda Ph.D. Thesis Transparent Integration of Distributed Resources within Object-Oriented Database Grid Advisor: prof. dr hab. inż. Kazimierz Subieta Łódź 2010

2 To my wife Lidia, sons Alan and Natan, brother Artur and my parents for their love, belief and patience...

3 Index of Contents ABSTRACT... 5 ROZSZERZONE STRESZCZENIE... 8 CHAPTER 1 INTRODUCTION Motivation Introduction to the Grid Technology Technical Computing Grids Utility Computing Grids Data Grids Research Problem Formulation and Proposed Solution Theses and Objectives Thesis Outline CHAPTER 2 STATE OF THE ART AND RELATED WORKS Grid History Contemporary Grid Related Technologies SOA and OGSA CORBA Distributed objects Conclusions CHAPTER 3 TECHNOLOGICAL BASE OF THE NEW APPROACH TO GRID Peer-to-Peer Networks JXTA Project JXTA Architecture JXTA Protocols JXTA ID JXTA Peers JXTA Groups JXTA Advertisements JXTA Modules JXTA Pipes JXTA Services JXTA Security EDUTELLA Middleware and Federated Databases Object-oriented database as a middleware Integration strategies Distributed objects Distributed and Federated Databases The egov-bus Virtual Repository The ODRA Database Conclusions CHAPTER 4 AN AUTOMATIC INTEGRATION METHODOLOGY OF DISTRIBUTED DATA, GENERAL CONCEPT AND ASSUMPTIONS Object Integration Approach in Data-Grid Updatable Object-Oriented Views Three-level Integration Model Data Fragmentation Page 3 of 221

4 Index of Contents 4.3 Automated Integration of Distributed Objects Approach to Integration using Data Grid Middleware Virtual Network Integration Procedure for the Grid CHAPTER 5 PROTOTYPE IMPLEMENTATION AND EXAMPLES Prototype Architecture Virtual Network Architecture Details User s Logical Environment in Grid Database Running the Virtual Network for the ODRA-GRID Running the ODRA-GRID Environment Integration of Distributed Resources in ODRA-GRID Prototype Real-Life Example of Integration Creation of local objects representing one health centre s database example Creation of views representing health centre s virtual contribution objects example Script creating integration views for health centre s databases example Script creating global views for health centre s databases example CD Contents for Example of Prototype Implementation CHAPTER 6 AUTOMATIC INTEGRATION TESTING RESULTS Testing of Employee-Department-Location Schema Testing of Modified Employee-Department-Location Schema Testing of Modified Employee-Department-Location Schema in Distributed Environment but without Virtual Network and Automated Integration Mechanisms Testing of Modified Employee-Department-Location Schema at One Local Host without Virtual Network, Automated Integration Mechanisms Testing Summary CHAPTER 7 SUMMARY AND CONCLUSIONS Prototype s Limitations and Further Works APPENDIX A OBJECT-TO-RELATIONAL WRAPPER AS A DATA RESOURCE FOR ODRA-GRID A.1 Wrapper Architecture and Assumptions A.2 Example INDEX OF FIGURES INDEX OF LISTINGS INDEX OF TABLES BIBLIOGRAPHY Page 4 of 221

5 Abstract This Ph.D. dissertation is focused on a novel approach to transparent integration of distributed and heterogeneous data resources into one common object-oriented database model. The integration process deals with managing with three tiers of updatable objectoriented views which according to a specific guidelines are able to accomplish the mappings between local (low level) data schemas into global (top level) data schema available directly to the users. The logical location of the tiers forces also the development of a security mechanism which naturally limits unauthorised access to a user s local and private data from the grid side. A managing process for views located on particular tiers is supported mainly by a view generator mechanism which is responsible for generating an intermediate view called integration view, which keeps logic information on dependencies between distributed resources, objects and their locations. Such a process makes a virtual global data store of contributing heterogeneous resources available to the users in a manner that they do not need to involve explicit integration operations. The integration process is fully automated and transparent for top level users, so they are not aware about changes which happen inside the virtual repository and they can continue their work being not aware of the integration facilities and processes. At the back-end of the solution a peer-to-peer network is placed, which permits moving a physical TCP/IP network infrastructure on a virtual level. Such an approach essentially improves efficiency of distributed database processing by disposing physical networks limitations like corporate network barriers firewalls and NATs. Moreover, such a solution moves an ordinary database network connection logic into a simpler abstraction layer where each database is virtually visible as a node with bound data modules. This is developed from an architectural concept to a prototype implementation using the JXTA framework. Page 5 of 221

6 The automated and transparent integration mechanism has been implemented as a part of the virtual repository solution which relies on the stack- based approach (SBA), the corresponding stack-based query language (SBQL) and updateable object-oriented views. The thesis has been developed under the egov-bus (Advanced egovernment Information Service Bus) project supported by the European Community under Information Society Technologies priority of the Sixth Framework Programme (contract number: FP6-IST STP). The idea with its different aspects (including the virtual repository it is a part of and data fragmentation and integration issues) has been presented in over 20 research papers, e.g. [1], [2], [3], [4], [5]. Keywords: grid, data-grid, database, object-oriented, heterogeneous resources integration, automatic object integration, virtual repository, SBA, SBQL, peer-to-peer, P2P, virtual network Page 6 of 221

7 POLITECHNIKA ŁÓDZKA WYDZIAŁ ELEKTROTECHNIKI, ELEKTRONIKI, INFORMATYKI I AUTOMATYKI KATEDRA INFORMATYKI STOSOWANEJ mgr inż. Kamil Kuliberda Rozprawa doktorska Przezroczysta integracja rozproszonych zasobów wewnątrz obiektowego bazodanowego gridu Promotor: prof. dr hab. inż. Kazimierz Subieta Łódź 2010 Page 7 of 221

8 Rozszerzone streszczenie Termin grid znany jest jako termin określający rozproszone sieci obliczeniowe, jednak szybka ewolucja Internetu, powiększanie się społeczności internetowych, globalny wzrost wymiany informacji przez Internet wytworzyły potrzebę przetwarzania danych w architekturze rozproszonej i tym samym rozwój systemów gridowych w stronę przetwarzania danych opisanych modelami biznesowymi danych o konkretniej strukturze. Obecnie popularne rozwiązania jak sieci P2P (peer-to-peer), w których możliwe jest równoległe przetwarzanie dużych ilości danych w postaci mediów, czy plików nie wspierają zarządzania danymi strukturalnymi. Taki typ danych przechowywany jest w bazach danych. Przetwarzanie danych pochodzących ze źródeł, którymi są bazy danych, stwarza obecnie duże problemy. Są one związane z odpowiednim fizycznym traktowaniem tych danych oraz ze sposobem jak te dane widzi użytkownik na co składa się dostęp do tych danych oraz sposoby ich przetwarzania. Główną trudnością w przetwarzaniu jest tutaj właśnie struktura takich danych. Dane biznesowe zazwyczaj charakteryzują się złożoną strukturą, przez co nie mogą być identyfikowane oraz przetwarzane jak zwykły ciąg bajtów. Często ich struktura może być zależna od innych struktur, stąd w systemie rozproszonym zarządzanie jest bardzo trudne, a w niektórych przypadkach niemożliwe. Przetwarzania rozproszonych danych strukturalnych opisanych modelem biznesowym, związany jest z budową modelu zdolnego realizować równoległe przetwarzanie rozproszonych danych, gdzie różnego typu dane oraz usługi znajdujące się w fizycznie odseparowanych od siebie lokalizacjach mogą być wirtualnie dostępne przez ich wirtualną reprezentację. W systemach rozproszonych wymagane jest osiągnięcie transparentności, tzn. aby użytkownik pracując na danych mógł je przetwarzać bez względu na to czy są to dane lokalne znajdujące sie na lokalnym komputerze użytkownika, czy tez dane pobierane z lokalizacji zdalnych. Dodatkowo poszczególne lokalizacje skąd dane są Page 8 of 221

9 Rozszerzone streszczenie pobierane zazwyczaj są systemami heterogenicznymi. Stawia to dodatkowe wyzwanie dla pojektantów takich systemów w postaci realizacji mechanizmu integracji takich zasobów. Mówiąc o przetwarzaniu danych mamy na myśli nie tylko ich odczyt, ale także swobodną ich aktualizację. Właśnie możliwość swobodnej modyfikacji danych rozproszonych przy założeniu, że użytkownik wprowadzający tą modyfikację nie jest nawet świadom, że działa na danych zdalnych jest najpoważniejszym problemem nierowiązanym w innych istniejących systemach rozproszonych. Przy powyższych założeniach rozproszony system baz danych, nazywany dataintensive lub data grid, musi mieć zapewnioną ciągłość pracy i łatwość dostępu do danych, aby to zrealizować musi to być zapewnione już na poziomie architektury samego gridu. Dlatego wymaga to realizacji bardzo elastycznej łatwo skalowalnej arichtektury. W niniejszej pracy doktorskiej powyższe problemy poddane zostały dyskusji oraz zaproponowano ich rozwiązanie w kotekście architektury data grid. Za cel ustalono realizację gridu baz danych cechującego się: przezroczystością dostępnych zasobów gridu przy ich przetwarzaniu, automatyczną integracją zasobów dołączających do gridu, wirtualną siecią łączącą współpracujące bazy danych. Do osiągnięcia celu autor pracy doktorskiej zaproponował następujące założenia: 1. Opracowanie modelu integracji rozproszonych, heterogenicznych zasobów danych objętych jednym wpólnym obiektowym schematem danych. Proces integracji opiera się o trójwarstwową strukturę aktualizowalnych perspektyw, które zgodnie z odpowiednimi wytycznymi są w stanie zrealizować odwzorowanie pomiędzy danymi kontrybuującymi do gridu schematem danych najniższego poziomu, znajdującym się na maszynie użytkownika włączającego się do gridu, a globalnym schematem gridu (najwyższego poziomu) dostępnym dla wszystkch użytkowników pracujących w gridzie. Trójwarstwowa struktura perpektyw to wzajemne zestawienie obiektowych, aktualizowalnych perspektyw w trzech warstwach, w taki sposób, że każda wyższa warstwa zeleżna jest od warstwy niższej. Zależność warstw określona jest umową, która mówi w jaki sposób wirtualne obiekty wyższej warstwy Page 9 of 221

10 Rozszerzone streszczenie zależą od obiektów niższej warstwy. Definicja perspektyw musi być zgodna ze sztuką budowy aktualizowalnych perspektyw opisaną w [6]. Zakładamy, że pierwsza warstwa musi zapewnić odwzorowanie i kontrybucję obiektów z lokalnej bazy danych do schematu odpowiadającego schematowi globalnemu w wyniku tego powstają obiekty kontrybucyjne dopasowane do schematu kontrybucyjnego, druga warstwa dokonuje odwzorowania integracyjnego zgodnego ze schematem integracyjnym, gdzie wszystkie dostępne obiekty kontrubucyjne mapowane są na kolekcje obiektów integracyjnych, oraz trzecia warstwa, która dokonuje odwzorowania kolekcji obiektów integracyjnych na obiekty globalne dostępne bezpośrednio w gridzie. 2. Opracowanie generycznej techniki automatycznej integracji zasobów danych do globalnego schematu gridu. Jest to mechanizm składający się z szeregu metod i zasad wykorzystanych w algorytmach generujących perspektywy we wszystkich trzech warstwach opracowanego modelu integracji danych. Proces generacji perpektyw realizowany jest na podstawie specyficznej definicji perspektyw. 3. Opracowanie techniki zarządzania i utrzymania mechanizmu automatycznej integracji, tak aby możliwe było ciągłe przyłączanie oraz odłączanie się poszczególnych zasobów kontrybucyjnych gridu. Mechanizm ten wykorzystuje podobne zasady i metody podczas przebudowywania działających perspektyw jak w przypadku generacji perspektyw, z tą różnicą że działa tylko na poziomie integracyjnym (w warstwie drugiej). 4. Budowa wirtualnej sieci opartej o archtekturę peer-to-peer, której zadaniem jest utowrzenie środowiska komunikacyjnego dla gridu, które ułatwi dostęp do baz danych jako węzłów w gridzie oraz uniezależni komunikację od warstw TCP/IP. Dodatkowo przeniesie standardową pracę w sieci na wyższy poziom abstrakcji, w ktorym nie są wymagane typowe operacje sieciowe. 5. Budowa warstwy pośredniczącej tzw. middleware zawierającej wszystkie powyższe rozwiązania i w której zamknięte bedzie proponowane rozwiązanie gridu baz danych. Przy budowie prototypu wykorzystano wirtualne repozytorium [7] oparte o silniki obiektowych baz danych ODRA [7], mechanizm aktualizowalnych perspektyw [2], [6] który jest natywnym mechanizmem realizacji perspektyw w bazach ODRA oraz środowisko do budowy sieci peer-to-peer JXTA Framework [8]. Tezy pracy zostają sformułowane jak poniżej: Page 10 of 221

11 Rozszerzone streszczenie 1. Możliwe jest skonstruowanie rozproszonego systemu bazodanowego pracującego w architekturze przetwarzania równoległego (grid) przy wykorzystaniu obiektowej bazy danych opartej na teorii podejścia stosowego oraz aktualizowanych perspektywach. 2. Możliwa jest integracja zasobów lokalnych do wirtualnego repozytorium przy wykorzystaniu mechanizmu wirtualnych obiektowych aktualizowanych perspektyw. 3. Możliwe jest zrealizowanie wirtualnej warstwy transportowej dla rozproszonego systemu bazodanowego w oparciu o architekturę P2P i zapewnienie przezroczystości przetwarzania danych przy jej wykorzystaniu. Praca doktorska została wykonana w ramach projektu egov-bus (Advanced egovernment Information Service Bus) wspieranego przez Wspólnotę Europejską w ramach priorytetu Information Society Technologies Szóstego Programu Ramowego (nr kontraktu: FP6-IST STP). Tekst pracy został podzielny na następujące rozdziały, których zwięzłe streszczenia znajdują się poniżej: Chapter 1 Introduction Wstęp Pierwszy rozdział wprowadza w tematykę pracy, zaprezentowane jest wprowadzenie w tematykę gridową, gdzie po krótce scharakteryzowane są obecnie znane typy gridów. Przedstawione są motywacje autora oraz sformułowano cele pracy oraz zidentyfikowano związane z nimi problemy. W tym kontekście omówiono tezy dysertacji oraz zarysowano opracowane przez autora rozwiązania. Chapter 2 State of the Art and Related Works Stan wiedzy i prace pokrewne W opisie stanu wiedzy przedstawiono podstawowe pojęcia związane z systemami rozproszonymi, przetwarzaniem danych oraz integracją danych rozproszonych. Pokazano historię terminu grid oraz przykłady istniejących wczesnych oraz późniejszych rozwiązań nawiązujących do gridów. Omówiono podstawowe systemy gridowe od pierwszej do trzeciej genereracji. Przytoczono reprezentatywne przykłady istniejących architektur rozproszonych stosowanych w przetwarzaniu danych. Dla zaprezentowanych rozwiązań omówiono najczęściej pojawiające się problemy w tych rozwiązaniach. Page 11 of 221

12 Rozszerzone streszczenie Chapter 3 Technological Base of the New Approach to Grid Technologiczne podstawy nowego podejścia do gridu Jest to rozdział, w którym skupiono się na opisaniu szeregu technologii i rozwiązań mających bezpośredni wpływ na powstanie rozwiązania prezentowanego w niniejszej rozprawie doktorskiej bądź są jego częścią. Skupiono się tutaj wyłącznie na technologiach rozproszonych. Scharakteryzowana architekturę sieci peer-to-peer oraz jej implementację w projekcie JXTA, który znalazł zastosowanie również prototypie zrealizownym w ramach niniejszej pracy. Scharakteryzowano terminy warstwy pośredniczącej (middleware), federacyjnych baz danych (federated databases), w skrócie opisano rozwiązanie wirtualnego repozytorium egov-bus w ramach którego został wykorzystany prezentowany w pracy prototyp oraz prototyp obiektowej bazy danych ODRA. Chapter 4 An Automatic Integration Methodology of Distributed Data, General Concept and Assumptions Ogólna koncepcja i założenia metodologii automatycznej integracji rozproszonych danych Ta część pracy przedstawia trójwarstwowy model przezroczystej integracji rozproszonych źródeł danych do jednego wspólnego schematu danych dla rozwiązania ODRA-GRID. Omówiono poszczególne warstwy modelu, również w odniesieniu do występujących rodzajów fragmentacji danych. Zaprezentowano zaprojektowaną i w znaczącym zakresie zaimplementowaną architekturę mechanizmu automatycznej integracji rozproszonych zasobów do gridu baz danych w oparciu o silnik bazy danych ODRA i mechanizm aktualizowalnych perspektyw. Zaprezentowano również architekturę wirtualnej sieci będącą warstwą pośredniczącą w której zaszyte zostały wszystkie proponowane mechanizmy gridowe. Pokazano schemat działania mechanizmu automatycznej generacji aktualizowalnych perpektyw wykorzystany w mechanizmie automatycznej integracji rozproszonych zasobów. Opracowane metody i mechanizmy zostały poparte odpowiednimi rzeczywistymi przykładami integracji w języku SBQL. Chapter 5 Prototype Implementation and Examples Realizacja prototypu i przykłady W rozdziale znajduje się szczegółowe omówienie opracowanej wirtualnej sieci, opisano zasadę jej działania oraz wytyczne jak uruchomić prototyp ODRA-GRID i jak w nim działać. Szczegółowo opisano środowisko pracy, wraz z logiką, które zostaje udostępnione użytkownikowi sieci wirtualnej. Na bazie dwóch przykładów Page 12 of 221

13 Rozszerzone streszczenie obiektowych schematów danych opisano proces generacji modelu integracyjnego wewnątrz prototypu ODRA-GRID oraz szczegółowo wytłumaczono kolejne kroki jego przetwarzania. Chapter 6 Automatic Integration Testing Results Automatyczna integracja wyniki testów W rozdziale zaprezentowano rezultaty testów zaimplementowanego systemu ODRA-GRID oraz mechanizmu automatycznej integracji rozproszonych zasobów. Wyniki potwierdzają skuteczność opracowanej metodologii. Testy empirycznie potwierdzają poprawność zastosowanych rozwiązań, a całość stanowi dowód dla tez dysertacji postawionych w rozdziale 1-szym. Chapter 7 Summary and Conclusions Podsumowanie i wnioski Zawarte zostały tutaj doświadczenia i wnioski zdobyte podczas opracowywania architektury systemu ODRA-GRID i testowania prototypu. Wymieniono opracowane rozwiązania i wyniki badań, które jednoznacznie potwierdzają słuszność tez pracy doktorskiej. Osobny podrozdział poświęcony jest dalszym pracom, które mogą zostać wykonane w celu rozwoju prototypu i rozszerzenia jego funkcjonalności. Tekst pracy został rozszerzony o załącznik omawiający użycie prototypu osłony obiektowo-relacyjnej jako dodatkowego źródła danych integrującego i włączającego do gridu relacyjne zasoby danych. W skrócie zaprezentowano architekturę osłony oraz metodę jej integracji z gridem. Page 13 of 221

14 Chapter 1 Introduction 1.1 Motivation One of the current trends in software engineering is to provide technologies and tools for the development of applications for data intensive processing in a distributed environment. The roots for this tendency lay in business requirements (such as globalization and wide use of the Internet) that applications must fulfil. Nowadays we can observe the grow of computer technologies connected to the development of distributed applications for processing a large amount of data. Such new solutions must conform to existent systems in such a manner, that both - old and new systems - have to cooperate with each other. In contemporary business models there is a lot of associated services, that could be important for global services. The problem concerns distribution of data on different locations and the forms of the distribution. Usually such data are heterogeneous, fragmented and redundant, so the information available to the users is less useful for the reuse in the global models. Therefore a mechanism for common integration of such data is required. There is also a need for a simple solution aiming at transparent integration of various data models into a common global data schema. In many such applications the users should be able to process data in both ways reading and updating. The thesis deals with the data grid architecture for a data-intensive grid solution which covers the technical aspects of distributed data processing in a virtual repository. Such a virtual repository provides functionalities and services that are common for distributed resources, including a trust infrastructure (security, privacy, licensing, payments, etc.), web services, distributed transactions, workflow management, etc. Consecutively the thesis presents a proposal how to compose an automatic process to Page 14 of 221

15 Chapter 1 Introduction integrate various forms of distributed data into one common data schema through the use of a virtual repository software. As an effect this work presents a prototype and technology for transparent processing of different forms of data. The ideas of the thesis are partially developed and implemented as an additional functionality of the virtual repository in the prototype ODRA-GRID. Currently there are few solutions which concern the problems of transporting and integrating distributed data. Most of them deal only with integration of distributed data in the aspect of integrating services and resources for relational database systems. Moreover, such solutions allow for processing data in one direction from a resource to a client. Data modification at the global side is forbidden. The proposal presented in this dissertation uses an object-oriented data model of the ODRA database engine and a peer-to-peer technology as core mechanisms for services and data integration. The approach allows the users to process the data available at the global side using all options, including data modifications. 1.2 Introduction to the Grid Technology The tem grid in the meaning of a specific computer processing architecture can be dated on the half of 1990s, but the first introduction of the notion can be observed on the beginning of 1990s. The meaning of the term grid concerned meta-computation focused on usage of several computers in parallel performing computation of one task divided on many sub-tasks. A sub-task was running on a particular machine. In such case the term grid was simply related to the well-known electric power grid. In this paradigm a user relies on cooperation between machines giving him/her the power of the sum of computations [9]. The first and initial definition of a contemporary grid was formulated by Ian Foster and Carl Kesselman in the book The Grid: Blueprint for a New Computing Infrastructure [9]. The definition concerns mainly a computational grid and tells that the computational grid consists of hardware and associated software which together permits on reliable, homogenous, universal and cheap access to huge computational powers. Today the above definition is not complete in virtue of focusing only on computational grids. Present grids give the users additional functionalities like resource sharing or data processing. In [10] the grid definition has been refined and considers Page 15 of 221

16 Chapter 1 Introduction other aspects of the term. The authors assume that a grid is an open computer system which gives the possibility to use potentially distributed resources having the following properties: 1. it uses a standard, open protocols and interfaces that are widely accessed, 2. it assumes that shared resources are located at different physical domains which are separated e.g. administratively, geographically, by organization, etc. 3. it significantly improves the quality of available services in comparison to the direct usage of particular resources. Please notice that the term resource, is understood generally. It may concern a power computation, applications, a set of data or a specific device. The authors of the definition stated that if any system conforms to the above definition it would be exactly a grid system. Grid computing is a term that arose in last few years to describe a number of computer architecture approaches based on simple but powerful principles.since then, the term has been broadened to refer generally to the use of shared (commodity) computer components processing and storage in a distributed networked architecture. In essence, grid computing is an architectural alternative to monolithic, centralised computations and storage architectures. Summing up, there are at least three common uses of the term grid in the present IT terminology: Technical Computing Grids employ rack-mount computer systems in scale-out configurations to bring the aggregate processing power of many CPUs to bare on problems of interest; (Enterprise) Utility Computing Grids provide an agile, on-demand model for application provisioning and migration based on sharing of common infrastructure resources implemented through commodity computer components (CPUs, networking, storage); Data Grids provide for the distributed capture, management, and sharing of information (and sometimes instrumentation) typically across multiple authority domains. For the goals of the thesis the last usage of the term is the most relevant. Page 16 of 221

17 Chapter 1 Introduction Technical Computing Grids This architectural approach is common in High Performance Computing (HPC) applications [10]. They are employing large amounts of the computing power to solve computationally intense problems. Traditional HPC apps in the scientific computing space include: Energy research and simulation, and high energy physics research; Earth, ocean, and atmospheric sciences, global change, and weather prediction; Complex multi-physics simulations for aerospace design; Seismic data analysis; Large scale signal and image processing applications. These cases represent the types of applications that are used to run on large supercomputers, such as those developed in the early 80s by Cray (The Supercomputer Company) and others. As commodity-based cluster computing (Beowulf clusters) emerged, many of these applications were re-hosed on large (1000+ nodes) compute clusters often called grids. Additionally, many new applications have arisen, in part due to the increased availability of these new commodity supercomputers. These applications are increasingly at the core of business critical operations, and include: Drug discovery (computational chemistry, genomics research); Circuit design simulations; Automobile design simulations (aerodynamics and crash analysis); Risk analysis of financial portfolios; Digital media applications (animation and rendering) Utility Computing Grids Utility computing is all about leveraging modular computing, network, and storage components to improve resource utilization, increase enterprise agility through rapid application provisioning (and re-provisioning), and simplifying IT operations. In short, it is creating a more nimble, more cost effective IT organization. Page 17 of 221

18 Chapter 1 Introduction The terms grid, utility computing, and on-demand computing are often used almost interchangeably to describe a wide variety of approaches that are generally aimed at these objectives. These approaches are typically based on two key principles the foundational pillars of utility computing: consolidation and virtualization [11]. These two principles go hand-in-hand to support hosting of multiple applications either concurrently, or in a time-share model on the same physical resources. E.g. server virtualization, as exemplified by VMWare, Xen, and Microsoft Virtual Server supports the hosting of multiple (virtual) servers on a single physical host. The physical resources of the host (CPU, memory, I/O, network connectivity) are shared amongst a number of virtual servers. Each looks like a standalone server (with its own IP address or addresses, its own network and security settings, and its own OS and applications) but shares the underlying physical resources. Server virtualization improves utilization by consolidating multiple applications onto a common physical hardware platform eliminating costs of capital expenditures and operating expenses associated with deploying multiple physical servers. This approach is particularly effective in containing the server sprawl that has occurred in many IT organizations where every application instance required its own server (and local storage) [10]. Similarly, network virtualization strategies (including virtual LANs) and storage virtualization strategies serve to allow applications shared use of those infrastructure resources, often employing Quality-Of-Service (QOS) provisions. Storage virtualization strategies, in particular, aim at delivering logical storage containers (file systems and LUNs or volumes) that transcend the physical nature of storage systems disks and controllers. Together with tiered storage strategies (using different classes of storage for different types of data) and transparent data migration, they deliver a storage as services model where those services are provided in the storage network and not by individual physical devices or servers. A realization of what many have been calling information lifecycle management (ILM) Data Grids The notion of a data grid may be the closest concept to the original grid concept developed by Foster et al. and presented in this thesis. It represents a physically distributed set of information resources (services) contributed by multiple authorities under a common set of protocols. In some ways, the World Wide Web represents the Page 18 of 221

19 Chapter 1 Introduction first generation data grid, in which information is published by individual sites, indexed by crawlers, and accessed via search engines and explicitly represented links. The San Diego Supercomputer Centre s Storage Resource Broker is another data grid model that presents catalogued data collections presentable to a community of interest [12]. SRB presents the user with a single file hierarchy for data distributed across multiple storage systems. It has features to support the management, collaboration, controlled sharing, publication, replication, transfer, and preservation of distributed data. In many ways, these efforts represent first generation data grids with static or quasistatic data published into web-based documents or files, and consumed by browsers and file-savvy applications. The next generation of data or information grid is one based on Web 2.0 technologies that affords a richer model for compositing individual information sources into a rich set of distributed web services. 1.3 Research Problem Formulation and Proposed Solution Contemporary, parallels and distributed processing of data coming from sources which are databases causes big problems. These problems are related to physical treatment of such data and the way how a user sees the data, which constitute an access to the data and its processing methods. The main difficulty in the processing is, in that case, the form of data the data has a structure which is usually complex. Therefore, it cannot be identified and treated as a simple string of bytes. Often, the data structure may be dependent on other structures, so in a distributed system such data management is very difficult and in some cases impossible. In distributed systems an achievement of transparency is required, i.e. a user working on a data can process it regardless of whether the data is local (located on the user's local computer), or the data is retrieved from remote locations. In most cases such remote locations are heterogeneous systems. This issues an additional challenge for designers of such systems. The problem is an implementation of the mechanism for an integration of such distributed and heterogeneous resources. Discussing distributed data processing we mean not only their reading but also primarily their non-limited updating. The ability of making free modifications on distributed data with assumption that a user who introduces a modification is not even Page 19 of 221

20 Chapter 1 Introduction aware that he/she operates on remote data, is the most serious unresolved problem in other existing distributed systems. With the above assumptions, a distributed database system must guarantee its operation continuity and easy access to data. In order to achieve this, it must be ensured already at the grid architecture level. Therefore, this requires the implementation of a very flexible and highly scalable architecture. In this dissertation the problems presented above have been discussed, and also the solution in the context of data grid architecture proposed. A prototype implementation of the database grid has been determined as the thesis goal. It is characterised by: Transparency of available resources during their processing in grid; Automatic integration of resources joining grid; Virtual network which connects the cooperating databases. Author proposed the following tenets to achieve the goals of the dissertation: Development of a model of integration of distributed heterogeneous data resources into one common object-oriented data schema. The integration process is based on three-layer structure of object-oriented updateable views, which in accordance with the relevant guidelines are able to realise the mapping between data contributing to the grid (the lowest level data schema, which is located on the user's machine plugging into the grid), and a global grid schema (the highest level) which is accessible to all users working in the grid. The three-layer structure of the views is a mutual compound of object-oriented updateable views in three levels, so that each view located on a higher level depends on a view located on a lower level. Dependence of layers is determined by the contract, which tells how the virtual objects of a higher layer (which are result of the view) depend on virtual objects coming from a lower layer. Definition of the views must be consistent with the guidelines for building of updateable views described in [6]. In the three-layer architecture a view located in the first layer must provide an object mapping and a contribution of the objects coming from the local database to a schema corresponding to the grid global schema. As a result, contribution virtual objects matching the pattern of contribution are created. The second layer makes an integration object mapping in accordance with an integration schema. Then a collection of available contribution virtual objects is mapped into a Page 20 of 221

21 Chapter 1 Introduction collection of integration virtual objects. The third layer does the mapping of a collection of available integration virtual objects into virtual global objects available in the grid directly. Creation of generic technique of the automatic data resources integration into a global schema of the grid. It is a mechanism consisting of a number of methods and rules used in algorithms that generate updatable view definitions. They are used inside all three layers of the developed model for data integration. The process of generation of the views definitions is executed on the basis of a specific description for the views. Creation of management and maintenance techniques for automatic integration mechanism, so that continuously joining and disconnecting of various contributing resources can be possible within the grid. This mechanism uses similar principles and methods, to those of views generation mechanism, while rebuilding of the views is being performed. The difference is that it operates at the level of the integration only (the second layer). Building a virtual network architecture based on a peer-to-peer network, whose task is to create a communications environment for the grid. The virtual network will facilitate access to databases as nodes in the grid, and make the communication independ from the limitations of the TCP/IP layer. Additionally, it will move an ordinary standard networking to a higher level of abstraction, where typical networking tasks are limited to minimum. Construction of the middleware for the database grid which will contain all the above solutions. 1.4 Theses and Objectives The summarised theses are the following: 1. It is possible to develop the construction of an object-oriented database system in a distributed and parallel processing architecture, commonly known as a data-intensive grid. The architecture is supported by an object-oriented database management system and an updatable object views mechanism. 2. Such a database system should be equipped with a unique integration facilities based on updatable object views. This mechanism can perform a transformation of local resources into a common - global data schema. Page 21 of 221

22 Chapter 1 Introduction 3. For more efficient using a database grid and allowing for transparent data processing the peer-to-peer virtual transport platform has to be introduced. The prototype solution accomplishing, verifying and proving the theses has been developed and implemented according to the modular reusable software development methodology, where subsequent components are developed independently and combined according to the predefined (primarily assumed) interfaces. The dissertation is organised as follows. First, in Chapter 2, the state of the art in the field of distributed data processing from heterogeneous and autonomous resources is analysed. This presents issues which appeared during the maintenance of remote data in various approaches. Then, a set of solutions related to distributed data integration is investigated with their assumptions, strengths and weaknesses, considering their experiences in the field and the possibility of their adaptation to the designed model this is the content of the Chapter 3. The author s interests together with opportunity of their implementations within the egov-bus virtual repository software [7] have allowed design of the general modular data grid architecture (subchapter 4.1 and 4.3) and the implementation of a grid middleware based on general integration assumptions (presented in Chapter 4). The architecture assures genericity and flexibility in terms of integration of distributed data, as well as distributed communication and data transfer. Till now, however, the architecture has not been subjected to any substantial verification and experiments. According to the above, the first working prototype has been implemented and experimentally tested (Chapter 5 and Chapter 6). The tests include the object-to-relational wrapping mechanism [13] presented in 0. The next stage concerns an integration of heterogeneous data based on the virtual repository mechanism (subchapter 3.5) and the ODRA object-oriented database management system (subchapter 3.6), including an automated process of integration, which is described in details in subchapter 5.6. The goal was to establish and achieve a fully automated solution for transparent inclusion and detaching of distributed objects in a virtual repository. The approach aims at designing a platform where all clients and data providers are able to access multiple distributed resources without any complications concerning data maintenance and to build a global schema for the accessible data. The assumption is that it must be a complete mechanism performing transparent integration of remote objects built on top of a database, including fully Page 22 of 221

23 Chapter 1 Introduction operational database engine and transport platform. To achieve all these features a creatiion a powerful middleware layer was needed. Described shortly, the development process consists mainly of analysis of related solutions in terms of the thesis objectives and integration requirements with the rest of a virtual repository. The resulting automatic integration process is completely transparent for end users and it enables complete distributed object data processing (including data reading, updating, creation and removing) in the grid architecture. The prototype solution accomplishing these goals and functionalities is implemented with Java TM language. It is based on: The Stack-Based Approach (SBA), providing SBQL (Stack-Based Query Language) being a query language with a complete computational power of regular programming languages and updateable object-oriented views; The JXTA peer-to-peer framework for creation of centralised and decentralised P2P networks that in turn are used for creation of a grid virtual network and a data transport platform. 1.5 Thesis Outline The thesis is subdivided into the following chapters: Chapter 1 Introduction The chapter presents the motivation for the thesis subject, the theses and the objectives, research problem with the description of solutions employed and an overview of grid technologies. Chapter 2 State of the Art and Related Works The state of the art and the related works aiming at processing of distributed data in various forms and services, are discussed here. The solutions having a fundamental influence on this dissertation are briefly described. Chapter 3 Technological Base of the New Approach to Grid The fundamentals of distributed data integration methods and challenges are given, basing on existing solutions. The chapter also focuses on the main types of integration of data in existing distributed database systems and issues appearing in these solutions. Page 23 of 221

24 Chapter 1 Introduction Chapter 4 An Automatic Integration Methodology of Distributed Data, General Concept and Assumptions The chapter presents the concept and the assumptions of the developed and implemented methodology for automatic integrating heterogeneous object-oriented database resources into a virtual repository. A general three-level approach to integration data in a virtual repository is presented there. Also the grid middleware concept is described. Chapter 5 Prototype Implementation and Examples The detailed description of the developed and implemented automatic integration mechanism is given. It contains an architecture of a grid. Additionally, assumptions concerning a virtual network platform are explained in detail. Prototype activities are depicted by demonstrative examples based on an introduced objectoriented schemata. The chapter contains a number of listings which show complex transformations of integration views and present how the view generation mechanism works. Chapter 6 Automatic Integration Testing Results here. Assumptions and issues of the testing process, as well as the results of testing are Chapter 7 Summary and Conclusions The conclusions and future works that can be conducted for the further grid and automatic integration mechanism prototype development. The thesis text is extended with one appendix describing an integration of relational data using object-to-relational wrapper mechanism within a grid. Page 24 of 221

25 Chapter 2 State of the Art and Related Works 2.1 Grid History The grid history began in early networking era before the Internet. Following Foster and Kesselman [9], some sources give as a beginning date establishing the ARPANET network and a network itself. However, in accordance with the Ian Foster's second definition of the grid, it can be assumed that the first system close to this definition is DNS (Domain Name System). It was originally developed to support the growth of communications on the ARPANET, and now supports the Internet on a global scale. The distributed DNS (at the beginning since 1973 the DNS was centralised and called Host Names On-line) has been established in March When the Foster's definition will be taken under consideration, DNS applied all three definition rules: #1 it uses standard and open protocol described in RFC 1034 and RFC 1035; #2 it assumes that shared resources should be located at different physical locations. By distribution the DNS service in 13 root name servers the system complies the rule; #3 improves quality and reliability of the whole service. Of course DNS has been introduced a long time before the grid term, so it is not really taken as a grid system nowadays. Evolution of distributed systems brings a division of the grid on generations. Below there are historical draft and characteristics of grid forerunners with dates. Currently there are three generations of grid systems; 1 st generation, 2 nd generation also known as future generation and 3 rd generation called the next generation. The 1 st generation grid systems assumes that the main issue is the access of grid resources on demand. The main components are resource providers, resource Page 25 of 221

26 Chapter 2 State of the Art and Related Works requestors and a resource information system. The way such a grid is working is to move code (and necessary files) from a resource to a resource. This results in two major problems: security (how the resource owner can trust the incoming code) and file staging (how to move large files or databases efficiently). In the 2 nd generation (future) grid systems the main issue is the access of grid services on demand, but the main components are service providers, service requestors and a service information system. In such a grid, a code is not moving. Instead that, requests (in form of data) move to service providers who process requests basing on their own or purchased (therefore trusted) code. In this way the biggest problem of the 1 st generation grid, i.e., the security problem is solved (similar to the current web technology approach). The problem of file staging is not completely solved but in many cases the service and the necessary database is located on the same site of the grid and therefore the move of large files is less stressing. Large scientific databases should be created as distributed, multi-site grid services [14]. The first project which reached huge success in the area of 1 st generation of grids is called Condor [15]. It was created in Condor is a queuing system, but in contrast to such standard solutions permits, and even puts the emphasis on cooperation of systems with administrative autonomy. Condor allows to install its processes not only on large cluster systems or dedicated computational servers, but also on ordinary workstations. As an early grid system it has unique feature of advanced sharing capabilities of specific machine resources (for example, only when it is not used locally). This feature enables interoperability of multiple machines similar to a grid. Condor does not fulfil the rule #1 of the second Foster's grid definition. Another distributed system which gives to the users the possibility of sharing computers computational power is PVM (Parallel Virtual Machine). This is a software package first published in 1990 that permits a heterogeneous collection of Unix and/or Windows computers hooked together by a network to be used as a single large parallel computer. Thus large computational problems can be solved more cost effectively by using the aggregate power and memory of many computers. PVMs are accessible through source code which can be compiled on many OSs and hardware platforms. PVM enables users to exploit their existing computer hardware to solve much larger problems at minimal additional cost. Hundreds of sites around the world are using PVM Page 26 of 221

27 Chapter 2 State of the Art and Related Works to solve important scientific, industrial, and medical problems in addition to PVM's use as an educational tool to teach parallel programming [16]. Processing distributed objects among parallel resources has been determined in the CORBA architecture by OMG (Object Management Group) [17] in 1991 as CORBA v1.1 for first time. CORBA is a shortcut of Common Object Request Broker Architecture, and soon after it appeared, it became a standard architecture for distributed object systems. CORBA belongs to the 2 nd generation of grid systems. It has been realised in many systems, the most popular are ORBit [18] and omniorb [19] (free), Visibroker [20] and Orbix [21] (commercial). More details about this approach are presented in subchapter Another noteworthy project of the 1 st generation grids which was a wide range successful experiment and which was really built is I-WAY (Information Wide Area Year). In 1995 this project connected together about 20 most important computational centres in USA. It created the computational power which has been used for virtual reality simulations. I-WAY has shown some technical problems which are still actual in contemporary grid systems, like security of resources and computational data, allocation and discovering of available data resources. The first distributed system, which gave rise to the concept of a data grid that is involved in sharing data within the meaning of the definition #2, was the Storage Resource Broker also known as SRB (version 1.0 was founded in 1997). This system enables the global and uniform view on data (generally, files) distributed on a number of physical locations. The most common usage of SRB is a distributed logical file system (a synergy of database system concepts and file systems concepts) that provides a powerful solution to manage multi-organisational file system namespaces. SRB presents to the user a single file hierarchy for data distributed across multiple storage systems. It has features to support the management, collaboration, controlled sharing, publication, replication, transfer, and preservation of distributed data. In fact it is a middleware in the sense that it is built on top of other major software packages (file systems, archives, real-time data sources, relational database management systems, etc). Its architecture assures well data processing without problems of security and discovery of resources. SRB introduced into grids the concept of metadata (i.e. data that describe the auxiliary data), often appearing in today's grid solutions [12]. Page 27 of 221

28 Chapter 2 State of the Art and Related Works Since when German scientist founded in 1997 the project entitled UNICORE which means UNiform Interface to COmputing REsources (described in [22]), grids evolved from simple-managed applications to multi-layers architectures. UNICORE is a project which is still alive. Currently the 6 th version of this system is implemented and shared to the public. At the beginning the project was oriented on supporting users who needed distributed resources and computational power to perform wide scale calculations. The essentials assumptions of UNICORE are: Easy-to-use interface for tasks preparation, sending them to servers which are able to process them and subsequent monitoring; The security system allows for uniform identification and authorization of users regardless the specificity of remote server systems; The format for the notation of complex tasks consists of an operation executions sequences. It is assumed that these operations can be performed on different servers and simultaneously can depend on each other; It separates the user from system s complexities it has direct influence on a task which should be accessible on different system installations and allows to run for different system environments; It supports dependent applications which could only process locally available files. UNICORE assumes lack of the support for interactive applications as well as streaming, etc. At present UNICORE (in version 6) is designed according to the Service Oriented Architecture (SOA will be described in details in the next subchapter). In UNICORE, SOA is represented by UNICORE Atomic Services mechanism which implements indivisible and simple tasks without their support by workflows. This means that this mechanism gives the possibility to use different programming languages for defining and processing such tasks. Summarising, such an approach is oriented on processing computational or streaming tasks, where data processing (like structured data coming from databases) is not supported. However, all that makes UNICORE a flexible system, which constitutes in the grids history a important part - as a system which is still alive and still evolves according to current trends. Globus Toolkit 2 has been created in 1997 basing on the previous grid project called Globus Project. The formal founder of GT2 currently GT4 is Globus Alliance Page 28 of 221

29 Chapter 2 State of the Art and Related Works team. GT2 since its beginning turned out to be the standard for grid computations. It is related to OGSA Open Grid Services Architecture (see next subchapter for details). GT2 implements protocols and APIs related to distributed calculations by providing services such as: authentication, search of resources (in a limited form), access to remote resources, data transfer, running, a control and queuing for processes, portability. After few years it became clear that this system actually has established a new quality of distributed calculations and sharing of resources between calculation centres. But it was also criticised for difficulty of its installation and management and complexity in usage [23]. For this reason, the system evaluated with many improvements into version 4 which is based on Web Services. Globus Toolkit architecture consists of three types of network nodes: Computational which execute processes controlled and queued by broker nodes; Storage which store a data needed for calculations or results; Brokers which are responsible for coordinating and queuing tasks, data transfers. A very important feature which has been introduced in GT2 is a distributed transparency of data calculations contemporary this feature is required in current grid solutions. In GT2 a user describes a performed task using a dedicated language JDL (Job Definition Language), and then sends it to a broker node, which is responsible for finding out the best place in a grid for its processing regarding of the power possibilities of computational nodes as well as costs of data transfer needed for the calculations. In case when the task fails for some reason, the broker tries to start it again. To improve the system operational, a data can be replicated in multiple storage nodes, and the state of such replica is coordinated by a dedicated processes. The result of the calculation and their status can be tracked using the Logging Service. Moreover, a copying of the data between the nodes is done by using a special protocol called GridFTP. The protocol in a transparent way for users is able to divide processed data on smaller parts and reuse existing replicas for decreasing time and power consuming. Mentioned improvement determines how GT2 achieved distributed transparency of data calculations they are following: Automatic start-up of the processes in the best available locations; Page 29 of 221

30 Chapter 2 State of the Art and Related Works Auto-notifications for the users about a calculation status and its result; Ensuring a homogeneous platform for access to a data and a transport layer for its transferring. Some implementations unfortunately decrease transparency aspects related to data itself and its access: Data shared by the consortium under a single agreement are shown for all participants in the group this can expose others on a loss of information; Explicit location of data on the network is essential in reaching access to them (the name of the node must be explicitly presented), which in a complex systems is inconvenient and not necessary; Data stored at different locations but belonging to the same single user are not integrated automatically, thus the user must operate separately on each location. For above reasons, the system is difficult to use for users not knowing low level commands. Globus Toolkit still evolves and currently is available version 4 of the platform. At the grid history reflection there has to be mentioned a commercial grid solution which was the first one supporting databases in grid architecture. It introduced, together with data integration aspects, certain facilities associated with a fragmentation of data and the processing of queries on the distributed resources. This refers to Oracle 10g system. The Oracle 10g database management system (so called Grid) is one of the forerunners in the field of distributed databases and distributed query processing systems in business. Its manufacturer states that this is the world's first system that is able to disperse a request processing, to realise a transparent integration of distributed resources and running in distributed manner database applications [24]. Experience shows that this system has no big differences from its previous version, the rhetoric of a grid is a little over-used. Oracle 10g offers the possibility to create and manage a distributed database system, but in reality it can only be a homogenous database system containing just Oracle databases. Such a system is able to process heterogeneous data coming from different companies but in a very limited scope. This not an open and generic approach to distributed data processing. In general, the architecture of Oracle is a typical client- Page 30 of 221

31 Chapter 2 State of the Art and Related Works server architecture. However in the grid version, a node the entity performing a query and having its own data repositories simultaneously can be a client and a server. Servers in Oracle 10g can be connected by database links, which allow to perform in a transparent manner queries on other servers. This means that a query is performed on other distributed servers, not only the one into which the client is connected to. Each database has a unique name that can be used in queries. The assumption is that the user can not refer to any other database, but only to that which is administratively connected by the link with the current which the client uses. Links cover physical connection among database servers, thus users don't need to be aware that a query is processed in a distributed system. However, the capacities of database links are limited due to their flat architecture and the need to preserve the clarity of calls within a single node. Summarising, Oracle 10g provides data location transparency (but in a limited form), the transparency of a distributed query processing and distributed transactions. In Oracle 10g the management of a distributed system is based on the following assumptions: Each server is managed independently by its administrator, and each has their autonomous repository; Each server maintains its own independent level of security and authentication for users. Remote user to get an access to the data must pass authentication. Authentication can be based on passwords stored locally, remotely or attached to the link; Remote users linked to a local database are treated like local users and the system does not distinguish them. All their activities are verified by the local system; Users can invoke procedures stored on remote servers; Distributed query has a global coordinator, which analyzes the system distribution and divides base queries into (optimizable) sub-queries, which are processed on individual servers. Oracle 10g is equipped with a mechanism for distributed, remote and local transactions. They allow to control operations performed on multiple servers, on a remote server and on a local system, respectively. In case of distribution, the system Page 31 of 221

32 Chapter 2 State of the Art and Related Works provides that operations coming from a given transaction will be entirely approved or rejected on all servers. This solution is implemented by means of the two-phase commit (2PC) protocol. 2.2 Contemporary Grid Related Technologies Grid systems have to solve a number of specific problems to be in accordance with the grid definition. The basic key is a communication between grid resources and their users. The term resource has a wide meaning in relation to distributed systems. It is understood as computing nodes (clusters, supercomputers, servers) and data storage devices. Please note that the resource may not necessarily determine the hardware and physical resource. Very often a resource is meant as well as application which is responsible for controlling and sharing a specific computer hardware. It is well known as a service (in meaning of resource). A good example is a service which provides an access to a data from database or just application which gives access to a file system. A grid communication has to provide a mechanism to take full advantage of resources power. This is really difficult task in term of the diversity of resources. Even a limiting the resources to computational ones still leaves the problem of heterogeneity of processor architectures, operating systems and hardware capabilities (e.g. available memory). The typical approach to solving this problem is a creation of an intermediate layer, i.e. so called middleware. In most cases it implements communication protocols and is responsible for interactions between these protocols and higher layers performing grid tasks. This is very flexible and generic approach which gives opportunities to solve other challenges facing grid systems. The grid building layer-model is presented in Fig. 1. It also shows a higher layers based on the middleware layer, i.e. users' applications, service applications and their users. The middleware as a management carrier is very generic in comparison to upper layers which present a specific software reflecting user needs users are usually interested in specific applications which provide only a particular grid's functionalities, but shared in a user-friendly manner. To share heterogeneous data in a grid environment between different clients platforms (in meaning of their different local environments like OSs or type of using data) a middleware should be used there. Currently three approaches how to prepare such middleware are known. Page 32 of 221

33 Chapter 2 State of the Art and Related Works The first concerns an implementation of different middleware versions, where each is created for a specific client platform. This is a straightforward approach. It requires a tremendous amount of work and is the most cost-consuming. Thus, unfortunately it is not so popular. Fig. 1 Widely accepted model of grid building and evolution. The second one is a using a portable-like programming language to implement a middleware. In such a case, the middleware can be used on any environment which supports the language used to create the middleware. Currently Java is the most popular as a platform-independent programming language. Java applications are typically compiled to bytecode that can run on any Java Virtual Machine (JVM) regardless of computer architecture. The third approach let to use virtualization on a grid s destination platform. This means that other environments (actually OSs) can be run within existing OS physically installed on a hardware. This approach also requires a middleware which in most cases is produced according to the second approach, but access between OS environments is provided over network interfaces. Currently this is the most popular solution, not perfect, because a such implementation of middleware can be sometimes not efficient OS which is running virtually in other physical OS, may be limited by physical OS and finally its performance will be decreased whole processing will work slower. But current and future hardware trends is going to solve the problem. Page 33 of 221

34 Chapter 2 State of the Art and Related Works Another key issue raised in grid systems is to provide widely acceptable safety. The issue has to be considered under the requirement to ensure the autonomy of particular systems, which consists of the grid system. The basic problems are: Privacy of the data (including transmission over network); Reliability and integrity of obtaining the unchanged data; Mutual identification of cooperating partners; Validity of sent messages (i.e. validation of sender); Delegation of authorities (e.g. a transfer of a part of the users rights to an agent and to let him manage of any task in the name of the user); Providing detailed authentication for resources access (users grouping, providing access roles, checking and management of authentication). Notice that actually all of the above problems have existing, stable and widely accepted solutions. Privacy of transmission on the Internet as well as the integrity of transmitted data are ensured by SSL and TLS protocols, but in conjunction with digital signature and public key infrastructure (PKI), can be added to the next listed above problems. There are also many of authentication systems, starting with the simplest, built-in mechanisms of operating systems (like access to files and user groups, access control lists ACL-s). In grid systems, these solutions are widely used, however some issues remain. All of above safety sub-solutions work well together with homogeneous security policies and uniform mechanisms used by the communicating participants and trust between service providers. In the case of grid systems, the problem is much complicated; at least if the two nodes in a grid accept different PKI authentication centres and a user using them needs to keep several different certificates, such situation will be enough to invoke a significant difficulties for a creation of a mutual and encrypted communication. The basis of security architecture for grid systems is the concept of virtual organizations (VO) [10]. VO specifies the virtual group of users, distributed in terms of a work under common goals or characteristics. At the moment this concept is strongly expanded in order of increasing its dynamic creation and increase their functionality which reflects Page 34 of 221

35 Chapter 2 State of the Art and Related Works in many researches and its implementations, including this the prototype that has been implemented within this thesis (see Chapter 4). The last of the major challenges faced by grid systems is actually the most general and it is addressed on a real possibility to differentiate the various offers of grid systems. This is a question about the value added into the quality of services in relation to using grid services excluding a grid system layer. This requirement, is already imposed by the grid definition. It can be implemented in several ways: Optimising of resources grid systems can help in a selection of optimal (at various aspects) runtime systems; Ease of use an extensive client layer can facilitate a remote execution of the simplest task; Reducing of administrative costs monitoring service of the system, higher security access and dedicated tools for the installation of different grid components. This can facilitate a deployment of distributed applications as well as management and system controlling; Scalability grid systems are generally able to establish an easy extension of its physical structure with additional elements. This multiplicity of implementations allows on the creation of competitive adaptations of grids, because in the reflection of the remaining issues, the main challenge still remains the co-operation: the use of compatible protocols and common standards. Browsing the most recurring problems, accented in the specialist press, commercials and science (see [25]) it is clear that the rapid increase of the amount of data is a universal phenomenon. In the result, it is a strong incentive steering the evolution of grid systems. Technologically we are ready to store such large and still growing amount of data in the form of distributed storage resources. However, this forces to maintain such data which in the final effect raises issues such as browsing, search, filtering, transfer and transform of the data. The real problems begin when such data creates compound structures in a business model meaning and they have to be prepared to be processed in a distributed global grid environment. Then, the main problems are following: Page 35 of 221

36 Chapter 2 State of the Art and Related Works Presentation of such data which comes from different structured heterogeneous resources in aspect of an adopted business models, integration of data from different resources and its global processing; Transparent access to the data; Searching; Filtering; Many others processes which are in such a large-scale difficult and often impossible to work. Currently accepted way to solve the above-mentioned problems is the usage of an intermediate layer so-called middlelayer. A grid's middlelayer is often divided into two parts. The first one, more fundamental also called a core part, physically implements the most basic functions like: file transfers, low-level security, reporting tasks and basic supervision. The second one, which contains other services making a grid infrastructure, is generally referred to as grid high lever services. Such a concept can easily identify a part of the middlelayer infrastructure, which is currently highly developed and most likely will be growing in the future. The aim of high-level services is to provide, greater convenience of an usability and more advanced features from the perspective of the user. It is related to the problem often raised by critics of early grid systems: a task (as a single process). In the grid technology a task process brings too much complications, and despite a number of offered advantages, it is easier to perform some tasks manually in separate executions. Below are selected categories of services recently gaining a popularity. All of them have major impact on the evolution of the grid systems: Broker services instead of the user they automatically perform an optimal choice of an executive system for running the tasks. The selection criteria may be various e.g.: minimization of cost, minimization of running-time, maximization of a security level. There have been several solutions Grid Service Broker implemented in Gridbus project [26] or GridWay [27]; Advanced data services which provide more advanced techniques to data access than a simple copying of files between a grid nodes. Most known technologies are OGSA-DAI and OGSA-DQP [28]. They allow to get access within a grid environment to a data coming from different types of databases including advanced browsing and searching for data; Page 36 of 221

37 Chapter 2 State of the Art and Related Works Workflow services are one of the most widely used elements in a grid. Their role is currently growing according to an adoption by the grids a serviceoriented architecture. They gives a possibility to a flexible composition of invoked services. There are available different languages to describe the workflows consisting of tasks and services; Monitoring services they are the services mostly useful for administrators and other grid services (e.g. Broker services). They provide an information at grid runtime about e.g. nodes availability, working processes, etc.; User and privileges management services they are usually connected with the idea of virtual organizations which contemporary are part of the grid systems SOA and OGSA Rapid development of the grid technology has produced the need for collaboration and coordination in creating next grid-like solutions. The large and growing number of projects, large funds and the corresponding number of software and applications was difficult to understand, even for experts in the field. Mostly the solutions duplicate and have begun to appear competitive approaches to the same problem. The main disadvantage was the lack of knowledge about existing solutions, which can co-operate with each other finally giving a much more powerful utility. However, the biggest problem in the field concerns incompatible protocols, which doesn t allow to co-operate between existing systems. In order to solve these problems in November 1998 was formed in the U.S. an organization called Grid Forum [29]. In 2001, Grid Forum has been merged with the European and Japanese initiatives to form a joint organization called the Global Grid Forum. The forum is considered most influencial concerning the grid technology development in recent years. In 2006 the Global Grid Forum has further transform incorporated Enterprise Grid Alliance a similar organization with business-oriented roots. It finally got third name in its history, called the Open Grid Forum (OGF) SOA vs. OGSA The biggest achievement of the Global Grid Forum is undoubtedly promote and developing general architecture of grid systems Open Grid Services Architecture, Page 37 of 221

38 Chapter 2 State of the Art and Related Works widely known under its shortcut OGSA. To understand its meaning, we refer to the underlying Service Oriented Architecture SOA. SOA has many definitions. In the present work it was decided to take the following definition, based on the definition derived from the Open Group organization. However, the definition which removes redundant references to the services industry is more common and bringing its meaning to the definition given by the OASIS [30] (which is formally correct, but too general to be used in practice): Service Oriented Architecture is a special type of architecture for distributed system, which is serviceoriented. The orientation on services means a modelling of functionalities through the services, which are logical representations of processes with specific inputs and outcomes, which may act alone, interact with other services and are indivisible from the point of view of the client who is using it. The development of the SOA concept there should indentify the most frequently recurring properties of the architecture. While the implementation of the service processes is completely arbitrary (as hidden for the service user) an interface which is an access way to a service is strictly defined with relation to a specific standard. This allows to achieve interoperability of services between each other. Services are organised in such a way which permits their easy finding. This requires the low degree of dependence between the services (it is called loose coupling). The usability of the SOA model for modelling grids becomes apparent in comparison of SOA and grid definitions. A grid as a model assumes highly distributed resources, whose elements are located in separate administrative domains, in an intuitive way it can be described as the composition of services. A general model of the grid in reflection of SOA architecture is provided by the OGSA specification. The first widely world recognised and fundamental work on this model has been presented in "Physiology of the grid" [11]. This book also introduces the term Open Grid Services Architecture, OGSA. Currently, the term is used for characterization of the "new generation" grid. A more detailed work on the architecture is continued within the Open Grid Forum community which also works on the OGSA specifications [31], [32]. The proper meaning of the term OGSA used in literature is also somewhat blurred in by using it in the vast quantity of different contexts. The core of OGSA consists of two parts the first one gives a range of issues and problems faced with grid systems, and the second one defines a theoretical set of services designed to solve these problems. Page 38 of 221

39 Chapter 2 State of the Art and Related Works OGSA also does not impose any specific technologies, formats and communication protocols. Furthermore, it assumes that a given set of services does not necessarily has to be used in full by the consistent solution. This is possible due to the high independence of components which characterise SOA. OGSA also assumes the following a set of problems which has to be solved by a grid system: Possibility of cooperation between the components (viewed as services) which in result gives a possibility of the use heterogeneous systems (in terms of hardware architecture, used operating systems, etc.); Possibility of sharing resources between their owners; Optimising of resources usage (e.g. to achieve low CPU load); Ability to execute tasks defined by the user; Data management their storage, retrieval, browsing, etc.; Ensuring the security of entire system, i.e. an identification of the users, authorization of access, rights delegation and many other aspects; Reduction of administrative costs; Scalability of the system, together with its ease of development through the addition of further resources; Resistance to failures, the possibility of reducing the losses caused by accidents and a high level of services availability; Ease of use Web Services Until then, it was intentionally not introduced a concept of Web services, although this technology is strongly associated with SOA and OGSA. Web services are one of the technology used to implement a service-oriented architecture. What is more, they are de facto basic SOA technology, which have also been selected as a leading technology by the majority of projects involved in construction of grid systems. Consequently, the majority of the specification based on OGSA uses the Web service technology, but it is not the only one possibility. Grida can be build using other services, actually it depends on type of the grid system and its purpose. Providing strict definition of a Web service is almost as difficult as the definition of the service-oriented architecture. In this Page 39 of 221

40 Chapter 2 State of the Art and Related Works dissertation we will use the following definition based on the meaning of the concept of IBM company (see [33]): A Web service is the type of a service within the meaning of SOA. It describes a set of operations available over the network. Executing an operation shared by a service rely on a sending an XML document suitably formatted in the standardised form. Also, a possible result of the operation is provided in the form of standardised XML document. Additionally, a Web service is described in the standard formal XML document which contains all necessary information needed to interact with the service. Web service uses standard network protocol for transmission of XML documents in both way of the Web service. Since Web services tools were initially used in B2B technologies, many companies of the IT sector provide a slightly different meaning of this concept, mostly imposing additional requirements for the definition. For example, it is assumed that Web services should be accessible via the HTTP protocol [34], and an XML document format should be in accordance to the SOAP protocol [35]. In the vast majority of cases, it is applied in practice. It is also commonly known that the term Web service can be understood a service with SOAP messages sent through other protocols than HTTP e.g. SMTP, or services of the REST type, which don't use SOAP for message processing. In this dissertation the prototype communication layer is not based on Web services OGSI Because the OGSA standard is general, further standards are required, clarifying the details of services, their interfaces and used protocols. Basing only on the OGSA standard there is a possibility to build a large number of different grid systems which do not fulfil the basic postulate of mutual cooperation between the grid systems, but systems which only have a similar architecture. Following are the most relevant cases requiring such a clarification: 1. a representation way of resources with their states through the services, 2. a way of addressing such resources, 3. control methods of those resources, such as the creation or setting their running time, 4. methods for the security of communications and authentication of the users. Page 40 of 221

41 Chapter 2 State of the Art and Related Works Of course there are many other aspects of grid building, but all listed above are essential. The first of such standards précising the OGSA was Open Grid Services Infrastructure (OGSI) [88] specification. Although it is not the OGSA profile, because during creation of the OGSI, the OGSA was rather more general idea than a formal standard. However, currently it is listed as the OGSA profile. Because OGSI had many unacceptable lacks (for details see [36]) there was created basing on the OGSI a new group of specifications generally grouped under the name WS-Resource Framework (WSRF) [37] and WS-Notification. The first group of specifications deals with the representation of the resources having a state through network services [38]. One of the improvements in relation to the OGSI was an use of external specification called WS-Addressing [39] to identify the unified resources. The specification for WS-Addressing contains an additional element defined in the SOAP message header. It keeps a name of the resource which concerns calling operation, it is also referred to as the endpoint reference (EPR). Moreover, the group WSRF specification includes: WS-Resource Properties which determines a way to share an information about the state of the resource (its property) [40]; WS-Lifetime defining a resource life-cycle management [41]; WS-Base Faults which slightly extends an error handling concept form the SOAP Protocol, so that it can be, in a more unified way, to provide a diagnostic information about faults causes [42]; WS-Service Group which permits on grouping the resources, e.g. to create their functional registry [43]. The WS-Notification as a set of specifications defines different ways of notifying a mutual state changing of the services. It includes three documents: WS-Base Notification, the basic specification defining operations to inform about events that have occurred and the subscription of such information. There are defined several such possible models (e.g. asynchronous data retrieval about the event and synchronous notifying by the service which is a source of the event) [44]; Page 41 of 221

42 Chapter 2 State of the Art and Related Works Brokered WS-Notification specifies an optional and additional broker service between the source of information about events and the information consumer. The functions of such broker may include a logging of events, simplify the implementation, simplify of discovering sources and recipients (i.e. to act as a common element for information creators and consumers) or even to ensure the anonymity of event information creators and consumers [45]; WS-Topics which determines methods of providing the notification subject or their groups (e.g. to subscribe only interesting notifications) [46]. Obviously, the WSRF and WSN specifications are generally useful, not only for the grid systems (moreover, they are promoted by the OASIS organization which is not only related with grid technologies) their importance for OGSA is given in a special profile [47]. Apart of WSRF and WSN specifications coming from OASIS and supported primarily by IBM and Hewlett-Packard corporations there is also a competing group of specifications with similar functional scope usually called as WS-Management [48], supported by Microsoft and Intel. The specifications of this group, though important in the industry, had a smaller influence on the grid environments. Currently, the organizations are working on unifying the WSRF and the WSN with WS-Management standards (see [49]). On the other competitors in relation to the specification of WS- Notification family there exists WS-Eventing specification [50] (which is supported also by Microsoft), but it is generally slightly limited than WS-Notificaton. The WSRF OGSA profile specifies only some of the issues addressed to OGSA. The most important issues addressing to security profiles in relation to grid is omitted in this specification. There are separate specification dedicated to this problem, for details see [51], [52]. The our OGSA presentation is focused only on the basic standards which determines a fundamental parts of the OGSA architecture. Summarising this section, the simplified version of the grid architecture in reflection of contemporary service oriented solution. Its orientation on the grid is described through OGSA standard, but it is still described as high abstraction level. Its realization is described basing on OGSA profiles among which the most important are these which define resources and security. The first ones use mostly WSRF specification. Security profiles are basing on standards from WS-Security group coming Page 42 of 221

43 Chapter 2 State of the Art and Related Works from OASIS. All solutions use WS-Addressing specification to addressing resources and transfer some of the data describing them CORBA Distributed objects CORBA (Common Object Request Broker Architecture ) [17], [53] is a standard defined by the OMG (Object Management Group) [54] enabling software components written in multiple computer languages and running on multiple computers to work together. Due to CORBA s application in distributed systems, object persistence (in terms of DAO) seem a very important issue, unfortunately not defined in the standard itself. Some attempts has been made in order to achieve persistence in object-oriented databases (e.g., [55]) and relational ones (e.g., [56]). Nevertheless, these solutions seem somehow exotic and are not very popular. The detailed discussion of persistence and ORM issues for CORBA is contained within [57]. In shorthand CORBA allows a distributed, heterogeneous collection of objects to interoperate. CORBA defines an architecture for distributed objects. The basic CORBA paradigm is a request for service of a distributed object. Another important part of the CORBA standard is the definition of a set of distributed services to support the integration and interoperation of distributed objects. CORBA is a mechanism in software for normalising the method-call semantics between application objects that reside either in the same address space (application) or remote address space (same host, or remote host on a network). Version 1.0 was released in October CORBA uses an interface definition language (IDL) to specify the interfaces that objects will present to the outside world as services. CORBA then specifies a mapping from IDL to a specific implementation language like C++ or Java. Standard mappings exist for Ada, C, C++, Lisp, Ruby, Smalltalk, Java, COBOL, PL/I and Python. There are also non-standard mappings for Perl, Visual Basic, Erlang, and Tcl implemented by Object Request Brokers (ORBs) written for those languages. The CORBA specification dictates that there shall be an ORB through which the application interacts with other objects. In practice, the application simply initialises the ORB, and accesses an internal Object Adapter which maintains such issues as reference counting, object (& reference) instantiation policies, object lifetime policies, etc. The Object Page 43 of 221

44 Chapter 2 State of the Art and Related Works Adapter is used to register instances of the generated code classes. Generated Code Classes are the result of compiling the user IDL code which translates the high-level interface definition into an OS and language-specific class base for use by the user application. This step is necessary in order to enforce the CORBA semantics and provide a clean user process for interfacing with the CORBA infrastructure. Some IDL language mappings are "more hostile" than others. For example, due to the very nature of Java, the IDL-Java Mapping is rather straightforward and makes usage of CORBA very simple in a Java application. The C++ mapping is not trivial but accounts for all the features of CORBA, e.g. exception handling. The C-mapping is even stranger (since it's not an Object Oriented language) but it does make sense and handles the RPC semantics just fine. A language mapping requires the developer (a user in this case) to create some IDL code that represents the interfaces to his objects. Typically, a CORBA implementation comes with a tool called an IDL compiler which converts the user's IDL code into some language-specific generated code [53]. CORBA (more precisely, IIOP [58]) uses raw TCP/IP connections in order to transmit data. However, if the client is behind a very restrictive firewall/transparent proxy server environment that only allows HTTP connections to the outside through port 80, communication may be impossible, unless the proxy server in question allows the HTTP CONNECT method or SOCKS connections as well. At one time, it was difficult even to force implementations to use a single standard port they tended to pick multiple random ports instead. As of today, current ORBs do have these deficiencies. Due to such difficulties, some users have made increasing use of web services instead of CORBA. These communicate using XML/SOAP via port 80, which is normally left open or filtered through a HTTP proxy inside the organization, for web browsing via HTTP. Recent CORBA implementations, though, support SSL and can be easily configured to work on a single port. Most of the popular open source ORBs, such as TAO and JacORB [59] also support bidirectional GIOP [58], which gives CORBA the advantage of being able to use call-back communication rather than the polling approach characteristic of web service implementations. Also, more CORBA-friendly firewalls are now commercially available. Currently CORBA significance is rather recognizable as a historic approach than an solution used in any live project, because of some architectural issues. Some of CORBA failures were due to the implementations and the process by which CORBA Page 44 of 221

45 Chapter 2 State of the Art and Related Works was created as a standard, others reflect problems in the politics and business of implementing a software standard. These problems led to a significant decline in CORBA use and adoption in new projects and areas. The technology is being replaced by Java-centric technologies. CORBA assures very important feature for the grid systems object location transparency. However, its notion of location transparency has been criticised; that is, that objects residing in the same address space and accessible with a simple function call are treated the same as objects residing elsewhere (different processes on the same machine, or different machines). This notion is flawed if one requires all local accesses to be as complicated as the most complex remote scenario. However, CORBA does not place a restriction on the complexity of the calls. Many implementations provide for recursive thread/connection semantics. I.e. Obj A calls Obj B, which in turn calls Obj A back, before returning. The creation of the CORBA standard is also often cited for its process of design by committee. There was no process to arbitrate between conflicting proposals or to decide on the hierarchy of problems to tackle. Thus the standard was created by taking a union of the features in all proposals with no regard to their coherence [60]. This made the specification very complex, prohibitively expensive to implement entirely and often ambiguous. A design committee composed largely of vendors of the standard implementation, created a disincentive to make a comprehensive standard. This was because standards and interoperability increased competition and eased customers' movement between alternative implementations. This led to much political fighting within the committee, and frequent releases of revisions of the CORBA standard that were impossible to use without proprietary extensions [61]. Through its history, CORBA was plagued by shortcomings of its implementations. Often there were few implementations matching all of the critical elements of the specification [60], and existing implementations were incomplete or inadequate. As there were no requirements to provide a reference implementation, members were free to propose features which were never tested for usefulness or implementability. Implementations were further hindered by the general tendency of the standard to be verbose, and the common practice of compromising by adopting the sum Page 45 of 221

46 Chapter 2 State of the Art and Related Works of all submitted proposals, which often created APIs that were incoherent and difficult to use, even if the individual proposals were perfectly reasonable. Working implementations of CORBA have been very difficult to acquire in the past, but are now much easier to find. The SUN Java SDK comes with CORBA already. Some poorly designed implementations have been found to be complex, slow, incompatible and incomplete. Commercial versions can be very expensive. This changed significantly as commercial, hobbyist, and government funded high quality free implementations became available. 2.3 Conclusions The selection of various solutions presented above has a basic influence on the research work presented in this dissertation. The solutions concern historical approaches and their implementations in terms of distributed processing of various forms of data. In the next chapter of this thesis a set of solutions having direct impact on building the working prototype of the proposed data integration procedure is presented. Page 46 of 221

47 Chapter 3 Technological Base of the New Approach to Grid Contemporary the most important solutions related to distributed data processing to which the approach of the thesis strictly refers are presented below. The solutions refer to distributed architectures, distributed processing and integration of structured data, management of distributed resources and transparent data transport. They create a knowledge base as well as some experiences and concepts which has become useful for the prototype implementation. 3.1 Peer-to-Peer Networks Communication over a network is strictly related to grid technologies. A grid solution should provide easy methods for communication of each part of the grid environment. Because recently peer-to-peer networks have become very popular in the area of file exchanging in the Internet, it could also be helpful as a part of a grid solution, and could provide many facilities within grid functioning. How to plug in a peer-to-peer solution into grid and how to reuse it is a part of this dissertation. Considering the Internet itself, it can be seen that there are millions of computers connected in the network at any given time. All the computers are theoretically connected to one another, and information stored on any of the systems can be accessed. As a whole, the topology or layout of the computers on the Internet is a grouping of machines spread out in various locations. Within each of the group or subnet, depending on their autonomous configurations, computers will be visible to other computers on the subnet and to the outside the Internet. Some of the computers will be servers and treated as data or information sources. The machines at Google that serve up contents are web servers. Browsing Google on local computer turns the local machine into a client. This Page 47 of 221

48 Chapter 3 Technological Base of the New Approach to Grid type of client-server interaction is happening for hundreds of thousands of computers at the same time. While a client machine is browsing Google, it could also be sharing a local drive with group members. In this situation, the machine will become a server to any client that tries to access files on the local drive. In most peer-to-peer systems (simply called P2P), the division between a server and a client is blurred. The computer itself might be connected to other computers using a token-ring topology, but a peer-to-peer system might have a completely different architecture. The peers could all be communicating with a central server, or just another peer. In most cases, peers will be connected to one another over the Internet using either the TCP or HTTP protocol. TCP/IP is a fundamental protocol for transferring information over the Internet. The HTTP protocol is built on top of TCP/IP and allows communication between computers using port 80. HTTP is very popular for peer-topeer systems because most organizations keep port 80 open on their firewalls for web browser traffic. Several network topologies can be used for connecting P2P systems [62]. Most interesting and contemporary used are as follows: The client-server, or centralised, topology. By far it is the most common topology. The terminology of client-server has been there since ARPANET was found; more recently, the term centralised has been used to describe a system in which a single computer, the server, makes services available over the network. Client machines contact the server when the services are needed. Obviously, the more clients in the system, the larger the server must be. At some point, the server will need to be replicated in order to handle the traffic volume from all clients. The decentralised topology is a network topology that comes closest to being truly peer-to-peer. There is no central authority, only individual computers that are able to connect and communicate with any of the other computers in the network. When a packet of information starts its travels on the Internet, it is basically travelling through a decentralised topology. Information within the packet itself tells each computer where to send the packet next. Fig. 2 shows an example of a decentralised network topology. Basically, all of the peers in the system act as both clients and servers, handling query Page 48 of 221

49 Chapter 3 Technological Base of the New Approach to Grid requests and downloads while also making media searches and download request themselves. Fig. 2 Decentralised network topology. Fig. 3 Hybrid network topology. The hybrid topology shown in Fig. 3, is an example of a situation where individual computers are considered clients when they need information. The client that needs information will contact a central server (the centralised servers are distributed in the example shown in Fig. 3) to obtain the name of another client where the requested information is stored. The requesting client will then contact the client with the information directly. With the hybrid, a computer can be either a client or a server. This Page 49 of 221

50 Chapter 3 Technological Base of the New Approach to Grid is the topology used when system-individual peers contact a localised server for searching others and proceed to contact with found peers directly for information exchange [62]. The term P2P is relatively new, despite the examples that have been in place since the birth of the Internet (see chapter 2.1). Some people credit P2P to Gene Kan and others early Gnutella [63] pioneers. Coming up with a concise definition of P2P, however, is not so simple. There are not only problems with what makes up a P2P application, but also with many competitive P2P protocols and implementations that operate in very different ways. P2P is not about eliminating servers. It is not a single technology, application, or business model. Perhaps most controversial is that it should be not characterised strictly by degree of centralisation versus decentralisation. Centralisation in a P2P network can consist of a central catalogue, where some data can be indexed, e.g. list of neighbour peers and their shared files. Such central server can act as a traditional client-server when peer users are looking for indexed data, and it can act as a P2P network when users transfer found files. This means the system takes advantage of the fact that it is easy to create a centralised database of shared files and their locations, but very difficult and costly to host such files. In decentralised P2P network, no peer is different than another except in the content which it shares. The directory service (which is stored in one place in centralised P2P) is now shared among the peers. In general, P2P is more a style of computing that makes a network interactions more symmetrical than client-server. Even though there may be centralised services, the end user peer is the significant focus of the application. If the centralised services such as , are distributed, the system is less susceptible to problems with the network. In centralised P2P, the file sharing is the ultimate example of a monolithic centralisation that causes all the peer-to-peer functionality to fail if the main server fails or is disconnected. The decentralised system is the opposite because no single peer, if removed, will significantly affect the quality of the network. While P2P is not a new concept, many factors make P2P practical for a wide number of applications today. These factors include the explosion of connected devices, rapid increase of affordable bandwidth, acceleration of computing power, larger storage capacities, and proliferation of information at the edges of the network. Page 50 of 221

51 Chapter 3 Technological Base of the New Approach to Grid P2P gives users (peers) control to use and access their data as they see it. In many instances, it is more efficient than replicating data on servers while providing the same type of access. P2P applications are flexible and tolerant of errors. They can replicate data as needed and broadcast data to multiple computers. With a server system, there are many points of failure, and many installations have several failure points as a trade-off between cost and reliability. P2P can be characterised in following areas [64]: Consumer file sharing Gnutella, FastTrack, Napster, edonkey, SoulSeek, etc. Distributed resources sharing SETI@Home, Avaki, Entropia, and Grid projects; Content distribution networks OpenCola, Blue Falcon Networks, Konitiki; P2P communications Instant Messengers, Video Messengrs - Skype, Webex; Collaboration applications Such as Hive, Groove, and myjxta. File sharing and P2P communications together are often the foundation capabilities used to build a workgroup. The common characteristics of today typical P2P systems include most of the following: Peer nodes have awareness of other peer nodes; Peers create a virtual network that abstracts the complexity of interconnecting peers despite firewalls, subnets, and lack of specific network services; Each node can act as both a client and a server; Peers form working communities of data and application that can be described as peer groups. The overall performance of P2P application tends to increase when more nodes are brought online in opposition to typical client-server environments where more clients degrade their performance. The performance is also dependent on the application, the P2P protocol, and the network topology. The network topology is the arrangement of peers, their bandwidth, and the peer's computing capacity. The protocols send messages and data on the network. The applications combined with the overhead of the protocol and speeds available in particular sections of the network make up a system with a specific performance profile. Compared to server-based systems, even a Page 51 of 221

52 Chapter 3 Technological Base of the New Approach to Grid small P2P network can be very complex. The P2P network is really a network of islands of PCs in different corporate, ISP, and home networks, which is a great advantage despite its complexity. P2P faces certain challenges with security, control, and network use, each of which is being addressed by the evolution of technology and the deployment of more sophisticated systems. Still, P2P's advantages outweigh today's challenges for many applications. P2P systems can provide the following capabilities: Individual control by peers Users become very powerful. They create their own groups in effect, their own virtual firewall, and have lower barriers for publishing their resources; Reliability It can be thought of as a poor man's high-availability system; Scalability P2P has been demonstrated to support as many simultaneous users as the largest centralised systems; Performance Resources are able to work together to tackle bigger problems more efficiently. A P2P implementation which manages well with above capabilities is JXTA. It will be discussed in next subchapter. 3.2 JXTA Project JXTA (Juxtapose) is an open source peer-to-peer protocol specification developed by Sun Microsystems in The JXTA protocols are defined as a set of XML messages which allow any device connected to a network to exchange messages and collaborate independently of the underlying network topology. As JXTA is based upon a set of open XML protocols, it can be implemented in any modern computer language. Implementations are currently available for Java Platform, C/C++/C# and J2ME. The C# Version uses the C++/C native bindings and is not on its own a complete reimplementation. JXTA based applications create peers network nodes, a set of such nodes creates a virtual overlay network which allows a peer to interact with other peers even when some of the peers and resources are behind firewalls and NATs or use different network transports. In addition, each resource is identified by a unique ID, a 160 bit SHA-1 URN in the Java binding, so that a peer can change its localization address while keeping a constant identification number. Page 52 of 221

53 Chapter 3 Technological Base of the New Approach to Grid JXTA Architecture JXTA specification says [8]: the JXTA protocols are a set of six protocols that have been specifically designed for ad hoc, pervasive, and multi-hop peer-to-peer network computing. Using the JXTA protocols, peers can co-operate to form selforganised and self-configured peer groups independently of their positions in the network (edges, firewalls), and without the need of a centralised management infrastructure. This means that JXTA is a framework with a set of standards that support peerto-peer applications. JXTA is not an application, and it does not define the type of applications which is created using JXTA. The protocols defined in the standard are also not rigidly defined, so their functionality can be extended to meet specific needs. JXTA is made up of three distinct layers (see Fig. 4). The first is the core layer where the code to implement the protocols is found. The protocols provide the functionality for peers, peer groups, security, and monitoring; as well as all the message-passing and network protocols. A universal peer group called the WorldPeerGroup is created over the protocols. When a peer starts executing, it will automatically become a part of the WorldPeerGroup, and will have access to peer group functionality or services already implemented. This functionality allows the peer to discover, join and create other peer groups, and exchange messages using pipes. A service is a functionality, built on top of the core layer, that uses the protocols to accomplish a given task. The services layer can be divided into two areas: essential and convenient. To illustrate this difference, consider two services: a service that provides membership and a service that translates messages from one IM to another (like from GoogleTalk to MSN Messenger). The membership service is an essential service in a peer-to-peer environment. In the JXTA architecture, all peers automatically join a default group called the NetPeerGroup. This peer group provides basic services, but not all peers will want to be part of the big umbrella group at all times. By using a membership function, peers can join smaller private groups, and interact only with other known peers in real world it can be compared to a group which belongs to a company with restricted access. On the other hand, the instant messaging translator service is a convenient service because a peer does not have the inherent need to translate messages between GoogleTalk and MSN. Page 53 of 221

54 Chapter 3 Technological Base of the New Approach to Grid The application layer hosts code that pulls individual peers together for a common piece of functionality. Fig. 4 JXTA Architecture. One of the important points in JXTA is that the line between the layers in the architecture is not rigid. If someone develops a peer that provides functionality, another peer might see the peer s functionality as a service that fits his needs, but others might see it as a complete application without dividing it in to pieces. Developers need to fill out the application layer, for the JXTA specification and related bindings to be successful. This arrangement should be familiar because it is identical to a standard operating system, where there are three layers consisting of the core operating system, services, and applications JXTA Protocols JXTA protocols are used to help peers discover each other, interact, and manage P2P applications. The protocols are not applications themselves and they need to be programmed to get it working. The protocols hide a lot of details, which makes writing JXTA applications much easier than developing P2P applications from the scratch. The following is a list of the JXTA protocols: Peer Discovery Protocol (PDP) Allows a peer to discover other peer advertisements (peer, group, service, or pipe). The discovery protocol is the searching mechanism used to locate information. The protocol can find peers, peer groups, and all published advertisements. The advertisements are mapped to peers, groups, and other objects, such as pipes. Queries are made by specifying an advertisement type (peer, group, or advertisement), an XML tag Page 54 of 221

55 Chapter 3 Technological Base of the New Approach to Grid name within the advertisement, and the string to match against the data represented by the XML tag. In short this protocol is for resource searching; Peer Resolver Protocol (PRP) Allows a peer to send a search query to another peer. The resolver protocol is a basic communications protocol that follows a request/response format. To use the protocol, you supply a peer to query and a request message containing XML that would be understood by the targeted peer. The result is a response message. The resolver is used to support communications in JXTA protocols like router and discovery protocols. For example, the protocol is used by the discovery protocol to send queries that represent searches for advertisements. The resolver also allows the propagation of queries. For example, if a peer receives a query and does not know the answer, the resolver sends the query to other peers. This is an interesting feature, especially because the originating peer does not need to have any knowledge of a peer that may actually have the result to the query. In shorthand it can be described as generic query service; Peer Information Protocol (PIP) Allows a peer to learn about the status of another peer. The information protocol is used partially like ping and partially to obtain basic information about the peer's status. The body of a PIP message is free-formed, allowing for querying of peer-specific information. In addition, this capability can be extended to provide a control capability. Simply, used for monitoring; Peer Membership Protocol (PMP) Allows a peer to join or leave a peer group. The protocol also supports the authentication and authorization of peers into peer groups. The protocol has three key advertisements for authorization, and the credential. The credential created in this protocol will be used as a proof that the peer is a valid member of the group. In shorthand, it is a security responsible protocol; Pipe Binding Protocol (PBP) Is used to create the physical pipe endpoint to a physical peer. It is used to create a communication path between one or more peers. The protocol is primarily concerned with connecting peers via the route(s) supplied by the peer endpoint protocol. Finally, it is required for addressable messaging; Page 55 of 221

56 Chapter 3 Technological Base of the New Approach to Grid Rendezvous Protocol (RVP) The Rendezvous Protocol is responsible for propagating messages within JXTA groups. The Rendezvous Protocol defines a base protocol for peers to send and receive messages within the group of peers and to control how messages are propagated. Simply, it is for propagation messaging; Peer Endpoint Protocol (PEP) Is used to create routes to route messages between peers. The protocol uses gateways between them to create a path that consists of one or more pipe protocols suitable to create a pipe. The pipe binding protocol uses a list of peers to create routes between peers. One of the more significant problems is that traditional routers and DNS servers fail because of firewalls, proxy servers, and NAT devices. This protocol searches for gateways that allow barriers, such as firewalls and others, to be traversed. This protocol also helps when the communicating peers do not support each other's protocols. For example, if you are connecting peer-a that supports TCP and peer-b that only supports HTTP, the endpoint protocol would choose either one gateway that could make the translation or multiple gateways with multiple but compatible protocols. Simply, it is responsible for routing JXTA ID In a peer-to-peer system, the resources of the system have to be referenced in some manner. A simple name is not enough because resources could have identical names. There could easily be two peer groups called Home Workgroup or 1000 files named my_photo.jpg. JXTA solves this problem with JXTA ID, also referred to as URN, which is a unique string used for the identification of six types of resources: Peers; Peer groups; Pipes; Content; Module classes; Module specifications. The JXTA ID consists of three parts. It is important to note that the URN and JXTA portions of the ID are not case-sensitive, but the data portion of the ID is case- Page 56 of 221

57 Chapter 3 Technological Base of the New Approach to Grid sensitive: format specifier "urn", namespace identifier "jxta" and ID unique value. An example of valid id for PeerGroupID can be: "urn:jxta:uuid-22a4394eda7e41ee9afa b37902" JXTA Peers The concept of a peer in JXTA should not be confused with the concept of a user. The peer is a node on the network it is simply an application, executing on a computer device, that has an ability to communicate with other peers. For the entire system to work, it is fundamental for the peer to have the ability to communicate with other peers. One computer system might be host to any number of peers. For the purposes of JXTA, a peer is any networked device that implements the core JXTA protocols this is the definition in the specification [8], so a single networked device can have any number of JXTA peers executing on it. The peers could all be implementing different service code or participating in a computational complex algorithm. By using the term networked device, the creators of JXTA are also stating that peers are not limited to computers that sit on a desk but also range from mainframes to the smallest PDAs and devices that we might not normally think of as computers. Some of the other capabilities and features of JXTA peers include: JXTA peer could volunteer to implement a module specification and lend its host computer to some tasks. In JXTA, any peer can implement a specification regardless of the binding used by the peer. All of the peers that implement the same specification are interchangeable and transparent to the peer using the peer s service; Peers can, but do not have to share content within a peer group; Peers have the ability to discover other peers and content using all of the network transport protocols implemented by the specification binding; however, the peer will use the defined JXTA message format for all communication; Peers are not required to remain on the JXTA network for any known period. A peer that is using services of another peer cannot be guaranteed that the peer will remain on the network until its services are no longer needed; Peers are not required to have direct communication or live directly on the Internet. Peers may use the services of a routing or rendezvous peer for communicating on the network. Page 57 of 221

58 Chapter 3 Technological Base of the New Approach to Grid Finally, peer behaviour will be such as the programmer s intensions. JXTA defines two main categories of peers: edge peers and super-peers. Superpeers can be further divided into rendezvous and relay peers. Each peer has a well defined role in the JXTA peer-to-peer model. The edge peers are usually defined as peers which have transient, low bandwidth network connectivity. They usually reside on the border of the Internet, hidden behind corporate firewalls or accessing the network through non-dedicated connections, they are usually client network users' peers. A rendezvous peer is a special purpose peer which is in charge of coordinating the peers in the JXTA network and provides the necessary scope to message propagation. If the peers are located in different subnets then the network should have at least one rendezvous peer. A relay peer allows the peers which are behind firewalls or NAT systems to be a part in the JXTA network. This is performed by using a protocol which can traverse the firewall, like HTTP, for example. Any peer in a JXTA network can be a rendezvous or relay as long as they have the necessary credentials or network/storage/memory/cpu requirements. A peer needs to be authenticated to get rights to use services. The identity of the peer in JXTA is a credential. The credential is used throughout the system to ensure that certain operations have the correct permissions. The credential is officially created when a peer joins a group. Also, the credential can simply be some type of token created ahead of time and presented as part of the joining process. The group recognises the credential during the authentication process when joining the group JXTA Groups If several peers get together to share files or work on a large, computationally intensive problem, they have to form a group. The formation of a group is usually attributed to several things: Membership to a shared system using a username/password; Common transport; Access to a centralised server. In the first case, the group is formed when peers log into a group with a predetermined username/password or one picked by the peer itself. In some cases, the group is defined by one of the peers publishing the information necessary to join. If the Page 58 of 221

59 Chapter 3 Technological Base of the New Approach to Grid peer publishes its own username/password, the group would be considered private because not all peers would potentially know about the group. In the second case, the transport system used to connect peers and exchange information can produce a group itself. For example, two different P2P systems are unable to communicate between themselves because the network transport is different and the format of the messages exchanged between peers is unique. The potential to create an even larger group of peers is lost because the individual peers do not know how to communicate with each other. This is obvious in instant messaging networks e.g. GoogleTalk, MSN, and Skype, which all have proprietary systems, and if a user wants to communicate with someone on each system, then must have three individual clients. Finally, a group is formed when all of the peers are required to log into a centralised server in order to be a part of the group. Although the login will require a username and password, the group will not be set up by an individual peer but by the network itself. Joining a group can provide many benefits that otherwise a single peer would have to implement itself. The group will have features commonly called services which each peer can take advantage of. The JXTA network has one umbrella peer group called the WorldPeerGroup. Because the WorldPeerGroup is the default group that all new peers automatically join on the JXTA network, a JXTA peer has a number of services immediately available to it, including discovery, advertisements, and pipes, among others. JXTA permits creating and joining new peer groups in a public and private format. The public peer group does not require a username or password, but a private one does. Any peer can create either type of peer group for any purpose it desires. JXTA group construction requires a default peer group to be available. It seems that somewhere on the network there is a common peer group server. However, it is not so, because the Java realisation of the JXTA specification has all of the default peer group functionality built in. This means that one or more peers can be launched in a network, which is completely cut off from the Internet and they can still function. The default peer group exists by name, and its functionality is contained within all peers by default. Peer groups have a number of services, which have been defined as a core set by the specification. Those services listed in the current specification are: Discovery Service Allows searching for peer group content; Page 59 of 221

60 Chapter 3 Technological Base of the New Approach to Grid Membership Service Allows the creation of a secure peer group; Access Service Permits validation of a peer; Pipe Service Allows creation and use of pipes; Resolver Service Allows queries and responses for peer services; Monitoring Service Enables peers to monitor other peers and groups. Peer groups have the option of creating and implementing additional services as desired. Each group should have at least one rendezvous peer and it is not possible to send messages between two groups JXTA Advertisements When peers and peer groups have services that they want to make known to the P2P network, they use an advertisement. The advertisement is an XML-based document that describes JXTA resource and content. All of the protocols use advertisements to pass information. All advertisements are hierarchical in nature and will contain elements specific to the type of advertisement. The ID of the resource, which will be used to identify the resource being advertised is particular important JXTA Modules A module is one of the ways in which functionality can be provided. The module is simply a piece of functionality designed to be downloaded or obtained outside the core JXTA implementation. In most cases, a P2P group will advertise a specification that tells about the functionality needed. The specification will be propagated through the JXTA network. A peer can discover the specification and want to use a new functionality. The implementation of an advertisement tells the network that a service is available at a specific peer that implements the functionality described in the specification. One of the goals of a P2P system is that multiple peers can have implementations of the same specification. The implementations could be in different languages, yet still provide the same service. This specification/implementation paradigm allows redundancy of services so that functionality is still available in the network when peers are overloaded or unavailable. The JXTA module abstraction does not impose or specify what the code representation is. The code representation can be a Java class, Java jar file, dynamic library (DLL), choreography set of XML messages (WSDL), or script. Multiple implementations of a module may exist to support various Page 60 of 221

61 Chapter 3 Technological Base of the New Approach to Grid runtime environments. For example, a module can be implemented on different platforms, such as the Java platform, Microsoft Windows, or the Solaris Operating Environment. JXTA platform uses modules for self-description and for describing the default NetPeerGroup JXTA Pipes The JXTA specification [8] takes the concept of using a pipe as its communication mechanism from the Unix operating system and its shell. Pipes provide an unidirectional, virtual connection between two pipe endpoints: an input pipe (receiving end) and output pipe (sending end). Pipe connections are established independently of the pipe endpoints peer location. For example, the input pipe endpoint can be located behind a firewall or NAT while the output endpoint can be located on a peer on the Internet. The endpoints may even be on physically different networks: the input pipe endpoint could be on a TCP network while the output pipe endpoint is on a token ring network. As long as there are available JXTA relay peers between the two endpoints, a logical pipe between them may be defined. As long as a pipe is involved, peers do not need to worry about the network topology or where a peer is located on the network in order to send messages. Through the pipes messages can be sent between peers without having to know anything about the underlying infrastructure. Therefore, pipes enable connections without any consideration of connectivity. Two peers may require an intermediary routing peer to communicate between each other. Pipes virtualise peer connections to homogenise and provide an abstraction of the full connectivity available within the JXTA network. The Java implementation of the specification has following pipe types: Unicast One-way pipe for sending non-secure data over an unreliable channel; UnicastSecure One-way pipe for sending secure data over an unreliable channel; Propagate One-to-many pipe for sending non-secure data over an unreliable channel; Bidirectional Two-way pipe for sending non-secure data over an unreliable channel; Page 61 of 221

62 Chapter 3 Technological Base of the New Approach to Grid Reliable Pipe that builds on the bidirectional pipe for reliable communication. The unicast pipes connect one peer to another for one-way communication. The propagating pipe connects an output pipe to multiple input pipes JXTA Services The concept of services in JXTA goes above and beyond the simple web service and extends to functionality that needs to exist in a decentralised network. The term network service is used to represent any kind of service (web services, legacy services, CORBA services, RMI services, etc.) available on the network. JXTA is as agnostic as possible regarding the service invocation model used to access any network service. JXTA s pipe construction can create services using pipes as the principal invocation mechanism, but pipes are not required for all network services. Upcoming standards such as WSDL, ebxml, SOAP, and UPnP may be used by JXTA application to invoke a service once the location of that service has been discovered. JXTA application can use a SOAP connection to a WSDL service, an RMI connection to contact a remote server object, or a pipe connection to communicate with JXTA service. The JXTA platform does not impose any restrictions on the service invocation model used. However, the underlying peer infrastructure will ultimately dictate which service invocation model is used. For example, if a Java runtime environment is not available, an application will not be able to activate a RMI-based service via an RMI call. JXTA protocols do not contain a specific service invocation protocol by design. This has been done to enable JXTA application to interoperate with a wide range of services. Due to the large variety of service invocation models and the lack of standards in that area, it is thought to be more practical to leave service invocation outside the scope of JXTA. However, services that use pipes will have better integration with the JXTA platform. There are two fundamental kinds of network services in the JXTA platform: peer services and peergroup services. The JXTA platform provides an infrastructure to publish and discover network services via the PDP. Like any JXTA resource, network service is represented by a service advertisement: an XML document that contains all Page 62 of 221

63 Chapter 3 Technological Base of the New Approach to Grid the information that uniquely identifies a service and all of information necessary to invoke a service. Members of a peergroup can publish and discover service advertisements just as they operate on any other advertisements. A service is made available to members of a peergroup by publishing a service advertisement in the peergroup. The JXTA platform provides a distinction between two types of services; depending on how the service is published in the peergroup, JXTA service is either a peer service or a peergroup service [65] JXTA Security From the very beginning JXTA was designed with a security infrastructure in mind. This is a big advantage: history has shown that when security is not included in initial architecture, it is far easier for malicious intruders to exploit any security mechanisms that are later added to the system. Moreover, it is always more difficult to add security infrastructure after a system has been designed. Security in a P2P network presents some interesting challenges due to the distributed computing environment, the vulnerability of links (due to multi-hop routing), the dynamic nature of peer interactions, and the lack of a centralised authority. JXTA security requirements are very similar to traditional systems, but the decentralised nature of JXTA makes it more difficult to implement confidentiality, integrity, and non-repudiation. Typically, security models define security in terms of users, objects, and actions. The action defines the operation allowed on an object by a user. The JXTA platform does not provide for the concept of users. This keeps the core minimal; a user definition can always be done at a service layer. JXTA protocols do not need to know who a user is or how a user is authenticated. Instead, protocol messages provide a placeholder with fields to store security credentials. JXTA provides a framework that allows different security solutions to be plugged in. For example, every message has a designated credential field that can be used to store security-related information. However, it is beyond the scope of the specification how such information is interpreted and is left to services and applications. Communication security can be assured in three ways: 1. Using a virtual private network (VPN) and send JXTA messages inside; 2. Creating a secure version of the pipe, similar to a protected tunnel, so that any message transmitted over this pipe is automatically secured; Page 63 of 221

64 Chapter 3 Technological Base of the New Approach to Grid 3. Using regular communication mechanisms but to let the developer protect specific data payloads with encryption techniques and digital signatures. JXTA provides a basic set of security classes based on the Java Card security 2.1 platform APIs. The Java Card API provides a minimal infrastructure appropriate for small, mobile, wireless devices, such as PDAs and cell phones. The JXTA security classes provide the basic foundation for defining RSA and secret keys, performing RC4 encryption, creating SHA-1 and MD5 message digests, and creating secure hashes and digital signatures based on these digests and an SHA-1- based, pseudo-random number generator. JXTA adds several new concepts, such as the peer, peer group, pipe, and endpoints. JXTA uses a new concept in peer-to-peer communication and discovery with advertisements, which are XML documents that describe services and information available on the JXTA network. Finally, it provides various types of identifiers used to distinguish one item or service from another. In term of P2P, JXTA provides a platform technology and a community of resources necessary to develop new P2P applications. JXTA's design supports the following: P2P applications that span from fully-centralised to fully decentralised; Any connected device running any OS on any network protocol; Highly secure applications; Interoperable components from different developers. Edutella is a well-known research project where JXTA has been applied. 3.3 EDUTELLA Edutella is a set of services which provide a mechanism for exchanging information about heterogeneous educational objects stored in different locations and described according to agreements coming from Edutella's framework. The object description has to be done using RDF (Resource Description Framework), which is a framework for representing information in the Web, to a metadata formulated by Edutella's consortium. Edutella is the first solution which extends well-known P2P file exchange model for a data which has a structured form. In common consumer P2P solutions available data is represented as files in a file system. It can be said that such data is flat. Flat because file Page 64 of 221

65 Chapter 3 Technological Base of the New Approach to Grid can be identified within a few atomic values, as a file name, file length, type of content, and checksum. Such data can be easily shared between peers, and easy identified because its flattening its atomic values which are simply evaluated during a request. Basing on such evaluation, it can be unambiguously determined if requested data is proper. In this area Edutella did one step forward and found a solution on how to describe a content of heterogeneous structured data to be unambiguously determined and available in a distributed network. As Edutella s authors assume, Edutella is a metadata based on peer to peer system, therefore it has to be able to integrate (using different repositories, query languages and functionalities) heterogeneous peers (heterogeneous in their uptime, performance, storage size, functionality, number of users etc.) as well as different kinds of metadata schemas. The essential assumption is that all resources maintained in the Edutella network can be described in RDF, and all functionality in the Edutella network is mediated through RDF statements and queries on them. For a local user, the Edutella network transparently provides access to distributed information resources, and different clients/peers can be used to access these resources (e.g. resource metadata is located in the peer wile shared object on completely different node, then Edutella is able to handle these kinds of situations transparently). Each peer will be required to offer a number of basic services and may offer an arbitrary number of advanced services [66]. Each peer in Edutella can be characterised by the set of services it offers. These services must extend the JXTA platform. Edutella services are: Query Service it is the most basic service within the Edutella network. Peers register the queries, they may be asked through the query service (i.e. by specifying supported metadata schemas, by specifying individual properties or even values for these properties). Queries are sent through the Edutella network to the subset of peers who has registered with the service to be interested in this kind of query. The resulting RDF models are sent back to the requesting peer. The Edutella Query Service is intended to be a standardised query exchange mechanism for RDF metadata stored in distributed RDF repositories and is meant to serve as both query interface for individual RDF repositories located at single Edutella peers as well as query interface for distributed queries spanning multiple RDF repositories. An RDF repository (or knowledge base) consists of RDF statements (or facts) and describes metadata according to arbitrary RDFS Page 65 of 221

66 Chapter 3 Technological Base of the New Approach to Grid schemas. One of the main purposes is to abstract from various possible RDF storage layer query languages (e.g. SQL) and from different user level query languages (e.g. RQL, TRIPLE). The Edutella Query Exchange Language and the Edutella common data model provide the syntax and semantics for an overall standard query interface across heterogeneous peer repositories for any kind of RDF metadata. The Edutella network uses the query exchange language family RDF-QEL-i (based on Datalog semantics and subsets thereof) as standardised query exchange language format which is transmitted in an RDF/XML-format [66]; Edutella Replication this service is complementing local storage by replicating data in additional peers to achieve data persistence/availability and workload balancing while maintaining data integrity and consistency. Since Edutella is mainly concerned with metadata, replication of metadata is mostly realised. Replication of data might be an additional possibility (though this complicates synchronization of updates); Edutella Mapping, Mediation, Clustering - while groups of peers will usually agree on using a common schema (e.g. SCORM or IMS/LOM for educational resources [66]), extensions or variations might be needed in some locations. The Edutella Mapping service will be able to manage mappings between different schemata and use these mappings to translate queries over one schema X to queries over another schema Y. Mapping services will also provide interoperation between RDF-based and XML-based repositories. Mediation services actively mediate access between different services, clustering services use semantic information to set up semantic routing and semantic clusters; Annotation Service in order to provide metadata for a particular document easily, the annotation service provides a document viewer. The document viewer may display HTML pages and PDFs. Furthermore, the service provides a browser for RDF schema. This means that a corresponding definition is loaded into an annotation tool and may be browsed. Fields for annotation are displayed according to the schema definition and may either be filled by typing or by marking and dragging information from the document viewer. In Edutella context the annotation is a set of instantiations attached to an HTML document. They can be as follows: instantiations of RDFS classes, instantiated properties Page 66 of 221

67 Chapter 3 Technological Base of the New Approach to Grid from one class instance to a datatype instance, and instantiated properties from one class instance to another class instance. Class instances have unique URIs. Instantiations may be attached to particular mark-ups in the HTML documents, viz. URIs and attribute values may appear as strings in the HTML text [67]. In Edutella architecture, Edutella services (described in web service languages like DAML-S or WSDL, etc.) complement the JXTA Service Layer, building upon the JXTA Core Layer. Edutella peers live on the Application Layer, using the functionality provided by these Edutella services as well as possibly other JXTA services (see Fig. 4) [67], [68]. Edutella adds a search service to the JXTA platform, so that any node, or peer, that carries metadata about some resources, can announce an Edutella search service to the network. When looking for information on Edutella, a question will be routed to peers which can answer the query, and they will return matching results to a requester. There are actually three types of roles to fill in an Edutella network: Provider provides a query service; Consumer asks questions; Hub manages query routing in the network. An Edutella network will contain many types of peers which may combine several roles. Hubs are typically set up to increase performance in the network. Most providers will not need to care about hubs at all, as they operate transparently in the Edutella network. Examples of providers, exposing data to the Edutella network, could be: Traditional Learning Management System (LMS) at an educational institution; Modern RDF-based repository such as UNIVERSAL, OLR or SCAM; Metadata harvester that collects information from legacy archives, such as OAI archives or Z39.50 sources; Mediator database such as AMOS, that searches a number of databases in combination, while exposing only one query service to Edutella, or any other kind of database containing learning object metadata. Many other kinds of metadata providers can be imagined. All that is required to be a provider is to be able to answer questions formulated in the Edutella query language. Any kind of information source can be given an Edutella interface [69]. Page 67 of 221

68 Chapter 3 Technological Base of the New Approach to Grid There are defined data exchange formats and protocols (e.g. how to exchange queries, query results and other metadata between Edutella Peers) on the Edutella Service layer, as well as APIs for advanced functionality in a library-like manner. Applications like repositories, annotation tools or GUI interfaces connected to and accessing the Edutella network will be implemented on the application layer. To enable the peer to participate in the Edutella network, Edutella wrappers are used to translate queries and results from the Edutella query and result exchange format to the local format of the peer and vice versa. Also it used to connect the peer to the Edutella network by a JXTA-based P2P library. To handle queries the wrapper uses a common Edutella query exchange format and data model for query and result representation. For communication with the Edutella network the wrapper translates the local data model into the Edutella Common Data Model (ECDM), and connects to the Edutella Network using the JXTA P2P primitives, transmitting the queries based on the common data model ECDM in RDF/XML form. Edutella's RDF-QEL-i query language is mapped by Query Service on simple TRIPLE, SQL, AmosQL queries using one predicate query (e.g. SELECT subject FROM statement WHERE predicate) which seems to be a weak solution to retrieve information from semantic repository about structured objects, even if they are just e- book like objects. In Edutella, the wrapper-mediator approach [70] divides the functionality of a data integration system into two kinds of subsystems. The wrappers provide access to the data in the data sources using a common data model (CDM) and a common query language. The mediators provide coherent views of the data in the data sources by performing semantic reconciliation of the Common Data Model (CDM) data representations provided by the wrappers. Edutella wrapping mediators distribute queries to the appropriate peer with the restriction that queries can be answered completely by one Edutella peer. Registration of peer query capabilities is based on (instantiated) property statements and schema information, basically telling the network, which kind of schema the peer uses, with some possible value constraints. These registration messages have the same syntax as RDF-QEL-1 queries, which are sent from the peer to the Page 68 of 221

69 Chapter 3 Technological Base of the New Approach to Grid registration/query distribution hub. Additionally, the peer announces to the hub, which query level it can handle (RDF-QEL-1, RDF- QEL-2, etc.) Whenever the hub receives queries, it uses these registrations to forward queries to the appropriate peers, merges the results, and sends them back as one result set [69]. Edutella with the approach to use P2P as the information exchange framework is a very interesting solution but it provides distributed integration of metadata of heterogeneous content only but not the content itself. So, user can query the distributed repository for a data, but when the data is unavailable he/she only gets information about data location, its accessibility and additional description. One can conclude that this solution has shown the way how the integration of structured data can be achieved, but has not resolved the integration problem itself, as the Edutella authors admit themselves [66]. 3.4 Middleware and Federated Databases The base mechanism conditioning efficient communication in a distributed environment is called middleware [71]. It is software which is located between an operating system and application on each side of a distributed system. That is why it is also called a layer. There are many types of middleware: distributed objects, application servers, transaction monitors, integration servers, message brokers and others. The most important task of middleware is to make software manufacturing process simpler by providing programming abstraction which is common for each side of a distributed system, hiding the heterogeneity and distribution of systems, as well as low-level details related to communication. It is often related to a term of transparency. It is important to note that the aspect of transparency in the distributed communication is not only the domain of middleware, but also the subject of research in the area of distributed/federated databases [72], [73], [74]. Over the years, specialists in this field have developed many more advanced aspects of transparency than those which characterise contemporary middlewares. The most important forms of transparency provided by the federated databases (called also virtual) include such transparencies as: access, location, concurrency, heterogeneity, scalability, fragmentation, replication, optimization, failure, migration, and others [1]. Traditional types of middleware do not support most of these properties, because they are not oriented on the processing of collections, or persistent data using declarative constructs. Page 69 of 221

70 Chapter 3 Technological Base of the New Approach to Grid Middleware is usually procedural in nature, which means that it has a significantly lower level of abstraction than query languages. The necessity of processing the collections in a distributed system cannot be avoided, but it is associated with problems which can be observed (for example) on technology of distributed objects i.e. in CORBA [17] where engineers tried to hide the differences between the access to remote and local data. This is an illusion which has been fatal for many software projects in their performance. The source of the problems is the attempt to operate on the remote objects using the same algorithms that can be used if the data is located in the local computer s memory. Unfortunately, as it appears, due to the nature of computer networks (access time, failure) algorithms cannot be the same. The creators of the CORBA standard attempted to introduce additional services (Query Service, Persistence Service) to improve processing of huge amounts of data using a query language, but it has not brought satisfactory results, mainly because of the mismatch impedance problem between query languages and traditional programming languages. The same problem concerns the popular technology Enterprise Java Beans (EJB) [75] where Entity Beans and EJB QL (query language) acting on them, does not solve the problem of processing mass-distributed data [76]. Despite the implementation of such query languages they are almost useless in a situation where an application which is able to combine data from different data sources is required. Because of the low-level processing (through the API) with the middleware, such situations require a gigantic amount of programming work related to an optimization of the code. Such perfect optimization is rarely possible to achieve by a man, because many of the same programming operations in accessing data can have multiple (e.g. a hundred) possible ways to implement. An automatic optimization of distributed queries realised by using techniques known from distributed databases could resolve this problem. Unfortunately, modern types of middleware do not have such possibilities. Current trends run away from this complex problem, for example, the latest solution, based on the idea of Service Oriented Architecture (SOA), total ignores the problem of the processing a huge amount of data. SOA followers convince that the SOA operates at the level of services, that is, a much higher level than the normal data level (high-level services integration vs. low-level integration of data). Thus, it is assumed that the remote procedure (service) has been designed in a way to handle all possible business situations associated with remote access to resources. This is an illusory assumption that leads to Page 70 of 221

71 Chapter 3 Technological Base of the New Approach to Grid the discovery (from the beginning) of a well-known data processing mechanisms in a distributed environment (e.g. transactions for Web Services) [76] Object-oriented database as a middleware In modern systems databases are mostly used in integration projects where the problem can be solved using a replication, federation or by setting up the Data Warehouse. If a project requires a use of integration concepts such as application logic, then there are other strategies used for integrating, e.g. integration through the services. However, note that these databases' restrictions arise from the simplicity of relational data model and do not concern object-oriented databases. Potentially object-oriented databases have the advantages of relational databases, and programming languages, because every known integration strategy can be expressed in the context of such a system. In integration projects object-oriented databases provide a number of significant advantages. Much higher level of abstraction of object-oriented database, on which a potential integration engineer works, can be often expressed in one line of code corresponding with a code which for e.g. would take (in measures based on standard CORBA) a form of a large programme. This is not only important for the productivity of developers, ease and stability of created software, but also enables an automatic optimizations for operations on distributed data. Such automatic optimization is not possible if the component of the system (e.g. service) will be programmed in a lowlevel programming language. If an integration has to be used for data from heterogeneous sources and it should be done in real-time, then it is necessary to build a federated database. Object-oriented databases are famous because of a high-abstracted data model, allowing an easier specification of canonical data model binding in a federation. This allows to integrate an existing distributed and heterogeneous data sets into a single virtual repository with all data available to the organization. It does not matter then, if integrated resources are relational or object-oriented databases, XML repositories or multimedia data. A data integration of so different data models constitutes a big integration issue using at the top the relational data model, but not for a number of existing object models. Notice that even if as a canonical data model is chosen as a relatively universal XML data model, that decision does not guarantee success in the integration of data. The most serious problems here are: a global data description, query language and optimization and updating data from a global user. Due Page 71 of 221

72 Chapter 3 Technological Base of the New Approach to Grid to the lack of a global object identifier (which exists e.g. in CORBA, but is not existing in other than the object-oriented data models), it is not possible to create a general method to modify the integrated data. For this reason, although the majority of current federated databases can easily provide reading operations on distributed data, modifying the state of individual databases is not possible at all, or only in certain cases. Since the interface to the global data is usually (in the federated databases) defined as a view, the problem requires an extension to a much broader and well-known (for many years) problem of updating the views. Views which are known from relational systems (treated as remembered/materialised queries) allow to update virtual data, but only in the simplest cases. For example, it is not possible to update a data generated by joins (with a few minor exceptions), or generated by aggregation operators. However, in a federated database, a transparency for all operations implemented in a view is very important and required, because views can integrate data from thousands of data sources [76]. As it turned out, the solution to the problem of updating the views has to be linked with an introduction of much more specialised structure having the view role. Such structures have been designed for object-oriented databases, and only in such context their power may be fully exploited. This is associated with a necessity to extend the object identifier to a form allowing a description of virtual data. As defined by the developers of the views, the modification of any virtual data can be done using views having functionalities as follow: Generic procedures overriding basic operations on virtual objects, which are defined by the views developers; Identifiers having information about the view; Query which generates the virtual objects Integration strategies Specialists in the field of EAI are agreed on the one issue currently there is no common method or technology for integration which gives a mechanism to solve every existing integration problem [76], [77], [78], [79]. For example, for an organization, an integration of the data contained in its databases may be essential. Then, this organization requires an unrestricted access to all of its data, but is unable to anticipate all access paths to the data now or in the future. For example, in accordance with the Page 72 of 221

73 Chapter 3 Technological Base of the New Approach to Grid SOA architecture, an access to a new set of data is associated with the construction of a new service, in this case it means a modification of the target system each time which for various reasons (e.g. costs) may be unacceptable. Finally, a federated database is here the best solution. However, similar problems may not concern an organization, which can focus only on integration of application logic, but not on the processing of distributed data. Various operations possible to implement in integrated systems that belong to the organization are well known, and their interfaces are static. For such organization a lowlevel integration, as it is in the integration of data may be too complicated. However, it is possible to allocate a well-defined functionality of each of the integrated systems and their global publication as service. A set of services created this way can be the basic element for the integration of business processes taking place within such an organization. Such high-level (and desirable from the business point of view) integration cannot be used for the organization from the first example. Note, that currently both organizations described above, would have to use a different set of tools in order to achieve their objectives. The first would probably use any relational database allowing for a federated database construction (e.g. DB2 with built-in Garlic module [80]) and the other distributed objects, a middleware oriented on messages, or just Web Services. All these solutions are possible to express by using object database integrated with a query language extended to a complete programming language. Such a tool used as middleware can make a completely new quality to the subject of the integration of software applications. The ability of using the same tool to achieve different integration strategies should enhance the productivity of developers, reduce costs of software creation, shorten the learning time and reduce the number of structures, concepts and mechanisms needed to learn by developers [76]. One of the features of modern organizations is that different business units use different systems for creating, storing and searching for data important for them. The variety of data sources is associated with the lack of the coordination between them, different pace for adopting new technologies, geographical distance, as well as connecting different companies. Only through the integration of all these systems together can organizations benefit the full value of data belonging to them. Together with the development of information systems and the Internet, there is more and more Page 73 of 221

74 Chapter 3 Technological Base of the New Approach to Grid talking about the integration of applications when any belongs to the same company (Enterprise Application Integration, EAI) in other case, when any belongs to various organizations (Business to Business, B2B) [77], [78] Distributed objects The integration of data and applications can be implemented at the physical level through a number of various techniques. Due to reducing costs of software development, companies rather abandon the implementations of own system-specific protocols in order to use existing, universal software facilitating communication between applications middleware. Over the years many types of middlewares have been designed: distributed objects, application servers, queue communications, integration servers, transaction monitors, and others. Various types of middleware allow to use different strategies for integration (integration oriented for data, services, communications, business processes, through the portals, etc.), using different methods of combining the integrated components (point-to-point connections, or many-to-many) and communication scenarios (synchronous and asynchronous). One of the most well-known types of middleware is distributed objects the CORBA standard [17] being the best known representative of it. In the context of integration, distributed objects are defined in the standard as parts of larger applications, designed to co-operate with one another, but operating at completely separate machines. The basic merits of distributed objects are: Object data model; Compatibility with the logic of business; Relatively high level of abstraction; Number of services (e.g. transactivity) available for developers; Standard protocol for data exchange between components of the distributed system; Independence from the programming language used in implementation. Unfortunately, time has proven that this unification of access methods to data and ease of access to data, made developers attempt to ignore the constraints imposed by the computer communication network (access time, failure) and to design distributed applications as if they were non-distributed applications. For example, the naive implementation of operations for collections processed on remote objects in a Page 74 of 221

75 Chapter 3 Technological Base of the New Approach to Grid procedural manner (using iterators) means multiple exchanging of data between the server and the client, in the proportion to the number of the objects. This feature, combined with inconsistency in the standard, a complex API, incompatibility between the software provided by different developers, and other serious problems let to the decline the popularity of the CORBA standard. Moreover, the significant success (7 million installations) of CORBA has recently been a subject of devastating criticism and practically there are no new installations. The question is whether the technology of distributed objects is also condemned to fail. It seems that the unifying access to local and remote data is a disadvantage only when application is created in a traditional programming language, where data is processed in the procedural manner. In this situation, a programmer without an experience in creating distributed applications creates an unstable and inefficient software. A programmer who has such experience has limited possibilities in the field of optimization, because he/she can only use the methods implemented by the creator of the server application. In the case where the appropriate method is not implemented, it is necessary to process data at the client-side (or rebuild the server application). A form of query language has been introduced in other technology Enterprise Java Beans [75]. This technology from the very beginning has been designed to provide basic services involving stability, transactivity and security in the context of Java programming language. Unfortunately, stability issue is resolved in a manner similar to the Hibernate [81] as the object-relational mapping. Since the technique of distributed object creation in the EJB is also extremely complicated, a loud criticism for EJB from many specialists is not surprising. In fact, the complexity of an API is a real feature of all contemporary technologies supporting development of distributed software, no matter whether it is CORBA [17], DCOM [82], EJB [75], Spring [83], JMS [84], or Web Services [78]. The reason for this is too low-level of abstraction and creation of software using traditional programming languages, which are unsuitable for this purpose. Therefore, a programming language for distributed systems should be integrated with a query language, and also equipped with appropriate constructs supporting distributed communication. Such a language should have optimization mechanisms bound with processing of huge amount of data in slow and unstable distributed environment as it is done in ODRA [7]. Page 75 of 221

76 Chapter 3 Technological Base of the New Approach to Grid Distributed and Federated Databases The needs of some organizations (e.g. research institutes) require integration of resources available to them at a very low-level, i.e. flat and simple data stored in databases. For such organizations the integration of data using services, or using distributed objects can be unacceptable due to their too high-leveled nature implicating a rigidity in the way of access. An alternative solution is to build a centralised database similar to the Data Warehouse, however, it can be unacceptable because of the cost, time access to the most current version of the data, and so on. Integration of a few databases using a federation may prove to be the most non-invasive method of integration for information systems (e.g. in the case when one company takes over another one). Federated database [72], [73] is a logical link of independent distributed databases, creating a single integrated database system. Data sources integrated in this way may include not only conventional databases (e.g. object-oriented, relational, XML repositories) created by different manufacturers, but also flat files, text documents, spreadsheets, and many other types of structured and unstructured data. Federated architecture makes all of these data presented as one virtual whole (hence it is also called virtual database). Such integrated databases make a range of their resources across the federation available. These resources may include metadata (database schemata), ordinary data, or programming interfaces allowing the use of a database. The sum of data shared in this way, together with the central, integration database creates a complete federation infrastructure. Integrated databases share all their content (or a specific part) with other members in the federation, while keeping their autonomy in their local management. New sources of data can be added to the federation through the creation of wrappers or views. Wrappers are relatively non-complex, but low-level small software allowing physical connection to the federation of different heterogeneous data sources [13]. For example, the wrapper role for Microsoft Excel files should be an implementation of the API allowing for reading a content of the files and dynamic sharing of it for a software which controls federated database functioning (e.g. it can be JDBC driver) in accordance with the data model adopted for it. Federation database architecture also includes common components, called mediators [70]. The mediator is Page 76 of 221

77 Chapter 3 Technological Base of the New Approach to Grid a special software module which is located at side of the integrated resource. Its task is a conversion of local data, in such a way, that later the data can be used by a global user in accordance with certain rules set for the whole federation. The mediator translates a query in accordance with the federated database's global schema in a form which can be performed on local data. Besides the data schema conversions the mediator is able to direct data conversion. A dynamic (virtual) conversion of a salary from one currency (used locally in a resource) to another (used in federation) can be a usage example of the mediator. Even a seemingly trivial problem like this requires some decisions and agreements, such as banking service according to which the conversion of the money will be done. The crucial feature of federated databases is the degree within which a system is able to become like a centralised database and simultaneously to hide the complexity of the mechanisms involving an integration of heterogeneous data in distributed environment. Frequently, it is referred to the levels of transparency in the federation databases which are following: Access transparency which is able to provide homogeneous methods for local and remote data manipulation; Location transparency that is freeing users from having knowledge about physical location of data in a distributed system; Concurrency transparency which allows multiple users a simultaneous access to the data with full integrity of the data, without necessity of synchronous cooperation between users, and low-level programming of synchronization mechanisms; Heterogeneity transparency it allows uniform treatment of data coming from different sources, stored by using different data models; Scalability transparency allows an addition of new elements of the distributed system without affecting operations of previously working applications and users; Fragmentation transparency which is an automatic merging facility for objects, collections, tables, whose fragments are stored in different physically separated places; Page 77 of 221

78 Chapter 3 Technological Base of the New Approach to Grid Replication transparency which allows a creation and removing of copies of the data in other geographic locations with a direct effect on the processing data, but without implications for the working software and an end user; Optimization transparency permits using, without a user s knowledge, series of optimization strategies to compile time or execution, to speedup executing a query within a distributed database; Failure transparency enables an uninterrupted work for most users of distributed database in a situation when some of its nodes or communication lines have been damaged; Migration transparency which allows a transfer of data to other locations without impact on the work of users. From many years federated databases have been a subject of many research projects, as well as several commercial solutions. TSIMMIS [85] and HERMES systems are among the first attempts to build such pioneering databases are. In both of the systems, concepts for an implementation of mediators using non-procedural specifications for integrations of specific data resources were used. DISCO [86] and Pegasus [87] systems were closer to the form of contemporary federated databases. The DISCO creators focused on an efficient integration of heterogeneous resources. Pegasus had its own data model and query language. Garlic system [80] was the first research project designed to build federated database based on relational model. The most important parts of this prototype are now a part of DB2 [74]. Configuration of a simple federation system in DB2 consists of a few steps. First, a wrapper to an integrated resource has to be registered (using CREATE WRAPPER operation) in a server of the federated database. Then, there is a database object representing the remote server created (using CREATE SERVER instruction). In the next step, the structure of each remote table, to which the integration server will refer must be described (CREATE NICKNAME operation should be used). The names given in the last operation are available in a transparent manner for the SQL language. For example, it is possible to build a view in a database (CREATE VIEW) that refers to both the local and remote data of the federation database server. Although, at first glance, it seems that the view realised in such a way, hides from the user that the data is distributed which is not the truth. The basic problem with transparency of such a Page 78 of 221

79 Chapter 3 Technological Base of the New Approach to Grid solution is when a user tries to update data returned by the view. In many cases it appears not to be possible due to a well-known problem of updating through the view. The DB2 recommends, therefore, to implement updates using triggers instead of. A major problem in the field of federated databases is a distributed query optimization. Due to a very long access time to distributed resources, a specific policy for the evaluation of queries operating on such data is needed. In relational databases the operation of joining tables requires special method of evaluation. Thus, mostly the techniques of decomposition of queries are used to create sub-queries which are finally sent to different servers. Sometimes semi-join operations with associating data which is transferred between servers to evaluate such partial operations are also used. A characteristic feature on the existing market of database management systems is that in addition to data integration and federation (in the form of clusters) they allow for a construction (for example, in the PL/SQL language) of application operating at the database server side. In such an environment the stored procedures can be provided as services Web Services, then applications can communicate with the RPC as well as the queues, and all operations are realised under a strong control of mechanisms to protect data provided by the database management system (DBMS) (e.g. transactions, monitors, etc.). From the software developer s point of view when such mechanisms are running in the context of the database server, their complexity is invisible. This is a situation completely different than in the case of middleware, where the programmer is forced to use a complex API. Object-oriented databases in federated architecture extend the whole functionality through support operations on distributed objects. In addition the object-oriented federated database is a strong programming language, integrated with declarative structures, and reducing it to a size of virtual machines of today's popular programming languages. It enables a programmer to use such database as a programming tool, which is not limited to work only at the server side. The tool obtained in this way, has the potential to represent many methods to integrate data and applications within the same tool. It seems that the transparency mechanisms and high-level nature of work inside the environment of such a tool should lead to a significant reduction of production and maintenance costs. Page 79 of 221

80 Chapter 3 Technological Base of the New Approach to Grid 3.5 The egov-bus Virtual Repository The egov-bus project [7] is an acronym for the Advanced egovernment Information Service Bus project supported by the European Community under "Information Society Technologies" priority of the Sixth Framework Programme (contract number: FP6-IST STP). The project was a 24-month international research aiming at designing foundations of a system providing citizens and businesses with improved access to virtual public services, which are based on existing national egovernment services and which support cross-border life events. The overall egov-bus project objective is to research, design and develop technology innovations which will create and support a software environment that provides user-friendly, advanced interfaces to support life events of citizens and businesses administration interactions involving many different government organizations within the European Union. The life-events model organises services and allows users to access services in a user-friendly manner, by hiding the functional fragmentation and the organizational complexity of the public sector. This approach transforms governmental portals into virtual agencies, which groups functions related to the customer s everyday life, regardless of the responsible agency or branch of government. Such virtual agencies offer single points of entry to multiple governmental agencies (European, national, regional and local) and provide citizens and businesses with the opportunity to interact easily with several public agencies. Life-events lead to a series of transactions between users (citizens and enterprises) and various public sector organizations, often crossing traditional department boundaries. There are substantial information needs and service needs for the user that can span a range of organizations and be quite complicated. An example of a straightforward life event moving house within only one country such as Poland may require complex interaction with a number of Government information systems. The detailed objectives are: 1. Create adaptable process management technologies by enabling virtual services to be combined dynamically from the available set of egovernment functions; 2. Improve effective usage of advanced web service technologies by egovernment functions by means of service-level agreements, an audit trail, semantic representations, better availability and performance; Page 80 of 221

81 Chapter 3 Technological Base of the New Approach to Grid 3. Exploit and integrate current and ongoing research results in the area of natural language processing to provide user-friendly, customizable interfaces to the egov-bus; 4. Organise currently available web services according to the specific life-event requirements, creating a comprehensive workflow process that provides clear instructions for end users and allows them to personalise services as required; 5. Research a secure, non-repudable audit trail for combined Web services by promoting qualified electronic signature technology; 6. Support a virtual repository of data sources required by life-event processes, including meta-data, declarative rules, and procedural knowledge about governing life-events categories; 7. Provide these capabilities based on a highly available, distributed and secure architecture that makes use of existing systems. Generally citizens and businesses will profit from more accessible public services. The following concrete benefits will be achieved: Improved public services for citizens and businesses; Easier access to cross-border services and therefore a closer European Union; Improved quality of life and quality of communication; Reduced red tape and thus an increase in productivity. To accomplish these challenging objectives, egov-bus researches advances in business process and Web service technologies. Virtual repositories provide data abstraction, and a security service framework ensures adequate levels of data protection and information security. Multi-channel interfaces allow citizens easy access using their preferred interface. The egov-bus architecture is to comprise three distinct classes of software components, namely the newly developed features resulting from the project research and development effort, the modified and extended pre-existing software components either proprietary software components licensed to the project by the project partners or open software, and the pre-existing information system features. The latter category pertains to the egovernment information systems to be rendered inter-operable with the use of the egov-bus prototype as well as to the middleware software components such as workflow management engines. Page 81 of 221

82 Chapter 3 Technological Base of the New Approach to Grid From technical point of view, one of the goals of the egov-bus project is to expose all the data as a virtual repository whose schema is shown in Fig. 5. The central part of the system is the ODRA database server, the only component accessible for the top-level users and applications presenting the virtual repository as a global schema. The server virtually integrates data from the underlying resources available here only as SBQL views definitions (described in subsection Updatable Object-Oriented Views) basing on a global integration schema defined by the system administrator and a global grid index keeping resource-specific information (e.g. data fragmentation, redundancy, etc.). The integration schema is another SBQL view (or a set of views) combining the data from the particular resources (also virtually represented as views, as mentioned) according to the predefined procedures and the global index contents. These contents must determine resource location and its role in the global schema available to the top-level users. The resource virtual representation is referred to as a contributory view, i.e. another SBQL view covering it and transforming its data into a form compliant with the global schema. A contributory view must comply with the global view; actually it is defined as its subset. Technical aspects of whole integration mechanism including, global index, global schema, integration schema and contributory schema is presented subchapters 5.4, 5.5 and 5.6 as a part of the implementation prototype for this dissertation. The user of the repository sees data exposed by the systems integrated by means of the virtual repository through the global integration view. The main role of the integration view is to hide complexities of mechanisms involved in access to local data sources. The view implements CRUD behaviour which can be augmented with logic responsible for dealing with horizontal and vertical fragmentation, replication, network failures, etc. Thanks to the declarative nature of SBQL, these complex mechanisms can often be expressed in one line of code. The repository has a highly decentralised architecture. In order to get access to the integration view, clients do not send queries to any centralised location in the network. Instead, every client possesses its own copy of the global view and corresponding integration view, which is automatically downloaded from the integration server after successful authentication to the repository. A query executed on the integration view is to be optimised using such techniques as rewriting, pipelining, global indexing and global caching. Page 82 of 221

83 Chapter 3 Technological Base of the New Approach to Grid Fig. 5 egov-bus virtual repository architecture [7]. The currently considered bottom-level resources can be relational databases, RDF resources, Web services applications and XML documents. Each type of such a resource requires an appropriate wrapper capable of communicating with both the upper ODRA database (see subchapter 3.6) and the resource itself (except for XML documents currently imported into the system). Such a wrapper works according to the early concept originally proposed in [88], while the contributory views are used as mediators performing appropriate data transformations and standing for external (however visible only within the system, not seen above the virtual repository) resources representations. 3.6 The ODRA Database ODRA (Object Database for Rapid Application development) [7] is a prototype objectoriented application development environment currently being constructed at the Polish- Japanese Institute of Information Technology as well as a part of the egov-bus project. Its aim is to design a next generation development tool for future database application programmers. The tool is based on SBQL [89]. The SBQL execution environment consists of a virtual machine, a main memory DBMS and an infrastructure supporting distributed computing. Page 83 of 221

84 Chapter 3 Technological Base of the New Approach to Grid The main goal of the ODRA project is to develop new paradigms of database application development. This goal can be reached by increasing the level of abstraction at which a programmer works with application of a new, universal, declarative programming language, together with its distributed, database-oriented and objectoriented execution environment. Such an approach provides a functionality common to the variety of popular technologies (such as relational/object databases, several types of middleware, general purpose programming languages and their execution environments) in a single universal, easy to learn, interoperable and effective to use application programming environment. The principle ideas implemented in order to achieve this goal are the following: 1. Object-oriented design. Despite the principal role of object-oriented ideas in software modelling and in programming languages, these ideas have not succeeded yet in the field of databases. ODRA approach is different from current ways of perceiving object databases, represented mostly by the ODMG standard [90] and database-related Java technologies (e.g. [90], [81]). The system is built upon the SBA methodology ( [91], [89]). This allows to introduce for database programming all the popular object-oriented mechanisms (like objects, classes, inheritance, polymorphism, encapsulation), as well as some mechanisms previously unknown (like dynamic object roles [92], [93] or interfaces based on database views [6], [94]). 2. Powerful query language extended to a programming language. The most important feature of ODRA is SBQL, an object-oriented query and programming language. SBQL differs from programming languages and from well-known query languages, because it is a query language with the full computational power of programming languages. SBQL alone makes possible to create fully fledged database-oriented applications. A chance to use the same very-high-level language for most database application development tasks may greatly improve programmers efficiency, as well as software stability, performance and maintenance potential. 3. Virtual repository as a middleware. In a networked environment it is possible to connect several hosts running ODRA. All systems tied in this manner can share resources in a heterogeneous and dynamically changing, but reliable and secure environment. This approach to distributed computing is based on object-oriented Page 84 of 221

85 Chapter 3 Technological Base of the New Approach to Grid virtual updatable database views [2]. Views are used as wrappers (or mediators) on top of local servers, as a data integration facility for global applications, and as customisers that adopt global resources to needs of particular client applications. This technology can be perceived as contribution to distributed databases, Enterprise Application Integration (EAI), Grid Computing and peerto-peer networks. The distributed nature of contemporary information systems requires highly specialised software facilitating communication and interoperability between applications in a networked environment. Such software is usually referred to as middleware and is used for application integration. ODRA supports informationoriented and service-oriented application integration. The integration can be achieved through several techniques known from research on distributed/federated databases. The key feature of ODRA-based middleware is the concept of transparency. Due to this transparency many complex technical details of the distributed data/service environment need not to be taken into account in an application code. ODRA supports the following transparency forms: Access and location transparency; Updating transparency made from the side of a global client; Distribution and heterogeneity transparency; Data fragmentation transparency; Data/service redundancies and replications transparency; Data indexing transparency; etc. These forms of transparency have not been solved to a satisfactory degree by current technologies, which was explained in previously in this chapter. Transparency is achieved in ODRA through the concept of a virtual repository [7]. The repository seamlessly integrates distributed resources and provides a global view on the whole system, allowing one to utilise distributed software resources (e.g., databases, services, applications) and hardware (processor speed, disk space, network, etc.). It is responsible for the global administration and security infrastructure, global transaction processing, communication mechanisms, ontology and metadata management. The repository also facilitates data access by several redundant data structures (global indexes, global caches, replicas), and protects data against random system failures. Page 85 of 221

86 Chapter 3 Technological Base of the New Approach to Grid A user of the repository sees data exposed by the systems integrated by means of the virtual repository through a global integration view. The main role of the integration view is to hide complexities of mechanisms involved in access to local and remote data sources. The view implements a CRUD behaviour which can be augmented with logic responsible for dealing with horizontal and vertical fragmentation, replication, network failures, etc. Thanks to the declarative nature of SBQL [89], these complex mechanisms can often be expressed in one line of code. Local sites are fully autonomous, which means it is not necessary to change them in order to make their content visible to the global user of the repository. Their content is visible to global clients through a set of contributory views which must conform to the integration view (be a subset of it). Non-ODRA data sources are available to global clients through a set of wrappers, which map data stored in them to the canonical object model assumed for ODRA. There are wrappers developed for several popular databases, languages and middleware technologies. Despite of their diversity, they can all be made available to global users of the repository. A global user may not only query local data sources, but also update their content using SBQL. Instead of exposing raw data, the repository designer may decide to expose only procedures. Calls to such procedures can be executed synchronously and asynchronously. Together with SBQL s support for semi-structured data, this feature enables a document-oriented interaction, which is characteristic to current technologies supporting Service Oriented Architecture (SOA). 3.7 Conclusions All solutions described in this chapter are related to contemporary techniques of integration of heterogeneous distributed data. The approach which will be presented since then on the next pages of this paper, is strictly related to the knowledge taken from the mentioned solutions. Actually, the issue of automatic integration of heterogeneous resources discussed in the dissertation can be divided into two separate parts. The first is focused on methodology how to create an automatic integration procedure for heterogeneous distributed and fragmented structured data coming from different resources into one consistent global structure. The second part concerns the creation of middleware for automatic integration mechanism, which will provide a transparent access to the resources for users including access and location transparencies. Page 86 of 221

87 Chapter 3 Technological Base of the New Approach to Grid The mediation procedure does not rely on bare resources but their wrappers, instead. Such a wrapped resource is provided with a common interface accessible to mediators and its actual nature can be neglected in upper-level mediation procedures. The mediation and integration procedures must rely on some effective resource description language used for building low-level schemata combined into top-level schema available to users and the only one seen by them. In the thesis, this feature is implemented by updateable object-oriented views based on SBQL. SBQL is further employed for querying resources constituting the virtual repository. Page 87 of 221

88 Chapter 4 An Automatic Integration Methodology of Distributed Data, General Concept and Assumptions A tendency for increasing needs on providing opportunities for communication between particular computer programs, or even whole systems in the same organizations (Enterprise Application Integration, EAI), or completely separate organizations (Business to Business, B2B) can be seen in modern computer systems. Users demand an immediate access to all information that the organization has, regardless a system in which such information is stored. The task for designers is to make the individual systems able to work together, it can be realised as so-called connection for islands of automation (mutually isolated systems performing well-defined tasks) into one virtual system. So far, there has been a number of controversies about how such a virtual system should look from the technical side. As it has been shown in previous chapters, for example, some specialists call for the extension of the traditional programming language with a set of tools (usually libraries and/or a source code generators) which give the programmers (to a certain degree) a transparent use of distributed resources. Others propose to use base services that communicate with each other using standard protocols and data description languages. Still others are followers of an integration at the level of business processes, also through business portals. There is also a group which supports an integration of applications at the data level through the replication Page 88 of 221

89 Chapter 4 An Automatic Integration Methodology of Distributed Data, General Concept and Assumptions of data (e.g. in a data warehouse), or a federation of databases. What all these solutions have in common is, that every information has to be available immediately, regardless of where and in what form it is actually stored. Therefore, integration mechanisms must be real-time operated, and to be so smooth as to be able to work with various databases, application servers, content management systems, data warehouses, workflow systems, programming languages, data exchange protocols, etc. This chapter consists of two parts, the first (from subchapter 4.1) contains contribution about how to achieve integration of data between heterogeneous distributed databases in data-grid architecture. The second (from subchapter 4.3) depicts the proposal of automatic integration method for heterogeneous distributed object-oriented databases. 4.1 Object Integration Approach in Data-Grid The approach to integration heterogeneous and autonomous resources, presented below, is based on an approach seen in federated databases where the global schema is a representation of objects accessible for clients. Moreover, it is the egov-bus Virtual Repository (VR) [7] approach extension with additional integration views which play a crucial role in the process of creation virtual objects from distributed resources accessible there for clients as global virtual objects. This will be further called datagrid. The goal of the integration in our data-grid solution is to create a fully automatic process of integration distributed and heterogeneous objects which also has to be transparent (in means of its technical aspects) for the users, and makes users activity reduced to minimum within the whole process. All, what a user must do is to prepare a mapping of their local database objects into a virtual repository objects according to specified schema and then join a virtual repository. The rest of the integration job will be done by mechanisms covered within integration middleware, which will be technically described in next chapter of this dissertation. Mapping of the local objects needs to be done using updateable object-oriented views mechanism. In this process a user has to create a contributory view in his/her local database environment. During the view creation the user must conform his/her local objects into virtual objects basing on information coming from contributory Page 89 of 221

90 Chapter 4 An Automatic Integration Methodology of Distributed Data, General Concept and Assumptions schema. Such a schema will be provided by a virtual repository s consortium e.g. an UML object description in the paper document Updatable Object-Oriented Views In databases, a view means an arbitrarily defined image of stored data in terms of distributed application (e.g., Web applications) [6]. Views can be used for resolving incompatibilities between heterogeneous data sources enabling their integration, which corresponds to mediation [95]. Database views are distinguished as materialised (representing copies of selected data) and virtual ones (standing only for definitions of data that can be accessed by calling such a view). A typical view definition is a procedure that can be invoked from within a query. One of the most important features of database views is their transparency which means that a query issued by a user must not distinguish between a view and actual stored data (he or she must not be aware of using views), therefore a data model and a syntax of a query language for views must conform with the ones for physical data. Views should be characterised by the following features [94], [6]: Customization, conceptualization, encapsulation a user (programmer) receives only data that is relevant to his/her interests and in a form that is suitable for his/her activity; this facilitates users productivity and supports software quality through decreasing probability of errors; views present the external layer in the three-layered architecture (commonly referred to as the ANSI/SPARC architecture [96]); Security, privacy, autonomy views give a possibility to restrict a user s access to relevant parts of database; Interoperability, heterogeneity, schema integration, legacy applications views enable the integration of distributed/heterogeneous databases, allowing understanding and processing alien, legacy or remote databases according to a common, unified schema; Data independence, schema evolution, views enable the users to change physical and logical database organisation and schema without affecting applications already written. The idea of updateable object views [6] relies on augmenting the definition of a view with information on users intents with respect to updating operations. Only the Page 90 of 221

91 Chapter 4 An Automatic Integration Methodology of Distributed Data, General Concept and Assumptions view definer is able to express the semantics of view updating. To achieve it, a view definition is divided into two parts. The first part is the functional procedure which maps stored objects into virtual objects (similarly to SQL). The second part contains redefinitions of generic operations on virtual objects. These procedures express the users intents with respect to update, delete, insert and retrieve operations performed on virtual objects. A view definition usually contains definitions of subviews which are defined on the same rule, according to the relativism principle. Because a view definition is a regular complex object, it may also contain other elements, such as procedures, functions, state objects, etc. The above assumptions and SBA semantics [89] allow achieving the following properties [91]: Full transparency of views after defining the view, a user can use the virtual objects in the same way as stored object; Views are automatically recursive and (as procedures) can have parameters. The first part of a view definition has the form of a functional procedure named virtual objects. It returns entities called seeds that unambiguously identify virtual objects (usually seeds are OIDs of stored objects). Seeds are then (implicitly) passed as parameters of procedures that overload operations on virtual objects. These operations are determined in the other part of the view definition. There are distinguished four generic operations that can be performed on virtual objects: delete removes the given virtual object, retrieve (dereference) returns the value of the given virtual object, insert puts an object being a parameter inside the given virtual object, update modifies the value of the given virtual object according to a parameter (a new value). Definitions of these overloading operations are procedures that are performed on stored objects. In this way the view definer can take full control on all operations that should happen on stored objects in response to update of the corresponding virtual object. If some overloading procedure is not defined, the corresponding operation on virtual objects is forbidden. The procedures have fixed names, respectively on_delete, on_retrieve, on_new, and on_update. All procedures, including the function supplying seeds of virtual objects, are defined in SBQL and may be arbitrarily complex. Page 91 of 221

92 Chapter 4 An Automatic Integration Methodology of Distributed Data, General Concept and Assumptions Three-level Integration Model We assume, that three separate layers of views to realise a complete and consistent integration mechanism are required in the Virtual Repository. Views on each layer will be responsible for different range of integration tasks as well as each lower layer will be dependent of next/higher layer finally it will behave as a hierarchical model. Particular layers are used (from client side) from top to down. Every layer can contain a non-limited number of views. The model is presented in the Fig. 6. Fig. 6 Three-layer Integration Model. Simplifying the problem, we assume that a user wants to contribute his/her local objects to a virtual repository, his/her local object schema contains only Person objects (these objects represent the user s company employees) with FirstName and LastName sub-objects as it is presented in the Fig. 7. In virtual repository global schema, the Person objects have to be available as Employee virtual objects with sub- Page 92 of 221

93 Chapter 4 An Automatic Integration Methodology of Distributed Data, General Concept and Assumptions objects Name and Surname. This is depicted in the Fig. 8. It means that a user needs to create contributory view which will finally do the mapping of his/her local Person objects as Employee virtual objects. Fig. 7 Person example object. Fig. 8 Employee example object. For the above objects, mapping contributory view content (according to implementation of Updateable Object-Oriented Views in [7]) should be as it is presented on Listing 1 using SBQL syntax: view EmployeeContribDef { Listing 2 Simplified updateable view for virtual Employee objects. /* virtual object declaration */ virtual objects Employee: record { Name: string; Surname: string; [0..*] ; /* definition of seeds for virtual objects */ seed: record { p: Person; [0..*] { return (Person) as p; /* definitions of CRUD procedures for virtual objects */ on_retrieve { return p.( FirstName as Name, LastName as Surname; on_delete { delete p; on_update { p := value.( Name as FirstName, Surname as LastName; on_new { create permanent Person(value.( Name as FirstName, Surname as LastName); Page 93 of 221

94 Chapter 4 An Automatic Integration Methodology of Distributed Data, General Concept and Assumptions /* sub-view definitions */ view NameDef { virtual Name: string; seed: record { _name: Person.FirstName; { return p.firstname as _name; on_retrieve { return _name; on_update { _name := value; view SurnameDef { virtual Surname: string; seed: record { _surname: Person.LastName; { return p.lastname as _surname; on_retrieve { return _surname; on_update { _surname := value; The mapping of contributory objects can be done under the restriction that mapped objects will only have a form (a schema) that will be accepted by a global view. The main task of contributory view is transformation of heterogeneous data into a schema required and allowed in a virtual repository. As it is presented on Listing 3 the basic CRUD procedures are also included in this task. Their implementations are very important for further actions which can be performed on contribution objects. In real life a global schema description or documentation should tell what behaviour on contributed objects should be preserved. A contributory view layer can be called the lowest layer because it is responsible for primary and direct mapping between raw user s objects (located on remote machine) and virtual ones. Moreover, only the person who has a direct access and appropriate privileges can make a contribution. At this level, the user must have knowledge how to do the contribution, data might be fragmented or redundant then human decisions are absolutely required. Integration view is placed in the middle layer of the presented integration model (see Fig. 6 Three-layer Integration Model.). This view should be provided by a virtual repository administrator and its schema must be conformed to a global view schema, as well. Actually, a local database user (or a client) does not touch the view directly. The integration view is responsible for keeping information about distributed resources and their contributed objects which are a part of the virtual repository. Its definition contains a code which creates a bag of remote accessible contribution objects and make them available in a local user environment through a global view. Information about global Page 94 of 221

95 Chapter 4 An Automatic Integration Methodology of Distributed Data, General Concept and Assumptions object fragmentation issues, redundancy, etc. is also hidden in the integration view. It can include CRUD procedures for particular objects, too. The integration view should be managed automatically through a special mechanism (which will be introduced in the next chapter) which can be said to be a type of view which is dynamic and rebuilt when a new contribution appears in the virtual repository. Moreover, if the integration view will not be present and global view will be implemented in a way not to use the integration view, such a virtual repository model will also work, but only with limited functionality without integration facility described in [7]. Summarising, the main task of the integration view is to provide information about integration of all contributed resources and make them available through the global view in a user s local environment. At the top of the integration model there is global view which finally defines global virtual objects available to the clients in a virtual repository. This is a static type of view which from the beginning should be created and maintained by the virtual repository s administrator. It can be propagated manually or automatically depending of the integration if it is not enabled it also can work directly on contributed objects (without integration) or on integrated objects. Global view provides data visibility in a virtual repository, thus its definition must be implemented in accordance to the virtual repository agreements, so it has direct reflection on CRUD procedures which are a part of the view. From the client s point of view the integration model works from top to down: 1. User can use global virtual objects in the query; 2. Global virtual objects retrieve local integration virtual objects; 3. Integration virtual objects connect to particular distributed resources and locally materialise remote objects as a result bags; a. Local contributed objects are treated as remote objects; b. Remote objects from remote resources are processed through remote contributory views; 4. Results are presented as global virtual objects to a requested user. The whole procedure is based on processing the updatable object-oriented views. According to CRUD procedures implemented in particular views, original objects can also be fully updated or created according to these procedures scopes. Page 95 of 221

96 Chapter 4 An Automatic Integration Methodology of Distributed Data, General Concept and Assumptions 4.2 Data Fragmentation The biggest problem during processing of distributed data from remote databases is that data schemata in particular locations are different. In context of distributed processing it is said that the data is fragmented. Working in grid-fashioned architecture or federated databases, the schemata for particular database resources are similar and often came down to common schema, unless a two or more resources are dependent on one another and joined in one result. In the first case, when data with similar schema is stored in various locations it is called a horizontal data fragmentation. In the second case, when data logic is dived according to a key, schemata created in such a way are different, but have a common field, which is basic information on which a binding between particular schemata is possible. This is called vertical data fragmentation. There is also the third type of fragmentation which contains both of the above types and it is called mixed fragmentation. The easiest way to explain the differences is to depict all situations using objectoriented data schema. Assuming that there is schema which contains objects called Person, these objects contain sub-objects: FirstName, LastName, Address, Coutry, PESEL (an unique id for each Person) an explanation can be done. Dealing with horizontal fragmentation, each resource should contain a collection of complete Person objects. This situation is presented in the Fig. 9. Fig. 9 Example of horizontal fragmentation of objects. Vertical fragmentation is when data resources have different logical schemata of the same objects. Such objects have additional information about which of particular Page 96 of 221

97 Chapter 4 An Automatic Integration Methodology of Distributed Data, General Concept and Assumptions object parts, coming from different resources, can be joined together and finally be available as one consistent object. In the example presented in the Fig. 10 there is PESEL object as a lead information for joining particular object parts. Fig. 10 Example of vertical fragmentation of objects. The example of mixed fragmentation is presented in the Fig. 11. There one source contains complete Person objects, in the rest resources object logic is divided, but join information (PESEL object) is preserved in each resource. Fig. 11 Example of mixed fragmentation of objects. The fragmentation issue in integration process of distributed resources is related to the creation of consistent virtual objects after integration. This process cannot be done automatically without an interference in the distributed system by administrators of contributed resources. They should be aware that fragmentation issues can appear Page 97 of 221

98 Chapter 4 An Automatic Integration Methodology of Distributed Data, General Concept and Assumptions and according to this environment-specific solutions should be prepared. A solution for horizontal fragmentation is introduced in details in the next chapter. 4.3 Automated Integration of Distributed Objects Fig. 12 A general concept of Virtual Repository. Achieving a fully automated solution for transparent attaching or detaching of distributed objects in a virtual repository is the main goal of this dissertation. A general concept of the solution where resources are easily pluggable into a system as well as system users can appear and disappear unexpectedly is presented in the Fig. 12. A user who plugs into the virtual repository can use resources according to his or her needs, assigned privileges and their availability. In the same way resource providers may plug in and offer data. Such virtual repository must be achieved by an additional layer of middleware which will supply a full transparency of resources, providers and clients [4]. The goals of the approach are to design a platform where all clients and providers are able to access multiple distributed resources without any complications concerning data maintenance and to build a global schema for the accessible data. We assume that it should be a complete mechanism performing a transparent integration of remote objects built on top of database, including a fully operational database engine and transport platform [4], [3], [5] middleware needs to be created to achieve all these features. Basing on the integration approach, presented in the previous subchapters, such a mechanism can be created using Virtual Repository [7] with data integration improvements and peer-to-peer framework as a transport platform. Page 98 of 221

99 Chapter 4 An Automatic Integration Methodology of Distributed Data, General Concept and Assumptions As we have described in subsection the crucial element required for automatic integration process is the integration view. The view describes two very important details related to live integration process in dynamic and distributed environment: 1. It keeps types of objects defined as view seeds (see seed clauses on Listing 4). It means that all the objects retrieved from seeds declared in the return clause will have a type as declared; 2. A list of live contributions, automatically managed (by external mechanism) located in seeds, return clause which also defines a result type of objects returned by the seed clause (e.g. a bag). The list must have an appropriate definition dependent on a fragmentation type for contributed objects (which can be horizontal, vertical and mixed see [5] and subchapter 4.2). It means that each newly initialised virtual object should have a seed which unambiguously identifies it (together with internal unique id) and when virtual object is called, data is retrieved form original object(s) stored in seed. When complex object(s) is used as seed, each object as well as its sub-objects (which can repeat recurrently) should have an appropriate original object in a seed. It is not a problem when a view deals with objects directly from the local database, then the objects are known. On the other hand, for virtual objects having as a source another virtual objects from remote locations, the situation becomes more complicated because original objects from seeds are locally unavailable. The problem is related to typechecking of such non-existing objects. Moreover, some existing virtual objects might be temporary they can disappear from the environment unexpectedly. In this situation a declaration of seed type as well as seed definitions (the list of seeds) should also be changed dynamically on demand (when e.g. new contributing resource appeared in the system, or detached from the system). Other available, similar virtual object with the same type but from different remote location should be used in seed s type declaration. The seed definitions should also be updated. Anyway, this part of problem is solved in the current proposal in the way described above. The ideal solution is when types for virtual objects would be non-dynamic and defined globally, but this problem is out of scope of the dissertation and unquestionably is on the list of the future works. Considering a view example presented in Listing 5 for objects stored in four autonomous databases and contributed in local user environment with names as Page 99 of 221

100 Chapter 4 An Automatic Integration Methodology of Distributed Data, General Concept and Assumptions EmployeeContrib, EmployeeContrib1, EmployeeContrib2, EmployeeContrib3, where all objects are horizontally fragmented, but EmployeeContrib is representation of local (current user s) contribution, the rest are remote contributions, the integration view content should be as in Listing 6. Listing 6 Integration view example for automatic integration with horizontal object fragmentation. view EmployeeAutoIntegrationDef { /* virtual object declaration */ virtual objects EmployeeIntegr: record { NameIntegr: string; SurnameIntegr: string; [0..*] ; /* definition of seeds for virtual objects */ seed: record { p: EmployeeContrib; [0..*] { return ( EmployeeContrib union EmployeeContrib1 union EmployeeContrib2 union EmployeeContrib3 ) as p; /* definitions of CRUD procedures for virtual objects */ on_retrieve { return p.( Name as NameIntegr, Surname as SurnameIntegr; on_delete { delete p; on_update { p := value.( NameIntegr as Name, SurnameIntegr as Surname; on_new { create permanent EmployeeContrib(value.( NameIntegr as Name, SurnameIntegr as Surname); /* sub-view definitions */ view NameAutoIntegrationDef { virtual NameIntegr: string; seed: record { _name: EmployeeContrib.Name; { return p.name as _name; on_retrieve { return _name; on_update { _name := value; view SurnameAutoIntegrationDef { virtual Surnamentegr: string; seed: record { _surname: EmployeeContrib.Surname; { Page 100 of 221

101 Chapter 4 An Automatic Integration Methodology of Distributed Data, General Concept and Assumptions return p.surname as _surname; on_retrieve { return _surname; on_update { _surname := value; Notice, in the above view, in all definitions objects which are already virtual as opposed to the view from Listing 7 are used. Thus, seed definition of each view specifies virtual contribution object as a seed type (e.g. EmployeeContrib in EmployeeIntegr view and EmployeeContrib.Name in NameIntegr view). For EmployeeIntegr view as a seed type the first locally available contributed virtual object is declared in the example it is EmployeeContrib object. When this object is unavailable (for e.g. because resource has been detached from the system) the first available contributed virtual object should be placed there it can be EmployeeContrib1 as well as others with indexes 2 and 3. A bag of results produced using the union operator is defined inside seed s return clause. It is strictly related to the fragmentation type of contributed objects. According to our research [3], [5] a collection of contributed objects integrated by the union operator is enough for horizontal fragmentation. In case of vertical fragmentation the problem is much more complicated, but beyond the scope of this dissertation. However, some research has been done in this field [5], thus several rigid assumptions are required to be fulfilled to carry out a proper integration process: 1. As a seed type there must be a contributed object having a complete schema of integrated object otherwise returned objects will be incomplete, or instead of this, a global representation of such type of object is required there; 2. Seed s return clause should contain a join (using integrated object) for particular sub-objects taken from particular contributed resources. The join must use a unique join-predicate which unambiguously will point to subobjects matching one another. Information about contributed objects and fragmentation should be propagated by automated integration mechanism. Such a mechanism as well as integration procedure is described in the subchapters 5.4, 5.5, 5.6 and 5.7. Page 101 of 221

102 Chapter 4 An Automatic Integration Methodology of Distributed Data, General Concept and Assumptions 4.4 Approach to Integration using Data Grid Middleware The real challenge for integration in Data Grid architecture and Virtual Repository is to discover a method of combining and enabling free bidirectional processing of contents of local clients and resource providers participating in the global virtual data store. The idea of database communication in a grid architecture relies on unbound data processing between all database engines plugged into a virtual repository. The approach is mainly focused on mechanisms to ensure effortless: Resources management; Transparent integration of resources; Users joining; Network communication architecture. The above issues are covered behind general architecture of proposed virtual network concept a middleware platform designed for an easy and scalable integration of a community of database users. The middleware platform creates an abstraction method for communication in a virtual repository community. The solution creates a unique and simple database grid, processed in a parallel peer-to-peer architecture. Since now we will use names grid and virtual repository alternately, with similar meanings in the context of the solution, but grid rather represents a technical point of view on virtual repository. A network communication is hidden behind a basic concept of a transport platform which is based on a well known peer-to-peer (P2P) architecture. Our investigations concerning distributed and parallel systems like Edutella [66], OGSA [28] lead a conclusion that a database grid should also be independent from TCP/IP stack limitations, e.g. firewalls, NAT systems and encapsulated private corporate restrictions. The network processes (such as an access to the resources, joining and leaving the grid) should be transparent for all participants. Because grid networks (also computational grids) operate on a parallel and distributed architecture, our crucial principle is to design a self-contained virtual network with P2P elements. Page 102 of 221

103 Chapter 4 An Automatic Integration Methodology of Distributed Data, General Concept and Assumptions Fig. 13 Data Grid Middleware - its communication layers and their dependencies. The middleware consists of two application layers which are presented in the Fig. 13 object-oriented database engines (located at the top and directly available for clients) and P2P applications which create grid virtual network (located behind database engines and available through the API inside database). Local networks and the Internet are located at the bottom. For users, the databases work as heterogeneous data stores, but in fact they can be transparently integrated in the virtual repository. Then, users can process their own local data schemata and also use virtual repository data from the global schema available for all grid contributors. In such an architecture, databases connected to the virtual network and peer applications arrange unique parallel communication between physical computers for an unlimited information exchange. When a user wants to join the grid he or she needs to meet two conditions: 1. Prepare a contribution view in his/her local environment according to virtual repository consortium guidelines and instantiate it; 2. Join the grid using specific command from local database. Page 103 of 221