Logical and categorical methods in data transformation (TransLoCaTe)

Logical and categorical methods in data transformation (TransLoCaTe) 1 Introduction to the abbreviated project description This is an abbreviated project description of the TransLoCaTe project, with an emphasis on the project s second part (2.2), for which a PhD student is currently being sought. The TransLoCaTe project consists of two interacting parts, the first (run by H. Forssell) more abstract or theoretical, and the second (PhD) more applied. The first part will investigate and develop new techniques, perspectives, and ideas for the field of database representation, outside the current paradigm the relational model and in order to address the deficiencies of that model, especially in the area of data transformation (in a wide sense of the term, including transforming to or from ontologies). The project s second part will begin in the other end, with the current research frontier in data transformation, particularly in the new field of ontology-based data access. While the first part will start with developing a new, abstract framework for representing data with an aim to benefit current developments in the field, the second part will start with an analysis of the current challenges and methods of the field, then develop a framework for addressing them. In a manner of speaking, while the first part has techniques and ideas, and wants to explore what results can be obtained from them, the second knows what results it wants to obtain, and is looking for techniques to obtain them. The two parts are to continuously interact and feed off each other. 2 Background and status of knowledge 2.1 Models of databases and data transformation Databases being, essentially, collections of (possibly interrelated) tables of data, the foundational question is how to best represent such collections of tables mathematically, in order to study their properties and find suitable ways to manipulate them. The dominant mathematical model since its invention by E.F. Codd [5] has been the so-called relational model, which has provided a powerful, yet quite simple theoretical tool. Although very successful, areas exist in which the relational model is less adequate than in others. For instance, this is the case in how it represents missing information 1, and, more centrally for us, in the means that it provides to compare and transform data structured in different ways. Being such a well-entrenched paradigm, it can be difficult for researchers to think along different lines in such areas. It is the task of the project s first part to think outside the relational box and bring techniques and results from areas such as category theory, categorical logic, and logic more widely, to the question of representing data and transformations of data. The relational model of databases A database being, then, roughly a collection of tables, its shape or schema can be specified by giving the number of tables and the number of columns in each table, e.g. as a list of table names, each with an associated list of (distinct) column names, referred 1 see e.g. http://thethirdmanifesto.com/ 1

to as attributes. A particular collection of tables of this shape, an instance of the database schema, is then a filling out of rows in the form of an assignment of a (finite) set of tuples, of the correct arity, to each table name. Thus the situation can be represented logically by letting a database schema be a (first-order, finite, relational) signature and each instance be a (finite) structure in the usual model theoretic sense. Queries then correspond to formulas over the signature, and constraints, at least those so-called dependencies that can be formulated as query inclusions, correspond to axioms. In practice (see [1]), it suffices with axioms that can be formulated as implications or sequents of a quite restricted class of formulas known as positive-primitive (in e.g. [16] or regular (in e.g. [14]). Thus, fruitfully if somewhat simplified, the theory of databases can be seen as the (finite) model theory of regular theories (in the sense of [14]) over a finite relational signature. With this model in hand, much of the theory of databases is now quite well understood, at least as concerns the static picture, that is, the properties of instances of a fixed schema. Data transformation In somewhat sweeping generality, we take data transformation here to involve converting data from a source schema to a target schema in a suitable way, e.g. to produce an instance of the target schema from an instance from the source schema, or to define a set of valid target instances with respect to a given source instance, or even to answer queries formulated over the target schema with respect to source instances without moving any data. Accordingly, we take data transformation to include both the data exchange of e.g. [8], and the data integration of e.g. [15]. Now, in the relational model, once you fix a schema you have a well-defined and well-understood mathematical object of the database instances structured over that schema, in that you have the category of instances and homomorphisms, in which you have operations such as products and techniques such as the chase algorithm. But the model does not tell you how to compare or transform instances structured over different schemas. One immediate suggestion is to use the notion of theory translation from mathematical logic and the functors (mappings, if you prefer) between instances that these induce. However, a much studied approach over the last decade is instead to simply define a relation on the instances of two different schemas in the familiar terms of dependencies. Such a data transformation setting is then given by a source schema S, a target schema T possibly with dependencies of its own and a schema mapping in the form of a set Σ st of source-to-target dependencies in the form of query inclusions with source schema queries included in target schema queries. As such, the approach can be seen as defining a transformation from S to T by defining a third schema S T consisting of a copy of S, a copy of T, and the new source-to-target dependencies. One can then study the mapping with the usual tried and tested relational notions and techniques (certain answers, chase algorithm, etc.). We shall get back to this picture in more detail in Section 2.2, where it forms part of the basic set up (the target schema there being that of an ontology, which introduces additional questions and an additional need for abstraction). However, although one may need or want to define a transformation in precisely this way in many or certain practical situations, this approach can hardly be said to constitute a general, flexible, and principled approach to comparing, mapping, and transforming instances of different database schemas. Rather, it is the study of how you can relate two schemas by adding extra dependencies, and notions which one would expect to be basic for a concept of schema mapping, such as composition and inverse, are highly non-trivial [7, 9]. 2.2 Mappings for heterogeneous schemas for ontology based data access The second part of the project is directed at the recent area of research that has become known as ontology-based data access, which is an approach to the problem of handling and accessing big data by means of information systems that use ontological reasoning. The problem of handling large amounts of data from heterogenous and distributed data sources, also known as the problem of big data, is currently a challenge in many applications. On the Web, sources of semi-structured, overlapping, and semantically-related data are currently proliferating at a phenomenal rate. Likewise in 2

industry, many companies amass large collections of semantically related data. This state of affairs has created a demand for more powerful and flexible information systems (ISs). This new generation of ISs will need to integrate incomplete and semi-structured information from heterogeneous sources, employ rich and flexible schemas, and answer queries by taking into account both knowledge and data. Ontology-based data access [20] has recently been proposed as an architectural principle for such systems. The main idea is to develop a unified view of the data by describing the relevant domain in an ontology, which then provides the vocabulary used to ask queries. Thus this forms an instance of a data transformation setting in the general sense of section 2.1; queries are being posed over a target schema, now in the form of an ontology, and the data to answer them are structured over a source (database) schema (or several). The advantage of an ontology-based IS is that it can use ontological statements, such as the concept hierarchy and other axioms, to derive new facts and thus enrich query answers with implicit knowledge. The ontology thus intermediates between the different data sources, allowing users a unified view of the data in a suitable language. This idea has been incorporated into systems such as QuOnto [2], Owlgres, 2 ROWLKit [6], and REQUIEM [19], and ontology reasoners such as RACER [13], FaCT++ [23], Pellet [18], and HermiT [17]. In order to accomplish this, ontology-based ISs need to combine reasoning and query answering over ontologies with building and maintain collections of mappings between ontologies and data sources. In current approaches, the notion of mapping employed is akin to that sketched in section 2.1 in that a mappings are defined in terms rules φ ψ relating a query φ over the data sources to a query ψ over the ontology. As such, ontology-based information systems can be seen as a variation of data integration (see e.g. [15]), where data stored under one relational schema needs to be available for query answering over a different schema. The main difference to data integration is that mappings in ontology-based data access map not between two schemas, but between a schema and an ontology. Thus the problems related to mappings in ontology-based ISs are to some extent the same or similar to those in database-to-database transformations, but the new set-up also present some unique challenges. The following three problem areas form the starting point for the second part of the project: Mapping between heteregenous schemas As noted, existing research on mappings has been done in the context of data integration and data exchange, where information in a source database, with a schema S, needs to be expressed using a different target schema T [22], via mappings that govern the transformation. In data exchange, the goal is to create a new database over T containing, as far as possible, the information from the source database. In data integration, on the other hand, the data stays as is, while the system allows users to pose queries using schema T. In both cases, the schemas S and T are over the same language, usually relational databases [3, 12]. In this setting, a schema mapping is a first-order logic formula φ ψ, where e.g. φ is a conjunction of atomic formulas from the source schema, and ψ a conjunction of atomic formulas from the target schema. Informally, such a mapping states that whenever a pattern of facts appears in the source database, a corresponding pattern must appear in the target. Such mappings are known as tuple-generating dependencies (tgds), and their expressive power and the complexity of working with them have been extensively studied [11, 22]. In OBDA, however, the mappings are between two schemas over different languages usually between a database schema (first-order logic) and an ontology (description logic). As such, it is not clear what results from data integration and exchange carry over to this setting, or to what extent the different classes of tgds that have been defined and studied in the literature make sense for OBDA. In sum, the fact that in OBDA one needs to relate two different kinds of entities, so to speak, namely databases and ontologies, presents a challenge, invalidates some of the previous research done, and indicates the need for a more abstract approach. 2 http://pellet.owldl.com/owlgres 3

Query answering with mappings Numerous query answering algorithms have been developed in both the ontology and the database settings. Mainstream RDBMSs currently employ sophisticated query optimisation techniques based on the assumption that the database instance already satisfies the dependencies a valid assumption when dependencies are used only as checks. Query answering becomes much harder if an ontology, or a set of dependencies, needs to be taken into account. A number of (worst-case) complexity results are known for answering queries over DLs [4, 10] and dependencies [1]. Currently, all practical approaches to building ontology-based ISs rely on query answering via query rewriting: answers to a query q over an ontology O and a database instance DB are computed by first rewriting q (using O) into another query q and then evaluating q over DB. Query rewriting is particularly suitable in scenarios where the ontology-based IS has no direct control over the data and cannot modify it. The query obtained through rewriting depends to a significant extent on the choice of mappings. As such, realistic algorithms need to work with and analyze the mappings treating the mappings as a given is likely to e.g. lead to redundant queries to execute over the data sources. Therefore, knowing what restrictions on mappings make for efficient query answering would be beneficial for OBDA. Current research in this area [21] uses fairly simple mappings, and thus the topic is still largely unexplored. Managing mappings in the face of change It is not clear how mappings can be maintained in the face of changes in data sources, or in the ontology. In fact, it not even clear what operations on mappings are required. Here, again, the question of what queries φ and ψ a mapping rule φ ψ allows becomes significant. Previous work on mapping management [3] is very recent, and has looked at mappings in the context of data integration and exchange. As discussed above, such results may or may not apply to OBDA settings due to the heterogenity of schemas. On one hand, some types of changes in the data sources, such as adding or altering a table, can to some extent be accommodated within existing frameworks for data integration and exchange [3]. On the other hand, other types of changes, such as the deletion of an axiom in the ontology, do not fit into this framework at all. For example, a subclass axiom that is removed may necessitate the addition of a new rule to avoid information loss. To discover and repair problems arising such manners, a system for mapping maintenance must consider not only the mappings, but also the ontology and possibly the data sources. That is to say, in a dynamic setting, mappings will have to be maintained in the face of changing source databases as well as changing target ontologies. This calls for a conceptual framework which is able to encompass heterogeneous database-to-ontology transformations together with homogeneous database-database and ontology-ontology schema changes. 2.3 The Optique project The Scalable End-user Access to Big Data project (short name: Optique), an FP7 Large-scale Integrating Project that runs until November 2016, aims at applying ontology-based data access to large industry use cases from Siemens and Statoil. Optique has a component that addresses ontology and mapping management, and has assembled a consortium of world leading experts in the domain of databases and ontology-based information systems in order to implement new practical solutions to the problems of managing large collections of mappings. The TransLoCaTe project will work in close collaboration with Optique so as to exploit the results, insights and use case information from Optique, complemeting the practical focus of Optique with foundational theoretical research. 4

2.4 Interaction The project consist of two parts, one top-down starting from abstract methods and ideas and one bottom-up starting from the concrete problems facing an area of current research in data transformation. It is clear that the two parts are at the outset some distance apart; the first part will investigate abstract models which it believes will be suitable for data transformation setting, but it does not have techniques ready that are established as suitable for the problems facing the second part. The second part will investigate problems facing an area of data integration which it believes calls for more abstract methods, but it has not identified those methods to be precisely those that the first part sets out to investigate. Nevertheless, each parts stands to benefit extensively from, as it fills a certain void, in the other. For the first part, its idea is that one should model databases in a way more suitable for dynamical settings, but the abstractions and approaches do not spring directly from current applications. The focus, concern, and hands on expertise with such applications is what the second part brings to the project. For the second part, the data transformation challenges it sets out to solve are rather clear, but they seem to call for an interdisciplinary, and more abstract, approach than what is currently being employed. The focus and experience with such approaches is what the first part brings to the project. In order to keep in close contact with database users and the database community, as well as to ongoing research in mapping management for ontology-based data access, the project will from its start work in close collaboration with the Optique project. References [1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. [2] A. Acciarri, D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, M. Palmieri, and R. Rosati. QuOnto: Querying Ontologies. In Proc. AAAI, pages 1670 1671, 2005. [3] M. Arenas, J. Pérez, J. L. Reutter, and C. Riveros. Foundations of schema mapping management. In J. Paredaens and D. V. Gucht, editors, Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 10), pages 227 238. ACM, 2010. [4] D. Calvanese, G. De Giacomo, and M. Lenzerini. On the Decidability of Query Containment under Constraints. In Proc. PODS, pages 149 158, 1998. [5] E. F. Codd. A relational model of data for large shared data banks. Communications of the ACM, 13(6):377 387, 1970. [6] C. Corona, M. Ruzzi, and D. F. Savo. Filling the gap between OWL 2 QL and QuOnto: ROWLKit. In Proc. DL, 2009. [7] R. Fagin. Inverting schema mappings. In PODS 06 Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 50 59, 2006. [8] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: Semantics and query answering. Theoretical Computer Science, 336(1):89 124, 2005. [9] R. Fagin, P. G. Kolaitis, L. Popa, and W.-C. Tan. Composing schema mappings: Second-order dependencies to the rescue. ACM Transactions on Database Systems (TODS), 30(4):994 1055, 2005. [10] B. Glimm, I. Horrocks, C. Lutz, and U. Sattler. Conjunctive Query Answering for the Description Logic SHIQ. J. Artif. Intell. Res., 31:151 198, 2008. [11] G. Gottlob, R. Pichler, and V. Savenkov. Normalization and optimization of schema mappings. The VLDB Journal, 20(2):277 302, 2011. [12] G. Gottlob and P. Senellart. Schema mapping discovery from data instances. Journal of the ACM, 57(2), 2010. [13] V. Haarslev and R. Möller. RACER System Description. In Proc. IJCAR, pages 701 706, 2001. [14] P. T. Johnstone. Sketches of an Elephant, volume 43 and 44 of Oxford Logic Guides. Clarendon Press, Oxford, 2002. [15] M. Lenzerini. Data integration: A theoretical perspective. In Proceedings of the ACM Symposium on Principles of Database Systems, pages 233 246, 2005. 5

[16] M. Makkai. A theorem on barr-exact categories, with an infinitary generalization. Annals of Pure and Applied Logic, 47:225 268, 1990. [17] B. Motik, R. Shearer, and I. Horrocks. Hypertableau Reasoning for Description Logics. J. Artif. Intell. Res., 36:165 228, 2009. [18] B. Parsia and E. Sirin. Pellet: An OWL-DL Reasoner. Poster at ISWC, 2004. [19] H. Pérez-Urbina, I. Horrocks, and B. Motik. Efficient Query Answering for OWL 2. In Proc. ISWC, pages 489 504, 2009. [20] A. Poggi, D. Lembo, D. Calvanese, G. De Giacomo, M. Lenzerini, and R. Rosati. Linking Data to Ontologies. J. Data Semantics, 10:133 173, 2008. [21] A. Poggi, D. Lembo, D. Calvanese, G. Giacomo, M. Lenzerini, and R. Rosati. Linking data to ontologies. In S. Spaccapietra, editor, Journal on Data Semantics X, volume 4900 of Lecture Notes in Computer Science, pages 133 173. Springer, 2008. [22] B. ten Cate and P. G. Kolaitis. Structural characterizations of schema-mapping languages. Communications of the ACM, 53(1):101 110, Jan. 2010. [23] D. Tsarkov and I. Horrocks. FaCT++ Description Logic Reasoner: System Description. In Proc. IJCAR, pages 292 297, 2006. 6