Heterogeneous databases mediation

Transcription

1 MASTER IN COMPUTER SCIENCE UBIQUITOUS NETWORKING Heterogeneous databases mediation Master Thesis Report Laboratoire d Informatique des Signaux et Systèmes de Sophia-Antipolis Team MODALIS 29/08/2014 Author: DJIMENOU Loïc Supervisors MONTAGNAT Johan MICHEL Franck

2 i

3 Abstract The recent trend towards the use of NoSQL databases raises several research topics and challenges. NoSQL databases offer a rich diversity of models and techniques to enhance the data management on the web. Despite their efficiency, the NoSQL are heterogeneous and a huge amount of data is still hidden behind legacy policies. Besides NoSQL, the Web of Data standards (RDF, R2RML, SPARQL etc.) have been designed for facilitating the access and the linking of open data sources. Several recent researches target formatting NoSQL data to the web of data standards. In this same directive, this paper presents an overview on the different NoSQL databases and then, proposes an adaptation of the R2RML mapping language to NoSQL databases. We introduce in this document, the xr2rml mapping language, a backward-compatible extension of R2RML. Several use cases, examples and an implementation come with the description of xr2rml. ii

4 Plan 1. INTRODUCTION NOSQL STATE OF THE ART DEFINITION MOTIVATION BEHIND THE NOSQL MOVEMENT The CAP Theorem DIFFERENT NOSQL DATABASE TYPES Key-Value stores databases Document store databases Extensible records stores Graph stores A COMMON DATA MODEL THE WEB OF DATA THE RESOURCE DESCRIPTION FRAMEWORK (RDF) THE RDB-TO-RDF PROCESS THE RDB TO RDF MAPPING LANGUAGE (R2RML) RDF STORES AND NOSQL EXTENDING R2RML TO NOSQL DEFINING A LOGICAL SOURCE CREATING RDF TERMS FROM STRUCTURED VALUES Referencing data elements with data formats The parse type property rrx:parsetype Term types Production of RDF terms with parse type rr:literal Production of RDF terms with parse type rrx:listormap Parsing nested structures and typing their elements Foreign Key relationship between logical tables IMPLEMENTING XR2RML ON MORPH IMPLEMENTING XR2RML IMPROVING XR2RML REFERENCING DATA ELEMENTS Referencing data elements with mixed data formats rrx:joinparse properties and mixed-path THE PROPERTY "RRX:PARSETYPESEQ" THESIS SUMMARY AND PERSPECTIVES BIBLIOGRAPHY:... 39

5 1. Introduction Since 1970, the relational data model, introduced in Codd's article [Ref 25], is the predominant storage system used in applications on the web. Relational databases are known for their efficiency and their consistency. Yet, the web and its fast data growth raised several challenges that relational databases cannot address. Besides the traditional relational databases (MySQL, PostgreSQL etc.), developers start considering alternative databases systems that can fill their data storage needs. The next generation of database management systems, namely NoSQL, comes with a rich diversity of techniques and systems. With no standards, the NoSQL databases expand and overcome some limitations of Relational databases by relaxing the data model. Big companies like Amazon or Google, develop their own system and enhance the adoption of NoSQL. NoSQL databases become popular on the web and constitute an interesting study subject. They usually aim at huge storage capacity to host data available on the web. Most of NoSQL systems are application-centric. Like Relational databases, the data stored in NoSQL databases are tightly coupled with a particular application or service. Both Relational databases and NoSQL have low accessibility.. In the other hand, the concepts of "Linked data" and "Web of data" aim at providing data formats and standards to overcome poor data accessibility on the web. Semantic web or Web of data provides an interoperable data format, namely RDF, based on web standards such as HTTP or URIs. Formatting data using web standards ensure accessibility, portability and web-scale data linking. The World Wide Web Consortium (W3C [Ref 26]) is an international community that works on developing standards for the web. The W3C designed the specification of the RDF format and the standards rules concerning data translation into RDF. More precisely, standards and rules for Relational database tables translation into RDF (RDB-to-RDF) already exist. In this thesis, we aim at extending Web of data standards designed for Relational databases to NoSQL. The paper organization is the following. The chapter 2 exposes the different existing NoSQL database systems, proposes a classification, and identifies a common data model making it possible to address most NoSQL systems consistently. Chapter 3 introduces the Web of data standards. Chapter 4 details our thesis. It presents the mapping language we propose in order to make NoSQL data available in the RDF format. This mapping language extends a W3C standard, namely R2RML. An implementation of the language presented in the chapter 4 is proposed in chapter 5. The chapters 6 and 7 present improvements and perspectives for our work. 2. NoSQL State of the art 2.1. Definition In 1998, Carlo Strozzi introduced the term "NoSQL" to name his relational database model [Ref 1]. Strozzi used the term "NoSQL" to simply distinguish his solution from other relational database management systems that use SQL. The reason was that his database did not expose an SQL interface. Nowadays, the term "NoSQL" is used for database systems that do not follow the relational schema. It stands for "Not Only SQL". Seminal papers on Google's Bigtable [Ref 2] and Amazon's Dynamo [Ref 3] revived the "NoSQL" topic and constitute the starting point for the NoSQL movement. To address their particular storage needs, Google and Amazon created their own data management system to store and process huge amounts of data. The term "NoSQL" now designates a storage system that has been designed for a specific need regardless of the relational schema and rules. Among "NoSQL", one can find 1

6 different categories: documents stores, key-value stores, extensible records stores and many others Motivation behind the NoSQL movement There are a lot of reasons behind the rise of NoSQL databases. The main one comes from the fact that relational databases implement the ACID (Atomicity, Consistency, Isolation, Durability) properties [Ref 4].f Theses properties guarantee the reliability of database transactions process (a transaction is any process that modifies the database state). Atomicity requires that each transaction is completed only if all the sub-processes it involves are completed. The consistency property ensures that any transaction must bring the database from one valid state to another. The Isolation property ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed one after the other. Parallel execution of multiple operations should bring the system at the state it would be in case these same operations were sequential. Durability means that once a transaction has been committed, it will remain even in case of crashes or errors events. ACID properties guarantee availability and consistency, but come with restrictions in terms of scalability. Relational databases strictly comply with ACID properties but they are mostly centralized systems for which scalability is vertical (adding more computing power to the host server). Conversely, horizontal scalability, the fact of partitioning the data over several machines into the pool of database system resources, is the most desirable feature in a NoSQL). The NoSQL databases aim at providing a good tolerance for partition that relational databases do not offer due to ACID properties. NoSQL data stores give up ACID properties for the more relaxed BASE model [Ref 5] (Basic Availability Soft state Eventually consistent). The notion of Basic Availability concerns the fact that the NoSQL database approach focuses on availability of data even in the presence of multiple failures. This is achieved by using a highly distributed database management approach. Instead of maintaining a single large data store and focusing on the fault tolerance, NoSQL databases spread data across many storage systems with a relatively high degree of replication. In the event that a failure disrupts access to a segment of data, this does not necessarily result in a complete database outage. In this model, availability among partitions is more important than consistency. The notion of Soft State in the definition of the BASE model means that consistency requirements of the ACID model are not necessarily taken in account. One of the basic concepts behind BASE is that data consistency should not be handled by the database but by the application. It is called eventual consistency. Unlike ACID properties with which consistency must be immediate after each transaction on the database, the BASE model proposes an «on-going» consistency that eventually leads the system to an overall consistent state. Database systems are not able to simultaneously ensure availability, strong consistency and reliability in case of nodes crashes (for a data partitioned among many nodes). This assertion is based on the CAP theorem of E. Brewer The CAP Theorem In 2000, Eric Brewer introduced the "CAP-theorem" at the ACM s Principle of Distributed Computing symposium. In his keynote titled "Towards Robust Distributed Systems" [Ref 6], Brewer presented his theorem which has been widely adopted today by many companies and also by the NoSQL community. The CAP acronym stands for Consistency - Availability - Partition tolerance. The notion of Consistency concerns the consistent state of the system after the execution of an operation. A 2

7 distributed system is typically considered to be consistent if, after each update operation, all users accessing the system have the latest version of the data. This property is commonly observed on Relational databases. Every transaction in the database system ideally leaves the system in a stable state. Intermediate states are not visible to the users. Availability and especially high availability means that a system is designed and implemented to guarantee a result to a process even in case of a crash (hardware or software errors). Partition Tolerance is the ability of the system to perform successful operations in case the data is spread among several nodes in a network. These three properties are desirable for a web-scale system but are impossible to achieve together in practice. The term "impossible" here does not imply that one should forfeit consistency, availability or partition tolerance. The choice between consistency and availability can occur several times within the same system. This choice can change according to the operation or the nature of the data involved in the transaction. The real meaning of eventually-consistent is that NoSQL systems provide a tradeoff between consistency and availability according to their needs. Usually, they ensure a low consistency level since partitioning and availability are highly desirable. The article "Brewer s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant web" [Ref 7] proves the "CAPtheorem" by using an asynchronous network model and also discusses about different solutions in case of partially synchronous system. There are many others reasons behind the growing of NoSQL. Most of them concern the avoidance of unneeded complexity of relational databases and also high performances that can be reached by adjusting the NoSQL to its starting motivation. More details can be found in the document "NoSQL databases" by Strauch [Ref 8] Different NoSQL Database Types There are many taxonomies to categorize NoSQL databases. We present in the following a non-exhaustive list of NoSQL techniques Key-Value stores databases Key-value stores are the simplest NoSQL databases. A key-value data store is a schema-less storage design. The data is represented in key-value pairs stored in a persistent store. Every single item in the database is stored as an attribute name (or "key"), together with its value which represents the real data. The data itself is usually either a basic data type (string, integer, and array) or an object that has been assembled by a program. This flexibility in the representation of the data replaces the fixed data model of Relational databases. Often, the length of keys to be stored is limited to a certain number of bytes while there is less limitation on values (it depends on the implementation mechanism). Key-values stores usually favor high scalability over consistency. Most of them avoid rich ad-hoc querying and analytics features like joins and aggregate operations. They provide high concurrency and other services like replication, versioning, insert-delete operations, locking, sorting, fast lookups and options for mass storage. Examples: Amazon s Dynamo [Ref 9]: Amazon is a large world-wide platform for e-commerce. It runs over tens of thousands of servers and network components located in many datacenters around the world. At large scale, the risk of small or large components failure is important. Reliability at this scale is one of its biggest challenges because even the slightest outage can have significant financial consequences and can impact its customers trust. The way the persistency is managed in case of failures determines the reliability and the scalability of the software systems. Dynamo is a highly available key-value storage system that Amazon uses 3

8 for its storage needs. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. Dynamo is a completely decentralized system with a minimal need of administration. It is incrementally scalable and allows service owners to scale up and down based on their current request loads. Thus, partitions can be added and removed without any manual redistribution of the data and service owners can customize their storage system to meet their desired performance, durability and consistency. Dynamo has been used, for the past years, to demonstrate that decentralized techniques can be combined to provide a single highly-available system. Its success shows that an "eventual-consistent" storage system can be a building block for highly-available applications. Dynamo is also built for latency sensitive applications that require read and write operations to be performed in very few time. To do so, Dynamo avoids routing requests through multiple nodes. Amazon DynamoDB is the NoSQL database service offered by Amazon that implements Amazon s Dynamo model Document store databases Document store databases refer to databases that store their data in the form of "documents". Documents inside a document-oriented database are similar to records in relational databases; but they are much more flexible. In relational databases, rows in a table all have the same data fields (same columns for all data in the table) and the unused data fields are kept empty or set to null. In case of document stores, each document may have similar as well as dissimilar data. Documents in the database are identified using a unique key that points to that document. These keys may be a simple string or a string that refers to a URI or a file path. Document stores are slightly more complex than key-value stores as they encapsulate key-value pairs in a document. As a result, document stores are also known as key-document pairs. Document stores serve well when the domain model can be split and partitioned across some documents. As an advice, they should be avoided if the storage involves recording many relations between the units of data. The reason is that, most Documents stores include documents in another one when it comes to materialize relations between them. In terms of relational database, it is equivalent to include in one cell, all rows involved in a join condition. Examples MongoDB [Ref 19]: MongoDB is an open source document store written in C++. It provides a document query mechanism. MongoDB supports automatic distributions of documents over servers. It also supports dynamic queries with automatic use of indexes like in relational databases. Finally, MongoDB provides additional features like aggregation, ad-hoc queries (static queries predefined on the data store). Apache CouchDB [Ref 20]: CouchDB was developed by the Apache software foundation using C++. It uses JSON documents to store data and uses JavaScript as query language. Several Document Stores used the JSON format to model data. Some of them, like MongoDB, use BSON (binary-encoded serialization of JSON-like documents). However, the format architecture is the same. The recursive key-value model is very close to the JSON notation except few differences. Here is an example of a JSON document: { "string_key": "Albert Smith", "array_key": [123456, , 45454], "boolean_key": true, "document_key": { "color": "blue" 4

9 } } Extensible records stores An extensible record is a mixed storage between a tuple key-value and a document. The basic data model is rows and columns. The data can be hierarchically represented and stored in columns and rows. The model is similar to relational tables with more flexibility. Rows in an Extensible store table are allowed to not have a value in a cell (which corresponds to an empty column) while in a relational table, it is required to set the value of these cell at "null". In terms of scalability, both rows and columns can be split over multiple nodes simultaneously used on the same table. There are two ways to split data among nodes: The Row-oriented approach is designed to efficiently return data for an entire row, in as few operations as possible. Rows are split across nodes through sharding on their primary keys. Extensible record stores typically split data by range so that queries on these ranges of values do not have to go to several nodes. Rows are analogous to documents in document stores. Rows coming from the same table can have different numbers of columns uniquely identified and having different types. For reading operations, this approach performs well when the row size is small or when many columns of single rows are simultaneously read. In case of writing, data in a row concerning many columns can be stored at the same time. The Column-oriented approach is distributing columns of a table over multiple nodes by using "column groups". Column groups are simply a way to indicate which columns stored together can accelerate queries processing. For instance, aggregation over several rows concerning columns of the same group can be efficiently done. Examples: Google s BigTable [Ref 2]: Google defined Bigtable as "a distributed storage system for managing structured data that is designed to scale to a large size". Bigtable seems to be the most complete extensible store nowadays and is used by many Google s applications. It was developed in C++ and has been designed to scale across thousands of machines. Big Table is not distributed outside Google. It is available as a part of Google app engine. Cassandra [Ref 21]: Cassandra was developed by Apache Software Foundations in 2008 using Java. Based on both Amazon s Dynamo model and Google s Bigtable, it merges both keyvalue stores and column stores concepts. Cassandra does partitioning, replication and automatic failure detection and recovery. Cassandra s model is eventually consistent and can be scaled up to 150 nodes. SimpleDB by Amazon [Ref 22]: Created around 2007, it is a part of Amazon s proprietary cloud computing system. SimpleDB has "Select, Delete, Get Attributes, and Put Attributes" operations on documents. SimpleDB is the simplest document store and does not allow nested documents (including or defining a document inside another one). It supports "eventual-consistency" and asynchronous replication. The partitions of the data among nodes are not automatically done because updates are asynchronous (consistency is not the main priority). Up to now, there are some limitations concerning nodes or domains size: 10 GB maximum per domain, 100 active domains maximum, 5 seconds limits on queries etc. SimpleDB has a rich documentation provided by Amazon. 5

10 2.3.4 Graph stores Graph stores store data in the form of a graph. The graph consists of nodes and edges, where nodes act as the objects and edges act as the relationship between these objects. The relationships can be considered as properties related to nodes. Graph stores use a technique called "index free adjacency" meaning every node has a direct pointer which points to the adjacent node. In a graph databases, the main emphasis is on the connection between data. Graph databases are schema-less. Example: Neo4J [Ref 23] [Ref 24]: Neo4J was developed by Neo Technology and was initially released in It was developed using Java. Neo4J is reliable, highly available and scalable. It uses CYPHER as its query language. Neo4j is usually used in software involving complex relationships such as social networking, recommendation engines etc. Many companies use Neo4j such as Adobe, Accenture, Cisco, Lufthansa, Telenor and Mozilla. There are many other NoSQL databases models: Multi-model, multi-dimensional or multivalue databases, Object Database, XML Database etc. All these categories are classified as NoSQL according to their non-relational properties. Nevertheless the ones introduced above are the most frequently implemented in the industry. Each data store and its implementations address specific needs by providing a flexible data model. The main purposes are high availability, high scalability and fault tolerance over strong consistency. 2.4 A common data model The rich diversity of NoSQL is mostly due to the lack of standards in this field. This also raises a real challenge for this thesis. As previously presented, it exists many types of NoSQL databases. Each of them proposes a data model that looks more flexible than traditional relational table model. However, these data model have a common point: They use the concept of "key-value pairs" in their data model. Key-Value Stores and Documents Stores are clearly based on this model. The NoSQL databases that use the able model can also comply with the "key-value pair" concept. By iterating on the different rows in a table, the column's name represents the "key" and the data in the row represent the value. Example: The table below presents a decomposition of a table into a list of "key-value" pairs. Database table Equivalent list of "key-value" pairs firstname lastname firstname lastname John Doe John Doe Al Smith firstname lastname Al Smith The previous example includes Extensible Record Stores and Relational databases. Concerning the Graph Stores, the model can also be applied. The properties of the nodes in a 6

11 Graph databases represents the "keys" in the pairs. The values in the pairs are those attributed to the properties. This common "key-value" model we named the "recursive keyvalue model" constitutes the first contribution of this thesis. The rest of our work relies on this common data model. The term "recursive" comes from the recursive aspect of some NoSQL data model; especially the Document Stores. A document in a Document Store can be embedded in another one. The recursive key-value model and the JSON format The standard JSON format [Ref 27] is a particular formalization of the recursive key-value model. It uses the concept of "key-value» and constitutes a perfect example. As a matter of fact, many NoSQL systems propose a JSON serialization for their data. The reason for taking a "recursive key-value model" as a reference instead of an existing format like JSON is simple: the NoSQL field is not standardized. Even if many NoSQL propose a JSON serialization, others do not. But the data model is the same. We use the recursive key-value model to be theoretically independent from the JSON format. 3. The web of data 3.1 The Resource Description Framework (RDF) The RDF (Resource Description Framework) is a W3C standard format for data representation. RDF is a flexible graph based data model that allows structured representation of knowledge. RDF is designed to be simple, generic and very expressive. It relies on web technologies (HTTP endpoints, Uniform Resource Identifiers for resource description) to facilitate the access to the data on the web and its sharing among applications. By combining flexibility and an easy access and share, RDF represents the ideal data format for the web. The unicity of the RDF resources URI enhances the linking data process. The elementary data representation in RDF is a "triple". A triple is an association of a subject, a predicate and an object. The subject and the object are two resources identified by a unique URI and connected to each other by the predicate also identified by URI. Example: For instance, statement Computer X belongs to John Doe may roughly produce following triples: subject predicate object URI_of_Resource_1 is a Computer URI_of_Resource_1 has for name X URI_of_Resource_2 is a Computer URI_of_Resource_2 has for name John Doe URI_of_Resource_1 belongs to URI_of_Resource_2 RDF sets can be queried by a standardized graph-pattern search language named SPARQL [Ref 17]. 7

12 3.2 The RDB-to-RDF process RDB-to-RDF mapping is the process of translating data from relational databases into RDF format. The process uses a mapping description standards and its implementation to materialize RDF sets. In details, the mapping description consists in a set of rules that describe how each relational data element must be translated into RDF resource. One distinguishes two types of mapping description: The Direct mapping approach that intends to convert relational data into RDF in a straightforward manner. It is used for its simplicity. The Direct Mapping method comes up with an ad-hoc class description that reflects the relational database schema. The Domain semantic-driven mapping approach that is used in the case the relational database must be translated using concepts and properties formally described in a well-organized and standardized sets of rules. This approach can deal with complex mapping cases. The purpose is to make explicit the semantics that is frequently implicit in the RDB schema. The second phase of the RDB-to-RDF process is the mapping implementation. Two methods can be used: The Data materialization approach that consists in a full transformation of the source database into an RDF representation by applying the mapping rules to the whole content of the database. As a result, the RDF data is available at once; the data materialization facilitates further processing, analysis or reasoning over the RDF data, including the execution of heavy inference rules. The drawback is that it hardly supports very large data sets, as the size of the graph produced may exceed memory capacity. Another limitation concerns frequently updated data sets that becomes hard to maintain due to the computation time of the process. The On-demand mapping approach which is a dynamic query-driven implementation and consists in the run time evaluation of queries against the relational data. It implements the mapping dynamically in response to a query usually in SPARQL. In this model, the data remains located in the legacy database. The advantage of this approach is that only the current version of the data is retrieved. However, a dynamic mapping implementation may reduce query performance, in the case; some entailment rules are applied to the RDF repository to infer new knowledge. It exists several standards and tools for RDB-to-RDF process. Many of them have been tested and classified. The report research report edited by F. Michel, J. Montagnat, C. Faron-Zucker [Ref 16] presents an extensive survey on different RDB-to-RDF approaches and tools. The main purpose of this thesis is to propose a NoSQL-to-RDF approach inspired from existing RDB-to-RDF techniques. 3.3 The RDB to RDF Mapping Language (R2RML) In 2012, the W3C published the R2RML recommendation [Ref 14], a standard language for the RDB-to-RDF mapping description process. R2RML mappings are themselves RDF graphs written in the TURTLE language. R2RML provides the ability to embed SQL snippets into the mapping definition, and allows the use of SQL functions to transform object values. Typically, an R2RML mapping consists of several triple maps; each triple map specifies how to map each row in a table of the input relational database into RDF triples. 8

13 Example : RDB Table Persons ID NAME 1 John Doe 2 Al Smith R2RML rr: ex: < <#Human> rr:logicaltable [ rr:tablename "Persons"; rr:subjectmap [ rr:template " rr:predicateobjectmap [ rr:predicate ex:has-for-name; rr:objectmap [ rr:column "NAME"; ] ] Comments: The tag "@prefix" defines prefixes that are replaced in the code. For instance "rr:logicaltable" is equivalent to " which is the real tag. It makes the code easy to read. The tag "<#Human>" is the name of the triple map and the code below this tag is it description. The property "rr:logicaltable" is used to name the relational table associated to the triple map. The property "rr:subjectmap" defines the construction pattern of the subject. It can use either a template (property rr:template), or indicate a column name (property rr:column), or use a constant value (property rr:constant"). In our example, a template is used to construct the subject with the values retrieved from the column "ID". The property "rr:predicateobjectmap" defines the predicate (property rr:predicate) and the object (property rr:objectmap) of the triple. The table below presents the RDF triples created with an implementation of the previous R2RMl mapping. subject predicate object < < "John Doe" < < "Al Smith" R2RML comes as the standard mapping language but not all RDB-to-RDF tools implements R2RML. In order to map NoSQL data for a translation into RDF, R2RML constitutes the perfect starting point. We base our works on the R2RML standard and extend it to NoSQL. 9

14 3.4 RDF stores and NoSQL Since RDF databases cannot be considered as Relational databases, they can be called NoSQL. Nethertheless, RDF stores differ from NoSQL in several aspects. RDF stores have been designed to store RDF (a standardized directed labeled graph). NoSQL Databases can store different types of data like documents or graphs. The principal advantage of RDF database systems is that the RDF format is standardized with a powerful query language: SPARQL [Ref 10]. As RDF highly relies on web standards, RDF Stores offer better data portability and interoperability than NoSQL implementations that are available at present. RDF Stores have other benefits such as: A simple and uniform standard data model. NoSQL databases typically have adhoc data models and capabilities designed specifically for a particular case. Usually, NoSQL data models are neither interoperable among each others, nor standardized. A powerful standard query language. NoSQL databases typically do not provide a unique standardized high-level declarative query language equivalent of SQL. Querying these databases is data-model-specific, language-specific and even application-specific. In the case query languages do exist, they are entirely specific to the NoSQL (Cipher for Neo4J, CQL for Cassandra etc.). SPARQL provides to RDF databases an interoperable query language. Standardized data interchange formats. Relational databases have SQL dumps, and some NoSQL databases have import-export capability from/to implementationspecific structures expressed in an XML or JSON format. RDF databases, by contrast, all have import/export capability based on well-defined, standardized formats such as N-Triples. 4. Extending R2RML to NoSQL This section exposes the principal achievement of this thesis. Translating data stored in NoSQL into RDF via a mapping language is not a simple task. The identification of a common model in NoSQL data schema (section 2.4) constitutes the backbone of the mapping language developed in the following. The language is named xr2rml as the extension of R2RML to the wider scope of NoSQL databases. Basically, xr2rml provides the necessary properties to explore any data format compliant to the recursive key value model introduced in section 2.4. xr2rml also enlarges the set of different RDF terms that can be generated such as RDF Containers or RDF Collections missing in R2RML. R2RML is seen as a subset of xr2rml, so that both can be backward compatible. Consequently, any R2RML mapping graph is a valid xr2rml mapping graph. Language description Since relational databases are modeled as tables, the R2RML design is table-oriented. The first version of xr2rml focuses on extending R2RML to NoSQL database systems with a similar architecture, namely Extensible Record Stores, RDF stores, CSV sources file and also Relational databases. In the following, we use the prefix "rrx" for xr2rml properties to differentiate them from the existing R2RML properties. 10

15 4.1 Defining a logical source An R2RML triples map describes a logical table as a data set on which the triples map applies: this may be a relational database table or SQL view, or the result of any valid SQL query. Relational databases have clearly identified commonalities (row-based data model, ACID properties, ANSI SQL compatibility). As xr2rml targets NoSQL, the language must be agile enough to cope with various query languages and protocols, in order to apply to a significant subset of non-relational databases. xr2rml proposes several extensions to describe an input database: A logical source (using the property rrx:logicalsource) extends the R2RML concept of logical table (the R2RML property rr:logicaltable) in the case of non-relational databases. A logical source is the result of a query posed to the input database, to be mapped to RDF triples. A logical source is either an R2RML base table or view, or an xr2rml view. An R2RML base table or view is a logical source containing data from a base table or view in the input database. A base table or view is represented by a resource that has exactly one rrx:sourcename property. Property rrx:sourcename extends R2RML property rr:tablename for non-relational sources. It may be used to name a table in the context of tabular systems where tables make sense, such as an extensible column store. An xr2rml view is a logical source whose content is the result of executing a query against the input database. It is represented by a resource that has exactly one rrx:query property. The value of property rrx:query is a literal representing a valid expression with regards to the query language supported by the input database. Optional property rrx:format specifies the format of the data retrieved from the logical source. Currently, possible format values are: rrx:row, rrx:json, and rrx:xml. If a logical source has no rrx:format property, its format defaults to rrx:row, to ensure compatibility with R2RML. Remarks: a) Format rrx:row applies to any database returning data as sets of rows, each row being a set of columns: relational database, CSV file, extensible column store, SPARQL result sets retrieved from a SPARQL endpoint (that can be seen as a table in which columns are named after the variables returned). b) No property specifies the query language used to express queries. Defining a set of query languages within xr2rml would be limitative with regards to NoSQL systems in which new query languages may come up frequently. Therefore, an xr2rml processor should allow the use of various kinds of connections and query languages in a flexible manner, namely a file readed on the local file system or available on a web server; a JDBC connection to a relational database, a web service using REST, a SOAP or simply HTTP GET parameters; or a SPARQL endpoint. xr2rml logical source and R2RML logical table definitions may equally be used to describe a relational database. Example: R2RML logical table <TriplesMap> rr:logicaltable [ rr:tablename "SOME_TABLE" xr2rml logical source <TriplesMap> rrx:logicalsource [ rx:sourcename "SOME_TABLE"; rrx:format rrx:row 11

16 rr:subjectmap [ rr:column "column_name" ]. rr:subjectmap [ rr:column "column_name" ]. The table below shows various examples of xr2rml logical source definition with different input databases. Type of logical source Logical source definition Relational database rrx:logicalsource [ rrx:query """SELECT NAME, DATE FROM MOVIES ORDER BY DATE LIMIT 10"""; rrx:format rrx:row; Cassandra (extensible column store) using Cassandra Query Language (CQL) AllegroGraph (RDF graph store) using SPARQL. The rrx:row format is applied to a SPARQL result set: the result set can be seen as a table in which columns are variable names. rrx:logicalsource [ rrx:query """SELECT NAME, DATE FROM MOVIES LIMIT 10"""; rrx:format rrx:row; rrx:logicalsource [ rrx:query """select?name?date where {?movie a ex:movie; ex:name?name; ex:date?date. } order by?date limit 10"""; rrx:format rrx:row; By extending the definition of data sources, xr2rml can widely be apply for several NoSQL storing different data items, possibly structured as nested key-value lists. The following sections focus on the retrieved data parsing mechanism. 4.2 Creating RDF terms from structured values R2RML defines a term map as a function that generates RDF terms from a logical table row. A term map is either a subject map, predicate map or object map. A term map must be exactly one of the following: a constant-valued term map (defined by the property rr:constant) a column-valued term map (defined by the property rr:column) a template-valued term map (defined by the property rr:template). R2RML treats all values from the input database as literals expressed in native data types (string, number, boolean etc.). To deal with structured values such as lists of elements or key-value maps used in databases relying on JSON, XML or an object-oriented model, 12

17 extensions are needed. xr2rml term maps extend R2RML term maps so that structured values can be parsed, and data elements within structured values can be selected to build RDF terms. This is achieved by xr2rml term map properties rrx:format, rrx:parsetype and rrx:parse, described in this section Referencing data elements with data formats An R2RML mapping graph uses properties rr:column and rr:template to reference columns of a relational database. Whereas the rr:template property name is generic, the rr:column property name explicitly refers to the relational column concept. A column-valued term map has exactly one rr:column property. The value of the rr:column property is a valid column name. A template-valued term map has exactly one rr:template property. The value of the rr:template property is a valid string template. A string template is a format string used to build strings from multiple components. It uses column s name by enclosing them in curly braces ("{" and "}"). The syntax of data retrieved from a logical source is specified using the rrx:format property of a logical source. In some use cases, it is common to store values in a format that is not the native format of the database. For instance, an application designer can choose to embed JSON or XML values in the cells of a relational database, for performance concerns or application design constraints. In the NoSQL world, it is common to store JSON documents in rows of extensible column stores, thus mixing the tabular and JSON formats. To reference data elements within such mixed contents, xr2rml allows a term map to indicate the format of a data element with the property rrx:format Existing values for the property rrx:format are: rrx:row (for tables), rrx:json (for data in JSON format), rrx:xml (for data in XML format), rrx:csv (for data in CSV format) If no format property is specified, the format is deduced by evaluating the format of the logical source. Contribution: Basically, the table format represented by the value "rrx:row" is the default implementation in R2RML. Extending the formats allows the treatment of a wider range of data. Even XML or JSON data stored in relational tables can be parsed. With R2RML, these data would be seen as a simple data string The parse type property rrx:parsetype As explained previously, the flexible data representation in NoSQL databases create the need to reference the data format. This need naturally comes with the possibility to parse the data according to its format. The rôle of the property rrx:parsetype is to indicate the general structure of the data element in adequation with its format. An xr2rml constant-based, column-based or template-valued term map has a parse type defined with the optional rrx:parsetype property. A parse type may have be one of two values: 13

18 rr:literal: values read from the input database are interpreted as literals, in this case the standard behavior of R2RML applies. rrx:listormap: values read from the input database are structured values representing either lists of values or key-value maps, written according to the syntax defined by the rrx:format property. If a term map has no rrx:parsetype property, its parse type defaults to rr:literal. A term map with parse type rr:literal may have any R2RML term type (rr:literal, rr:blanknode or rr:iri), it must not have an RDF collection or container term type (see section 4.2.3). Formally:?X rrx:parsetype rr:literal.?x rr:termtype?tt.?tt is one of rr:literal, rr:blanknode or rr:iri Using parse type rrx:listormap instructs that the list or key-value map must be parsed according to the logical source data format. The incentive behind this parsing is to build RDF terms from the elements of the list of key-value map. Such a term map can have either no rr:termtype property or a rr:termtype property with an RDF collection or container term type: A term map with parse type rrx:listormap may have either no rr:termtype property, or a rr:termtype property with an RDF collection or container term type. It must not have a rr:termtype property with an R2RML term type (rr:literal, rr:blanknode or rr:iri). Formally:?X a rr:termmap.?x rrx:parsetype rrx:listormap.?x rr:termtype?tt.?tt is one of rrx:rdflist, rrx:rdfseq, rrx:rdfbag or rrx:rdfalt (see section 4.2.3). A term map with parse type rrx:listormap must not have a rr:language or rr:datatype property Term types In the case the term map has an optional rr:termtype property then its term type is the value of that property. The value must be one of the following options: If the term map is a subject map: rr:iri or rr:blanknode If the term map is a predicate map: rr:iri If the term map is an object map: rr:iri, rr:blanknode, rr:literal, rdf:list, rdf:seq, rdf:bag, rdf:alt. If the term map is a graph map: rr:iri. If the term map does not have a rr:termtype property, then its term type is: rr:literal, if it is an object map and at least one of the following conditions is true: o It is a column-based term map and its parse type is rr:literal o It is a column-based term map, its parse type is rrx:listormap, and it does not have a rrx:parse property. o It has a rr:language property (and thus a specified language tag). o It has a rr:datatype property (and thus a specified datatype). rr:iri, otherwise. 14

19 RDF collection or container term types The RDF terms generated by a term map have a term type (rr:termtype) that may be one of the three R2RML term types: rr:literal, rr:blanknode or rr:iri. xr2rml extends the rr:termtype property with four new values, hereafter referred to as RDF collection or container term types, or xr2rml term types: rrx:rdflist: generate an RDF collection of class rdf:list rrx:rdfseq: generate an RDF container of class rdf:seq rrx:rdfbag: generate an RDF container of class rdf:bag rrx:rdfalt: generate an RDF container of class rdf:alt Contribution: The RDF term classes rdf:list, rdf:bag, rdf:seq and rdf:alt are formally in the RDF Specification. One limitation of R2RML is that there is no properties or mechanism to describe the construction of these four RDF term classes using data stored in relational databases. One of the contributions of our R2RML extension is to allow the construction of rdf:list, rdf:bag, rdf:seq and rdf:alt either with data from relational database or NoSQL databases Production of RDF terms with parse type rr:literal The behavior of a term map with parse type rr:literal is as described in R2RML. Just note that in the case a term map references a value that is not a simple literal (with regards to the logical source format), and the parse type is rr:literal, then the generated RDF term is the serialization of that non-literal value, considered as a literal. Example: Input data Data { "person": { "FirstName":"John", "LastName":"Smith" } } Term map Assumes that the logical source description does not mention the JSON format Generated RDF term rr:objectmap [ rr:column "data"; rrx:parsetype rr:literal; # optional, this is the default value ] The term is the structured value returned as a string literal: '{ "FirstName":"John", "LastName":"Smith" }' Production of RDF terms with parse type rrx:listormap A term map with parse type rrx:listormap will have different behaviors depending on its term type: 15

20 with no rr:termtype property, and an R2RML term type inherited from a rrx:parse (see section 4.2.6) property (rr:literal, rr:blanknode or rr:iri), it may produce multiple RDF terms during each iteration; with an RDF collection or container term type (rrx:rdflist, rrx:rdfseq, rrx:rdfbag or rrx:rdfalt), it may produce zero or one RDF collection or container during each iteration. Both cases are described in details in the rest of this section. Term map with parse type rrx:listormap and no rr:termtype property In the R2RML iteration model, a term map generates at most one RDF term during each iteration, and consequently a triples map generates at most one triple during each iteration. However with the rrx:listormap parse type, a term map generates one RDF term for each element of the list or key-value map treated during each iteration, it means possibly several RDF terms per iteration. Consequently, a triples map may generate several triples during a single iteration. In the example below, the subject map generates one RDF term during a single iteration, while the object map generates two RDF terms during the same iteration: literals "Laptop" and "Desktop", thus resulting in the production of two triples: Input data: JSON document retrieved in a single iteration Table ID data Dell ["Laptop", "Desktop"] Mapping graph <#TripleMap> rr:logicaltable [ rr:logicaltble "Table"; rrx:format rrx:row; rr:subjectmap [ rr:template " rr:predicateobjectmap [ rr:predicate ex:produces; rr:objectmap [ rr:column "data"; rrx:format rrx:json; rrx:parsetype rrx:listormap; Generated triples < ex:produces "Laptop". < ex:produces "Desktop". Note: If one or several term maps of a triples map produce several RDF terms during a single iteration, then triples are produced as the cartesian product between all RDF terms produced by all term maps of the triples map (subject, predicate-object). 16

21 In the example below, during the iteration the subject map produces two RDF terms < and < while the object map produces two literals "Laptop" and "Desktop". A cartesian product between the two subjects and the two objects results in the production of four triples: Input data: RDB table with columns formatted in JSON and XML Table companies [ "Dell", "Asus" ] products <product>laptop</product> <product>desktop</product> Mapping graph <#TripleMap> rr:logicaltable [ rr:tablename "table" rr:subjectmap [ rr:template " rrx:format rrx:json; rrx:parsetype rrx:listormap ; rr:predicateobjectmap [ rr:predicate ex:produces; rrx:objectmap [ rr:column "products"; rrx:format rrx:xml; rrx:parsetype rrx:listormap; Generated triples < ex:produces "Laptop". < ex:produces "Desktop". < ex:produces "Laptop". < ex:produces "Desktop". Term map with parse type rrx:listormap and term type RDF Collection or RDF container A term map with parse type rrx:listormap and an RDF collection or container term type generates one RDF term during each iteration, representing the whole list or key-value map. This complex RDF term consists of several triples; typically a blank node is the root of the collection or container. In the example below, the triples map generates one triple per iteration, the object of this triple is an RDF bag consisting of several triples: 17

22 Input data Table companies products Dell <product>laptop</product> <product>desktop</product> Mapping graph <#TripleMap> rrx:logicaltable [ rr:tablename "table" rr:subjectmap [ rr:template " rr:predicateobjectmap [ rr:predicate ex:builds; rrx:objectmap [ rr:column "products"; rrx:format rrx:xml; rrx:parsetype rrx:listormap; rr:termtype rrx:rdfbag; Generated triples < ex:builds [ a rdf:bag; rdf:_1 "Laptop"; rdf:_2 "Desktop". ]. Constant-valued term maps In a constant-valued term map, the constant value must be a valid expression with regards to the logical source data format. Example: Term map rrx:logicalsource [... rr:objectmap [ rr:constant '["ABC", "DEF"]'; rrx:format rrx:json; rrx:parsetype rrx:listormap; ] rrx:logicalsource [... rr:objectmap [ rr:constant '["ABC", "DEF"]'; rrx:format rrx:json; rrx:parsetype rrx:listormap; rr:termtype rrx:rdfseq; ] Generated terms RDF "ABC" "DEF" [ a rdf:seq; rdf:_1 "ABC"; rdf:_2 "DEF". ] 18

23 4.2.6 Parsing nested structures and typing their elements In the xr2rml language exposed so far, two concerns are not addressed: In a term map with parse type rrx:listormap, structured values can be parsed in order to translate each element into RDF terms, possibly assembled in RDF collections or containers. However it may be needed to explicitly type elements as literals, blank nodes or IRIs, or assign them a language tag (rr:language) or data type (rr:datatype). Structured data written in JSON or XML format commonly have more than one level of nesting, resulting in potentially deep tree-like values. The purpose of the property rrx:parse is to have the possibility to parse these data in order to nest RDF collections and containers. To address those concerns, a term map with parse type rrx:listormap may have a rrx:parse property. Whereas the rrx:listormap parse type along with an RDF collection/container term type describe how to parse a list or key-value map and possibly translate it into an RDF collection or container, the rrx:parse property of a term map describes how to translate each element of a list or key-value map into RDF terms. The range of the rrx:parse property is the rrx:parse class. An instance of the rrx:parse class may have the properties below: rrx:parsetype bears the same semantics as in the context of a term map; rr:termtype bears the same semantics as in the context of a term map; rrx:parse is used to recursively parse any depth of nested structured values. Its range is the rrx:parse class; rr:language bears the same semantics as defined in R2RML; rr:datatype bears the same semantics as defined in R2RML. A term map may have a rrx:parse property only if its parse type is rrx:listormap. Formally:?t rrx:parse?p =>?t rrx:parsetype rrx:listormap In a column-valued or a constant-valued term map, the rrx:parse property describes how to translate elements of a structured value referenced in the rr:column property, or retrieved from the rr:constant property, into RDF terms. In a template-valued term map, the rrx:parse property describes how to translate values produced by the template string into RDF terms. If a term map or an instance of class rrx:parse has a rrx:listormap parse type and no rrx:parse property, it is assumed to have a default rrx:parse property defined as follows: - If the term map is a column-valued, reference-valued or a constant-valued term map: rrx:parse [ rrx:parsetype rr:literal ; rr:termtype rr:literal - If the term map is a template-valued term map: rrx:parse [ rrx:parsetype rr:literal ; rr:termtype rr:iri A term map with parse type rrx:listormap may generate either multiple RDF terms (no rr:termtype property), or one RDF term of type RDF collection or container (when it has a rr:termtype property). Conversely, an instance of class rrx:parse with parse type rrx:listormap must generate one RDF term of type RDF collection or container, but cannot generate multiple RDF terms: 19

24 An instance of class rrx:parse with parse type rrx:listormap must have a rr:termtype property with an RDF collection or container term type. Formally:?X a rrx:parse.?x rrx:parsetype rrx:listormap.?x rr:termtype?tt.?tt is one of rrx:rdflist, rrx:rdfseq, rrx:rdfbag or rrx:rdfalt Finally, properties rr:language and rr:datatype apply when generating literals only, therefore they do not apply in case of a rrx:listormap parse type. A term map, or an instance of the rrx:parse class, may have a rr:language or rr:datatype property only if their parse type is rr:literal (either stated by property rrx:parsetype or inferred as a default value). Using rrx:parse to type RDF terms generated from a list or key-value map The rrx:parse property provides the ability to specify the term type, and optionally the language tag or data type of RDF terms produced from the elements of a list or key-value map. The example below illustrates the usage of property rrx:parse to generate an RDF list of elements typed as IRIs (first example), or multiple typed RDF literals (second example): Input data data ["url1", "url2"] data [10, 20] Term map rr:objectmap [ rr:column "data"; rrx:parsetype rrx:listormap; rrx:format rrx:json; rrx:parse [ rrx:parsetype rr:literal; rr:termtype rr:iri; rr:termtype rrx:rdflist; ] rr:objectmap [ rr:column "data"; rrx:format rrx:json; rrx:parsetype rrx:listormap; rrx:parse [ rrx:parsetype rr:literal; rr:termtype rr:literal; rr:datatype xsd:integer; ] Generated terms RDF In Turtle abbreviated notation: (<url1> <url2>) 10^^xsd:integer 20^^xsd:integer 20

25 Using rrx:parse to parse nested lists or key-value maps The example below illustrates the usage of property rrx:parse to, first, parse nested structured values (the column "data" contains an XML list of which "team" elements are lists) and then translate them into two RDF terms of type RDF list. Input data data <team> <member>john</member> <member>paul</member> </team> <team> <member>cathy</member> <member>ed</member> </team> Term map rr:objectmap [ rr:column "data"; rrx:parsetype rrx:listormap; rrx:format rrx:xml; rrx:parse [ rrx:parsetype rrx:listormap; rr:termtype rrx:rdflist; Generated RDF terms ("John" "Paul") ("Cathy" "Ed") Using the same input data, the two examples below generate one RDF term consisting of an RDF sequence with two nested RDF lists: In the first example the elements of the inner RDF lists are not typed explicitly, thus their term type defaults to rr:literal. In the second example, the elements of the inner RDF lists are assigned an explicit language tag using an additional nested rrx:parse property. Input data data { "team1": ["John", "Paul"], "team2": ["Cathy", "Ed"] } data { "team1": ["John", "Paul"], "team2": ["Cathy", "Ed"] } Term map rr:objectmap [ rr:column "data"; rrx:format rrx:json; rrx:parsetype rrx:listormap; rr:objectmap [ rr:column "data"; rrx:format rrx:json; rrx:parsetype rrx:listormap; 21

26 rr:termtype rrx:rdfseq; rrx:parse [ rrx:parsetype rrx:listormap; rr:termtype rrx:rdflist; Generated RDF terms [ a rdf:seq; rdf:_1 ("John" "Paul"); rdf:_2 ("Cathy" "Ed"); ] rr:termtype rrx:rdfseq; rrx:parse [ rrx:parsetype rrx:listormap; rr:termtype rrx:rdflist; rrx:parse [ rrx:parsetype rr:literal; rr:language "en"; [ a rdf:seq; rdf:_1 ("John"@en "Paul"@en); rdf:_2 ("Cathy"@en "Ed"@en); ] Using parse type rrx:listormap with object maps, subject maps, predicate maps Unlike RDF terms of type IRI or blank node, RDF terms of type RDF collection or container cannot be used as the subject or the predicate of an RDF triple, nor as a graph IRI. Consequently: A term map with parse type rrx:listormap and term type rrx:rdflist, rrx:rdfseq, rrx:rdfbag or rrx:rdfalt is an object map (hence it cannot be a subject map or predicate map). Formally:?X a rr:termmap.?x rrx:parsetype rr:listormap.?x rr:termtype?tt.?tt is one of rrx:rdflist, rrx:rdfseq, rrx:rdfbag or rrx:rdfalt?x a rr:objectmap. The rrx:parse property may be used in a subject map or predicate map only if it produces IRIs. Consequently: A term map with parse type rrx:listormap may be a subject map only if (i) it does not have a rr:termtype property and (ii) the object of its rrx:parse property has a rr:termtype rr:iri or rr:blanknode. A term map with parse type rrx:listormap may be a predicate map only if (i) it does not have a rr:termtype property and (ii) the object of its rrx:parse property has a rr:termtype rr:iri. Formally:?X is a rr:subjectmap or rr:graphmap.?x rrx:parsetype rr:listormap.?x has no rr:termtype property.?x rrx:parse?p.?p rr:parsetype rr:literal.?p has a rr:termtype property with one of rr:iri or rr:blanknode. 22

27 ?X is a rr:predicatemap.?x rrx:parsetype rr:listormap.?x has no rr:termtype property.?x rrx:parse [ rr:parsetype rr:literal; rr:termtype rr:iri ]. Contribution: As a result of extending of R2RML for parsing structured data, the possibility to include a rrx:parse property in another rrx:parse allows xr2rml syntax to be recursive, similarly to the data. One can literally explore the data and express at any level of depth, the construction of RDF term classes like rdf:list, rdf:bag, rdf:seq and rdf:alt Foreign Key relationship between logical tables Reminder of the R2RML definition A referencing object map allows using the subjects of another triples map as the objects generated by a predicate-object map. Since both triples maps may be based on different logical tables, this may require a join between the logical tables. A referencing object map resource has exactly one rr:parenttriplesmap property (its value is a triples map), and optional rr:joincondition properties. A join condition has exactly one rr:child property and one rr:parent property. The rr:child property references the join condition's child column, the rr:parent property references the join condition's parent column. xr2rml extension In xr2rml, the join condition is extended in two ways: (i) rr:child and rr:parent are allowed to specify mixed-syntax paths, (ii) two optional properties are introduced, rrx:childparse and rrx:parentparse: Properties rr:child and rr:parent may use mixed-syntax paths to reference data elements by traversing data of different formats. The rrx:childparse property (respectively the rrx:parentparse property) of a join condition describes how to interpret and parse the values from the logical source referenced by the rr:child property (respectively the rr:parent property). The range of the rrx:childparse and rrx:parentparse properties is the rrx:joinparse class. An instance of the rrx:joinparse class has one rrx:parsetype property and one rrx:format property that bear the same semantics as in the context of a term map. If a join condition has no rrx:childparse property, it is supposed to have the default property: rrx:childparse [ rrx:parsetype rr:literal; rrx:format rrx:row ]. If a join condition has no rrx:parentparse property, it is supposed to have the default property: rrx:parentparse [ rrx:parsetype rr:literal; rrx:format rrx:row ]. -Vocabulary definitions: If a referencing object map has a join condition, then the parse type provided by the rrx:childparse property is called the child parse type, whereas the parse type provided by the rrx:parentparse property is called the parent parse type. -Equivalent join queries: Technically, there is no equivalent join query to express the constraints of an rrx:joinparse property. The reason is simple : there is no standard query 23

28 language for NoSQL. The rrx:joinparse properties can involve two different logical sources and potentially two differents query languages and data formats. Generating multiple RDF terms with a referencing object map In the example relational database below, column "Doctor.studies" contains a JSON array of which values are foreign keys to column "Study.study_id". Input data Study study_id study_name 1 study1 2 study2 3 study3 Doctor doc_id doc_name studies 1 D1 [1,2] 2 D2 [3] Mapping graph <#Study> rr:logicaltable [ rr:tablename "Study" rr:subjectmap [ rr:template " ]. <#Doctor> rr:logicaltable [ rr:tablename "Doctor" rr:subjectmap [ rr:template " rr:predicateobjectmap [ rr:predicate ex:investigator; rr:objectmap [ rr:parenttriplesmap <#Study>; rr:joincondition [ rr:parent "study_id"; rr:child "studies"; rrx:childparse [ rrx:parsetype rrx:listormap; rrx:format rrx:json ]. The rr:child property uses the property rrx:format to specify that the data retrieved is formatted in JSON. 24

29 Generated triples The equivalent table view to these join conditions is : doc_id doc_name studies study_id study_name 1 D1 [1,2] 1 study1 1 D1 [1,2] 2 study2 2 D2 [3] 3 study3 Resulting triples: < ex:investigator < < ex:investigator < < ex:investigator < Remarks: Contrary to the rrx:parse class, an instance of the rrx:joinparse class does not have an rr:termtype property: the JoinParse is not meant to create RDF terms, instead it allows selecting comparable values from the input database, to perform a join operation. Thus a rrx:joinparse instance only returns literals and no term type is required. Contrary to the rrx:parse class, an instance of the rrx:joinparse class does have a rrx:parse property. See justification in Appendix B. Generating RDF collection or RDF container with a referencing object map In R2RML, referencing object term maps cannot have an rr:termtype property, as they should only produce RDF terms of type rr:iri. In xr2rml however, the result of a join may be translated into an RDF collection or container using property rr:termtype. The rr:termtype has a specific semantics here: it instructs that join query results should be grouped by child reference, that is the subject of the generated triples, and that all objects in the same grouping should be rendered as an RDF collection or container. If a referencing object map has no rr:termtype property, then its term type is rr:iri (compliant with the definition of R2RML term types). A referencing object map may have a rr:termtype property with an RDF collection or container term type (rrx:rdflist, rrx:rdfseq, rrx:rdfbag or rrx:rdfalt). In that case, elements of the collection or container are necessarily of type rr:iri. In a referencing object map with an RDF collection or container term type, results of the join condition are grouped by child value, i.e. by subjects of the triples map. The parent values of such formed groups (the objects of the triples map) are grouped in a single object of type RDF collection or container, as instructed by the rr:termtype property. In the example below the referencing object map has an rr:termtype property with value rrx:rdflist. 25

30 Input data Table Study study_id study_name 1 study1 2 study2 3 study3 Table Doctor doc_id doc_name studies 1 D1 [1,2] 2 D2 [3] Mapping graph <#Study> rr:logicaltable [ rr:tablename "Study" rr:subjectmap [ rr:template " ]. <#Doctor> rr:logicaltable [ rr:tablename "Doctor" rr:subjectmap [ rr:template " rr:predicateobjectmap [ rr:predicate ex:investigator; rr:objectmap [ rr:parenttriplesmap <#Study>; rr:joincondition [ rr:parent "study_id"; rr:child "studies"; rrx:childparse [ rrx:parsetype rrx:listormap; rrx:format rrx:json rr:termtype rrx:rdflist; ]. Generated triples Resulting triples: < ex:investigator ( < < ). < ex:investigator (< 26

31 Contribution: The concept of "join condition" is barely present in NoSQL databases. Most of them avoid it. Conversely, relational databases strongly implement join conditions. The ability to parse data based on its format improves the concept of join condition that used to be limited to a simple comparison of data in cells. One can now compare XML data elements value with JSON data elements values by parsing them according their format description. Appendix A - The reason for not having parse type rrx:listormap with term type rr:literal in a rrx:parse Input data { "teams": [ ["John", "Paul"], ["Cathy", "Ed"] ] } Term map rr:objectmap [ rrx:reference "teams"; rrx:format rrx:json; rrx:parsetype rrx:listormap; # "teams" is a list rr:termtype rrx:rdfseq; rrx:parse [ rrx:parsetype rrx:listormap; # each element is itself a list rr:termtype rr:literal; Generated RDF terms For each element of the "teams" list, the object map produces a member of the RDF sequence: rdf_:1, rdf:_2, etc. In addition, the instance of class rrx:parse returns one RDF term for each element of the inner lists. The cartesian would result in the following invalid sequence: [ a rdf:seq; rdf:_1 "John"; rdf:_1 "Paul"; # Incorrect to have two properties rdf:_1 rdf:_2 "Cathy"; rdf:_2 "Ed"; # Incorrect to have two properties rdf:_2 ] Appendix B - The reason for not having a rrx:parse property in a rrx:joinparse instance The semantics of parsing nested structures with several nested rrx:parse properties within a rrx:joinparse instance is difficult to figure out. Example: if "studies" is a list of lists such as value "[ [1,2],[3,4] ]", then what would be the meaning of the referencing object map below: rr:objectmap [ rr:parenttriplesmap <#Study>; rr:joincondition [ rr:parent "study_id"; rr:child "studies"; rrx:childparse [ rrx:parsetype rrx:listormap; rrx:parse [ rrx:parsetype rrx:listormap 27

32 The rrx:listormap parse type within the rrx:childparse instructs that the result returned by the rrx:child property must be treated as a list. Additionally, the rrx:parse property instructs that "[1,2]" and "[3,4]" must also be parsed as lists. Using RDF lists, this term map returns: ((1 2) ( 3 4)). Thus, it would not make sense to compare literal values from the parent reference with RDF lists (1 2) and (3 4) from the child reference. To be comparable, values from the child and parent references must share the same nature, presumably literals. Arguably though, we could figure out a use case that would use a structured value as a foreign key. Below, studies are identified by a JSON list. A join is done on studies with id like {"phase":1, "duration":2} in the parent reference, and [1, 2] in the child reference. Since the parsing gets rid of keys in key-value maps, both values are considered as a valid match and the join produces a result. Input data JSON documents retrieved by the query in the <#Study> triples map: { "study_id": {"phase":1, "duration":2}, "study_name":"study12" } { "study_id": {"phase":3, "duration":4}, "study_name":"study34" } JSON document retrieved by the query in the <#Doctor> triples map: { "doc_name":"d1", "studies": [1,2], [3,4]] } Mapping graph <#Study> rrx:logicalsource [ rrx:format rrx:json;... rr:subjectmap [ rr:template " ]. <#Doctor> rrx:logicalsource [ rrx:format rrx:json; rrx:query "..."; rr:subjectmap [ rr:template " ]. rr:predicateobjectmap [ rr:predicate ex:investigator; rr:objectmap [ rr:parenttriplesmap <#Study>; rr:joincondition [ 28

33 Generated triples RDF rr:parent "study_id"; rrx:parentparse [ rrx:parsetype rrx:listormap; rr:child "studies"; rrx:childparse [ rrx:parsetype rrx:listormap; rrx:parse [rrx:parsetype rrx:listormap rr:termtype rrx:rdflist; ]. < ex:investigator ( < < ). 5. Implementing xr2rml on MORPH With a first theoretical work, we focus in this section on implementing xr2rml. The implementation activity gives a practical framework to test and verify the efficiency of xr2rml properties. As xr2rml extends the R2RML standard, we base its implementation on an R2RML tool. It exists several R2RML implementations DB2Triples, Ultrawrap, MORPH, XSPARQL etc. Among these tools, we choose MORPH for our implementation for many reasons: it passed most of the R2RML tests and we have free access to the source code. MORPH is an R2RML mapping engine developed in Scala, by the Ontology Engineering Group [Ref 11]. MORPH relies on the Domain semantics-driven Mapping (a dynamic query-driven implementation that dynamically executes the mapping in response to a query usually written in SPARQL) to translate relational databases into RDF graphs. There are three implementations of MORPH: Morph-RDB, Morph-Stream and Morph-GFT. Morph-RDB deals with traditional Relational Databases. Morph-RDB also performs query translation, which allows evaluating SPARQL queries over a virtual RDF dataset, by rewriting those queries into SQL according to an R2RML mapping description. A couple of databases are supported such as MySQL, PostgreSQL, or MonetDB. Morph-stream and morph-gft use R2RML for specific types of data sources that are not SQL-based, but which are still following a relational model. Morph-Stream implements an R2RML engine that works with a Data Stream Management System. Morph-GFT is a web based data management system supported by Google called "Google Fusion Tables", and extend Morph-RDB to work with GFT (SPARQL only, without batch upgrade). We focus on Morph-RDB to implement our theoretical work Implementing xr2rml This implementation takes into account structured data formats such as XML, JSON and CSV. In the remainder we will call xmorph our implementation. 29

34 Algorithm This section presents a brief overview of the algorithm implemented by xmorph. MORPH needs two input files to work: the mapping file and the file containing the properties configurations. These conditions have been maintained for xmorph. The following table presents the algorithm of MORPH and xmorph. MORPH Algorithm Input : R2RML mapping file : mapping.ttl properties file : config.properties Start: Read "config.properties" Connection to the database Read "mapping.ttl" for each triples map in "mapping.ttl" Create the corresponding SQL query Run the SQL query and retrieve a SQL table or view for each row in the SQL table or view for each term map in the triple map Convert data into RDF Write the RDF terms in the output file end for end for end for End of the algorithm xmorph Algorithm Input : xr2rml mapping file : mapping.ttl properties file : file.properties Start: Read "file.properties" Connection to the database Read "mapping.ttl" for each triples map in "mapping.ttl" Create the corresponding query Run the query and retrieve a table or a view for each row in the table or the view for each term map in the triple map Check the format of the term map Convert data into recursive key-value model Convert data into RDF (Simple RDF terms,rdf Collection or RDF Containers) Write the RDF terms in the output file end for end for end for End of the algorithm The main changes are the format conversion and the creation of RDF Collections and Containers. For practical reasons, we use the JSON format as a representative of the recursive key-value model. All the examples cases presented in the section 4 are supported by xmorph. The source code can be found on this link: 6. Improving xr2rml This paragraph exposes an improved version of xr2rml. The separation of this section and the section 4 is motivated by the implementation work. Indeed, the xr2rml version introduced in section 4 is stable enough to be implemented. It reaches the main goal of the language: map structured data that is compliant to the recursive key-value data model. Its implementation in xmorph covers all the use cases presented in section 4. The following version of xr2rml, edited during the implementation of the section 4, alleviates the mapping description. Mainly, it presents the referencing and mixed-path mechanisms that enhance the management of the data format. 30

35 6.1. Referencing data elements xr2rml references columns (section 4) and also any data element within structured values such as lists of elements or key-value maps. On this purpose, we extend the property rr:template to integrate a data element referencing mechanism explained in the following. To avoid confusion, xr2rml also extends the property rr:column with property rrx:reference to allow referencing data elements in non-relational databases. This leads to the following amended definition. A reference-valued term map has exactly one rrx:reference property. The value of the rrx:reference property is a valid reference to a data element. The value of the rr:template property is still a string template. A string template can use data elements referenced by enclosing them in curly braces ("{" and "}"). In non-mixed data format, properties rr:template and rrx:reference use data element references to selected data from structured values using path expressions. Path expression syntax is deduced from the logical source data format as follows: Property rrx:format of the logical source rrx:row rrx:xml rrx:json rrx:csv Assumed path expression syntax Column name in a relational database, extensible column store. Variable name in a SPARQL result set. XPath JSONPath CSV file In case this notation brings confusion with the term "referencing object map" discussed in section note that they have nothing in common. "Referencing object map" are object maps that use join condition properties to construct their RDF terms. On the other hand, referencing data element refers to the use of the native language of the logical source to select data. The table below shows examples of xr2rml logical source definition with various input databases and associated references to data elements. Logical source Logical source definition XML database supporting XQuery. The database contains: <movies> <movie id="1" name="movie 1" date="2011" /> <movie id="2" name="movie 2" date="1989" /> </movies> rrx:logicalsource [ rrx:query """for $i in //movies/movie order by $i/@date return $i"""; rrx:format rrx:xml; 31

36 The XQuery expression in property rrx:query returns: <movie id="2" name="movie 2" date="1989" /> <movie id="1" name="movie 1" date="2011" /> The rrx:reference property uses an XPath syntax to retrieved the names of the movies. MongoDB database (document store), using Jaspersoft MongoDB Query Language ( The database contains one document: [{ "id":1, "name"="movie 1", "date"=2011 }, { "id":2, "name"="movie 2", "date"=1989 } ] The MongoDB query returns: { "id":2, "name"="movie 2", "date"=1989 } { "id":1, "name"="movie 1", "date"=2011 } The rrx:reference property uses the JSONPath syntax. AllegroGraph (RDF graph store) using SPARQL The rrx:reference property indicates the name of a variable returned in the result set. rr:subjectmap [ rr:template " rr:predicateobjectmap [ rr:predicate ex:title; rr:objectmap [ rrx:reference "//movie/@name"; ] ]. rrx:logicalsource [ rrx:query """{ collectionname: 'movies', sort: { date:1 }, findfields: { name:1, date:1 }, limit: 10 }"""; rrx:format rrx:json; rr:subjectmap [ rr:template " rr:predicateobjectmap [ rr:predicate ex:title; rr:objectmap [ rrx:reference "$.name"; ] ]. rrx:logicalsource [ rrx:query """select?name?date where {?movie a ex:movie; ex:name?name; ex:date?date. } order by?date limit 10"""; rrx:format rrx:row; rr:subjectmap [ rr:template " rr:predicateobjectmap [ rr:predicate ex:title; rr:objectmap [ rrx:reference "?name"; ] 32

37 Referencing data elements with mixed data formats To reference data elements within such mixed contents, xr2rml allows a term map to reference a data element with mixed-syntax paths: Properties rrx:reference and rr:template use mixed-syntax paths to reference data elements by traversing data of different formats. A path with mixed-syntax consists of the concatenation of several individual path expressions in different syntaxes, separated by the slash '/' character. Each individual path is enclosed in a syntax path constructor naming the path syntax explicitly. Existing constructors are: Row(column name), JSONPath(JSONPath expression), XPath(XPath expression). CSV() for CSV data. Note: There is any universal equivalent path description like JSONPath or XPath for CSV data. Therefore, the constructor CSV() is will be always used with empty parentheses. Example: Input data Table column {"id":1, "FirstName":"John", "LastName": "Smith" } Term map rrx:logicalsource [ rrx:sourcename "Table"; rrx:format rrx:row; rr:subjectmap [ rr:template " rr:predicateobjectmap [ rr:predicate ex:name; rr:objectmap [ rrx:reference "Row(column)/JSONPath($.FirstName)"; rr:language "en"; ] ]. Generated RDF term from the Object map < ex:name "John"@en Implicitly, values referenced by a mixed-syntax path are expressed in the syntax corresponding to the last path expression in the mixed-syntax path. Typically, "Row(column)/JSONPath($.name)" references values written in JSON since the last path is expressed in JSONPath. This assumption is necessary when structured values must be parsed. Example: 33

38 Input data <person> <name>john Smith</name> <items>[1,2,3]</items> </person> Node "items" contains a value expressed as a JSON array. Term map rr:objectmap [ rrx:reference "XPath(/person/items)/JSONPath()"; rrx:parsetype rrx:listormap; ] The empty parentheses in the last element of the mixed-syntax path, "JSONPath()", indicates that value "[1,2,3]" is formatted in JSON syntax. Generated RDF terms rrx:joinparse properties and mixed-path Generating multiple RDF terms with a referencing object map In the example relational database below, column "Doctor.studies" contains a JSON array of which values are foreign keys to column "Study.study_id". Input data Table Study study_id study_name 1 study1 2 study2 3 study3 Table Doctor doc_id doc_name studies 1 D1 [1,2] 2 D2 [3] Mapping graph <#Study> rr:logicaltable [ rr:tablename "Study" rr:subjectmap [ rr:template " ]. <#Doctor> 34

39 rr:logicaltable [ rr:tablename "Doctor" rr:subjectmap [ rr:template " rr:predicateobjectmap [ rr:predicate ex:investigator; rr:objectmap [ rr:parenttriplesmap <#Study>; rr:joincondition [ rr:parent "study_id"; rr:child "Row(studies)/JSONPath()"; rrx:childparse [rrx:parsetype rrx:listormap; ]. The rr:child property uses a mixed-syntax path specifying that the data retrieved is formatted in JSON. In the example below the referencing object map has an rr:termtype property with value rrx:rdflist (note that the data set is different: D2 is an investigateur for studies 2 and 3): Input data JSON documents retrieved by the query in the <#Study> triples map: { "study_id":1, "study_name":"study1" } { "study_id":2, "study_name":"study2"} { "study_id":3, "study_name":"study3"} JSON documents retrieved by the query in the <#Doctor> triples map: { "doc_name":"d1", "studies": [1,2] } { "doc_name":"d2", "studies": [2,3] } Mapping graph <#Doctor> rrx:logicalsource [ rrx:format rrx:json; rrx:query "..."; rr:subjectmap [ rr:template " ]. <#Study> rrx:logicalsource [ rrx:format rrx:json; rrx:query "..."; rr:subjectmap [ rr:template " rr:predicateobjectmap [ rr:predicate ex:hasinvestigator; rr:objectmap [ 35

40 Generated triples RDF rr:parenttriplesmap <#Doctor>; rr:joincondition [ rr:child "$.study_id"; rr:parent "$.studies"; rrx:parentparse [ rrx:parsetype rrx:listormap; rr:termtype rrx:rdflist; ]. Results are grouped by the child reference, that is "study_id": < ex:hasinvestigator ( < ). < ex:hasinvestigator ( < < ). < ex:hasinvestigator ( < ). Contribution: Using the property "rrx:reference" clearly has several advantages: - It gives the possibility to use any path expression according the format description of the logical source. xr2rml can be used to map any data with a format compliant to the the recursive key value data model (section 2.4). It enlarges the scope of xr2rml. - The mixed-path mechanism allows the parsing of twistiest data that can be stored. Even if it is unusual to encapsulate XML data in a JSON document or vice versa, such data can be parsed. - It also reduces the use of the property "rrx:format" to the logical source description Moreover, path expressions can be directly used as value of join condition properties The property "rrx:parsetypeseq" A template-valued term map may reference several data elements from the logical source, captured by curly braces ('{' and '}'). Therefore, it should be possible to specify a parse type for each data element referred to in the template string. The rrx:parsetypeseq property takes as object an RDF sequence of parse types following the order of capturing curly braces in the template string. An xr2rml template-valued term map may have a sequence of parse types defined with the optional rrx:parsetypeseq property. A parse type may have be one of two values defined for the parse Type property. Typically: the template string "{ref1}... {ref2}... {refn}" has the following sequence of parse types: rrx:parsetypeseq [ a rdf:seq; rdf:_1 value1; rdf:_2 value2; ; rdf:_n valuen value(i) corresponds to the parse type of red(i) The simple "parse type" expression is used when all parse types of the parse type sequence has the same value. If a term map has no rrx:parsetypeseq property, its parse 36

41 type defaults to a rdf seq with rr:literal. A template-valued term map referencing literal values has a default rrx:parsetypeseq property in which all members have a rr:literal parse type. Typically: the template string "{ref1}... {ref2}... {refn}" has the following default sequence of parse types: rrx:parsetypeseq [ a rdf:seq; rdf:_1 rrx:literal; rdf:_2 rr:literal; ; rdf:_n rr:literal Example : Input data: one row retrieved from a RDB, with VARCHAR columns formatted in JSON and XML Mapping graph Generated triples cos products [ "Dell", "Asus" ] <list> <product>laptop</product> <product>desktop</product> </list> <#TripleMap> rrx:logicalsource [... rr:subjectmap [ rr:template " rrx:parsetypeseq [ a rdf:seq; rdf:_1 rrx:listormap rr:predicateobjectmap [ rr:predicate ex:produces; rrx:objectmap [ rrx:reference "Row(products)/XPath(/list)"; rrx:parsetype rrx:listormap; < ex:produces "Laptop". < ex:produces "Desktop". < ex:produces "Laptop". < ex:produces "Desktop". 7. Thesis summary and perspectives The aim of this thesis was to propose a solution for mapping NoSQL data into the RDF format. Facing the large variety of NoSQL, the first part of this work presents a state of the art of NoSQL databases. The list of NoSQL systems presented contains the major actors classified according to their data mode. As a result of this state of the art, we highlight a data model common to the different NoSQL databases. The second section introduces the semantic web standards on which we rely for the purpose of this thesis, namely RDF and 37

42 R2RML. The mapping language, R2RML, standardized by the W3C, constitutes a solid starting point because it is a standard well documented and supported by many tools. We, therefore, extended R2RML properties to NoSQL databases. We mostly focused on designing different properties for an efficient parsing of the data. These properties have been designed for any data formatted in the common data model highlighted at the end of the state-of-theart, namely the recursive key-value model. Thus, we ensure compatibility with a large set of NoSQL systems. xr2rml enlarges the set of databases that can be mapped and also the set of the data formats that can be parsed. It allows the creation of complex RDF terms such as RDF Collections and RDF Containers. Some experimental works have also been performed. We implement the extended R2RML language (xr2rml) on MORPH, an R2RML engine developed by the Ontology Engineering Group. We modified its source code to process xr2rml properties. The last part of this document presents an improved and nonimplemented version of xr2rml. This second version mainly enhances the data format handling. The NoSQL movement evolves very fast and several NoSQL systems have not reached maturity yet. Designing a mapping language for NoSQL requires a frequently updated standard. Thus, xr2rml is an on-going work. 38

43 Bibliography: [Ref 1] Strozzi, Carlo: NoSQL "A relational database management system" [Ref 2] Chang, Fay ; Dean, Jeffrey ; Ghemawat, Sanjay ; Hsieh, Wilson C. ; Wallach, Deborah A. ; Burrows, Mike ; Chandra, Tushar ; Fikes, Andrew ; Gruber, Robert E.: "Bigtable: A Distributed Storage System for Structured Data". November [Ref 3] G. DeCandia et al. "Dynamo: Amazon's Highly Available Key-value Store". In: ACM SIGOPS Operating Systems Review 41.6 (2007), pp [Ref 4] English Wikipage of ACID - [Ref 5] BASE Wikipedia page [Ref 6] Brewer, Eric A.: "Towards Robust Distributed Systems". Portland, Oregon, July Keynote at the ACM Symposium on Principles of Distributed Computing (PODC), July [Ref 7] Nancy Lynch and Seth Gilbert, Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services, ACM SIGACT News, Volume 33 Issue 2 (2002), pg : [Ref 8] Christof Strauch NoSQL Databases, Stuttgart Media University, [Ref 9] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall and We. Vogels, Dynamo: Amazon s Highly Available Keyvalue Store ; ACM SIGOPS symposium on Operating systems principles (SOSP '07), Stevenson, USA, Oct pp [Ref 10] SPARQL specification by W3C: [Ref 11] Official website of the Ontology Engineering Group: [Ref 12] RDF 1.1 Primer W3C Working Group, February 2014: [Ref 13] John Roijackers,"Bridging SQL and NoSQL", Master Thesis in March 2012, Eindhoven University of Technology at the Department of Mathematics and Computer Science: [Ref 14] R2RML: RDB to RDF Mapping Language W3C Recommendation, September 2012: [Ref 15] Dimou et al,"rml: A Generic Language for Integrated RDF Mappings of Heterogeneous Data", Proceedings of the 7th Workshop on Linked Data on the Web (LDOW2014) April 2014, Seoul KOREA. [Ref 16] F. Michel, J. Montagnat, C. Faron-Zucker, "A survey of RDB to RDF translation approaches and tools", Laboratoire d'informatique, Signaux, et Systèmes de Sophia-Antipolis (I3S) / Team MODALIS and Team WIMMICS (INRIA Sophia Antipolis / Laboratoire I3S) INRIA Université Nice Sophia Antipolis (UNS) CNRS ouvertes.fr/docs/00/98/66/83/pdf/rapport_rech_i3s_v2_-_michel_et_al_2013_- _A_survey_of_RDB_to_RDF_translation_approaches_and_tools.pdf 39

44 [Ref 17] SPARQL Query Language for RDF W3C Recommendation 15 January 2008: [Ref 18] Master thesis, "Analysis and Classification of NoSQL Databases and Evaluation of their Ability to Replace an Object-relational Persistence Layer", by Kai Orend of Technical University of MUNICH, Faculty of Computer Science, April [Ref 19] 10gen, Inc: mongodb [Ref 20] PritloveTim; Lehnardt, Jan ; Lang, Alexander: "CouchDB Die moderne Key/Value- Datenbank lädt Entwickler zum Entspannen" in. June Chaosradio Express Episode 125, Podcast published on [Ref 21] Lakshman, Avinash; Malik, Prashant: "Cassandra A Decentralized Structured Storage System." In: SIGOPS Operating Systems Review 44 (2010), April, p Also available online. [Ref 22] Amazon.com, Inc.: Amazon SimpleDB [Ref 23] Eifrem Emil: Neo4j The Benefits of Graph Databases. July OSCON presentation. [Ref 24 ]Graph Databases by Ian Robinson, Jim Webber, and Emil Eifrem Copyright 2013 Neo Technology, Inc. [Ref 25] Codd Edgar F.: "A Relational Model of Data for Large Shared Data Banks." In Communications of the ACM 13 (1970), June, No. 6, p [Ref 26] The World Wide Web Consortium - [Ref 27] The JavaScript Object Notation (JSON) specification- 40