Data Store Interface Design and Implementation

WDS'07 Proceedings of Contributed Papers, Part I, 110 115, 2007. ISBN 978-80-7378-023-4 MATFYZPRESS Web Storage Interface J. Tykal Charles University, Faculty of Mathematics and Physics, Prague, Czech Republic. Abstract. This article describes a draft and a implementation of Semantic Web data store - one of the main parts of the Infrastructure for Semantic web. The main goal of this project is to create a new simple interface between any Data Store and other pieces of the infrastructure. This interface should not dependent on any existing or future Data store. It allows load data and execute query on them. The main part of this article will include interface for data import. The implementation takes advantage of a relational database Oracle and everyone can use this implementation from more than one programming language. Performance tests show not only bottle-necks of this solution but also that Data store has excellent load characteristic. These tests also determined improvements and future research. Introduction Everyone can imagine that the world wide web includes the great amount of information. In case of the Semantic web is situation very similar. In both types of these webs if a user wants to look up something specific he usually use one or more search engines. Every search engine is based at least on data importing into data store, data indexing and query evaluating. For storing and searching in Semantic data are developed some data stores. All of them are based on one of these principles: store data in memory, store data in native format or store data in relational database. Each of these principles has an advantages and disadvantages. In-memory data store is very fast but it has limited capacity. Data store based on native format may be fast but each change in data store structure may be difficult. Data store based on relation database is slower than in-memory data store but it can store a huge amount of data. The most common data stores are Jena and Sesame in semantic web community. These data stores work perfectly with a small amount of data but when we try to work with a huge data the work is impracticable. So we try to design, implement and test a new data store for semantic web data. Each two existing data store interfaces are different so it is too difficult to change data store in semantic web application. There is no standard for data store interface. We propose a new interface that should become an universal interface for all semantic data stores. Pilot implementation do not use existing data store (e.g. Jena, Sesame) yet but use a new designed and implemented data store. In the future work we try to make implementation that connect the proposed interface with common used repositories. Infrastructure for Semantic web In [IFR2006], there was introduced proposal of Infrastructure for Semantic web. We consider that the main part of this infrastructure is a place, where anyone can save own data. This place can call Data Store. Many other parts of this infrastructure can interact with Data Store. Each of these parts have to specific function and each of them (see Figure 1) are bound to access to data, possibly insert new data. 110

The first of them is Query unit. The main function is query data. The second one is SemWeb server. The main function of this part is to find new relations based on inserted information. These relations are saved back to Data Store. It is clear that this module uses both querying and importing data of Semantic web. The last ones are Importers. Their function is to load a new amount of semantic data. We can divide whole interface in two main parts: query interface and import interface. Both interfaces use universal structures (called Basic data structures) that offer to save any RDF triple or reification. Figure 1. Modules Importer and Query units communicate with a Data Store using query and import interfaces. Basic data structure The building block of the Semantic web is RDF triple. RDF triple [W3CRDF] consists of three parts: the subject, which is an RDF URI reference or blank node, the predicate, which is an RDF URI reference, the object, which is an RDF URI reference, a literal or a blank node. RDF triples can be associated with additional information called Reification. Internal structure of the reification has similar structure as the record of RDF triple. The difference is only in subject - the subject is triple (in case of reification) or URI/Blank node (in case of RDF triple). Based on previous information we defined a class hierarchy. This hierarchy consists of one virtual class called Node and other derived classes URI, Literal, Blank node and Triple. By making Triple descendant of Node we get a unified interface for working with both triples and reifications. This part of interface offers API for creating a releasing URI, Literal, Blank Node and Triple. Data Structures The basic property of the import interface is a definition of internal memory structure for data insertion (sometimes called RDF Graph) and functions which provide connection to data store. The internal memory structure can be filled by both RDF triples and reifications. The 111

content of this structure is periodically saved into the data store. When you want to save data, it is necessary to define data store type and other parameters. Every data store that supports this interface should implement at least: InitializeConnection(repository name, user name, password, parameters). This function initializes connection to data store. We assume that every data store is identified at least name, login name and login password for authentication. Other data store specific parameters can be inserted into the last argument of this function. The data store will ignore unknown parameters. Return value of this function indicates whether the data store was successfully connected. InitializeInserts(Import type). This function initialize insert into data store. Parameter Import type determines whether batch insert is set. More information about this parameter is in chapter Import Type. Return value of this function indicates whether the data store is successfully initialized. FinishInserts(). This function finish a insertion a propagates all triples into the data store. InsertTriple(triple). This function inserts a triple into the internal memory structure. Input interface This implementation is written in C++ and data is stored in an Oracle relational database. Import type Some parts (e.g. SemWeb server) query the data and when they deduce formerly unknown knowledge, they insert the information back into the data store. These information are typically a small amount of triples, because the quality of these information is more important then their quantity. Other parts (e.g. Importers) insert amount data into the data store. The goal for these parts is to import data quickly. Conclusion: The import interface for any data store should support two modes: Insert immediate, insert data immediately when insert triple function is called. Batch insert, insert data into a temporally space, after finish all triples are saved. Our implementation supports both of these modes. Local Cache Due to performance optimization we had to implement a cache for inserted triples into the import interface. The Cache is usable in data stores based on both relational database and other data stores with remote access. In case of relational database based data store, it is better to insert triples in shorter transactions, but do not commit the transaction after each inserted triple. It prevents extensive record locking and too long response times produced by frequent commits. In case of other types of data stores, caching can help reduce negative effect of high network latency. The interface can send more triples in one request and eliminate useless waiting for the network. Portability Portability our interface into other programming languages is possible because all necessary functions are exported into a DLL library. These functions can be called from many programming languages - e.g. Java or C#. 112

Implementation of API functions TYKAL: WEB STORAGE INTERFACE Implementation of SemWeb interface has two layers 2. The first one is public library and it is written i C++. The second layer is in relation database Oracle and it is written in PL/SQL. Connection between these layer provides OCI interface [OCI]. Figure 2. Communication between import interface and Data Store based on relation database Oracle. Function InitializeConnection sets some information about user, password, etc. into internal structure and try to connect to the Data Store. This function has to be called at least once. Function InitializeInserts tries to connect to the Data Store and try to obtain a new BatchId. The Import type is chosen at this time. Function FinishInserts(). This function finishes an insertion and propagates all triples into the Data Store. Function InsertTriple(triple). This function calls API function on underlying layer in database. The correct API function is determined by internal structure of inserted triple. Implementation of Data Store We decided to choose relational database Oracle as the Data Store due to several reasons: It is optimized for working with a large data. It has own procedural language. SQL is easy to use. Performance tests We made two kinds of tests. The first one is comparison between load time into one of existing Semantic Web repositories based on relational database and the new developed SemWeb repository. As a candidate of existing Semantic Web repository was elected the Sesame v1.2 (Sesame v2.0 beta doesn t support relational database) due to his popularity inside the Semantic web community. The second test was designed to predicate load time curve. Tests used a large data containing 23 654 790 triple (3396 MB Turtle file [Turtle]). Test environment Tests were performed on three machines: 1. A computer (1x CPU Pentium-M 1,7 GHz, 1,5 GB RAM, DB instance was asssigned 256 MB RAM and 512 MB temporary tablespace), 2. an Oracle database server (2x CPU Xeon 3.06 GHz with hyper-threading, DB instance was assigned 1.0 GB RAM), 113

3. an application server (2x CPU Quad-Core Xeon 1,6 GHz, 8GB RAM), assigned 256 MB RAM). The first machine was used for comparison between the SemWeb repository and the Sesamedb repository. The others were used for large data import and to test a query responses. Figure 3. Comparison between the Sesame-db repository and the SemWeb repository. Comparison with Sesame-db repository The main goal of this test was to compare the SemWeb repository with an existing solution based on relational database. The Sesame-db repository was connected to a local instance of Oracle database. The SemWeb repository was connected to the same instance. We tried to load 150 000 triples into each of them. The SemWeb repository loads this data in 780 seconds. The Sesame-db finished loading near 118 000 loaded triples. The error was low space in the TEMP tablespace. Load time both the Sesame-db and the SemWeb repository is shown on Figure 3. The SemWeb repository has load time almost linearly dependent on processed data, but the Sesamedb has rather exponential grow. Sesame-db behavior is expected and it is the same as described in [BSW05]. The article shows that Sesame-DB has serious performance issues when loading a huge data. The load time greatly increases with the size of the input data. The SemWeb repository was primarily designed to have ability to work with a huge semantic data whereas the Sesame-db was probably designed to store some semantic data. So the Sesame database schema and SQL statements are written inappropriate for load this amount of data. According to this test the smaller data (up to 110 000 triples in this machine configuration) may be loaded in the Sesame-db, but it is not suitable to use the Sesame-db for a larger data. Huge data loading The main goal of this test was to show if the SemWeb data store allows to load a huge RDF data. Implementation indicated us bottle-necks of the solution and it helped us to find some of other upgrades. Some of this bottle-necks were implemented into the current solution and some of them are postponed to the future work. Data were loaded in 100k triples batches. Whole load took 22 hours and 54 minutes, out of which 13 hours and 44 minutes were spent transferring data from source files to temporary tables and other 30 minutes were spent on cleanup actions. Time dependency on the count of loaded triples is showed on Figure 4. 114

Figure 4. Two tests of load time 23,6 M triples into our Data store. Conclusion We have designed and implemented a new SemWeb repository, that can allow to store and work with a huge semantic data. The SemWeb repository interface is accessible from many programming languages e.g. Java or C#. Our implementation showed us bottle-necks of this solution and it helps us find some of other upgrades. Some of them (e.g. load mode, cache,...) are particularly or fully implemented and the rest of them are objects of future work. So future work contains at least these improvements: data transfer acceleration from file to temporary tables in database, elimination of clean-up actions and optimization of data processing. Comparing with other semantic web repository, implementation demonstrates excellent results of performance tests. This implementation can load over 25 milion RDF triples without any problem. The SemWeb data store is part of infrastructure for Semantic web that is currently used as a platform for further semantic web research. References [W3CRDF] Carroll J. J., Klyne G. (2004): Resource Description Framework: Concepts and Abstract Syntax, W3C Recommendation, 10 February 2004 http://www.w3.org/tr/2004/rec-rdf-concepts-20040210 [IFR2006] Yaghob J., Zavoral F.: Semantic Web Infrastructure using DataPile The 2006 IEEE/WIC/ACM International Conference on Web Intelligence and Itelligent Agent Technology, IEEE, Los Alamitos, California, ISBN 0-7695-2749-3, pp. 630-633, 2006 [Turtle] Beckett D.: Turtle - Terse RDF Triple Language http://www.dajobe.org/2004/01/turtle [OCI] http://www.oracle.com/technology/tech/oci/index.html [BSW05] S. Wang, Y. Guo, A. Qasem, and J. Heflin (2005): Rapid Benchmarking for Semantic Web Knowledge Base Systems. Technical Report LU-CSE-05-026, CSE Department, Lehigh University 115