Acknowledgements References 5. Conclusion and Future Works Sung Wan Kim

Hybrid Storage Scheme for RDF Data Management in Semantic Web Sung Wan Kim Department of Computer Information, Sahmyook College Chungryang P.O. Box118, Seoul 139-742, Korea swkim@syu.ac.kr ABSTRACT: With the advent of the Semantic Web as the next-generation of Web technology, large volumes of Semantic Web data described in RDF will appear in the near future. Most previous approaches treat RDF data as a form of triple and store them in a large-sized relational table. Basically, since it always requires the whole table to be scanned for processing a query, it may degrade retrieval performance. In addition, it does not scale well. In this paper, we propose a hybrid storage approach for RDF data management. The proposed approach aims to provide good query performance, scalability, manageability, and flexibility. To achieve these goals, we distinguish some frequently appeared properties in RDF data. A set of RDF data with a distinguished property is independently treated and stored together in a corresponding property-based table. For processing a query having a specific property, we can avoid full scanning the whole data and only have to access a corresponding table. For queries having specific properties, the proposed scheme achieves better performance than the previous approach. Categories and Subject Descriptors B.4 [Input/Output and Data Communications]; D.2.12 [Interoperability Web-based Services]; E.2 [Data Storage Representations]; H.2 [Database Management] General Terms Hybrid data storage, W3C, Web data management Keywords: RDF Data management, Semantic web, Data storage scheme Received 28 Oct. 2005; Revised 15 Dec. 2005; Accepted 16 Jan. 2006 1. Introduction The W3C has established the Semantic Web as the next-generation Web. The Semantic Web extends the current Web to make Web information meaningful to computers by giving it a well-defined meaning, which is so called semantics. This semantic data attached to Web information is the foundation in the Semantic Web. The W3C released, therefore, the Resource Description Framework (RDF) to represent and exchange semantic data about resources in the Web [1]. We call these data Semantic Web data or more concisely RDF data in this paper. As it is expected that the utilization scope of the Semantic Web application will be more expanded, enormous Semantic Web data will appear in the near future. For example, MusicBrainz is one of the first of what might be called Semantic Web services [12]. It provides information about musical artists, song titles, and so on using metadata described in RDF. Thus, we strongly believe that how to efficiently store and manage the Semantic Web data is a key role in realizing the vision of the Semantic Web. In order to manage RDF data, most previous approaches use traditional database management systems such as RDBMS and ORDBMS [2][3][4][14]. In these approaches, RDF data is represented by a set of triples and then stored in a single large relational table (what is called a triple table). From a data management view point, it has the advantage of directly using the full power of databases management systems. Basically, since it always requires the whole table to be scanned for processing a query, however, it may degrade retrieval performance. In addition, maintaining a single large triple table is not good for scalability. 1 This is a revised version of the paper presented at the International Conference on Next Generation Web Services Practices (NWeSP), August, 2005, Seoul, Korea. Recently Ding et al [8] reported an analysis on the empirical usage of properties over FOAF (Friend-of-a-Friend) data and revealed the most frequently used properties. We focused on the fact that among whole properties in FOAF vocabulary, the average total usage of several properties (about 5) shows over 50% of the whole usage. We believe that since the most frequently used properties will be continuously and popularly used both in generating future FOAF documents and in forming user query, it is more efficient to manage them with a special manner. In order to enhance query performance, in this paper, we propose a novel storage scheme for managing RDF data. We also aim to provide scalability, manageability, and flexibility. We maintain RDF data not in a single large table but in several independent tables. We group RDF data according to some distinguished properties and store them independently in the corresponding tables. Thus, we can avoid the full scanning for a single large table and obtain a good retrieval performance. The rest of the paper is organized as follows. In Section 2 a brief concept of a RDF data model and the previous approaches for managing RDF data are described. The proposed storage scheme for RDF data management is explained in Section 3. Section 4 covers our experiment and the results of our performance test. We finally conclude this paper with future works in Section 5. 2. Data Model and RDF Data Management In this section, we briefly overview the RDF specifications and define the RDF data model. And then we compare various representation schemes for a data object in database systems and describe how to apply them for RDF data management. 2.1. RDF Data Model RDF is a language for describing semantic metadata about resources in the Web [1]. RDF is based on the idea of identifying things using Web identifiers (URIs), and describing resources in terms of simple properties and property values. This enables RDF to represent a simple statement about resource description. An RDF statement consists of a subject, predicate, and object. Often these ternaries are referred as to a triple. The subject or S means a resource URI. The object or O as the value of property may be either a resource URI or a literal value. The predicate or P means a property and also is a resource URI. Definition 1. A triple <S, P, O> R U ( R L) is called an RDF triple where U is a set of URI references (URIs), L is a set of literals, B is a set of blank nodes, and R is a set of U union B. An RDF statement also can be modeled as a directed graph of node and arc representing the resource, and its property and value. Both the subject and the object are represented with node. A resource without URI is represented with a blank node. The predicate is represented with a directed arc from subject node to object node. This means the relationship between nodes. Figure 1 shows an RDF description example and its graph representation using the FOAF ontology vocabulary. FOAF (Friend-of-a-Friend) is a kind of ontology providing vocabulary for describing personal information such as name, mailbox, homepage URI etc. Figure 2 is a captured image for the extracted RDF triples via CARA RDF parser [10] to the RDF graph in Figure 1. Thus, RDF is a directed graph-based data model, which consists of a set of RDF triples. Another characteristic of RDF is that it is a property-centered model. Property can be defined independently of a specific class definition and applied to all classes unless domain specifications are explicitly described. Thus we can assert an RDF statement for a resource to associating it with any property. Thus, an RDF storage scheme should be flexible for newly added property or deleted property. Journal of Digital Information Management q Volume 4 Number 1 q March 2006 32

Figure 1. An Example RDF Description and RDF Graph Figure 2. Extracted RDF Triples 2.2. RDF Data Management In traditional database systems a data object is represented as a row in the relational table. In this approach since different attributes for an object are grouped together, only one record for the object is required. According to the data object characteristic, however, many null values may appear. Flexibility to schema evolution, such as attribute insertion and deletion, is not good for this approach. This approach is referred to as a horizontal storage approach. Decomposition storage model was introduced in [6]. It divides a table in the horizontal storage approach into as many binary tables as the number of attributes. Whole data objects are physically grouped by attributes. We call this scheme a binary storage approach. Since each table name is the same as the attribute name in the horizontal table, there is no need to explicitly maintain an attribute field in each binary table. Neither does the null values appear. A vertical storage approach introduced how to store a data object with a different manner, where a 3-ary table consists of an object identifier, an attribute name, and an attribute value is generated [7]. In the vertical scheme, since the table contains records for only those attributes that are present in an object, there is no null value. Different attributes for an object are tied together using the same OID. Schema evolution is easier than the previous scheme. Among these approaches, the binary approach has shown better performance in query processing, such as projection, selection, join, and aggregation, than other two schemes. The poor performance of the horizontal scheme is mainly caused by I/O operations. To get further information about the performance comparisons, refer to [7]. Figure 3 compares a horizontal table and its corresponding representations in the vertical and binary approaches respectively. Most approaches for managing the Semantic Web data described in RDF regard an RDF graph as a set of RDF triples and store them in the relational table [3][4][14]. The basic storage schema consists of a single large triple table for storing a set of RDF triples. To reduce disk space, some additional tables such as a resource table and a literal table are generated. The resource table maintains all resources. It includes properties also. The literal table maintains all literal value. The triple table consists of the subject, predicate, and object fields and stores a set of RDF triple and references the other tables. There are as many records as RDF triples having different properties for a specific resource description. Thus, the triple table can be regarded as applying the vertical approach just mentioned above. Since it maintains only one table except the additional tables, it is easy to manage data objects (RDF triples). For example, an RDF statement with new property for a resource is easily inserted in the table. However, since it always requires the whole table to be scanned for processing a query it may degrade retrieval performance. In addition, it does not scale well. In Jena 2 [5], besides a triple table, a horizontal table-based approach in which several properties for a resource are clustered was used. Thus, the related properties and values for a resource can be accessed together. Many null values however may appear in the table. Adding a new property or deleting a property in the table is very expensive also. Figure 3. Comparison of Storage Schema Journal of Digital Information Management q Volume 4 Number 1 q March 2006 33

3. Hybrid Storage Scheme 3.1. Hybrid Approach We fundamentally interpret an RDF graph as a set of RDF triples and store it in the relational tables as in the previous approaches. The proposed RDF storage scheme aims to provide : Query performance in general most RDF queries are given with a specific property. Thus, query to a specific property either with a value or not should be efficiently processed. Scalability and Manageability The proposed scheme should be both scalable and manageable. A single large triple table in previous scheme is good for manageability but not good for scalability. Flexibility RDF is not a resource-centric (or object-centric) but a property-centric data model. It means that it should be easy to insert an RDF statement for a resource description with a new property. Our basic philosophy is that most commonly appeared properties in describing RDF data will be frequently used in the future and also in forming user queries. Thus, we manage and handle them specially. In order to achieve the above goals, we distinguish some frequently appeared properties in RDF data and frequently used properties in user query. RDF statements described with these distinguished properties are grouped by properties and independently maintained. For this, we adapt the binary storage scheme mentioned in the previous section and maintain an independent table for each property. However, maintaining as many tables as the number of properties may cause overhead in manageability if there is a large quantity of properties. Thus, we divide the RDF statements into two categories and manage them with a different manner, which is called a hybrid approach. The first category means a set of RDF statements described with the distinguished properties. These RDF statements are maintained in the corresponding independent binary table according to their properties. The remaining RDF statements are maintained in the same old-fashioned way using a single triple table. Figure 4 shows a brief schema diagram for the proposed storage schema structure. As a result, a set of RDF statements having a frequently used property is physically grouped in a single independent table. On the other hand, RDF statements with low-important properties are grouped together in a common triple table. There is no duplication among these tables. Finally, since a query having a specific or distinguished property is evaluated through accessing only a corresponding property-based arc table, we can achieve high performance in query processing. 3.2. Managing the Property-based Arc Tables Property-based arc tables may be generated selectively according to the characteristics of the property. To achieve this process, we first have to select some distinguished properties after analyzing RDF data and RDF query log information. For example, properties frequently appeared in RDF data in an application domain and frequently used in the user query may be candidates. Generally, the number of types (vocabulary) of properties used in a specific domain is already defined and limited. Especially, frequently used properties are limited also. An analysis result was introduced for FOAF (Friend-of-a-Friend) documents usage under the real Web environment [8]. It analyzed the empirical usage of properties over the FOAF data and revealed the most frequently used properties. Among all the properties in FOAF vocabulary, the average total usage of some properties such as foaf:name, foaf:mbox_sha1sum, foaf:homepage, foaf:knows, and foaf:nick shows over 50% of the whole usage. Since most frequently used properties will be continuously used in future FOAF documents and take a large portion in the entire data, it is efficient to manage them independently. The hybrid approach scales well since RDF data is physically distributed in several tables instead of storing in a single table. Due to maintaining the property-based arc tables, it gives a chance to enhance performance, especially for a retrieval query having specific properties. Although we have not seen other literatures analyzing query log about frequently used properties yet, it is obvious that the frequently used properties also will be often used in forming query over RDF data. Thus, maintaining property-based arc tables for several important properties has a strong advantage in terms of managing and querying RDF data. Maintaining only a single triple table in the previous storage scheme gives good manageability and flexibility. Inserting an RDF triple described with newly appeared property and deleting RDF triples are easily handled. In the proposed hybrid scheme, since all remaining RDF triples described with non-distinguished properties are stored together in a single table like the triple table in the previous scheme, it can be handled with the same flexibility as well. The proposed scheme also provides good manageability by maintaining independent tables as necessary. 4. Performance Experiments We now describe the results of an experiment to evaluate the performance of the proposed scheme. In this experiment we include the performance comparison with the previous scheme described in Section 2. Figure 4. Database Schema for the Proposed Storage Scheme The resources table basically maintains all identifiable resources via URI reference and consists of a resource identifier, namespace, resource name, and resource type fields. The literal table maintains all literal values and their related information. Additional fields for more information to the literal value may also be included. These two tables are referenced to other tables. As just mentioned above, we generate independent binary tables for only distinguished properties. Only RDF statements described with distinguished properties are stored in these tables according to their properties. We call these tables property-based arc tables. It basically consists of the arc identifier, subject, and object fields. Since the table name is the same as the property name, we don t need to explicitly maintain a predicate field. The arc_others table is for the remaining RDF statements. It plays the same role as the triple table in the previous approaches and contains the predicate field explicitly. 4.1 Experimental Conditions In order to compare the previous triple table-based scheme with the proposed scheme, we directly implemented both schemes. As in Jena and Sesame/MySQL, only a single triple table-based approach was implemented. As mentioned in Section 2, it can be regarded as applying the vertical storage scheme. Thus, we call this previous approach a vertical scheme in this section. Since we only focus the experiment on how to manage RDF data in both storage structures, we implement the core storage structure and retrieval module of the vertical approach instead of installing comparative systems such as Jena, Sesame/MySQL. The proposed hybrid scheme maintains 5 property-based arc tables for 5 distinguished properties and the arc_others table for the others. Since the length of resource URIs and literal values for the test data set is not long, in-lining approach is adapted in both schemes. That is, resource URIs and literal values are directly stored in the triple table and property-based arcs tables. Thus, a resource table and literal table are not implemented in either schemes. Using in-lining Journal of Digital Information Management q Volume 4 Number 1 q March 2006 34

No. Description Query in RDQL format Q1 Return all statements with a specific property SELECT?x,?z WHERE (?x <foaf:name>?z) (most frequently used property) Q2 Return all statements with a specific property SELECT?x,?z WHERE (?x <foaf:givenname>?z) (least frequently used property) Q3 Find all properties and their values of a specified SELECT?y,?z WHERE (http://www.picdiary.com/ resource pics.rdf#photolists?y?z) Q4 Find all value pairs for two related properties of a SELECT?x,?y WHERE (<genid:mkim>, <foaf:name>,?x), specified resource (<genid:mkim>, <foaf:knows>,?y) Q5 Find all object pairs for two related properties grouped by 1st object and having the number of 2nd objects less than 200 Not described in RDQL format Q6 Return all instances known by a resource whose name is given SELECT?z WHERE (?x <foaf:name> Dr. Steven R. (graph pattern query) Newcomb ), (?x <foaf:knows>?z) Q7 Find all name values a specified resource knows (path query) SELECT?z WHERE (<genid:pldms> <foaf:knows>?y), * USING clause (e.g. USING foaf FOR http://xmlns.com/foaf/0.1/) is omitted in Q1 to Q7 Table 1. Test Queries for Performance Evaluation approach reduces the number of join operations and gives performance improvement although space overhead is increased. Refer to [13] for more ideas of the in-lining approach. For the triple table in the vertical approach and the arc_others table in the hybrid approach, we indexed each subject, property, and object columns independently. We also indexed each subject and object fields of property-based arc tables in the hybrid approach. For implementing the storage and retrieval modules of both schemes we used APM_Setup 5 for Win 32 which consists of MySQL database management system, PHP language and interpreter, and Apache web server. We use the open CARA parser [10] as an RDF parser to extract RDF triples. The test is performed with a machine with Pentium III 866MHz, 256 MB main memory, and 20GB hard disk under the Window 2000 professional server. As test data, we use a single FOAF ontology-based RDF file generated by FOAF developer site [9]. The analyzed result for the test RDF data showed that the number of different properties is about 50. As similar with the result in [6], some properties such as foaf:name, foaf:mbox_sha1sum, foaf:thumbnail, foaf:knows, rdfs:seealso are most frequently used. It takes over 50% of whole usage frequency. The number of extracted RDF triples is about 100,000. 4.2 Performance Results We first measured storage requirements. The required database sizes for the vertical and hybrid schemes are about 58MB and 44MB respectively. The index spaces are 3.2MB and 2.4 MB respectively. The reason why the vertical scheme uses more space than the hybrid scheme is that it explicitly maintains property field. In the proposed scheme, on the other hand, each propery-based arc table is named as the corresponding property name. Thus, except the arc_others table, property field is not explicitly used. Next we measured retrieval times. The RDF is a directed graphbased model, often represented as a set of RDF triples as mentioned in an earlier section. In most RDF query languages, such as RDQL, the basic query form is founded on triple pattern. A triple pattern is comprised of named variables, URIs or literals. We used seven triple pattern-based queries to perform the test as shown in Table1. We express these queries using RDQL [11] which is one of the RDF Query languages. Queries Q1-Q3 are based on a single triple pattern. Q4 and Q5 are more complex than Q1-Q3. Q4, Q6, and Q7 are based on a graph pattern. Especially, Q7 is a path-based query. Each query was issued several times after cold booting to perfectly flush buffer cache. In real applications, the number of extracted RDF triples may be very large. Therefore, some factors to design database storage schema, such as index, affect the query performance. How to efficiently index each table in storage schema is important. In order to observe and analyze the influence of index, we first experiment with both approaches without indices. And then we experiment with both approaches with proper indices. Table 2 shows the average retrieval times for initial executions without considering indices. Due to cache effects, a large reduction in response times for the following executions compared to the initial execution occurred. Observing the cache effects is not a goal for this experiment, we mention only the retrieval times for initial executions in this section. Query # of results Vertical (sec.) Hybrid (sec.) Q1 16,052 40.55 6.41 Q2 60 38.08 8.62 Q3 1,150 39.08 30.02 Q4 15 86.42 10.53 Q5 19-625.14 Q6 10 78.79 10.32 Q7 10 71.41 10.37 Table 2. A Comparison of Retrieval Times (without indices) As a whole the vertical scheme shows lower performance than the proposed scheme. As was expected, for a query with a specific and frequently appeared property (Q1, Q2, Q4, Q6, Q7), the performance of the proposed schemes achieves about 4 to 8 times better retrieval performance than the vertical scheme. For a query issued without a specific property (Q3), the proposed hybrid approach also appears faster than the vertical scheme. More disk I/ O operations in the vertical scheme are one of the reasons that show lower performance. Graph pattern-based queries Q4 - Q7 requires self-join operation in the vertical scheme and 2-way join operation in the proposed scheme respectively. Due to this, more times are required for graph pattern-based queries compared to the simple triple pattern queries in the vertical scheme. Q5 is a very exhaustive query. It requires full scanning to the triple table two times in the vertical scheme and two related tables in the proposed one respectively. In the vertical approach, Q5 exceeds the maximum execution time. Table 3 shows the average retrieval times with indices for initial executions. In [14] it was demonstrated that building independent indices on each column in the triple table is superior to other index combinations. We therefore adopted this indices scheme to the triple table in vertical approach and the arc_others table in hybrid approach respectively. For each property-based arc table, subject and object fields are independently indexed. Query # of results Vertical (sec.) Hybrid (sec.) Q1 16,052 41.34 6.38 Q2 60 0.66 0.56 Q3 1,150 1.97 2.72 Q4 15 0.51 0.39 Q5 19-19.36 Q6 10 0.44 0.34 Q7 10 0.94 0.38 Table 3. A Comparison of Retrieval Times (with indices) Due to adopting index schemes, good performance improvement was indicated in both approaches as shown in Table 3. Although the performance differences between two schemes are reduced as a result of adapting a proper index scheme, the hybrid approach Journal of Digital Information Management q Volume 4 Number 1 q March 2006 35

shows overall performance improvement over the vertical one. Even though Q1 is executed with an index only in the vertical approach, a performance difference of more than 6 times is observed compared to the execution of the hybrid approach where table scanning is used. This comes from the fact that the selectivity is somewhat low. Low selectivity means that the percentage of returned rows is high. In the case of the execution on Q3 (given no specific property), the vertical approach is slightly faster because the hybrid approach should access more tables and the selectivity is high (about 1%). In the case of Q5, since MySQL DBMS choose only an index among indices created on a table to execute user query (index on subject field is chosen in this case), whole tables should fully be scanned in both approaches. It takes about 19 seconds to execute the query in the hybrid approach. On the other hand, we exceed the maximum execution time in the vertical approach. 5. Conclusion and Future Works A large quantity of the Semantic Web data described in RDF format will appear in the near future. In most previous approaches, RDF data is stored in a single large relational table called as a triple table. Basically, it always requires the whole table to be scanned for processing a query, however, it may degrade retrieval performance. In addition, it does not scale well. We propose a hybrid approach in this paper. First, we distinguish some important properties according to the appearance and usage frequency. And based on the binary storage scheme, we generate several property-based tables for the distinguished properties to treat each property independently. RDF statements described with a specific property are grouped and stored in a corresponding table. Thus we can avoid full scanning the whole data and achieve better retrieval performance. For other RDF statements described with non-distinguished properties, we manage them in the same manner as the previous approach. It also provides good manageability by maintaining several independent tables as necessary. In addition, it scales well since RDF data is physically distributed in several tables instead of storing in a single table. Finally, we implement and evaluate the proposed scheme. The proposed scheme especially shows better performance for a retrieval query having specific properties. How to analyze the usage frequency of properties is remaining work. In the near future, we have a plan to analyze and evaluate what the optimal number of property-based tables to be maintained is. In this paper we don t consider the ontology languages such as RDF Schema and OWL. Since the RDF Schema and OWL documents fundamentally can be described in RDF syntax we can apply our proposed storage scheme to manage them. However, the ontological data described in RDF Schema or OWL has different characters as compared with RDF data. Thus, it may be more efficient to treat them with a different manner. We have been currently investigating to design a management scheme for the ontological data and to connect it with the hybrid storage scheme proposed in this paper. Sung Wan Kim He is an assistant professor in the Department of Computer Information at the Sahmyook College, Korea. He received the B.Eng. degree with First Class Honors in Computer Science from the Myongji University, Korea in 1996, and the M.S. and Ph.D. degrees in Computer Science from the Hongik University, Korea in 1998 and 2003, respectively. His current research interests are in the areas of XML and semantic web from the viewpoint of database systems. Acknowledgements This work was supported by the Sahmyook College Research Fund in 2005. I wish to thank Brenda Yoon for her valuable proofreading efforts on this manuscript. References [1] W3C (2004). RDF Primer. (http://www.w3c.org) [2] Melnik, S (2004). Storing RDF in a Relational Database. (http:// www-db.stanford.edu/~melnik/rdf/db.html) [3] McBride, B (2001). Jena: Implementing the RDF Model and Syntax. Proc. of the Second International Workshop on the Semantic Web (SemWeb 2001). [4] Broekstra, J. et al (2002). Sesame ÿa Generic Architecture for Storing and Querying RDF and RDF Schema. Proc. of the 1st Int l Semantic Web Conference. 54-68. [5] Wilkinson, K. et al(2003). Efficient RDF Storage and Retrieval in Jena2. Proc. of the 1st International Workshop on Semantic Web and Databases. 131-151. [6] Copeland, G. Khoshafian, S. (1985). A Decomposition Storage Model. Proc. of the ACM SIGMOD Inter l Conf. on Management of Data. 268-279. [7] Agrawal, R. Somani, A.,Xu, Y (2001). Storage and Querying of E- Commerce Data. Proc. of the 27th Int l Conf. on Very Large Data Bases (VLDB). 149 158. [8] Li Ding et al (2005). How the Semantic Web is Being Used:An Analysis of FOAF. Proc. of the 38th Hawaii Int l Conf. on System Sciences. [9] FOAF project (http://www.foaf-project.org) [10] CARA RDF Parser (http://cara.sourceforge.net) [11] Andy Seaborne (2004). Jena Tutorial : A Programmer s Introduction - RDQL (http://jena.sourceforge.net/tutorial/rdql/) [12] Aaron Swarts (2002). MusicBrainz:A Semantic Web Service, IEEE Intelligent Systems, 17(1). 76-77 [13] Florescu, D. Kossmann, D (1999). Storing and Querying XML Data using an RDBMS. Bulletin of the IEEE Computer Society Technical Committee on Data Engneering. 22 (3) 27-34. [14] Li Ma et al (2004). RStar: an RDF Storage and Query System for Enterprise Resource Management. In: Proc. of the 13th ACM Conf. on Information and Knowledge Management, 484-491. Journal of Digital Information Management q Volume 4 Number 1 q March 2006 36