Querying DBpedia Using HIVE-QL

Querying DBpedia Using HIVE-QL AHMED SALAMA ISMAIL 1, HAYTHAM AL-FEEL 2, HODA M. O.MOKHTAR 3 Information Systems Department, Faculty of Computers and Information 1, 2 Fayoum University 3 Cairo University 1, 2 Fayoum, Egypt 3 Cairo, Egypt 1 asi01@fayoum.edu.eg 2 htf00 @fayoum.edu.eg 3 h.mokhtar@fci-cu.edu.eg Abstract: - DBpedia is considered one of the main data hubs on the web nowadays. DBpedia extracts its content from different Wikipedia editions, which dramatically increase day after day according to the participation of new chapters. It became a big data environment describing 38.3 million things with over than 3 billion facts. This data explosion affects both the efficiency and accuracy of the retrieved data. From this point of view, we propose a new architecture to deal with the DBpedia using big data techniques in addition to the Semantic Web principles and technologies. Our proposed architecture introduces HIVE-QL as a query language for DBpedia instead of the SPARQL Query Language, which is considered the backbone of the semantic web applications. Additionally, this paper presents the implementation and evaluation of this architecture that minimizes the retrieval time for a query in DBpedia. Key-Words: - Semantic Web; SPARQL; Big Data; Hadoop; Hive; Dbpedia; Map-Reduce 1 Introduction Semantic web [1] is considered an extension of the current web that aims to enrich the meaning of web pages content and enables machines and users to work in co-operation. DBpedia is seen as a clear example of both semantic web applications and big data as well due to its huge size of triples and its different chapters extracted from different Wikipedia editions. It aims to convert the unstructured data published in Wikipedia to be in a structured form using ontology engineering principles and SPARQL as a standard query language for semantic web triples. On the other hand, big data is considered a modern term that describes the massive growth of data, both structured and unstructured. Big data challenges became one of the most emerging problems facing developers due to its Volume, Variety, and Velocity (3Vs) [2]. All of these problems need to be highlighted, and innovative solutions need to be proposed depending on new techniques, such as Hadoop [3] and HIVE-QL [4] which are considered the arms of the Big Data. Studying the problem of the enormous data volume of DBpedia content [5] is one of the primary concerns in this paper. The paper on hands presents a new approach to retrieve DBpedia using HIVE-QL query language. The rest of this paper is organized as follows: Section 2 discusses the related work; section 3 highlights and overviews the DBpedia project and navigates through Hadoop & Hive languages. On the other hand, section IV presents the architecture for retrieving DBpedia content using HIVE-QL. Section 4 describes the implementation process of the proposed architecture. Additionally, section 5 shows the test cases using HIVE-QL and SPARQL Language, which highlights the implementation results. Finally, section 6 concludes the paper and discusses possible directions for future work. 2 RELATED WORK Google Search Engine is considered a good example of the relation between Semantic Web and Big Data techniques [6] that use Linked Data [7] to retrieve a massive amount of search results and Map-Reduce to accelerate this job. How to use Hadoop and Map- Reduce to retrieve massive semantic data was ISBN: 978-1-61804-329-0 102

discussed in [8]. In addition, an algorithm was proposed to allow users to cast some SPARQL queries to equivalent and simple queries using Hadoop's Map-Reduce, [8] but this work met limitations, such as the incapability and the difficulty of retrieving data using Hive techniques. On the other hand, Liu proposed an architecture that uses Hadoop and Map-Reduce to answer multiple SPARQL queries at the same time via dividing multiple data rows into a number of clusters to be processed all together at the same time to save efforts and time of querying. A system based on Hadoop named HadoopSparql was developed in [9] to allow the handling of multiple queries simultaneously based on using multi-way join operator instead of the traditional two-way join operator. In addition, an optimization algorithm was proposed to calculate the best join order, and to allow users to access the system through web browsers. However, how to store and retrieve large RDF Graph using Hadoop and Map-Reduce [10] is considered an important work in this domain and discussed in [11].The main problem of this work was the handling of a large amount of semantic web data. However, it was shown that current semantic web architectures do not scale up fine. Generating data in RDF/XML format by the usage of the LUBM data generator and converting the generated data to N- Triples format using N-Triple Converter software using Jena architecture was discussed in [12]. This work shows that the increase in the time needed to answer a query is always less than the increase in the dataset size. However, this approach also has limitations, such as its complexity and time retrieval. On the other hand, building a Hadoop RDF system was considered one of the main goals in [13] that gather both the advantages of the semantic web RDF content and big data environments, like Hadoop, to avoid the limitation of querying RDF triples on a single machine. This system depends on parsing the SPARQL query into HIVE-QL query. The distribution of the large RDF triplestore using HBase for storing triples and HIVE-QL for generating query results was discussed in [14]. However, this work has shortcomings and needs an optimization for the translation of SPARQL to HIVE-QL and also needs to refine its data model. 3 DBpedia, Hadoop & HIVE 3.1 DBpedia DBpedia is considered one of the most remarkable achievements in the Semantic Web nowadays. This is due to its importance in adding structure to Wikipedia content. DBpedia project was initially established as a project between the Freie University of Berlin and the University of Leipzig, in collaboration with Open Link Software. The first dataset of DBpedia was published in 2007. It was made available under free licenses, allowing others to reuse the dataset [15] [16]. DBpedia allows users to ask sophisticated queries on datasets that are gathered from Wikipedia and link other datasets flying on the Web to Wikipedia data [17]. In addition, DBpedia-Live provides a live streaming method based on the update of Wikipedia. DBpedialive depends on local copies of Wikipedia that are synchronized via the Open Archives Initiative Protocol for Metadata Harvesting [18] that enables a continuous stream of updates from a wiki [17]. DBpedia-Live Extraction Manager applies on any processed page, and the extracted triples are inserted into the triplestore in the form of N-Triples [17] that can be queried via SPARQL query language as shown in the DBpedia architecture [19] in Fig.1 consists of main modules. Wikipedia is considered the source of data that has infoboxes been extracted by different extractors, such as label, redirect, and abstract extractors. These extracted data were embedded into DBpedia in the form of N-Triples dump stored in Virtuoso triplestore that enables users to execute sophisticated queries through SPARQL endpoint. Fig. 1. DBpedia Architecture components 3.2 Hadoop & Hive The Apache Hadoop project [3] concerns in the development of open source software to be scalable, reliable and distributed. Hadoop is considered an architecture that allows the distributed processing of massive data sets across clusters of computers using simple programming models such as the Hive language [20] which performs map-reduce jobs in its ISBN: 978-1-61804-329-0 103

back end. In addition, it enables scaling up from a single machine to a number of machines, each of which has its local computation and storage. Hadoop Distributed File System (HDFS) [21], Ontology Cloud Storage System (OCSS) [22], and the Distributed Ontology Cloud Storage System (DOCSS) [23] are examples for distributing information via Hadoop architecture 3. Proposed Architecture 3.1 Proposed Architecture Dealing with a large size semantic content based on Big Data is considered the main contribution of our work. HIVE-QL is used in our architecture instead of SPARQL, which is regarded as one of the backbones of the semantic web. DBpedia is regarded as an example of the big semantic dataset available on the web nowadays. In our architecture, DBpedia datasets are converted into Comma-Separated Values (CSV) files instead of using the HBase triplestore. Our architecture consists of three main phases as shown in Fig.2. The stored RDF datasets are reformatted into CSV format. They are then loaded as triplestore into the in HDFS in Hadoop that we used here as a distributed file system. After that, Map-Reduce model divides the massive datasets and performs parallel processing tasks on them. This process is considered one of the Hive language functions that are performed in Hive backend. 3) Data Querying Phase At this stage, the CSV datasets are loaded into the new Hive data store. Instead of using SPARQL as query statements, HIVE-QL query statements are used. Moreover, in this way, any sophisticated query is executed using HIVE-QL across Hive platform to get the required user results. 4. Implementation We present our approach as a desktop application implemented using Java programming language and Scala programming language to allow users to query any row of data stored in DBpedia datasets. The physical environment for our experiment consists of Intel Core I5 with 4 GBs memory and 160 GBs hard disks with Ubuntu 12.04 Linux operating system. On the other hand, DBpedia triplestore is stored on the DBpedia server so that we can download a local copy of the last version of our proposed English chapter of DBpedia, which is 3.9 as shown in Fig 3. Fig. 2. Proposed Architecture 1) Input Phase This phase is based on downloading a local copy of DBpedia dumps. The downloaded dumps of DBpedia datasets are stored in DBpedia triplestore in N-triples format and are then transformed into RDF format using RDF validator. 2) Data Storing and Processing Phase Fig. 3. DBpedia 2014 download server The English chapter of DBpedia consists of 583 million RDF triples that are divided into different categories; each category is formatted in four RDF formats like N- Quads (nq), N- Triples (nt), Turtle Quads (tql) and turtle (ttl). After getting the DBpedia data in the required format, we convert the RDF triples into CSV ISBN: 978-1-61804-329-0 104

format with three separate columns that represent the RDF triple. The converted CSV file is loaded into the HDFS of the Hadoop platform and it is then loaded into the Hive table to be ready for querying using HIVE-QL query language. 5. Results In this section, we present the test cases that verify our architecture that uses HIVE-QL instead of SPARQL to retrieve data from DBpedia. This section will reply the question: Which will perform better on DBpedia, SPARQL that is considered one of the main players in the Semantic Web and DBpedia or HIVE-QL which is included as one of the building blocks of our proposed architecture?. We have three test cases varying from simple to sophisticated queries that run using both SPARQL and HIVE-QL measuring the retrieval time of each query language for each test case. The First Query (Q1): aims to retrieve the founding date of the New_York_Times newspaper which is considered one of the resources of DBpedia dataset as shown in Fig.4 and Fig.5. The SPARQL query is shown in Fig 6 which aims to retrieve the founding date as shown in Fig 7. Our proposed work that use HIVE-QL query to retrieve the requested data is shown in Fig 8. It is clear that our query using HIVE-QL takes only 40 secs to retrieve the same results as SPARQL which takes 370 secs. Our technique depends on dividing the query into two sub-queries that lead to retrieve the organization by which the New York Times was founded and then retrieve the foundation year for this company in few seconds. Fig. 5. The founding date that should be retrieved from DBpedia Fig. 6. DBpedia Query using SPARQL query language Fig. 7. DBpedia Query result using SPARQL query language Fig. 8. DBpedia Query and its results using HIVE-QL query language Fig. 4. New York Times resource with its properties & values in DBpedia The Second Query (Q2): aims to retrieve the URI address of the university where the German Chancellor studied. Such query is considered one of the sophisticated queries. The SPARQL query is shown in Fig.9 and its results is shown in Fig 10 that takes 414 secs. On the other hand, our proposed work as HIVE-QL query is shown in Fig.11 and requires only 108 secs to retrieve the results. ISBN: 978-1-61804-329-0 105

Fig. 13. Result using SPARQL to retrieve the scientific field of that scientist Fig. 9. DBpedia Query using SPARQL query language Fig. 10. DBpedia Query result using SPARQL query language Fig. 14. Query in HIVE-QL to retrieve the scientific field of that scientist Fig. 11. DBpedia query and its results for the query requesting the address of the University of the Chancellor The results on the three cases tested here in both SPARQL and HIVE-QL prove that HIVE-QL is less retrieval time and performs better, while both are efficient in querying semantic datasets, but HIVE- QL will perform better on large semantic datasets according to the using of Map-Reduce techniques in clusters. The Third Query (Q3): aims to retrieve the scientific field for scientists who was awarded the Nobel Prize in Chemistry and was also born in Egypt. The SPARQL query is shown in Fig.12, and its results are shown in Fig.13. It takes around 432 secs. On the other hand, our proposed work using HIVE-QL is shown in Fig 14 takes 174 secs to retrieve the same results. Time(secs) 500 450 400 350 300 250 200 150 100 50 0 Query 1Query 2Query 3 HIVE_QL SPARQL Query Query Type Fig. 12. DBpedia query in SPARQL to retrieve the scientific field of this scientist Fig. 15. Shows the retrieval time for HIVE-QL vs. SPARQL ISBN: 978-1-61804-329-0 106

5.1 Architecture Evaluation TABLE I. ARCHITECTURE EVALUATION Our proposed architecture is presented in the form of a layered architecture [28] as shown in Fig.2. We evaluated our proposed architecture via some criteria described in [24] [25] and [26] as shown in Table I; those criteria are as follows: Availability: The degree of a system to be operable and in a committable state at the start of a mission. Clearly defined context: The possibility to identify the context from the description of the architecture Appropriate level of abstraction: The possibility to view the system within the framework as a whole Hiding of implementation details: The possibility to hide any implementation detailed information in the description of the architecture Clearly defined functional layers: The possibility to specify a function in layer description within the system? Interoperability: The extent to which systems can exchange information Modularity: The possibility to change the implementation of a particular layer while the functionality and the interfaces remain the same Upgradeability: The degree to which a system can easily be improved in functionality. Modifiability: The extent to which a system can be modifiable. Accessibility: The degree to which many people can access a system as possible. Usability: The measure of how easy it is to use the system. Stability: The extent to which the system rarely exhibits failure. Efficiency: The degree to which the system is running in an efficient time and generates correct output. Criterion Availability Clearly defined context Appropriate level of abstraction Hiding of implementation details Clearly defined functional layers Interoperability Modularity Upgradability Modifiability Accessibility Usability Stability Efficiency Proposed Architecture No, The system cannot be shown as one thing Partially Partially High 6. Conclusions and Future Work The objective of this work is to propose a new architecture that is capable of efficently querying massive amounts of semantic DBpedia content on Hadoop environment based on Hive mechanism. DBpedia is used as the dataset representing the semantic content. The paper discusses our proposed architecture and shows the implementation of this proposed architecture based on Java programming language, HIVE-QL query, Scala programming language [27]. The experimental results show how our proposed approach outperforms using SPARQL in retrieving the search results.for future work, we aim to use Apache Spark and its Spark SQL query ISBN: 978-1-61804-329-0 107

language as an attempt to get better results to extend the system and to get better usability and stability. References: [1] N. Shadbolt, T. Berners-Lee and W. Hall, 'The Semantic Web Revisited', IEEE Intell. Syst., vol. 21, no. 3, pp. 96-101, 2006. [2] Getting Started with Hadoop', 2015. [Online]. Available: http://hortonworks.com/get-started/. [Accessed: 24- Jun- 2015]. [3] Introduction to Hadoop. [Online]. Available: http://hadoop.apache.org/. [Accessed: 24- Jun- 2015]. [4] B. Enrico, G. Marco and I. Mauro, 'Modeling apache hive based applications in big data architectures', in the 7th International Conference on Performance Evaluation Methodologies and Tools, 2015, pp. 30-38. [5] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak and S. Hellmann, 'DBpedia - A crystallization point for the Web of Data', Web Semantics: Science, Services and Agents on the World Wide Web, vol. 7, no. 3, pp. 154-165, 2009. [6] E. Dumbill, "Big data and the semantic web at war, Indifferent, or Intimately Connected," in Strata Conference New York 2011, 2011. [7] H. Christian Bizer and B. Christian, 'special issue on Linked Data', International Journal on Semantic Web and Information, 2014. [8] H. Mohammad Farhan, K. Latifur, K. Murat and H. Kevin, Data intensive query processing for Semantic Web data using Hadoop and MapReduce. IEEE 3rd International Conference. The University Of Texas at Dallas, 2011 [9] L. Chang, Q. Jun, Q. Guilin, W. Haofen and Y. Yong, 'Hadoopsparql: a hadoop-based engine for multiple SPARQL query answering', in 9th Extended Semantic Web Conference, 2012, pp. 474-479. [10] J. Dean and S. Ghemawat, 'MapReduce', Communications of the ACM, vol. 51, no. 1, p. 107, 2008. [11] H. Mohammad Farhan, D. Pankil, K. Latifur and T. Bhavani, 'Storage and retrieval of large RDF graph using hadoop and MapReduce',in Springer, Berlin Heidelberg, 2009, pp. 680-686. [12] N. Tu Ngoc and S. Wolf, 'SLUBM: An Extended LUBM Benchmark for Stream Reasoning.', inordring@ ISWC, 2013, pp. 43-54. [13] D. Jin-Hang, W. Hao-Fen, N. Yuan, and Y. Yong, 'HadoopRDF: A scalable semantic data analytical engine', in Intelligent Computing Theories and Applications, 2012, pp. 633-641. [14] H. Albert and P. Lynette, 'Distributed RDF triplestore using HBase and Hive', the University of Texas at Austin, 2012. [15] L. Jens, I. Robert, J. Max, J. Anja and K. Dimitris, 'DBpedia-a large-scale, multilingual knowledge base extracted from Wikipedia', Semantic Web Journal, vol. 5, pp. 1-29, 2014. [16] H. Al-Feel, 'A Step towards the Arabic DBpedia', International Journal of Computer Applications, vol. 80, no. 3, pp. 27-33, 2013. [17] M. Morsey, J. Lehmann, S. Auer, C. Stadler and S. Hellmann, 'DBpedia and the live extraction of structured data from Wikipedia', Program: electronic library and information systems, vol. 46, no. 2, pp. 157-181, 2012. [18] T. Carl Lagoze (2008, The Open Archives Initiative Protocol for Metadata Harvesting', 2015. [Online]. Available: http://www.openarchives.org/oai/2.0/openarchivesp rotocol.2008-12-07.htm. [Accessed: 15- Jun- 2015]. [19] The DBpedia Data Provision Architecture, 2015. [Online]. Available: http://wiki.dbpedia.org/about/aboutdbpedia/architecture. [Accessed: 22- Jun- 2015] [20] Hive Wiki,2015[Online]. Available:http://www.apache.org/hadoop/hive. [Accessed: 22- Jun- 2015] [21] Introduction to HDFS: what is the Hadoop Distributed File System (HDFS),2015[Online]. Available: http://www.1.ibm.com/software/data/infosphere/had oop/hdfs/ [Accessed: 22- Jun- 2015] [22] F. Haytham Tawfeek and K. Mohamed Helmy, 'OCSS: Ontology cloud storage system', 2011, pp. 9-13. [23] K. Mohamed Helmy and F. Haytham Tawfeek, 'DOCSS: Distributed Ontology cloud storage system ', 2012, pp. 48-52. [24] H. Neil B and A. Paris, 'Using Pattern-Based Architecture Reviews to Detect Quality Attribute Issues-An Exploratory Study', in Transactions on Pattern Languages of Programming III, 2013, pp. 168-194. [25] G. Aurona J, B. Andries and V. Alta J, 'Towards a semantic web layered architecture', in the International Conference on Software Engineering Innsbruck, Austria, 2007, pp. 353-362. [26] G. Aurona J, B. Andries and V. Alta J, 'Design and evaluation criteria for layered architectures', in the 8th International Conference on Enterprise Information System, Paphos, Cyprus., 2006. [27] M. Odersky. (2014), the Scala Language Specification.,2015[Online].Available: http://www.scalalang.org/docu/files/scalareference.pdf [Accessed: 1- Jul- 2015] [28] H. Al-Feel, M. Koutb and H. Suoror, 'Toward An Agreement on Semantic Web Architecture', Europe, vol. 49, no. 3, pp. 806-810, 2009. ISBN: 978-1-61804-329-0 108