Querying DBpedia Using HIVE-QL
|
|
- Darcy Chambers
- 8 years ago
- Views:
Transcription
1 Querying DBpedia Using HIVE-QL AHMED SALAMA ISMAIL 1, HAYTHAM AL-FEEL 2, HODA M. O.MOKHTAR 3 Information Systems Department, Faculty of Computers and Information 1, 2 Fayoum University 3 Cairo University 1, 2 Fayoum, Egypt 3 Cairo, Egypt 1 asi01@fayoum.edu.eg 2 3 h.mokhtar@fci-cu.edu.eg Abstract: - DBpedia is considered one of the main data hubs on the web nowadays. DBpedia extracts its content from different Wikipedia editions, which dramatically increase day after day according to the participation of new chapters. It became a big data environment describing 38.3 million things with over than 3 billion facts. This data explosion affects both the efficiency and accuracy of the retrieved data. From this point of view, we propose a new architecture to deal with the DBpedia using big data techniques in addition to the Semantic Web principles and technologies. Our proposed architecture introduces HIVE-QL as a query language for DBpedia instead of the SPARQL Query Language, which is considered the backbone of the semantic web applications. Additionally, this paper presents the implementation and evaluation of this architecture that minimizes the retrieval time for a query in DBpedia. Key-Words: - Semantic Web; SPARQL; Big Data; Hadoop; Hive; Dbpedia; Map-Reduce 1 Introduction Semantic web [1] is considered an extension of the current web that aims to enrich the meaning of web pages content and enables machines and users to work in co-operation. DBpedia is seen as a clear example of both semantic web applications and big data as well due to its huge size of triples and its different chapters extracted from different Wikipedia editions. It aims to convert the unstructured data published in Wikipedia to be in a structured form using ontology engineering principles and SPARQL as a standard query language for semantic web triples. On the other hand, big data is considered a modern term that describes the massive growth of data, both structured and unstructured. Big data challenges became one of the most emerging problems facing developers due to its Volume, Variety, and Velocity (3Vs) [2]. All of these problems need to be highlighted, and innovative solutions need to be proposed depending on new techniques, such as Hadoop [3] and HIVE-QL [4] which are considered the arms of the Big Data. Studying the problem of the enormous data volume of DBpedia content [5] is one of the primary concerns in this paper. The paper on hands presents a new approach to retrieve DBpedia using HIVE-QL query language. The rest of this paper is organized as follows: Section 2 discusses the related work; section 3 highlights and overviews the DBpedia project and navigates through Hadoop & Hive languages. On the other hand, section IV presents the architecture for retrieving DBpedia content using HIVE-QL. Section 4 describes the implementation process of the proposed architecture. Additionally, section 5 shows the test cases using HIVE-QL and SPARQL Language, which highlights the implementation results. Finally, section 6 concludes the paper and discusses possible directions for future work. 2 RELATED WORK Google Search Engine is considered a good example of the relation between Semantic Web and Big Data techniques [6] that use Linked Data [7] to retrieve a massive amount of search results and Map-Reduce to accelerate this job. How to use Hadoop and Map- Reduce to retrieve massive semantic data was ISBN:
2 discussed in [8]. In addition, an algorithm was proposed to allow users to cast some SPARQL queries to equivalent and simple queries using Hadoop's Map-Reduce, [8] but this work met limitations, such as the incapability and the difficulty of retrieving data using Hive techniques. On the other hand, Liu proposed an architecture that uses Hadoop and Map-Reduce to answer multiple SPARQL queries at the same time via dividing multiple data rows into a number of clusters to be processed all together at the same time to save efforts and time of querying. A system based on Hadoop named HadoopSparql was developed in [9] to allow the handling of multiple queries simultaneously based on using multi-way join operator instead of the traditional two-way join operator. In addition, an optimization algorithm was proposed to calculate the best join order, and to allow users to access the system through web browsers. However, how to store and retrieve large RDF Graph using Hadoop and Map-Reduce [10] is considered an important work in this domain and discussed in [11].The main problem of this work was the handling of a large amount of semantic web data. However, it was shown that current semantic web architectures do not scale up fine. Generating data in RDF/XML format by the usage of the LUBM data generator and converting the generated data to N- Triples format using N-Triple Converter software using Jena architecture was discussed in [12]. This work shows that the increase in the time needed to answer a query is always less than the increase in the dataset size. However, this approach also has limitations, such as its complexity and time retrieval. On the other hand, building a Hadoop RDF system was considered one of the main goals in [13] that gather both the advantages of the semantic web RDF content and big data environments, like Hadoop, to avoid the limitation of querying RDF triples on a single machine. This system depends on parsing the SPARQL query into HIVE-QL query. The distribution of the large RDF triplestore using HBase for storing triples and HIVE-QL for generating query results was discussed in [14]. However, this work has shortcomings and needs an optimization for the translation of SPARQL to HIVE-QL and also needs to refine its data model. 3 DBpedia, Hadoop & HIVE 3.1 DBpedia DBpedia is considered one of the most remarkable achievements in the Semantic Web nowadays. This is due to its importance in adding structure to Wikipedia content. DBpedia project was initially established as a project between the Freie University of Berlin and the University of Leipzig, in collaboration with Open Link Software. The first dataset of DBpedia was published in It was made available under free licenses, allowing others to reuse the dataset [15] [16]. DBpedia allows users to ask sophisticated queries on datasets that are gathered from Wikipedia and link other datasets flying on the Web to Wikipedia data [17]. In addition, DBpedia-Live provides a live streaming method based on the update of Wikipedia. DBpedialive depends on local copies of Wikipedia that are synchronized via the Open Archives Initiative Protocol for Metadata Harvesting [18] that enables a continuous stream of updates from a wiki [17]. DBpedia-Live Extraction Manager applies on any processed page, and the extracted triples are inserted into the triplestore in the form of N-Triples [17] that can be queried via SPARQL query language as shown in the DBpedia architecture [19] in Fig.1 consists of main modules. Wikipedia is considered the source of data that has infoboxes been extracted by different extractors, such as label, redirect, and abstract extractors. These extracted data were embedded into DBpedia in the form of N-Triples dump stored in Virtuoso triplestore that enables users to execute sophisticated queries through SPARQL endpoint. Fig. 1. DBpedia Architecture components 3.2 Hadoop & Hive The Apache Hadoop project [3] concerns in the development of open source software to be scalable, reliable and distributed. Hadoop is considered an architecture that allows the distributed processing of massive data sets across clusters of computers using simple programming models such as the Hive language [20] which performs map-reduce jobs in its ISBN:
3 back end. In addition, it enables scaling up from a single machine to a number of machines, each of which has its local computation and storage. Hadoop Distributed File System (HDFS) [21], Ontology Cloud Storage System (OCSS) [22], and the Distributed Ontology Cloud Storage System (DOCSS) [23] are examples for distributing information via Hadoop architecture 3. Proposed Architecture 3.1 Proposed Architecture Dealing with a large size semantic content based on Big Data is considered the main contribution of our work. HIVE-QL is used in our architecture instead of SPARQL, which is regarded as one of the backbones of the semantic web. DBpedia is regarded as an example of the big semantic dataset available on the web nowadays. In our architecture, DBpedia datasets are converted into Comma-Separated Values (CSV) files instead of using the HBase triplestore. Our architecture consists of three main phases as shown in Fig.2. The stored RDF datasets are reformatted into CSV format. They are then loaded as triplestore into the in HDFS in Hadoop that we used here as a distributed file system. After that, Map-Reduce model divides the massive datasets and performs parallel processing tasks on them. This process is considered one of the Hive language functions that are performed in Hive backend. 3) Data Querying Phase At this stage, the CSV datasets are loaded into the new Hive data store. Instead of using SPARQL as query statements, HIVE-QL query statements are used. Moreover, in this way, any sophisticated query is executed using HIVE-QL across Hive platform to get the required user results. 4. Implementation We present our approach as a desktop application implemented using Java programming language and Scala programming language to allow users to query any row of data stored in DBpedia datasets. The physical environment for our experiment consists of Intel Core I5 with 4 GBs memory and 160 GBs hard disks with Ubuntu Linux operating system. On the other hand, DBpedia triplestore is stored on the DBpedia server so that we can download a local copy of the last version of our proposed English chapter of DBpedia, which is 3.9 as shown in Fig 3. Fig. 2. Proposed Architecture 1) Input Phase This phase is based on downloading a local copy of DBpedia dumps. The downloaded dumps of DBpedia datasets are stored in DBpedia triplestore in N-triples format and are then transformed into RDF format using RDF validator. 2) Data Storing and Processing Phase Fig. 3. DBpedia 2014 download server The English chapter of DBpedia consists of 583 million RDF triples that are divided into different categories; each category is formatted in four RDF formats like N- Quads (nq), N- Triples (nt), Turtle Quads (tql) and turtle (ttl). After getting the DBpedia data in the required format, we convert the RDF triples into CSV ISBN:
4 format with three separate columns that represent the RDF triple. The converted CSV file is loaded into the HDFS of the Hadoop platform and it is then loaded into the Hive table to be ready for querying using HIVE-QL query language. 5. Results In this section, we present the test cases that verify our architecture that uses HIVE-QL instead of SPARQL to retrieve data from DBpedia. This section will reply the question: Which will perform better on DBpedia, SPARQL that is considered one of the main players in the Semantic Web and DBpedia or HIVE-QL which is included as one of the building blocks of our proposed architecture?. We have three test cases varying from simple to sophisticated queries that run using both SPARQL and HIVE-QL measuring the retrieval time of each query language for each test case. The First Query (Q1): aims to retrieve the founding date of the New_York_Times newspaper which is considered one of the resources of DBpedia dataset as shown in Fig.4 and Fig.5. The SPARQL query is shown in Fig 6 which aims to retrieve the founding date as shown in Fig 7. Our proposed work that use HIVE-QL query to retrieve the requested data is shown in Fig 8. It is clear that our query using HIVE-QL takes only 40 secs to retrieve the same results as SPARQL which takes 370 secs. Our technique depends on dividing the query into two sub-queries that lead to retrieve the organization by which the New York Times was founded and then retrieve the foundation year for this company in few seconds. Fig. 5. The founding date that should be retrieved from DBpedia Fig. 6. DBpedia Query using SPARQL query language Fig. 7. DBpedia Query result using SPARQL query language Fig. 8. DBpedia Query and its results using HIVE-QL query language Fig. 4. New York Times resource with its properties & values in DBpedia The Second Query (Q2): aims to retrieve the URI address of the university where the German Chancellor studied. Such query is considered one of the sophisticated queries. The SPARQL query is shown in Fig.9 and its results is shown in Fig 10 that takes 414 secs. On the other hand, our proposed work as HIVE-QL query is shown in Fig.11 and requires only 108 secs to retrieve the results. ISBN:
5 Fig. 13. Result using SPARQL to retrieve the scientific field of that scientist Fig. 9. DBpedia Query using SPARQL query language Fig. 10. DBpedia Query result using SPARQL query language Fig. 14. Query in HIVE-QL to retrieve the scientific field of that scientist Fig. 11. DBpedia query and its results for the query requesting the address of the University of the Chancellor The results on the three cases tested here in both SPARQL and HIVE-QL prove that HIVE-QL is less retrieval time and performs better, while both are efficient in querying semantic datasets, but HIVE- QL will perform better on large semantic datasets according to the using of Map-Reduce techniques in clusters. The Third Query (Q3): aims to retrieve the scientific field for scientists who was awarded the Nobel Prize in Chemistry and was also born in Egypt. The SPARQL query is shown in Fig.12, and its results are shown in Fig.13. It takes around 432 secs. On the other hand, our proposed work using HIVE-QL is shown in Fig 14 takes 174 secs to retrieve the same results. Time(secs) Query 1Query 2Query 3 HIVE_QL SPARQL Query Query Type Fig. 12. DBpedia query in SPARQL to retrieve the scientific field of this scientist Fig. 15. Shows the retrieval time for HIVE-QL vs. SPARQL ISBN:
6 5.1 Architecture Evaluation TABLE I. ARCHITECTURE EVALUATION Our proposed architecture is presented in the form of a layered architecture [28] as shown in Fig.2. We evaluated our proposed architecture via some criteria described in [24] [25] and [26] as shown in Table I; those criteria are as follows: Availability: The degree of a system to be operable and in a committable state at the start of a mission. Clearly defined context: The possibility to identify the context from the description of the architecture Appropriate level of abstraction: The possibility to view the system within the framework as a whole Hiding of implementation details: The possibility to hide any implementation detailed information in the description of the architecture Clearly defined functional layers: The possibility to specify a function in layer description within the system? Interoperability: The extent to which systems can exchange information Modularity: The possibility to change the implementation of a particular layer while the functionality and the interfaces remain the same Upgradeability: The degree to which a system can easily be improved in functionality. Modifiability: The extent to which a system can be modifiable. Accessibility: The degree to which many people can access a system as possible. Usability: The measure of how easy it is to use the system. Stability: The extent to which the system rarely exhibits failure. Efficiency: The degree to which the system is running in an efficient time and generates correct output. Criterion Availability Clearly defined context Appropriate level of abstraction Hiding of implementation details Clearly defined functional layers Interoperability Modularity Upgradability Modifiability Accessibility Usability Stability Efficiency Proposed Architecture No, The system cannot be shown as one thing Partially Partially High 6. Conclusions and Future Work The objective of this work is to propose a new architecture that is capable of efficently querying massive amounts of semantic DBpedia content on Hadoop environment based on Hive mechanism. DBpedia is used as the dataset representing the semantic content. The paper discusses our proposed architecture and shows the implementation of this proposed architecture based on Java programming language, HIVE-QL query, Scala programming language [27]. The experimental results show how our proposed approach outperforms using SPARQL in retrieving the search results.for future work, we aim to use Apache Spark and its Spark SQL query ISBN:
7 language as an attempt to get better results to extend the system and to get better usability and stability. References: [1] N. Shadbolt, T. Berners-Lee and W. Hall, 'The Semantic Web Revisited', IEEE Intell. Syst., vol. 21, no. 3, pp , [2] Getting Started with Hadoop', [Online]. Available: [Accessed: 24- Jun- 2015]. [3] Introduction to Hadoop. [Online]. Available: [Accessed: 24- Jun- 2015]. [4] B. Enrico, G. Marco and I. Mauro, 'Modeling apache hive based applications in big data architectures', in the 7th International Conference on Performance Evaluation Methodologies and Tools, 2015, pp [5] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak and S. Hellmann, 'DBpedia - A crystallization point for the Web of Data', Web Semantics: Science, Services and Agents on the World Wide Web, vol. 7, no. 3, pp , [6] E. Dumbill, "Big data and the semantic web at war, Indifferent, or Intimately Connected," in Strata Conference New York 2011, [7] H. Christian Bizer and B. Christian, 'special issue on Linked Data', International Journal on Semantic Web and Information, [8] H. Mohammad Farhan, K. Latifur, K. Murat and H. Kevin, Data intensive query processing for Semantic Web data using Hadoop and MapReduce. IEEE 3rd International Conference. The University Of Texas at Dallas, 2011 [9] L. Chang, Q. Jun, Q. Guilin, W. Haofen and Y. Yong, 'Hadoopsparql: a hadoop-based engine for multiple SPARQL query answering', in 9th Extended Semantic Web Conference, 2012, pp [10] J. Dean and S. Ghemawat, 'MapReduce', Communications of the ACM, vol. 51, no. 1, p. 107, [11] H. Mohammad Farhan, D. Pankil, K. Latifur and T. Bhavani, 'Storage and retrieval of large RDF graph using hadoop and MapReduce',in Springer, Berlin Heidelberg, 2009, pp [12] N. Tu Ngoc and S. Wolf, 'SLUBM: An Extended LUBM Benchmark for Stream Reasoning.', inordring@ ISWC, 2013, pp [13] D. Jin-Hang, W. Hao-Fen, N. Yuan, and Y. Yong, 'HadoopRDF: A scalable semantic data analytical engine', in Intelligent Computing Theories and Applications, 2012, pp [14] H. Albert and P. Lynette, 'Distributed RDF triplestore using HBase and Hive', the University of Texas at Austin, [15] L. Jens, I. Robert, J. Max, J. Anja and K. Dimitris, 'DBpedia-a large-scale, multilingual knowledge base extracted from Wikipedia', Semantic Web Journal, vol. 5, pp. 1-29, [16] H. Al-Feel, 'A Step towards the Arabic DBpedia', International Journal of Computer Applications, vol. 80, no. 3, pp , [17] M. Morsey, J. Lehmann, S. Auer, C. Stadler and S. Hellmann, 'DBpedia and the live extraction of structured data from Wikipedia', Program: electronic library and information systems, vol. 46, no. 2, pp , [18] T. Carl Lagoze (2008, The Open Archives Initiative Protocol for Metadata Harvesting', [Online]. Available: rotocol htm. [Accessed: 15- Jun- 2015]. [19] The DBpedia Data Provision Architecture, [Online]. Available: [Accessed: 22- Jun- 2015] [20] Hive Wiki,2015[Online]. Available: [Accessed: 22- Jun- 2015] [21] Introduction to HDFS: what is the Hadoop Distributed File System (HDFS),2015[Online]. Available: oop/hdfs/ [Accessed: 22- Jun- 2015] [22] F. Haytham Tawfeek and K. Mohamed Helmy, 'OCSS: Ontology cloud storage system', 2011, pp [23] K. Mohamed Helmy and F. Haytham Tawfeek, 'DOCSS: Distributed Ontology cloud storage system ', 2012, pp [24] H. Neil B and A. Paris, 'Using Pattern-Based Architecture Reviews to Detect Quality Attribute Issues-An Exploratory Study', in Transactions on Pattern Languages of Programming III, 2013, pp [25] G. Aurona J, B. Andries and V. Alta J, 'Towards a semantic web layered architecture', in the International Conference on Software Engineering Innsbruck, Austria, 2007, pp [26] G. Aurona J, B. Andries and V. Alta J, 'Design and evaluation criteria for layered architectures', in the 8th International Conference on Enterprise Information System, Paphos, Cyprus., [27] M. Odersky. (2014), the Scala Language Specification.,2015[Online].Available: [Accessed: 1- Jul- 2015] [28] H. Al-Feel, M. Koutb and H. Suoror, 'Toward An Agreement on Semantic Web Architecture', Europe, vol. 49, no. 3, pp , ISBN:
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering
HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering Chang Liu 1 Jun Qu 1 Guilin Qi 2 Haofen Wang 1 Yong Yu 1 1 Shanghai Jiaotong University, China {liuchang,qujun51319, whfcarter,yyu}@apex.sjtu.edu.cn
More informationStorage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.
More informationOn a Hadoop-based Analytics Service System
Int. J. Advance Soft Compu. Appl, Vol. 7, No. 1, March 2015 ISSN 2074-8523 On a Hadoop-based Analytics Service System Mikyoung Lee, Hanmin Jung, and Minhee Cho Korea Institute of Science and Technology
More informationHadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
More informationLDIF - Linked Data Integration Framework
LDIF - Linked Data Integration Framework Andreas Schultz 1, Andrea Matteini 2, Robert Isele 1, Christian Bizer 1, and Christian Becker 2 1. Web-based Systems Group, Freie Universität Berlin, Germany a.schultz@fu-berlin.de,
More informationTHE SEMANTIC WEB AND IT`S APPLICATIONS
15-16 September 2011, BULGARIA 1 Proceedings of the International Conference on Information Technologies (InfoTech-2011) 15-16 September 2011, Bulgaria THE SEMANTIC WEB AND IT`S APPLICATIONS Dimitar Vuldzhev
More informationSEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA
SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA J.RAVI RAJESH PG Scholar Rajalakshmi engineering college Thandalam, Chennai. ravirajesh.j.2013.mecse@rajalakshmi.edu.in Mrs.
More informationTowards the Integration of a Research Group Website into the Web of Data
Towards the Integration of a Research Group Website into the Web of Data Mikel Emaldi, David Buján, and Diego López-de-Ipiña Deusto Institute of Technology - DeustoTech, University of Deusto Avda. Universidades
More informationIndustry 4.0 and Big Data
Industry 4.0 and Big Data Marek Obitko, mobitko@ra.rockwell.com Senior Research Engineer 03/25/2015 PUBLIC PUBLIC - 5058-CO900H 2 Background Joint work with Czech Institute of Informatics, Robotics and
More informationE6895 Advanced Big Data Analytics Lecture 4:! Data Store
E6895 Advanced Big Data Analytics Lecture 4:! Data Store Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data Analytics,
More informationGraph Database Performance: An Oracle Perspective
Graph Database Performance: An Oracle Perspective Xavier Lopez, Ph.D. Senior Director, Product Management 1 Copyright 2012, Oracle and/or its affiliates. All rights reserved. Program Agenda Broad Perspective
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationRevealing Trends and Insights in Online Hiring Market Using Linking Open Data Cloud: Active Hiring a Use Case Study
Revealing Trends and Insights in Online Hiring Market Using Linking Open Data Cloud: Active Hiring a Use Case Study Amar-Djalil Mezaour 1, Julien Law-To 1, Robert Isele 3, Thomas Schandl 2, and Gerd Zechmeister
More informationDBpedia German: Extensions and Applications
DBpedia German: Extensions and Applications Alexandru-Aurelian Todor FU-Berlin, Innovationsforum Semantic Media Web, 7. Oktober 2014 Overview Why DBpedia? New Developments in DBpedia German Problems in
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationA Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
More informationBenchmarking the Performance of Storage Systems that expose SPARQL Endpoints
Benchmarking the Performance of Storage Systems that expose SPARQL Endpoints Christian Bizer 1 and Andreas Schultz 1 1 Freie Universität Berlin, Web-based Systems Group, Garystr. 21, 14195 Berlin, Germany
More informationJOURNAL OF COMPUTER SCIENCE AND ENGINEERING
Exploration on Service Matching Methodology Based On Description Logic using Similarity Performance Parameters K.Jayasri Final Year Student IFET College of engineering nishajayasri@gmail.com R.Rajmohan
More informationA Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
More informationCity Data Pipeline. A System for Making Open Data Useful for Cities. stefan.bischof@tuwien.ac.at
City Data Pipeline A System for Making Open Data Useful for Cities Stefan Bischof 1,2, Axel Polleres 1, and Simon Sperl 1 1 Siemens AG Österreich, Siemensstraße 90, 1211 Vienna, Austria {bischof.stefan,axel.polleres,simon.sperl}@siemens.com
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationIntegrating Open Sources and Relational Data with SPARQL
Integrating Open Sources and Relational Data with SPARQL Orri Erling and Ivan Mikhailov OpenLink Software, 10 Burlington Mall Road Suite 265 Burlington, MA 01803 U.S.A, {oerling,imikhailov}@openlinksw.com,
More informationDistributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
More informationApproaches for parallel data loading and data querying
78 Approaches for parallel data loading and data querying Approaches for parallel data loading and data querying Vlad DIACONITA The Bucharest Academy of Economic Studies diaconita.vlad@ie.ase.ro This paper
More informationData processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
More informationHow To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
More informationLiDDM: A Data Mining System for Linked Data
LiDDM: A Data Mining System for Linked Data Venkata Narasimha Pavan Kappara Indian Institute of Information Technology Allahabad Allahabad, India kvnpavan@gmail.com Ryutaro Ichise National Institute of
More informationYet Another Triple Store Benchmark? Practical Experiences with Real-World Data
Yet Another Triple Store Benchmark? Practical Experiences with Real-World Data Martin Voigt, Annett Mitschick, and Jonas Schulz Dresden University of Technology, Institute for Software and Multimedia Technology,
More informationApplied research on data mining platform for weather forecast based on cloud storage
Applied research on data mining platform for weather forecast based on cloud storage Haiyan Song¹, Leixiao Li 2* and Yuhong Fan 3* 1 Department of Software Engineering t, Inner Mongolia Electronic Information
More informationHadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
More informationProblem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis
, 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying
More informationHow Companies are! Using Spark
How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made
More informationRole of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
More informationAdvanced SQL Query To Flink Translator
Advanced SQL Query To Flink Translator Yasien Ghallab Gouda Full Professor Mathematics and Computer Science Department Aswan University, Aswan, Egypt Hager Saleh Mohammed Researcher Computer Science Department
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationITG Software Engineering
Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,
More informationDeveloping Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationD3.3.1: Sematic tagging and open data publication tools
COMPETITIVINESS AND INNOVATION FRAMEWORK PROGRAMME CIP-ICT-PSP-2013-7 Pilot Type B WP3 Service platform integration and deployment in cloud infrastructure D3.3.1: Sematic tagging and open data publication
More informationHow To Create A Large Data Storage System
UT DALLAS Erik Jonsson School of Engineering & Computer Science Secure Data Storage and Retrieval in the Cloud Agenda Motivating Example Current work in related areas Our approach Contributions of this
More informationA Study of Data Management Technology for Handling Big Data
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,
More informationCOMP9321 Web Application Engineering
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II) http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2411
More informationHow To Build A Cloud Based Intelligence System
Semantic Technology and Cloud Computing Applied to Tactical Intelligence Domain Steve Hamby Chief Technology Officer Orbis Technologies, Inc. shamby@orbistechnologies.com 678.346.6386 1 Abstract The tactical
More informationEnhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationOracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
More informationReference Architecture, Requirements, Gaps, Roles
Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture
More informationSecure Third Party Publications of Documents in a Cloud
, pp.-53-57. Available online at http://www.bioinfo.in/contents.php?id=344 SECURITY IN CLOUD COMPUTING GAWANDE Y.V. 1, AGRAWAL L.S. 2, BHARTIA A.S. 3 AND RAPARTIWAR S.S. 4 Department of Computer Science
More informationA Survey on: Efficient and Customizable Data Partitioning for Distributed Big RDF Data Processing using hadoop in Cloud.
A Survey on: Efficient and Customizable Data Partitioning for Distributed Big RDF Data Processing using hadoop in Cloud. Tejas Bharat Thorat Prof.RanjanaR.Badre Computer Engineering Department Computer
More informationBig Data and Natural Language: Extracting Insight From Text
An Oracle White Paper October 2012 Big Data and Natural Language: Extracting Insight From Text Table of Contents Executive Overview... 3 Introduction... 3 Oracle Big Data Appliance... 4 Synthesys... 5
More informationVolume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image
More informationMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMiner Petar Ristoski, Christian Bizer, and Heiko Paulheim University of Mannheim, Germany Data and Web Science Group {petar.ristoski,heiko,chris}@informatik.uni-mannheim.de
More informationBSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,
More informationAssociate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
More informationExploring the Efficiency of Big Data Processing with Hadoop MapReduce
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationBig Data: Tools and Technologies in Big Data
Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can
More informationManaging Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges
Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Prerita Gupta Research Scholar, DAV College, Chandigarh Dr. Harmunish Taneja Department of Computer Science and
More informationQLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM
QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QlikView Technical Case Study Series Big Data June 2012 qlikview.com Introduction This QlikView technical case study focuses on the QlikView deployment
More informationHigh-Performance, Massively Scalable Distributed Systems using the MapReduce Software Framework: The SHARD Triple-Store
High-Performance, Massively Scalable Distributed Systems using the MapReduce Software Framework: The SHARD Triple-Store Kurt Rohloff BBN Technologies Cambridge, MA, USA krohloff@bbn.com Richard E. Schantz
More informationCloud Storage Solution for WSN Based on Internet Innovation Union
Cloud Storage Solution for WSN Based on Internet Innovation Union Tongrang Fan 1, Xuan Zhang 1, Feng Gao 1 1 School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang,
More informationChukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84
Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics
More informationHadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
More informationData-Gov Wiki: Towards Linked Government Data
Data-Gov Wiki: Towards Linked Government Data Li Ding 1, Dominic DiFranzo 1, Sarah Magidson 2, Deborah L. McGuinness 1, and Jim Hendler 1 1 Tetherless World Constellation Rensselaer Polytechnic Institute
More informationHadoop Technology for Flow Analysis of the Internet Traffic
Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet
More informationIndex Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.
Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated
More informationIntroduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.
Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in
More informationMobile Storage and Search Engine of Information Oriented to Food Cloud
Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:
More informationBig Data Weather Analytics Using Hadoop
Big Data Weather Analytics Using Hadoop Veershetty Dagade #1 Mahesh Lagali #2 Supriya Avadhani #3 Priya Kalekar #4 Professor, Computer science and Engineering Department, Jain College of Engineering, Belgaum,
More informationMassive Cloud Auditing using Data Mining on Hadoop
Massive Cloud Auditing using Data Mining on Hadoop Prof. Sachin Shetty CyberBAT Team, AFRL/RIGD AFRL VFRP Tennessee State University Outline Massive Cloud Auditing Traffic Characterization Distributed
More informationCitationBase: A social tagging management portal for references
CitationBase: A social tagging management portal for references Martin Hofmann Department of Computer Science, University of Innsbruck, Austria m_ho@aon.at Ying Ding School of Library and Information Science,
More informationAnalysis of Web Archives. Vinay Goel Senior Data Engineer
Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner
More informationHadoop on Windows Azure: Hive vs. JavaScript for Processing Big Data
Hive vs. JavaScript for Processing Big Data For some time Microsoft didn t offer a solution for processing big data in cloud environments. SQL Server is good for storage, but its ability to analyze terabytes
More informationISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies
ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationLinkZoo: A linked data platform for collaborative management of heterogeneous resources
LinkZoo: A linked data platform for collaborative management of heterogeneous resources Marios Meimaris, George Alexiou, George Papastefanatos Institute for the Management of Information Systems, Research
More informationChing-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015
E6893 Big Data Analytics Lecture 8: Spark Streams and Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing
More informationAn Approach to Implement Map Reduce with NoSQL Databases
www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh
More informationTECHNICAL Reports. Discovering Links for Metadata Enrichment on Computer Science Papers. Johann Schaible, Philipp Mayr
TECHNICAL Reports 2012 10 Discovering Links for Metadata Enrichment on Computer Science Papers Johann Schaible, Philipp Mayr kölkölölk GESIS-Technical Reports 2012 10 Discovering Links for Metadata Enrichment
More informationBig RDF Data Partitioning and Processing using hadoop in Cloud
Big RDF Data Partitioning and Processing using hadoop in Cloud Tejas Bharat Thorat Dept. of Computer Engineering MIT Academy of Engineering, Alandi, Pune, India Prof.Ranjana R.Badre Dept. of Computer Engineering
More informationPerformance Analysis of Hadoop for Query Processing
211 Workshops of International Conference on Advanced Information Networking and Applications Performance Analysis of Hadoop for Query Processing Tomasz Wiktor Wlodarczyk, Yi Han, Chunming Rong Department
More informationLinksTo A Web2.0 System that Utilises Linked Data Principles to Link Related Resources Together
LinksTo A Web2.0 System that Utilises Linked Data Principles to Link Related Resources Together Owen Sacco 1 and Matthew Montebello 1, 1 University of Malta, Msida MSD 2080, Malta. {osac001, matthew.montebello}@um.edu.mt
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
More informationScalable End-User Access to Big Data http://www.optique-project.eu/ HELLENIC REPUBLIC National and Kapodistrian University of Athens
Scalable End-User Access to Big Data http://www.optique-project.eu/ HELLENIC REPUBLIC National and Kapodistrian University of Athens 1 Optique: Improving the competitiveness of European industry For many
More informationThe Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
More information11/18/15 CS 6030. q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.
by shatha muhi CS 6030 1 q Big Data: collections of large datasets (huge volume, high velocity, and variety of data). q Apache Hadoop framework emerged to solve big data management and processing challenges.
More informationImage Search by MapReduce
Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s
More informationAdvanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
More informationData Refinery with Big Data Aspects
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data
More informationLog Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com
More informationDataBridges: data integration for digital cities
DataBridges: data integration for digital cities Thematic action line «Digital Cities» Ioana Manolescu Oak team INRIA Saclay and Univ. Paris Sud-XI Plan 1. DataBridges short history and overview 2. RDF
More informationBig Data on Microsoft Platform
Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4
More informationRDF Dataset Management Framework for Data.go.th
RDF Dataset Management Framework for Data.go.th Pattama Krataithong 1,2, Marut Buranarach 1, and Thepchai Supnithi 1 1 Language and Semantic Technology Laboratory National Electronics and Computer Technology
More informationLinked Open Data Infrastructure for Public Sector Information: Example from Serbia
Proceedings of the I-SEMANTICS 2012 Posters & Demonstrations Track, pp. 26-30, 2012. Copyright 2012 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes.
More informationBig Data and Hadoop with components like Flume, Pig, Hive and Jaql
Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.
More informationReducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be
More informationA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem Raj Nair Director Data Platform Kiru Pakkirisamy CTO AGENDA About Penton and Serendio Inc Data Processing at Penton PoC Use Case Functional
More informationInteractive data analytics drive insights
Big data Interactive data analytics drive insights Daniel Davis/Invodo/S&P. Screen images courtesy of Landmark Software and Services By Armando Acosta and Joey Jablonski The Apache Hadoop Big data has
More informationDocument Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model
Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model Condro Wibawa, Irwan Bastian, Metty Mustikasari Department of Information Systems, Faculty of Computer Science and
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More information