Querying DBpedia Using HIVE-QL

Size: px
Start display at page:

Download "Querying DBpedia Using HIVE-QL"

Transcription

1 Querying DBpedia Using HIVE-QL AHMED SALAMA ISMAIL 1, HAYTHAM AL-FEEL 2, HODA M. O.MOKHTAR 3 Information Systems Department, Faculty of Computers and Information 1, 2 Fayoum University 3 Cairo University 1, 2 Fayoum, Egypt 3 Cairo, Egypt 1 asi01@fayoum.edu.eg 2 3 h.mokhtar@fci-cu.edu.eg Abstract: - DBpedia is considered one of the main data hubs on the web nowadays. DBpedia extracts its content from different Wikipedia editions, which dramatically increase day after day according to the participation of new chapters. It became a big data environment describing 38.3 million things with over than 3 billion facts. This data explosion affects both the efficiency and accuracy of the retrieved data. From this point of view, we propose a new architecture to deal with the DBpedia using big data techniques in addition to the Semantic Web principles and technologies. Our proposed architecture introduces HIVE-QL as a query language for DBpedia instead of the SPARQL Query Language, which is considered the backbone of the semantic web applications. Additionally, this paper presents the implementation and evaluation of this architecture that minimizes the retrieval time for a query in DBpedia. Key-Words: - Semantic Web; SPARQL; Big Data; Hadoop; Hive; Dbpedia; Map-Reduce 1 Introduction Semantic web [1] is considered an extension of the current web that aims to enrich the meaning of web pages content and enables machines and users to work in co-operation. DBpedia is seen as a clear example of both semantic web applications and big data as well due to its huge size of triples and its different chapters extracted from different Wikipedia editions. It aims to convert the unstructured data published in Wikipedia to be in a structured form using ontology engineering principles and SPARQL as a standard query language for semantic web triples. On the other hand, big data is considered a modern term that describes the massive growth of data, both structured and unstructured. Big data challenges became one of the most emerging problems facing developers due to its Volume, Variety, and Velocity (3Vs) [2]. All of these problems need to be highlighted, and innovative solutions need to be proposed depending on new techniques, such as Hadoop [3] and HIVE-QL [4] which are considered the arms of the Big Data. Studying the problem of the enormous data volume of DBpedia content [5] is one of the primary concerns in this paper. The paper on hands presents a new approach to retrieve DBpedia using HIVE-QL query language. The rest of this paper is organized as follows: Section 2 discusses the related work; section 3 highlights and overviews the DBpedia project and navigates through Hadoop & Hive languages. On the other hand, section IV presents the architecture for retrieving DBpedia content using HIVE-QL. Section 4 describes the implementation process of the proposed architecture. Additionally, section 5 shows the test cases using HIVE-QL and SPARQL Language, which highlights the implementation results. Finally, section 6 concludes the paper and discusses possible directions for future work. 2 RELATED WORK Google Search Engine is considered a good example of the relation between Semantic Web and Big Data techniques [6] that use Linked Data [7] to retrieve a massive amount of search results and Map-Reduce to accelerate this job. How to use Hadoop and Map- Reduce to retrieve massive semantic data was ISBN:

2 discussed in [8]. In addition, an algorithm was proposed to allow users to cast some SPARQL queries to equivalent and simple queries using Hadoop's Map-Reduce, [8] but this work met limitations, such as the incapability and the difficulty of retrieving data using Hive techniques. On the other hand, Liu proposed an architecture that uses Hadoop and Map-Reduce to answer multiple SPARQL queries at the same time via dividing multiple data rows into a number of clusters to be processed all together at the same time to save efforts and time of querying. A system based on Hadoop named HadoopSparql was developed in [9] to allow the handling of multiple queries simultaneously based on using multi-way join operator instead of the traditional two-way join operator. In addition, an optimization algorithm was proposed to calculate the best join order, and to allow users to access the system through web browsers. However, how to store and retrieve large RDF Graph using Hadoop and Map-Reduce [10] is considered an important work in this domain and discussed in [11].The main problem of this work was the handling of a large amount of semantic web data. However, it was shown that current semantic web architectures do not scale up fine. Generating data in RDF/XML format by the usage of the LUBM data generator and converting the generated data to N- Triples format using N-Triple Converter software using Jena architecture was discussed in [12]. This work shows that the increase in the time needed to answer a query is always less than the increase in the dataset size. However, this approach also has limitations, such as its complexity and time retrieval. On the other hand, building a Hadoop RDF system was considered one of the main goals in [13] that gather both the advantages of the semantic web RDF content and big data environments, like Hadoop, to avoid the limitation of querying RDF triples on a single machine. This system depends on parsing the SPARQL query into HIVE-QL query. The distribution of the large RDF triplestore using HBase for storing triples and HIVE-QL for generating query results was discussed in [14]. However, this work has shortcomings and needs an optimization for the translation of SPARQL to HIVE-QL and also needs to refine its data model. 3 DBpedia, Hadoop & HIVE 3.1 DBpedia DBpedia is considered one of the most remarkable achievements in the Semantic Web nowadays. This is due to its importance in adding structure to Wikipedia content. DBpedia project was initially established as a project between the Freie University of Berlin and the University of Leipzig, in collaboration with Open Link Software. The first dataset of DBpedia was published in It was made available under free licenses, allowing others to reuse the dataset [15] [16]. DBpedia allows users to ask sophisticated queries on datasets that are gathered from Wikipedia and link other datasets flying on the Web to Wikipedia data [17]. In addition, DBpedia-Live provides a live streaming method based on the update of Wikipedia. DBpedialive depends on local copies of Wikipedia that are synchronized via the Open Archives Initiative Protocol for Metadata Harvesting [18] that enables a continuous stream of updates from a wiki [17]. DBpedia-Live Extraction Manager applies on any processed page, and the extracted triples are inserted into the triplestore in the form of N-Triples [17] that can be queried via SPARQL query language as shown in the DBpedia architecture [19] in Fig.1 consists of main modules. Wikipedia is considered the source of data that has infoboxes been extracted by different extractors, such as label, redirect, and abstract extractors. These extracted data were embedded into DBpedia in the form of N-Triples dump stored in Virtuoso triplestore that enables users to execute sophisticated queries through SPARQL endpoint. Fig. 1. DBpedia Architecture components 3.2 Hadoop & Hive The Apache Hadoop project [3] concerns in the development of open source software to be scalable, reliable and distributed. Hadoop is considered an architecture that allows the distributed processing of massive data sets across clusters of computers using simple programming models such as the Hive language [20] which performs map-reduce jobs in its ISBN:

3 back end. In addition, it enables scaling up from a single machine to a number of machines, each of which has its local computation and storage. Hadoop Distributed File System (HDFS) [21], Ontology Cloud Storage System (OCSS) [22], and the Distributed Ontology Cloud Storage System (DOCSS) [23] are examples for distributing information via Hadoop architecture 3. Proposed Architecture 3.1 Proposed Architecture Dealing with a large size semantic content based on Big Data is considered the main contribution of our work. HIVE-QL is used in our architecture instead of SPARQL, which is regarded as one of the backbones of the semantic web. DBpedia is regarded as an example of the big semantic dataset available on the web nowadays. In our architecture, DBpedia datasets are converted into Comma-Separated Values (CSV) files instead of using the HBase triplestore. Our architecture consists of three main phases as shown in Fig.2. The stored RDF datasets are reformatted into CSV format. They are then loaded as triplestore into the in HDFS in Hadoop that we used here as a distributed file system. After that, Map-Reduce model divides the massive datasets and performs parallel processing tasks on them. This process is considered one of the Hive language functions that are performed in Hive backend. 3) Data Querying Phase At this stage, the CSV datasets are loaded into the new Hive data store. Instead of using SPARQL as query statements, HIVE-QL query statements are used. Moreover, in this way, any sophisticated query is executed using HIVE-QL across Hive platform to get the required user results. 4. Implementation We present our approach as a desktop application implemented using Java programming language and Scala programming language to allow users to query any row of data stored in DBpedia datasets. The physical environment for our experiment consists of Intel Core I5 with 4 GBs memory and 160 GBs hard disks with Ubuntu Linux operating system. On the other hand, DBpedia triplestore is stored on the DBpedia server so that we can download a local copy of the last version of our proposed English chapter of DBpedia, which is 3.9 as shown in Fig 3. Fig. 2. Proposed Architecture 1) Input Phase This phase is based on downloading a local copy of DBpedia dumps. The downloaded dumps of DBpedia datasets are stored in DBpedia triplestore in N-triples format and are then transformed into RDF format using RDF validator. 2) Data Storing and Processing Phase Fig. 3. DBpedia 2014 download server The English chapter of DBpedia consists of 583 million RDF triples that are divided into different categories; each category is formatted in four RDF formats like N- Quads (nq), N- Triples (nt), Turtle Quads (tql) and turtle (ttl). After getting the DBpedia data in the required format, we convert the RDF triples into CSV ISBN:

4 format with three separate columns that represent the RDF triple. The converted CSV file is loaded into the HDFS of the Hadoop platform and it is then loaded into the Hive table to be ready for querying using HIVE-QL query language. 5. Results In this section, we present the test cases that verify our architecture that uses HIVE-QL instead of SPARQL to retrieve data from DBpedia. This section will reply the question: Which will perform better on DBpedia, SPARQL that is considered one of the main players in the Semantic Web and DBpedia or HIVE-QL which is included as one of the building blocks of our proposed architecture?. We have three test cases varying from simple to sophisticated queries that run using both SPARQL and HIVE-QL measuring the retrieval time of each query language for each test case. The First Query (Q1): aims to retrieve the founding date of the New_York_Times newspaper which is considered one of the resources of DBpedia dataset as shown in Fig.4 and Fig.5. The SPARQL query is shown in Fig 6 which aims to retrieve the founding date as shown in Fig 7. Our proposed work that use HIVE-QL query to retrieve the requested data is shown in Fig 8. It is clear that our query using HIVE-QL takes only 40 secs to retrieve the same results as SPARQL which takes 370 secs. Our technique depends on dividing the query into two sub-queries that lead to retrieve the organization by which the New York Times was founded and then retrieve the foundation year for this company in few seconds. Fig. 5. The founding date that should be retrieved from DBpedia Fig. 6. DBpedia Query using SPARQL query language Fig. 7. DBpedia Query result using SPARQL query language Fig. 8. DBpedia Query and its results using HIVE-QL query language Fig. 4. New York Times resource with its properties & values in DBpedia The Second Query (Q2): aims to retrieve the URI address of the university where the German Chancellor studied. Such query is considered one of the sophisticated queries. The SPARQL query is shown in Fig.9 and its results is shown in Fig 10 that takes 414 secs. On the other hand, our proposed work as HIVE-QL query is shown in Fig.11 and requires only 108 secs to retrieve the results. ISBN:

5 Fig. 13. Result using SPARQL to retrieve the scientific field of that scientist Fig. 9. DBpedia Query using SPARQL query language Fig. 10. DBpedia Query result using SPARQL query language Fig. 14. Query in HIVE-QL to retrieve the scientific field of that scientist Fig. 11. DBpedia query and its results for the query requesting the address of the University of the Chancellor The results on the three cases tested here in both SPARQL and HIVE-QL prove that HIVE-QL is less retrieval time and performs better, while both are efficient in querying semantic datasets, but HIVE- QL will perform better on large semantic datasets according to the using of Map-Reduce techniques in clusters. The Third Query (Q3): aims to retrieve the scientific field for scientists who was awarded the Nobel Prize in Chemistry and was also born in Egypt. The SPARQL query is shown in Fig.12, and its results are shown in Fig.13. It takes around 432 secs. On the other hand, our proposed work using HIVE-QL is shown in Fig 14 takes 174 secs to retrieve the same results. Time(secs) Query 1Query 2Query 3 HIVE_QL SPARQL Query Query Type Fig. 12. DBpedia query in SPARQL to retrieve the scientific field of this scientist Fig. 15. Shows the retrieval time for HIVE-QL vs. SPARQL ISBN:

6 5.1 Architecture Evaluation TABLE I. ARCHITECTURE EVALUATION Our proposed architecture is presented in the form of a layered architecture [28] as shown in Fig.2. We evaluated our proposed architecture via some criteria described in [24] [25] and [26] as shown in Table I; those criteria are as follows: Availability: The degree of a system to be operable and in a committable state at the start of a mission. Clearly defined context: The possibility to identify the context from the description of the architecture Appropriate level of abstraction: The possibility to view the system within the framework as a whole Hiding of implementation details: The possibility to hide any implementation detailed information in the description of the architecture Clearly defined functional layers: The possibility to specify a function in layer description within the system? Interoperability: The extent to which systems can exchange information Modularity: The possibility to change the implementation of a particular layer while the functionality and the interfaces remain the same Upgradeability: The degree to which a system can easily be improved in functionality. Modifiability: The extent to which a system can be modifiable. Accessibility: The degree to which many people can access a system as possible. Usability: The measure of how easy it is to use the system. Stability: The extent to which the system rarely exhibits failure. Efficiency: The degree to which the system is running in an efficient time and generates correct output. Criterion Availability Clearly defined context Appropriate level of abstraction Hiding of implementation details Clearly defined functional layers Interoperability Modularity Upgradability Modifiability Accessibility Usability Stability Efficiency Proposed Architecture No, The system cannot be shown as one thing Partially Partially High 6. Conclusions and Future Work The objective of this work is to propose a new architecture that is capable of efficently querying massive amounts of semantic DBpedia content on Hadoop environment based on Hive mechanism. DBpedia is used as the dataset representing the semantic content. The paper discusses our proposed architecture and shows the implementation of this proposed architecture based on Java programming language, HIVE-QL query, Scala programming language [27]. The experimental results show how our proposed approach outperforms using SPARQL in retrieving the search results.for future work, we aim to use Apache Spark and its Spark SQL query ISBN:

7 language as an attempt to get better results to extend the system and to get better usability and stability. References: [1] N. Shadbolt, T. Berners-Lee and W. Hall, 'The Semantic Web Revisited', IEEE Intell. Syst., vol. 21, no. 3, pp , [2] Getting Started with Hadoop', [Online]. Available: [Accessed: 24- Jun- 2015]. [3] Introduction to Hadoop. [Online]. Available: [Accessed: 24- Jun- 2015]. [4] B. Enrico, G. Marco and I. Mauro, 'Modeling apache hive based applications in big data architectures', in the 7th International Conference on Performance Evaluation Methodologies and Tools, 2015, pp [5] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak and S. Hellmann, 'DBpedia - A crystallization point for the Web of Data', Web Semantics: Science, Services and Agents on the World Wide Web, vol. 7, no. 3, pp , [6] E. Dumbill, "Big data and the semantic web at war, Indifferent, or Intimately Connected," in Strata Conference New York 2011, [7] H. Christian Bizer and B. Christian, 'special issue on Linked Data', International Journal on Semantic Web and Information, [8] H. Mohammad Farhan, K. Latifur, K. Murat and H. Kevin, Data intensive query processing for Semantic Web data using Hadoop and MapReduce. IEEE 3rd International Conference. The University Of Texas at Dallas, 2011 [9] L. Chang, Q. Jun, Q. Guilin, W. Haofen and Y. Yong, 'Hadoopsparql: a hadoop-based engine for multiple SPARQL query answering', in 9th Extended Semantic Web Conference, 2012, pp [10] J. Dean and S. Ghemawat, 'MapReduce', Communications of the ACM, vol. 51, no. 1, p. 107, [11] H. Mohammad Farhan, D. Pankil, K. Latifur and T. Bhavani, 'Storage and retrieval of large RDF graph using hadoop and MapReduce',in Springer, Berlin Heidelberg, 2009, pp [12] N. Tu Ngoc and S. Wolf, 'SLUBM: An Extended LUBM Benchmark for Stream Reasoning.', inordring@ ISWC, 2013, pp [13] D. Jin-Hang, W. Hao-Fen, N. Yuan, and Y. Yong, 'HadoopRDF: A scalable semantic data analytical engine', in Intelligent Computing Theories and Applications, 2012, pp [14] H. Albert and P. Lynette, 'Distributed RDF triplestore using HBase and Hive', the University of Texas at Austin, [15] L. Jens, I. Robert, J. Max, J. Anja and K. Dimitris, 'DBpedia-a large-scale, multilingual knowledge base extracted from Wikipedia', Semantic Web Journal, vol. 5, pp. 1-29, [16] H. Al-Feel, 'A Step towards the Arabic DBpedia', International Journal of Computer Applications, vol. 80, no. 3, pp , [17] M. Morsey, J. Lehmann, S. Auer, C. Stadler and S. Hellmann, 'DBpedia and the live extraction of structured data from Wikipedia', Program: electronic library and information systems, vol. 46, no. 2, pp , [18] T. Carl Lagoze (2008, The Open Archives Initiative Protocol for Metadata Harvesting', [Online]. Available: rotocol htm. [Accessed: 15- Jun- 2015]. [19] The DBpedia Data Provision Architecture, [Online]. Available: [Accessed: 22- Jun- 2015] [20] Hive Wiki,2015[Online]. Available: [Accessed: 22- Jun- 2015] [21] Introduction to HDFS: what is the Hadoop Distributed File System (HDFS),2015[Online]. Available: oop/hdfs/ [Accessed: 22- Jun- 2015] [22] F. Haytham Tawfeek and K. Mohamed Helmy, 'OCSS: Ontology cloud storage system', 2011, pp [23] K. Mohamed Helmy and F. Haytham Tawfeek, 'DOCSS: Distributed Ontology cloud storage system ', 2012, pp [24] H. Neil B and A. Paris, 'Using Pattern-Based Architecture Reviews to Detect Quality Attribute Issues-An Exploratory Study', in Transactions on Pattern Languages of Programming III, 2013, pp [25] G. Aurona J, B. Andries and V. Alta J, 'Towards a semantic web layered architecture', in the International Conference on Software Engineering Innsbruck, Austria, 2007, pp [26] G. Aurona J, B. Andries and V. Alta J, 'Design and evaluation criteria for layered architectures', in the 8th International Conference on Enterprise Information System, Paphos, Cyprus., [27] M. Odersky. (2014), the Scala Language Specification.,2015[Online].Available: [Accessed: 1- Jul- 2015] [28] H. Al-Feel, M. Koutb and H. Suoror, 'Toward An Agreement on Semantic Web Architecture', Europe, vol. 49, no. 3, pp , ISBN:

HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering

HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering Chang Liu 1 Jun Qu 1 Guilin Qi 2 Haofen Wang 1 Yong Yu 1 1 Shanghai Jiaotong University, China {liuchang,qujun51319, whfcarter,yyu}@apex.sjtu.edu.cn

More information

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.

More information

On a Hadoop-based Analytics Service System

On a Hadoop-based Analytics Service System Int. J. Advance Soft Compu. Appl, Vol. 7, No. 1, March 2015 ISSN 2074-8523 On a Hadoop-based Analytics Service System Mikyoung Lee, Hanmin Jung, and Minhee Cho Korea Institute of Science and Technology

More information

HadoopRDF : A Scalable RDF Data Analysis System

HadoopRDF : A Scalable RDF Data Analysis System HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn

More information

LDIF - Linked Data Integration Framework

LDIF - Linked Data Integration Framework LDIF - Linked Data Integration Framework Andreas Schultz 1, Andrea Matteini 2, Robert Isele 1, Christian Bizer 1, and Christian Becker 2 1. Web-based Systems Group, Freie Universität Berlin, Germany a.schultz@fu-berlin.de,

More information

THE SEMANTIC WEB AND IT`S APPLICATIONS

THE SEMANTIC WEB AND IT`S APPLICATIONS 15-16 September 2011, BULGARIA 1 Proceedings of the International Conference on Information Technologies (InfoTech-2011) 15-16 September 2011, Bulgaria THE SEMANTIC WEB AND IT`S APPLICATIONS Dimitar Vuldzhev

More information

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA J.RAVI RAJESH PG Scholar Rajalakshmi engineering college Thandalam, Chennai. ravirajesh.j.2013.mecse@rajalakshmi.edu.in Mrs.

More information

Towards the Integration of a Research Group Website into the Web of Data

Towards the Integration of a Research Group Website into the Web of Data Towards the Integration of a Research Group Website into the Web of Data Mikel Emaldi, David Buján, and Diego López-de-Ipiña Deusto Institute of Technology - DeustoTech, University of Deusto Avda. Universidades

More information

Industry 4.0 and Big Data

Industry 4.0 and Big Data Industry 4.0 and Big Data Marek Obitko, mobitko@ra.rockwell.com Senior Research Engineer 03/25/2015 PUBLIC PUBLIC - 5058-CO900H 2 Background Joint work with Czech Institute of Informatics, Robotics and

More information

E6895 Advanced Big Data Analytics Lecture 4:! Data Store

E6895 Advanced Big Data Analytics Lecture 4:! Data Store E6895 Advanced Big Data Analytics Lecture 4:! Data Store Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science Mgr., Dept. of Network Science and Big Data Analytics,

More information

Graph Database Performance: An Oracle Perspective

Graph Database Performance: An Oracle Perspective Graph Database Performance: An Oracle Perspective Xavier Lopez, Ph.D. Senior Director, Product Management 1 Copyright 2012, Oracle and/or its affiliates. All rights reserved. Program Agenda Broad Perspective

More information

Workshop on Hadoop with Big Data

Workshop on Hadoop with Big Data Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly

More information

Revealing Trends and Insights in Online Hiring Market Using Linking Open Data Cloud: Active Hiring a Use Case Study

Revealing Trends and Insights in Online Hiring Market Using Linking Open Data Cloud: Active Hiring a Use Case Study Revealing Trends and Insights in Online Hiring Market Using Linking Open Data Cloud: Active Hiring a Use Case Study Amar-Djalil Mezaour 1, Julien Law-To 1, Robert Isele 3, Thomas Schandl 2, and Gerd Zechmeister

More information

DBpedia German: Extensions and Applications

DBpedia German: Extensions and Applications DBpedia German: Extensions and Applications Alexandru-Aurelian Todor FU-Berlin, Innovationsforum Semantic Media Web, 7. Oktober 2014 Overview Why DBpedia? New Developments in DBpedia German Problems in

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

More information

Benchmarking the Performance of Storage Systems that expose SPARQL Endpoints

Benchmarking the Performance of Storage Systems that expose SPARQL Endpoints Benchmarking the Performance of Storage Systems that expose SPARQL Endpoints Christian Bizer 1 and Andreas Schultz 1 1 Freie Universität Berlin, Web-based Systems Group, Garystr. 21, 14195 Berlin, Germany

More information

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING Exploration on Service Matching Methodology Based On Description Logic using Similarity Performance Parameters K.Jayasri Final Year Student IFET College of engineering nishajayasri@gmail.com R.Rajmohan

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

City Data Pipeline. A System for Making Open Data Useful for Cities. stefan.bischof@tuwien.ac.at

City Data Pipeline. A System for Making Open Data Useful for Cities. stefan.bischof@tuwien.ac.at City Data Pipeline A System for Making Open Data Useful for Cities Stefan Bischof 1,2, Axel Polleres 1, and Simon Sperl 1 1 Siemens AG Österreich, Siemensstraße 90, 1211 Vienna, Austria {bischof.stefan,axel.polleres,simon.sperl}@siemens.com

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Integrating Open Sources and Relational Data with SPARQL

Integrating Open Sources and Relational Data with SPARQL Integrating Open Sources and Relational Data with SPARQL Orri Erling and Ivan Mikhailov OpenLink Software, 10 Burlington Mall Road Suite 265 Burlington, MA 01803 U.S.A, {oerling,imikhailov}@openlinksw.com,

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

Approaches for parallel data loading and data querying

Approaches for parallel data loading and data querying 78 Approaches for parallel data loading and data querying Approaches for parallel data loading and data querying Vlad DIACONITA The Bucharest Academy of Economic Studies diaconita.vlad@ie.ase.ro This paper

More information

Data processing goes big

Data processing goes big Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,

More information

How To Handle Big Data With A Data Scientist

How To Handle Big Data With A Data Scientist III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution

More information

LiDDM: A Data Mining System for Linked Data

LiDDM: A Data Mining System for Linked Data LiDDM: A Data Mining System for Linked Data Venkata Narasimha Pavan Kappara Indian Institute of Information Technology Allahabad Allahabad, India kvnpavan@gmail.com Ryutaro Ichise National Institute of

More information

Yet Another Triple Store Benchmark? Practical Experiences with Real-World Data

Yet Another Triple Store Benchmark? Practical Experiences with Real-World Data Yet Another Triple Store Benchmark? Practical Experiences with Real-World Data Martin Voigt, Annett Mitschick, and Jonas Schulz Dresden University of Technology, Institute for Software and Multimedia Technology,

More information

Applied research on data mining platform for weather forecast based on cloud storage

Applied research on data mining platform for weather forecast based on cloud storage Applied research on data mining platform for weather forecast based on cloud storage Haiyan Song¹, Leixiao Li 2* and Yuhong Fan 3* 1 Department of Software Engineering t, Inner Mongolia Electronic Information

More information

Hadoop Big Data for Processing Data and Performing Workload

Hadoop Big Data for Processing Data and Performing Workload Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer

More information

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis , 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying

More information

How Companies are! Using Spark

How Companies are! Using Spark How Companies are! Using Spark And where the Edge in Big Data will be Matei Zaharia History Decreasing storage costs have led to an explosion of big data Commodity cluster software, like Hadoop, has made

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Advanced SQL Query To Flink Translator

Advanced SQL Query To Flink Translator Advanced SQL Query To Flink Translator Yasien Ghallab Gouda Full Professor Mathematics and Computer Science Department Aswan University, Aswan, Egypt Hager Saleh Mohammed Researcher Computer Science Department

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

ITG Software Engineering

ITG Software Engineering Introduction to Apache Hadoop Course ID: Page 1 Last Updated 12/15/2014 Introduction to Apache Hadoop Course Overview: This 5 day course introduces the student to the Hadoop architecture, file system,

More information

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

D3.3.1: Sematic tagging and open data publication tools

D3.3.1: Sematic tagging and open data publication tools COMPETITIVINESS AND INNOVATION FRAMEWORK PROGRAMME CIP-ICT-PSP-2013-7 Pilot Type B WP3 Service platform integration and deployment in cloud infrastructure D3.3.1: Sematic tagging and open data publication

More information

How To Create A Large Data Storage System

How To Create A Large Data Storage System UT DALLAS Erik Jonsson School of Engineering & Computer Science Secure Data Storage and Retrieval in the Cloud Agenda Motivating Example Current work in related areas Our approach Contributions of this

More information

A Study of Data Management Technology for Handling Big Data

A Study of Data Management Technology for Handling Big Data Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 9, September 2014,

More information

COMP9321 Web Application Engineering

COMP9321 Web Application Engineering COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 11 (Part II) http://webapps.cse.unsw.edu.au/webcms2/course/index.php?cid=2411

More information

How To Build A Cloud Based Intelligence System

How To Build A Cloud Based Intelligence System Semantic Technology and Cloud Computing Applied to Tactical Intelligence Domain Steve Hamby Chief Technology Officer Orbis Technologies, Inc. shamby@orbistechnologies.com 678.346.6386 1 Abstract The tactical

More information

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Reference Architecture, Requirements, Gaps, Roles

Reference Architecture, Requirements, Gaps, Roles Reference Architecture, Requirements, Gaps, Roles The contents of this document are an excerpt from the brainstorming document M0014. The purpose is to show how a detailed Big Data Reference Architecture

More information

Secure Third Party Publications of Documents in a Cloud

Secure Third Party Publications of Documents in a Cloud , pp.-53-57. Available online at http://www.bioinfo.in/contents.php?id=344 SECURITY IN CLOUD COMPUTING GAWANDE Y.V. 1, AGRAWAL L.S. 2, BHARTIA A.S. 3 AND RAPARTIWAR S.S. 4 Department of Computer Science

More information

A Survey on: Efficient and Customizable Data Partitioning for Distributed Big RDF Data Processing using hadoop in Cloud.

A Survey on: Efficient and Customizable Data Partitioning for Distributed Big RDF Data Processing using hadoop in Cloud. A Survey on: Efficient and Customizable Data Partitioning for Distributed Big RDF Data Processing using hadoop in Cloud. Tejas Bharat Thorat Prof.RanjanaR.Badre Computer Engineering Department Computer

More information

Big Data and Natural Language: Extracting Insight From Text

Big Data and Natural Language: Extracting Insight From Text An Oracle White Paper October 2012 Big Data and Natural Language: Extracting Insight From Text Table of Contents Executive Overview... 3 Introduction... 3 Oracle Big Data Appliance... 4 Synthesys... 5

More information

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image

More information

Mining the Web of Linked Data with RapidMiner

Mining the Web of Linked Data with RapidMiner Mining the Web of Linked Data with RapidMiner Petar Ristoski, Christian Bizer, and Heiko Paulheim University of Mannheim, Germany Data and Web Science Group {petar.ristoski,heiko,chris}@informatik.uni-mannheim.de

More information

BSPCloud: A Hybrid Programming Library for Cloud Computing *

BSPCloud: A Hybrid Programming Library for Cloud Computing * BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Big Data: Tools and Technologies in Big Data

Big Data: Tools and Technologies in Big Data Big Data: Tools and Technologies in Big Data Jaskaran Singh Student Lovely Professional University, Punjab Varun Singla Assistant Professor Lovely Professional University, Punjab ABSTRACT Big data can

More information

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges Prerita Gupta Research Scholar, DAV College, Chandigarh Dr. Harmunish Taneja Department of Computer Science and

More information

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM QlikView Technical Case Study Series Big Data June 2012 qlikview.com Introduction This QlikView technical case study focuses on the QlikView deployment

More information

High-Performance, Massively Scalable Distributed Systems using the MapReduce Software Framework: The SHARD Triple-Store

High-Performance, Massively Scalable Distributed Systems using the MapReduce Software Framework: The SHARD Triple-Store High-Performance, Massively Scalable Distributed Systems using the MapReduce Software Framework: The SHARD Triple-Store Kurt Rohloff BBN Technologies Cambridge, MA, USA krohloff@bbn.com Richard E. Schantz

More information

Cloud Storage Solution for WSN Based on Internet Innovation Union

Cloud Storage Solution for WSN Based on Internet Innovation Union Cloud Storage Solution for WSN Based on Internet Innovation Union Tongrang Fan 1, Xuan Zhang 1, Feng Gao 1 1 School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang,

More information

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84 Index A Amazon Web Services (AWS), 50, 58 Analytics engine, 21 22 Apache Kafka, 38, 131 Apache S4, 38, 131 Apache Sqoop, 37, 131 Appliance pattern, 104 105 Application architecture, big data analytics

More information

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current

More information

Data-Gov Wiki: Towards Linked Government Data

Data-Gov Wiki: Towards Linked Government Data Data-Gov Wiki: Towards Linked Government Data Li Ding 1, Dominic DiFranzo 1, Sarah Magidson 2, Deborah L. McGuinness 1, and Jim Hendler 1 1 Tetherless World Constellation Rensselaer Polytechnic Institute

More information

Hadoop Technology for Flow Analysis of the Internet Traffic

Hadoop Technology for Flow Analysis of the Internet Traffic Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet

More information

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk. Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated

More information

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture. Big Data Hadoop Administration and Developer Course This course is designed to understand and implement the concepts of Big data and Hadoop. This will cover right from setting up Hadoop environment in

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

Big Data Weather Analytics Using Hadoop

Big Data Weather Analytics Using Hadoop Big Data Weather Analytics Using Hadoop Veershetty Dagade #1 Mahesh Lagali #2 Supriya Avadhani #3 Priya Kalekar #4 Professor, Computer science and Engineering Department, Jain College of Engineering, Belgaum,

More information

Massive Cloud Auditing using Data Mining on Hadoop

Massive Cloud Auditing using Data Mining on Hadoop Massive Cloud Auditing using Data Mining on Hadoop Prof. Sachin Shetty CyberBAT Team, AFRL/RIGD AFRL VFRP Tennessee State University Outline Massive Cloud Auditing Traffic Characterization Distributed

More information

CitationBase: A social tagging management portal for references

CitationBase: A social tagging management portal for references CitationBase: A social tagging management portal for references Martin Hofmann Department of Computer Science, University of Innsbruck, Austria m_ho@aon.at Ying Ding School of Library and Information Science,

More information

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Analysis of Web Archives. Vinay Goel Senior Data Engineer Analysis of Web Archives Vinay Goel Senior Data Engineer Internet Archive Established in 1996 501(c)(3) non profit organization 20+ PB (compressed) of publicly accessible archival material Technology partner

More information

Hadoop on Windows Azure: Hive vs. JavaScript for Processing Big Data

Hadoop on Windows Azure: Hive vs. JavaScript for Processing Big Data Hive vs. JavaScript for Processing Big Data For some time Microsoft didn t offer a solution for processing big data in cloud environments. SQL Server is good for storage, but its ability to analyze terabytes

More information

ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

LinkZoo: A linked data platform for collaborative management of heterogeneous resources

LinkZoo: A linked data platform for collaborative management of heterogeneous resources LinkZoo: A linked data platform for collaborative management of heterogeneous resources Marios Meimaris, George Alexiou, George Papastefanatos Institute for the Management of Information Systems, Research

More information

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015 E6893 Big Data Analytics Lecture 8: Spark Streams and Graph Computing (I) Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing

More information

An Approach to Implement Map Reduce with NoSQL Databases

An Approach to Implement Map Reduce with NoSQL Databases www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh

More information

TECHNICAL Reports. Discovering Links for Metadata Enrichment on Computer Science Papers. Johann Schaible, Philipp Mayr

TECHNICAL Reports. Discovering Links for Metadata Enrichment on Computer Science Papers. Johann Schaible, Philipp Mayr TECHNICAL Reports 2012 10 Discovering Links for Metadata Enrichment on Computer Science Papers Johann Schaible, Philipp Mayr kölkölölk GESIS-Technical Reports 2012 10 Discovering Links for Metadata Enrichment

More information

Big RDF Data Partitioning and Processing using hadoop in Cloud

Big RDF Data Partitioning and Processing using hadoop in Cloud Big RDF Data Partitioning and Processing using hadoop in Cloud Tejas Bharat Thorat Dept. of Computer Engineering MIT Academy of Engineering, Alandi, Pune, India Prof.Ranjana R.Badre Dept. of Computer Engineering

More information

Performance Analysis of Hadoop for Query Processing

Performance Analysis of Hadoop for Query Processing 211 Workshops of International Conference on Advanced Information Networking and Applications Performance Analysis of Hadoop for Query Processing Tomasz Wiktor Wlodarczyk, Yi Han, Chunming Rong Department

More information

LinksTo A Web2.0 System that Utilises Linked Data Principles to Link Related Resources Together

LinksTo A Web2.0 System that Utilises Linked Data Principles to Link Related Resources Together LinksTo A Web2.0 System that Utilises Linked Data Principles to Link Related Resources Together Owen Sacco 1 and Matthew Montebello 1, 1 University of Malta, Msida MSD 2080, Malta. {osac001, matthew.montebello}@um.edu.mt

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Scalable End-User Access to Big Data http://www.optique-project.eu/ HELLENIC REPUBLIC National and Kapodistrian University of Athens

Scalable End-User Access to Big Data http://www.optique-project.eu/ HELLENIC REPUBLIC National and Kapodistrian University of Athens Scalable End-User Access to Big Data http://www.optique-project.eu/ HELLENIC REPUBLIC National and Kapodistrian University of Athens 1 Optique: Improving the competitiveness of European industry For many

More information

The Future of Data Management

The Future of Data Management The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class

More information

11/18/15 CS 6030. q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

11/18/15 CS 6030. q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in. by shatha muhi CS 6030 1 q Big Data: collections of large datasets (huge volume, high velocity, and variety of data). q Apache Hadoop framework emerged to solve big data management and processing challenges.

More information

Image Search by MapReduce

Image Search by MapReduce Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s

More information

Advanced Big Data Analytics with R and Hadoop

Advanced Big Data Analytics with R and Hadoop REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

More information

Data Refinery with Big Data Aspects

Data Refinery with Big Data Aspects International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

DataBridges: data integration for digital cities

DataBridges: data integration for digital cities DataBridges: data integration for digital cities Thematic action line «Digital Cities» Ioana Manolescu Oak team INRIA Saclay and Univ. Paris Sud-XI Plan 1. DataBridges short history and overview 2. RDF

More information

Big Data on Microsoft Platform

Big Data on Microsoft Platform Big Data on Microsoft Platform Prepared by GJ Srinivas Corporate TEG - Microsoft Page 1 Contents 1. What is Big Data?...3 2. Characteristics of Big Data...3 3. Enter Hadoop...3 4. Microsoft Big Data Solutions...4

More information

RDF Dataset Management Framework for Data.go.th

RDF Dataset Management Framework for Data.go.th RDF Dataset Management Framework for Data.go.th Pattama Krataithong 1,2, Marut Buranarach 1, and Thepchai Supnithi 1 1 Language and Semantic Technology Laboratory National Electronics and Computer Technology

More information

Linked Open Data Infrastructure for Public Sector Information: Example from Serbia

Linked Open Data Infrastructure for Public Sector Information: Example from Serbia Proceedings of the I-SEMANTICS 2012 Posters & Demonstrations Track, pp. 26-30, 2012. Copyright 2012 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes.

More information

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql Abstract- Today data is increasing in volume, variety and velocity. To manage this data, we have to use databases with massively parallel software running on tens, hundreds, or more than thousands of servers.

More information

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be

More information

A Scalable Data Transformation Framework using the Hadoop Ecosystem

A Scalable Data Transformation Framework using the Hadoop Ecosystem A Scalable Data Transformation Framework using the Hadoop Ecosystem Raj Nair Director Data Platform Kiru Pakkirisamy CTO AGENDA About Penton and Serendio Inc Data Processing at Penton PoC Use Case Functional

More information

Interactive data analytics drive insights

Interactive data analytics drive insights Big data Interactive data analytics drive insights Daniel Davis/Invodo/S&P. Screen images courtesy of Landmark Software and Services By Armando Acosta and Joey Jablonski The Apache Hadoop Big data has

More information

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model

Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model Document Similarity Measurement Using Ferret Algorithm and Map Reduce Programming Model Condro Wibawa, Irwan Bastian, Metty Mustikasari Department of Information Systems, Faculty of Computer Science and

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information