Heuristics Based Query Processing for Large RDF Graphs Using Cloud Computing

Similar documents
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Semantic Web Standard in Cloud Computing

Scalable Distributed Service Integrity Attestation for Software-as-a-Service Clouds

CloudFTP: A free Storage Cloud

How To Ensure Correctness Of Data In The Cloud

PRIVACY-PRESERVING PUBLIC AUDITING FOR SECURE CLOUD STORAGE

Android City Tour Guide System

OPTIMAL MULTI SERVER CONFIGURATION FOR PROFIT MAXIMIZATION IN CLOUD COMPUTING

Secure cloud access system using JAR ABSTRACT:

CLOUD computing is an emerging paradigm in the IT and

Cloud Data Protection for the Masses

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING

Efficient load balancing system in SIP Servers ABSTRACT:

Performance Analysis of Hadoop for Query Processing

ABSTRACT 1. INTRODUCTION. Kamil Bajda-Pawlikowski

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

Industry 4.0 and Big Data

Online Student Attendance Management System using Android

A Secure Intrusion detection system against DDOS attack in Wireless Mobile Ad-hoc Network Abstract

How To Handle Big Data With A Data Scientist

Secure Third Party Publications of Documents in a Cloud

SECURITY FOR ENCRYPTED CLOUD DATA BY USING TOP-KEY TREE TECHNOLOGIES

Cloud Cost Management for Customer Sensitive Data

RDF graph Model and Data Retrival

HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc

DYNAMIC GOOGLE REMOTE DATA COLLECTION

preliminary experiment conducted on Amazon EC2 instance further demonstrates the fast performance of the design.

Evaluating HDFS I/O Performance on Virtualized Systems

HadoopRDF : A Scalable RDF Data Analysis System

Data-Flow Awareness in Parallel Data Processing

Benchmarking the Performance of Storage Systems that expose SPARQL Endpoints

Using RDF Metadata To Enable Access Control on the Social Semantic Web

GeoGrid Project and Experiences with Hadoop

Comparison of Triple Stores

CREDIT CARD FRAUD DETECTION SYSTEM USING GENETIC ALGORITHM

Network device management solution

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

bigdata Managing Scale in Ontological Systems

Green Big Data. A Green IT / Green IS Perspective on Big Data

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Data Store Interface Design and Implementation

excellent graph matching capabilities with global graph analytic operations, via an interface that researchers can use to plug in their own

Developing. and Securing. the Cloud. Bhavani Thuraisingham CRC. Press. Taylor & Francis Group. Taylor & Francis Croup, an Informs business

Graph Database Performance: An Oracle Perspective

CiteSeer x in the Cloud

Data Intensive Query Processing for Large RDF Graphs Using Cloud Computing Tools

High-Performance, Massively Scalable Distributed Systems using the MapReduce Software Framework: The SHARD Triple-Store

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Learn to Personalized Image Search from the Photo Sharing Websites

Querying DBpedia Using HIVE-QL

Enterprise Edition. Hardware Requirements

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

CMS Query Suite. CS4440 Project Proposal. Chris Baker Michael Cook Soumo Gorai

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

AuditMatic Enterprise Edition Installation Specifications

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

LINKED DATA EXPERIENCE AT MACMILLAN Building discovery services for scientific and scholarly content on top of a semantic data model

How Companies are! Using Spark

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

MEGA Web Application Architecture Overview MEGA 2009 SP4

OntoDBench: Ontology-based Database Benchmark

Massive Cloud Auditing using Data Mining on Hadoop

Hardware Configuration Guide

DISTRIBUTED RDF QUERY PROCESSING AND REASONING FOR BIG DATA / LINKED DATA. A THESIS IN Computer Science

GLOBAL FINANCIAL SERVICES COMPLIANCE & RISK MANAGEMENT with

ARC: appmosphere RDF Classes for PHP Developers

Big Fast Data Hadoop acceleration with Flash. June 2013

Big RDF Data Partitioning and Processing using hadoop in Cloud

Scaling-out Semantic Data Management and Processing

Hardware and Software Requirements for Installing California.pro

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Cray: Enabling Real-Time Discovery in Big Data

It is designed to resist the spam in the Internet. It can provide the convenience to the user and save the bandwidth of the network.

A Big Picture for Big Data

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

A Survey on: Efficient and Customizable Data Partitioning for Distributed Big RDF Data Processing using hadoop in Cloud.

Network device management solution.

CS54100: Database Systems

SECURITY ANALYSIS OF A SINGLE SIGN-ON MECHANISM FOR DISTRIBUTED COMPUTER NETWORKS

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM

Energy Efficient MapReduce

How To Build A Cloud Based Intelligence System

DURGA SOFTWARE SOLUTUIONS,S.R NAGAR,HYDERABAD. Ph: , Abstract

Semantic Interoperability

Using Fuzzy Logic Control to Provide Intelligent Traffic Management Service for High-Speed Networks ABSTRACT:

WEB COMPAS MINIMUM HOSTING REQUIREMENTS

Data Collection Agent for NAS EMC Isilon Edition

Taming Big Data Variety with Semantic Graph Databases. Evren Sirin CTO Complexible

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Some Research Challenges for Big Data Analytics of Intelligent Security

Software and Hardware Requirements

Log Mining Based on Hadoop s Map and Reduce Technique

Semantic Modeling with RDF. DBTech ExtWorkshop on Database Modeling and Semantic Modeling Lili Aunimo

AllegroGraph. a graph database. Gary King gwking@franz.com

Phire Architect Hardware and Software Requirements

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Dynamic Resource allocation in Cloud

How To Write A Query Plan For A Sparql Query In Hadoop

Transcription:

Heuristics Based Query Processing for Large RDF Graphs Using Cloud Computing Abstract: Semantic Web is an emerging area to augment human reasoning. Various technologies are being developed in this arena which has been standardized by the World Wide Web Consortium (W3C). One such standard is the Resource Description Framework (RDF). Semantic Web technologies can be utilized to build efficient and scalable systems for Cloud Computing. With the explosion of semantic web technologies, large RDF graphs are common place. This poses significant challenges for the storage and retrieval of RDF graphs. Current frameworks do not scale for large RDF graphs and as a result do not address these challenges. In this paper, we describe a framework that we built using Hadoop to store and retrieve large numbers of RDF triples by exploiting the cloud computing paradigm. We describe a scheme to store RDF data in Hadoop Distributed File System. More than one Hadoop job (the smallest unit of execution in Hadoop) may be needed to answer a query because a single triple pattern in a query cannot simultaneously take part in more than one join in a single Hadoop job. To determine the jobs, we present an algorithm to generate query plan, whose worst case cost is bounded, based on a greedy approach to answer a SPARQL Protocol and RDF Query Language(SPARQL) query. We use Hadoop s MapReduce framework to answer the queries. Our results show that we can store large RDF graphs in Hadoop clusters built with cheap commodity class hardware. Furthermore, we show that our framework is scalable and efficient and can handle large amounts of RDF data, unlike traditional approaches.

Architecture:

Algorithm: Relaxed - Best plan algorithm Relaxed Best plan problem is to find the job plan that has the minimum number of jobs. Next, we show that if joins are reasonably chosen, and no eligible join operation is left undone in a job, then we may set an upper bound on the maximum number of jobs required for any given query. However, it is still computationally expensive to generate all possible job plans. Therefore, we resort to a greedy algorithm, that finds an approximate solution to the Relaxed Best plan problem, but is guaranteed to find a job plan within the upper bound.

Existing System: Semantic Web technologies are being developed to present data in standardized way such that such data can be retrieved and understood by both human and machine. Historically, web pages are published in plain html files which are not suitable for reasoning. 1. No user data privacy 2. Existing commercial tools and technologies do not scale well in Cloud Computing settings. Proposed System: Researchers are developing Semantic Web technologies that have been standardized to address such inadequacies. The most prominent standards are Resource 1Description Framework1 (RDF) and SPARQL Protocol and RDF Query Language2 (SPARQL). RDF is the standard for storing and representing data and SPARQL is a query language to retrieve data from an RDF store. Cloud Computing systems can utilize the power of these Semantic Web technologies to provide the user with capability to efficiently store and retrieve data for data intensive applications. 1. Researchers propose an indexing scheme for a new distributed database which can be used as a Cloud system. 2. RDF storage becoming cheaper and the need to store and retrieve large amounts of data. 3. Semantic web technologies could be especially useful for maintaining data in the cloud.

Modules: 1. Data Generation and Storage We use the LUBM dataset. It is a benchmark datasets designed to enable researchers to evaluate a semantic web repository s performance. The LUBM data generator generates data in RDF/XML serialization format. This format is not suitable for our purpose because we store data in HDFS as flat files and so to retrieve even a single triple we would need to parse the entire file. Therefore we convert the data to N-Triples to store the data, because with that format we have a complete RDF triple (Subject, Predicate and Object) in one line of a file, which is very convenient to use with MapReduce jobs. The processing steps to go through to get the data into our intended format are described in following sections. 2. File Organization We do not store the data in a single file because, in Hadoop and MapReduce Framework, a file is the smallest unit of input to a MapReduce job and, in the absence of caching, a file is always read from the disk. If we have all the data in one file, the whole file will be input to jobs for each query. Instead, we divide the data into multiple smaller files. 3. Predicate Object Split(POS) In the next step, we work with the explicit type information in the rdf type file. The predicate rdf:type is used in RDF to denote that a resource is an instance of a class. The rdf type file is first divided into as many files as the number of distinct objects the rdf:type predicate has. For example, if in the ontology the leaves of the class hierarchy are c1, c2,..., cn then we will

create files for each of these leaves and the file names will be like type c1, type c2,..., type cn. Please note that the object values c1, c2,..., cn are no longer needed to be stored within the file as they can be easily retrieved from the file name. This further reduces the amount of space needed to store the data. We generate such a file for each distinct object value of the predicate rdf:type. 4. Query plan generation We define the query plan generation problem, and show that generating the best (i.e., least cost) query plan for the ideal model as well as for the practical is computationally expensive. Then, we will present a heuristic and a greedy approach to generate an approximate solution to generate the best plan. Running example: We will use the following query as a running example in this section. Running Example SELECT?V,?X,?Y,?Z WHERE{?X rdf : type ub : GraduateStudent?Y rdf : type ub : University?Z?V ub : Department?X ub : memberof?z?x ub : undergraduatedegreefrom?y }

System Requirements: Hardware Requirements: System Hard Disk Floppy Drive Monitor Mouse Ram : Pentium IV 2.4 GHz. : 40 GB. : 1.44 Mb. : 15 VGA Colour. : Logitech. : 512 Mb. Software Requirements: Operating system : Windows XP. Coding Language : ASP.Net with C# Data Base : SQL Server 2005