RDF graph Model and Data Retrival

Similar documents
Big Data Analytics. Rasoul Karimi

Data Modeling for Big Data

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Integrating Big Data into the Computing Curricula

INTRODUCTION TO CASSANDRA

Lecture Data Warehouse Systems

An Approach to Implement Map Reduce with NoSQL Databases

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Heterogeneous databases mediation

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

How To Handle Big Data With A Data Scientist

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

NoSQL and Graph Database

Cloud Scale Distributed Data Storage. Jürmo Mehine

Structured Data Storage

NoSQL in der Cloud Why? Andreas Hartmann

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

Challenges for Data Driven Systems

Chapter 7. Using Hadoop Cluster and MapReduce

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

THE SEMANTIC WEB AND IT`S APPLICATIONS

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

NOSQL, BIG DATA AND GRAPHS. Technology Choices for Today s Mission- Critical Applications

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch September 30,

How To Improve Performance In A Database

Understanding NoSQL on Microsoft Azure

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

NoSQL Databases. Nikos Parlavantzas

Can the Elephants Handle the NoSQL Onslaught?

Understanding NoSQL Technologies on Windows Azure

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA


A COMPARATIVE STUDY OF NOSQL DATA STORAGE MODELS FOR BIG DATA

A programming model in Cloud: MapReduce

Big Systems, Big Data

NoSQL Database Options

Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING

The Sierra Clustered Database Engine, the technology at the heart of

LDIF - Linked Data Integration Framework

Databases 2 (VU) ( )

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

16.1 MAPREDUCE. For personal use only, not for distribution. 333

HadoopRDF : A Scalable RDF Data Analysis System

A survey of big data architectures for handling massive data

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

NoSQL Databases. Polyglot Persistence

Referential Integrity in Cloud NoSQL Databases

NoSQL and Hadoop Technologies On Oracle Cloud

The Quest for Extreme Scalability

Industry 4.0 and Big Data

The emergence of big data technology and analytics

BIG DATA TOOLS. Top 10 open source technologies for Big Data

ABSTRACT 1. INTRODUCTION. Kamil Bajda-Pawlikowski

Data Modeling in the Age of Big Data

NoSQL for SQL Professionals William McKnight

CloudDB: A Data Store for all Sizes in the Cloud

Choosing The Right Big Data Tools For The Job A Polyglot Approach

bigdata Managing Scale in Ontological Systems

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Evaluator s Guide. McKnight. Consulting Group. McKnight Consulting Group

2.1.5 Storing your application s structured data in a cloud database

Cloud Service Model. Selecting a cloud service model. Different cloud service models within the enterprise

these three NoSQL databases because I wanted to see a the two different sides of the CAP

Log Mining Based on Hadoop s Map and Reduce Technique

Databases : Lecture 11 : Beyond ACID/Relational databases Timothy G. Griffin Lent Term Apologies to Martin Fowler ( NoSQL Distilled )

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Slave. Master. Research Scholar, Bharathiar University

An Open Source NoSQL solution for Internet Access Logs Analysis

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

NoSQL Systems for Big Data Management

Benchmarking and Analysis of NoSQL Technologies

Chapter 1: Introduction

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Luncheon Webinar Series May 13, 2013

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Advanced Data Management Technologies

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

Structured Content: the Key to Agile. Web Experience Management. Introduction

Cloud computing - Architecting in the cloud

Comparing SQL and NOSQL databases

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Cloud Storage Standards Overview and Research Ideas Brainstorm

Oracle Big Data SQL Technical Update

Completing the Big Data Ecosystem:

Applications for Big Data Analytics

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

BIG DATA: STORAGE, ANALYSIS AND IMPACT GEDIMINAS ŽYLIUS

Sentimental Analysis using Hadoop Phase 2: Week 2

Technology WHITE PAPER

Transcription:

Distributed RDF Graph Keyword Search 15 2 Linked Data, Non-relational Databases and Cloud Computing 2.1.Linked Data The World Wide Web has allowed an unprecedented amount of information to be published and shared globally. The barriers of cost to publish and retrieve the information have greatly lowered, and the interconnection provided by hyperlinks delivers the means to find more related data about a particular topic. Search engines are build on top of those characteristics to rank relevant documents and facilitate locating information to be retrieved [2]. All those factors combined contributed to the unconstrained growth of the Web [3]. These low barriers to multimedia documents, rendered human readable by browsers, are not the same to data. As more data is collected and processed, it is produced on different formats, as Excel tables, CSV files, XML or SQL databases. Three problems arise from this. First, the logic of access and parsing of those contents is different from one another. Excel data, for example, was originally a binary proprietary format, while CSV files allow for different separators and escape codes. Second, the pieces of data stored on those formats have relationships that are more complex than the simple implicit relationship between two hyperlinked documents. A hyperlink does not indicate the type of relationship a document shares with its connected part, and the relationship between pieces of data must be more expressive than just a connection. Finally, as different providers produce databases, the schema of those databases does not necessarily end up being similar to other providers working on the same kind of data. To join those databases, a translation between those schemas will need to be made, adding up the cost of making large amounts of data available. Linked Data was proposed to lower some of those barriers and make way for this data to be available on the Web [4]. The principles of the proposal, the "Linked Data principles", are a set of rules for publishing data on the Web: 1. Use URIs as names for things; 2. Use HTTP URIs so that people can look up those names;

Distributed RDF Graph Keyword Search 16 3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL); 4. Include links to other URIs, so that they can discover more things. Typically Linked Data uses RDF documents to contain data [5]. The relationship between that data is also described on RDF to make typed links between the data nodes. The resulting data, described this way and available on the Web, is the Web of Data. Since these principles of Linked Data were laid out, the public availability of information on the Semantic Web grew quickly [6]. Many organizations, institutes and others contribute to the collective Web of Data, building and publishing RDF repositories that they had available previously only on traditional models. Those repositories encompass different kinds of data, such as movie databases (e.g., LinkedMDB and imdb), New Testament names (e.g., Semantic PUC-Rio - Certificação Digital Nº 1021777/CA Bible), encyclopaedic articles (e.g., DBPedia from Wikipedia), biological taxonomy (e.g., GeoSpecies Knowledge Base), and many other examples [7]. Figure 1 - The Linking Open Data Cloud Diagram [6] 2.1.1.Generic data model and graph representation As described above, RDF is the basic format for data documents and databases that compose the Web of Data. In its most basic form, the generic data model described in RDF is a data triple in the subject-predicate-object format.

Distributed RDF Graph Keyword Search 17 RDF can be serialized in several formats. The most recognized one is the XML syntax described alongside the RDF definition by W3C [8]. But other formats for serialization, such as Notation 3 and N-Triples, also exist with applications in different scenarios. For example, Figure 2 is a fictional RDF document. The XML headers of that listing indicate that it is an RDF document and the syntax version it uses. The document then describes two CDs (i.e., two subjects) with information like artist and price (i.e., several predicate-objects). If represented as N-Triples, the same listing would start out without any headings about syntax, assuming the parser already knows its format. Then each line of the file would define a triple, with URIs between angle brackets, literals between quotes, and the whole line finished by a full stop, as it can be seen in the example in Figure 3. <?xml version="1.0"?> <rdf:rdf xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:cd="http://www.recshop.fake/cd#"> <rdf:description rdf:about="http://www.recshop.fake/cd/empire Burlesque"> <cd:artist>bob Dylan</cd:artist> <cd:country>usa</cd:country> <cd:company>columbia</cd:company> <cd:price>10.90</cd:price> <cd:year>1985</cd:year> </rdf:description> <rdf:description rdf:about="http://www.recshop.fake/cd/hide your heart"> <cd:artist>bonnie Tyler</cd:artist> <cd:country>uk</cd:country> <cd:company>cbs Records</cd:company> <cd:price>9.90</cd:price> <cd:year>1988</cd:year> </rdf:description>... </rdf:rdf> Figure 2 - Two CD's listing example in RDF XML [28]

Distributed RDF Graph Keyword Search 18 <http://fake/cd/empire Burlesque> <http://fake/cd/artist> "Bob Dylan". <http://fake/cd/empire Burlesque> <http://fake/cd/country> "USA". <http://fake/cd/empire Burlesque> <http://fake/cd/company> "Columbia". <http://fake/cd/empire Burlesque> <http://fake/cd/price> "10.90". <http://fake/cd/empire Burlesque> <http://fake/cd/year> "1985". <http://fake/cd/hide your heart> <http://fake/cd/artist> "Bonnie Tyler". <http://fake/cd/hide your heart> <http://fake/cd/country> "UK". <http://fake/cd/hide your heart> <http://fake/cd/company> "CBD Records". <http://fake/cd/hide your heart> <http://fake/cd/price> "9.90". <http://fake/cd/hide your heart> <http://fake/cd/year> "1988". Figure 3 - Two CD's example listing in N-Triples Also defined in the W3C conceptualization is the graph data model to represent the generic data model. In this representation, a subject-predicate-object triple expresses a node-arc-node link, as seen on Figure 4. The subject and the object are nodes of that link, and the predicate is a directed arc in the link connecting the subject node to the predicate node. The whole set of triples, expressed in this manner, results in a RDF graph that can be visually represented as a traditional graph. Figure 4 - Subject-Predicate-Object graph representation The graph representation presents several advantages. First, the visual description of the information can convey the information for immediate human consumption, whereas the original triples set didn't. The visual display can adapt graph visualization techniques to arrange nodes by proximity of their relations, weight nodes with more relations from them, and clustering nodes among other techniques [9]. For data processing, the RDF graph representation allows the application of distributed computing for storage and retrieval of its data, which will be explored in latter sections. 2.2.Non-relational databases Large amounts of data are being produced and stored through various means and a large proportion of that in an unstructured or semi-structured way. However, the availability of that data is an issue for some reasons. First, the current dominant storage and retrieval paradigm of data in corporate and government environments is the relational model, and the task to

Distributed RDF Graph Keyword Search 19 adapt semi-structured data to a fully functional relational model is cumbersome, with questionable results [10]. The resulting database schemas might be either too deconstructed and fall back to the basic RDF schema (e.g., a triples schema on SQL), or result in too many levels of hierarchy to avoid being too vague (e.g., a city list that relates to a locations list that relates to a entities list). In both cases, it means that deep knowledge of the original domain rules is still needed to use the database. Second, unstructured data, in particular those in binary format (e.g., video), tends to be used in units that are similar to those in which they are stored (e.g., video contained in files). As such, the task to locate and retrieve those units is usually very simple, requiring little to no use of the relational model itself. The real problems in those cases are on storage and later processing of those units of data. Finally, the availability of that data on current internet scale, requiring low response times even under the regime of tens of thousands of parallel queries, is a challenge to conventional relational model. One of the strongest set properties of relational databases is the ACID - atomicity, consistency, isolation and durability. Those properties guarantee that database transactions are processed reliably. Assuring all those properties, at the same time on parallel and distributed database nodes on top of replication and load balancing techniques, is time consuming and often only a subset of them is enough. 2.2.1. NoSQL Non-relational databases have been used and studied since the late 1960s, but recently a renewed focus has been given to them under the label of NoSQL. The term covers a broad range of database types that will be described below. NoSQL is often interpreted as "not only SQL" to avoid the unintentional meaning of being against SQL and relational databases. Instead it is meant to be the use of non-relational databases for problems less suited for relational databases. The set of properties commonly seen at modern non-relational databases is described as BASE - Basically Available, Soft-state and Eventual consistency - in contrast with relational ACID properties. In short, a trade-off is made, giving

Distributed RDF Graph Keyword Search 20 away some of ACID properties to regain availability and performance. That tradeoff is described under the CAP-Theorem. CAP is acronym standing for: Consistency: the nodes of the system are expected to be all at a consistent and equivalent state after the execution of an operation. Availability: the system is expected to be highly available and is able to continuously answer any kind of operations, even if some nodes are unavailable. Partition Tolerance: in case of a network split, it creates isolation between the nodes of the system, and continuous operation is expected after that network partition and eventual restoration. The CAP-Theorem states that only two of those three characteristics can be chosen for a shared-data system. Based on this, non-relational databases compared to relational databases choose to have availability and partition tolerance over consistency. As such, a very common description used for NoSQL databases is that they are eventually consistent and inherently scalable, even though not all of them choose those characteristics from the CAP set. 2.2.2.Non-relational databases types Non-relational databases can be categorized by the model of data they represent. These types are just guidelines for better understanding how they work, and some databases are a hybrid of those models. The most basic types are Key- Value, Document and Graph. On top of those are more specific types as Tabular, Object, Multi-value, Tuple etc. The basic types can be described as: Key-Value: typically consists of data indexed by a string key that is used to retrieve a unit of data of a primitive type (i.e., an integer or another string). Inherently schema-less, it depends on the application to properly use the data in a meaningful way. Most implementations add functionality over the simplistic model, such as more complex primitive types (e.g., Redis' data structures) or customizable synchronization and replication model (e.g., Cassandra's clusterstoken-replication implementable factors). Document: is the type of store which the central concept is a multivalued document or object. In comparison with Key-Value stores,

Distributed RDF Graph Keyword Search 21 here the database is expected to have functionalities that can be used to query or modify data over fields of a document instead of making use only of indexed keys. This type of store can be further categorized by the way objects are organized and grouped on its database, such as collections, tags or hierarchies. Also, they can be differentiated by the underlying document representation, such as JSON, multi-value rows or XML datasets. Examples of different functionalities are map-reduce support over queried data (e.g., MongoDB's aggregation and data processing) and bi-directional data replication and synchronization (e.g., CouchDB's partition, offline and reunion operations). Graph: is the kind of database designed for storing and querying data nodes and relations interconnecting these nodes, i.e., data that can be well represented as a graph. 2.2.3.Data Sharding Data sharding is a common approach, upon in which effective use of nonrelational databases is based. This approach consists of relying on the application to indicate how data should be distributed among the many nodes that compose the database system network. The rationale behind this is that, even though the database can use some natural keys to better distribute load and optimize data retrieval, these are never going to be as efficient as those designed taking the application into consideration. The support to sharding varies between databases. Some do not give any support, relying entirely on the application (e.g., Redis). Other give limited access to parameterize default sharding algorithms provided by the database (e.g., MongoDB) [11]. Finally, there are databases with extension points to plug in custom made modules that dictate to the database how to distribute, rebalance, and replicate data based, not only on keys, but on information provided by the database about its network topology (e.g., Cassandra).

Distributed RDF Graph Keyword Search 22 2.3.Cloud Computing Cloud Computing is a model for enabling ubiquitous, convenient, ondemand network access to a shared pool of configurable computing resources, such as networks, servers, storage, applications, and servers, which can be rapidly provisioned and released with minimal effort or service provider interaction [12]. The emergence of the current model is historically associated with the evolution of Amazon's system architecture. Amazon began their business selling books online, but quickly expanded their offers to other retail options and, as demand grew, so did their computational needs. But demand is highly seasonal, peaking between Thanksgiving and Christmas. In the traditional model of capacity planning, the allocated resources to Amazon's systems had to target the peaks of access, leaving the investment in the allocated hardware underused during most of the year. Realizing this, Amazon began to remodel their architecture towards an on-demand computational shared pool of resources. In 2006, Amazon began renting their excess computational resources [13] with the easy to remember description: "Sometimes you need a lot of processing power, and sometimes you need just a little. Sometimes you need a lot, but you only need it for a limited amount of time." That model quickly evolved to the current availability of public clouds by providers other than Amazon's Web Services (AWS). Services such as Netflix and Dropbox explore those resources extensively, basing their whole operations on third-party's public cloud infrastructure. The impact of the new model is changing industries, such as television [14] and providing a special boost to e-science. 2.3.1.MapReduce One of the most useful techniques for cloud computing based systems is MapReduce [15]. This programming model is well suited for large processing and generating large datasets, and several computational tasks can be expressed in this model with some adaptations. Systems specified to use this model have a map function that should process a key-value pair and generate a set of intermediate key-value pairs that can be processed in parallel, and they also have a reduce function to merge the intermediate results.

Distributed RDF Graph Keyword Search 23 Figure 5 - MapReduce Overview (Wikimedia Commons) Once a computational problem is adapted and expressed in this model, it is automatically ready to parallel execution on large clusters of computers. There are several benefits of this approach: allowing programmers to write parallel and distributed systems to easily use the resources of clusters; making use of heterogeneous, commodity machines instead of relying on homogeneous clusters; providing a uniform approach to implement distributed systems, which allow the development of frameworks to automate and support the execution of applications built to the MapReduce model. This last benefit has been extensively explored. Any general Job Queues and Workers framework can enable a MapReduce application. The model consists on implementing a user defined perform method and then building up a pool of worker processes on several machines. Each of those workers periodically access a shared job queue, marks a job and uses the jobs associates inputs to execute its own perform method. Once he finishes, he clears the job from the queue and takes another one. A monitoring process can check if workers are active and set jobs as failed or retry them if staled. Resque is an example of such framework and is employed on many computational problems for regular background jobs execution and also on MapReduce [16].

Distributed RDF Graph Keyword Search 24 Even though MapReduce can be executed in any general Job Queues framework, several specialized frameworks now exist, and they have grown to cover other aspects of distributed systems with modules to handle distributed filesystems, databases, and to load balancing and provisioning. Hadoop is an example of such framework and is employed by several organizations [17]. 2.4.Summary In this chapter, we reviewed the two areas of software engineering involved on this dissertation, Linked Data and Cloud Computing. First, we discussed the foundation ideas behind Linked Data and their impact on the unprecedented availability of data on the Web of Data. We then skimmed over the general Linked Data RDF model and discussed non-relational databases, how they differ from relational databases and how they scale well while still having all the necessary characteristics to store large RDF datasets. We finally describe how the Cloud Computing is currently evolving and explored the MapReduce technique that enables an uniform and accessible way to harness the computational power available on the public clouds, using very simple Job Queues and Workers frameworks or specialized frameworks. In the next chapter, we provide an overview on Linked Data keyword search and discuss a simplified definition of the tensor-based approach to search on large RDF datasets.