Federated Query Processing over Linked Data



Similar documents
ON DEMAND ACCESS TO BIG DATA. Peter Haase fluid Operations AG

LDIF - Linked Data Integration Framework

Graph Database Performance: An Oracle Perspective

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Benchmarking the Performance of Storage Systems that expose SPARQL Endpoints

Yet Another Triple Store Benchmark? Practical Experiences with Real-World Data

EMC Unified Storage for Microsoft SQL Server 2008

Certification Report

Visual Analysis of Statistical Data on Maps using Linked Open Data

Industry 4.0 and Big Data

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Cloud Cruiser and Azure Public Rate Card API Integration

Data and Machine Architecture for the Data Science Lab Workflow Development, Testing, and Production for Model Training, Evaluation, and Deployment

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

System Requirements Table of contents

CHAPTER FIVE RESULT ANALYSIS

Linked Open Data Infrastructure for Public Sector Information: Example from Serbia

Using Open Source software and Open data to support Clinical Trial Protocol design

Virtualizing SQL Server 2008 Using EMC VNX Series and Microsoft Windows Server 2008 R2 Hyper-V. Reference Architecture

XpoLog Center Suite Log Management & Analysis platform

Kaseya IT Automation Framework

DBpedia German: Extensions and Applications

Deliverable Billion Triple dataset hosted on the LOD2 Knowledge Store Cluster. LOD2 Creating Knowledge out of Interlinked Data

Semantic Interoperability

SysAid Remote Discovery Tool

Integrating Open Sources and Relational Data with SPARQL

Installation & Configuration Guide

Proactive Rootkit Protection Comparison Test

Layer2 Term Set Glossary App for SharePoint 2013

How To Test For Performance And Scalability On A Server With A Multi-Core Computer (For A Large Server)

Healthcare Big Data Exploration in Real-Time

Automated Accounts Payable User Guide

Application Release Automation (ARA) Vs. Continuous Delivery

Prof. Dr. Lutz Heuser SAP Research

BW-EML SAP Standard Application Benchmark

Deploying System Center 2012 R2 Configuration Manager

Introducing Oracle Exalytics In-Memory Machine

Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study

Riverbed Stingray Traffic Manager VA Performance on vsphere 4 WHITE PAPER

Top 10 Performance Tips for OBI-EE

Revealing Trends and Insights in Online Hiring Market Using Linking Open Data Cloud: Active Hiring a Use Case Study

DataOps: Seamless End-to-end Anything-to-RDF Data Integration

SmartLink: a Web-based editor and search environment for Linked Services

a new generation software test automation framework - CIVIM

TIBCO Live Datamart: Push-Based Real-Time Analytics

Automating Big Data Benchmarking for Different Architectures with ALOJA

Linked Statistical Data Analysis

Developing Business Intelligence and Data Visualization Applications with Web Maps

AppSense Environment Manager. Enterprise Design Guide

Introduction to Service Oriented Architectures (SOA)

How To Choose A Business Intelligence Toolkit

A Study of Data Management Technology for Handling Big Data

Portal for ArcGIS. Satish Sankaran Robert Kircher

Database Server Configuration Best Practices for Aras Innovator 10

A Practical Approach to Process Streaming Data using Graph Database

How To Monitor A Server With Zabbix

Introduction to Mobile Access Gateway Installation

Autodesk Inventor on the Macintosh

MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database System in Energy Data Management

D5.4.4 Integrated SemaGrow Stack API components

Rich Media & HD Video Streaming Integration with Brightcove

JOURNAL OF OBJECT TECHNOLOGY

Publishing Linked Data Requires More than Just Using a Tool

VANGUARD ONLINE BACKUP

Microsoft Dynamics CRM 2011 Guide to features and requirements

Service-Oriented Architecture and Software Engineering

Virtualization Guide. McAfee Vulnerability Manager Virtualization

BPMS BUYER S TOOL KIT. Sample Request for Proposal for a Business Process Management Suite. Part 1 of the complete BPMS Buyer s Tool Kit

Improving Grid Processing Efficiency through Compute-Data Confluence

Online Backup Frequently Asked Questions

Data Sheet: Archiving Symantec Enterprise Vault Discovery Accelerator Accelerate e-discovery and simplify review

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

The Importance of Cybersecurity Monitoring for Utilities

Towards Smart and Intelligent SDN Controller

Network device management solution

Your Private Hosted QlikView Server Solution.

Accelerating Business Intelligence with Large-Scale System Memory

Daniel J. Adabi. Workshop presentation by Lukas Probst

MicroStrategy Course Catalog

Foundations of Business Intelligence: Databases and Information Management

EMC Backup and Recovery for Microsoft SQL Server

Ignify ecommerce. Item Requirements Notes

Integrating Apache Spark with an Enterprise Data Warehouse

ELIS Multimedia Lab. Linked Open Data. Sam Coppens MMLab IBBT - UGent

Microsoft Dynamics AX 2012 System Requirements. Microsoft Corporation Published: November 2011

Enabling REST Services with SAP PI. Michael Le Peter Ha

Big Data Text Mining and Visualization. Anton Heijs

Intel Cloud Builder Guide to Cloud Design and Deployment on Intel Xeon Processor-based Platforms

Evaluation of Enterprise Data Protection using SEP Software

Transcription:

An Evaluation of Approaches to Federated Query Processing over Linked Data Peter Haase, Tobias Mathäß, Michael Ziller fluid Operations AG, Walldorf, Germany i-semantics, Graz, Austria September 1, 2010

Agenda Motivation Approaches to Linked Data Query Processing Benchmark Definition Evaluation Results Conclusions and Future Work

Linked Data Web of Data - a globally distributed data space Publishing Linked Data URIs for names of things, HTTP URI lookups to obtain structured data, links to other things Many data sources provide a SPARQL endpoint Many publishers make dumps available Potential of consuming Linked Data Aggregation of data sources Answering queries that cannot be answered by a single data source alone

Linked Open Data Cloud Music Online Activities Geographic Cross-Domain Publications Life Sciences

Linked Data Applications Source: TBL@TED

Querying Linked Data - Alternatives Data Warehousing Build centralized repository from s Centralized query processing Federated Query Processing Mediator to distribute subqueries to relevant sources and integrate results Typically via SPARQL endpoints as query interface Automated Link Traversal Linked Data URI Lookups Evaluation on a continuously augmented data set Discovery of potentially relevant data during execution Discovery driven by intermediate solutions

Sample Application Use Cases 1. Linked data portals carefully selected set of data sources is preprocessed and aggregated centrally in a portal a primary goal is to enable efficient and reliable access to the data for a large user base 2. Ad hoc data analysis quickly explore a set of data sources, build federations on the fly potentially even select a subset of available data sources on a per-query-basis

Requirements and Constraints Consumer side Selection of data sources: How large? Changes over time? Lifetime: How long is the federation intended to exist? Queries: Characteristics of the queries, types, frequency? Updates: Are updates to the data required? Provider side Access: How are the data sources made accessible? Interfaces: What kind of interfaces are exposed? Service Levels: What kind of guarantees or restrictions are made with respect to performance, response times, etc.? Dynamics: How frequently do the data sources change? Processing capabilities: On provider or consumer side?

Comparison Centralized repositories (Warehousing) Federation of SPARQL endpoints Source data s SPARQL Endpoint Original Data / No Yes Yes Up-to-dateness Link Traversal URI Lookups Completeness of results Dynamic selection of data sources Query processing on Yes Yes No No Possible Yes Consumer-side Consumer and provider-side Consumer-side

Goals of the Benchmark / Evaluation Compare alternative approaches qualitatively and quantitatively Provide insights about their (dis-)advantages Assist application developers in choosing the right architecture given their requirements and constraints

Previous Benchmarks LUBM, Berlin SPARQL Benchmark, SP²Bench Focus on different aspects of query processing So far no focus on linked data style processing

Data Sets Real-life data sets from the LOD cloud (as opposed to synthetic) Two subsets, different kinds of links between them Available via SPARQL endpoint and dump Cross-domain Life Sciences

Queries Definition of query mixes: 1. Test specific features of a language 2. Requirements of specific use cases Focus on aspects relevant for multiple, distributed sources: Number of data sources involved Complexity of the joins Types of links between sources Query result size 7 Queries for each data set

Queries: Examples Q1.1: Find all information about Barack Obama SELECT?predicate?object WHERE { { dbpedia:barack_obama?predicate?object } UNION {?subject owl:sameas dbpedia:barack_obama.?subject?predicate?object } } Q1.3: Return for all US presidents their party membership and news pages SELECT?pres?party?page WHERE {?pres rdf:type dbpedia-owl:president.?pres dbpedia-owl:person/nationality dbpedia:united_states.?pres dbpedia-owl:person/party?party.?x owl:sameas?pres?x nytimes:topicpage?page }

Configurations evaluated in the benchmark Query Query Query Central Repository Federation Federation Single Single Single Repository Repository Repository SPARQL Endpoint SPARQL Endpoint SPARQL Endpoint Data Source Data Source Data Source a) Integration in a central repository b) Federation over multiple single repositories c) Federation over multiple SPARQL endpoints

Benchmark Environment and Performance Measures Focus on architectural alternatives, not complete coverage of systems Evaluation within the Sesame Framework a) Integration in single repository Sesame native store, default configuration b) Federation over multiple single repositories Sesame native stores + Federation SAIL from AliBaba c) Federation over multiple SPARQL endpoint: Federation SAIL + original SPARQL endpoints 2x3GHz Intel Xeon Server, 20GB RAM, 64Bit JRE Performance Measures Load time (if applicable) Query time Assumption: Data sources known a-priori, no dynamic discovery of sources

Configurations evaluated using the benchmark Query Query Query Sesame Native Store Alibaba Federation SAIL Alibaba Federation SAIL Sesame Sesame Sesame Native Native Native Store Store Store SPARQL Endpoint SPARQL Endpoint SPARQL Endpoint Data Source Data Source Data Source a) Integration in a central repository b) Federation over multiple single repositories c) Federation over multiple SPARQL endpoints

Results Centralized approach performs best in most cases (all data is local, optimizer has complete knowledge) Federation only practical for simple queries Federation as simple means for parallelization and distribution of workloads

Results and Conclusions Centralized approach unavoidable for subsecond response times to more complex queries Federation over linked data still in its infancy Huge potential for linked data federation: ad hoc integration and analytics Approaches to federated query optimization needed Statistics and summaries of data sources required, c.f. VoID Cost models for linked data processing Potential for reuse of work from distributed databases Goal: virtualized access to linked data sources abstract the applications from the specific setup of the data sources (e.g., local vs. remote, federation and distribution)

Summary Discussion and analysis of alternative approaches to query processing over distributed linked data sources No single best solution for querying linked data Constraints by application requirements and how data is published Definition of a benchmark data sources, queries, performance measures Results of experiments Federation only practical for simple cases Future and ongoing work Evaluation of additional approaches / implementations More comprehensive classification of queries Open invitation to participate in the development of the benchmark http://code.google.com/p/fbench/

CONTACT US: fluid Operations Altrottstr. 31 Walldorf, Germany Email: info@fluidops.com website: www.fluidops.com Tel.: +49 6227 3849-567