Federated Query Processing over Linked Data

Size: px
Start display at page:

Download "Federated Query Processing over Linked Data"


1 An Evaluation of Approaches to Federated Query Processing over Linked Data Peter Haase, Tobias Mathäß, Michael Ziller fluid Operations AG, Walldorf, Germany i-semantics, Graz, Austria September 1, 2010

2 Agenda Motivation Approaches to Linked Data Query Processing Benchmark Definition Evaluation Results Conclusions and Future Work

3 Linked Data Web of Data - a globally distributed data space Publishing Linked Data URIs for names of things, HTTP URI lookups to obtain structured data, links to other things Many data sources provide a SPARQL endpoint Many publishers make dumps available Potential of consuming Linked Data Aggregation of data sources Answering queries that cannot be answered by a single data source alone

4 Linked Open Data Cloud Music Online Activities Geographic Cross-Domain Publications Life Sciences

5 Linked Data Applications Source:

6 Querying Linked Data - Alternatives Data Warehousing Build centralized repository from s Centralized query processing Federated Query Processing Mediator to distribute subqueries to relevant sources and integrate results Typically via SPARQL endpoints as query interface Automated Link Traversal Linked Data URI Lookups Evaluation on a continuously augmented data set Discovery of potentially relevant data during execution Discovery driven by intermediate solutions

7 Sample Application Use Cases 1. Linked data portals carefully selected set of data sources is preprocessed and aggregated centrally in a portal a primary goal is to enable efficient and reliable access to the data for a large user base 2. Ad hoc data analysis quickly explore a set of data sources, build federations on the fly potentially even select a subset of available data sources on a per-query-basis

8 Requirements and Constraints Consumer side Selection of data sources: How large? Changes over time? Lifetime: How long is the federation intended to exist? Queries: Characteristics of the queries, types, frequency? Updates: Are updates to the data required? Provider side Access: How are the data sources made accessible? Interfaces: What kind of interfaces are exposed? Service Levels: What kind of guarantees or restrictions are made with respect to performance, response times, etc.? Dynamics: How frequently do the data sources change? Processing capabilities: On provider or consumer side?

9 Comparison Centralized repositories (Warehousing) Federation of SPARQL endpoints Source data s SPARQL Endpoint Original Data / No Yes Yes Up-to-dateness Link Traversal URI Lookups Completeness of results Dynamic selection of data sources Query processing on Yes Yes No No Possible Yes Consumer-side Consumer and provider-side Consumer-side

10 Goals of the Benchmark / Evaluation Compare alternative approaches qualitatively and quantitatively Provide insights about their (dis-)advantages Assist application developers in choosing the right architecture given their requirements and constraints

11 Previous Benchmarks LUBM, Berlin SPARQL Benchmark, SP²Bench Focus on different aspects of query processing So far no focus on linked data style processing

12 Data Sets Real-life data sets from the LOD cloud (as opposed to synthetic) Two subsets, different kinds of links between them Available via SPARQL endpoint and dump Cross-domain Life Sciences

13 Queries Definition of query mixes: 1. Test specific features of a language 2. Requirements of specific use cases Focus on aspects relevant for multiple, distributed sources: Number of data sources involved Complexity of the joins Types of links between sources Query result size 7 Queries for each data set

14 Queries: Examples Q1.1: Find all information about Barack Obama SELECT?predicate?object WHERE { { dbpedia:barack_obama?predicate?object } UNION {?subject owl:sameas dbpedia:barack_obama.?subject?predicate?object } } Q1.3: Return for all US presidents their party membership and news pages SELECT?pres?party?page WHERE {?pres rdf:type dbpedia-owl:president.?pres dbpedia-owl:person/nationality dbpedia:united_states.?pres dbpedia-owl:person/party?party.?x owl:sameas?pres?x nytimes:topicpage?page }

15 Configurations evaluated in the benchmark Query Query Query Central Repository Federation Federation Single Single Single Repository Repository Repository SPARQL Endpoint SPARQL Endpoint SPARQL Endpoint Data Source Data Source Data Source a) Integration in a central repository b) Federation over multiple single repositories c) Federation over multiple SPARQL endpoints

16 Benchmark Environment and Performance Measures Focus on architectural alternatives, not complete coverage of systems Evaluation within the Sesame Framework a) Integration in single repository Sesame native store, default configuration b) Federation over multiple single repositories Sesame native stores + Federation SAIL from AliBaba c) Federation over multiple SPARQL endpoint: Federation SAIL + original SPARQL endpoints 2x3GHz Intel Xeon Server, 20GB RAM, 64Bit JRE Performance Measures Load time (if applicable) Query time Assumption: Data sources known a-priori, no dynamic discovery of sources

17 Configurations evaluated using the benchmark Query Query Query Sesame Native Store Alibaba Federation SAIL Alibaba Federation SAIL Sesame Sesame Sesame Native Native Native Store Store Store SPARQL Endpoint SPARQL Endpoint SPARQL Endpoint Data Source Data Source Data Source a) Integration in a central repository b) Federation over multiple single repositories c) Federation over multiple SPARQL endpoints

18 Results Centralized approach performs best in most cases (all data is local, optimizer has complete knowledge) Federation only practical for simple queries Federation as simple means for parallelization and distribution of workloads

19 Results and Conclusions Centralized approach unavoidable for subsecond response times to more complex queries Federation over linked data still in its infancy Huge potential for linked data federation: ad hoc integration and analytics Approaches to federated query optimization needed Statistics and summaries of data sources required, c.f. VoID Cost models for linked data processing Potential for reuse of work from distributed databases Goal: virtualized access to linked data sources abstract the applications from the specific setup of the data sources (e.g., local vs. remote, federation and distribution)

20 Summary Discussion and analysis of alternative approaches to query processing over distributed linked data sources No single best solution for querying linked data Constraints by application requirements and how data is published Definition of a benchmark data sources, queries, performance measures Results of experiments Federation only practical for simple cases Future and ongoing work Evaluation of additional approaches / implementations More comprehensive classification of queries Open invitation to participate in the development of the benchmark

21 CONTACT US: fluid Operations Altrottstr. 31 Walldorf, Germany website: Tel.: