DataOps: Seamless End-to-end Anything-to-RDF Data Integration

Similar documents
LDIF - Linked Data Integration Framework

Scalable End-User Access to Big Data HELLENIC REPUBLIC National and Kapodistrian University of Athens

The Optique Project: Towards OBDA Systems for Industry (Short Paper)

ON DEMAND ACCESS TO BIG DATA THROUGH SEMANTIC TECHNOLOGIES. Peter Haase fluid Operations AG

ON DEMAND ACCESS TO BIG DATA. Peter Haase fluid Operations AG

Smart Cities require Geospatial Data Providing services to citizens, enterprises, visitors...

Enabling End User Access to Big Data in the O&G Industry

Revealing Trends and Insights in Online Hiring Market Using Linking Open Data Cloud: Active Hiring a Use Case Study

THE SEMANTIC WEB AND IT`S APPLICATIONS

Creating and Utilizing Linked Open Statistical Data for the Development of Advanced Analytics Services

How To Build A Cloud Based Intelligence System

Benchmarking the Performance of Storage Systems that expose SPARQL Endpoints

TDAQ Analytics Dashboard

Databricks. A Primer

MicroStrategy Course Catalog

On Rewriting and Answering Queries in OBDA Systems for Big Data (Short Paper)

Publishing Linked Data Requires More than Just Using a Tool

Databricks. A Primer

Mining the Web of Linked Data with RapidMiner

Ganzheitliches Datenmanagement

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

Towards Semantic Mashup Tools for Big Data Analysis

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Data Virtualization for Agile Business Intelligence Systems and Virtual MDM. To View This Presentation as a Video Click Here

Sieve: Linked Data Quality Assessment and Fusion

Unifi Technology Group & Software Toolbox, Inc. Executive Summary. Building the Infrastructure for emanufacturing

Department of Defense. Enterprise Information Warehouse/Web (EIW) Using standards to Federate and Integrate Domains at DOD

<Insert Picture Here> Oracle BI Standard Edition One The Right BI Foundation for the Emerging Enterprise

Lightweight Data Integration using the WebComposition Data Grid Service

How semantic technology can help you do more with production data. Doing more with production data

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

ORACLE HEALTHCARE ANALYTICS DATA INTEGRATION

Oracle BI Applications (BI Apps) is a prebuilt business intelligence solution.

Federated Query Processing over Linked Data

Making big data simple with Databricks

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Data Virtualization. Paul Moxon Denodo Technologies. Alberta Data Architecture Community January 22 nd, Denodo Technologies

Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc

SQL Server 2012 Business Intelligence Boot Camp

SimCorp Solution Guide

Oracle Database 11g Comparison Chart

Experiences from a Large Scale Ontology-Based Application Development

LinkZoo: A linked data platform for collaborative management of heterogeneous resources

Oracle Data Integrator 12c (ODI12c) - Powering Big Data and Real-Time Business Analytics. An Oracle White Paper October 2013

Towards a Uniform User Interface for Editing Mapping Definitions

Short Paper: Enabling Lightweight Semantic Sensor Networks on Android Devices

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Evaluating SPARQL-to-SQL translation in ontop

Linked Open Data A Way to Extract Knowledge from Global Datastores

Moving From Hadoop to Spark

Using Big Data Analytics for Financial Services Regulatory Compliance

Extending The Value of SAP with the SAP BusinessObjects Business Intelligence Platform Product Integration Roadmap

How To Write A Drupal Rdf Plugin For A Site Administrator To Write An Html Oracle Website In A Blog Post In A Flashdrupal.Org Blog Post

Introduction to Oracle Business Intelligence Standard Edition One. Mike Donohue Senior Manager, Product Management Oracle Business Intelligence

Distributed Query Processing on the Cloud: the Optique Point of View (Short Paper)

VIVO Dashboard A Drupal-based tool for harvesting and executing sophisticated queries against data from a VIVO instance

Extend your analytic capabilities with SAP Predictive Analysis

RODI: A Benchmark for Automatic Mapping Generation in Relational-to-Ontology Data Integration

The Future of Data Management

City Data Pipeline. A System for Making Open Data Useful for Cities. stefan.bischof@tuwien.ac.at

Building Your Big Data Team

Scope. Cognescent SBI Semantic Business Intelligence

Your Path to. Big Data A Visual Guide

Data-Gov Wiki: Towards Linked Government Data

Best Practices for Hadoop Data Analysis with Tableau

COLINDA: Modeling, Representing and Using Scientific Events in the Web of Data

How to avoid building a data swamp

Smart Financial Data: Semantic Web technology transforms Big Data into Smart Data

Accelerate BI Initiatives With Self-Service Data Discovery And Integration

Visual Analysis of Statistical Data on Maps using Linked Open Data

An Ontological Approach to Oracle BPM

Linked Data Interface, Semantics and a T-Box Triple Store for Microsoft SharePoint

Software Architecture Document

Cloud3DView: Gamifying Data Center Management

SELLING PROJECTS ON THE MICROSOFT BUSINESS ANALYTICS PLATFORM

Pervasive Software + NetSuite = Seamless Cloud Business Processes

Cray: Enabling Real-Time Discovery in Big Data

Sustainable Development with Geospatial Information Leveraging the Data and Technology Revolution

Yet Another Triple Store Benchmark? Practical Experiences with Real-World Data

Business Intelligence Discover a wider perspective for your business

GoodData. Platform Overview

TECHNICAL Reports. Discovering Links for Metadata Enrichment on Computer Science Papers. Johann Schaible, Philipp Mayr

Enhance Performance Management Reporting

UIMA and WebContent: Complementary Frameworks for Building Semantic Web Applications

D5.3.2b Automatic Rigorous Testing Components

Optique Zooming In on Big Data Access

White Paper: Datameer s User-Focused Big Data Solutions

IBM Software Integrating and governing big data

SQLstream 4 Product Brief. CHANGING THE ECONOMICS OF BIG DATA SQLstream 4.0 product brief

Providing real-time, built-in analytics with S/4HANA. Jürgen Thielemans, SAP Enterprise Architect SAP Belgium&Luxembourg

The Ontological Approach for SIEM Data Repository

Semantic Web Applications

Big Data and Semantic Web in Manufacturing. Nitesh Khilwani, PhD Chief Engineer, Samsung Research Institute Noida, India

Ali Ghodsi Head of PM and Engineering Databricks

By Makesh Kannaiyan 8/27/2011 1

Rapid Analytics. A visual, live approach to requirements gathering and business analytic development Mark Marinelli, VP of Product Management

The ESB and Microsoft BI

In principle, SAP BW architecture can be divided into three layers:

IDE Integrated RDF Exploration, Access and RDF-based Code Typing with LITEQ

Map-On: A web-based editor for visual ontology mapping

Transcription:

DataOps: Seamless End-to-end Anything-to-RDF Data Integration Christoph Pinkel, Andreas Schwarte, Johannes Trame, Andriy Nikolov, Ana Sasa Bastinos, and Tobias Zeuch fluid Operations AG, Walldorf, Germany firstname.lastname@fluidops.com Abstract. While individual components for semantic data integration are commonly available, end-to-end solutions are rare. We demonstrate DataOps, a seamless Anything-to-RDF semantic data integration toolkit. DataOps supports the integration of both semantic and non-semantic data from an extensible host of different formats. Setting up data sources end-to-end works in three steps: (1) accessing the data from arbitrary locations in different formats, (2) specifying mappings depending on the data format (e.g., R2RML for relational data), and (3) consolidating new data with existing data instances (e.g., by establishing owl:sameas links). All steps are supported through a fully integrated Web interface with configuration forms and different mapping editors. Visitors of the demo will be able to perform all three steps of the integration process. 1 Introduction In recent years semantic data integration has evolved to an important application area in the industry: software eco systems in companies become more and more complex, produce large amounts of heterogenous information, and make it harder and harder to get a holistic view on the company s knowledge. Traditionally, in such situations dedicated ETL-style systems are used for the analysis. Functionality is provided by large scale data warehouse systems or, more recently, by big data frameworks such as Hadoop YARN [1] that work as a data operating systems running a mix of data warehousing and other applications. Those systems share an important property for enterprise data analysis: available as ready solutions, they include everything - from assisted setup, a broad selection of access methods, over graphical configuration interfaces to a comprehensive documentation and support. However, they come along with a downside: in classical data warehouses a dedicated, global warehousing schema must be designed, mappings must be constructed, and the resulting schema must be documented and communicated to users. For Hadoop-like systems, this is not necessarily the case, as a mix of relational and non-relational workloads are possible with different applications in the system. This comes at the price of either a very small set of supported queries and little flexibility, or involves even more initial effort for programming

all the tasks and queries to be supported. Worse, with a number of data sources that quickly change in structure, maintenance of the resulting schema, mapping and queries can quickly become a nightmare in either case. Often enough, the effort for setup and maintenance becomes unacceptable, especially if some data sources are complex in structure. This contributes to the current situation where enterprises are assumed to analyze less than one sixth of their potentially relevant data. 1 Semantic data integration, with its flexible graph model and vocabularies, is one possible and natural way to address this predicament. To leverage the potential of semantic data integration in such cases, an integration environment needs to be fully functional and sufficiently usable, much the same as enterprise data warehousing solutions. While individual components for semantic data integration are commonly available, end-to-end solutions that fulfill all requirements are rare. Existing frameworks such as the Linked Data Integration Framework [3] focus on web-scale data rather than enterprise data. Similarly, many systems composed of loosely coupled special-purpose components (where setup or maintenance involve intense human effort) fail the one key requirement that motivates their use in the first place: to significantly reduce overall effort in configuration and maintenance. With DataOps, we target the challenge of delivering an end-to-end solution for semantic ETL data integration that supports seamless setup and configuration as well as convenient maintenance procedures. DataOps delivers all primarily relevant components as a toolkit out of the box, fills any gaps between them and offers an integrated user interface, built on industry-proven platform technology. We describe details about the DataOps demonstration in Section 2 before we discuss selected related systems in Section 3. Finally, we conclude with the contributions of our demo (Section 4). 2 DataOps Demo We demonstrate DataOps, a seamless Anything-to-RDF semantic data integration toolkit. DataOps supports the integration of both semantic and nonsemantic data from a host of different formats, including relational databases, CSV, Excel, XML, JSON, existing RDF graphs and others. Additional source formats can be integrated through an extension mechanism. We demonstrate this with a specialized data source that allows to directly access results from the statistical software R. In addition, each data source can be accessed from different locations within an organization, e.g., as local files, from network shares (which may optionally require authentication), from the Web through HTTP, or even through custom protocols. As an integrated toolkit, DataOps supports all setup, configuration and maintenance steps through a fully integrated Web interface with configuration forms 1 According to a recent study of business analysts [2], surveying several hundreds of enterprises.

Fig. 1: DataOps Process and different mapping editors. Setting up data sources end-to-end is implemented in a three-step process (Figure 1): 1. Accessing the different data sources from arbitrary locations through different mechanisms (Figure 2a). 2. Specifying mappings depending on the data source format (Figure 2b). 3. Consolidating new data with existing data instances, e.g., by establishing owl:sameas links (Figure 2c). For some of its features, DataOps makes use of established external components that are integrated in the backend. In particular, ETL extraction of relational databases currently relies on DB2Triples 2 and entity reconcilation uses Silk [4]. All modules are plug-able due to generic interfaces and standards. Postprocessor modules can even be stacked as pipelines of sub components. For example, other standard R2RML mapping engines can be hooked in, if required. The interface used for editing relational-to-ontology mappings is a newer version of the prototype presented in [5]. DataOps builds on the technology of a semantic platform, Information Workbench [6]. Information Workbench is a platform for Linked Data application development, including flexible and extensible interfaces for semantic integration, visualization and collaboration. It provides features such as managed triple store connectivity, SPARQL federation [7], ontology management, or exploration and visualization of resulting data. More specific features for data integration and business analytics use cases can be installed as Apps (e.g., more advanced visualization components, or automatic mapping support [8]). DataOps is available both under a commercial license and under Open Source terms, as part of the Information Workbench. 3 Visitors of the demo will be able to perform all three steps of the integration process. In addition, visitors will be able to view and explore integrated data with widget provided by the platform. 2 https://github.com/antidot/db2triples/ 3 http://www.fluidops.com/en/company/training/open_source.

(a) Step 1: Configuration of Data Sources (b) Step 2: Example of a Mapping Editor (R2RML) (c) Step 3: Reconcilation Rules Fig. 2: Setup/Configuration Steps Demonstrated 3 Related Work Individual connector, extractor, translation and editor components that can be hooked together for disparate ETL style tasks are commonly available, including some of those used in DataOps (e.g., [4, 6, 5]). For end-to-end solutions, however, they are often hard to configure and to use, especially when disparate data sources (e.g., relational databases, XML, Excel, and RDF) each require different components in the setup. There are recent approaches to target this heterogeneity challenge in semantic data integration, such as RML [9], a unified mapping language to translate different source formats into RDF. However, those (and similar) approaches are still isolated components when it comes to providing users with an end-to-end toolkit. A few end-to-end semantic data integration platforms exist so far. However, they usually do not focus on the needs of enterprise data integration but rather on other use cases (e.g., on RDF data sources or on Web scale data rather than enterprise data). For instance, UnifiedViews [10] provides easy to use interfaces, however, focuses on processing of pre-existing RDF data. LDIF [3] provides an

interface for some of the surrounding tasks but also has a focus on RDF data, even though relational data is also supported. Some recent projects (e.g., the Optique platform [11]) aim at providing an integrated platform for large-scale enterprise data integration but at the same time focus to push on other limitations such as federated, virtualized scale-out or even streaming data. 4 Conclusion We presented DataOps, a seamless Anything-to-RDF semantic data integration toolkit. In contrast to most existing frameworks, DataOps assumes data sources in any number of mostly non-semantic source data formats, available from different locations within one organization. DataOps also integrates all procedural and lifecycle aspects of setting up and providing integrated data in a single platform, offering a seamless user experience. It is thus novel as a targeted solution for semantic enterprise data integration with a focus on many different, heterogeneous data formats and ease of use. In the demo, users can try all aspects of DataOps for themselves using both locally provided data and data accessible from the Web. References 1. Vavilapalli, V.K. et al.: Apache Hadoop YARN: Yet Another Resource Negotiator. In: SOCC 2013. (2013) 2. Evelson, B., Kisker, H., Bennett, M., Christakis, S.: Benchmark Your BI Environment. Technical report, Forrester Research, Inc. (October 2013) 3. Schultz, A., Matteini, A., Isele, R., Bizer, C., Becker, C.: LDIF - Linked Data Integration Framework. In: COLD 20122. (2011) 4. Volz, J., Bizer, C., Gaedke, M., Kobilarov, G.: Discovering and Maintaining Links on the Web of Data. In: ISWC. (2009) 5. Pinkel, C., Binnig, C., Haase, P., Sengupta, K., Trame, J., Martin, C.: How to Best Find a Partner? An Evaluation of Editing Approaches to Construct R2RML Mappings. In: ESWC. (2014) 6. Haase, P., Schmidt, M., Schwarte, A.: The Information Workbench as a Self-Service Platform for Linked Data Applications. In: COLD. (2011) 7. Schwarte, A., Haase, P., Hose, K., Schenkel, R., Schmidt, M.: Fedx: Optimization Techniques for Federated Query Processing on Linked Data. In: ISWC. (2011) 8. Pinkel, C., Binnig, C., Kharlamov, E., Haase, P.: IncMap: Pay-as-you-go Matching of Relational Schemata to OWL Ontologies. In: OM. (2013) 9. Dimou, A., Sande, M.V., Colpaert, P., Verborgh, R., Mannens, E., Walle, R.V.D.: RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data. In: LDOW. (2014) 10. Knap, T., Kukhar, M., Macháč, B., Škoda, P., Tomeš, J., Vojt, J.: UnifiedViews: An ETL Framework for Sustainable RDF Data Processing. In: ESWC (Posters & Demos). (2014) 11. Kharlamov, E. et al.: Optique 1.0: Semantic Access to Big Data The Case of Norwegian Petroleum Directorates FactPages. In: ISWC (Posters & Demos). (2013)