TopQuadrant-Syngenta Webcast July 10, 2014 Semantic Data Virtualization: Extracting More Value from Data Silos

Similar documents
Logical Semantic Warehouse - Developing Your Own Semantic Ecosystem Peter Lawrence, TopQuadrant

TopBraid Insight for Life Sciences

TopBraid Life Sciences Insight

Industry 4.0 and Big Data

An industry perspective on deployed semantic interoperability solutions

Publishing Linked Data Requires More than Just Using a Tool

POLAR IT SERVICES. Business Intelligence Project Methodology

How semantic technology can help you do more with production data. Doing more with production data

Customer experiences in implemen0ng SKOS- based vocabulary management systems, Ralph Hodgson, TopQuadrant. CWI, Amsterdam, April 3, 2014

Semantic Data Management. Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies

Annotea and Semantic Web Supported Collaboration

Linked Open Data Infrastructure for Public Sector Information: Example from Serbia

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Semantic Web Applications

LDIF - Linked Data Integration Framework

Data warehouse and Business Intelligence Collateral

De la Business Intelligence aux Big Data. Marie- Aude AUFAURE Head of the Business Intelligence team Ecole Centrale Paris. 22/01/14 Séminaire Big Data

Building Semantic Content Management Framework

cheminformatics nomenclature activity binding based data sets knowledge thesauri article

Semantic Interoperability

Get More Value from Your Reference Data Make it Meaningful with TopBraid RDM

DISCOVERING RESUME INFORMATION USING LINKED DATA

Three Fundamental Techniques To Maximize the Value of Your Enterprise Data

BUSINESSOBJECTS DATA INTEGRATOR

An Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

Big Data for Investment Research Management

LinkZoo: A linked data platform for collaborative management of heterogeneous resources

FROM DATA STORE TO DATA SERVICES - DEVELOPING SCALABLE DATA ARCHITECTURE AT SURS. Summary

The Open PHACTS Discovery Platform Semantic data integration for Medicinal Chemists

BUSINESS VALUE OF SEMANTIC TECHNOLOGY

Linked Statistical Data Analysis

Encoding Library of Congress Subject Headings in SKOS: Authority Control for the Semantic Web

Disributed Query Processing KGRAM - Search Engine TOP 10

EC Wise Report: Unlocking the Value of Deeply Unstructured Data. The Challenge: Gaining Knowledge from Deeply Unstructured Data.

SQL Server 2012 Business Intelligence Boot Camp

Evaluating Business Intelligence Offerings: Business Objects and Microsoft.

Improve Cooperation in R&D. Catalyze Drug Repositioning. Optimize Clinical Trials. Respect Information Governance and Security

Busting 7 Myths about Master Data Management

WHITE PAPER TOPIC DATE Enabling MaaS Open Data Agile Design and Deployment with CA ERwin. Nuccio Piscopo. agility made possible

Data Integration Checklist

Enterprise Data Warehouse (EDW) UC Berkeley Peter Cava Manager Data Warehouse Services October 5, 2006

Linked Open Data A Way to Extract Knowledge from Global Datastores

SOLUTION BRIEF CA ERwin Modeling. How can I understand, manage and govern complex data assets and improve business agility?

ORACLE BUSINESS INTELLIGENCE SUITE ENTERPRISE EDITION PLUS

An Information Provider s Wish List for a Next Generation Big Data End-to-End Information System

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Driving Innovation in Licensing Through Competitive Intelligence and Big Data Analytics

Scalable End-User Access to Big Data HELLENIC REPUBLIC National and Kapodistrian University of Athens

ORACLE BUSINESS INTELLIGENCE SUITE ENTERPRISE EDITION PLUS

Presente e futuro del Web Semantico

Information Services for Smart Grids

The Foundations of Successful Reference Data Management

Cloud computing based big data ecosystem and requirements

By Makesh Kannaiyan 8/27/2011 1

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

Adam Rauch Partner, LabKey Software Extending LabKey Server Part 1: Retrieving and Presenting Data

Introduction to SKOS. Bob DuCharme October 6, 2011

SOLUTION BRIEF CA ERWIN MODELING. How Can I Manage Data Complexity and Improve Business Agility?

It s all around the domain ontologies - Ten benefits of a Subject-centric Information Architecture for the future of Social Networking

ECM Governance Policies

<Insert Picture Here> Extending Hyperion BI with the Oracle BI Server

Agile Business Intelligence Data Lake Architecture

Database Marketing, Business Intelligence and Knowledge Discovery

Data Mining, Predictive Analytics with Microsoft Analysis Services and Excel PowerPivot

Creating a Business Intelligence Competency Center to Accelerate Healthcare Performance Improvement

fédération de données et de ConnaissancEs Distribuées en Imagerie BiomédicaLE Data fusion, semantic alignment, distributed queries

Providing real-time, built-in analytics with S/4HANA. Jürgen Thielemans, SAP Enterprise Architect SAP Belgium&Luxembourg

STAR Semantic Technologies for Archaeological Resources.

Department of Defense. Enterprise Information Warehouse/Web (EIW) Using standards to Federate and Integrate Domains at DOD

Building an Intelligent Biobank to Power Research Decision-Making

Web-Based Genomic Information Integration with Gene Ontology

How to Enhance Traditional BI Architecture to Leverage Big Data

at Work in the Enterprise

Revealing Trends and Insights in Online Hiring Market Using Linking Open Data Cloud: Active Hiring a Use Case Study

BUSINESSOBJECTS DATA INTEGRATOR

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices

How To Use Data Analysis To Get More Information From A Computer Or Cell Phone To A Computer

Serendipity a platform to discover and visualize Open OER Data from OpenCourseWare repositories Abstract Keywords Introduction

ONTOLOGY-BASED APPROACH TO DEVELOPMENT OF ADJUSTABLE KNOWLEDGE INTERNET PORTAL FOR SUPPORT OF RESEARCH ACTIVITIY

How To Build A Cloud Based Intelligence System

Lightweight Data Integration using the WebComposition Data Grid Service

Integrating SAP and non-sap data for comprehensive Business Intelligence

Research Data Integration of Retrospective Studies for Prediction of Disease Progression A White Paper. By Erich A. Gombocz

Data Governance, Data Architecture, and Metadata Essentials

Business Intelligence in Oracle Fusion Applications

Design and Implementation of a Semantic Web Solution for Real-time Reservoir Management

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Consulting and Systems Integration (1) Networks & Cloud Integration Engineer

JOURNAL OF OBJECT TECHNOLOGY

Extending The Value of SAP with the SAP BusinessObjects Business Intelligence Platform Product Integration Roadmap

Leveraging existing Web frameworks for a SIOC explorer to browse online social communities

Business Intelligence. A Presentation of the Current Lead Solutions and a Comparative Analysis of the Main Providers

TIBCO Spotfire Helps Organon Bridge the Data Gap Between Basic Research and Clinical Trials

Structured Content: the Key to Agile. Web Experience Management. Introduction

Exploring the Synergistic Relationships Between BPC, BW and HANA

BUSINESSOBJECTS WEB INTELLIGENCE

Foundations of Business Intelligence: Databases and Information Management

Acronym: Data without Boundaries. Deliverable D12.1 (Database supporting the full metadata model)

Transcription:

TopQuadrant-Syngenta Webcast July 10, 2014 Semantic Data Virtualization: Extracting More Value from Data Silos Featuring Syngenta's report on its successful pilot

Webcast Agenda Overview of Problem and Solution: Cost and complexity of distributed data integration Semantic virtual data integration Syngenta Report on its Successful Pilot Approach: Why Semantic Technologies are uniquely suited to solving the Problem TopBraid Insight provides a platform for this approach Discussion and Questions copyright TopQuadrant, Inc. 2014 2

The Context: Data Distribution An Information Tsunami Increasing volumes of data of multiple types from multiple sources combined with An Informatics Tower of Babel Distributed systems using heterogeneous technologies Used by permission: National Cancer Institute Cancer copyright TopQuadrant, Inc. 2014 3 Bioinformatics Grid (cabig)

The Problem (Domain Experts perspective) Domain experts know that data exists - Within an organization - In external data sources and know both actual and potential questions that could be asked of the data IF and this is a big IF the data could be integrated. copyright TopQuadrant, Inc. 2014 4

Example (the problem is not domain-specific) Find all the 3-way valves across all vehicles that correspond to the valves in this vehicle that are showing intermittent malfunctions at this point in checkout since we changed to this new supplier and the associated change orders and work authorizations???????? Data is in different places with no simple way to achieve integration PR CM PDM PLM CO T&V WA Semantic technologies allow the meaning of data to be expressed so that data can be related across databases with different schemas copyright TopQuadrant, Inc. 2014 5

The TMC (complex domain example) 21 st -Century Data Integration across the Translational Medicine Continuum (multiple stakeholders multiple systems, multiple contexts*) * Re-purposing of data requires separation of context from data/meta-data, i.e. explicit representation of context-specific meta-data. Used by permission: National Cancer Institute Cancer Bioinformatics Grid (cabig)

The Solution: (IT Technologists perspective) Physical Data Integration - Technology-centric Proven tools and technologies (data warehouse, data mart, commercial solutions, etc.) Time consuming, expensive Virtual Data Integration Lower cost? Technology dependencies? Time? copyright TopQuadrant, Inc. 2014 7

Virtual Data Integration The Semantic Technologies Solution: Based on an integrated set of W3C standards relationship Thing Thing that expose the meaning (semantics) of distributed data in a technology-independent representation (RDF) and use SPARQL (semantic query standard) to run federated queries over the distributed meta-data/data. copyright TopQuadrant, Inc. 2014 8

Why Semantic Virtual Data Integration? Lower cost More user /domain-expert-centric Less technology centric Enables Deeper insight into data Less development time Increased flexibility and evolutionary capabilities Domain experts can ask unexpected questions after the integration has been done. copyright TopQuadrant, Inc. 2014 9

Syngenta s Successful Pilot Working with TopQuadrant, Syngenta initiated an extensive pilot project that: Speeded the retrieval of the desired information from weeks to days Decreased the effort necessary to integrate data Enabled answering questions that they were unable to answer before Syngenta s requirements helped to motivate development of TopBraid Insight. Problem is important, pervasive and there is high value in addressing it Encouraging other organizations to initiate pilots copyright TopQuadrant, Inc. 2014 10

Extracting more value from data silos Using the semantic web to link chemistry and biology for innovation Tim Eyres Classification: PUBLIC

Syngenta Syngenta is one of the world's leading companies with more than 27,000 employees in some 90 countries dedicated to our purpose: Bringing plant potential to life. Through world-class science, global reach and commitment to our customers we help to increase crop productivity, protect the environment and improve health and quality of life 12 Classification: PUBLIC

Crop Protection Research & Development Progression Synthesize Design Test Efficacy Safety Analyse $/cycle $$$$/cycle 13 Classification: PUBLIC

The Challenge Expand along dimensions beyond potency and AI spectrum Registrability Potency Innovative class/ new MOA 1 Historic Future Competency Questions of Potency & Spectrum What bioactivity is known for small molecule x, or it s similar compounds, targeting protein p? CoGs Broad spectrum AI Competency Questions of Registrability and Innovation ICS potential 2 Synergy in mixtures Source: Project Mendeleev 1 Importance of this axis may differ by indication 2 Compatibility with new delivery technology, formulation breadth, seed care, germplasm, traits, etc. CE References to all Published Documents where Chemical Substances that are similar to Chemical Substance 's' with structure property 'sp' are Subject Of Paper. 14 Classification: PUBLIC

The Linked Data Enterprise (It s strategy not just a single solution) Linked Data describes a method of publishing structured data so that it can be interlinked and become more useful. http://en.wikipedia.org/wiki/linked_data An organization in which the act of information creation is intimately coupled with the act of information sharing Information is produced and consumed in ways for specific business needs, but produced in a way that can be connected to other aspects of the enterprise Key roles for: Metadata: Controlled vocabularies, Taxonomy, tagging, ontologies Metadata governance Project/System publication of data & concept models Architectural [reuse] of existing information & meta-data 15 Classification: PUBLIC

Linked Linked The Approach - Combine internal and external data using Linked Data technology What bioactivity is known for small molecule x, or it s similar compounds, targeting protein p? Syngenta Scientists ask Competency Questions of the Data, phrased using real world terms from the Concept Model Concept models describe R&D knowledge and the semantic vocabulary for Syngenta science Data Models link the scientists real world terms to the raw data and information Data sets, both internal and external can be easily added to the federation 16 Classification: PUBLIC

Idea Gather Competency Questions Breakdown Questions in to Concepts Competency Questions Concepts Properties 1. Dream the ideal assume a world of effortlessly linked data at your finger tips. 2. Brain storm what questions would you like to ask of the knowledge base? Try to be specific. What do you really mean? 3. Assess remove duplicates and cluster similar questions. 4. Categorise importance, needs to be answered today or in the future, routine or non-routine. 5. Identify concepts and the properties of those concepts. 18 Classification: PUBLIC

Competency Question Example Competency Question Concepts Properties All the ChemicalSubstances (and their structure properties) that have activity (of type inhibition) against target 't' with an assay result where assayorganism is a plant. ChemicalSubstance, Target, Assay, Bioactivity, AssayOrganism hassubstancestru cture, activitytype, assayresult 19 Classification: PUBLIC

Now the technical problem Internal Data Sources External/Public Data Sources Bio activity of small molecules 72 Gb Compound catalogue 5 Gb 16 mil compounds Millions of triples ChEMBL Protein crystal structures Small molecule crystal structures Research documents 2 Gb 1 Gb 1000s documents Millions of patents What Chemical-Substances (and their structure properties) are we working on that have activity (of type inhibition) against target 't' with an assay result where assayorganism is a plant and this substance is similar to any patented substance? Derwent Patent Index 20

The lift and shift solution Internal Data Sources External/Public Data Sources Bio activity of small molecules Compound catalogue ChemSpider Protein crystal structures Small molecule crystal structures Derwent Patent Index Research documents 21

The physical data warehouse solution Internal Data Sources External/Public Data Sources Bio activity of small molecules Compound catalogue ChEMBL Protein crystal structures ETL ETL Small molecule crystal structures Derwent Patent Index Research documents 22

Vision Virtual Data Warehouse Map-Reduce Internal Data Sources External/Public Data Sources Bio activity of small molecules Compound catalogue ChEMBL Protein crystal structures Small molecule crystal structures Map-Reduce Derwent Patent Index Research documents 23

Map-Reduce Challenges Virtual Data Warehouse Internal Data Sources External/Public Data Sources Bio activity of small molecules Compound catalogue ChEMBL Protein crystal structures Small molecule crystal structures Map-Reduce Derwent Patent Index Research documents 24

Map-Reduce Actual- Virtual Data Warehouse Internal Data Sources External/Public Data Sources Bio activity of small molecules D2RQ Compound catalogue D2RQ ChEMBL Protein crystal structures D2RQ Small molecule crystal structures D2RQ Map-Reduce Derwent Patent Index Research documents File import File import 25

Outcome Federated Search succeeded as measured by its scientific criteria laid down in the project definition - R&D was able to ask adhoc questions of Potency, Novelty and IP across a broad set of internal and external data sets - Scientists can pose questions using scientific terms, without the need to understand the underlying data model Federated Search delivered - a fully functional linked data federation - concept models which are reusable by future BI or integration projects 26 Classification: PUBLIC

Technical Proofs Linked Data can be used to federate live operation data sets New datasets are simply added by linking their data model to the scientific concept model Using Competency Questions ensures that: - The right things are linked and we do not end up boiling the ocean - Business benefit can be evaluated at the end of any data set integration Federation was able to make disparate data sources appear to the scientists as one unified data set. 27 Classification: PUBLIC

What else in 2013 Linking new data sets: weather, soil, topology, field trial, mechanistic crop modelling Broadening acceptance through reporting and visualization solutions What next in 2014 Extending in the genetic, chemical and environment domains Learning how to scale semantic solutions for enterprise wide usage 28 Classification: PUBLIC

Acknowledgements The project team at Syngenta Our collaborators at TopQuadrant Our collaborators at ChEMBL Our collaborators at Thomson Reuters. The community on whose shoulders we stand 29 Classification: PUBLIC

Data Integration is hard because of Multiple data sources and storage technologies Multiple data-collection contexts Multiple data-usage scenarios copyright TopQuadrant, Inc. 2014 30

Perspectives from a (converted) non-believer Connecting the dots means machine understanding the semantics of the data/meta-data being connected Data integration (by machines) means sharing semantics Two (or more) systems given same data/meta-data will produce the same results when performing the same function There are no silver bullets or secret sauces copyright TopQuadrant, Inc. 2014 31

Semantic technologies What s different? Semantics as a first-class citizen Use of data (by domain experts) rather than storage or transmission of data (by technologists) Bypass the non-semantic barriers to data integration Differences in relational models/tables or XML document hierarchies Vendor-specific technologies/implementations Use of semantic technologies is evolutionary: Not rip and replace Instead integrate and evolve copyright TopQuadrant, Inc. 2014 32

Semantic Data Virtualization The secret sauce isn t really secret It is based on the use of W3C semantic standards However, aligning semantics isn t simple TopBraid Insight has a UI and associated tools to make as easy as it can be and when it s done, it opens the door to insights that were previously difficult, time consuming, and expensive to achieve copyright TopQuadrant, Inc. 2014 33

TopBraid Insight (TBI) Connect the dots New Insights from virtually integrated data Interactive Exploration Federated queries Over distributed data Surface answers using W3C semantic standards Running on commodity internet technologies HTTP, URI, ReST, etc. copyright TopQuadrant, Inc. 2014 34

Key Problem-Solving Features Benefits of this approach to semantic data virtualization: Virtualized Data Storage Federates data from different sources. Resolves data variety. Eliminates data replication and ETL. High performance, flexible, open, scalable, architecture. Simple change management Add new datasets as needed. Change or update consolidated schemas for new concepts and questions without changing underlying datasets. Easily configured Consolidated schemas and datasources for different users without duplication of data or new ETL scripts. Interactive Dynamic Exploration Explore and discover intuitively using Search, Query, Filter, Browse, Navigate, Visualize, Share. copyright TopQuadrant, Inc. 2014 35

TopBraid Insight - Today copyright TopQuadrant, Inc. 2014 36

Questions? copyright TopQuadrant, Inc. 2014 37

Thank you! A wonderful harmony is created when we join together the seemingly unconnected. - Heraclitus