Semantic Technologies for Big Data. Marin Dimitrov (Ontotext)

Similar documents
From Big Data to Smart Data. Marin Dimitrov - CTO

Graph Database Performance: An Oracle Perspective

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Increase Agility and Reduce Costs with a Logical Data Warehouse. February 2014

Semantic Data Management. Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Publishing Linked Data Requires More than Just Using a Tool

Linked Open Data A Way to Extract Knowledge from Global Datastores

How To Build A Cloud Based Intelligence System

BIG Big Data Public Private Forum

Reason-able View of Linked Data for Cultural Heritage

Linked Open Data Infrastructure for Public Sector Information: Example from Serbia

Smart Financial Data: Semantic Web technology transforms Big Data into Smart Data

DISCOVERING RESUME INFORMATION USING LINKED DATA

Cray: Enabling Real-Time Discovery in Big Data

Klarna Tech Talk: Mind the Data! Jeff Pollock InfoSphere Information Integration & Governance

How semantic technology can help you do more with production data. Doing more with production data

Service Oriented Architecture

JOURNAL OF OBJECT TECHNOLOGY

Semantic Interoperability

Introduction to Ontologies

BUSINESS VALUE OF SEMANTIC TECHNOLOGY

Luncheon Webinar Series May 13, 2013

LDIF - Linked Data Integration Framework

Survey of Big Data Architecture and Framework from the Industry

Joshua Phillips Alejandra Gonzalez-Beltran Jyoti Pathak October 22, 2009

DRUM Distributed Transactional Building Information Management

HOW TO DO A SMART DATA PROJECT

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

Addressing Self-Management in Cloud Platforms: a Semantic Sensor Web Approach

VIEWPOINT. High Performance Analytics. Industry Context and Trends

Revealing Trends and Insights in Online Hiring Market Using Linking Open Data Cloud: Active Hiring a Use Case Study

MarkLogic Enterprise Data Layer

Data Management in SAP Environments

A Big Data Storage Architecture for the Second Wave David Sunny Sundstrom Principle Product Director, Storage Oracle

Analance Data Integration Technical Whitepaper

TopBraid Insight for Life Sciences

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

ATA DRIVEN GLOBAL VISION CLOUD PLATFORM STRATEG N POWERFUL RELEVANT PERFORMANCE SOLUTION CLO IRTUAL BIG DATA SOLUTION ROI FLEXIBLE DATA DRIVEN V

ON DEMAND ACCESS TO BIG DATA THROUGH SEMANTIC TECHNOLOGIES. Peter Haase fluid Operations AG

Industry 4.0 and Big Data

How To Turn Big Data Into An Insight

Deploying a Geospatial Cloud

bigdata Managing Scale in Ontological Systems

BIG. Big Data Analysis John Domingue (STI International and The Open University) Big Data Public Private Forum

Emerging Requirements and DBMS Technologies:

The Next Wave of Data Management. Is Big Data The New Normal?

Data Refinery with Big Data Aspects

Big Data Management Assessed Coursework Two Big Data vs Semantic Web F21BD

Serendipity a platform to discover and visualize Open OER Data from OpenCourseWare repositories Abstract Keywords Introduction

Informatica PowerCenter Data Virtualization Edition

NextGen Infrastructure for Big DATA Analytics.

Geospatial Platforms For Enabling Workflows

Big Data and Advanced Analytics Applications and Capabilities Steven Hagan, Vice President, Server Technologies

Oracle Big Data SQL Technical Update

Towards a reference architecture for Semantic Web applications

Analance Data Integration Technical Whitepaper

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

TopBraid Life Sciences Insight

Healthcare, transportation,

Global Data Integration with Autonomous Mobile Agents. White Paper

Data Virtualization and ETL. Denodo Technologies Architecture Brief

Navigating the Big Data infrastructure layer Helena Schwenk

Are You Big Data Ready?

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

Sustainable Development with Geospatial Information Leveraging the Data and Technology Revolution

Semantic Web Applications

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Where is... How do I get to...

FROM DATA STORE TO DATA SERVICES - DEVELOPING SCALABLE DATA ARCHITECTURE AT SURS. Summary

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

urika! Unlocking the Power of Big Data at PSC

THE TRUTH ABOUT TRIPLESTORES The Top 8 Things You Need to Know When Considering a Triplestore

Architecting an Industrial Sensor Data Platform for Big Data Analytics

Low-cost Open Data As-a-Service in the Cloud

Chapter 11 Mining Databases on the Web

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Using Master Data in Business Intelligence

Architecting an Industrial Sensor Data Platform for Big Data Analytics: Continued

Understanding traffic flow

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Oracle Big Data Building A Big Data Management System

Big Data Integration: A Buyer's Guide

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

How to Enhance Traditional BI Architecture to Leverage Big Data

LINKING EVERYTHING HEADLINE SUBHEADLINE. Manfred Hauswirth TU Berlin, Open Distributed Systems & Fraunhofer FOKUS. Matthias Heyde / Fraunhofer FOKUS

How To Handle Big Data With A Data Scientist

Federated Query Processing over Linked Data

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Oracle Spatial and Graph

LINKED DATA EXPERIENCE AT MACMILLAN Building discovery services for scientific and scholarly content on top of a semantic data model

On-demand Text Analytics and Metadata Management with S4

LinkZoo: A linked data platform for collaborative management of heterogeneous resources

Big Data and Semantic Web in Manufacturing. Nitesh Khilwani, PhD Chief Engineer, Samsung Research Institute Noida, India

Effective Data Integration - where to begin. Bryte Systems

Why big data? Lessons from a Decade+ Experiment in Big Data

Transcription:

Semantic Technologies for Big Data Marin Dimitrov (Ontotext) XML Amsterdam 2012

XML Amsterdam 2012 #2

About Ontotext Provides products and services for creating, managing and exploiting semantic data Founded in 2000 Offices in Bulgaria, USA and UK Major clients and industries Media & Publishing (BBC, Press Association) HCLS (AstraZeneca, UCB) Cultural Heritage (The British Museum, The National Archives, Polish National Museum, Dutch Public Library) Defense and Homeland Security #3

Outline Semantic Technologies for the Enterprise Semantic Technologies for Big Data Success stories #4

SEMANTIC TECHNOLOGIES FOR THE ENTERPRISE #5

The need for a smarter Web "The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation. (Tim Berners-Lee, 2001) PricewaterhouseCoopers believes a Web of data will develop that fully augments the document Web of today. You ll be able to find pieces of data sets from different places, aggregate them without warehousing, and analyze them in a more straightforward, powerful way than you can now. (PWC, May 2009) #6

Linked Data Linked Data is a set of principles that allows publishing, querying and consumption of RDF data, distributed across different servers Design principles Use unambiguous identifiers for resources (URIs) Use HTTP URIs (dereference-able) Provide useful information for URI lookups Interlink resources #7

The Semantic Web timeline RDF RDF 2 DAML+OIL OWL OWL 2 SPARQL SPARQL 1.1 RIF RDFa SAWSDL LOD SKOS HCLS SSN RDB2RDF PIL GLD LDP 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 #8

Enterprise Information Management Challenges Many disparate data sources and data silos Many point-to-point interfaces Data sources with similar/inconsistent information Complex data integration processes inadequate for changing business requirements Most of the knowledge is hidden in texts Difficult to integrate & analyse structured data and text #9

Semantic Web and Linked Data Opportunities for the Enterprise Simplify the information integration processes Flexible, easy to evolve data model Bottom-up / incremental integration Efficiently integrate structured and unstructured data Provide an enterprise metadata layer Unified metadata vocabulary for the enterprise Align the legacy data silos Improve the information sharing and reuse #10

Semantic Web and Linked Data Opportunities for the Enterprise (2) Discovery and enrichment of information Interlink people, organisations, events, etc. Enrich enterprise content with structured annotations Discover implicit links and relationships Unified access to information within the enterprise Simplified infrastructure based on open web standards Information interchange across a value chain Easy publishing and consumption of Linked Data Augments existing IT assets and technologies No need for disruptive replacement #11

XML and RDF: friends or foes Complement each other XML best for content, structure and interchange format RDF for metadata layer and semantics Typical use case Many XML content data sources Content stored in an XML store (XQuery and XSLT) Structured data sources & external Linked Data RDF-ized and stored in an RDF store (SPARQL) Metadata extracted from content stored in an RDF store (SPARQL) semantic search and metadata driven content delivery #12

BBC Sports (c) BBC #13

Added value of RDF Explicit semantics Intended meaning of entities and relations Global identifiers (URIs) Simple and flexible graph-based data model Easier data mapping & integration Bottom-up / incremental data integration with owl:sameas Inference of implicit information Working with distributed information Linked Data, federated SPARQL #14

Added value of RDF Descriptive / agile schema Open World Assumption, don t restrict predicates Generated dynamically from data Queries based on meaning Not depending on structure / order of statements Data and queries may use different vocabularies Exploratory queries Choice of OWL2 profiles Tradeoff features vs performance New profiles may emerge in the future #15

SEMANTIC TECHNOLOGIES FOR BIG DATA #16

The three V s of Big Data Velocity Streaming, sensor, real-time data Solution: distributed processing & storage Semantic challenge: stream reasoning Volume Petabytes of data Solution: distributed processing & storage Semantic challenge: distributed reasoning & querying Variety Structured, semi-structured and unstructured data Semantic Technologies (RDF) are a good fit #17

Types of Big Data (NIST) Type 1 Velocity (-), Volume (-), Variety (+) Perfect fit for Semantic Technologies Type 2 Velocity and/or Volume, Variety (-) Only horizontal scalability required, traditional approaches are a good enough fit Type 3 All V s Semantic Technologies not a good fit yet, but moving in that direction #18

Semantic Technologies for Volume and Velocity Promising ongoing research Distributed inference with Hadoop/Storm Stream reasoning Continuous queries Continuous (dynamic) semantics SPARQL to Pig translation Distributed RDF stores on top of NoSQL C-SPARQL, EP-SPARQL, CQELS #19

Linked Open Data Cloud (Sep 2011) (c) Cyganiak & Jentzsch #20

From Big Linked Data to Linked Big Data Big Linked Data Big Data approach adopted by the Linked Data community In particular handling Volume and Velocity Exponential growth of Linked Data in the last 5 years Linked Big Data Linked Data approach adopted by the Big Data community RDF data model for Variety Enrich Big Data with metadata and semantics more powerful analytics on top of it Interlink Big Data sets Simplify data access and data integration #21

SUCCESS STORIES #22

Typical Use Cases for Linked Data and Semantic Technologies Publish / consume Linked Data across enterprises Linked Data is not necessarily free data Facilitate data interchange within the value chain Information integration within the enterprise Integrated asset management / align data silos Master Data Management Knowledge discovery and semantic search Integrate structured and unstructured data Enrich and interlink information Semantic search and exploration of information #23

Semantic Information Integration (Ontotext) #24

The National Archives (Ontotext) Challenge Large archive of various UK Government websites since 1997 Lots of duplicated information & documents Inefficient search & navigation Semantic Knowledge Base project goals Integrate multiple data sources Extract information & metadata from archived documents Interlink the web archive with data.gov.uk and LOD data Advanced search & navigation of the archive #25

The National Archives (Ontotext) Front Ends: Semantic Search Semantic Annotation O 1 O 2 O 3 3 rd party Ontology Editors SPARQL graph exploration Semantic Repository SKB Ontologies A B C D Data Transformation and Integration Factual Knowledge (TNA data, LOD, data.gov.uk) Annotation Process (GATE Teamware) Semantic annotations Semantic Index Identity Resolution #26

The National Archives (Ontotext) The numbers 2.5 billion input files 40TB compressed archive data 10 billion RDF triples stored in OWLIM 33,000 EC2 hours used on AWS Dynamic EC2 cluster (180 instances average, 500 max) Major challenges Complex pre-processing of documents De-duplication of information & documents EC2/RRS performance & reliability #27

Dutch Public Library (Ontotext + Dayon) Challenge Many disparate data sources, inefficient search Goals Data integration Automated metadata generation Open search platform Numbers 500 heterogeneous data sources 40 million cultural heritage artifacts to be describes 6-8 billion triples to be stored into the knowledge base #28

Linked Life Data (Ontotext) Challenge Disparate, heterogeneous and unaligned data silos lock valuable biomedical information Goals Semantic warehouse integrating and interlinking public biomedical data sources Interactive discovery and exploration Numbers 25+ heterogeneous biomedical data sources integrated 1 billion entities described 5.5 billion RDF triples #29

Linked Life Data (Ontotext) #30

Linked Life Data-as-a-Service (Ontotext) More data sources Large scale text mining over the LOD cloud Adapted for specific use cases UCB use case 2 billion entities described 11 billion RDF triples #31

Dynamic Semantic Publishing (Ontotext) Challenge Difficult & slow to aggregate content from various sources Goals Metadata generation for news (semantic annotation) Interlink & categorize content Metadata driven web pages Numbers Nearly real-time processing & annotation required Tens of millions (SPARQL) queries to the knowledge base per day #32

Trillion RDF triples (Franz Inc.) Use case Use RDF for the customer management database of a telecom Challenge 4,000 triples per customer, more than a trillion for the whole customer base Numbers 1 trillion triples stored in AllegroGraph by Franz Inc Hardware requirements undisclosed The 310 billion triple result used 8-CPU system with 2TB RAM #33

urika (Cray/YarcData) Big Data appliance for graph analytics Based on the Threadstorm tm architecture Up to 8K processors, 512TB RAM, 350TB/hr IO throughput In-memory RDF database SPARQL 1.0 engine (c) YarcData #34

TAKEAWAYS #35

Semantic Technologies for Big Data Rich ecosystem of Semantic Technologies since 1999 Strong Enterprise focus in the last 5 years Semantic Technologies provide opportunity for reducing the cost and complexity of data integration Common metadata layer for the enterprise More powerful ways to find and explore information RDF complements XML within the enterprise Semantic Technologies are a good fit for Big Data s Variety #36

Semantic Technologies for Big Data Velocity and Volume still challenging for Semantic Technologies, but lots of progress in that direction Linked Data will grow into Big Linked Data, but Big Data will also benefit from evolving into Linked Big Data Interesting success stories for Semantic Technologies in Big Data scenarios #37

THANK YOU! #38