Technological challenges Introduction to Ontologies Combining relational databases and ontologies Author : Marc Lieber Date : 21-Jan-2014 BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STU TTGART WIEN 1
AGENDA 1. Introduction to Semantic Web 2. Graph databases / Triple Stores overview Oracle Graph databases Franz Allegrograph 3. Uses cases Novartis Fraud detetion 2
Semantic technologies 1. Semantic technologies generally refers to a broad spectrum of techniques for finding signal in large or complex data sources Link Analysis Distance Pattern Detect anomalies Complex search 3
Ontology Editing and Engineering TopQuadrant TopBraid 4
Semantic Web in Use 1. Industries include: Life Science, Health care and Pharma Energy sector, Oil & Gas Google, Facebook, Linkedin Financial services Digital libraries Libraries & museums Defense & Intelligence Service egovernement Media, Sport (BBC, NFL) Networks & Communication Department Stores (Wallmart) 5
W3C Semantic Web technologies Goes back to few years now Large set of specifications for many application domains RDF, RDFS, OWL, SKOS, SNOMED, etc Google s schema.org initiative to federate the definition of ontologies Ontologies : FOAF (Friend of a Friend) Serialisation in n3 triple, RDF/XML, Turtle or RDFa (XHTML) 6
Graph DBs 1. Graph databases can be split into W3c Semantic Web Databases also named as Triplestores or RDF graphdb General Graph databases; Property Graph and Hypergraph are two main types of General Graph databases (Property Graph Vs. Hypergraph). 2. Triple stores store the relationships between nodes and their properties as triples or quads 3. Property Graphs store the relationships between nodes and the properties of each node separately 4. Some database such as Allegrograph can be considered as a W3c Semantic Web Database and a Property Graph DB since it supports Graph traversals and the W3C SPARQL querying language 7
Property graphs and hypergraphs 1. In a property graph both nodes and links can have properties time 2013-01-01T12:12:12 Lat long 37.30 121.90 ja@franz.com 96777543 email account# amount pays 2000 pays pays pays
Resource Description Framework Graphs URIs are used to identify Resources, entities, relationships, concepts Creates Subject-Property-Object triples Properties of subjects are triples Standarts defined by W3c and OGC (Open Geospatial Consortium ) 9
RDF Triples RDF as core data format Uniform structure to represent data (triples) [subject] [predicate] [object] JFK president of the United States [resource] [property] [value] JFK PresidentOf The United States quad = triple + named graph, quint = quad + technical ID (rowid) use of namespaces to differentiate terms Some are predefined, but you can create your own namespaces <http://www.world.org/celibrity#jfk> <http://www.w3.org/2000/01/rdf-schema#label> "John Fitzgerald Kennedy"^^<http://www.w3.org/2001/XMLSchema#string>. <http://www.world.org/airport#jfk> <http://www.world.org/airport#islocatedin> New York City. 10 Presentation Title Presenter Name Date Subject Business Use Only
Data migration : Where do triples come from? 1. Relational storage ID Name Hiredate Job Salary Deptno 7982 Scott 12-02-1998 Clerk 4800 30 7855 Adams 27-09-2001 Manager 7500 30 2. Equivalent in triples Subject Predicate Object <...emp:7982> rdfs:label Scott xsd:string <...emp:7982> <..HR#Hiredate> 12-02-1998 xsd:date <...emp:7982> <..HR#hasJob> Clerk xsd:string <...emp:7982> <..HR#HasSalary> 4800 xsd:int <...emp:7982> <..HR#worksIn> <...dept:30> <...dept:30> rdfs:label Sales xsd:string 11
Databases Market Overview The database world is changing rapidly NoSQL databases are often used in conjunction with Big Data Graph databases can be split into W3c Semantic Web Databases and others 12
Triple stores comparaison Tripe Stores Scalability (Billion Triples) Query Reasoning support Full text Search support Jena (TDB) up to 1.7 BT SPARQL 1.1 OWL, RDFS Yes (lucene integration) Programming Java Sesame Millions Triples SPARQL 1.1 RDFS Yes (through Lucene SAIL) Java OpenLink Viruoso 15.4 BT SPARQL 1.1 RDFS, subsets of OWL yes Java Oracle >500 Billons Triples SPARQL 1.0 (11g) Sparql 1.1 (12c), SEM_MATCH, SEM_RELATED RDFS, OWL, OWLIM, SKOS, SNOMED Yes (Oracle Text) OWLIM 20 BT SPARQL 1.1 RDFS, OWL, OWLIM yes Java Java, SQL, PL/SQL Allegrograph >500 Billons Triples SPARQL 1.1, Prolog RDFS, Prolog rules yes Java, LISP, Python, Ruby, C# 4 Store 15 BT SPARQL 1.1 RDFS yes Java BigData over 10 BT SPARQL 1.1 RDFS, OWL Lite Internal, external through Lucene Java 13 Urika ( YarcData) Anzo Cambridge Trillions SPARQL 1.1 RDFS Yes Java, Python unknown SPARQL 1.1 RDFS, OWL Yes (Information Mining) Java, SOAP
SPARQL Protocol and RDF Query Language Latest Version 1.1 SELECT returns all, or a subset of, the variables bound in a query pattern match CONSTRUCT returns triples ASK returns a boolean DESCRIBE asks for triples that describe a particular resource 14
SPARQL compared to SQL A SPARQL query of this type would be quite difficult to translate into SQL queries : 15
Inferencing / Reasoning Inferencing is the ability to make logical deductions based on Ontology rules. The reasoning tools use the rules defined in the RDF Model (RDFS, OWL, SKOS, ) to detect new properties and new relationships. The ability to draw inferences from existing data using the precision and rigor of mathematical logic is probably the most important property that distinguishes semantic data from others. Example of use: Linkedin or Facebooks discovering new links between persons 16
Reasoning example Graph representation and data modelisation Reasoning builts the missing relation Can take time.. Some DBs do it on the fly or materialize the generated triples 17 O-XML: Introducing XML
LOD : Linked Open Data Initative 18
Semantic Web query federation Searching multiple Datasets with one Query 19
Semantic Web in relation to Big Data or how to transform Big Data into Smart Data. Sample vs. All Clean vs Dirty Many Undiscovered causation (Why) vs Correlation Table vs Graph Planned Path vs Discovery 20
Data Science example using R and SPARQL 1. Extracts data from htp://spatial.linkedscience.org and represents the result as a graph : 21
Linked Data in Enterprise Access & Presentation Layer Semantic Graph model (W3C RDF Metadata Model) Index Data Servers Event Server Hadoop Appliance Content Mgmt BI Server Data Warehouse Data Sources / Types Machine Generated Data Social Media Human Sourced Information Subscription Services Transaction Systems
Franz Corp. Allegrograph 1. Allegrograph is licensed under proprietary commercial license 2. Focuses on high scalability 3. Development language : Java, Python or LISP 4. Alternative to SPARQL queries : PROLOG 5. RESTful HTTP protocol to maintain triples in the DB 6. Graphical tool : GRUFF 23
Oracle Spatial & Graphs 1. The Oracle RDF Triple Store embedded in the relational databases Schema MDSYS contains RDF_LINK$ and RDF_VALUE$ tables SPARQL 1.1 supported in 12c Native support of most of the W3C rules Use of named graphs (quad) since 11.2.0.3 Scales up to 100 s billions of triples Oracle specific adapters available for JENA, SESAME, TopBraid, Protégé and Cytoscape 24
Oracle Spatial & Graphs other features 1. Support of Temporal reasoning, Spatial reasoning 2. Fine grained security on triple level and for inferenced graphs 3. The oracle reasoner persists the infered triples in the DB. As an alternative, integration with Pellet or TrOWL, as an external OWL 2 reasoner 4. Jena and Sesame Adapters 1. To build SPARQL end points 2. Bulk load triples from Java 3. Develop applications in Java 5. Integration with OBIEE, RDF browser 25
SPARQL and SPARQL in SQL Architecture HTTP Standard SPARQL Endpoint Enhanced with query management control Java Jena API Jena Adapter Sesame API Sesame Adapter SPARQL-to-SQL Translation Logic SQL SEM_MATCH rewritable table function
ORACLE Database RDF Query engine Can be joined with any other relational table or view 27
RDB2RDF & R2RML : Modeling Relational Data as a Graph Relational to RDF Modeling W3C R2RML Oracle Spatial and Graph 12c can represent relational schema as graph view Integrate content from distributed sources Federate distributed databases Apply SPARQL queries on tables, views, SQL query results No duplication of data and storage
Graph Support on Oracle NoSQL Available on Oracle NoSQL Database (Enterprise Edition) Graph Feature for NoSQL RDF Graph support in Oracle NoSQL Database Enterprise Edition High performance Key Value store Standard access to graph data: SPARQL 1.1 Jena & Joseki SPARQL endpoint Web Services Massive horizontal scalability of triples petabytes Support for World Wide Web Consortium (W3C) Semantic Web standards
Novartis Institutes for BioMedical Research (NIBR) Usecase : project Metastore NIBR is the global pharmaceutical organization for Novartis committed to discovering innovative medicines to treat diseases with high unmet medical need 6000+ scientists, physicians, business professionals worldwide METASTORE is a Scientific knowledge portal used by many application to Search over Ontology oriented data Organized around scientific concept types : Genes, Proteins, Indications, Anatomy, diseases, taxonomy etc ; Can be hierarchically organized and classified Builds a semantic network of scientific concepts 30
Solution implemented : Oracle Spatial & Graph 1. Accessible through dedicated service layer and reusable widgets Integrated application to visualize all Metastore content. 31
Use case Fraud detection 32
A real world fraud detection example Find any circle of payments between accounts that all happened within 10 miles of San Jose within the last day and where the payments > $1000 Requires Graph Analytics Temporal reasoning Geospatial reasoning Social Network Analysis
Social Network Analysis answers 4 questions Social Network Analysis answers 4 questions How far is P1 from P2 and how strong is the relation To what groups does this person belong (ego groups, cliques?) How important is this person in the group? Does this group have a leader, how cohesive are they?
Activity recognition Find all meetings that happened in November within 5 miles of Berkeley that was attended by the most important person in Jans friends and friends of friends. (select (?x) (ego-group person:jans knows?group 2) SNA (actor-centrality-members?group knows?x?num) SNA (q?event fr:actor?x) DB Lookup (qs?event rdf:type fr:meeting) RDFS (interval-during?event 2008-11-01 2008-11-06 ) Temporal (geo-box-around geoname:berkeley?event 5 miles) Spatial!)
Fraud detection example using SPARQL Find any circle of payments between accounts that all happened within 10 miles of San Jose within the last day and where the payments > $1000 Find the circle Inspect the property graph Temporal Geo
Conclusion : Why should you choose Semantic Web? 1. You want a flexible, adaptable, transparant information architecture 2. Project requires complex structures and large amount of relations beetween classes as well as properties 3. project requires integration of data from different sources 4. heterogeneous sets of metadata and vocabulary concepts, originating from multiple sources 5. Need for semantic annotations using controlled vocabularies and thesauri such as FOAF, OWL, SKOS, etc 6. There is a need for making logical deductions based on rules defined by these controlled vocabularies. 37
THANK YOU. Marc Lieber Marc.lieber@trivadis.com www.trivadis.com BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STU TTGART WIEN 38