Taming Big Data Variety with Semantic Graph Databases Evren Sirin CTO Complexible
About Complexible Semantic Tech leader since 2006 (née Clark & Parsia) software, consulting W3C leadership Offices in DC & Boston Launched Stardog 1.0 in 2012 Currently raising A Round
Big Data Vs Volume Velocity Variety Veracity Volatility Value
Data diversity is the real challenge Based on Paradigm4 survey of more than 100 data scientists http://www.paradigm4.com/infographic2014/
Data Variety Syntax: Formats Structure: Schemas Identity: Entities https://www.flickr.com/photos/designmilk/8552219138
In large and complex enterprises with lots of data, most analytic challenges can be reduced to data integration challenges.
Data integration approaches Integrated data Data warehouses Sweet spot Data lakes Integration effort
What is a Unified Data Model? Global coherent view over heterogenous data flexible and composable at the right level of abstraction enabling automated processing and analysis
Data models Tables Trees Graphs
Data models Tables Trees Graphs
Data models Tables Trees Graphs
Data models Tables Trees Graphs https://commons.wikimedia.org/wiki/file:social_network_analysis_visualization.png
Graphs are everywhere Knowledge Graph Open Graph Linked Open Data
Why graphs? Generic data representation model Utilize connectedness of the data Flexible and extensible Easy to compose and connect Increasing number of graph database offerings
not Why graphs? Generic data representation model Utilize connectedness of the data Flexible and extensible Easy to compose and connect Increasing number of graph database offerings
Semantic graphs = RDF graphs Meaning is defined in an explicit and machine-processable way
Abstract Graph http://www.w3.org/tr/rdf11-primer/
RDF Graph http://www.w3.org/tr/rdf11-primer/
RDF Serialization 01 BASE <http://example.org/> 02 PREFIX foaf: <http://xmlns.com/foaf/0.1/> 03 PREFIX xsd: <http://www.w3.org/2001/xmlschema#> 04 PREFIX schema: <http://schema.org/> 05 PREFIX dcterms: <http://purl.org/dc/terms/> 06 PREFIX wd: <http://www.wikidata.org/entity/> 07 08 <bob#me> 09 a foaf:person ; 10 foaf:knows <alice#me> ; 11 schema:birthdate "1990-07-04"^^xsd:date ; 12 foaf:topic_interest wd:q12418. 13 14 wd:q12418 15 dcterms:title "Mona Lisa" ; 16 dcterms:creator <http://dbpedia.org/resource/leonardo_da_vinci>. 17 18 <http://data.europeana.eu/item/04802/243fa8618938f4117025f17a8b813c5f9aa4d619> 19 dcterms:subject wd:q12418. http://www.w3.org/tr/rdf11-primer/ SPARQL Query PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX schema: <http://schema.org/> PREFIX dcterms: <http://purl.org/dc/terms/> PREFIX dbpedia: <http://dbpedia.org/resource/> SELECT?person?title WHERE {?person a foaf:person ; schema:birthdate?birthdate ; foaf:topic_interest?interest.?interest dcterms:title?title ; dcterms:creator dbpedia:leonardo_da_vinci. FILTER (?birthdate < "1991-01-01"^^xsd:date ) }
Schema - Ontology rdfs:subclassof Agent rdfs:subclassof worksfor owl:inverseof hasemployee Person Organization rdfs:range rdf:type rdf:type rdf:type rdf:type Bob worksfor ACME hasemployee
Tables to graphs (R2RML) http://www.w3.org/tr/rdb2rdf-ucr/
RDF Models Interoperable - No vendor lock-in Actionable - Run queries against it Expressive - Multiple views over same data Reusable - By different apps in other domains
Model-based data integration
Stardog: Semantic Graph Database The leading RDF database Pure Java: any JVM language, full REST bindings Client-server, embedded, middleware modes Rich feature set ACID Transactions, High Availability, Hot backup/restore, JMX server monitoring, Access & Audit logging, RBAC security model, LDAP integration, SPARQL 1.1 queries, OWL 2 Reasoning, Proof trees, Integrity constraints, Full-text search, Geospatial support, Virtual graphs, Provenance support Supports property graphs (Tinkerpop)
Single-node Scalability Scale up to 50B triples on modest hardware 32 cores, 256 GB RAM, 2 x 7200RPM HDDs, < $10K cost Load rates up to 500k triples/second That s 100M triples in 3 min, 1B in 30 min, and 20B in 20 hours Best-of-breed query answering performance Query 100M triples with a throughput of 3M+ queries/hour, 1B at 500k queries/hour, and 10B at 20k queries/hour (BSBM, 64 clients)
Stardog for Big Data HDFS-backed storage Horizontal partitioning of data Advanced query planner and optimization Parallel query execution with async messaging Coming in version 5 (2016)
Big Data computations Integration with Apache Spark Run Spark jobs over integrated view of the data PageRank, machine learning algos,... Different ways to expose RDF data in Spark RDD[Triple] RDD[SPARQLResults]
Thanks!