Big Data, Fast Data, Complex Data. Jans Aasman Franz Inc



Similar documents
Bigdata : Enabling the Semantic Web at Web Scale

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

bigdata Managing Scale in Ontological Systems

Big Systems, Big Data

AllegroGraph. a graph database. Gary King gwking@franz.com

How To Improve Performance In A Database

Semantic Modeling with RDF. DBTech ExtWorkshop on Database Modeling and Semantic Modeling Lili Aunimo

HadoopRDF : A Scalable RDF Data Analysis System

Big Data Analytics. Rasoul Karimi

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Supercomputing and Big Data: Where are the Real Boundaries and Opportunities for Synergy?

HadoopSPARQL : A Hadoop-based Engine for Multiple SPARQL Query Answering

Reference Architecture, Requirements, Gaps, Roles

Big RDF Data Partitioning and Processing using hadoop in Cloud

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Open source, high performance database

Cloud Scale Distributed Data Storage. Jürmo Mehine

Cray: Enabling Real-Time Discovery in Big Data

InfiniteGraph: The Distributed Graph Database

Semantic Web Standard in Cloud Computing

Big Data With Hadoop

Graph Database Performance: An Oracle Perspective

Domain driven design, NoSQL and multi-model databases

Databases in Organizations

Industry 4.0 and Big Data

Geospatial Platforms For Enabling Workflows

Microsoft Azure Data Technologies: An Overview

THE SEMANTIC WEB AND IT`S APPLICATIONS

A Survey on: Efficient and Customizable Data Partitioning for Distributed Big RDF Data Processing using hadoop in Cloud.

Comparing SQL and NOSQL databases

Physical Database Design and Tuning

Data-Flow Awareness in Parallel Data Processing

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

1. Physical Database Design in Relational Databases (1)

NoSQL and Graph Database

Big Data Management Assessed Coursework Two Big Data vs Semantic Web F21BD

Lecture Data Warehouse Systems

Linked Data Interface, Semantics and a T-Box Triple Store for Microsoft SharePoint

Bigdata Model And Components Of Smalldata Structure

How To Scale Out Of A Nosql Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Data storing and data access

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

Understanding NoSQL on Microsoft Azure

Semantic Interoperability

Developing MapReduce Programs

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Geospatial Technology Innovations and Convergence

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

Performance and Scalability Overview

JOURNAL OF COMPUTER SCIENCE AND ENGINEERING

LINKED DATA EXPERIENCE AT MACMILLAN Building discovery services for scientific and scholarly content on top of a semantic data model

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Xiaoming Gao Hui Li Thilina Gunarathne

Infrastructures for big data

The Sierra Clustered Database Engine, the technology at the heart of

Semantic Stored Procedures Programming Environment and performance analysis

Semantic Web Success Story

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Data Mining in the Swamp

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Performance Analysis, Data Sharing, Tools Integration: New Approach based on Ontology

Databases 2 (VU) ( )

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Big Data Challenges in Bioinformatics

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

XML enabled databases. Non relational databases. Guido Rotondi

Cloud Computing Summary and Preparation for Examination

How To Use Big Data For Telco (For A Telco)

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Chapter 1: Introduction. Database Management System (DBMS) University Database Example

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Transcription:

Big Data, Fast Data, Complex Data Jans Aasman Franz Inc

Private, founded 1984 AI, Semantic Technology, professional services Now in Oakland Franz Inc Who We Are

(1 (2 3) (4 5) (6 7) (8 9) (10 11) (12 13) (14 15)(16 17) (18 19 20 21 22 23 24 27 28) (29 30))

No Schema. How is it different from an RDB and why is it more flexible? Say whatever you want to say but ontologies may constrain what you put in triple store No Link Tables because you can do one to many relationships directly No Indexing Choices Can add new data attributes (predicates) on the fly that willbe real timeavailablefor available querying, because everything is automatically indexed. Takes anything you give it: it is trivial to consume Rows and columns from RDB, XML, RDF(S), OWL, Text and Extracted Entities

We in the Semantic Community call what we do Complex Data

Complex data is good at Knowledge (instead of data) RDF and Logic Built to share information about objects, think Linked Open Data Cloud (Public and Enterprise) Complex ad hoc queries and rules and graph algorithms Getting more and more scalable by the day And all built on standards

But they keep asking Shouldn t we do this with big data/ nosql solutions Or with a fast in memory graph database?

Big Data really good at Insane amounts of data Relatively flexible data structures Finding a single object very fast Rudimentary analytics using map/reduce

Hadoop brought parallel data processing to the masses but this is what we do in our labs Notice the Sparse Graph problem And here is where Map/reduce fails http://cacm.acm.org/blogs/blog-cacm/149074-possible-hadoop-trajectories/fulltext

And what about fast data? A new OVUM marketing term for in memory triple stores or in memory graph databases Do we need them? Well, if you have problems expressed as graphs.

Q1: A reasonable hard query for horizontally scaling stores and rdb, a straight forward query for a graph database Select?a?b?c?d?e where { Franz send-money?a?a send-money?b?b send-money?c?c send-money Cray Cray send-money?d Not (?d =?c)?d send-money?e Not (?e?b)?e send-money Franz}

Q1: A very hard query for nosql stores and rdb, a straight forward query for graph database Find a money trail from Franz to Cray that is more than two steps, find another money trail from Franz Cray that is more than two step where the two trails are completely different (Select (?path1?path2) (path Franz Cray <send-money> >= 2?path1) (path Cray Franz <send-money> >= 2?path2) (empty (intersection?path1?path2))

You have billions of sametype objects and you need to retrieve them extremely fast. Or you need simple analytics. You have a fixed size, static data set and you need fast graph computations and pattern matching. You need all the features of an enterprise database but You need to work with ontology driven knowledge base, rules but also the flexibility of a graph database

Are there applications where we Track customers, insurance customers credit cards, employees, parts, etc in real time. Always have a 360 view on every entity need all three?

Big Data: hadoop would be great for storing all triples about a customer, but map/reduce wouldn t get you anywhere to deal with individual triples or detailed analysis. And it certainly won t help you with single user updates Fast Data: graph databases currently not dynamic enough and memory footprint too big. Triple stores: We currently solve the problem in AG4 with partitioning on account id and device id Get an object by graph, create memory cache, apply rules andprediction engine, store changes

AGHorizontal: Distributed triple store. Using Hadoop principles Automatic SPARQL to MapReduce translation AG Vertical: Mostly in mem triple store. 500 % more triples per Gig, including all strings and indices Programmable as graph database.

AG Horizontal Uses BigData hashing ideas for partitioning Redundant storage for multiple indices (slices) We have a SPARQL 1.0 and partially SPARQL 1.1 were we translate a query in a query flow graph and pipeline.

AG Vertical We have a new in graph based database kernel called AIMS Almost In Memory Store Almost In a Micro Second Total disk size for 1 B triples = 35 Gig Including all strings and inverse indices: 35 bytes per triple. 25% is for spogi index = 8.75 bytes per triple A breakthrough in terms of speed and size

A simple memory footprint test X?a?y?z

Memory Footprint Results* Test data and SPARQL, CYPHER, and PROLOG code available on our website.

Thanks!