NoSQL and Graph Database Biswanath Dutta DRTC, Indian Statistical Institute 8th Mile Mysore Road R. V. College Post Bangalore 560059 International Conference on Big Data, Bangalore, 9-20 March 2015
Outlines NoSQL Problem NoSQL properties, types Motivation Various solutions and why Graph Database Graph and graph database Graph analytics Various graph databases Graph and RDF triple RDF Graph database and RDF triple store Conclusion
Introduction Big Data Immense processing and storage requirement Varieties of applications
RDBMS Problem Slow reading and writing with the data size increases, the database prone to deadlocks Limited capacity Existing SQL solutions do not scale big enough. Expansion is difficult Database technology is becoming increasingly important
Various facets of demands High concurrency of reading and writing with low latency. Efficient big data storage and access requirements. High scalability and availability. Lower management and operational costs. Source: [4]
NoSQL Not only SQL A non-relational database system tends to be inherently distributed, schema-less, and horizontally scalable (sharding). A means of storage and retrieval of data other than the tabular relations used in relational databases. A very wide category for a group of persistence solutions which don't follow the relational data model, and who do not use SQL as the query language.
NoSQL Properties Scale horizontally Simple and flexible non-relational data models (schema less) Data replication and distribution over multiple machines for coping with failure and achieving eventual consistency. Simple interfaces for searching the data and calling procedures. Most NoSQL stores lack ACID (Atomicity, Consistency, Isolation, Durability) transactions. Although a few recent systems, such as FairCom c-treeace, Google Spanner, FoundationDB and OrientDB have made them central to their designs. NoSQL sometimes referred as BASE system (Basically Available, Soft-state, Eventually consistent) High availability: many NoSQL stores compromise Consistency (in terms of CAP (Consistency, Availability and Partition tolerance) theorem)
The key motivations for NoSQL Motivations for this approach include simplicity of design, horizontal scaling, and finer control over availability.
Types of NoSQL Databases (A classification based on data model) 1. Column: Accumulo, Cassandra, Druid, HBase, Vertica 2. Document: Clusterpoint, Apache CouchDB, Couchbase, MarkLogic, MongoDB, OrientDB 3. Key-value: Dynamo, FoundationDB, MemcacheDB, Redis, Riak, FairCom c-treeace, Aerospike, OrientDB 4. Graph: Allegro, Neo4J, InfiniteGraph, OrientDB, Virtuoso, Stardog 5. Multi-model: OrientDB, FoundationDB, ArangoDB, Alchemy Database, CortexDB
Why so many NoSQL Solutions? One size fits all solutions cannot be provided Because of varieties requirements. of data and their processing E.g., blogging data, transportation data, social relations, road map. Blogging data requires document type NoSQL solution. Transportation, social relations, road map require graph solution. The particular suitability of a given NoSQL database depends on the problem it must solve.
NoSQL Barriers Barriers to the greater adoption of NoSQL stores include Lack of a standard querying language (such as SQL)/ use of lowlevel query languages Lack of ACID transactions [7] Lack of standardized interfaces Huge investments already made in SQL by enterprises
Graph Databases
What is Graph? An abstract representation of a set of objects where some pairs of objects are connected by links. The interconnected objects are represented by abstractions called vertices or nodes. The links that connect some pairs of vertices are called edges. Vertex/ node Edge/ arc
Types of Graphs Directed graph: Undirected graph: Mixed graph: Multi graph: Hyper graph:
Types of Graphs (contd 2) 0.5 Weighted graph: Labeled graph: knows John Mary type: knows Property graph: name: John age: 32 name: Peter age: 30
What is a Graph Database? A database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. Key characteristic: Provides index-free adjacency. (an index is unnecessary as each node knows the location of its adjacent nodes)
Why Graph Databases? Graph databases are designed to: Store interconnected data. E.g., the relationships between people in social networks, between people and artifacts, between items and attributes in recommendation engines. Make easy to make sense of the data. Make it easy to evolve the database. Enable optimal performance operations: Discovery of connected data patterns; Relatedness queries of arbitrary length. Ironically relational databases do not store relations.
Why people use Graph Databases? Problems with join performance. Continuously evolving data set. Naturally the shape of the domain is a graph. Graph is everywhere. E.g., Social networks, biological network, interstate highway system, hyperlink structure of the Web,
Graph Analytics Sushi restaurants at Trento that my friends like most likes Restaurant: Japoni located_in John serves IsFriedOf Cuisine: Sushi Bob IsFriedOf City: Trento serves Mary located_in likes Restaurant: ishushi
Graph Analytics (contd 2) Transportation network Return the shortest or cheapest flight/road from one city to another Social network Determine whether there is a path less than 4 steps which connects two users in a social network Find the movies acted by actor X people link most Financial network Fraud detection, money laundering Find the path connecting two suspicious transactions Temporal network Compute the number of computers who were affected by a particular computer virus in three days, thirty days since its discovery Recommendation Network impact analysis Information/document usage pattern Source: [5]
Graph based NoSQL Solutions Graph database: Neo4j - Open Source, Java, Property Graph model Sones - Closed Source,.NET focused HyergraphDB - Open Source, Java, HyperGraph model FlockDB Open source, Java RDF database (triple store): AllegroGraph - Closed Source, RDF-QuadStore Virtuoso - Closed Source, RDF focused 4store RDF based
Neo4j Made up of nodes, relationships and properties Nodes contain properties in the form of keyvalue pairs Relationship connect and structure node consists of relationship, a label, a start node, end node Relationships also has properties like nodes Source: http://neo4j.com/product/
Neo4j (contd 2) Properties: One of the most popular graph databases It is based on property graph. Open source (enterprise edition licensed under AGPL) ACID compliant Java based but has bindings for other languages, e.g., Ruby and Python. Highly scalable, up to several billion nodes and relationships. Flexible schema. Query language: Cypher
FlockDB A distributed and fault tolerant graph database FlockDB was created by Twitter Licensed under the Apache License, Version 2.0 Useful for large and shallow graphs Properties: A high rate of add/update/remove operations Potientially complex set arithmetic queries Paging through query result sets containing millions of entries Ability to "archive" and later restore archived edges Horizontal scaling including replication Online data migration Source: https://github.com/twitter/flockdb
FlockDB (contd 2) Twitter uses FlockDB to store social graphs (who follows whom, who blocks whom). The major difference of FlockDB with other graph databases like Neo4j is graph traverlsal. Twitter's model has no need for traversing the social graph. Twitter is only concerned about the direct edges (relationships) on a given node (account). For example, Twitter doesn't want to know who follows a person you follow. Instead, it is only interested in the people you follow. By trimming off graph traversal functions, FlockDB is able to allocate resources elsewhere. (Source: http://readwrite.com/2011/04/20/5-graphdatabases-to-consider)
AllegroGraph A graph database built around the W3C specification for the Resource Description Framework. A proprietary product of Franz Inc. 100% ACID, supporting Transactions: Commit, Rollback, and Check pointing. 100% Read Concurrency, near Full Write Concurrency Dynamic and Automatic Indexing All committed triples are always indexed Advanced Text Indexing Text indexing per predicate SOLR and MongoDB Integration supports SPARQL, RDFS++, and Prolog reasoning from numerous client applications Source: http://franz.com/agraph/allegrograph/
AllegroGraph (contd 2) The company claims Pfizer, Ford, Kodak, NASA and the Department of Defence among its AllegroGraph customers. Source: http://franz.com/agraph/success/
Graph Database and RDF Triple Store
RDF RDF is a decentralized directed labeled graph wherein the arcs start with subject URIs, are labeled with predicate URIs, and end up pointing to object URIs or scalar values.
Graph database and RDF Triple store (similarities) Both graph database and RDF triple store are designed to store linked data. Graph databases and RDF triple stores focus on the relationships between the data. A web of nodes and edges can be put together into interesting visualizations a defining characteristic of graph databases.
Why RDF based graph solution? Simple and uniform data model Powerful standard query language Standardized NoSQL solution (built upon W3C Linked Data technology) No vendor or product lock-in (ensure portability and tool chain interoperability) Standardized data interchange (import/export) formats Inferences on data e.g., :human rdfs:subclassof :mammal. and :man rdfs:subclassof :human. The RDF database can infer a new triple: :man rdfs:subclassof mammal.
Why RDF based graph solution? (contd 2) Future proof It is hardly evident that except RDF based solution, any other would be available down 20-30 years. RDF based solution is future proof in a sense that it is based on a very basic technology that is URI. Support for globally-addressable row identifiers and property names Data modeling standards and tooling for creating and publishing schemas Metastandards for being able to declaratively specify that one piece of information entails another, Inference engines that implement such data transformation rules.
Conclusion NoSQL is inescapable. NoSQL is not an analytical tool, but can play an indispensable role in analytics. There are reasons to use RDF based graph databases.
References 1. Why SPARQL and RDF for graph analytics. http://sparqlcity.com/why-sparql 2. Use cases. http://sparqlcity.com/use-cases 3. *Graph Databases, NOSQL and Neo4j. http://www.infoq.com/articles/graph-nosqlneo4j 4. Jing Han, Haihong E, Guan Le and Jian Du (2011). Survey on NoSQL databases. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6106531 5. Max De Marzi. Graph Databases Use Cases. http://www.slideshare.net/maxdemarzi/graph-database-use-cases 6. RDF meets NoSQL (2010). http://decentralyze.com/2010/03/09/rdf-meets-nosql/ 7. Katarina Grolinger, Wilson A Higashino, Abhinav Tiwari and Miriam AM Capretz (2013). Journal of Cloud Computing: Advances, Systems and Applications 2013, 2:22 http://www.journalofcloudcomputing.com/content/2/1/22
Thank you very much for kind attention!! Question??