NoSQL Databases Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015
Database Landscape Source: H. Lim, Y. Han, and S. Babu, How to Fit when No One Size Fits., in CIDR, 2013. Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-2 -
Why NoSQL? Rise of the Internet (Distributed Systems, Web 2.0 applications, Cloud Systems) Applications spanning over huge geographic areas Many concurrent users Different data characteristics Rise of Big Data 3Vs of Big Data (according to D. Laney, 3D data management: Controlling data volume, velocity and variety, Appl. Deliv. Strateg. File, vol. 949, 2001.) Data Velocity From batch, periodic, near real time to real time Data Volume From MB, GB, TB, PB, EB... Data Variety From structured (tables, etc.), semi-structured (JSON, XML, Emails etc.) to unstructured (photos, web, social media, texts, tweets, blogs, audio etc.) Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-3 -
NoSQL System Characteristics Ability to scale horizontally Distribution and replication of data over many servers Simple interfaces, not necessary SQL Weaker concurrency models than ACID Utilization of distributed indexes and memory Flexible schemata Source: R. Cattell, Scalable SQL and NoSql Data Stores. SIGMOD Record, 39(4), 27-Dec-2010. and often Open Source Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-4 -
CAP Theorem CAP theorem (also known as Brewer s theorem) stated at the Symposium on Principles of Distributed Computing (PODC) by Eric Brewer in 2000 Formal proof by Seth Gilbert, Nancy Lynch in 2002 The CAP theorem states that in a distributed database you can only have two of the following properties: Consistency equivalent to having a single up-to-date copy of the data (all requests at the same time retrieve the same value) High Availability of that data (the retrieval of data is always possible as long as at least one server is running) Tolerance to Network Partitions (the system will function even if the communication is broken). Typically Consistency is traded for a higher level of availability, this is known as BASE (Basically Available, Soft state, Eventually consistent). Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-5 -
CAP Theorem (cont.) C RDBMS ATM A DNS P Social Media Sites (were weak consistency is okay) Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-6 -
CAP Theorem (cont.) Assume a server (single node, no cluster) has performance problems. Solution: Add another node to increase performance. Now we have a distributed system. A new problem occurs in our two-node cluster: When data is written to both nodes the data is not consistent if it s not synchronized (the system is still available and partition tolerant). Solution: Each node propagates updates to other node. That requires that both nodes are online all the time. If one node is down, the other can t function anymore and the system is not available anymore (but still consistent and partition tolerant). Solution: The nodes offline will perform the updates (stored in a queue) when they are online again. Not partition tolerant (but always consistent and available). Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-7 -
CAP Theorem Revisited The 2 of 3 formulation was always misleading because it tended to oversimplify the tensions among properties. Now such nuances matter. CAP prohibits only a tiny part of the design space: perfect availability and consistency in the presence of partitions, which are rare. Although designers still need to choose between consistency and availability when partitions are present, there is an incredible range of flexibility for handling partitions and recovering from them. The modern CAP goal should be to maximize combinations of consistency and availability that make sense for the specific application. Such an approach incorporates plans for operation during a partition and for recovery afterward, thus helping designers think about CAP beyond its historically perceived limitations. Source: Eric Brewer, CAP twelve years later: How the rules have changed, IEEE Explore, Volume 45, Issue 2 (2012), pg. 23-29. Additional reading: Daniel Abadi (February 2012), Consistency Tradeoffs in Modern Distributed Database System Design: CAP is Only Part of the Story, IEEE Computer Society Press 45(2):27-42 Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-8 -
CAP Theorem Revisited (cont.) Of the CAP theorem s Consistency, Availability, and Partition Tolerance, Partition Tolerance is mandatory in distributed systems. You cannot not choose it. Coda Hale, Yammer Software Engineer http://codahale.com/you-cant-sacrifice-partition-tolerance/ An important observation is that in larger distributed-scale systems, network partitions are a given; therefore, consistency and availability cannot be achieved at the same time. Werner Vogels, Amazon CTO http://www.allthingsdistributed.com/2008/12/eventually_consistent.html So in reality, there are only two types of systems: CP/CA and AP. I.e., if there is a partition, does the system give up availability or consistency? Daneil Abadi, Co-founder of Hadapt, Associate Professor at Yale University http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-9 -
NoSQL Databases A Classification NoSQL Systems are often classified in four categories Key-Value Stores (e.g. Dynamo, Riak) Values are accessed by a key Simple data model, simple queries Wide Columnar Stores (e.g. Big Table, Hbase, Cassandra) Big sparse tables with a lot of columns Document Stores (e.g. MongoDB, DB4O) Documents (e.g. JSON/BSON/XML) are accessed by a key Graph Databases (e.g.neo4j, Allegro) Nodes and Edges (Relationships) are stored Complex data model, complex queries Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-10 -
NoSQL Landscape Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-11 -
Some NoSQL Databases in Detail Apache Cassandra (Wide Column-Store) MongoDB (Document Store) Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-12 -
Cassandra Characteristics Apache Project (Apache Cassandra) Architecture inspired by Amazon Dynamo and Big Table Distributed and Decentralized (no master-slave architecture, no SPOF) Good Scalability High Availability and Fault Tolerance (Replication) Tuneable Consistency Column-oriented Key-Value Store CQL (a SQL like query language) High (Write-)Performance Flexible Schema (No ETL at ingestion phase at least) Hadoop Integration Capable of handling Big Data workloads Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-13 -
Cassandra Terms Cluster A group of nodes where you store your data. Replication Storing copies of data on multiple nodes to ensure reliability and fault tolerance (number of copies set by replication factor). Data Center A (replication) group of related nodes configured together within a cluster for replication purposes. It is not necessarily a physical data center. Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-14 -
Cassandra Architecture Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-15 -
Cassandra Partitioning Partitioner A partitioner distributes data evenly across the nodes in the cluster for load balancing. Murmur3Partitioner (default): uniformly distributes data based on MurmurHash hash values. RandomPartitioner: uniformly distributes data based on MD5 hash values. ByteOrderedPartitioner: keeps an ordered distribution of data lexically by key bytes Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-16 -
Cassandra Data Model Data stored in big sparse hash table Column Family (CF) Comparable to a table in RDBMS CF contain columns, and a set of related columns is identified by a row key. Each row in a CF is not required to have the same set of columns. Keyspace Schema in relational world All CF objects (tables) are in keyspaces Usually one keyspace per application Replication is controlled on a per-keyspace basis Design of data model based on (expected) queries Joining CF at query time is not supported, no FK Column values have a timestamp Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-17 -
Cassandra Data Model (cont.) A super column is a way to group multiple columns based on a common lookup value. Adds another level of nesting to the regular column family structure Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-18 -
RDBMS vs. Cassandra Data Model Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-19 -
RDBMS vs. Cassandra Data Model (cont.) Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-20 -
Cassandra Query Language (CQL) CQL command to create a keyspace CREATE KEYSPACE db2_keyspace WITH replication = {'class':'simplestrategy', 'replication_factor':3}; CQL command to create CF Static CF: CREATE TABLE usertable (userid TEXT PRIMARY KEY, lastname VARCHAR, firstname VARCHAR); Dynamic CF: CREATE TABLE usertable (userid TEXT PRIMARY KEY); Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-21 -
Cassandra Overview Source: http://cassandra-php.blogspot.co.uk/ Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-22 -
Cassandra Use Cases Typical Cassandra Use Cases: Geophraphical distribution Write intensive workloads Application (and queries) well known in advance in the data model design phase Big Data Workloads Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-23 -
Cassandra References Datastax Cassandra Documentation, http://www.datastax.com/docs (accessed Jan 15, 2015) Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-24 -
MongoDB Characteristics Document store (JSON style), flexible data model Index support for attributes Querying: Range queries, search by field Map/Reduce Support (e.g. aggregation functions) Replication Open Source (GNU AGPL v3.0) Good horizontal scalability (due to sharding) Easy to understand/learn for app programmers Writes only handled by master (possible bottleneck) Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-25 -
MongoDB Architecture Master/Slave architecture Write/Reads to primary (master) by default Strong Consistency, CP system by default Also possible to allow reading from secondaries Eventual Consistency Number of replica configurable If master fails, a slave is elected and promoted to master Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-26 -
MongoDB Data Model Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-27 -
MongoDB Query Language (Read) Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-28 -
MongoDB Query Language (Insert) Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-29 -
MongoDB Query Language (Update + Delete) db.inventory.update( { username: db2student" }, { $set: { age": 25" } } ) db.inventory.remove( ) { age : 25" } Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-30 -
MongoDB Use Cases Typical MongoDB Use Cases: Good to store documents (Content Management) Easy (ad hoc) querying of documents and their attributes Easy to learn for programmers using object oriented programming languages Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-31 -
MongoDB References MongoDB Documentation, http://docs.mongodb.org/ (accessed Jan 15, 2015) MongoDB Architecture Guide, http://info.mongodb.com/rs/mongodb/images/mong odb_architecture_guide.pdf (accessed Jan 15, 2015) Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-32 -