NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Similar documents

NoSQL Databases. Nikos Parlavantzas

Cloud Scale Distributed Data Storage. Jürmo Mehine

A COMPARATIVE STUDY OF NOSQL DATA STORAGE MODELS FOR BIG DATA

So What s the Big Deal?

Introduction to NOSQL

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Structured Data Storage

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Preparing Your Data For Cloud

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

Lecture Data Warehouse Systems

The Cloud Trade Off IBM Haifa Research Storage Systems

Can the Elephants Handle the NoSQL Onslaught?

Practical Cassandra. Vitalii

NOSQL DATABASES AND CASSANDRA

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to am to 5 pm HOTEL DUBAI GRAND DUBAI

Comparing SQL and NOSQL databases

MongoDB Developer and Administrator Certification Course Agenda

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

INTRODUCTION TO CASSANDRA

NoSQL Database Options

NoSQL Data Base Basics

The CAP theorem and the design of large scale distributed systems: Part I

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island

NoSQL Systems for Big Data Management

NoSQL in der Cloud Why? Andreas Hartmann

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Integrating Big Data into the Computing Curricula

How to Choose Between Hadoop, NoSQL and RDBMS

NoSQL: Going Beyond Structured Data and RDBMS

Distributed Data Stores

BRAC. Investigating Cloud Data Storage UNIVERSITY SCHOOL OF ENGINEERING. SUPERVISOR: Dr. Mumit Khan DEPARTMENT OF COMPUTER SCIENCE AND ENGEENIRING

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Eventually Consistent

Big Systems, Big Data

Introduction to NoSQL

How To Scale Out Of A Nosql Database

Slave. Master. Research Scholar, Bharathiar University

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

How To Write A Database Program

Introduction to Big Data Training

A survey of big data architectures for handling massive data

Advanced Data Management Technologies

Benchmarking and Analysis of NoSQL Technologies

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Using RDBMS, NoSQL or Hadoop?

NoSQL Database Systems and their Security Challenges

Big Data Technologies Compared June 2014

Introduction to Apache Cassandra

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

nosql and Non Relational Databases

Evaluation of NoSQL databases for large-scale decentralized microblogging

Referential Integrity in Cloud NoSQL Databases

Introduction to NoSQL

Cassandra A Decentralized Structured Storage System

How To Use Big Data For Telco (For A Telco)

Transactions and ACID in MongoDB

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

these three NoSQL databases because I wanted to see a the two different sides of the CAP

Enterprise Operational SQL on Hadoop Trafodion Overview

A Distributed Network Security Analysis System Based on Apache Hadoop-Related Technologies. Jeff Springer, Mehmet Gunes, George Bebis

An Approach to Implement Map Reduce with NoSQL Databases

Hadoop IST 734 SS CHUNG

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

Scalable Architecture on Amazon AWS Cloud

Getting Started with SandStorm NoSQL Benchmark

NoSQL - What we ve learned with mongodb. Paul Pedersen, Deputy CTO paul@10gen.com DAMA SF December 15, 2011

Apache Hadoop. Alexandru Costan

NoSQL for SQL Professionals William McKnight

Choosing The Right Big Data Tools For The Job A Polyglot Approach

A Review of Column-Oriented Datastores. By: Zach Pratt. Independent Study Dr. Maskarinec Spring 2011

Big Data Course Highlights

Evaluator s Guide. McKnight. Consulting Group. McKnight Consulting Group

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL Evaluation. A Use Case Oriented Survey

Introduction to Polyglot Persistence. Antonios Giannopoulos Database Administrator at ObjectRocket by Rackspace

Domain driven design, NoSQL and multi-model databases

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

NoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases

An Open Source NoSQL solution for Internet Access Logs Analysis

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Open source large scale distributed data management with Google s MapReduce and Bigtable

Transcription:

NoSQL Databases Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Database Landscape Source: H. Lim, Y. Han, and S. Babu, How to Fit when No One Size Fits., in CIDR, 2013. Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-2 -

Why NoSQL? Rise of the Internet (Distributed Systems, Web 2.0 applications, Cloud Systems) Applications spanning over huge geographic areas Many concurrent users Different data characteristics Rise of Big Data 3Vs of Big Data (according to D. Laney, 3D data management: Controlling data volume, velocity and variety, Appl. Deliv. Strateg. File, vol. 949, 2001.) Data Velocity From batch, periodic, near real time to real time Data Volume From MB, GB, TB, PB, EB... Data Variety From structured (tables, etc.), semi-structured (JSON, XML, Emails etc.) to unstructured (photos, web, social media, texts, tweets, blogs, audio etc.) Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-3 -

NoSQL System Characteristics Ability to scale horizontally Distribution and replication of data over many servers Simple interfaces, not necessary SQL Weaker concurrency models than ACID Utilization of distributed indexes and memory Flexible schemata Source: R. Cattell, Scalable SQL and NoSql Data Stores. SIGMOD Record, 39(4), 27-Dec-2010. and often Open Source Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-4 -

CAP Theorem CAP theorem (also known as Brewer s theorem) stated at the Symposium on Principles of Distributed Computing (PODC) by Eric Brewer in 2000 Formal proof by Seth Gilbert, Nancy Lynch in 2002 The CAP theorem states that in a distributed database you can only have two of the following properties: Consistency equivalent to having a single up-to-date copy of the data (all requests at the same time retrieve the same value) High Availability of that data (the retrieval of data is always possible as long as at least one server is running) Tolerance to Network Partitions (the system will function even if the communication is broken). Typically Consistency is traded for a higher level of availability, this is known as BASE (Basically Available, Soft state, Eventually consistent). Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-5 -

CAP Theorem (cont.) C RDBMS ATM A DNS P Social Media Sites (were weak consistency is okay) Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-6 -

CAP Theorem (cont.) Assume a server (single node, no cluster) has performance problems. Solution: Add another node to increase performance. Now we have a distributed system. A new problem occurs in our two-node cluster: When data is written to both nodes the data is not consistent if it s not synchronized (the system is still available and partition tolerant). Solution: Each node propagates updates to other node. That requires that both nodes are online all the time. If one node is down, the other can t function anymore and the system is not available anymore (but still consistent and partition tolerant). Solution: The nodes offline will perform the updates (stored in a queue) when they are online again. Not partition tolerant (but always consistent and available). Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-7 -

CAP Theorem Revisited The 2 of 3 formulation was always misleading because it tended to oversimplify the tensions among properties. Now such nuances matter. CAP prohibits only a tiny part of the design space: perfect availability and consistency in the presence of partitions, which are rare. Although designers still need to choose between consistency and availability when partitions are present, there is an incredible range of flexibility for handling partitions and recovering from them. The modern CAP goal should be to maximize combinations of consistency and availability that make sense for the specific application. Such an approach incorporates plans for operation during a partition and for recovery afterward, thus helping designers think about CAP beyond its historically perceived limitations. Source: Eric Brewer, CAP twelve years later: How the rules have changed, IEEE Explore, Volume 45, Issue 2 (2012), pg. 23-29. Additional reading: Daniel Abadi (February 2012), Consistency Tradeoffs in Modern Distributed Database System Design: CAP is Only Part of the Story, IEEE Computer Society Press 45(2):27-42 Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-8 -

CAP Theorem Revisited (cont.) Of the CAP theorem s Consistency, Availability, and Partition Tolerance, Partition Tolerance is mandatory in distributed systems. You cannot not choose it. Coda Hale, Yammer Software Engineer http://codahale.com/you-cant-sacrifice-partition-tolerance/ An important observation is that in larger distributed-scale systems, network partitions are a given; therefore, consistency and availability cannot be achieved at the same time. Werner Vogels, Amazon CTO http://www.allthingsdistributed.com/2008/12/eventually_consistent.html So in reality, there are only two types of systems: CP/CA and AP. I.e., if there is a partition, does the system give up availability or consistency? Daneil Abadi, Co-founder of Hadapt, Associate Professor at Yale University http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-9 -

NoSQL Databases A Classification NoSQL Systems are often classified in four categories Key-Value Stores (e.g. Dynamo, Riak) Values are accessed by a key Simple data model, simple queries Wide Columnar Stores (e.g. Big Table, Hbase, Cassandra) Big sparse tables with a lot of columns Document Stores (e.g. MongoDB, DB4O) Documents (e.g. JSON/BSON/XML) are accessed by a key Graph Databases (e.g.neo4j, Allegro) Nodes and Edges (Relationships) are stored Complex data model, complex queries Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-10 -

NoSQL Landscape Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-11 -

Some NoSQL Databases in Detail Apache Cassandra (Wide Column-Store) MongoDB (Document Store) Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-12 -

Cassandra Characteristics Apache Project (Apache Cassandra) Architecture inspired by Amazon Dynamo and Big Table Distributed and Decentralized (no master-slave architecture, no SPOF) Good Scalability High Availability and Fault Tolerance (Replication) Tuneable Consistency Column-oriented Key-Value Store CQL (a SQL like query language) High (Write-)Performance Flexible Schema (No ETL at ingestion phase at least) Hadoop Integration Capable of handling Big Data workloads Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-13 -

Cassandra Terms Cluster A group of nodes where you store your data. Replication Storing copies of data on multiple nodes to ensure reliability and fault tolerance (number of copies set by replication factor). Data Center A (replication) group of related nodes configured together within a cluster for replication purposes. It is not necessarily a physical data center. Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-14 -

Cassandra Architecture Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-15 -

Cassandra Partitioning Partitioner A partitioner distributes data evenly across the nodes in the cluster for load balancing. Murmur3Partitioner (default): uniformly distributes data based on MurmurHash hash values. RandomPartitioner: uniformly distributes data based on MD5 hash values. ByteOrderedPartitioner: keeps an ordered distribution of data lexically by key bytes Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-16 -

Cassandra Data Model Data stored in big sparse hash table Column Family (CF) Comparable to a table in RDBMS CF contain columns, and a set of related columns is identified by a row key. Each row in a CF is not required to have the same set of columns. Keyspace Schema in relational world All CF objects (tables) are in keyspaces Usually one keyspace per application Replication is controlled on a per-keyspace basis Design of data model based on (expected) queries Joining CF at query time is not supported, no FK Column values have a timestamp Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-17 -

Cassandra Data Model (cont.) A super column is a way to group multiple columns based on a common lookup value. Adds another level of nesting to the regular column family structure Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-18 -

RDBMS vs. Cassandra Data Model Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-19 -

RDBMS vs. Cassandra Data Model (cont.) Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-20 -

Cassandra Query Language (CQL) CQL command to create a keyspace CREATE KEYSPACE db2_keyspace WITH replication = {'class':'simplestrategy', 'replication_factor':3}; CQL command to create CF Static CF: CREATE TABLE usertable (userid TEXT PRIMARY KEY, lastname VARCHAR, firstname VARCHAR); Dynamic CF: CREATE TABLE usertable (userid TEXT PRIMARY KEY); Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-21 -

Cassandra Overview Source: http://cassandra-php.blogspot.co.uk/ Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-22 -

Cassandra Use Cases Typical Cassandra Use Cases: Geophraphical distribution Write intensive workloads Application (and queries) well known in advance in the data model design phase Big Data Workloads Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-23 -

Cassandra References Datastax Cassandra Documentation, http://www.datastax.com/docs (accessed Jan 15, 2015) Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-24 -

MongoDB Characteristics Document store (JSON style), flexible data model Index support for attributes Querying: Range queries, search by field Map/Reduce Support (e.g. aggregation functions) Replication Open Source (GNU AGPL v3.0) Good horizontal scalability (due to sharding) Easy to understand/learn for app programmers Writes only handled by master (possible bottleneck) Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-25 -

MongoDB Architecture Master/Slave architecture Write/Reads to primary (master) by default Strong Consistency, CP system by default Also possible to allow reading from secondaries Eventual Consistency Number of replica configurable If master fails, a slave is elected and promoted to master Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-26 -

MongoDB Data Model Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-27 -

MongoDB Query Language (Read) Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-28 -

MongoDB Query Language (Insert) Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-29 -

MongoDB Query Language (Update + Delete) db.inventory.update( { username: db2student" }, { $set: { age": 25" } } ) db.inventory.remove( ) { age : 25" } Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-30 -

MongoDB Use Cases Typical MongoDB Use Cases: Good to store documents (Content Management) Easy (ad hoc) querying of documents and their attributes Easy to learn for programmers using object oriented programming languages Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-31 -

MongoDB References MongoDB Documentation, http://docs.mongodb.org/ (accessed Jan 15, 2015) MongoDB Architecture Guide, http://info.mongodb.com/rs/mongodb/images/mong odb_architecture_guide.pdf (accessed Jan 15, 2015) Institute of Computer Science and Mathematics, Databases and Information Systems (DBIS), DB 2 WS 2014/2015-32 -