NoSQL Databases. Nikos Parlavantzas

Similar documents

Big Data Management in the Clouds. Alexandru Costan IRISA / INSA Rennes (KerData team)

Big Data Management and NoSQL Databases

Cloud Scale Distributed Data Storage. Jürmo Mehine

Structured Data Storage

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Can the Elephants Handle the NoSQL Onslaught?

Introduction to NOSQL

NoSQL Databases. Polyglot Persistence

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

NoSQL Systems for Big Data Management

The NoSQL Ecosystem, Relaxed Consistency, and Snoop Dogg. Adam Marcus MIT CSAIL

Lecture Data Warehouse Systems

MongoDB Developer and Administrator Certification Course Agenda

Preparing Your Data For Cloud

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Advanced Data Management Technologies

NoSQL in der Cloud Why? Andreas Hartmann

Domain driven design, NoSQL and multi-model databases

Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Databases 2 (VU) ( )

Comparing SQL and NOSQL databases

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

NoSQL Database Options

Journal of Cloud Computing: Advances, Systems and Applications

A COMPARATIVE STUDY OF NOSQL DATA STORAGE MODELS FOR BIG DATA

Practical Cassandra. Vitalii

these three NoSQL databases because I wanted to see a the two different sides of the CAP

Introduction to NoSQL and MongoDB. Kathleen Durant Lesson 20 CS 3200 Northeastern University

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Polyglot Persistence. Antonios Giannopoulos Database Administrator at ObjectRocket by Rackspace

NoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Slave. Master. Research Scholar, Bharathiar University

A Selection Method of Database System in Bigdata Environment: A Case Study From Smart Education Service in Korea

Introduction to Apache Cassandra

Big Systems, Big Data

Transactions and ACID in MongoDB

Making Sense of NoSQL Dan McCreary Wednesday, Nov. 13 th 2014

Big Data Analytics. Rasoul Karimi

An Approach to Implement Map Reduce with NoSQL Databases

Big Data Technologies. Prof. Dr. Uta Störl Hochschule Darmstadt Fachbereich Informatik Sommersemester 2015

Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to am to 5 pm HOTEL DUBAI GRAND DUBAI

Understanding NoSQL Technologies on Windows Azure

Understanding NoSQL on Microsoft Azure

NOSQL DATABASES AND CASSANDRA

The CAP theorem and the design of large scale distributed systems: Part I

NoSQL: Going Beyond Structured Data and RDBMS

Institutionen för datavetenskap Department of Computer and Information Science

Dr. Chuck Cartledge. 15 Oct. 2015

["Sam Stelfox", " ], ["Gabe Koss", " ] }'

So What s the Big Deal?

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

Data Modeling for Big Data

nosql and Non Relational Databases

Document Oriented Database

Evaluator s Guide. McKnight. Consulting Group. McKnight Consulting Group

Integrating Big Data into the Computing Curricula

Eventually Consistent

MongoDB: document-oriented database

.NET User Group Bern

A survey of big data architectures for handling massive data

Benchmarking and Analysis of NoSQL Technologies

INTRODUCTION TO CASSANDRA

NoSQL and Graph Database

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

The MongoDB Tutorial Introduction for MySQL Users. Stephane Combaudon April 1st, 2014

NOSQL DATABASE SYSTEMS

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

NoSQL - What we ve learned with mongodb. Paul Pedersen, Deputy CTO paul@10gen.com DAMA SF December 15, 2011

Challenges for Data Driven Systems

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Big Data With Hadoop

Enterprise Operational SQL on Hadoop Trafodion Overview

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Choosing the right NoSQL database for the job: a quality attribute evaluation

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

How To Write A Database Program

Introduction to NoSQL

NoSQL Evaluation. A Use Case Oriented Survey

NoSQL and Hadoop Technologies On Oracle Cloud

Scalable Architecture on Amazon AWS Cloud

Big Data Solutions. Portal Development with MongoDB and Liferay. Solutions

BRAC. Investigating Cloud Data Storage UNIVERSITY SCHOOL OF ENGINEERING. SUPERVISOR: Dr. Mumit Khan DEPARTMENT OF COMPUTER SCIENCE AND ENGEENIRING

Referential Integrity in Cloud NoSQL Databases

NoSQL. What Is NoSQL? Why NoSQL?

MongoDB and Couchbase

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

A Review of Column-Oriented Datastores. By: Zach Pratt. Independent Study Dr. Maskarinec Spring 2011

Transcription:

!!!! NoSQL Databases Nikos Parlavantzas

Lecture overview 2 Objective! Present the main concepts necessary for understanding NoSQL databases! Provide an overview of current NoSQL technologies

Outline 3! Main concepts Motivation for NoSQL Data models Distribution models Consistency! NoSQL databases Key-value stores Document databases Column-family stores Graph databases

Relational databases 4 Dominant model for the last 30 years Standard, easy-to-use, powerful query language (SQL) Reliability and consistency in the face of failures and concurrent access! Support for transactions (ACID properties)

Weaknesses 5 Relational databases are not designed to run on multiple nodes (clusters)! Favour vertical scaling! Cannot cope with large volumes of data and operations (e.g., Big Data applications)

Weaknesses 6 Mapping objects to tables is notoriously difficult (impedance mismatch)

NoSQL 7 Definition! Term appeared in 2009! No SQL or Not Only SQL! Practically, anything that deviates from traditional relational database systems (RDBMSs)

NoSQL 8 Common characteristics! Not supporting the relational model! Running well on clusters! Not needing a schema (schema-free)! Typically, relaxing consistency

Advantages 9 Elasticity Big data Automated management Flexible data models

Data models 10

Data models 11 Key-value Document Column family Graph Aggregate oriented Schema free

Aggregate oriented 12 Aggregate: a collection of data with complex structure, manipulated as a unit Unit of data retrieval Unit of atomic update

Example: data model 13

Sample tables 14

Aggregate data model 15

Aggregate orientation 16 Addresses impedance mismatch Helps with running on a cluster! Provides information to the database about which data is manipulated together and thus should live on the same node BUT makes some manipulations difficult! Does not support unanticipated ways of organising data (unlike RDBMSs)

Being schema-free 17 No need to create a schema before storing data into a NoSQL database Facilitates evolution Handles non-uniform data (e.g., records with different fields) BUT knowledge about structure remains in application code (implicit schema)! May be problematic if multiple applications access same database

Distribution models 18

Sharding 19

Sharding 20 Improves both read and write performance Does not provide resilience Distributing data becomes challenging! Ideally, a user interacts with one server, and load is balanced between servers! Aggregate orientation helps! Auto-sharding is provided by some NoSQL databases

Master-slave replication 21

Master-slave replication 22 Read scalability! More slaves " more read requests! Does not help with write load Read resilience! Slaves can still handle read requests after a master failure Possibility of inconsistency! Clients of different slaves may see different values

Peer-to-peer replication 23

Peer-to-peer replication 24 Scalability and resilience! No central element to form bottleneck or single point of failure! Easy to add nodes and to handle node failures Maintaining consistency is challenging! Wide range of options (e.g., coordinating to avoid conflicts, merging inconsistent writes)

Possible combinations 25 Master-slave & sharding Peer-to-peer & sharding

Consistency 26

CAP Theorem 27 Consistency! All nodes see the same data at the same time Availability! Every request receives a response Partition tolerance! The system continues to operate despite arbitrary message loss CAP Theorem! A networked shared-data system can satisfy at most 2 properties at a time

A simple proof 28 Consistency and Availability Client Write data Read data A Replicate B

A simple proof 29 Availability and Partitioning No consistency Client Write data Read old data A B

A simple proof 30 Consistency and Partitioning No availability (waiting ) Client Write data Wait for data A B

Consistency 31 Partitions are inevitable in distributed systems! Must trade-off consistency and availability! Relaxing consistency allows system to remain available under partitions Consistency is also traded off against latency! Keeping replicas consistent requires network interactions! The more nodes are involved, the higher the latency

Client-side consistency 32 Strong consistency! After the update completes, any subsequent access will return the updated value Eventual consistency! If no new updates are made, eventually all accesses will return the last updated value

Server-side consistency 33 N = # replicas of the data W = # replicas that need to confirm a write R = # replicas contacted for a read W+R > N: strong consistency W+R<=N: eventual consistency Can configure N, W, R to tune consistency, availability, read/write performance! W=N, R=1: optimise for read; any failure makes write unavailable! W=1, R=N: optimise for write, any failure makers read unavailable, etc.

Key-value stores 34

Key-value stores 35 Basically, a simple hash table where all access is done via a key Basic operations! Get the value for a key! Put the value for a key! Delete a key and its value Example systems! Redis, Riak, Memcached,

Key-value stores 36 When to use! Storing session Information, user profiles, preferences! Storing shopping cart data When not to use! Need for relationships among data! Transactions involving multiple keys! Querying by data inside value part! Operating on multiple keys at the same time

Riak 37 Open-source, distributed key-value store! First release: 2010! Inspired by Amazon s Dynamo Written in Erlang Decentralised architecture Built-in MapReduce support

Data access 38 Default interface is HTTP! GET (retrieve), PUT (update), POST (create), DELETE (delete) Keys are stored in buckets! namespaces with common properties (e.g., replication factor)! Keys may be user-specified or generated by system Paths:! /riak/<bucket>! /riak/<bucket>/<key>

Examples 39 Store a JSON file into a bucket curl -v -X PUT http://localhost:8091/riak/ animals/ace -H "Content-Type: application/json -d '{"nickname" : "The Wonder Dog", "breed" : "German Shepherd"} Retrieve a value from a bucket curl http://localhost:8091/riak/animals/ 6VZc2o7zKxq2B34kJrm1S0ma3PO

Links 40 Links are metadata that establish oneway relationships between objects Stored in the HTTP Link header curl -X PUT http://localhost:8091/riak/cages/1 \ -H "Content-Type: application/json" \ -H "Link:</riak/animals/polly>;riaktag=\"contains \ -d '{"room" : 101} Used to retrieve data through link walking curl http://localhost:8091/riak/cages/1/ _,contains,_

Advanced queries 41 MapReduce! Clients submit map/reduce functions (in JavaScript or Erlang) and lists of keys. The functions are executed in parallel on nodes where values are located.

Advanced queries 42 Secondary indexes! Tagging objects with metadata and supporting exact match and range queries Riak search! Full-text search and indexing using the objects values

Distribution model 43 Data is automatically distributed around the cluster based on keys (consistent hashing) The number of replicas (N) can be set for each bucket

Key-Value Store Consistency 44 Clients can configure W,R per request to achieve desired levels of consistency, performance, availability Common values for W, R! One, All (i.e., N), Quorum (i.e., N/2+1) N = # replicas of the data W = # replicas that need to confirm a write R = # replicas contacted for a read

Horizontal scaling 45 Can dynamically add and remove nodes; data is redistributed automatically All nodes are equal

Document databases 46

Document Store Document databases 47 Main concept is document! self-describing, hierarchical data structures! JSON, BSON, XML, etc. Similar to key-value stores where the values are examinable Example systems! MongoDB, Couchbase, Terrastore, RavenDB, Lotus Notes, etc.

Document Store Document databases 48 Documents are organised into collections The structure of documents in the same collection can be different! No predefined schema New attributes can be created without the need to define them or to change the existing documents

Document Store Document databases 49 When to use! Event Logging! Content management systems, blogging platforms! Web analytics or real-time analytics! E-Commerce applications When not to use! Need for complex transactions spanning different documents

MongoDB 50 Open-source document database! First released in 2009 Written in C++ JSON documents with dynamic schemas Ad hoc queries Replication, automatic sharding MapReduce support

Data model 51 JSON stored as BSON (binary representation) RDBMS Database Table, View Row Column Index Join Foreign Key Partition MongoDB Database Collection Document (JSON, BSON) Field Index Embedded Document Reference Shard >!db.user.findone({age:39})! {!!!!!!!!!"_id"!:! ObjectId("5114e0bd42 "),!!!!!!!!!"first"!:!"john",!!!!!!!!!"last"!:!"doe",!!!!!!!!!"age"!:!39,!!!!!!!!!"interests"!:![!!!!!!!!!!!!!!!!!"reading",!!!!!!!!!!!!!!!!!"mountain!biking!]!!!!!!!!"favorites":!{!!!!!!!!!!!!!!!!!"color":!"blue",!!!!!!!!!!!!!!!!!"sport":!"soccer"}!! }!

Example operations 52 > db.user.insert({ first: "John", last : "Doe", age: 39 }) > db.user.find () { "_id" : ObjectId("51 "), "first" : "John", "last" : "Doe", "age" : 39 } > db.user.update( {"_id" : ObjectId("51 ")}, { $set: { age: 40, salary: 7000} } ) > db.user.remove({ "first": /^J/ })

Queries 53 Complex queries using ranges, set inclusion, comparison, negation, sorting, counting, etc. Querying nested structured data MapReduce support

Distribution model 54 Master-slave replication! Availability, read performance! Automatic failover If master fails, a new master is elected Automatic sharding! Adding /removing shards! Automatic data balancing using migrations

Distribution model 55

Document Store Consistency 56 The level of guarantee can be configured per operation! Allows setting W (number of servers that need to confirm the update)

Column-family stores 57

Column-family stores 58 Column family: an ordered collection of rows, each of which is an ordered collection of columns

Column-family stores 59 Column families are predefined; columns are not! Can freely add columns! Can have rows with thousands of columns Data for a particular column family is usually accessed together Columns have attached timestamps Example systems! Cassandra, HBase, SimpleDB,

Column-family stores 60 When to use! Event Logging, content management systems, blogging platforms! Data that expires When not to use! Complex queries! Aggregating data (needs MapReduce)! Unknown query patterns

Cassandra 61 Open-source distributed database! Developed in Facebook! Initial release 2008 Written in Java Decentralised architecture MapReduce support CQL (Cassandra Query Language)

Data model 62 The basic unit of storage in Cassandra is a column Each column consists of a column name, a value, and a timestamp

Column-Family Stores Data model 63 A standard column family A super column family

Queries 64 CQL: SQL-like language adapted to the Cassandra data model and architecture! Does not support joins or sub-queries! Enables modifying tables dynamically without blocking updates and queries! Emphasises denormalisation (e.g., supports collection columns) CREATE TABLE users ( user_id text PRIMARY KEY, first_name text, last_name text, emails set<text> );

Distribution model 65 Peer-to-peer replication and automatic sharding! Any user can connect to any node; all writes are partitioned and replicated automatically

Consistency and scalability 66 Tunable consistency on a per-operation basis! Can set N, W, R Can dynamically add and remove nodes

Graph databases 67

Graph databases 68 A graph contains nodes and relationships; both have properties

Graph databases 69 Graph traversal is fast Declarative, domain-specific query languages! Matching patterns of nodes and relationships in the graph and extracting information! e.g., Find all people that work for Big Co. whose friends listen to Rock Music Example systems! Neo4j, Infinite Graph, FlockDB,

Graph databases 70 When to use! Connected data, e.g. social networks! Routing, dispatch and location-based services! Recommendation engines When not to use! Multiple nodes need to be updated! Handling very big graphs (sharding is difficult)

Neo4j 71 Open-source graph database! First release in 2010 Written in Java ACID transactions Clustering for high availability Server mode or embedded (same process) mode

Cypher language 72 Declarative graph query language! ASCII art-like patterns START a=node:user(name='michael') MATCH (c)-[:knows]->(b)-[:knows]->(a), (c)- [:KNOWS]->(a) RETURN b, c

Cypher language 73 Supports SQL-like syntax (e.g., WHERE, ORDER BY, COUNT) START theater=node:venue(name='theatre Royal'), newcastle=node:city(name='newcastle'), bard=node:author(lastname='shakespeare') MATCH (newcastle)<-[:street CITY*1..2]-(theater) <-[:VENUE]-()-[p:PERFORMANCE_OF]->()-[:PRODUCTION_OF]-> (play)<-[:wrote_play]-(bard) RETURN play.title AS play, count(p) AS performance_count ORDER BY performance_count DESC

Indexing 74 Properties of nodes or relationships can be indexed! Useful for finding starting locations for traversals! e.g., theater=node:venue(name='theatre Royal')

Graph Databases Distribution model 75 Master-slave replication! High availability; scalability for reads! Writes to the master are eventually synchronized to slaves No support for sharding

Summary 76 The main characteristics of NoSQL technologies are:! Support for flexible, schema-free data models! Horizontal scalability! Trading some degree of consistency for availability or performance Key-value, document, and column-family databases are suitable when most interactions are done with the same data unit. Graph databases are suitable for data with complex relationships.

References 77 NoSQL distilled: a brief guide to the emerging world of polyglot persistence, Pramod J. Sadalage and Martin Fowler, Pearson Education, 2012. Seven Databases in Seven Weeks, Eric Redmond and Jim Wilson, Pragmatic Bookshelf, O'Reilly, 2012.

Stop following me! 78