Database Revolution: Old SQL, NewSQL, NoSQL Huh? Michael Bowers April 9, 2013 v2.9



Similar documents
Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Using RDBMS, NoSQL or Hadoop?

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Introduction to NOSQL

Big Data Technologies Compared June 2014

Lecture Data Warehouse Systems

Big Data and Data Science: Behind the Buzz Words

Structured Data Storage

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Data Modeling for Big Data

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

The Sierra Clustered Database Engine, the technology at the heart of

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

In-memory databases and innovations in Business Intelligence

OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni

An Approach to Implement Map Reduce with NoSQL Databases

NoSQL for SQL Professionals William McKnight

Luncheon Webinar Series May 13, 2013

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Big Data Database Revenue and Market Forecast,

Big Data and Its Impact on the Data Warehousing Architecture

bigdata Managing Scale in Ontological Systems

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Search and Real-Time Analytics on Big Data

Performance and Scalability Overview

Database Replication with MySQL and PostgreSQL

Performance and Scalability Overview

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Integrating Netezza into your existing IT landscape

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

Cloud Scale Distributed Data Storage. Jürmo Mehine

I/O Considerations in Big Data Analytics

How to Choose Between Hadoop, NoSQL and RDBMS

Composite Software Data Virtualization Turbocharge Analytics with Big Data and Data Virtualization

Evaluator s Guide. McKnight. Consulting Group. McKnight Consulting Group

Analyzing Big Data with Splunk A Cost Effective Storage Architecture and Solution

Real-time Data Replication

Cloud DBMS: An Overview. Shan-Hung Wu, NetDB CS, NTHU Spring, 2015

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

IN-MEMORY DATABASE SYSTEMS. Prof. Dr. Uta Störl Big Data Technologies: In-Memory DBMS - SoSe

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

SOLUTION BRIEF. JUST THE FAQs: Moving Big Data with Bulk Load.

Big Data Analytics - Accelerated. stream-horizon.com

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

So What s the Big Deal?

Transactions and ACID in MongoDB

<Insert Picture Here> Big Data

Do Relational Databases Belong in the Cloud? Michael Stiefel

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

In Memory Accelerator for MongoDB

Data Warehousing. Jens Teubner, TU Dortmund Winter 2015/16. Jens Teubner Data Warehousing Winter 2015/16 1

Introduction to Apache Cassandra

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

REAL-TIME BIG DATA ANALYTICS

Testing Big data is one of the biggest

1 File Processing Systems

In-memory computing with SAP HANA

EMC BACKUP MEETS BIG DATA

ETL-EXTRACT, TRANSFORM & LOAD TESTING

NoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases

Introducing Oracle Exalytics In-Memory Machine

Real Time Big Data Processing

Tiber Solutions. Understanding the Current & Future Landscape of BI and Data Storage. Jim Hadley

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

Big Data With Hadoop

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC

Big Data Analytics Nokia

Practical Cassandra. Vitalii

JOURNAL OF OBJECT TECHNOLOGY

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

2009 Oracle Corporation 1

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

In-Memory Data Management for Enterprise Applications

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

Integrating Big Data into the Computing Curricula

multiparadigm programming Multiparadigm Data Storage for Enterprise Applications

Software-defined Storage Architecture for Analytics Computing

Making Sense of Big Data in Insurance

Achieving High Availability

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

InfiniteGraph: The Distributed Graph Database

How To Handle Big Data With A Data Scientist

Transcription:

Database Revolution: Old SQL, NewSQL, NoSQL Huh? Michael Bowers April 9, 2013 v2.9

Database Revolution: Old SQL, NewSQL, NoSQL Huh? Prepared by Michael Bowers 2013 03 28 v. 2.9

Abstract We are in the middle of a database revolution. NoSQL is disrupting the database world by innovating in three ways: New database architectures (software and hardware) that handle the large and ever growing velocity and volume of data dispersed across geographically distant data centers New graph and document modeling paradigms that compete with object, relational, and dimensional Schema less databases to enable extreme agility of software development and rapid changes to huge data sets 3

What will you learn? We are in the middle of a database revolution. NoSQL is disrupting the database world, but which NoSQL solution is best? Which one makes development fastest and easiest? Which one loads data the fastest? Which one scales the best? Which one queries data the best? Which one manages your data the best? Which one is the most mature? Which one is the least expensive? Which one should you use? Facts no spin no agenda, and lots of thought provoking information. 4

About the Author Michael Bowers Principal Architect LDS Church Author Pro CSS and HTML Design Patterns Published by Apress, 2007 Pro HTML5 and CSS3 Design Patterns Published by Apress, 2011 cssdesignpatterns.com

Agenda What is NoSQL? Which NoSQL is best?

Agenda What is NoSQL? New Modeling Paradigms New Hardware Paradigms Which NoSQL is best?

Hospital Name: John Hopkins Operation Number: 13 Operation Type: Heart Transplant Surgeon Name: Dorothy Oz Drug Drug Dose Dose Name Manufacturer Size UOM Minicillan Drugs R Us 200 mg Maxicillan Canada4Less Drugs 400 mg Minicillan Drug USA 150 mg Low Latency Operational Velocity Graph AllegroGraph, Neo4j, InfiniteGraph OrientDB DEX, SysTap New Database Paradigms Document JSON MarkLogic, MongoDB, CouchDB, CouchBase Column key/val DynamoDB, Riak, Cassandra, Hbase, HyperTable, Oracle NoSQL Object newsql Oracle x10, VoltDB Clustrix SQL Live Analytics Oracle Exalytics, SAP HANA High Bandwidth Analytical Volume Big Data Hadoop Raw Raw Graph Less structure (schemaless) Doc Warehouse XML MarkLogic, Solr Document Versant, Objectivity, GemStone Object Oracle DB, SQL Server, DB2, Sybase, MySQL, PostgreSQL Relational Data Warehouse Exadata, OLAP, Teradata, Netezza, Sybase, Greenplum, Vertica, ParAccel Dimensional More structure (schema)

Agenda What is NoSQL? New Modeling Paradigms New Hardware Paradigms Which NoSQL is best?

What s wrong with SQL DB s? Variability Relational modeling is not good at adapting to rapid change, which makes it incompatible with agile development Variety Relational modeling organizes all data into tables, which is a suboptimal representation for graphs, hierarchies, cubes, linked lists, unstructured data, etc. Relevance Relational modeling isn t good at relevance, which requires text and data to be stored in document context with links to and from other documents hacky 10

Variability Schemas are incompatible with rapid change Constantly evolving data structures Can we afford to keep a large application in sync with regular changes to data structures? Big data Is data so large that it takes too long to modify values, structures, and indexes? Agile development Managing Rapid Change Are requirements stable enough to create long lasting relational data structures? Schemaless data is ideal for rapid change Schemaless data and languages XML, JSON, XQuery, SPARQL, JavaScript Defensive programming is required You never know what data queries will return

Variety Handling data in all imaginable forms Impedance mismatch Different data structures Structured, unstructured, semi structured Different data paradigms Relational, Dimensional, Document, Graph, Object oriented, etc. Different data types JSON doesn t have a date/time/duration type, XML schema and SQL have a variety, etc. Different markup standards XML, JSON, RDF, etc

Relevance Getting relevant data to the right people at the right time Contextual Information = Narrative + Data + Relationships Narrative Defines the context surrounding a set of data so that contexts can be compared and deemed more or less relevant to a given query Requires sets of data to be wrapped in one or more document containers Data Data items are identified, labeled, and typed Requires JSON or XML to mark up data and requires metadata to define the meaning and precise domain of each type of data element Structural Relationships Structure is identified and leveraged to enable meaningful searches Requires structure to be marked up and requires metadata to define its meaning Semantic Relationships Entities are identified and leveraged to enable meaningful searches Requires people, places, and things to be marked up and requires semantic metadata to define the meaning of each type of entity

Relevance Narrative + Data + Relationships = Contextual Information (Semantic & Structural) T A Relational Model of Data for Large Shared Data Banks P O L L A A A E E. F. CODD IBM Research Laboratory, San Jose, California Information Retrieval, Volume 13 / Number 6 / June, 1970 Programs should remain unaffected when the internal I representation of data is changed. Tree structured inadequacies are T discussed. Relations are I discussed and T applied to the problems of redundancy and consistency. T T KEY WORDS AND PHRASES: data base, data structure, data T organization, hierarchies T of data, networks of data, relations T CR CATEGORIES: 3.70, 3.73, 3.75, 4.20, 4.22 T T T R R R R R 1. Relational Model and Normal Form 1.1. INTRODUCTION This paper is concerned with the application of elementary relation theory to formatted data. The problems are those of data independence and data inconsistency. The relational view appears to be superior in several respects to the graph or network model. Relational view forms a sound basis for treating derivability, redundancy, and consistency. [and] a clearer evaluation of 1.2. DATA DEPENDENCIES IN PRESENT SYSTEMS Tables represent a major advance toward the goal of data independence 1.2.1. Ordering Dependence. Programs which take advantage of the stored ordering of a file are likely to fail if it becomes necessary to replace that ordering by a different one. 1.2.2. Indexing Dependence. Can application programs remain invariant as indices come and go? 1.2.3. Access Path Dependence. Many of the existing formatted data systems provide users with tree structured files or slightly more general network models of the data. These programs fail when a change in structure becomes necessary. The program is required to exploit paths to the data. Programs become dependent on the continued existence of the paths.

How do variability, variety, and relevance map to the five data paradigms? 15

Five Data Paradigms SQL DBs currently support 2 of 5 data models Relational Model Data entry apps Dimensional Model Data warehouse apps Object Model Data processing apps Document Model Document delivery apps Graph Model Relationship apps Hospital Name: John Hopkins Operation Number: 13 Operation Type: Heart Transplant Surgeon Name: Dorothy Oz Drug Name Drug Manufacturer Dose Size Dose UOM Minicillan Drugs R Us 200 mg Maxicillan Canada4Less Drugs 400 mg Minicillan Drug USA 150 mg 16

Relational Model Focuses on data Manages data entry Data freed from context Data reuse, durability, integrity, independence, unity Hospital Hospital Key Hospital Name Surgeon Surgeon Key Surgeon First Name Surgeon Last Name Operation Codes Operation Key Operation Type performed at performed by classified by schedules performs classifies Operation Hospital Key Surgeon Key Operation Key Operation Date Formal data structures & truth statements Truth queries through dynamic, flexible SQL language Poor Variability: Poor Variety: Poor Relevance: Relationships not dynamic Impedance mismatch with OO Not designed for unstructured data Queries cannot be sorted by relevance because they don t have document context 17

Relational Model Use for maximum flexibility in querying operational data Example: traditional data entry apps 18

Dimensional Model Focuses on information Manages info retrieval Info shown in context Info reuse, durability, integrity, independence, unity Authoritative source, authoritative view, defined business questions Ad hoc self service queries through fast & flexible semantic layer Poor Variability: OK Variety: Good Relevance: Unexpected changes break ETL jobs Data can be extracted and loaded from any data source Queries return the most relevant data because information context is carefully defined 19

Dimensional Model Use to transform authoritative data into contextual information to enable self service, ad hoc, flexible reporting Example: traditional BI 20

Focuses on process Object Model Manages data processing Data processed in context Process reuse, durability, integrity, independence, unity Formal processes through methods Flexible object queries through JPQL Best Variability: OK Variety: Poor Relevance: JPA combined with OO DB provide query flexibility Agile development through object schema evolution, object links, embedded objects, class hierarchies Fast development because no language/model mismatch No impedance mismatch: OO data is stored directly Object oriented model not designed for unstructured data Information hiding prevents querying text in context Objects cannot freely intermingle with text 21

Object Model Use to manage complex processes and to speed & simplify application development Example: traditional data processing apps 22

Focuses on content Manages documents Content processed in context Document reuse, durability, integrity, independence, unity Document Model Informal document structure (hierarchical & semantic) Hospital Name: John Hopkins Operation Number: 13 Operation Type: Heart Transplant Surgeon Name: Dorothy Oz Drug Name Drug Manufacturer Dose Size Dose UOM Minicillan Drugs R Us 200 mg Maxicillan Canada4Less Drugs 400 mg Minicillan Drug USA 150 mg Flexible & accurate document processing through queries (recall) Relevance queries through composable, customizable search (precision) Best Variability: Good Variety: Good Relevance: Content is dynamic and drives all processing Fast development because very little impedance mismatch Requires defensive programming to handle variant data All content is in context so relevance is possible New data & relationships can be discovered and marked up 23

XML vs. JSON <section> <heading> Data Models</heading> <paragraph> This paper is. </paragraph> <paragraph> The <i>relational</i> model is no longer, </br> the only game in town. </paragraph> </section> "section": { "heading": "Data Models", "paragraph": "This paper is.", "paragraph": { "s": "The ", "i": "relational, "s": "model is no longer,", "br": null "s": the only game in town." } } XML is more expressive and more complex than JSON 1. It allows text to be freely interspersed among elements 2. It requires every element to be part of a namespace 3. It allows elements to be precisely typed (XML schema) 4. It is harder to read, parse, and its markup consumes more space 24

Document Model Use XML documents to enable search relevance Use JSON documents to enable agile development Example: new browser centric apps 25

Graph Model Focuses on relationships likes Manages relationships Relationships create context Relationship reuse, durability, integrity, independence, unity Informal relationship structure Relationship queries Flexible queries through pattern matching Surgeon works at performed friend of has poor success at Data is so granular it is difficult to comprehend & manage Operation on Person at Hospital at fails at Good Variability: Good Variety: Good Relevance: Everything is dynamic: maximum flexibility Fast development because very little impedance mismatch Requires defensive programming to handle variant data Relationship patterns create richer relevance than text Original text and context is lost 26

Graph Model Use to enhance text search using relationships and relationship patterns Example: genetic research and map apps 27

Modeling Takeaways Each model has a specialized purpose Model Dimensional Relational Object Doc Graph Optimal Use Self service ad hoc reporting Flexible queries & joins Processes & agile programming Search & transformations Relationship relevance & patterns Use the model that best fits your needs Use multiple models in the same app Glue it all together 28

What NoSQL model is best for developing apps? Document model Increases developer productivity by 2x over relational if you don t need to compensate for lack of consistency and data integration Supports agile development without a schema Handles rapidly changing requirements Handles deeply hierarchical, complex, and highly variable data structures Has little to no impedance mismatch between application and database JSON is the new lingua franca of the web Potential to enable better search relevance Full text search in context of document structure Full featured queries of any data anywhere in a document 29

What does an App need from a DB? Insert, Update, and Delete data Query data Search data Bulk Process data Do it all consistently 30

Which document NoSQL DBs are best for developer productivity? MarkLogic MongoDB Riak Cloudant Cassandra CouchDB Low barrier to entry Great Great Great Great Poor Easy to insert, update, delete Great Great Great Great Great individual documents Easy to retrieve one document Great Great Great Great Great Query to find multiple docs Great Good Poor Good Poor Search within documents Great Good Good Good Poor Search relevance Great Poor Good Poor None Advanced search : Facets, Great None None None None Geospatial, Entity Enrichment Graph capabilities Poor None Poor None None Doc bulk processing Good Great Great Great None Data Integration Capabilities Great Poor Poor Poor Poor Data Consistency Great Poor Poor Poor Poor 31

Agenda What is NoSQL? New Modeling Paradigms New Hardware Paradigms Which NoSQL is best?

Hospital Name: John Hopkins Operation Number: 13 Operation Type: Heart Transplant Surgeon Name: Dorothy Oz Drug Drug Dose Dose Name Manufacturer Size UOM Minicillan Drugs R Us 200 mg Maxicillan Canada4Less Drugs 400 mg Minicillan Drug USA 150 mg Low Latency Operational Velocity Graph AllegroGraph, Neo4j, InfiniteGraph OrientDB DEX, SysTap New Database Paradigms Document JSON MarkLogic, MongoDB, CouchDB, CouchBase Column key/val DynamoDB, Riak, Cassandra, Hbase, HyperTable, Oracle NoSQL Object newsql Oracle x10, VoltDB Clustrix SQL Live Analytics Oracle Exalytics, SAP HANA High Bandwidth Analytical Volume Big Data Hadoop Raw Raw Graph Less structure (schemaless) Doc Warehouse XML MarkLogic, Solr Document Versant, Objectivity, GemStone Object Oracle DB, SQL Server, DB2, Sybase, MySQL, PostgreSQL Relational Data Warehouse Exadata, OLAP, Teradata, Netezza, Sybase, Greenplum, Vertica, ParAccel Dimensional More structure (schema)

Do we need to optimize databases differently for velocity and volume?

Storage Cost/Performance Determines Physical Architecture Optimization Technology Persistent Capacity Inexpensive Volume Hard Drives Y High Capacity, Bandwidth Velocity RAM N Low IOPs, Latency, Bandwidth Technology GB / Blade Latency (µs) Bandwidth (MB/s) IOPs (1000/s) Cost / GB Cost / MB/s Cost / IOPs Cost / Blade RAID 0 w/2 HDDs 1,200 4,850 288 2.4 1.25 5.21 0.63 1.5K Flash SSDs 800 123 1,308 29 31.25 19.12 0.86 25.0K Flash PCIe SLC 320 26 1,200 238 53.13 14.17 0.07 17.0K RAM DDR3 1333 DR (12x16GB) 192 0.03 5,333 666,500 65.10 2.34 0.00 12.5K RAM DDR3 1333 SR (10x4GB) 40 0.02 10,666 1,333,000 33.43 0.13 0.00 1.3K 35

Sync or Async Transactions JSON Hardware Architecture Velocity Optimized (OLTP) MPP RAM DB XML Sync or Async Hospital Name: John Hopkins Operation Number: 13 Operation Type: Heart Transplant Surgeon Name: Dorothy Oz Drug Drug Dose Dose Name Manufacturer Size UOM Minicillan Drugs R Us 200 mg Maxicillan Canada4Less Drugs 400 mg Minicillan Drug USA 150 mg Sync or Async JSON MPP Disk DB XML Sync or Async Hospital Name: John Hopkins Operation Number: 13 Operation Type: Heart Transplant Surgeon Name: Dorothy Oz Drug Drug Dose Dose Name Manufacturer Size UOM Minicillan Drugs R Us 200 mg Maxicillan Canada4Less Drugs 400 mg Minicillan Drug USA 150 mg Volume Optimized (Warehouse) 36

What s wrong with SQL DB s? Velocity SQL DBs are serialized to ensure consistency and use high latency disk, and this prevents them from scaling horizontally Volume Because SQL DBs share cores, caches, and storage, they are serialized across these resources, and this prevents them from scaling horizontally hacky 37

Problem: Serialized DB Design DBs use serialization and synchronization for consistency Logging data Loading data from disk to RAM Locking rows of data Locking buffered data Michael Stonebraker @ 2011 NoSQL Conference San Jose 38

Cost of Synchronization Single threaded operations are 746 times faster than multiple threads with locks Time (ns) Single thread Single thread with memory barrier Single thread with CAS Single thread with lock Two threads with CAS 300 4,700 5,700 10,000 30,000 Two threads with lock 224,000 0 50,000 100,000 150,000 200,000 250,000 Slide from James Gates SORT presentation, When Performance Really Matters 39

Future is More Cores Not Faster 10,000,000 1,000,000 100,000 10,000 1,000 Transistors (000) Clock Speed (MHz) Power (W) Perf/Clock (ILP) Exponential Transistors Flat Clock (faster clock needs more power) Flat Power (power is expensive and hot) 100 10 Flat ILP (instruction level parallelism) 1 0 1970 1975 1980 1985 1990 1995 2000 2005 2010 40

Hardware Takeaways Velocity and Volume Future architectures will scale horizontally (MPP) They will be a mix of synchronous and asynchronous transactions RAM is lowest cost for velocity (low latency) Disk is lowest cost for volume (high bandwidth) MPP functional processing is needed to glue it all together (input function output) 41

Which document NoSQL DBs perform best for apps? App Performance MarkLogic MongoDB Riak Couch Cassandra Insert Velocity Good Good Great Good Great Volume Scalability Great Good Great Good Great Fast retrieval of one doc Great Great Great Great Great Fast query of docs (secondary index) Fast query of docs (no secondary index) Great Great Good Good Good Great Good Poor Good None Fast search within documents Great Good Good Good Poor Fast search relevance Great Poor Poor Poor None Fast aggregation Great Good Poor Good Poor Fast map reduce Poor Good Great Great None Hadoop integration Great Great Great Great Great 42

Agenda What is NoSQL? Which NoSQL is best? Consistency Model Enterprise Readiness

What s wrong with NoSQL? In most NoSQL solutions, the developer is responsible for ensuring consistency Imagine programming an app to coordinate thousands of concurrent threads across gigabytes of data structures Imagine writing code to handle all threading issues Locks Contention Serialization Dead locks Race conditions Threading bugs Etc hacky 44

ACID vs. BASE ACID transactions between two nodes (within or between clusters) require a two phase commit, which tightly couples them and reduces availability 1. Transaction coordinator has each node pre commit the transaction and indicate if the commit is possible 2. If both nodes agree that a commit is possible, the transaction coordinator asks both nodes to apply the commit 3. If either or both nodes veto the commit, the transaction coordinator asks both nodes to roll back the commit BASE transactions between two nodes (within or between clusters) are decoupled, eventually consistent, reliable, and higher availability 1. Use an asynchronous, guaranteed delivery, ordered message queue 2. Add a log table to target database to track successful execution of queue messages 3. Entries into the log table occur when messages are successfully executed in the target database 4. Messages in the queue are dequeued only after the log confirms they have been executed

What are ACID transactions? Databases use the ACID model to make it easy, reliable, and fast for hundreds of concurrent processes to query and modify the same data consistently: Contrast this with manually programming hundreds of threads to do the same Atomic All parts of a transaction succeed, or all fail and roll back Consistent All committed data must be consistent with all data rules including constraints, triggers, cascades, atomicity, isolation, and durability Isolated No transaction can interfere with other concurrent transactions Durable Once a transaction is committed, data will survive system failures, and can be recovered after an unwanted deletion 46

What is Durability? Durable data survives system failures Once a transaction has committed, its data is guaranteed to survive system failures Failures may occur in the server, operating system, disk, and database. Failures may be caused by server crash, full disk, corrupted disk, power outages, etc. Durability requires storage to be redundant. Durability requires logs to replay asynchronously written data. Durability requires logs to be archived to another location so they can be recovered. Durability works with atomicity to ensure that partially written data is not durable. Without durability you can have faster inserts, updates, and deletes because you have no logs to write and you can store data in volatile memory while lazily writing it to disk Durable data can be recovered after unwanted deletion Durability allows data to be recovered to a point before the system failed or before applications or users inappropriately destroyed or modified data Durability requires backups of data Backups need to be able to recover data to a point in time

Do you need Durability? Can you live with writing advanced code to compensate? Are you willing to trust all developers to properly check for partial transaction failures, properly detect current physical layout of the database cluster, and write extra code to properly propagate data durably across the cluster? Can you live with lost data? Can you live with data lost because the database doesn t use logs, doesn t archive it logs, doesn t use mirrored disks, and doesn t require data to be written to multiple servers? Can you live with accidental deletion of data? Can you live without a point in time recovery feature that can restore data immediately before an accidental deletion? Can you live with scripting your own backup & recovery solution? Are you willing to develop custom scripts to back up and restore data files and database configurations?

Durability Support Durability Data Survival MarkLogic MongoDB Riak Cloudant CouchDB Cassandra Durable on commit Yes No No Yes No Recoverable to point in time Yes No No Yes No Recoverable from backup Yes No No No No NOTES MongoDB s architecture is fundamentally not durable. By default it is not durable. Configuring MongoDB to be durable is complicated and error prone. Making MongoDB durable severely hurts performance. Even when properly configured to be durable, its in memory design allows data to be committed without being persisted. To ensure durability in Riak and Cassandra, each transaction must commit across two or more nodes. If only one node reports success, your transaction is not durable; and if that node fails, your data is lost. When this happens in Riak, you can immediately read from the desired nodes to repair the failed writes. In Cassandra, you have to manually retry your inserts. In addition, in Riak and Cassandra the document distribution algorithm does not guarantee that documents will go to different physical nodes, which means there is no guaranteed durability.

What is Atomicity? An atomic transaction is all or nothing Atomicity means all parts of a transaction succeed or nothing does This requires partially written data to be automatically removed from disk. A single command or a set of commands is called a transaction A single command to a programmer often represents multiple commands to the database because the database needs to replicate data to multiple disks, to update indexes, to execute triggers, to verify constraints, and to cascade deletes and updates, etc. Without atomicity, you can have faster transactions because they don t need a two phase commit Sets of data An operation may need to process multiple data items All data in the set needs to be changed or none; otherwise it becomes arbitrarily inconsistent For example, you want to delete part of a set of data and part way through the transaction fails. If all the changes are not automatically rolled back, the data is left in an inconsistent state that can affect the results of other transactions. Sets of commands A series of commands often needs to work as a unit to produce accurate results All commands need to succeed or all need to fail; otherwise data becomes arbitrarily inconsistent Inconsistent data is hard to fix: data may contradict other data, and there may be extra data or missing data. Without atomicity to roll back failures, there may be no way to fix the data. The classic example is where you need to debit one account and credit another as a single transaction. If one command succeeds and the other fails, account totals are inaccurate.

Do you need Atomicity? Can you live with modifying single documents at a time? Without atomicity, you can t guarantee results when working with multiple documents at a time Can you live with partially successful transactions? Without atomicity, you can have higher availability because transactions can partially succeed For example, rather than losing a sale when an app failed to capture a credit card number, you can save what you captured and later contact the customer to get the rest. Can you live with inconsistent and incomplete data? Without atomicity, sets of data or sets of commands may fail before being processed completely leaving data inconsistent or incomplete Is it OK not to know when data anomalies are caused by bugs in your code or are temporarily inconsistent because they haven t yet been synchronized? Can you live with writing advanced code to compensate? Are you willing to develop custom solutions to provide atomic rollback, such as transaction logs, version numbers, and two phase commits? Are you willing to program defensively to handle transactions that fail? Are you willing to develop processes to find and fix inconsistent data?

Atomicity Support Atomicity All or nothing commit MarkLogic MongoDB Riak Cloudant CouchDB Cassandra For one item plus indexes Yes Yes No Yes No For one item plus validation rules Yes No No Yes No Across a set of data Yes No No No No Across a set of commands Yes No No No No Across a set of nodes Yes No No No No NOTES In Riak and Cassandra to ensure durability, each transaction must commit across two or more nodes. If only one node reports success your transaction is not durable yet these DBs do not roll back the transaction. Thus, if you want durability, they cannot provide atomicity for an insert, update, and delete of a single document. To make matters worse, in Riak and Cassandra, if you need to index data in multiple ways, you have to insert the same document multiple times with different keys. Some of those inserts may succeed and some might not. Thus, Riak and Cassandra cannot provide any level of atomicty.

What is Isolation? Isolation prevents concurrent transactions from affecting each other Read isolation ensures queries return accurate results as of a point in time Write isolation locks data to prevent race conditions during updates, deletes, and inserts Without isolation, queries and transactions run faster because the database doesn t have to provide a consistent view using locks, snapshots, or system version numbers Sets of data An operation can only produce accurate results as of a point in time It takes time for a command to process a set of data During this time, concurrent transactions may insert, update, and delete data in the set Without isolation, a single command executes against data while it is being changed by other concurrent transactions. Records may be added after the command started running Records may be deleted or changed after the command has processed them. This creates inconsistent results: aggregate functions produce wrong answers Sets of commands A series of commands need to work on a consistent view of data to produce accurate results Without isolation, each command in a series will execute against arbitrarily different data

Do you need Isolation? Can you live with never processing more than a single document? Without isolation, you can t guarantee results if you work with multiple documents at a time Can you live with inaccurate queries? Without isolation, query results are inaccurate because concurrent transactions change data while processing it Can you live with race conditions and dead locks? Without isolation, transactions can create race conditions and deadlocks Can you live with writing advanced code to compensate? Are you willing to implement your own versioning system (like MVCC) that provides a reliable system change number spanning multiple data centers? Are you willing to write code to hide concurrent updates, inserts and deletes from queries? Are you willing to write code to handle update conflicts, race conditions and deadlocks?

Read Isolation Support Isolation read non interference MarkLogic MongoDB Riak Cloudant CouchDB Cassandra MVCC Yes No No Yes No OLTP Query for one item Yes Yes Yes Yes Yes for multiple items Yes No No Yes No for multiple commands Yes No No No No Batch Queries (map/reduce) for multiple items Yes No Yes Yes No for multiple commands Yes No No No No

Write Isolation Support Isolation write non interference MarkLogic MongoDB Riak Cloudant CouchDB Cassandra Update for one item (optimistic no lock) Yes Yes Yes Yes Yes for one item (pessimistic lock) Yes No No No No for multiple items Yes No No Yes No for multiple commands Yes No No No No Delete for one item Yes Yes Yes Yes Yes for multiple items Yes No No Yes No for multiple commands Yes No No No No Insert for one item Yes Yes Yes Yes Yes for multiple items Yes No No Yes No for multiple commands Yes No No No No

What is Consistency? All committed data must be consistent with all data rules Constraints, triggers, cascades, atomicity, isolation, and durability Data must always be in a consistent state at any point in time Consistency is the product of atomicity, isolation, and durability Atomicity ensures that if data rules are violated, such as constraints and triggers, the transaction fails and all changes are rolled back. Isolation ensures a query sees a consistent set of data while other concurrent commands are modifying the underlying data Isolation ensures bulk updates lock sets of data so they can be processed as a consistent unit without other concurrent commands modifying their data. Durability ensures that data is consistently replicated to other nodes in a cluster so a loss of a node won t cause a loss of data

Do you need Consistency? Not necessarily instead, you may prefer Absolute fastest performance at lowest hardware cost Highest global data availability at lowest hardware cost Working with one document at a time Writing advanced code to create your own consistency model Eventually consistent data Some inconsistent data that can t be reconciled Some missing data that can t be recovered Some inconsistent query results

Consistency Support Consistency results are accurate and repeatable at a point in time MarkLogic MongoDB Riak Cloudant CouchDB Cassandra True consistency Yes No No No No Eventual consistency Yes Yes Yes Yes Yes Multi datacenter clusters Yes Yes Yes Yes Yes Multi master replication No Yes Yes Yes Yes NOTES Eventual consistency is an oxymoron: it means your data and queries may be or may not be consistent: this is fundamentally inconsistent. Multi master replication in all NoSQL databases requires the application to handle conflict resolution. This is a non trivial problem to solve. MarkLogic does not support multi master replication because it is fundamentally inconsistent and thus incompatible with the ACID consistency model.

What do you need most? Highest performance for queries and transactions Highest data availability across multiple data centers or Less data loss (i.e. durability) More query accuracy & less deadlocks (i.e. isolation) More data integrity (i.e. atomicity) Less code to compensate for lack of ACID compliance

Compensating for Cassandra Hystrix is a library designed to control the interactions between these distributed services providing greater tolerance of latency and failure. Hystrix does this by isolating a dramatic improvement in uptime and resilience has been achieved through its use

Compensating for Cassandra Blitz4j is a logging framework to reduce multithreaded contention and enable highly scalable logging...

Agenda What is NoSQL? Which NoSQL is best? Consistency Model Enterprise Readiness

Which document NoSQL DBs are ready for the enterprise? MarkLogic MongoDB Riak Cloudant Cassandra CouchDB Cost High High High High High Data Integration Great Poor Poor Poor Poor Security Great Poor Poor Poor Poor Administration: monitoring, provisioning, patching, etc. Great Poor Poor Poor Poor NOTES NoSQL is as expensive as SQL databases, such as Oracle and SQL Server. Hardware costs (particularly storage) are the most expensive and comparable. Open source databases have no licensing costs, but they have annual maintenance. Vendor s annual maintenance costs are comparable. 64

When will NoSQL be Enterprise Ready? In memory DB MarkLogic NoSQL DB as a Service MapReduce Columnar DW DB Appliances Oracle SQL Server DB2 Open Source DBs Technology Trigger Inflated Expectations Disillusionment Enlightenment Productivity Enterprise Ready 1 to 5 years 5 to 10 years Derived from Gartner Hype Cycle for Data Management 65

What document NoSQL database is best? MarkLogic MongoDB Riak Cloudant Cassandra CouchDB Overall Developer Productivity Great Good Poor Good Poor Overall App Performance Great Good Good Good Good Overall Enterprise Readiness Great Poor Poor Poor Poor Questions / Comments? 66