The CAP theorem and the design of large scale distributed systems: Part I

Similar documents

Transactions and ACID in MongoDB

Eventually Consistent

The Cloud Trade Off IBM Haifa Research Storage Systems

Distributed Data Stores

Big Data Management and NoSQL Databases

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Introduction to NOSQL

Do Relational Databases Belong in the Cloud? Michael Stiefel

NoSQL Databases. Nikos Parlavantzas

CAP Theorem and Distributed Database Consistency. Syed Akbar Mehdi Lara Schmidt

Cloud Computing with Microsoft Azure

NoSQL in der Cloud Why? Andreas Hartmann

Advanced Data Management Technologies

A Review of Column-Oriented Datastores. By: Zach Pratt. Independent Study Dr. Maskarinec Spring 2011

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

Structured Data Storage

Practical Cassandra. Vitalii

Cloud Computing Is In Your Future

Although research on distributed database systems. Consistency Tradeoffs in Modern Distributed Database System Design COVER FEATURE

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

A Survey of Distributed Database Management Systems

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

This paper defines as "Classical"

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Database Replication with Oracle 11g and MS SQL Server 2008

The NoSQL Ecosystem, Relaxed Consistency, and Snoop Dogg. Adam Marcus MIT CSAIL

Geo-Replication in Large-Scale Cloud Computing Applications

Preparing Your Data For Cloud

Data Management in the Cloud

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Cloud Scale Distributed Data Storage. Jürmo Mehine

NoSQL Databases. Polyglot Persistence

Appendix A Core Concepts in SQL Server High Availability and Replication

Cassandra A Decentralized, Structured Storage System

NoSQL Database Systems and their Security Challenges

Big Systems, Big Data

bigdata Managing Scale in Ontological Systems

Referential Integrity in Cloud NoSQL Databases

Database Replication with MySQL and PostgreSQL

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Can the Elephants Handle the NoSQL Onslaught?

Cloud DBMS: An Overview. Shan-Hung Wu, NetDB CS, NTHU Spring, 2015

A COMPARATIVE STUDY OF NOSQL DATA STORAGE MODELS FOR BIG DATA

nosql and Non Relational Databases

Big Data & Scripting storage networks and distributed file systems

The Quest for Extreme Scalability

Introduction to NoSQL

Design Patterns for Distributed Non-Relational Databases

Basics Of Replication: SQL Server 2000

BRAC. Investigating Cloud Data Storage UNIVERSITY SCHOOL OF ENGINEERING. SUPERVISOR: Dr. Mumit Khan DEPARTMENT OF COMPUTER SCIENCE AND ENGEENIRING

Benchmarking and Analysis of NoSQL Technologies

Introduction to Apache Cassandra

Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island

NoSQL - What we ve learned with mongodb. Paul Pedersen, Deputy CTO paul@10gen.com DAMA SF December 15, 2011

Lecture Data Warehouse Systems

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

Data Distribution with SQL Server Replication

Distributed Systems. Tutorial 12 Cassandra

Distributed Data Parallel Computing: The Sector Perspective on Big Data

Consistency Trade-offs for SDN Controllers. Colin Dixon, IBM February 5, 2014

NoSQL. Thomas Neumann 1 / 22

Distributed Databases

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to am to 5 pm HOTEL DUBAI GRAND DUBAI

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Database Scalability {Patterns} / Robert Treat

How To Build Cloud Storage On Google.Com

Amr El Abbadi. Computer Science, UC Santa Barbara

Database Replication

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Dynamo: Amazon s Highly Available Key-value Store

NoSQL Data Base Basics

Understanding Neo4j Scalability

Data Consistency on Private Cloud Storage System

Integrating Big Data into the Computing Curricula

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

The first time through running an Ad Hoc query or Stored Procedure, SQL Server will go through each of the following steps.

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

How To Write A Database Program

Cassandra A Decentralized Structured Storage System

these three NoSQL databases because I wanted to see a the two different sides of the CAP

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014

Distributed Software Systems

The CAP-Theorem & Yahoo s PNUTS

Eventual Consistent Databases: State of the Art

NoSQL and Hadoop Technologies On Oracle Cloud

Transcription:

The CAP theorem and the design of large scale distributed systems: Part I Silvia Bonomi University of Rome La Sapienza www.dis.uniroma1.it/~bonomi Great Ideas in Computer Science & Engineering A.A. 2012/2013

A little bit of History Wireless and Mobile ad-hoc networks Cloud Computing Platforms Web-services based Information Systems Peer-to-peer Systems First Internet-based systems for military purpose Client-server architectures Mainframe- based Information Systems 2

Relational Databases History } Relational Databases mainstay of business } Web-based applications caused spikes } Especially true for public-facing e-commerce sites } Developers begin to front RDBMS with memcache or integrate other caching mechanisms within the application

Scaling Up } Issues with scaling up when the dataset is just too big } RDBMS were not designed to be distributed } Began to look at multi-node database solutions } Known as scaling out or horizontal scaling } Different approaches include: } Master-slave } Sharding

Scaling RDBMS Master/Slave } Master-Slave } All writes are written to the master. All reads performed against the replicated slave databases } Critical reads may be incorrect as writes may not have been propagated down } Large data sets can pose problems as master needs to duplicate data to slaves

Scaling RDBMS - Sharding } Partition or sharding } Scales well for both reads and writes } Not transparent, application needs to be partition-aware } Can no longer have relationships/joins across partitions } Loss of referential integrity across shards

Other ways to scale RDBMS } Multi-Master replication } INSERT only, not UPDATES/DELETES } No JOINs, thereby reducing query time } This involves de-normalizing data } In-memory databases

Today

Context } Networked Shared-data Systems A A A 9

Fundamental Properties } Consistency } (informally) every request receives the right response } E.g. If I get my shopping list on Amazon I expect it contains all the previously selected items } Availability } (informally) each request eventually receives a response } E.g. eventually I access my shopping list } tolerance to network Partitions } (informally) servers can be partitioned in to multiple groups that cannot communicate with one other 10

CAP Theorem } 2000: Eric Brewer, PODC conference keynote } 2002: Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2) Of three properties of shared-data systems (Consistency, Availability and tolerance to network Partitions) only two can be achieved at any given moment in time. 11

Proof Intuition C Write (v1, A) A A A Networked Shared-data system Read (A) C 12

Fox&Brewer CAP Theorem : C-A-P: choose two. consistency C Claim: every distributed system is on one side of the triangle. CA: available, and consistent, unless there is a partition. CP: always consistent, even in a partition, but a reachable replica may deny service without agreement of the others (e.g., quorum). A Availability AP: a reachable replica provides service even in a partition, but may be inconsistent. P Partition-resilience

The CAP Theorem Consistency Availability Theorem: You can have at most two of these invariants for any shared-data system Tolerance to network Partitions Corollary: consistency boundary must choose A or P

Forfeit Partitions Consistency Availability } } } } Examples Single-site databases Cluster databases LDAP Fiefdoms Tolerance to network Partitions } } } Traits 2-phase commit cache validation protocols The inside

Observations } CAP states that in case of failures you can have at most two of these three properties for any shared-data system } To scale out, you have to distribute resources. } P in not really an option but rather a need } The real selection is among consistency or availability } In almost all cases, you would choose availability over consistency

Forfeit Availability Consistency Availability Examples } Distributed databases } Distributed locking } Majority protocols Tolerance to network Partitions Traits } Pessimistic locking } Make minority partitions unavailable

Forfeit Consistency Consistency Availability } } } } Examples Coda Web caching DNS Emissaries Tolerance to network Partitions } } } } Traits expirations/leases conflict resolution Optimistic The outside

Consistency Boundary Summary } We can have consistency & availability within a cluster. } No partitions within boundary! } OS/Networking better at A than C } Databases better at C than A } Wide-area databases can t have both } Disconnected clients can t have both

CAP, ACID and BASE } BASE stands for Basically Available Soft State Eventually Consistent system. } Basically Available: the system available most of the time and there could exists a subsystems temporarily unavailable } Soft State: data are volatile in the sense that their persistence is in the hand of the user that must take care of refresh them } Eventually Consistent: the system eventually converge to a consistent state 21

CAP, ACID and BASE } Relation among ACID and CAP is core complex } Atomicity: every operation is executed in all-or-nothing fashion } Consistency: every transaction preserves the consistency constraints on data } Integrity: transaction does not interfere. Every transaction is executed as it is the only one in the system } Durability: after a commit, the updates made are permanent regardless possible failures 22

CAP, ACID and BASE CAP } C here looks to single-copy consistency } A here look to the service/ data availability ACID } C here looks to constraints on data and data model } A looks to atomicity of operation and it is always ensured } I is deeply related to CAP. I can be ensured in at most one partition } D is independent from CAP 23

Warning! } What CAP says: } When you have a partition in the network you cannot have both C and A } What CAP does not say: During Normal Periods (i.e. period with no both partitions) C, A and P both C and A can be achieved } There could not exists a time periods in which you can have 24

2 of 3 is misleading } Partitions are rare events } there are little reasons to forfeit by design C or A } Systems evolve along time } Depending on the specific partition, service or data, the decision about the property to be sacrificed can change } C, A and P are measured according to continuum } Several level of Consistency (e.g. ACID vs BASE) } Several level of Availability } Several degree of partition severity

2 of 3 is misleading } In principle every system should be designed to ensure both C and A in normal situation } When a partition occurs the decision among C and A can be taken } When the partition is resolved the system takes corrective action coming back to work in normal situation 26

Consistency/Latency Trade Off } CAP does not force designers to give up A or C but why there exists a lot of systems trading C? } CAP does not explicitly talk about latency } however latency is crucial to get the essence of CAP 27

Consistency/Latency Trade Off High Availability High Availability is a strong requirement of modern shared-data systems Replication To achieve High Availability, data and services must be replicated Consistency Replication impose consistency maintenance Latency Every form of consistency requires communication and a stronger consistency requires higher latency 28

PACELC } Abadi proposes to revise CAP as follows: PACELC (pronounced pass-elk): if there is a partition (P), how does the system trade off availability and consistency (A and C); else (E), when the system is running normally in the absence of partitions, how does the system trade off latency (L) and consistency (C)? 29

Partitions Management COVER FEATURE State: S Operations on S State: S1 Partition recovery State: S' Time State: S2 Partition starts Partition mode Figure 1. The state starts out consistent and remains so until a partition starts. To stay available, both sides enter partition mode and continue to execute operations, creating concurrent Partition states S 1 and S 2, Activating which are inconsistent. When Partition the partition ends, the truth becomes Detection clear and partition Partition recovery Mode starts. During Recovery recovery, the system merges S 1 and S 2 into a consistent state S' and also compensates for any mistakes made during the 30 partition.

Partition Detection } CAP does not explicitly talk about latencies } However } To keep the system live time-outs must be set } When a time-out expires the system must take a decision NO, continue to wait Is a partition happening? YES, go on with execution Possible Availability Loss Possible Consistency Loss 31

Partition Detection } Partition Detection is not global } An interacting part may detect the partition, the other not. } Different processes may be in different states (partition mode vs normal mode) } When entering Partition Mode the system may } Decide to block risk operations to avoid consistency violations } Go on limiting a subset of operations 32

Which Operations Should Proceed? } Live operation selection is an hard task } Knowledge of the severity of invariant violation } Examples } every key in a DB must be unique Managing violation of unique keys is simple Merging element with the same key or keys update } every passenger of an airplane must have assigned a seat Managing seat reservations violation is harder Compensation done with human intervention } Log every operation for a possible future re-processing 33

Partition Recovery } When a partition is repaired, partitions logs may be used to recover consistency } Strategy 1: roll-back and executed again operations in the proper order (using version vectors) } Strategy 2: disable a subset of operations (Commutative Replicated Data Type - CRDT) 34

Basic Techniques: Version Vector } In the version vector we have an entry for any node updating the state } Each node has an identifier } Each operation is stored in the log with attached a pair <nodeid, timestamp> } Given two version vector A and B, A is newer than B if } For any node in both A and B, ta(b) ts(a) and } There exists at least one entry where ta(b) < ts(a) 35

Version Vectors: example 1 0 0 ts(b) 1 1 0 Ts(A) ts(a) < ts(b) then A B 1 0 0 ts(b) 0 0 1 Ts(A) ts (A) ts (B) then A B POTENTIALLY INCONSISTENT! 36

Basic Techniques: Version Vector } Using version vectors it is always possible to determine if two operations are causally related or they are concurrent (and then dangerous) } Using vector versions stored on both the partitions it is possible to re-order operations and raising conflicts that may be resolved by hand } Recent works proved that this consistency is the best that can be obtained in systems focussed on latency 37

Basic Techniques: CRDT } Commutative Replicated Data Type (CRDT) are data structures that provably converges after a partition (e.g. set). } Characteristics: } All the operations during a partition are commutative (e.g. add(a) and add(b) are commutative) or } Values are represented on a lattice and all operations during a partitions are monotonically increasing wrt the lattice (giving an order among them) } Approach taken by Amazon with the shopping cart. } Allows designers to choose A still ensuring the convergence after a partition recovery 38

Basic Techniques: Mistake Compensation } Selecting A and forfaiting C, mistakes may be taken } Invariants violation } To fix mistakes the system can } Apply deterministic rule (e.g. last write win ) } Operations merge } Human escalation } General Idea: } Define specific operation managing the error } E.g. re-found credit card 39

What is NoSQL? } Stands for Not Only SQL } Class of non-relational data storage systems } Usually do not require a fixed table schema nor do they use the concept of joins } All NoSQL offerings relax one or more of the ACID properties (will talk about the CAP theorem)

Why NoSQL? } For data storage, an RDBMS cannot be the be-all/end-all } Just as there are different programming languages, need to have other data storage tools in the toolbox } A NoSQL solution is more acceptable to a client now than even a year ago

How did we get here? } Explosion of social media sites (Facebook, Twitter) with large data needs } Rise of cloud-based solutions such as Amazon S3 (simple storage solution) } Just as moving to dynamically-typed languages (Ruby/ Groovy), a shift to dynamically-typed data with frequent schema changes } Open-source community

Dynamo and BigTable } Three major papers were the seeds of the NoSQL movement } BigTable (Google) } Dynamo (Amazon) } Gossip protocol (discovery and error detection) } Distributed key-value data store } Eventual consistency

Thank You! Questions?!