Distributed Systems Tutorial 12 Cassandra written by Alex Libov Based on FOSDEM 2010 presentation winter semester, 2013-2014
Cassandra In Greek mythology, Cassandra had the power of prophecy and the curse of never being believed 2
Cassandra A massively scalable, decentralized, structured data store Developed by Facebook to power the inbox search Released as an open source project on google code in July 2008 Became an apache incubator project in March 2009 On February 2010 graduated to a top-level project Version 2.0, released Sep 4 2013 3
Cassandra features Decentralized Every node in the cluster has the same role No single point of failure Scalable Read and write throughput both increase linearly as new machines are added Fault-tolerant Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Tunable consistency from "writes never fail" to "block for all replicas to be readable Query language CQL (Cassandra Query Languge) is an SQL-like interface alternative to the traditional RPC interface 4
Cassandra Structure A Key K F B E D C Nodes B,C and D store keys in range A,B (R=3) 5
Vnodes More nodes can be used when recovering from node failure 6
Vnodes Easing the use of heterogeneous machines in a cluster 7
Replication in CQL A key space is the highest level container SimpleStrategy means placing replicas on successive nodes in the ring NetworkTopologyStrategy places replicas across different data centers (which are defined elsewhere) NetworkTopologyStrategy places replicas in the same data center by walking the ring clockwise until reaching the first node in another rack 8
Not only a O(1) DHT Values are structured, indexed Columns/ column families Queries 9
Column families Key1 column column column Key2 column column Column name: byte[] value: byte[] timestamp: long 10
Why column families? Vs 11
Write Consistency The client can specify desired consistency Any Will always succeed One Write to at least one replica node Two Three Quorum ((N/2)+1) Local_one At least one in the local datacenter Local_Quorum Each_Quorum Written to commit log and memory table on a quorum in all data centers All 12
Write Path Memtable Commitlog Write SStables 13
Writes No reads No seeks Sequential disk access Atomic within a column family Fast Any node if the write doesn t belong to the node proxied to where the write belongs Always writable (hinted hand-off) if the node where the write belongs is down, the write is given to someone else with a hint, that says, update the correct node when it comes back up 14
Read consistency One Read from closest replica Two Read from any two, return most recent data Three Quorum Local_quorum Local_one Returns only if the replica is in the local datacenter Each_quorum All 15
Request illustration 16
Read path Memtable Read Bf Idx Bf Idx Bf Idx 17
Reads Any node Cassandra tracks which replicas respond fastest and prefers to route requests there Read repair Usual caching conventions apply 18
Hinted handoff When a write is performed and a replica is down, the coordinator node stores the request for some time After a node discovers from gossip that a node for which it holds hints has recovered, the node sends the data row corresponding to each hint to the target If insufficient replica targets are alive to satisfy a requested consistency level, an exception is thrown with or without hinted handoff Unlike Dynamo s replication model - Cassandra does not default to sloppy quorum. 19
Lightweight transactions Two users attempting to create a unique user account in the same cluster could overwrite each other s work with neither user knowing about it Using and extending the Paxos consensus protocol, Cassandra offers a way to ensure a transaction isolation level similar to the serializable level offered by RDBMS s 20
A modified Paxos Promises to not accept any proposals associated with any earlier ballot. Along with that promise, it includes the most recent proposal it has already received. read the current value of the row to see if it matches the expected one If a majority of the nodes promise to accept the leader s proposal, it may proceed to the actual proposal reset the Paxos state for subsequent proposals 21
Datastore comparison Google Bigtable Amazon Dynamo Microsoft Azure Yahoo! PNUTS Apache Cassandra Consistency Atomic appends Non-atomic writes + atomic transaction on a row basis Eventual Atomic appends Tunable per request eventual to timeline Tunable per request eventual to serializable Master implementation Chubby lock service + primary/backup in GFS O(1) DHT Stream master + Paxos cluster in storage layer a single pair of active/standby servers O(1) DHT + (modified Paxos for serializable requests) Request handling Assigned tablet server Any node LB to specific partition server Router to closest tablet server Any node Conflict resolution By client By client - By client By server using timestamps Replication Count Tunable per file (GFS) Tunable per service 3+georeplicaiton Geo-replication Tunable per keystore Write implementation Directly with the corresponding tablet Server Coordinator sends to all+ hinted handoffs+ background read repairs Master replica + sync with stream master on recovery uses pub/sub with guaranteed delivery to commit Same as Dynamo 22
For more info http://cassandra.apache.org/ http://www.datastax.com/documentation/articles/cassan dra/cassandrathenandnow.html 23