Cassandra Jonathan Ellis
Motivation Scaling reads to a relational database is hard Scaling writes to a relational database is virtually impossible and when you do, it usually isn't relational anymore
The new face of data Scale out, not up Online load balancing, cluster growth Flexible schema Key-oriented queries CAP-aware
CAP theorem Pick two of Consistency, Availability, Partition tolerance
Two famous papers Bigtable: A distributed storage system for structured data, 2006 Dynamo: amazon's highly available keyvalue store, 2007
Two approaches Bigtable: How can we build a distributed db on top of GFS? Dynamo: How can we build a distributed hash table appropriate for the data center?
10,000 ft summary Dynamo partitioning and replication Log-structured ColumnFamily data model similar to Bigtable's
Cassandra highlights High availability Incremental scalability Eventually consistent Tunable tradeoffs between consistency and latency Minimal administration No SPF
Dynamo architecture & Lookup
Architecture details O(1) node lookup Explicit replication Eventually consistent
Architecture layers Messaging service Commit log Tombstones Gossip Memtable Hinted handoff Failure detection SSTable Read repair Cluster state Indexes Bootstrap Partitioner Compaction Monitoring Replication Admin tools
Writes Any node Partitioner Commitlog, memtable SSTable Compaction Wait for W responses
Memtable / SSTable Disk Commit log
SSTable format Key / data
SSTable Indexes Bloom filter Key Column (Similar to Hadoop MapFile / Tfile)
Compaction Merge keys Combine columns Discard tombstones
Remove Deletion marker (tombstone) necessary to suppress data in older SSTables, until compaction Read repair complicates things a little Eventually consistent complicates things more Solution: configurable delay before tombstone GC, after which tombstones are not repaired
Cassandra write properties No reads No seeks Fast Atomic within ColumnFamily Always writable
Read path Any node Partitioner Wait for R responses Wait for N R responses in the background and perform read repair
Cassandra read properties Read multiple SSTables Slower than writes (but still fast) Seeks can be mitigated with more RAM Scales to billions of rows
Consistency in a BASE world If W + R > N, you will have consistency W=1, R=N W=N, R=1 W=Q, R=Q where Q = N / 2 + 1
vs MySQL with 50GB of data MySQL ~300ms write ~350ms read Cassandra ~0.12ms write ~15ms read Achtung!
Data model Rows, ColumnFamilies, Columns
ColumnFamilies keya column1 column2 column3 keyc column1 column7 column11 Column Byte[] Name Byte[] Value I64 timestamp
Super ColumnFamilies keyf Super1 column keyj Super2 column column Super1 column column column column Super5 column column column column column
Types of queries Single column Slice Set of names / range of names Simple slice -> columns Super slice -> supercolumns Key range
Range queries Add master server Implement on top of K/V Order-preserving partitioning
Modification Insert / update Remove Single column or batch Specify W, number of nodes to wait for
Thrift struct Column { 1: binary name, 2: binary value, 3: i64 timestamp, } struct SuperColumn { 1: binary name, 2: list<column> columns, } Column get_column(table, key, column_path, block_for=1) list<string> get_key_range(table, column_family, start_with="", stop_at="", max_results=100) void insert(table, key, column_path, value, timestamp, block_for=0) void remove(tablename, key, column_path_or_parent, timestamp)
Honestly, Thrift kinda sucks
Example: a multiuser blog Two queries - the most recent posts belonging to a given blog, in reverse chronological order - a single post and its comments, in chronological order
First try JBE blog Cassandra is teh awesome Evan blog I like kittens post post comment comment BASE FTW post comment comment comment comment And Ruby comment comment post <ColumnFamily Type="Super" CompareWith="TimeString" CompareSubcolumnsWith="UUID" Name="Blog"/>
Second try JBE blog Cassandra is teh awesome BASE FTW Evan blog I like kittens And Ruby <ColumnFamily Cassandr a is teh awesome comment comment Base FTW comment comment I like kittens comment comment And Ruby comment comment <ColumnFamily CompareWith="UUIDType" CompareWith="UUIDType" Name="Blog"/> Name="Comment"/>
Roadmap
Cassandra 0.3 Remove support OPP / Range queries Test suite Workarounds for JDK bugs Rudimentary multi-datacenter support
Cassandra 0.4 Branched May 18 Data file format change to support billions of rows per node instead of millions API changes (no more colon delimiters) Multi-table (keyspace) support LRU key cache fsync support Bootstrap Web interface
Cassandra 0.5 Bootstrap Load balancing Closely related to bootstrap done right Merkle tree repair Millions of columns per row This will require another data format change Multiget Callout support
Users Production: facebook, RocketFuel Production RSN: Digg, Rackspace No date yet: IBM Research, Twitter Evaluating: 50+ in #cassandra on freenode
More Eventual consistency: http://www.allthingsdistributed.com/2008/12/ Introduction to distributed databases by Todd Lipcon at NoSQL 09: http://www.vimeo.com/5145059 Other articles/videos about Cassandra: http://wiki.apache.org/cassandra/articlesandp #cassandra on irc.freenode.net
Cassandra