Introduction to Cassandra

Similar documents
HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

Apache Cassandra for Big Data Applications

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014

Apache Zeppelin, the missing component for your BigData ecosystem

Real-Time Big Data in practice with Cassandra. Michaël

Distributed Systems. Tutorial 12 Cassandra

Introduction to Apache Cassandra

Practical Cassandra. Vitalii

Introduction to Multi-Data Center Operations with Apache Cassandra and DataStax Enterprise

Cassandra vs MySQL. SQL vs NoSQL database comparison

So What s the Big Deal?

No-SQL Databases for High Volume Data

Introduction to Multi-Data Center Operations with Apache Cassandra, Hadoop, and Solr WHITE PAPER

NetflixOSS A Cloud Native Architecture

CQL for Cassandra 2.2 & later

Evaluation of NoSQL databases for large-scale decentralized microblogging

Apache Cassandra Present and Future. Jonathan Ellis

Apache HBase. Crazy dances on the elephant back

these three NoSQL databases because I wanted to see a the two different sides of the CAP

Going Native With Apache Cassandra. QCon London, 2014

Apache Cassandra and DataStax. DataStax EMEA

Preparing Your Data For Cloud

NoSQL: Going Beyond Structured Data and RDBMS

Comparing the Hadoop Distributed File System (HDFS) with the Cassandra File System (CFS)

INTRODUCTION TO CASSANDRA

Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra

Enabling SOX Compliance on DataStax Enterprise

NoSQL Databases. Nikos Parlavantzas

LARGE-SCALE DATA STORAGE APPLICATIONS

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Apache Cassandra Query Language (CQL)

Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to am to 5 pm HOTEL DUBAI GRAND DUBAI

Cassandra A Decentralized, Structured Storage System

Simba Apache Cassandra ODBC Driver

Comparing the Hadoop Distributed File System (HDFS) with the Cassandra File System (CFS) WHITE PAPER

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Design Patterns for Distributed Non-Relational Databases

The Modern Online Application for the Internet Economy: 5 Key Requirements that Ensure Success

Comparing SQL and NOSQL databases

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

DBA'S GUIDE TO NOSQL APACHE CASSANDRA

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

MakeMyTrip CUSTOMER SUCCESS STORY

Big Data Analytics with Cassandra, Spark & MLLib

Epimorphics Linked Data Publishing Platform

Hadoop Ecosystem B Y R A H I M A.

Cassandra. Jonathan Ellis

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Case study: CASSANDRA

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

How To Use Big Data For Telco (For A Telco)

Going Native With Apache Cassandra. NoSQL Matters, Cologne, 2014

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

DataStax Enterprise Reference Architecture

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

Referential Integrity in Cloud NoSQL Databases

Xiaoming Gao Hui Li Thilina Gunarathne

Cloud Computing at Google. Architecture

Time series IoT data ingestion into Cassandra using Kaa

Data Modeling in the New World with Apache Cassandra TM. Jonathan Ellis CTO, DataStax Project chair, Apache Cassandra

Cloud Computing with Microsoft Azure

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

MariaDB Cassandra interoperability

Practical Guidelines for Selecting NoSQL vs. an RDBMS Deployment Considerations Conclusion About DataStax

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

CQL for Cassandra 2.0 & 2.1

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Introduction to Hbase Gkavresis Giorgos 1470

May 6, DataStax Cassandra South Bay Meetup. Cassandra Modeling. Best Practices and Examples. Jay Patel Architect, Platform

Axibase Time Series Database

Technical Overview Simple, Scalable, Object Storage Software

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Integration of Apache Hive and HBase

Cassandra A Decentralized Structured Storage System

Rakam: Distributed Analytics API

Evaluating Apache Cassandra as a Cloud Database WHITE PAPER

Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

An Approach to Implement Map Reduce with NoSQL Databases

Mark Bennett. Search and the Virtual Machine

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Comparing Oracle with Cassandra / DataStax Enterprise

Apache Cassandra 1.2 Documentation

Transcription:

Introduction to Cassandra DuyHai DOAN, Technical Advocate

Agenda! Architecture cluster replication Data model last write win (LWW), CQL basics (CRUD, DDL, collections, clustering column) lightweight transactions 2

Who Am I?! Duy Hai DOAN Cassandra technical advocate talks, meetups, confs open-source devs (Achilles, ) OSS Cassandra point of contact duy_hai.doan@datastax.com 3

Datastax! Founded in April 2010 We contribute a lot to Apache Cassandra 400+ customers (25 of the Fortune 100), 200+ employees Headquarter in San Francisco Bay area EU headquarter in London, offices in France and Germany Datastax Enterprise = OSS Cassandra + extra features 4

Cassandra 5 key facts! Key fact 1: linear scalability C* YOU C* C* NetcoSports 3 nodes, 3GB 1k+ nodes, PB+ 5

Cassandra 5 key facts! Key fact 2: continuous availability ( 100% up-time) resilient architecture (Dynamo) 6

Cassandra 5 key facts! Key fact 3: multi-data centers out-of-the-box (config only) AWS conf for multi-region DCs GCE/CloudStack support Microsoft Azure 7

Cassandra 5 key facts! Key fact 4: operational simplicity 1 node = 1 process + 2 config file (main + IP) deployment automation OpsCenter for monitoring 8

Cassandra 5 key facts! 9

Cassandra 5 key facts! Key fact 5: analytics combo Cassandra + Spark = awesome! realtime streaming/

Cassandra architecture! Cluster Replication

Cassandra architecture! Cluster layer Amazon DynamoDB paper masterless architecture Data-store layer Google Big Table paper Columns/columns family 12

Data distribution! Random: hash of #partition token = hash(#p) Hash: ]-X, X] X = huge number (2 64 /2) n 1 n 2 n 3 n 4 n 5 n 8 n 6 n 7 13

Token Ranges! A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] B n 2 nc 3 D n 4 E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] A n 1 ne 5 H: ] 7X/8, X] H n 8 nf 6 G n 7 14

Distributed Table! nc 3 B n 2 D n 4 user_id 1 user_id 2 user_id 3 A n 1 ne 5 user_id 4 user_id 5 H n 8 nf 6 G n 7 15

Distributed Table! nc 3 user_id 1 B n 2 D n 4 user_id 3 A n 1 ne 5 user_id 5 user_id 4 H n 8 nf 6 user_id 2 G n 7 16

Linear scalability! 8 nodes n 3 n 2 n 4 n 1 n 5 n 8 n 6 n 7 17

Linear scalability! 10 nodes n 3 n 4 n 2 n 5 n 1 n 6 n 10 n 7 n 9 n 8 18

Failure tolerance! Replication Factor (RF) = 3 n 3 n 2 n 4 n 1 n 5 n 8 n 6 n 7 19

Coordinator node! Incoming requests (read/write) Coordinator node handles the request n 2 n 3 n 4 n 1 n 5 Every node can be coordinator à masterless n 8 n 7 n 6

Consistency! Tunable at runtime ONE QUORUM (strict majority w.r.t. RF) ALL Apply both to read & write 21

Consistency in action! RF = 3, Write ONE, Read ONE Write ONE: B B A A data replication in progress B A A Read ONE: A 22

Consistency in action! RF = 3, Write ONE, Read QUORUM Write ONE: B B A A data replication in progress B A A Read QUORUM: A 23

Consistency in action! RF = 3, Write ONE, Read ALL Write ONE: B B A A data replication in progress B A A Read ALL: B 24

Consistency in action! RF = 3, Write QUORUM, Read ONE Write QUORUM: B B B A data replication in progress B B A Read ONE: A 25

Consistency in action! RF = 3, Write QUORUM, Read QUORUM Write QUORUM: B B B A data replication in progress B B A Read QUORUM: B 26

Consistency trade-off! 27

Consistency level! ONE Fast, may not read latest written value 28

Consistency level! QUORUM Strict majority w.r.t. Replication Factor Good balance 29

Consistency level! ALL Paranoid Slow, no high availability 30

Consistency summary! ONE Read + ONE Write available for read/write even (N-1) replicas down QUORUM Read + QUORUM Write available for read/write even 1+ replica down 31

! "! Q & R

Data model! Last Write Win! CQL basics! Clustered tables! Lightweight transactions!

Last Write Win (LWW)! INSERT INTO users(login, name, age) VALUES( jdoe, John DOE, 33); #partition jdoe age name 33 John DOE 34

Last Write Win (LWW)! INSERT INTO users(login, name, age) VALUES( jdoe, jdoe age (t 1 ) name (t 1 ) 33 John DOE 35

Last Write Win (LWW)! UPDATE users SET age = 34 WHERE login = jdoe; SSTable 1 SSTable 2 jdoe age (t 1 ) name (t 1 ) 33 John DOE jdoe age (t 2 ) 34 36

Last Write Win (LWW)! DELETE age FROM users WHERE login = jdoe; tombstone SSTable 1 SSTable 2 SSTable 3 jdoe age (t 1 ) name (t 1 ) 33 John DOE jdoe age (t 2 ) 34 jdoe age (t 3 ) ý 37

Last Write Win (LWW)! SELECT age FROM users WHERE login = jdoe;??? SSTable 1 SSTable 2 SSTable 3 jdoe age (t 1 ) name (t 1 ) 33 John DOE jdoe age (t 2 ) 34 jdoe age (t 3 ) ý 38

Last Write Win (LWW)! SELECT age FROM users WHERE login = jdoe; SSTable 1 SSTable 2 SSTable 3 jdoe age (t 1 ) name (t 1 ) 33 John DOE jdoe age (t 2 ) 34 jdoe age (t 3 ) ý 39

Compaction! SSTable 1 SSTable 2 SSTable 3 jdoe age (t 1 ) name (t 1 ) 33 John DOE jdoe age (t 2 ) 34 jdoe age (t 3 ) ý New SSTable jdoe age (t 3 ) name (t 1 ) ý John DOE 40

Historical data! You want to keep data history? do not use internal generated timestamp!!! time-series data modeling history SSTable 1 SSTable 2 id date 1(t 1 ) date 2 (t 2 ) date 9 (t 9 ) id date 10(t 10 ) date 11 (t 11 ) 41

CRUD operations! INSERT INTO users(login, name, age) VALUES( jdoe, John DOE, 33); UPDATE users SET age = 34 WHERE login = jdoe; DELETE age FROM users WHERE login = jdoe; SELECT age FROM users WHERE login = jdoe; 42

Simple Table! CREATE TABLE users ( login text, name text, age int, PRIMARY KEY(login)); partition key (#partition) 43

Clustered table (1 N)! CREATE TABLE mailbox ( login text, message_id timeuuid, interlocutor text, message text, PRIMARY KEY((login), message_id)); partition key clustering column (sorted) unicity 44

On disk layout SSTable 1 jdoe hsue message_id 1 message_id 2 message_id 104 message_id 1 message_id 2 message_id 78 SSTable 2 jdoe message_id 105 message_id 106 message_id 169 45

Queries! Get message by user and message_id (date) 46

Queries! Get message by message_id only (#partition not provided) SELECT * FROM mailbox WHERE message_id = 2014-09-25 16:00:00 ; Get message by date interval (#partition not provided) SELECT * FROM mailbox WHERE and message_id <= 2014-09-25 16:00:00 and message_id >= 2014-09-20 16:00:00 ; 47

Without #partition No #partition no token where are my data????????? 48

The importance of #partition In RDBMS, no primary key full table scan 49

The importance of #partition With Cassandra, no partition key full CLUSTER scan 50

Queries! Get message by user range (range query on #partition) SELECT * FROM mailbox WHERE login >= hsue and login <= jdoe; Get message by user pattern (non exact match on #partition) SELECT * FROM mailbox WHERE login like %doe% ; 51

Clustering order! CREATE TABLE mailbox ( login text, message_id timeuuid, interlocutor text, message text, PRIMARY KEY((login), message_id)) WITH CLUSTERING ORDER BY (message_id DESC) ; 52

Reverse on disk layout! SSTable 1 jdoe message_id 169 message_id 168 message_id 105 SSTable 2 jdoe message_id 104 message_id 103 message_id 1 53

WHERE clause restrictions! All queries (INSERT/UPDATE/DELETE/SELECT) must provide #partition Only exact match (=) on #partition, range queries (<,, >, ) not allowed full cluster scan On clustering columns, only range queries (<,, >, ) and exact match WHERE clause only possible on columns defined

WHERE clause restrictions! What if I want to perform «arbitrary» WHERE clause? search form scenario, dynamic search fields 55

WHERE clause restrictions! What if I want to perform «arbitrary» WHERE clause? search form scenario, dynamic search fields DO NOT RE-INVENT THE WHEEL! Apache Solr (Lucene) integration (Datastax Enterprise) Same JVM, 1-cluster-2-products (Solr & Cassandra) 56

WHERE clause restrictions! What if I want to perform «arbitrary» WHERE clause? search form scenario, dynamic search fields DO NOT RE-INVENT THE WHEEL! Apache Solr (Lucene) integration (Datastax Enterprise) Same JVM, 1-cluster-2-products (Solr & Cassandra) SELECT * FROM users WHERE solr_query = age:[33 TO *] AND gender:male ; SELECT * FROM users WHERE solr_query = lastname:*schwei?er ; 57

Collections & maps! CREATE TABLE users ( login text, name text, age int, friends set<text>, hobbies list<text>, languages map<int, text>, PRIMARY KEY(login)); Keep cardinality low! ( 1000) 58

From SQL to CQL! Normalized User CREATE TABLE comments ( article_id uuid, comment_id timeuuid, author_id text, // typical join id content text, PRIMARY KEY((article_id), comment_id)); 1 n Comment 59

From SQL to CQL! De-normalized User CREATE TABLE comments ( article_id uuid, comment_id timeuuid, author_json text, // de-normalize content text, PRIMARY KEY((article_id), comment_id)); 1 n Comment 60

Data modeling best practices! Start by queries identify core functional read paths 1 read scenario 1 SELECT 61

Data modeling best practices! Start by queries identify core functional read paths 1 read scenario 1 SELECT Denormalize wisely, only duplicate necessary & immutable data functional/technical trade-off 62

Data modeling best practices! Person JSON - firstname/lastname - date of birth - gender - mood - location 63

Data modeling best practices! John DOE, male birthdate: 21/02/1981 subscribed since 03/06/2011 San Mateo, CA Impossible is not John DOE Full detail read from User table on click 64

Lightweight Transaction (LWT)! What? make operations linearizable Why? solve a class of race conditions in Cassandra that would require installing an external lock manager 65

Lightweight Transaction (LWT)! SELECT * FROM account WHERE id= jdoe ; (0 rows) INSERT INTO account (id, email) VALUES ( jdoe, john_doe@fiction.com ); winner SELECT * FROM account WHERE id= jdoe ; (0 rows) INSERT INTO account (id, email) VALUES ( jdoe, jdoe@fiction.com ); 66

Lightweight Transaction (LWT)! How? implementing Paxos protocol on Cassandra Syntax? INSERT INTO account (id, email) VALUES ( jdoe, john_doe@fiction.com ) IF NOT EXISTS; UPDATE account SET email = jdoe@fiction.com IF email = john_doe@fiction.com WHERE id= jdoe ; 67

Lightweight Transaction (LWT)! Recommendations insert with LWT delete must use LWT INSERT INTO my_table IF NOT EXISTS DELETE FROM my_table IF EXISTS 68

Lightweight Transaction (LWT)! Recommendations LWT expensive (4 round-trips), do not abuse only for 1% 5% use cases 69

Lightweight Transaction (LWT)! 1 Queue-in 3 Consensus 2 Compare 4 Swap / Learn 70

! "! Q & R

Thank You duy_hai.doan@datastax.com https://academy.datastax.com/