Practical Cassandra. Vitalii Tymchyshyn tivv00@gmail.com @tivv00



Similar documents
Cassandra vs MySQL. SQL vs NoSQL database comparison

Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to am to 5 pm HOTEL DUBAI GRAND DUBAI

Xiaowe Xiaow i e Wan Wa g Jingxin Fen Fe g n Mar 7th, 2011

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Evaluation of NoSQL databases for large-scale decentralized microblogging

The Apache Cassandra storage engine

Case study: CASSANDRA

Cassandra A Decentralized, Structured Storage System

In Memory Accelerator for MongoDB

NoSQL for SQL Professionals William McKnight

Transactions and ACID in MongoDB

Use Your MySQL Knowledge to Become an Instant Cassandra Guru

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Distributed Data Stores

Comparing SQL and NOSQL databases

NoSQL Databases. Nikos Parlavantzas

Cassandra. Jonathan Ellis

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

How To Use Big Data For Telco (For A Telco)

Do Relational Databases Belong in the Cloud? Michael Stiefel

NoSQL - What we ve learned with mongodb. Paul Pedersen, Deputy CTO paul@10gen.com DAMA SF December 15, 2011

High Availability Solutions for the MariaDB and MySQL Database

Distributed Systems. Tutorial 12 Cassandra

MongoDB Developer and Administrator Certification Course Agenda

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Data Management in the Cloud

Introduction to Apache Cassandra

A Review of Column-Oriented Datastores. By: Zach Pratt. Independent Study Dr. Maskarinec Spring 2011

NoSQL in der Cloud Why? Andreas Hartmann

NOT IN KANSAS ANY MORE

Can the Elephants Handle the NoSQL Onslaught?

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

MySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!)

these three NoSQL databases because I wanted to see a the two different sides of the CAP

MADOCA II Data Logging System Using NoSQL Database for SPring-8

"LET S MIGRATE OUR ENTERPRISE APPLICATION TO BIG DATA TECHNOLOGY IN THE CLOUD" - WHAT DOES THAT MEAN IN PRACTICE?

Development of nosql data storage for the ATLAS PanDA Monitoring System

Department of Software Systems. Presenter: Saira Shaheen, Dated:

How to Choose Between Hadoop, NoSQL and RDBMS

Real-Time Big Data in practice with Cassandra. Michaël

LARGE-SCALE DATA STORAGE APPLICATIONS

Introduction to Cassandra

Introduction to Big Data Training

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Appendix A Core Concepts in SQL Server High Availability and Replication

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Apache HBase. Crazy dances on the elephant back

Big Data with Component Based Software

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

Preparing Your Data For Cloud

A survey of big data architectures for handling massive data

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

Structured Data Storage

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

Cloud Computing with Microsoft Azure

Cloud Computing at Google. Architecture

Adding scalability to legacy PHP web applications. Overview. Mario Valdez-Ramirez

Using RDBMS, NoSQL or Hadoop?

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

In-memory databases and innovations in Business Intelligence

INTRODUCING DRUID: FAST AD-HOC QUERIES ON BIG DATA MICHAEL DRISCOLL - CEO ERIC TSCHETTER - LEAD METAMARKETS

Integrating Big Data into the Computing Curricula

Using Oracle NoSQL Database

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014

Microsoft SQL Server performance tuning for Microsoft Dynamics NAV

A Survey of Distributed Database Management Systems

Distributed Storage Systems part 2. Marko Vukolić Distributed Systems and Cloud Computing

Ground up Introduction to In-Memory Data (Grids)

nosql and Non Relational Databases

Xiaoming Gao Hui Li Thilina Gunarathne

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

Module 14: Scalability and High Availability

The NoSQL Ecosystem, Relaxed Consistency, and Snoop Dogg. Adam Marcus MIT CSAIL

Technical Deep Dive: Secondary Indexes with Range Queries

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

ITG Software Engineering

Cloud Computing Is In Your Future

Challenges for Data Driven Systems

NoSQL: Going Beyond Structured Data and RDBMS

MySQL. Leveraging. Features for Availability & Scalability ABSTRACT: By Srinivasa Krishna Mamillapalli

MS SQL Server 2014 New Features and Database Administration

Distributed Storage Systems

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Transcription:

Practical Cassandra NoSQL key-value vs RDBMS why and when Cassandra architecture Cassandra data model Life without joins or HDD space is cheap today Hardware requirements & deployment hints Vitalii Tymchyshyn tivv00@gmail.com @tivv00

RDBMS problems Sometimes you reach the point where single server can't cope Relational Replication Sharding Not write scalable Data is not instantly visible No foreign keys or joins No transactions Reduced reliability (multiple servers) Schema update is a pain

Cassandra NoSQL Master-Master Replication + Sharding in one bottle Peer-to-peer architecture (no SPOF) Easy cluster reconfiguration Eventual consistency as a standard All data in one record no need to join Flexible schema

Our data We have intelligent Internet cache Intelligent means we don't cache everything or we would need Google's DC It's still hundreds of millions of sites And 10s of TB of packed data Randomly updated Analysis must be able to process all of this in term of hours

Cassandra ring - server - client

Ring partitioner types Order Preserving Each server serves key range Range queries possible Read/Write/Disk space hot spots possible Complex to fix key range Random Data is smoothly distributed on servers No range queries No hot spots Fixed key range

Runtime CAP-solving The whole thing is about replication CAP: Consistency, Availability, Partition tolerance choose two. With cassandra you can choose at runtime.

Runtime CAP-solving Quorum read/write Fast writes Fast reads Fast, less consistency

Data model Keyspaces much like database in RDBMS Column Families storage element, like tables in RDBMS Columns you can have million for a row, names are flexible, still like columns in RDBMS Super Column A column that has structured content, superseded by composite columns

Example Twitter DB Users table ID, Name, Birthday Twitter Keyspace Users CF Key: User ID Name(Str), Birthday(Str) Tweets table UserID, TweetID, TweetContent Timeline CF Key: User ID <TweetID>(TweetContent)

Example (alternative) Twitter DB Twitter Keyspace Users table ID, Name, Birthday Tweets table UserID, TweetID, TweetContent Data CF Key: User ID Name(Str), Birthday(Str), <TweetID>(TweetContent)

Example (data) Users ID Name 1 Tom 2 John Tweets User ID Text 1 1 Hello 1 2 See me? 2 3 See you! Data Key Data 1 Name = Tom T_1 = Hello T_2 = See me? 2 Name = John T_3 = See you!

Data model You can have same key in multiple column families You can have different set of columns for different keys in same column family You can query a range of columns for a key (columns are sorted) with pagination You can have (and it's useful) to have columns without values

ACID vs BASE Super Heroes are good, but not scalable. So, what do we loose?

No Atomicity You've got no transactions no rollback The maximum you have is atomic update to single row Failed operation MAY be applied (that's why counters are not reliable)

Eventual Consistency Cassandra has no central governor This means no bottleneck This also means no one knows if database as a whole is consistent Regular repair is your friend!

No Isolation All mutations are timestamped to restore order from chaotic arrival You MUST have your clock synchronized That's how operation are applied on server :)

Controlled Durability Cassandra uses transaction log to ensure durability on single server Durability of the whole database depends on both total number of replicas and write operation replication factor Remember, single server 99% uptime means 36.6% (0.99 100 ) of full cluster working uptime for 100 servers most time you've got at least one server down!

Data querying With SQL you simply ask. You can easily scan the whole DB Indexes may help Any calculation is repeated each time This can be slow on read

Data querying With NoSQL you can't efficiently scan the whole db No group by or order by You must prepare your data beforehand You have multiple copies of data You must recalculate on application logic change The precalculated reads are fast

Think on your queries in advance! There is no I'll simply add an index, some hints and my query will become fast Any index is created and maintained from application code Now cassandra have secondary indexes, but they are much inferior to custom ones

What's wrong with secondary indexes They work on fixed column names They are consistent with data This means they live near the data they index This means they are distributed between nodes by row key, not by indexed column value This means you need to ask every node to get single value

What's wrong with secondary indexes Node 1 A: phone=1 B: phone=3 Phone index: Node 3 1=A,3=B E: phone=1 F: phone=5 Phone index: 1=E,5=F Node 2 C: phone=3 D: phone=5 Phone index: Node 4 3=C,5=D G: phone=3 H: phone=7 Phone index: 3=G,7=H

Index example Column family people Key: Fred [phone=2223355, phone2=4445566, fax=9998877] Key: John [phone=4445566, mobile=099123456] Column family phone_directory Key: 2223355 [Fred] Key: 4445566 [Fred, John] Key: 9998877 [Fred] Key: 099123456 [John]

Join example Column family customer Key: Boeing [email: boeing@boing.com] Key: Oracle [skype: java] Column family orders Key: 1 [customer: Boeing, total: 200m] Key: 2 [customer: Oracle, total: 300m] Key: 3 [customer: Boeing, total: 500m] Column family customer_order_totals Key: Boeing[ 1:200m, 3:500m] Key: Oracle[ 2:300m]

Peer-to-peer replication Your operation can return OK even if it was not written to every replica Hinted handoff will try to repair later Even if your operation have failed, it may have been written to some replicas This inconsistency won't be repaired automatically This are drawbacks of no master architecture You need to repair regular!

Tombstones and Repair Delete events are recorded as Tombstones to ensure arriving before delete data won't be used Regular repair not only makes sure your data is replicated, but also that your deletes are replicated. If you don't, beware of ghosts!

Resources & Environment Disk space requirements Memory requirements Native plugins & configuration

Disk estimations Say, we've got 1TB of data Replication factor 3 make it 3TB Data duplication make it 12TB Tombstones/repair space make it 24TB Backups make it 36TB

Memory estimations Cassandra has certain in-memory structures that are linear to data amount Key and Row caches configured at column family level. Change defaults if you've got a lot of CFs Bloom filters and key samples cache are configured globally in latest versions Estimate minimum ~0.5% of RAM for your data amount

Native specifics Cassandra (like may other large things) likes JNA. Please install. Cassandra maps files to memory cassandra process virtual and resident memory size will grow because of mmap. Default heap sizes are large tame it if it's not only task on the host

Q&A Author: Vitalii Tymchyshyn tivv00@gmail.com @tivv00