Cassandra A Decentralized Structured Storage System



Similar documents
Cassandra A Decentralized, Structured Storage System

Distributed Storage Systems part 2. Marko Vukolić Distributed Systems and Cloud Computing

So What s the Big Deal?

A Review of Column-Oriented Datastores. By: Zach Pratt. Independent Study Dr. Maskarinec Spring 2011

LARGE-SCALE DATA STORAGE APPLICATIONS

Distributed Systems. Tutorial 12 Cassandra

Slave. Master. Research Scholar, Bharathiar University

Introduction to NOSQL

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Cloud Scale Distributed Data Storage. Jürmo Mehine

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

No-SQL Databases for High Volume Data

Xiaowe Xiaow i e Wan Wa g Jingxin Fen Fe g n Mar 7th, 2011

Amr El Abbadi. Computer Science, UC Santa Barbara

these three NoSQL databases because I wanted to see a the two different sides of the CAP

Cassandra. Jonathan Ellis

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Introduction to Apache Cassandra

NoSQL Database Options

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

NOSQL DATABASES AND CASSANDRA

Lecture Data Warehouse Systems

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

A Survey of Distributed Database Management Systems

INTRODUCTION TO CASSANDRA

Introduction to Big Data Training

Consistency Trade-offs for SDN Controllers. Colin Dixon, IBM February 5, 2014

Case study: CASSANDRA

MASTER PROJECT. Resource Provisioning for NoSQL Datastores

Structured Data Storage

Integrating Big Data into the Computing Curricula

Preparing Your Data For Cloud

Distributed Data Stores

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

OldSQL vs. NoSQL vs. NewSQL on New OLTP

nosql and Non Relational Databases

Infrastructures for big data

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

The NoSQL Ecosystem, Relaxed Consistency, and Snoop Dogg. Adam Marcus MIT CSAIL

Evaluation of NoSQL databases for large-scale decentralized microblogging

NoSQL Databases. Nikos Parlavantzas

Dominik Wagenknecht Accenture

Distributed Data Parallel Computing: The Sector Perspective on Big Data

Joining Cassandra. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece

Cloud data store services and NoSQL databases. Ricardo Vilaça Universidade do Minho Portugal

Practical Cassandra. Vitalii

Database Management System Choices. Introduction To Database Systems CSE 373 Spring 2013

Introduction to NoSQL

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Dynamo: Amazon s Highly Available Key-value Store

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

Cloud Computing Is In Your Future

Design Patterns for Distributed Non-Relational Databases

OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni

Google Bing Daytona Microsoft Research

Logistics. Database Management Systems. Chapter 1. Project. Goals for This Course. Any Questions So Far? What This Course Cannot Do.

Can the Elephants Handle the NoSQL Onslaught?

Database Scalability {Patterns} / Robert Treat

Cloud DBMS: An Overview. Shan-Hung Wu, NetDB CS, NTHU Spring, 2015

Cassandra vs MySQL. SQL vs NoSQL database comparison

A Distributed Storage Schema for Cloud Computing based Raster GIS Systems. Presented by Cao Kang, Ph.D. Geography Department, Clark University

A COMPARATIVE STUDY OF NOSQL DATA STORAGE MODELS FOR BIG DATA

How To Scale Out Of A Nosql Database

09 Cloud Storage. NoSQL Databases. Dynamo: Amazon s Highly Available Key-value Store. Christof Strauch

Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to am to 5 pm HOTEL DUBAI GRAND DUBAI

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

NewSQL: Towards Next-Generation Scalable RDBMS for Online Transaction Processing (OLTP) for Big Data Management

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Transactions and ACID in MongoDB

Apache HBase. Crazy dances on the elephant back

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Data. Data S. A Model for Capacity Planning in Cassandra. Sharath Abbireddy. Case Study on Ericsson s Voucher System. Thesis no: MSCS


Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

INTERNATIONAL JOURNAL of RESEARCH GRANTHAALAYAH A knowledge Repository

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

NoSQL Databases. Polyglot Persistence

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Cloud Computing with Microsoft Azure

The Apache Cassandra storage engine

Big Data Technologies Compared June 2014

Comparing SQL and NOSQL databases

Current Data Security Issues of NoSQL Databases

INFO5011 Advanced Topics in IT: Cloud Computing

The Quest for Extreme Scalability

Referential Integrity in Cloud NoSQL Databases

Big Data, Deep Learning and Other Allegories: Scalability and Fault- tolerance of Parallel and Distributed Infrastructures.

Big Data With Hadoop

Pla7orms for Big Data Management and Analysis. Michael J. Carey Informa(on Systems Group UCI CS Department

Open Source Technologies on Microsoft Azure

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Applications for Big Data Analytics

What Next for DBAs in the Big Data Era

Transcription:

Cassandra A Decentralized Structured Storage System Avinash Lakshman, Prashant Malik LADIS 2009 Anand Iyer CS 294-110, Fall 2015

Historic Context Early & mid 2000: Web applicaoons grow at tremendous rates Big data 3 V s (Volume, Velocity, Variety) places lots of demands Need for scalable and flexible data storage soluoons TradiOonal soluoons Open source (MySQL, Postgres) Scalability, elasocity Commercial (Oracle, SQL Server) Cost, dependency SQL not really amenable to distributed operaoons Leads to NoSQL revoluoon

Historic Context Big Table (OSDI 2006) MulOple a]ribute based access Master slave architecture Consistency over availability Dynamo (SOSP 2007) User shopping cart storage Decentralized High availability ($$$) Simpler KV model

Problem Facebook Inbox Search Enable users to search through inbox MulOple a]ributes Manage data spread across mulople datacenters Provide high availability and no single point of failure (Why?) Treat failures as the norm. BigTable + Dynamo =

Key Idea (1) Leverage paroooning and replicaoon techniques from Dynamo ParOOon based on consistent hashing (like Dynamo) Load balance by moving lightly loaded nodes on the ring Replicate based on policies Dynamo style replicaoon for simple policies Zookeeper based for more involved policies (e.g. Rack Aware)

Key Idea (2) Use a richer data model similar to Big Table Distributed mulodimensional map Column Families (similar to BigTable) and Supercolumns Persistence using commit logs, memtables and SSTable compacoon Later versions of Cassandra made changes to many of these

Fundamental Tradeoffs Tradeoff #1: Between consistency and availability in the face of network parooons (CAP theorem) Favor availability over consistency Allow tunable consistency Quorum based Strict quorum => strong consistency (R + W > N) ParOal quorum => eventual consistency (R+W <= N) Tradeoff #2: Latency v/s consistency guarantee during normal operaoons

Influence Highly popular, one of the most popular NoSQL stores. InstallaOon at 1500+ companies including ebay, Nemlix, Github, etc. Largest known deployment at Apple, with over 75,000 nodes storing over 10 PB of data. But also acquired FoundaOonDB recently! Influenced the development of several other NoSQL databases.

Impact of NoSQL Of course DB folks are not happy Eventual consistency = creates garbage (Michael Stonebraker, LISA 2011) Several a]empts to scale tradioonal DBMS MySQL Cluster VoltDB Claims 5-7X improvements over Cassandra NewSQL Offer SQL and ACID with NoSQL s scalability Focus on reducing overheads using systems techniques But don t give up SQL or ACID Mostly in- memory

At the same Ome Cassandra 2.0: Secondary index CQL (SQL like interface) Lightweight transacoons (ACID) Triggers (stored procedures) Cassandra 3.0: Materialized views UDF, UDAs

Going Forward Storage problem will likely worsen Easier to collect data Machine generated Mobile App Era Billions of smartphone users Mobile technology improving Internet- of- Things Variety will increase, flexible schema more important BigTable/Cassandra will remain influenoal, but what about the consistency semanocs?

Is eventual consistency really garbage? PBS (Bailis et. al.): in prac)ce, and in the average case, eventually consistent data stores o6en deliver consistent data. Facebook s move to HBase We found Cassandra's eventual consistency model to be a difficult pa@ern to reconcile for our new Messages infrastructure. Newer systems have transacoons as a primary goal Google Spanner (OSDI 12, later in the class) We believe it is be@er to have applica)on programmers deal with performance problems due to overuse of transac)ons as bo@lenecks arise, rather than always coding around the lack of transac)ons. Google F1 (SIGMOD 12, later in the class) We also have a lot of experience with eventual consistency systems at Google. In all such systems, we find developers spend a significant frac)on of their )me building extremely complex and error- prone mechanisms to cope with eventual consistency and handle data that may be out of date. Is this a characterisoc of new emerging workloads/seungs?