Apache Cassandra Present and Future. Jonathan Ellis



Similar documents
HADOOP MOCK TEST HADOOP MOCK TEST I

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Certified Big Data and Apache Hadoop Developer VS-1221

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Introduction to Cassandra

Apache HBase. Crazy dances on the elephant back

Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to am to 5 pm HOTEL DUBAI GRAND DUBAI

Apache Cassandra for Big Data Applications

The Apache Cassandra storage engine

Complete Java Classes Hadoop Syllabus Contact No:

Distributed Systems. Tutorial 12 Cassandra

Practical Cassandra. Vitalii

MapReduce with Apache Hadoop Analysing Big Data

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Big Data With Hadoop

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop Architecture. Part 1

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Qsoft Inc

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Cassandra. Jonathan Ellis

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

HadoopRDF : A Scalable RDF Data Analysis System

Big Data : Experiments with Apache Hadoop and JBoss Community projects

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Open source large scale distributed data management with Google s MapReduce and Bigtable

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Where is Hadoop Going Next?

COURSE CONTENT Big Data and Hadoop Training

Hadoop Ecosystem B Y R A H I M A.

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Chase Wu New Jersey Ins0tute of Technology

Chapter 7. Using Hadoop Cluster and MapReduce

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Deploying Hadoop with Manager

Apache Cassandra Query Language (CQL)

BIG DATA HADOOP TRAINING

LARGE-SCALE DATA STORAGE APPLICATIONS

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

Ankush Cluster Manager - Hadoop2 Technology User Guide

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

No-SQL Databases for High Volume Data

HADOOP MOCK TEST HADOOP MOCK TEST II

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Real-Time Big Data in practice with Cassandra. Michaël

Apache Hadoop FileSystem and its Usage in Facebook

Workshop on Hadoop with Big Data

Introduction to Apache Cassandra

Hadoop Job Oriented Training Agenda

Comparing the Hadoop Distributed File System (HDFS) with the Cassandra File System (CFS)

Cloud Computing at Google. Architecture

CitusDB Architecture for Real-Time Big Data

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Big Table A Distributed Storage System For Data

Comparing the Hadoop Distributed File System (HDFS) with the Cassandra File System (CFS) WHITE PAPER

So What s the Big Deal?

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Data-Intensive Computing with Map-Reduce and Hadoop

Hypertable Architecture Overview

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

Comparing SQL and NOSQL databases

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Hadoop IST 734 SS CHUNG

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

DataStax Enterprise Reference Architecture

NOSQL DATABASES AND CASSANDRA

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

The Hadoop Eco System Shanghai Data Science Meetup

HDFS Users Guide. Table of contents

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Transcription:

Apache Cassandra Present and Future Jonathan Ellis

History Bigtable, 2006 Dynamo, 2007 OSS, 2008 Incubator, 2009 TLP, 2010 1.0, October 2011

Why people choose Cassandra Multi-master, multi-dc Linearly scalable Larger-than-memory datasets High performance Full durability Integrated caching Tuneable consistency

Cassandra users Financial Social Media Advertising Entertainment Energy E-tail Health care Government

Road to 1.0 Storage engine improvements Specialized replication modes Performance and other real-world considerations Building an ecosystem

Compaction Size-Tiered Leveled

Differences in Cassandra s leveling ColumnFamily instead of key/value Multithreading (experimental) Optional throttling (16MB/s by default) Per-sstable bloom filter for row keys Larger data files (5MB by default) Does not block writes if compaction falls behind

Column ( secondary ) indexing cqlsh> CREATE INDEX state_idx ON users(state); cqlsh> INSERT INTO users (uname, state, birth_date) VALUES ( bsanderson, UT, 1975) users state_idx state birth_date bsanderson UT 1975 UT bsanderson htayler prothfuss WI 1973 WI prothfuss htayler UT 1968

Querying cqlsh> SELECT * FROM users WHERE state='ut' AND birth_date > 1970 ORDER BY tokenize(key);

More sophisticated indexing? Want to: Support more operators Support user-defined ordering Support high-cardinality values But: Scatter/gather scales poorly If it s not node local, we can t guarantee atomicity If we can t guarantee atomicity, doublechecking across nodes is a huge penalty

Other storage engine improvements Compression Expiring columns Bounded worst-case reads by re-writing fragmented rows

Eventually-consistent counters Counter is partitioned by replica; each replica is master of its own partition

13

14

15

16

The interesting parts Tuneable consistency Avoiding contention on local increments Store increments, not full value; merge on read + compaction Renewing counter id to deal with data loss Interaction with tombstones

(What about version vectors?) Version vectors allow detecting conflict, but do not give enough information to resolve it except in special cases In the counters case, we d need to hold onto all previous versions until we can be sure no new conflict with them can occur Jeff Darcy has a longer discussion at http://pl.atyp.us/ wordpress/?p=2601

Performance A single four-core machine; one million inserts + one million updates

Dealing with the JVM JNA mlockall() posix_fadvise() link() Memory Move cache off-heap In-heap arena allocation for memtables, bloom filters Move compaction to a separate process?

The Cassandra ecosystem Replication into Cassandra Gigaspaces Drizzle Solandra: Cassandra + Solr search DataStax Enterprise: Cassandra + Hadoop analytics

DataStax Enterprise

Operations Vanilla Hadoop 8+ services to setup, monitor, backup, and recover (NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker, Zookeeper, Metastore,...) Single points of failure Can't separate online and offline processing DataStax Enterprise Single, simplified component Peer to peer JobTracker failover No additional cassandra config

What s next Ease Of Use CQL: Native transport, prepared statements Triggers Entity groups Smarter range queries enabling Hive predicate pushdown Blue sky: streaming / CEP

Questions? jbellis@datastax.com DataStax is hiring! ~15 engineers (35 employees), want to double in 2012 Austin, Burlingame, NYC, France, Japan, Belarus 100+ customers $11M Series B funding