Evaluation of NoSQL databases for large-scale decentralized microblogging

Similar documents
Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Practical Cassandra. Vitalii

Scalable Architecture on Amazon AWS Cloud

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Distributed Systems. Tutorial 12 Cassandra

LARGE-SCALE DATA STORAGE APPLICATIONS

Can the Elephants Handle the NoSQL Onslaught?

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

How To Scale Out Of A Nosql Database

Couchbase Server Under the Hood

Apache Hadoop. Alexandru Costan

MongoDB and Couchbase

Big Systems, Big Data

Cassandra A Decentralized, Structured Storage System

NoSQL and Hadoop Technologies On Oracle Cloud

Benchmarking Cassandra on Violin

NoSQL Databases. Nikos Parlavantzas

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to am to 5 pm HOTEL DUBAI GRAND DUBAI

An Open Source NoSQL solution for Internet Access Logs Analysis

Technical Overview Simple, Scalable, Object Storage Software

Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

A programming model in Cloud: MapReduce

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

INTRODUCTION TO CASSANDRA

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Introduction to Cassandra

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

Hands-on Cassandra. OSCON July 20, Eric

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Alfresco Enterprise on AWS: Reference Architecture

Structured Data Storage

Dynamo: Amazon s Highly Available Key-value Store

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

NoSQL: Going Beyond Structured Data and RDBMS

Introduction to Apache Cassandra

Design and Evolution of the Apache Hadoop File System(HDFS)

Comparing SQL and NOSQL databases

Lecture Data Warehouse Systems

FAWN - a Fast Array of Wimpy Nodes

Cassandra. Jonathan Ellis

Evaluator s Guide. McKnight. Consulting Group. McKnight Consulting Group

References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline

Introduction to Database Systems CSE 444

Hadoop IST 734 SS CHUNG

Implement Hadoop jobs to extract business value from large and varied data sets

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

Hypertable Architecture Overview

A Survey of Distributed Database Management Systems

Getting Started with SandStorm NoSQL Benchmark

Real-Time Big Data in practice with Cassandra. Michaël

Big Data With Hadoop

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Apache HBase. Crazy dances on the elephant back

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Cassandra A Decentralized Structured Storage System

A Performance Analysis of Distributed Indexing using Terrier

Case study: CASSANDRA

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Joining Cassandra. Luiz Fernando M. Schlindwein Computer Science Department University of Crete Heraklion, Greece

Introduction to Big Data Training

Ankush Cluster Manager - Cassandra Technology User Guide

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

So What s the Big Deal?

Infrastructures for big data

Intro to AWS: Storage Services

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Best Practices for Hadoop Data Analysis with Tableau

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Benchmarking Top NoSQL Databases Apache Cassandra, Couchbase, HBase, and MongoDB Originally Published: April 13, 2015 Revised: May 27, 2015

Hadoop & its Usage at Facebook

Cloud Scale Distributed Data Storage. Jürmo Mehine

NoSQL Database Options

Cloud Computing For Bioinformatics

MASTER PROJECT. Resource Provisioning for NoSQL Datastores

nosql and Non Relational Databases

Scaling Pinterest. Yash Nelapati Ascii Artist. Pinterest Engineering. Saturday, August 31, 13

Cloud Computing with Microsoft Azure

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Simba Apache Cassandra ODBC Driver

Amazon EC2 Product Details Page 1 of 5


Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island

Apache Cassandra for Big Data Applications

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Configuration and Deployment Guide for the Cassandra NoSQL Data Store on Intel Architecture

An Approach to Implement Map Reduce with NoSQL Databases

SWIFT. Page:1. Openstack Swift. Object Store Cloud built from the grounds up. David Hadas Swift ATC. HRL 2012 IBM Corporation

Transcription:

Evaluation of NoSQL databases for large-scale decentralized microblogging Cassandra & Couchbase Alexandre Fonseca, Anh Thu Vu, Peter Grman Decentralized Systems - 2nd semester 2012/2013 Universitat Politècnica de Catalunya

Microblogging Blogging with very small & concise posts. Users subscribe to others by following them. Real-time interactions between growing userbase. Useful for: Real-time updates & news. Marketing and public relations campaigns. Research into social interaction & perception. Examples: Twitter, Tumblr, identi.ca, Facebook & Google+ status updates.

Microblogging - decentralization Twitter: Over 500 million registered users. 340 million tweets per day. More than 12TB of data per day in 2010. Hard to provide this service relying solely on a small number of centralized servers. Scaling up has its limits.

Microblogging - decentralization 2 possible ways to decentralize: Decentralize dedicated servers managed by the service provider - better load sharing. Allow users to share the system load between them (similar to voluntary computing).

Relational Database Systems (RDS) Typical enterprise and service data storage. Tables and table rows as storage units. Highly normalized. Storage divided into entity types. One table per entity type. One row per entity instance. Many-to-many relationship tables. Transaction support for maximum consistency.

RDS - Twitter Example Tweets Users ID Name Password ID PosterID Time Body 1 Alex ********* 1 1 1345453 Burguer 2 Casey ********* 2 2 1345455 Nutella 3 Peter ********* 3 3 1345457 Beer JOIN Followers ID Follows 1 2 1 3 2 1 Timeline for User Alex Tweet ID Poster Time Body 2 Casey 1345453 Nutella 3 Peter 1345455 Beer

NoSQL New (or is it?) data storage paradigm: KISS. Simple key-value stores in its essence. (Usually) no transaction control or strict schemas. Incentive denormalization. Let the application worry about business logic and entity relationships. Focus on: Flexibility (of schemas, topologies, configurations) Scalability Robustness Performance

NoSQL - Twitter Example Users Tweets [{'id': 1, 'name': 'Alex', 'password': '***', 'following':[2, 3]}, {'id': 2, 'name': 'Casey', 'password': '***', 'following':[1]}, {'id': 3, 'name': 'Peter', 'password': '***', 'following':[1, 2]}] [{'id': 1, 'posterid': 1, 'time': 1345453, body: 'Burger'}, {'id': 2, 'posterid': 2, 'time': 1345455, body: 'Nutella'}, {'id': 3, 'posterid': 3, 'time': 1345457, body: 'Beer'}] Timeline-View [{'user': 1, 'tweetid': 2, 'postedby': 2, tweetbody: 'Nutella'}, {'user': 1, 'tweetid': 3, 'postedby': 3, tweetbody: 'Beer'}, {'user': 2, 'tweetid': 1, 'postedby': 1, tweetbody: 'Burger'}, {'user': 3, 'tweetid': 1, 'postedby': 1, tweetbody: 'Burger'}, {'user': 3, 'tweetid': 2, 'postedby': 2, tweetbody: 'Nutella'}]

NoSQL - Twitter Example Users Tweets [{'id': 1, 'name': 'Alex', 'password': '***', 'followers':[2, 3]}, {'id': 2, 'name': 'Casey', 'password': '***', 'followers':[1, 3]}, {'id': 3, 'name': 'Peter', 'password': '***', 'followers':[1]}] [{'id': 1, 'posterid': 1, 'time': 1345453, body: 'Burger'}, {'id': 2, 'posterid': 2, 'time': 1345455, body: 'Nutella'}, {'id': 3, 'posterid': 3, 'time': 1345457, body: 'Beer'}] Timeline-View [{'user': 1, 'tweetid': 2, 'postedby': 2, tweetbody: 'Nutella'}, {'user': 1, 'tweetid': 3, 'postedby': 3, tweetbody: 'Beer'}, {'user': 2, 'tweetid': 1, 'postedby': 1, tweetbody: 'Burger'}, {'user': 3, 'tweetid': 1, 'postedby': 1, tweetbody: 'Burger'}, {'user': 3, 'tweetid': 2, 'postedby': 2, tweetbody: 'Nutella'}]

Cassandra Initially developed by Facebook. Released in 2008. Apache Top-Project by 2010. Latest stable release (1.2.4, April 2013) Data model: Hybrid between key-value and tabular storage. Data split in column families (like RDS tables). Column families split into rows (indexed by row keys). Dynamic columns (schema). Interface: CQL via Thrift.

Cassandra All database nodes play the same role. Nodes divided into variable number of virtual nodes: Organized into ring structure. Assigned tokens determine data assignment: Consistent hashing of partition key (1st column of primary key). Hashing can be parametrized. Cluster can be given topology awareness: Try to keep request resolution local to rack or datacenter. Replicate data over different data centers and racks. Replication and consistency options all configurable.

Couchbase = + Current release v2.0.1 (Apr 9, 2013)

Couchbase Key - Value Key: string, no space allowed, 250 bytes limit Value: JSON document, integer or serializable byte stream, 20MB limit View: MapReduce: to index all data Incrementally and periodically updated

Couchbase All nodes play the same role A bucket (or database) is divided into small subsets, called vbuckets distributed in the cluster. Replication factor: configurable for each bucket with a max of 3. Consistency: Strong consistency for access with "key" Eventual consistency for queries on views

PyDLoader Custom Python script 2 components: Manager: Interactive console. Deploy new nodes. Install and control slaves on those nodes. Slaves: Automatic setup, populating and takedown of database clusters (database slaves). Generation of application workload towards database clusters (workload slaves). Used libraries: Boto (AWS), Paramiko (SSH), RpyC (RPC).

Evaluation Done using Amazon Web Services (AWS). 6 database nodes (m1.small) 1.7 GB of RAM 1 compute unit 150GB of storage 12 workload nodes (t1.micro) 615 MB or RAM 1 compute unit No local storage (NAS) Distributed equally over 2 datacenters in Ireland.

Evaluation Focus on 3 main operations: Tweet Userline (all tweets by user) Timeline (all tweets by people user is following). Limit to 50 items per userline/timeline. Basic data structure: Aggregate data according to operations, not entities: Timeline table/bucket: UserID, TweetID, PostedBy, Body Userline table/bucket: UserID, TweetID, Body

Ease of Setup Cassandra - Awesome! Install, configure addresses, partitioner, snitch, replication factors and seed nodes, launch! Automatic partitioning and replication. Automatic adjustment to churn. Couchbase - Equivalently Awesome! Install, configure RAM, directory, launch! Easily add/failover/remove servers Rebalance: background process, asynchronous, incremental Automatic replication, partitioning and node failure detection.

Setup time Cassandra: 6 nodes form and stabilize ring after ~2 minutes. Populating with sample data: ~1 minute in 1 node configuration. 6.5 minutes in 6 node configuration. Couchbase: 6 nodes form and stabilize ring after ~2 minutes. Populating with sample data: ~1 minute in 1 node configuration. ~2 minutes in 6 node configuration.

Latency Tweet Latency Userline Latency

Latency Timeline Latency

Normalized latency Tweet Latency

Normalized latency Userline Latency Timeline Latency

Tweets & Denormalization Tweet Latency v.s. Number of Followers

Tweets & Denormalization Immediate denormalization not scalable. How to make asynchronous? Cassandra: No native support. External processing. Couchbase: Views!

Reconfiguration Latency Cassandra Adding a new node to a cluster: ~5 mins. Latency while node joining cluster

Reconfiguration Latency Couchbase Adding a new node to a cluster: immediate BUT rebalance ~ 30 mins Latency while node joining cluster

Consistency/Convergence Cassandra: Average of 0.096498 seconds to detect new tweet. Standard deviation of 0.096319: Same datacenter => very fast detection. Different datacenter => slower detection. Couchbase: Average of 0.001593 seconds to detect new tweet with standard deviation of 0.000526 The delay for new tweet to appear on timeline proportional to the schedule period

Replication Cassandra: Very flexible in terms of replication configuration. Per datacenter replication factors with no hard-coded limitations. Behaviour under crash of 2 nodes @ second 100th

Replication Couchbase Automatic and configurable per bucket with a limit of (1+3) replicas. Behaviour under crash of 2 nodes @ second 30th, 100th

Load Balancing Cassandra: Average data ownership per node: 16.68% Standard deviation: 1.23% Couchbase: Evenly distributed with standard deviation: 0.65% for data on disk 0.04% for RAM usage

System Load CPU Idle Time Free Memory

System Load Disk IO

Disk Space Usage SQLite database: 1.6MB Cassandra: Fully denormalized (single node): 16.31MB. Fully denormalized (6-nodes): 9.5MB/node. Includes partitioning and replication. "Normalized" (no body in userline/timeline): Single node: 2.8MB 6-nodes: 1.6MB/node. Commit log after populating: 54MB Need 50% of free disk space at all times: Column family compactions. Data redistributions.

Disk Space Usage Couchbase: Total of 250MB distributed across 6 nodes. No minimum free space requirement. Lack of disk usage limitations in both DBs: Not ideal for voluntary computing systems.

Conclusion Easy cluster setup Allows horizontal scaling over multiple nodes Improved performance through denormalization Higher storage requirements The future is hybrid A mix of RDS and NoSQL-Systems Cassandra's CQL, CouchBase's Views, NoSQL in RDS