1 Evaluation of NoSQL databases for large-scale decentralized microblogging Cassandra & Couchbase Alexandre Fonseca, Anh Thu Vu, Peter Grman Decentralized Systems - 2nd semester 2012/2013 Universitat Politècnica de Catalunya
2 Microblogging Blogging with very small & concise posts. Users subscribe to others by following them. Real-time interactions between growing userbase. Useful for: Real-time updates & news. Marketing and public relations campaigns. Research into social interaction & perception. Examples: Twitter, Tumblr, identi.ca, Facebook & Google+ status updates.
3 Microblogging - decentralization Twitter: Over 500 million registered users. 340 million tweets per day. More than 12TB of data per day in Hard to provide this service relying solely on a small number of centralized servers. Scaling up has its limits.
4 Microblogging - decentralization 2 possible ways to decentralize: Decentralize dedicated servers managed by the service provider - better load sharing. Allow users to share the system load between them (similar to voluntary computing).
5 Relational Database Systems (RDS) Typical enterprise and service data storage. Tables and table rows as storage units. Highly normalized. Storage divided into entity types. One table per entity type. One row per entity instance. Many-to-many relationship tables. Transaction support for maximum consistency.
6 RDS - Twitter Example Tweets Users ID Name Password ID PosterID Time Body 1 Alex ********* Burguer 2 Casey ********* Nutella 3 Peter ********* Beer JOIN Followers ID Follows Timeline for User Alex Tweet ID Poster Time Body 2 Casey Nutella 3 Peter Beer
7 NoSQL New (or is it?) data storage paradigm: KISS. Simple key-value stores in its essence. (Usually) no transaction control or strict schemas. Incentive denormalization. Let the application worry about business logic and entity relationships. Focus on: Flexibility (of schemas, topologies, configurations) Scalability Robustness Performance
10 Cassandra Initially developed by Facebook. Released in Apache Top-Project by Latest stable release (1.2.4, April 2013) Data model: Hybrid between key-value and tabular storage. Data split in column families (like RDS tables). Column families split into rows (indexed by row keys). Dynamic columns (schema). Interface: CQL via Thrift.
11 Cassandra All database nodes play the same role. Nodes divided into variable number of virtual nodes: Organized into ring structure. Assigned tokens determine data assignment: Consistent hashing of partition key (1st column of primary key). Hashing can be parametrized. Cluster can be given topology awareness: Try to keep request resolution local to rack or datacenter. Replicate data over different data centers and racks. Replication and consistency options all configurable.
12 Couchbase = + Current release v2.0.1 (Apr 9, 2013)
13 Couchbase Key - Value Key: string, no space allowed, 250 bytes limit Value: JSON document, integer or serializable byte stream, 20MB limit View: MapReduce: to index all data Incrementally and periodically updated
14 Couchbase All nodes play the same role A bucket (or database) is divided into small subsets, called vbuckets distributed in the cluster. Replication factor: configurable for each bucket with a max of 3. Consistency: Strong consistency for access with "key" Eventual consistency for queries on views
15 PyDLoader Custom Python script 2 components: Manager: Interactive console. Deploy new nodes. Install and control slaves on those nodes. Slaves: Automatic setup, populating and takedown of database clusters (database slaves). Generation of application workload towards database clusters (workload slaves). Used libraries: Boto (AWS), Paramiko (SSH), RpyC (RPC).
16 Evaluation Done using Amazon Web Services (AWS). 6 database nodes (m1.small) 1.7 GB of RAM 1 compute unit 150GB of storage 12 workload nodes (t1.micro) 615 MB or RAM 1 compute unit No local storage (NAS) Distributed equally over 2 datacenters in Ireland.
17 Evaluation Focus on 3 main operations: Tweet Userline (all tweets by user) Timeline (all tweets by people user is following). Limit to 50 items per userline/timeline. Basic data structure: Aggregate data according to operations, not entities: Timeline table/bucket: UserID, TweetID, PostedBy, Body Userline table/bucket: UserID, TweetID, Body
18 Ease of Setup Cassandra - Awesome! Install, configure addresses, partitioner, snitch, replication factors and seed nodes, launch! Automatic partitioning and replication. Automatic adjustment to churn. Couchbase - Equivalently Awesome! Install, configure RAM, directory, launch! Easily add/failover/remove servers Rebalance: background process, asynchronous, incremental Automatic replication, partitioning and node failure detection.
19 Setup time Cassandra: 6 nodes form and stabilize ring after ~2 minutes. Populating with sample data: ~1 minute in 1 node configuration. 6.5 minutes in 6 node configuration. Couchbase: 6 nodes form and stabilize ring after ~2 minutes. Populating with sample data: ~1 minute in 1 node configuration. ~2 minutes in 6 node configuration.
24 Tweets & Denormalization Tweet Latency v.s. Number of Followers
25 Tweets & Denormalization Immediate denormalization not scalable. How to make asynchronous? Cassandra: No native support. External processing. Couchbase: Views!
26 Reconfiguration Latency Cassandra Adding a new node to a cluster: ~5 mins. Latency while node joining cluster
27 Reconfiguration Latency Couchbase Adding a new node to a cluster: immediate BUT rebalance ~ 30 mins Latency while node joining cluster
28 Consistency/Convergence Cassandra: Average of seconds to detect new tweet. Standard deviation of : Same datacenter => very fast detection. Different datacenter => slower detection. Couchbase: Average of seconds to detect new tweet with standard deviation of The delay for new tweet to appear on timeline proportional to the schedule period
29 Replication Cassandra: Very flexible in terms of replication configuration. Per datacenter replication factors with no hard-coded limitations. Behaviour under crash of 2 second 100th
30 Replication Couchbase Automatic and configurable per bucket with a limit of (1+3) replicas. Behaviour under crash of 2 second 30th, 100th
31 Load Balancing Cassandra: Average data ownership per node: 16.68% Standard deviation: 1.23% Couchbase: Evenly distributed with standard deviation: 0.65% for data on disk 0.04% for RAM usage
32 System Load CPU Idle Time Free Memory
33 System Load Disk IO
34 Disk Space Usage SQLite database: 1.6MB Cassandra: Fully denormalized (single node): 16.31MB. Fully denormalized (6-nodes): 9.5MB/node. Includes partitioning and replication. "Normalized" (no body in userline/timeline): Single node: 2.8MB 6-nodes: 1.6MB/node. Commit log after populating: 54MB Need 50% of free disk space at all times: Column family compactions. Data redistributions.
35 Disk Space Usage Couchbase: Total of 250MB distributed across 6 nodes. No minimum free space requirement. Lack of disk usage limitations in both DBs: Not ideal for voluntary computing systems.
36 Conclusion Easy cluster setup Allows horizontal scaling over multiple nodes Improved performance through denormalization Higher storage requirements The future is hybrid A mix of RDS and NoSQL-Systems Cassandra's CQL, CouchBase's Views, NoSQL in RDS
Benchmarking Couchbase Server for Interactive Applications By Alexey Diomin and Kirill Grigorchuk Contents 1. Introduction... 3 2. A brief overview of Cassandra, MongoDB, and Couchbase... 3 3. Key criteria
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
Practical Cassandra NoSQL key-value vs RDBMS why and when Cassandra architecture Cassandra data model Life without joins or HDD space is cheap today Hardware requirements & deployment hints Vitalii Tymchyshyn
Distributed Systems Tutorial 12 Cassandra written by Alex Libov Based on FOSDEM 2010 presentation winter semester, 2013-2014 Cassandra In Greek mythology, Cassandra had the power of prophecy and the curse
Facebook: Cassandra Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/24 Outline 1 2 3 Smruti R. Sarangi Leader Election
Cassandra High Performance Cookbook Over 150 recipes to design and optimize large-scale Apache Cassandra deployments Edward Capriolo [PACKT] publishing"1 open source community experience distilled BIRMINGHAM
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
BENCHMARKING AVAILABILITY AND FAILOVER PERFORMANCE OF LARGE-SCALE DATA STORAGE APPLICATIONS Wei Sun and Alexander Pokluda December 2, 2013 Outline Goal and Motivation Overview of Cassandra and Voldemort
NoSQL Databases Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015 Database Landscape Source: H. Lim, Y. Han, and S. Babu, How to Fit when No One Size Fits., in CIDR,
Big Systems, Big Data When considering Big Distributed Systems, it can be noted that a major concern is dealing with data, and in particular, Big Data Have general data issues (such as latency, availability,
Benchmarking MongoDB and Couchbase No-SQL Databases Alex Voss Chris Choi University of St Andrews TOP 2 Questions Should a social scientist buy MORE or UPGRADE computers? Which DATABASE(s)? Document Oriented
Getting Started with SandStorm NoSQL Benchmark SandStorm is an enterprise performance testing tool for web, mobile, cloud and big data applications. It provides a framework for benchmarking NoSQL, Hadoop,
Couchbase Server Under the Hood An Architectural Overview Couchbase Server is an open-source distributed NoSQL document-oriented database for interactive applications, uniquely suited for those needing
Big Data Development CASSANDRA NoSQL Training - Workshop March 13 to 17-2016 9 am to 5 pm HOTEL DUBAI GRAND DUBAI ISIDUS TECH TEAM FZE PO Box 121109 Dubai UAE, email training-coordinator@isidusnet M: +97150
Cassandra A Decentralized, Structured Storage System Avinash Lakshman and Prashant Malik Facebook Published: April 2010, Volume 44, Issue 2 Communications of the ACM http://dl.acm.org/citation.cfm?id=1773922
Development of nosql data storage for the ATLAS PanDA Monitoring System M.Potekhin Brookhaven National Laboratory, Upton, NY11973, USA E-mail: email@example.com Abstract. For several years the PanDA Workload
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
!!!! NoSQL Databases Nikos Parlavantzas Lecture overview 2 Objective! Present the main concepts necessary for understanding NoSQL databases! Provide an overview of current NoSQL technologies Outline 3!
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch firstname.lastname@example.org XLDB
Dynamo: Amazon s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and
Analytics March 2015 White paper Why NoSQL? Your database options in the new non-relational world 2 Why NoSQL? Contents 2 New types of apps are generating new types of data 2 A brief history of NoSQL 3
INTRODUCTION TO CASSANDRA This ebook provides a high level overview of Cassandra and describes some of its key strengths and applications. WHAT IS CASSANDRA? Apache Cassandra is a high performance, open
NoSQL Evaluator s Guide McKnight Consulting Group William McKnight is the former IT VP of a Fortune 50 company and the author of Information Management: Strategies for Gaining a Competitive Advantage with
HDB++: HIGH AVAILABILITY WITH Page 1 OVERVIEW What is Cassandra (C*)? Who is using C*? CQL C* architecture Request Coordination Consistency Monitoring tool HDB++ Page 2 OVERVIEW What is Cassandra (C*)?
NoSQL: Going Beyond Structured Data and RDBMS Scenario Size of data >> disk or memory space on a single machine Store data across many machines Retrieve data from many machines Machine = Commodity machine
Why NoSQL? Your database options in the new non- relational world 2015 IBM Cloudant 1 Table of Contents New types of apps are generating new types of data... 3 A brief history on NoSQL... 3 NoSQL s roots
Cassandra Jonathan Ellis Motivation Scaling reads to a relational database is hard Scaling writes to a relational database is virtually impossible and when you do, it usually isn't relational anymore The
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores
RAMCloud and the Low- Latency Datacenter John Ousterhout Stanford University Most important driver for innovation in computer systems: Rise of the datacenter Phase 1: large scale Phase 2: low latency Introduction
Introduction to Apache Cassandra White Paper BY DATASTAX CORPORATION JULY 2013 1 Table of Contents Abstract 3 Introduction 3 Built by Necessity 3 The Architecture of Cassandra 4 Distributing and Replicating
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Structured Data Storage Xgen Congress Short Course 2010 Adam Kraut BioTeam Inc. Independent Consulting Shop: Vendor/technology agnostic Staffed by: Scientists forced to learn High Performance IT to conduct
F1: A Distributed SQL Database That Scales Presentation by: Alex Degtiar (email@example.com) 15-799 10/21/2013 What is F1? Distributed relational database Built to replace sharded MySQL back-end of AdWords
Big Data Principles and best practices of scalable real-time data systems NATHAN MARZ JAMES WARREN II MANNING Shelter Island contents preface xiii acknowledgments xv about this book xviii ~1 Anew paradigm
Innovate, Integrate, Transform Hadoop vs Apache Spark www.altencalsoftlabs.com Introduction Any sufficiently advanced technology is indistinguishable from magic. said Arthur C. Clark. Big data technologies
wow CPSC350 relational schemas table normalization practical use of relational algebraic operators tuple relational calculus and their expression in a declarative query language relational schemas CPSC350
Where We Are Introduction to Data Management CSE 344 Lecture 25: DBMS-as-a-service and NoSQL We learned quite a bit about data management see course calendar Three topics left: DBMS-as-a-service and NoSQL
IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Hyper-V Server Agent Version 6.3.1 Fix Pack 2 Reference IBM Tivoli Composite Application Manager for Microsoft Applications:
Apache HBase Crazy dances on the elephant back Roman Nikitchenko, 16.10.2014 YARN 2 FIRST EVER DATA OS 10.000 nodes computer Recent technology changes are focused on higher scale. Better resource usage
Page:1 Openstack Swift Object Store Cloud built from the grounds up David Hadas Swift ATC HRL firstname.lastname@example.org Page:2 Object Store Cloud Services Expectations: PUT/GET/DELETE Huge Capacity (Scale) Always
NoSQL Database Options Introduction For this report, I chose to look at MongoDB, Cassandra, and Riak. I chose MongoDB because it is quite commonly used in the industry. I chose Cassandra because it has
Amazon EC2 Product Details Page 1 of 5 Amazon EC2 Functionality Amazon EC2 presents a true virtual computing environment, allowing you to use web service interfaces to launch instances with a variety of
Overview of Databases On MacOS Karl Kuehn Automation Engineer RethinkDB Session Goals Introduce Database concepts Show example players Not Goals: Cover non-macos systems (Oracle) Teach you SQL Answer what
Intro to AWS: Storage Services Matt McClean, AWS Solutions Architect 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved AWS storage options Scalable object storage Inexpensive archive
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Big Data Management Big Data Management (BDM) Autumn 2013 Povl Koch September 30, 2013 29-09-2013 1 Overview Today s program 1. Little more practical details about this course 2. Recap from last time 3.
Cassandra A Decentralized Structured Storage System Avinash Lakshman, Prashant Malik LADIS 2009 Anand Iyer CS 294-110, Fall 2015 Historic Context Early & mid 2000: Web applicaoons grow at tremendous rates
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
Realtime Apache Hadoop at Facebook Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens Agenda 1 Why Apache Hadoop and HBase? 2 Quick Introduction to Apache HBase 3 Applications of HBase at
With Saurabh Singh email@example.com The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
Case study: CASSANDRA Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu Cassandra:
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,firstname.lastname@example.org Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
Cloud Computing For Bioinformatics Cloud Computing: what is it? Cloud Computing is a distributed infrastructure where resources, software, and data are provided in an on-demand fashion. Cloud Computing
In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000
1 SCALABLE DATA SERVICES 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2 Overview MySQL Database Clustering GlusterFS Memcached 3 Overview Problems of Data Services 4 Data retrieval
Cloud Computing with Microsoft Azure Michael Stiefel www.reliablesoftware.com email@example.com http://www.reliablesoftware.com/dasblog/default.aspx Azure's Three Flavors Azure Operating
Highly available, scalable and secure data with Cassandra and DataStax Enterprise GOTO Berlin 27 th February 2014 About Us Steve van den Berg Johnny Miller Solutions Architect Regional Director Western
JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra January 2014 Legal Notices Apache Cassandra, Spark and Solr and their respective logos are trademarks or registered trademarks
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
America s Most Wanted a metric to detect persistently faulty machines in Hadoop Dhruba Borthakur and Andrew Ryan dhruba,firstname.lastname@example.org Presented at IFIP Workshop on Failure Diagnosis, Chicago June
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System email@example.com Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
DataStax Enterprise 3.x Realtime Analytics with Solr Jason Rutherglen 2012 DataStax 1 About the Presenter Big Data Engineer at DataStax Co-author of Programming Hive and Lucene and Solr: The Definitive
Executive Summary... 2 Introduction... 3 Defining Big Data... 3 The Importance of Big Data... 4 Building a Big Data Platform... 5 Infrastructure Requirements... 5 Solution Spectrum... 6 Oracle s Big Data
Big Data Course Highlights The Big Data course will start with the basics of Linux which are required to get started with Big Data and then slowly progress from some of the basics of Hadoop/Big Data (like
Cassandra in Action ApacheCon NA 2013 Yuki Morishita Software Developer@DataStax / Apache Cassandra Committer 1 2 ebay Application/Use Case Social Signals: like/want/own features for ebay product and item
GigaSpaces Real-Time Analytics for Big Data GigaSpaces makes it easy to build and deploy large-scale real-time analytics systems Rapidly increasing use of large-scale and location-aware social media and
Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks
So What s the Big Deal? Presentation Agenda Introduction What is Big Data? So What is the Big Deal? Big Data Technologies Identifying Big Data Opportunities Conducting a Big Data Proof of Concept Big Data
Big Data Primer Alex Sverdlov firstname.lastname@example.org 1 Why Big Data? Data has value. This immediately leads to: more data has more value, naturally causing datasets to grow rather large, even at small companies.