Design Patterns for Distributed Non-Relational Databases

Similar documents
Distributed Data Stores

Apache HBase. Crazy dances on the elephant back

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Cassandra. Jonathan Ellis

Dynamo: Amazon s Highly Available Key-value Store

Cassandra A Decentralized, Structured Storage System

Distributed Systems. Tutorial 12 Cassandra

A Review of Column-Oriented Datastores. By: Zach Pratt. Independent Study Dr. Maskarinec Spring 2011

Cloud Computing at Google. Architecture

Practical Cassandra. Vitalii

Hypertable Architecture Overview

Big Table A Distributed Storage System For Data

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

The NoSQL Ecosystem, Relaxed Consistency, and Snoop Dogg. Adam Marcus MIT CSAIL

In-Memory Columnar Databases HyPer. Arto Kärki University of Helsinki

The Apache Cassandra storage engine

MASTER PROJECT. Resource Provisioning for NoSQL Datastores

Introduction to Apache Cassandra

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Structured Data Storage

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Evaluator s Guide. McKnight. Consulting Group. McKnight Consulting Group

The CAP theorem and the design of large scale distributed systems: Part I

How to Choose Between Hadoop, NoSQL and RDBMS

High Availability for Database Systems in Cloud Computing Environments. Ashraf Aboulnaga University of Waterloo

Distributed Storage Systems

The Sierra Clustered Database Engine, the technology at the heart of

Introduction to Cassandra

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Design and Evolution of the Apache Hadoop File System(HDFS)

Cassandra A Decentralized Structured Storage System

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

CitusDB Architecture for Real-Time Big Data

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

NoSQL and Hadoop Technologies On Oracle Cloud

Comparing SQL and NOSQL databases

Geo-Replication in Large-Scale Cloud Computing Applications

Data Management in the Cloud

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014

Benchmarking Cassandra on Violin

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Cloud Computing Is In Your Future

Distributed File Systems

Big Data Development CASSANDRA NoSQL Training - Workshop. March 13 to am to 5 pm HOTEL DUBAI GRAND DUBAI

Using RDBMS, NoSQL or Hadoop?

Can the Elephants Handle the NoSQL Onslaught?

NoSQL Data Base Basics

A Survey of Distributed Database Management Systems

Case study: CASSANDRA

Big Data Management and NoSQL Databases

Do Relational Databases Belong in the Cloud? Michael Stiefel

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Evaluation of NoSQL databases for large-scale decentralized microblogging

Cloud Computing with Microsoft Azure

Applying traditional DBA skills to Oracle Exadata. Marc Fielding March 2013

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

In-memory databases and innovations in Business Intelligence

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Benchmarking Cloud Serving Systems with YCSB

Benchmarking Cloud Serving Systems with YCSB

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Accelerating Cassandra Workloads using SanDisk Solid State Drives

Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier

Tushar Joshi Turtle Networks Ltd

Hadoop and Map-Reduce. Swati Gore

Cloud DBMS: An Overview. Shan-Hung Wu, NetDB CS, NTHU Spring, 2015

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

In Memory Accelerator for MongoDB

Einsatzfelder von IBM PureData Systems und Ihre Vorteile.

Apache Hadoop FileSystem and its Usage in Facebook

Bigtable is a proven design Underpins 100+ Google services:

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Amr El Abbadi. Computer Science, UC Santa Barbara

NoSQL Databases. Nikos Parlavantzas

Apache Hadoop. Alexandru Costan

SQL Server 2014 New Features/In- Memory Store. Juergen Thomas Microsoft Corporation

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Disk Storage Shortfall

Microsoft SQL Database Administrator Certification

Cassandra vs MySQL. SQL vs NoSQL database comparison

Inge Os Sales Consulting Manager Oracle Norway

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Apache HBase: the Hadoop Database

The Hadoop Distributed File System

Transcription:

Design Patterns for Distributed Non-Relational Databases aka Just Enough Distributed Systems To Be Dangerous (in 40 minutes) Todd Lipcon (@tlipcon) Cloudera June 11, 2009

Introduction Common Underlying Assumptions Design Patterns Consistent Hashing Consistency Models Data Models Storage Layouts Log-Structured Merge Trees Cluster Management Omniscient Master Gossip Questions to Ask Presenters

Why We re All Here Scaling up doesn t work Scaling out with traditional RDBMSs isn t so hot either Sharding scales, but you lose all the features that make RDBMSs useful! Sharding is operationally obnoxious. If we don t need relational features, we want a distributed NRDBMS.

Closed-source NRDBMSs The Inspiration Google BigTable Applications: webtable, Reader, Maps, Blogger, etc. Amazon Dynamo Shopping Cart,? Yahoo! PNUTS Applications:?

Data Interfaces This is the NOSQL meetup, right? Every row has a key (PK) Key/value get/put multiget/multiput Range scan? With predicate pushdown? MapReduce? SQL?

Underlying Assumptions

Assumptions - Data Size The data does not fit on one node. The data may not fit on one rack. SANs are too expensive. Conclusion: The system must partition its data across many nodes.

Assumptions - Reliability The system must be highly available to serve web (and other) applications. Since the system runs on many nodes, nodes will crash during normal operation. Data must be safe even though disks and nodes will fail. Conclusion: The system must replicate each row to multiple nodes and remain available despite certain node and disk failure.

Assumptions - Performance...and price thereof All systems we re talking about today are meant for real-time use. 95th or 99th percentile is more important than average latency Commodity hardware and slow disks. Conclusion: The system needs to perform well on commodity hardware, and maintain low latency even during recovery operations.

Design Patterns

Partitioning Schemes Where does a key live? Given a key, we need to determine which node(s) it belongs on. If that node is down, we need to find another copy elsewhere. Difficulties: Unbounded number of keys. Dynamic cluster membership. Node failures.

Consistent Hashing Maintaining hashing in a dynamic cluster

Consistent Hashing Key Placement

Consistency Models A consistency model determines rules for visibility and apparent order of updates. Example: Row X is replicated on nodes M and N Client A writes row X to node N Some period of time t elapses. Client B reads row X from node M Does client B see the write from client A? Consistency is a continuum with tradeoffs

Strict Consistency All read operations must return the data from the latest completed write operation, regardless of which replica the operations went to Implies either: All operations for a given row go to the same node (replication for availability) or nodes employ some kind of distributed transaction protocol (eg 2 Phase Commit or Paxos) CAP Theorem: Strict Consistency can t be achieved at the same time as availability and partition-tolerance.

Eventual Consistency As t, readers will see writes. In a steady state, the system is guaranteed to eventually return the last written value For example: DNS, or MySQL Slave Replication (log shipping) Special cases of eventual consistency: Read-your-own-writes consistency ( sent mail box) Causal consistency (if you write Y after reading X, anyone who reads Y sees X) gmail has RYOW but not causal!

Timestamps and Vector Clocks Determining a history of a row Eventual consistency relies on deciding what value a row will eventually converge to In the case of two writers writing at the same time, this is difficult Timestamps are one solution, but rely on synchronized clocks and don t capture causality Vector clocks are an alternative method of capturing order in a distributed system

Vector Clocks Definition: A vector clock is a tuple {t 1, t 2,..., t n } of clock values from each node v 1 < v 2 if: For all i, v 1i v 2i For at least one i, v1i < v 2i v 1 < v 2 implies global time ordering of events When data is written from node i, it sets t i to its clock value. This allows eventual consistency to resolve consistency between writes on multiple replicas.

Data Models What s in a row? Primary Key Value Value could be: Blob Structured (set of columns) Semi-structured (set of column families with arbitrary columns, eg linkto:<url> in webtable) Each has advantages and disadvantages Secondary Indexes? Tables/namespaces?

Multi-Version Storage Using Timestamps for a 3rd dimension Each table cell has a timestamp Timestamps don t necessarily need to correspond to real life Multiple versions (and tombstones) can exist concurrently for a given row Reads may return most recent, most recent before T, etc. (free snapshots) System may provide optimistic concurrency control with compare-and-swap on timestamps

Storage Layouts How do we lay out rows and columns on disk? Determines performance of different access patterns Storage layout maps directly to disk access patterns Fast writes? Fast reads? Fast scans? Whole-row access or subsets of columns?

Row-based Storage Pros: Cons: Good locality of access (on disk and in cache) of different columns Read/write of a single row is a single IO operation. But if you want to scan only one column, you still read all.

Columnar Storage Pros: Cons: Data for a given column is stored sequentially Scanning a single column (eg aggregate queries) is fast Reading a single row may seek once per column.

Columnar Storage with Locality Groups Columns are organized into families ( locality groups ) Benefits of row-based layout within a group. Benefits of column-based - don t have to read groups you don t care about.

Log Structured Merge Trees aka The BigTable model Random IO for writes is bad (and impossible in some DFSs) LSM Trees convert random writes to sequential writes Writes go to a commit log and in-memory storage (Memtable) The Memtable is occasionally flushed to disk (SSTable) The disk stores are periodically compacted P. E. O Neil, E. Cheng, D. Gawlick, and E. J. O Neil. The log-structured merge-tree (LSM-tree). Acta Informatica. 1996.

LSM Data Layout

LSM Write Path

LSM Read Path

LSM Read Path + Bloom Filters

LSM Memtable Flush

LSM Compaction

Cluster Management Clients need to know where to find data (consistent hashing tokens, etc) Internal nodes may need to find each other as well Since nodes may fail and recover, a configuration file doesn t really suffice We need a way of keeping some kind of consistent view of the cluster state

Omniscient Master When nodes join/leave or change state, they talk to a master That master holds the authoritative view of the world Pros: simplicity, single consistent view of the cluster Cons: potential SPOF unless master is made highly available. Not partition-tolerant.

Gossip Gossip is one method to propagate a view of cluster status. Every t seconds, on each node: The node selects some other node to chat with. The node reconciles its view of the cluster with its gossip buddy. Each node maintains a timestamp for itself and for the most recent information it has from every other node. Information about cluster state spreads in O(lgn) rounds (eventual consistency) Scalable and no SPOF, but state is only eventually consistent

Gossip - Initial State

Gossip - Round 1

Gossip - Round 2

Gossip - Round 3

Gossip - Round 4

Questions to Ask Presenters

Scalability and Reliability What are the scaling bottlenecks? How does it react when overloaded? Are there any single points of failure? When nodes fail, does the system maintain availability of all data? Does the system automatically re-replicate when replicas are lost? When new nodes are added, does the system automatically rebalance data?

Performance What s the goal? Batch throughput or request latency? How many seeks for reads? For writes? How many net RTTs? What 99th percentile latencies have been measured in practice? How do failures impact serving latencies? What throughput has been measured in practice for bulk loads?

Consistency What consistency model does the system provide? What situations would cause a lapse of consistency, if any? Can consistency semantics be tweaked by configuration settings? Is there a way to do compare-and-swap on row contents for optimistic locking? Multirow?

Cluster Management and Topology Does the system have a single master? Does it use gossip to spread cluster management data? Can it withstand network partitions and still provide some level of service? Can it be deployed across multiple datacenters for disaster recovery? Can nodes be commissioned/decomissioned automatically without downtime? Operational hooks for monitoring and metrics?

Data Model and Storage What data model and storage system does the system provide? Is it pluggable? What IO patterns does the system cause under different workloads? Is the system best at random or sequential access? For read-mostly or write-mostly? Are there practical limits on key, value, or row sizes? Is compression available?

Data Access Methods What methods exist for accessing data? Can I access it from language X? Is there a way to perform filtering or selection at the server side? Are there bulk load tools to get data in/out efficiently? Is there a provision for data backup/restore?

Real Life Considerations (I was talking about fake life in the first 45 slides) Who uses this system? How big are the clusters it s deployed on, and what kind of load do they handle? Who develops this system? Is this a community project or run by a single organization? Are outside contributions regularly accepted? Who supports this system? Is there an active community who will help me deploy it and debug issues? Docs? What is the open source license? What is the development roadmap?

Questions? http://cloudera-todd.s3.amazonaws.com/nosql.pdf