Scaling with MongoDB. by Michael Schurter 2011 @schmichael. Scaling with MongoDB by Michael Schurter - OS Bridge, 2011.06.22

Similar documents

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

MongoDB Developer and Administrator Certification Course Agenda

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Getting Started with MongoDB

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

NoSQL - What we ve learned with mongodb. Paul Pedersen, Deputy CTO paul@10gen.com DAMA SF December 15, 2011

Be Very Afraid. Christophe Pettus PostgreSQL Experts Logical Decoding & Backup Conference Europe 2014

High Availability Solutions for the MariaDB and MySQL Database

MongoDB. The Definitive Guide to. The NoSQL Database for Cloud and Desktop Computing. Apress8. Eelco Plugge, Peter Membrey and Tim Hawkins

MongoDB. Or how I learned to stop worrying and love the database. Mathias Stearn. N*SQL Berlin October 22th, gen

NOSQL INTRODUCTION WITH MONGODB AND RUBY GEOFF

The MongoDB Tutorial Introduction for MySQL Users. Stephane Combaudon April 1st, 2014

NoSQL in der Cloud Why? Andreas Hartmann

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

L7_L10. MongoDB. Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD.

Scaling up = getting a better machine. Scaling out = use another server and add it to your cluster.

Practical Cassandra. Vitalii

Writing a Logical Decoding Plug-In. Christophe Pettus PostgreSQL Experts, Inc. FOSDEM 2015

Boost SQL Server Performance Buffer Pool Extensions & Delayed Durability

Introduction to NoSQL and MongoDB. Kathleen Durant Lesson 20 CS 3200 Northeastern University

Big Data & Data Science Course Example using MapReduce. Presented by Juan C. Vega

An Approach to Implement Map Reduce with NoSQL Databases

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

MongoDB. An introduction and performance analysis. Seminar Thesis

MONGODB - THE NOSQL DATABASE

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

MongoDB and Couchbase

Structured Data Storage

MySQL Replication. openark.org

NoSQL Databases. Nikos Parlavantzas

these three NoSQL databases because I wanted to see a the two different sides of the CAP

Understanding NoSQL on Microsoft Azure

Distributed Systems. Tutorial 12 Cassandra

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

MySQL: Cloud vs Bare Metal, Performance and Reliability

Cloud Backup and Recovery

NoSQL Database Options

Understanding NoSQL Technologies on Windows Azure

High Availability for Database Systems in Cloud Computing Environments. Ashraf Aboulnaga University of Waterloo

Designing Apps for Amazon Web Services

Preparing for the Big Oops! Disaster Recovery Sites for MySQL. Robert Hodges, CEO, Continuent MySQL Conference 2011

MongoDB: document-oriented database

Scalable Architecture on Amazon AWS Cloud

Comparing SQL and NOSQL databases

Document Oriented Database

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

SQL Server 2014 New Features/In- Memory Store. Juergen Thomas Microsoft Corporation

Building Your First MongoDB Application

An Open Source NoSQL solution for Internet Access Logs Analysis

Performance Monitoring AlwaysOn Availability Groups. Anthony E. Nocentino

Database Replication with MySQL and PostgreSQL

Introduction to Polyglot Persistence. Antonios Giannopoulos Database Administrator at ObjectRocket by Rackspace

LARGE-SCALE DATA STORAGE APPLICATIONS

Big Data Solutions. Portal Development with MongoDB and Liferay. Solutions

Microsoft Azure Data Technologies: An Overview

In Memory Accelerator for MongoDB

How To Fix A Powerline From Disaster To Powerline

Application of NoSQL Database in Web Crawling

High Availability Using MySQL in the Cloud:

MySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!)

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

NoSQL Failover Characteristics: Aerospike, Cassandra, Couchbase, MongoDB

Website Disaster Recovery

Databases for text storage

The Future of PostgreSQL High Availability Robert Hodges - Continuent, Inc. Simon Riggs - 2ndQuadrant

Couchbase Server Under the Hood

Web Application Deployment in the Cloud Using Amazon Web Services From Infancy to Maturity

HO5604 Deploying MongoDB. A Scalable, Distributed Database with SUSE Cloud. Alejandro Bonilla. Sales Engineer abonilla@suse.com

NoSQL: Going Beyond Structured Data and RDBMS

Integrating Big Data into the Computing Curricula

Introducing DocumentDB

Big data and urban mobility

PostgreSQL Concurrency Issues

Sharding and MongoDB. Release MongoDB, Inc.

CSCC09F Programming on the Web. Mongo DB

Please ask questions! Have people used non-relational dbs before? MongoDB?

MakeMyTrip CUSTOMER SUCCESS STORY

Evaluation of NoSQL databases for large-scale decentralized microblogging

Domain driven design, NoSQL and multi-model databases

A Journey through the MongoDB Internals?

Mark Bennett. Search and the Virtual Machine

Google File System. Web and scalability

Appendix A Core Concepts in SQL Server High Availability and Replication

Frictionless Persistence in.net with MongoDB. Mogens Heller Grabe Trifork

How SourceForge is Using MongoDB

Zero Downtime Deployments with Database Migrations. Bob Feldbauer

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

Efficient Management of System Logs using a Cloud Radoslav Bodó, Daniel Kouřil CESNET. ISGC 2013, March 2013

Scaling Graphite Installations

Transcription:

Scaling with MongoDB by Michael Schurter 2011 @schmichael

What is MongoDB? Community Developer by 10gen AGPL Database Apache drivers JIRA issue tracking On GitHub

What is MongoDB? Architecture Server Database Collection (table) Document (BSON; like a row) Fields (columns) Queries are on the collection level (no joins) Indexes are per collection Documents have a unique ID Atomicity is at the document level

What is MongoDB? Documents (2) BSON Open standard: bsonspec.org Binary JSON or protobufs without the schema Objects/Sub-documents/mappings Arrays UTF-8 Strings Floats, Integers (32 or 64 bit) Timestamps, DateTimes Booleans Binary Assorted others specific to Mongo's use case (Regex, ObjectID)

What is MongoDB? Querying Querying Dynamic queries (JavaScript code or objects) Map/Reduce (JavaScript functions) Secondary indexes (B-tree, R-tree for simple geospatial)

What is MongoDB? Querying (2) Search by value or inside an array: db.collection.find({tags: "some tag"}) [{...}, {...},...] Update tag lists: update({app: "...", $addtoset: {tags: "another tag"}) No escaping values Declarative syntax makes for easy composition Must escape keys and insure values aren t objects

What is MongoDB? Operations Replication Master/Slave Replica Pairs Sets Auto-sharding (data partitioning) Tools mongo shell, mongostat mongo{dump,restore,export,import} Multi-master slaves mongo shell is javascript mongostat is *amazing*

What is(n t) Mongo? Durability Single server durability added in 1.8 (off by default) Preference is Replica Sets No guarantee your data was written with safe=true Use safe=true No guarantee your data was replicated without w=1 If a server goes down, trash it s data and use a slave Unknown performance hit for durability Probably worse than systems that ship WALs/journals as part of replication safe=false is literally fire and forget

What is MongoDB? Performance I hear it s fast It is until: Your data no longer fits in memory Your indexes no longer fit in memory You miss the index (full collection scan) You use safe=true You use w>0 You turn on journaling/durability Oodles of marketing Anecdote: Cassandra vs. Memcache

Random internet picture of a read heavy system Ours was write heavy and ran into something else first... but I m getting ahead of myself

MOAR RAM Rinse; repeat. RAM is probably your single most important resource SSDs would be great Sharding essentially buys you the ability to add RAM forever lol voltdb

In the beginning Project at YouGov prior to Urban Airship User/group database Was custom datastore, migrated to MongoDB 1.2 Highly recommended to Michael Richardson Single PostgreSQL instance dying under load I snuck into Urban Airship before anything blew up User permissions and profiles were relatively complex Initially stored in a BDB-backed CouchDB-inspired Python service. o Worked fine, a bit slow, major NIH-Syndrome MongoDB (1.2?) to the rescue! o Perfect fit for a small read-heavy, write-light dataset

Early perf. issue: syncdelay The theory: Durability within a certain time-frame. (default: 60s) Barely documented Never worked for us Syncs would cause infinite block loops: block for 30s, apply backed up writes, block another 30s, never catch up. Just let the kernel handle flushing syncdelay useless with journaling? No hints in docs There are proc tunables for how the kernel handles dirty pages

Initial MongoDB setup at Urban Airship

Replication Streaming asynchronous replication Streams an oplog; no conflict resolution Master accepts writes, can read from slaves Master + Slaves or... Replica Sets (auto-election & failover) Non-idempotent operations like $inc/$push/etc are changed to their idempotent form: {devices: 1,560} {$inc: {devices: 1}} {devices: 1,561} But we were using replica pairs which were deprecated before replica sets were even officially released Had a driver + replica sets issue where writes with safe=false went to slave (safe=false means we never saw an error)

After the replica set bug we went to a simpler setup

Locked In MongoDB only has Global (per-server) locks Early versions (~1.2) locked server on every query Later (~1.4) separate read & write locks Present (>=1.6) updates use read lock while seeking Future (1.9?) Per collection locks Moral: Do not run long queries on your master Similar to dynamic programming languages Keep indexes in memory to keep read locks short Have fast disks - NOT EBS (See Gavin s talk http://bit.ly/j6pr21) https://jira.mongodb.org/browse/server-1240

3rd generation mongodb setup - separate long reads from backups

Double (Up)Dating Cause: update(..., {$push: {big object}}) Effect: Big object exceeds document padding Document is moved to end of data Update comes along and re-updates all documents Changing your schema can avoid this... how do you change schemaless data s schema? not easily Isn t guaranteed to hit all documents, so you can t just $pop once

Flip/Flop for the Win Data files get sparse as documents are moved Indexes could get sparse (getting better & better) Sparse indexes new in 1.8 (have to recreate old indexes) The Solution: Take slave offline, resync, failover Requires downtime without Replica Sets Future (1.9) - In-place compaction Docs are padded to avoid moving; but when they are moved, gaps can grow Poor data locality in disk / in memory Schemaless means lots of space wasted repeating field names / structure In-place - *not* online. Queries & replication stopped before compaction

When adding RAM isn t enough More RAM doesn t alleviate pain of full server locks You have to write to disk someday Solution: Partition the cluster Good old days: manually shard your data Brave new world (1.6): auto-sharding

Auto-Sharding We don t use it Couldn t get it to work in any of the 1.5 dev releases Couldn t pay me enough to use 1.6.0, maybe safe post 1.8.1 Auto-shards based on a shard-key per collection Order preserving partitioner Querying anything other than the shard-key spams the cluster Each shard should be a Replica Set Run 1 mongos per app server Early adopters seemed to all have 10gen support & special builds Took down 4sq because adding shards & rebalancing incurs expenses in any distributed system (most? maybe there are clever ways to mitigate...) Much larger ops burden Make the right decisions initially (not totally schemaless)

4th (and final) generation Shard by unrelated datasets Don t use an underpowered backup slave Monitor replication delay closely - in a disaster nothing else matters if your replication delay is *14 days*

Disaster Recovery So what happens when things go *really* wrong?

EBS Goes on Strike EBS volumes grind to a halt (or at least 1 per RAID) Restore from backup in a different Availability Zone! 4 hours later the restore is still building indexes Wait for EBS to come back or use stale backup data? In the end: EBS came back before restore finished. When your HDD locks, MongoDB locks

mongorestore --indexeslast Always use it Should be On by default Lock database and copy files (snapshot) is best Waiting for indexes to build takes approximately as long as it takes AWS engineers to EBS issues.

Goodbye MongoDB Moved bulk of data to manually partitioned PostgreSQL 120 GB of MongoDB data became 70 GB in PostgreSQL Migration difficult due to undisciplined MongoDB code PostgreSQL 9.0 s streaming replication is nice Slightly better failover story than MongoDB Master/Slave Lack of data abstraction layer let us get sloppy - long migration

Moral Test MongoDB for your use case If you can t get auto-sharding working, probably run That goes double for Replica Sets Choose your schema well (especially with sharding) Use short key names + a data abstraction layer {created_time: new DateTime()} 27 bytes {created: new DateTime()} 22 bytes {ct: new DateTime()} 17 bytes With Auto-sharding *and* Replica Sets, MongoDB s scaling and durability stories are just too scary to trust. You ll thank me when your data files are 1/3 smaller.

Questions? Content, text & diagrams, public domain (steal away) Slide template copyright Urban Airship (sorry)