Benchmarking MongoDB and Couchbase No-SQL Databases Alex Voss Chris Choi University of St Andrews
TOP 2 Questions Should a social scientist buy MORE or UPGRADE computers? Which DATABASE(s)?
Document Oriented Databases
The central concept of a document-oriented database is the notion of a Document
Both and Uses JSON to store these Documents
Example Documents: { } FirstName:"Bob", Address:"5 Oak St.", Hobby:"sailing" { } FirstName:"Jonathan", Address:"15 Wanamassa Point Road", Children:[ {Name:"Michael",Age:10}, {Name:"Jennifer", Age:8}, {Name:"Samantha", Age:5}, {Name:"Elena", Age:2} ] there are no empty 'fields', and it does not require explicitly stating if other pieces of information are left out
The Database Tests
Database Test Aggregate/Count: Most User Mentioned Most Hashed Tags Most Shared URLs
Database Test Aggregate/Count: Typical questions Most User Mentioned Most Hashed Tags Most Shared URLs
Has two query approaches: Map Reduce & Aggregation Framework
Map Reduce For Aggregation Uses JavaScript
Aggregation Framework Similar to Map Reduce But Uses C++ (Faster!)
Single Node SSD vs HDD Map Reduce Aggregation Framework 60 50 40 Most User Mentioned Most Hashed-Tags Most Shared URLs 60 50 40 Most User Mentioned Most Hashed-Tags 30 30 20 20 10 10 0 HDD SSD 0 HDD SSD
Single Node Aggregation Framework is roughly 2 times faster than Map Reduce
When we have more than 1 node ~ Scaling Generally...
Scaling Vertically BIGGER and FATTER machine! Scaling Horizontally MORE machines!
Replica-Set Master-Slave Replication Replicates data from master to slave Only one node is in charge (the Master), and data only travels in one direction (to the Slave)
Replica-Set Single Node vs Replica Set: Aggregation Framework 25 20 Single Node Replica Set Time (minutes) 15 10 5 0 User Mentioned Hashed-Tags Shared URLs
Replica-Set Single Node vs Replica Set: Map Reduce 80 70 Single Node Replica Set 60 Time (minutes) 50 40 30 20 10 0 User Mentioned Hashed-Tags Shared URLs
Replicating Data did not improve query performance (MapReduce and Aggregation Framework)... so we tried Sharding
Sharding One BIG stack Two or more smaller stacks
Master-Slave Replication uses a democratic hierarchical scaling approach
But Deployment is an issue with MongoDB... If you want to shard...
documentation suggests minimum 11 nodes, to build a fail-safe sharded cluster 2 Shards 6 nodes (3 each) 2 mongos (load balancer) 2 nodes 3 mongod (config servers) 3 nodes ------------------------------------------- Total: 11 nodes
The investment to go from single node --> sharded cluster is very HIGH!
By omitting Fail-Safe features For our tests, we kind of cheated and used: Note: X is variable 2 X Shards 6 2 nodes 2 1 mongos (load balancer) 2 0.5 node 3 1 mongod (config servers) 3 0.5 node ------------------------------------------- Total: 11 3 nodes NOT RECOMMENDED IN PRODUCTION
Which looks something like this: 4 Shard configuration
Or this: 8 Shard configuration
Another problem is with querying in a sharded environment...
We couldn't perform queries using the Aggregation Framework in the shards, because it was limited by: - Output at every stage of the aggregation can only contain 16MB - >10% RAM usage MongoDB would fail if the limits were reached
Sharded Cluster Map Reduce: On 3 modest machines 4 Cores 8GB RAM 30.00 Most hash tags (mins) Most user mentions (mins) Most shared urls (mins) 25.00 20.00 Time (minutes) 15.00 10.00 5.00 0.00 2 4 8 Number of Shards Optimal
Sharded Cluster Map Reduce: On 3 modest machines 4 Cores 8GB RAM Single Node Three Nodes Time (minutes) 60 50 40 30 20 Most Hashed-Tags Most User Mentioned Most Shared URLs Time (minutes) 60.00 50.00 40.00 30.00 20.00 Most hash tags (mins) Most user mentions (mins) Most shared urls (mins) 10 10.00 0 HDD SSD 0.00 2 4 8 Number of Shards
Sharded Cluster Map Reduce: On 3 modest machines 4 Cores 8GB RAM Single Node Three Nodes Time (minutes) 60 60 50 50 40 40 Time (minutes) 30 30 20 20 Most Hashed-Tags Most Hashed-Tags User Mentioned Most User Shared Mentioned URLs Most Shared URLs Time (minutes) 60.00 50.00 40.00 30.00 20.00 Most hash tags (mins) Most user mentions (mins) Most shared urls (mins) 10 10 10.00 0 0 HDD SSD SSD 0.00 2 4 8 Number of Shards
Performance of Single Node (SSD) is very comparable to 3 nodes (HDD) of different number of shards! (Single Node with SSD is 2-3 minutes slower than 3 nodes... [40 GB corpus]) Perhaps its worth purchasing a good SSD over investing more nodes to improve aggregate performance...
What happens when we scale vertically with ONE BIG BOX? we get --->
Sharded Cluster Map Reduce: On a Big Box 32 Cores 128GB RAM 12.00 10.00 Most hash tags (mins) Most user mentions (mins) Most shared urls (mins) Optimal 8.00 Time (minutes) 6.00 4.00 2.00 0.00 2 4 8 16 24 28 30 Number of Shards
uses a communistic scaling approach Master-Master Replication
Single Node 45 40 Most user mentioned Most hashed tags Most shared urls 35 30 Time (seconds) 25 20 15 10 5 0 HDD Hard Drive Type SSD
Single Node VS Map Reduce Time (seconds) 60 Most user mentioned 50 Most hashed tags Most shared urls 40 30 20 60 50 40 30 20 Most User Mentioned Most Hashed-Tags Most Shared URLs 10 10 0 HDD Hard Drive Type SSD 0 HDD SSD
Single Node VS Aggregation Framework Time (seconds) 60 Most user mentioned 50 Most hashed tags Most shared urls 40 30 20 60 50 40 30 20 Most User Mentioned Most Hashed-Tags 10 10 0 HDD Hard Drive Type SSD 0 HDD SSD
On a SINGLE NODE, while CouchBase is FASTER than MongoDB's MapReduce MongoDB's Aggregation Framework is FASTER than CouchBase
That said adding more Couchbase nodes was disappointing...
In short...
1 Node VS 2 Nodes VS 3 Nodes One Node 90 80 Two Nodes Three Nodes 70 60 50 40 30 20 10 0 Most user mentioned Most hashed tags Most shared urls
Adding more nodes did not improve query performance...
Other users have found similar issues, Couchbase Support says: One known issue with the current developer preview of Couchbase is that it slows down on smaller clusters. The views are optimal on very large clusters. [1] But they did not specify what is very large [1] Couchbase performance - real tests - it's horrible - or am I doing somethin wrong?
That was November 2011...
Which one?
The obvious answer is From a performance standpoint.
Simple flat file aggregation with Java took approx 60 mins On the same machine, MongoDB took approx 30 mins (MapReduce) 10 mins (Aggregation Framework)
Deployment Options Hard drive Map Reduce 140.00 120.00 1 Node HDD MR 1950 Total Execution Time Costs 2200 2500 2000 100.00 Total Execution Time (mins) 80.00 60.00 40.00650 1 Node SSD MR 900 3 Nodes HDD MR 1000 1000 MR 1500 1000 500 Approximate Total Cost (GBP) 20.00 0.00 1 Node #2 Fast SSD AF Big Box HDD MR 0 Aggregate Framework
But! Has problems with: - Full Text Search - Social Network Graphs - Aggregation
's Full Text Search is slow 1 simple query on a single node, 44GB corpus took approx 10 mins!
Solution: A Fast Full Text Search Engine The same search took only a fraction of a second
To create a Social Network Graph with Can be Complex, since you query multiple times to obtain relationships
Solution: A Popular Graph Database Where you can obtain a network graph with VERY FEW queries
Example: START john=node:node_auto_index(name = 'John') MATCH john-[:friend]->()-[:friend]->fof RETURN john, fof The above query fetches all nodes who are friends of friends of Johns
With Aggregation Is perhaps not the best, we have yet to play with
Or even An SQL language to build MapReduce queries With
Moral of the story Use the right tools for the right job
Future work We aim to experiment with: With possibility in integrating all with