Why MySQL beats MongoDB



Similar documents
In Memory Accelerator for MongoDB

The MongoDB Tutorial Introduction for MySQL Users. Stephane Combaudon April 1st, 2014

Integrating Big Data into the Computing Curricula

Can the Elephants Handle the NoSQL Onslaught?

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Introduction to SQL for Data Scientists

MongoDB and Couchbase

Cloud Scale Distributed Data Storage. Jürmo Mehine

NoSQL in der Cloud Why? Andreas Hartmann

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Moving From Hadoop to Spark

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

How To Scale Out Of A Nosql Database

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

How To Handle Big Data With A Data Scientist

Big Systems, Big Data

Big Data & Data Science Course Example using MapReduce. Presented by Juan C. Vega

Getting Started with MongoDB

Hadoop and MySQL for Big Data

Alexander Rubin Principle Architect, Percona April 18, Using Hadoop Together with MySQL for Data Analysis

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

How Lucene Powers LinkedIn Segmentation & Targeting Platform

Hadoop: Embracing future hardware

Real-time reporting at 10,000 inserts per second. Wesley Biggs CTO 25 October 2011 Percona Live

5 Signs You Might Be Outgrowing Your MySQL Data Warehouse*

Big Data. Facebook Wall Data using Graph API. Presented by: Prashant Patel Jaykrushna Patel

SQL: joins. Practices. Recap: the SQL Select Command. Recap: Tables for Plug-in Cars

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

The Hadoop Framework

Mark Bennett. Search and the Virtual Machine

Scalable Architecture on Amazon AWS Cloud

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

bigdata Managing Scale in Ontological Systems

Reference Architecture, Requirements, Gaps, Roles

Understanding NoSQL Technologies on Windows Azure

Scaling Database Performance in Azure

Graph Database Proof of Concept Report

Rackspace Cloud Databases and Container-based Virtualization

How To Create A Table In Sql (Ahem)

Scaling with MongoDB. by Michael Schurter Scaling with MongoDB by Michael Schurter - OS Bridge,

SQL Server Table Design - Best Practices

Spark: Cluster Computing with Working Sets

Comparing Scalable NOSQL Databases

How graph databases started the multi-model revolution

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch September 2,

NoSQL - What we ve learned with mongodb. Paul Pedersen, Deputy CTO paul@10gen.com DAMA SF December 15, 2011

sqlite driver manual

Unified Big Data Processing with Apache Spark. Matei

Database Administration with MySQL

Intro to Databases. ACM Webmonkeys 2011

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

MS ACCESS DATABASE DATA TYPES

NOSQL INTRODUCTION WITH MONGODB AND RUBY GEOFF

Introduction to Hadoop

INTRODUCTION TO CASSANDRA

Big data and urban mobility

Transactions and ACID in MongoDB

An Open Source NoSQL solution for Internet Access Logs Analysis

The 3 questions to ask yourself about BIG DATA

Hacettepe University Department Of Computer Engineering BBM 471 Database Management Systems Experiment

Services. Relational. Databases & JDBC. Today. Relational. Databases SQL JDBC. Next Time. Services. Relational. Databases & JDBC. Today.

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, XLDB Conference at Stanford University, Sept 2012

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Parallel Replication for MySQL in 5 Minutes or Less

these three NoSQL databases because I wanted to see a the two different sides of the CAP

Web Application Deployment in the Cloud Using Amazon Web Services From Infancy to Maturity

Performance Analysis of Cloud Relational Database Services

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

High Availability Solutions for the MariaDB and MySQL Database

References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline

Introduction to Database Systems CSE 444

L7_L10. MongoDB. Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD.

Structured Data Storage

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Challenges for Data Driven Systems

INTRODUCING DRUID: FAST AD-HOC QUERIES ON BIG DATA MICHAEL DRISCOLL - CEO ERIC TSCHETTER - LEAD METAMARKETS

MySQL: Cloud vs Bare Metal, Performance and Reliability

MongoDB. Or how I learned to stop worrying and love the database. Mathias Stearn. N*SQL Berlin October 22th, gen

Cloud Computing For Bioinformatics

NoSQL Database - mongodb

NoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases

MONGODB - THE NOSQL DATABASE

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

An Approach to Implement Map Reduce with NoSQL Databases

FREE computing using Amazon EC2

Transcription:

SEARCH MARKETING EVOLVED Why MySQL beats MongoDB Searchmetrics in one big data scenario Suite Stephan Sommer-Schulz (Director Research) Searchmetrics Inc. 2011

The Challenge Backlink Database: Build a graph of most of the backlinks in the internet. Store the data in a database Give the user the possibilities to Perform queries with multiple where conditions Aggregate fields where needed (grouping, counting) Order results over multiple fields (asc, desc) Answer requests within seconds Keep costs and administration as low as possible Definition: A backlink is an internet link which points from one webpage to another, which is not in the same subdomain (means: no internal links). 2

The Data Graph components: The graph consists of nodes (webpages) and edges (links). Each node and each link can have a number of attributes. No. of pages: 9 800 000 000 No. of links: 112 000 000 000 No. of daily new links & updates: 1 000 000 000 Because the links are the most interesting thing in our scenario, we will have at least one record per link in our database (> 112 billion). 3

MongoDB Approach (v2.0.0) Good: Buildin sharding! Replication (not needed here) Rest and json (or bson) Schemaless / document oriented (we will see) Scalable (we will see) Not so good: Poor datatypes (no tinyint, no smallint,...) Large datasize, because of document-orientation, field names,... Performance issues on aggregates (e.g. count distinct) Indexes must fit completely into memory Funny problems with sharding keys if a shard gets to big, but cannot be divided anymore. 4

MongoDB Tests 1 Test Sharding / Datainput / Index 3 Sharding-Servers (AMD Hexacore, 8 GByte, SSD) Start with one index Input 1 billion records (~300-400 million / shard) Create a second index... Problem 1: On one shard the second index didn t fit into memory (the eta was several days!) Do a db.killop(...) for the index creation process process is marked as killed, but no other reaction after 6 hours, system is blocked L Kill the mongod process, install more Memory (16 GByte) create index in 3 hours. Problem 2: Sharding-Key has to be changed (no possibility of dividing any more in a segment) Export data reimport data waisting lots of time and resources Clue: Find a key that fits your needs over years, even if the requierments change sometimes (and they will) -> Impossible Mission in very large systems Problem 3: The size of data and index (disk and memory) is much bigger than expected!? 5

MongoDB Tests 2 Test Advanced Queries 3 Sharding-Servers (AMD Hexacore, 8 GByte, SSD) Input 1 billion records (~300-400 million / shard) Problem 1: Counting rows with distinct is damn slow Tried to calculate the counters over night doesn t fit all needs, because users can influence the queries. Map/Reduce is not an option for just in-time results Problem 2: Using query operators on none index fields is very slow, need to keep data in memory as well as the indexes (more memory!) It is not possible to catch every query-situation with an index, if we don t want to spend lots of money for Schemaless is fine, but the need of indexes in memory looks like a blocker, MySQL has not. 6

MySQL Approach (v5.5) Good: Well known and powerfull SQL Replication (not needed here) Good datatypes Small data on disk because of relations Smaller indexes because of better datatypes Not so good: No sharding! No rest, no json More development on application-site needed (watch for costs) 7

Datasize Comparision (not relational) Fields Type (MySQL) Avg. Bytes (MySQL) Type (MongoDB) Avg. Bytes (MongoDB) 6x Textfields Varchar(3-1000) 346 Varchar (3-1000) 346 6x Tiny Attributes Tinyint (each 1Byte) 6 64-Bit Integer 48 6x Small Attributes Smallint (each 2 Bytes) 12 64-Bit Integer 48 10x Numbers Integer (each 4 Bytes) 40 64-Bit Integer 80 3x Large Numbers Bigints (each 8 Byte) 24 64-Bit Integer 24 2x Doubles Double (each 8 Byte) 16 Float (each 8 Bytes) 16 6x Relational Ids 6 Ints, 2 Bigints 38 -- 0 Sharding-Key -- 0 2x 64-Bit Integer 32 Fieldnames -- 0 33x 2 Bytes 66 Sum per record: 482 Bytes 660 Bytes Relational: MySQL relational datasize (4 tables): 148 Bytes per record! 8

Datasize Comparision (real world) MySQL: One Record: 148 Byte (relational) No. of records: 112 Billion Datasize: 15.1 Terabyte Indexsize: 5.5 Terabyte MongoDB: One Record: 660 Byte (document) No. of records: 112 Billion Datasize: 67.2 Terabyte Indexsize: 11.4 Terabyte Looks like we will run into a little cost-problem with MongoDB... 9

MongoDB Datasize Fieldname Problem: Fieldnames are stored in MongoDB with every single datarecord that s necessary, because it is schemaless! Fieldnames are stored in every row uncompressed If 1 datarecord consists of 33 fields in our scenario and the fieldnames are only 2 Bytes long ( aa, ab, ac,...) we will need: One Datarecord Fieldname-Size: No. Fields: 33 Fieldbytes/Record: 66 (with only 2 Bytes / Field) 112 Billion Datarecords Fieldname-Size: 66 Byte * 112 Billion = 6.72 Terabyte! (only for the names) This is where part of your diskspace goes in MongoDB. Maybe schemaless is not that big advantage... 10

Price on Amazon Cloud (EC2) in Ireland MySQL: Datasize: Indexsize: 15.1 Terabyte 5.5 Terabyte Server: 107 (with 75% mem for index) $/Month: $ 158.360,- (admin & traffic costs come on top) MongoDB: Datasize: 68.5 Terabyte Indexsize: 11.4 Terabyte Server: 221 (with 75% mem for index) $/Month: $ 327.080,- (admin & traffic costs come on top) Amazon EC2 High-Memory Quadrupel Extra Large Instance Memory: 68,4 Gbyte Disk: 1.6 Terabyte $/Hour: $ 2,024 $/Month: $ 1.480,- 11

MySQL Challenge Assumption: Sharding in MySQL is not so difficult if we can avoid some traps: ToDo: No cross-shard joining No cross-shard distinct counting Generate unique keys in application Find a schema, which fits the above criteria (we luckily found one, after two weeks in the thinktank, but there will be usecases which never will work in such a setup). Setup shards as standalone MySQL servers, with no dependencies Build a data-deployment mechanism for the shards (Remind: unique ids) Build an application which is able to request data parallel from the shards and than make aggregations, sorts,... Problems: Network latency Caching in application level is finegrained, but can be tricky. 12

We are hiring Mad Scientist* (m/w): You ve trained a Support Vector Machine which selects your morning flakes by weather, milk temperature, blood pressure and thirty other features. TF/IDF and BM25 are so 90s. When you are looking for your doorkey, your searchengine ranks listwise. You didn t join the Yahoo Learning to Rank challenge because you wouldn t want to be a party pooper With the approach of map/reduce you gave a lol about the new funny name of the good old scatter/gather Your brain is parted in stack and heap, programming a doom-clone in whitespace was fun for an afternoon. The Mad Scientist is just another step on your way to Evil Genius. Hundrets of servers, billions of datasets, sharding, replication, coffee, pizza, coke, more coffee and colleagues dressed in black. Welcome to the real world, Neo. *jobtitle may differ on business card 13

Thanks a lot! Searchmetrics Inc. 2011