Summary of Alma-OSF s Evaluation of MongoDB for Monitoring Data Heiko Sommer June 13, 2013

Similar documents

Using distributed technologies to analyze Big Data

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Comparing SQL and NOSQL databases

Oracle Big Data SQL Technical Update

Getting Started with SandStorm NoSQL Benchmark

Comparing Scalable NOSQL Databases

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

MongoDB Developer and Administrator Certification Course Agenda

Lecture Data Warehouse Systems

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

MongoDB and Couchbase

Big Data. Facebook Wall Data using Graph API. Presented by: Prashant Patel Jaykrushna Patel

Practical Cassandra. Vitalii

Benchmarking Cassandra on Violin

NoSQL: Going Beyond Structured Data and RDBMS

Can the Elephants Handle the NoSQL Onslaught?

Document Oriented Database

NoSQL Databases. Nikos Parlavantzas

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Cloud Scale Distributed Data Storage. Jürmo Mehine

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Table of Contents. Développement logiciel pour le Cloud (TLC) Table of Contents. 5. NoSQL data models. Guillaume Pierre

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

An Approach to Implement Map Reduce with NoSQL Databases

Databases 2 (VU) ( )

MongoDB: document-oriented database

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Integrating Big Data into the Computing Curricula

.NET User Group Bern

Cost-Effective Business Intelligence with Red Hat and Open Source

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Virtuoso and Database Scalability

Department of Software Systems. Presenter: Saira Shaheen, Dated:

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

An Open Source NoSQL solution for Internet Access Logs Analysis

Databases for text storage

Big data and urban mobility

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

HOW CLOUD DATABASE ENABLES EFFICIENT REAL-TIME ANALYTICS?

DYNAMIC QUERY FORMS WITH NoSQL

Performance and Scalability Overview

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch September 16,

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, XLDB Conference at Stanford University, Sept 2012

INTRODUCING DRUID: FAST AD-HOC QUERIES ON BIG DATA MICHAEL DRISCOLL - CEO ERIC TSCHETTER - LEAD METAMARKETS

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

NOSQL INTRODUCTION WITH MONGODB AND RUBY GEOFF

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hacettepe University Department Of Computer Engineering BBM 471 Database Management Systems Experiment

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Open Source Technologies on Microsoft Azure

Structured Data Storage

Informatica Data Director Performance

NoSQL Database - mongodb

Moving From Hadoop to Spark

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

By Ockham s Razor Solutions. A division of OCRgroup

Large Scale Text Analysis Using the Map/Reduce

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Application of NoSQL Database in Web Crawling

NoSQL Database Options

Database Scalability and Oracle 12c

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

A survey of big data architectures for handling massive data

Move Data from Oracle to Hadoop and Gain New Business Insights

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Yahoo! Cloud Serving Benchmark

Liferay Portal s Document Library: Architectural Overview, Performance and Scalability

How To Scale Out Of A Nosql Database

Understanding NoSQL Technologies on Windows Azure

Search and Real-Time Analytics on Big Data

Turn Big Data to Small Data

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

Proactive and Reactive Monitoring

Parallel Replication for MySQL in 5 Minutes or Less

MADOCA II Data Logging System Using NoSQL Database for SPring-8

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Fact Sheet In-Memory Analysis

Real Time Big Data Processing

Using RDBMS, NoSQL or Hadoop?

A Performance Analysis of Distributed Indexing using Terrier

Introduction to Big Data Training

Oracle Database 12c Plug In. Switch On. Get SMART.

SYSTAP / bigdata. Open Source High Performance Highly Available. 1 bigdata Presented to CSHALS 2/27/2014

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Study concluded that success rate for penetration from outside threats higher in corporate data centers

Development of nosql data storage for the ATLAS PanDA Monitoring System

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Transcription:

Summary of Alma-OSF s Evaluation of MongoDB for Monitoring Data Heiko Sommer June 13, 2013 Heavily based on the presentation by Tzu-Chiang Shen, Leonel Peña ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, 17-19 April 2013

Monitoring Storage Requirement Expected data rate with 66 antennas: 150,000 monitor points ( MP s) total. MPs get archived once per minute ~1 minute of MP data bucketed into a clob ~ 7000 clobs/s ~ 25-30 GB/day, ~10 TB/year 2500 clobs/s + dependent MP demultiplexing + fluctuations ~ equivalent to 310KByte/s or 2,485Mbit/s Monitoring data characteristic Simple data structure: [ID, timestamp, value] But huge amount of data Read-only data

Prior DB Investigations Oracle: See Alisdair s slides. MySQL Query problems, similar to Oracle DB HBase (2011-08) Got stuck with Java client problems Poor support from the community Cassandra (2011-10) Keyspace / replicator issue resolved Poor insert performance: Only 270 inserts / minute (unclear what size) Clients froze These experiments were done only with some help from archive operators, not in the scope of a student s thesis like it was later with MongoDB. Also administrational complexity was mentioned, without details.

Very Brief Introduction of MongoDB no-sql and document oriented. The storage format is BSON, a variation of JSON. SQL Database Table Row Field Index mongodb Database Collection Document Field Index Documents within a collection can differ in structure. For monitor data we don t really need this freedom. Other features: Sharding, Replication, Aggregation (Map/Reduce)

Very Brief Introduction of MongoDB A document in mongodb: { } _id: ObjectID("509a8fb2f3f4948bd2f983a0"), user_id: "abc123", age: 55, status: 'A'

Schema Alternatives 1.) One MP value per doc One MP value per doc: One MongoDB collection total, or one per antenna.

Schema Alternatives 2.) MP clob per doc A clob (~1 minute of flattened MP data): Collection per antenna / other device.

Schema Alternatives 3.) Structured MP /day/doc One monitor point data structure per day Monthly database Shard key = antenna + MP, keeps matching docs on the same node. Updates of pre-allocated documents.

Analysis Advantages of variant 3.): Fewer documents within a collection There will be ~150,000 documents per day The amount of indexes will be lower as well. No data fragmentation problem Once a specific document is identified ( nlog(n) ), the access to a specific range or a single value can be done in O(1) Smaller ratio of metadata / data

How would a query look like? Query to retrieve a value with seconds-level granularity: Ej: To get the value of the FrontEnd/Cryostat/GATE_VALVE_STATE at 2012-09- 15T15:29:18. db.monitordata_[month].findone( {"metadata.date": "2012-9-15", "metadata.monitorpoint": "GATE_VALVE_STATE", "metadata.antenna": "DV10", "metadata.component": "FrontEnd/Cryostat }, { 'hourly.15.29.18': 1 } );

How would a query look like Query to retrieve a range of values Ej: To get values of the FrontEnd/Cryostat/GATE_VALVE_STATE at minute 29 (at 2012-09-15T15:29) db.monitordata_[month].findone( {"metadata.date": "2012-9-15", "metadata.monitorpoint": "GATE_VALVE_STATE", "metadata.antenna": "DV10", "metadata.component": "FrontEnd/Cryostat }, { 'hourly.15.29': 1 } );

Indexes A typical query is restricted by: Antenna name Component name Monitor point Date db.monitordata_[month].ensureindex( { "metadata.antenna": 1, "metadata.component": 1, "metadata.monitorpoint": 1, "metadata.date": 1 } );

Testing Hardware / Software A cluster of two nodes were created CPU: Intel Xeon Quad core X5410. RAM: 16 GByte SWAP: 16 GByte OS: RHEL 6.0 2.6.32-279.14.1.el6.x86_64 MongoDB V2.2.1

Testing Data Real data from Sep-Nov of 2012 was used initially, but: A tool to generate random data was implemented: Month: 1 (February) Number of days: 11 Number of antennas: 70 Number of components by antenna: 41 Monitoring points by component: 35 Total daily documents: 100.450 Total of documents: 1.104.950 Average weight by document: 1,3MB Size of the collection: 1,375.23GB Total index size 193MB

Database Statistics

Data Sets

Data Sets

Data Sets

Schema 1: One Sample of Monitoring Data per Document

Proposed Schema:

More tests For more tests, see https://adcwiki.alma.cl/bin/view/software/highvolu medatatestingusingmongodb

TODO Test performance of aggregations/combined queries Use Map/Reduce to create statistics (max, min, avg, etc) of range of data to improve performance of queries like: i.e: Search monitoring points which values >= 10 Test performance under a year worth of data Stress tests with big amount of concurrent queries

Conclusion @ OSF MongoDB is suitable as an alternative for permanent storage of monitoring data. Reported 25,000 clobs/s ingestion rate in the tests. The schema + indexes are fundamental to achieve milliseconds level of responses

Comments What are the requirements going to be like? Only extraction by time interval and offline processing? Or also data mining running on the DB? All queries ad-hoc and responsive, or also batch jobs? Repair / flagging of bad data? Later reduction of redundancies? Can we hide the MP-to-document mapping from upserts/queries? Currently queries have to patch together results at the 24 hour and monthly breaks.