Comparison of the Frontier Distributed Database Caching System with NoSQL Databases



Similar documents
Comparison of the Frontier Distributed Database Caching System to NoSQL Databases

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

An Approach to Implement Map Reduce with NoSQL Databases

Can the Elephants Handle the NoSQL Onslaught?

Hadoop IST 734 SS CHUNG

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

In Memory Accelerator for MongoDB

Lecture Data Warehouse Systems

NoSQL Data Base Basics

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Structured Data Storage

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

NoSQL. Thomas Neumann 1 / 22

Cloud Scale Distributed Data Storage. Jürmo Mehine

Introduction to Big Data Training

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

Database Scalability and Oracle 12c

An Open Source NoSQL solution for Internet Access Logs Analysis

nosql and Non Relational Databases

Open source large scale distributed data management with Google s MapReduce and Bigtable

Google Bing Daytona Microsoft Research

NoSQL in der Cloud Why? Andreas Hartmann

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Open Source Technologies on Microsoft Azure

Development of nosql data storage for the ATLAS PanDA Monitoring System

MapReduce with Apache Hadoop Analysing Big Data

Challenges for Data Driven Systems

Introduction to NOSQL

Microsoft Azure Data Technologies: An Overview

Hadoop Ecosystem B Y R A H I M A.

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Slave. Master. Research Scholar, Bharathiar University

Large scale processing using Hadoop. Ján Vaňo

Big Data with Component Based Software

Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

A COMPARATIVE STUDY OF NOSQL DATA STORAGE MODELS FOR BIG DATA

Sentimental Analysis using Hadoop Phase 2: Week 2

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop and Map-Reduce. Swati Gore

How To Scale Out Of A Nosql Database

INTRODUCTION TO CASSANDRA

Understanding NoSQL Technologies on Windows Azure

Department of Software Systems. Presenter: Saira Shaheen, Dated:

How To Handle Big Data With A Data Scientist

NoSQL for SQL Professionals William McKnight

NoSQL: Going Beyond Structured Data and RDBMS

Practical Cassandra. Vitalii

Comparing SQL and NOSQL databases

Advanced Data Management Technologies

Cloud Computing at Google. Architecture

Using RDBMS, NoSQL or Hadoop?

How To Use Big Data For Telco (For A Telco)

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

NOSQL INTRODUCTION WITH MONGODB AND RUBY GEOFF

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

The NoSQL Ecosystem, Relaxed Consistency, and Snoop Dogg. Adam Marcus MIT CSAIL

Understanding NoSQL on Microsoft Azure

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Putting Apache Kafka to Use!

Integrating Big Data into the Computing Curricula

So What s the Big Deal?

A Performance Analysis of Distributed Indexing using Terrier

NoSQL, Big Data, and all that

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

DROSS Distributed & Resilient Open Source Software

GigaSpaces Real-Time Analytics for Big Data

NoSQL Databases. Nikos Parlavantzas

A survey of big data architectures for handling massive data

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Cassandra A Decentralized Structured Storage System

these three NoSQL databases because I wanted to see a the two different sides of the CAP

Hadoop implementation of MapReduce computational model. Ján Vaňo

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Infrastructures for big data

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Big Data. Facebook Wall Data using Graph API. Presented by: Prashant Patel Jaykrushna Patel

NoSQL Database Options

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Big data blue print for cloud architecture

Internet Scale Storage Microsoft Storage Community

NOSQL DATABASES AND CASSANDRA

Preparing Your Data For Cloud

Big Data Storage Architecture Design in Cloud Computing

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Getting Started with SandStorm NoSQL Benchmark

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

Hadoop Distributed File System (HDFS) Overview

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Transcription:

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases Dave Dykstra dwd@fnal.gov Fermilab is operated by the Fermi Research Alliance, LLC under contract No. DE-AC02-07CH11359 with the United States Department of Energy

Outline Common characteristics of NoSQL databases The Slashdot Effect Frontier Distributed Database Caching system characteristics CMS Frontier/Squid deployment examples Comparison of Frontier to NoSQL in general Comparisons to MongoDB, CouchDB, Hadoop HBase, and Cassandra Conclusions 2

The Slashdot Effect Slashdot Effect (or, slashdotting): when a toosmall server is overwhelmed by the same request from too many clients Named for slashdot.org, a very popular News for Nerds website that often hyperlinks to less-popular sites For web servers, usual solution is to use a Content Delivery Network (CDN) that either replicates or caches the objects around the world Some database applications have similar need 3

NoSQL common characteristics NoSQL denotes a large variety of Database Management Systems (DBMS) Primary unifying characteristic: not a Relational Database Management System (RDBMS) Generally nested key/value instead of row/column Run-time flexibility, doesn't need pre-defined schemas Most don't support the RDBMS standard Structured Query Language SQL Most popular NoSQL DBs support being distributed and fault-tolerant highly scalable on commodity HW Most give up atomicity of updates (ACID) and instead have eventual consistency (BASE) 4

Frontier characteristics The Frontier Distributed Database Caching System is designed for the Slashdot Effect many readers of same data, few writers Distributes RDBMS SQL queries (not NoSQL ) RESTful, so cacheable with standard web proxy caches (we use Squid) Web caches on client premises make ideal CDN Most network traffic on LAN, scalable as needed Practically maintenance-free Simultaneous same requests collapsed to one Simultaneous different requests queued at server 5

CMS Offline Frontier/Squid Conditions deployment RDBMS Offline Frontier Servers Tomcat+ Servlet+ Squid Tier0 Squids Wide Area Network Tier0 Farm Tier1, 2, 3 Squids Tier1, 2,3 Squids TierN Farm TierN Farm Only custom software is Frontier servlet in Tomcat and frontier_client in application on worker node farms Planning to replicate RDBMS & Frontier servers for availability 6

CMS Offline Frontier/Squid Conditions stats For Tier 0, 1, & 2 (not counting Tier 3): Average 250 job starts per minute worldwide Average 500,000 total Frontier requests per minute, aggregate average total 500MB/s Bursts at sites are much higher than average The 3 central server Squids at CERN only get 4,000 average requests per minute, 0.5MB/s Factor of 125 improvement on requests and 1000 on bandwidth (not counting Tier 3) Vast majority of jobs read very quickly because results already cached & valid in local Squids 7

Squid & Frontier limits Frontier tomcat server 3-year old 8-core machine (Xeon L5420 @ 2.5Ghz): Without compression, easily saturates 1Gbit network out With gzip compression, drops to 25MB/s out (but saves much bandwidth later in the caches) Adds 1/3rd overhead before gzip to avoid binary data Squid 2-year old machine (Xeon E5430 @ 2.66Ghz): Saturates 2Gbit network with one single-thread Squid modern machine (AMD Opteron 6140): Up to 7Gbps on 10Gbit network with a single-thread Squid Can get full throughput with two Squids on same port 8

CMS Online Frontier/Squid Conditions deployment RDBMS Online Frontier Servers Tomcat+ Servlet+ Squid Squids Squid placement is very flexible for more bandwidth Hierarchy of Squids on every worker node Blasts data to all 1400 nodes in parallel 9

Frontier vs. NoSQL in general Frontier NoSQL in general DB structure Row/column Nested key/value Consistency Eventual Eventual Write model Central writing Distributed writing Read model Many readers same data Read many different data Data model Central data, cache on demand Distributed data, copies Distributed elements General purpose Special purpose 10

MongoDB Mongo for humongous - for big, cheap data Stores binary JSON (JavaScript Object Notation) data Any field can be memory-indexed for performance Common in RDBMS, not common in NoSQL Flexible queries By fields, ranges, and regular expressions Similar to RDBMS, not common in NoSQL Only one write server per data item Copies are read-only, can take over as master if master goes down 11

MongoDB cont'd Scales by sharding, splitting writing of different data to different servers Not great at Slashdot effect Used by CMS for Data Aggregation Service (DAS) Needed the dynamic structure, liked other features Not a big installation though, only one server Supports MapReduce for distributing query processing to where the data is An ATLAS evaluation showed this didn't work well but it is supposed to be better now in version 2.0 12

CouchDB Stores JSON RESTful interface Can use http proxy caches where needed Also easy to insert authentication proxy Automated, low-maintenance replication All copies get all data, all can read and write Uses MultiVersion Concurrency Control (MVCC) Feature of RDBMS transactions, ACID Readers get consistent view Writing doesn't block reading Write conflicts automatically detected and aborted 13

CouchDB cont'd Supports MapReduce Used by CMS for some data and workflow management queues, job state machine CouchDB data replicated between CERN and Fermilab, 3 replicas at CERN and 4 at Fermilab 14

Hadoop HBase HBase is built on Hadoop Distributed FileSystem HDFS automatically distributes files and replicates them across a cluster Very reliable and automated for large amount of data Modeled after Google's BigTable Billions of rows with millions of columns Good for search engine-like applications Very good at MapReduce Good for big installations, not small 15

HBase cont'd Tunable replication level Also has SQL interface via Hive add-on Used by ATLAS distributed data manager for log analysis and accounting on a 12-node cluster 8 to 20 times faster than Oracle for accounting summary, depending on replication level HBase recognized by the WLCG Database Technical Evolution Group as having greatest potential impact of all the NoSQL technologies CERN IT is setting up a cluster 16

Cassandra Like HBase, modeled after Google BigTable All nodes are masters, decentralized control for geographically distributed fault tolerance Dynamic re-configuration with no downtime Keys and values can be any arbitrary data Has static column families used like indexes in RDBMS Tunable consistency from always consistent to eventual consistency Tunable replication level 17

Cassandra cont'd Originally written by Facebook, but they abandoned it in favor of HBase Used in production by ATLAS PanDA monitoring system Hosted on 3 high-power nodes at BNL, 12 hyperthreaded cores each, 1TB of RAID0 SSDs each 18

Conclusions NoSQL databases have a wide variety of characteristics, including scalability Frontier+Squid easily & efficiently add scalability to Relational databases when there are many readers of the same data Also enables clients to be geographically distant CouchDB with REST can have same scalability Hadoop HBase has most potential for big apps There are good applications in HEP for many different Database Management Systems 19