NoSQL. Thomas Neumann 1 / 22

Similar documents

Structured Data Storage

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

NoSQL Data Base Basics

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Can the Elephants Handle the NoSQL Onslaught?

A survey of big data architectures for handling massive data

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Lecture Data Warehouse Systems

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

NoSQL Database Options

Advanced Data Management Technologies

Cloud Scale Distributed Data Storage. Jürmo Mehine

An Approach to Implement Map Reduce with NoSQL Databases

Transactions and ACID in MongoDB

Preparing Your Data For Cloud

NoSQL in der Cloud Why? Andreas Hartmann

The Sierra Clustered Database Engine, the technology at the heart of

Infrastructures for big data

Introduction to NOSQL

Big Data and Apache Hadoop s MapReduce

Cloud Computing at Google. Architecture

How To Scale Out Of A Nosql Database

So What s the Big Deal?

Big Systems, Big Data

Big Data Solutions. Portal Development with MongoDB and Liferay. Solutions

Introduction to Apache Cassandra

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Big Data: Opportunities & Challenges, Myths & Truths 資料來源 : 台大廖世偉教授課程資料

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

Study and Comparison of Elastic Cloud Databases : Myth or Reality?

2.1.5 Storing your application s structured data in a cloud database

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Introduction to Hadoop

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

CS 4604: Introduc0on to Database Management Systems. B. Aditya Prakash Lecture #13: NoSQL and MapReduce

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

NoSQL Databases. Polyglot Persistence

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

NoSQL Evaluation. A Use Case Oriented Survey

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

How To Write A Database Program

Challenges for Data Driven Systems

Big Data With Hadoop

Slave. Master. Research Scholar, Bharathiar University

The NoSQL Ecosystem, Relaxed Consistency, and Snoop Dogg. Adam Marcus MIT CSAIL

16.1 MAPREDUCE. For personal use only, not for distribution. 333

A programming model in Cloud: MapReduce

Hadoop implementation of MapReduce computational model. Ján Vaňo

Benchmarking and Analysis of NoSQL Technologies

Dominik Wagenknecht Accenture

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University

The Quest for Extreme Scalability

NoSQL Databases. Nikos Parlavantzas

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Oracle Big Data SQL Technical Update

BRAC. Investigating Cloud Data Storage UNIVERSITY SCHOOL OF ENGINEERING. SUPERVISOR: Dr. Mumit Khan DEPARTMENT OF COMPUTER SCIENCE AND ENGEENIRING

NoSQL Systems for Big Data Management

Databases 2 (VU) ( )

Large scale processing using Hadoop. Ján Vaňo

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technologies Compared June 2014

CISC 432/CMPE 432/CISC 832 Advanced Database Systems

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

How To Use Big Data For Telco (For A Telco)

CloudDB: A Data Store for all Sizes in the Cloud

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Apache Hadoop FileSystem and its Usage in Facebook

LARGE-SCALE DATA STORAGE APPLICATIONS

Introducing DocumentDB

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Hadoop Ecosystem B Y R A H I M A.

CSE-E5430 Scalable Cloud Computing Lecture 2

Similarity Search in a Very Large Scale Using Hadoop and HBase

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Keywords: Big Data, HDFS, Map Reduce, Hadoop

these three NoSQL databases because I wanted to see a the two different sides of the CAP

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Integrating Big Data into the Computing Curricula

Open source, high performance database

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

In Memory Accelerator for MongoDB

Transcription:

NoSQL Thomas Neumann 1 / 22

What are NoSQL databases? hard to say more a theme than a well defined thing Usually some or all of the following: no SQL interface no relational model / no schema no joins, emphasize on key/value pairs scale out to many machines weak or no consistency guarantees 2 / 22

Why not relational databases? Some commonly stated reasons: RDBMS are hard to use do not scale to web-scale relational model is too restrictive NoSQL is faster, scales better Some of this is true (as we will see), but most likely will not affect you! 3 / 22

Illustrational Web Video MongoDB is web scale http://www.xtranormal.com/watch/6995033 4 / 22

The Performance Argument Voter benchmark: People call to vote for American Idol at most 12 votes are counted per caller-id very simple transaction model On a 8 core Intel Xeon X5570 with 64GB main memory: MongoDB: ca. 10,000 transactions per second relational main-memory database: ca. 1,000,000 transactions per second Do not blindly follow a hype, do the math! 5 / 22

Sucess stories from the net Why we chose MongoDB [...] Very easy to install. [...] Very easy replication [...] We cut down the names to 2-3 characters. This is a little more confusing in the code but the disk storage savings are worth it [...] a massive saving. [...] Was it the right move? Yes. MongoDB has been an excellent choice [...] MongoDB is going to be very cool! MongoDB works fine, but the same query is 25 times faster in PostgreSQL [...] MongoDB will win once I have 26 machines 6 / 22

Technical Arguments in favor of NoSQL CAP-Theorem: In a distributed system you can only have two of the following Consistency all nodes see the same data at the same time Availability node failures do not prevent survivors from continuing to operate Partition Tolerance he system continues to operate despite arbitrary message loss Basis for the claim the RDBMS are not web-scale 7 / 22

Scalability How to scale to thousands of nodes? traditional RDBMS usually scale to less than 100 node transaction semantic requires a lot of coordination two phase commit is expensive O(n 2 ) network connections does not scale to thousands of nodes Partitioning helps, but usually requires human interaction. 8 / 22

Key/Value - Stores Life would be much simpler if we only stored key/value pairs only (or mostly) point-access transactions operate on a single item allows for simple partitioning by spreading keys over nodes we distribute the data usually scales perfectly Life is much simpler if you only care about individual values... 9 / 22

Distributed Hash Tables Basis for many distributed storage schemes: spread a hash table over a large number of nodes nodes can enter and leave (more or less) at will nodes know only a few other nodes offers scalable distributed storage Many algorithms exists: Chord, Pastry, P-Grid, etc. usual idea: hash nodes into hash domain nodes responsible to for hash values near to them 10 / 22

Distributed Hash Tables - Chord Both items and nodes are hashed into ring structure Finger tables similar to skip lists 11 / 22

Expressiveness DHTs are one giant Key Value table only three operations: lookup, insert, delete each operations is limited to a single data item range queries not supported efficiently This is a severe limitation. Scalability is obtained by eliminating functionality 12 / 22

What about Consistency? We want our database to be consistent as long as transactions operate on single items (note: strong restriction) life is relatively simple one node is responsible for the data item as long as all changes are atomic or idempotent everything is fine But: nodes will be replicated for availability ensuring consistency adds the same costs as in standard distributed RDBMSs most systems aim at eventual consistency after waiting long enough (without updates in between), all replicas will have the same value This is usually unacceptable if the data is valuable (e.g., involves money)! 13 / 22

What about Multi-Item Transactions? Short answer: not supported Long answer: not supported very well one can ignore the issue and run multi single-level transactions completely messes up consistency some systems offer explicit locking some problems as in distributed RDBMSs 14 / 22

How to Query the Data Data is spread over thousands of nodes point query are supported by DHTs range queries are not aggregation queries are very important Requires some very heavy machinery inside the NoSQL database query response time usually multiple seconds, even minutes not really suited for interactive queries data will change during query execution usually queries inconsistent data 15 / 22

Map/Reduce Programming paradigm that allows for easy parallelization. Sequence of two operations: 1. Map: (k 1, v 1 ) list(k 2, v 2 ) constructs key-value pairs from input pair can be computed in parallel, no interaction 2. Reduce: (k 2, list(v 2 ) list(k 3, v 3 ) reduces all k 2 pairs into one (or more) value different k 2 s can be parallelized Simple, scalable scheme, but involves massive movement of data 16 / 22

Map/Reduce - Canonical Example Word count is the classical example: 1. map(documentid,document) for each word w in document emit (w,1) 2. reduce(word,counts) count=0 for each c in counts count+ = c emit(word,count) Computes the frequency of each word 17 / 22

Map/Reduce - Database Queries Can also be used to query distributed key/value stores: 1. map(customerid,customerdata) emit (customerid,customerdata.amount) 2. reduce(customerid,revenues) sum = 0 for each r in revenues sum+ = r emit(customerid,r) 3. reduce(customerid,revenue) if revenue > 10000 emit (customerid,revenue) executed across all nodes very heavy operation 18 / 22

Systems There is a huge number of NoSQL systems around BigTable key/value store used inside Google, row/column/time dimensions, slicing Casandra key/value store with tunable consistency MongoDB document centric, JavaScript driven, relatively rich queries CouchDB document centric, JavaScript driven, MVCC Dynamo, Project Voldemort, Hbase,... Unfortunately all incompatible, all different in some aspects 19 / 22

Who needs this ultra-scalability? Going fully web-scale makes sense in a few cases: the data amount is huge petabytes of data consistency is not important click streams, not payment data access is mostly single-item more complicated queries are expensive But: very few companies have these characteristics 20 / 22

The real reason why (some) people use NoSQL: Money A 1TB main-memory machine costs ca. 60K most people do not have large amounts of data anyway or if they have, the data is not that important enterprise database systems are expensive NoSQL products tend to be free or cheap startups do not have money But is this really an argument for NoSQL? 21 / 22

Conclusion NoSQL is a fuzzy term, but usually stores non-relational data aims at scalability to thousands of nodes sacrifices consistency support mainly simple queries efficiently Mainly makes sense if data is really huge and not very valuable Otherwise, use a RDBMS! 22 / 22