NoSQL: Going Beyond Structured Data and RDBMS

Similar documents

Hypertable Architecture Overview

Cloud Scale Distributed Data Storage. Jürmo Mehine

How To Scale Out Of A Nosql Database

NoSQL Data Base Basics

An Approach to Implement Map Reduce with NoSQL Databases

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Integrating Big Data into the Computing Curricula

Bigtable is a proven design Underpins 100+ Google services:

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Lecture Data Warehouse Systems

Challenges for Data Driven Systems

Can the Elephants Handle the NoSQL Onslaught?

Practical Cassandra. Vitalii

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

MongoDB Developer and Administrator Certification Course Agenda

nosql and Non Relational Databases

Structured Data Storage

NoSQL Databases. Nikos Parlavantzas

MySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!)

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

NoSQL replacement for SQLite (for Beatstream) Antti-Jussi Kovalainen Seminar OHJ-1860: NoSQL databases

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Building Your First MongoDB Application

INTRODUCTION TO CASSANDRA

Big Systems, Big Data

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Open Source Technologies on Microsoft Azure

Open source, high performance database

MapReduce with Apache Hadoop Analysing Big Data

these three NoSQL databases because I wanted to see a the two different sides of the CAP

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

NoSQL and Hadoop Technologies On Oracle Cloud

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

L7_L10. MongoDB. Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD.

In Memory Accelerator for MongoDB

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

Comparing SQL and NOSQL databases

Hacettepe University Department Of Computer Engineering BBM 471 Database Management Systems Experiment

SURVEY ON MONGODB: AN OPEN- SOURCE DOCUMENT DATABASE

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

.NET User Group Bern

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Hadoop Ecosystem B Y R A H I M A.

Evaluation of NoSQL databases for large-scale decentralized microblogging

NoSQL Database Options

Application of NoSQL Database in Web Crawling

NoSQL - What we ve learned with mongodb. Paul Pedersen, Deputy CTO paul@10gen.com DAMA SF December 15, 2011

Understanding NoSQL Technologies on Windows Azure

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

Understanding NoSQL on Microsoft Azure

The MongoDB Tutorial Introduction for MySQL Users. Stephane Combaudon April 1st, 2014

The NoSQL Ecosystem, Relaxed Consistency, and Snoop Dogg. Adam Marcus MIT CSAIL

An Open Source NoSQL solution for Internet Access Logs Analysis

BIG DATA TOOLS. Top 10 open source technologies for Big Data

Advanced Data Management Technologies

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Introduction to Hbase Gkavresis Giorgos 1470

Big Data With Hadoop

Scaling up = getting a better machine. Scaling out = use another server and add it to your cluster.

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Cloud Computing at Google. Architecture

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

Big Data. Facebook Wall Data using Graph API. Presented by: Prashant Patel Jaykrushna Patel

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

LARGE-SCALE DATA STORAGE APPLICATIONS

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Dr. Chuck Cartledge. 15 Oct. 2015

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Preparing Your Data For Cloud

Cassandra A Decentralized, Structured Storage System

Chapter 7. Using Hadoop Cluster and MapReduce

Introduction to Apache Cassandra

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Introduction to Cassandra

Sentimental Analysis using Hadoop Phase 2: Week 2

NOSQL DATABASES AND CASSANDRA

A programming model in Cloud: MapReduce

Document Oriented Database

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Big Data with Component Based Software

Data storing and data access

Moving From Hadoop to Spark

MADOCA II Data Logging System Using NoSQL Database for SPring-8

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Technical Overview Simple, Scalable, Object Storage Software

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Hadoop IST 734 SS CHUNG

Getting Started with MongoDB

Oracle Big Data SQL Technical Update

NoSQL Systems for Big Data Management

Big data and urban mobility

How To Use Big Data For Telco (For A Telco)

Transcription:

NoSQL: Going Beyond Structured Data and RDBMS

Scenario Size of data >> disk or memory space on a single machine Store data across many machines Retrieve data from many machines Machine = Commodity machine

Facebook 2009 Facebook storage system includes 4000 MySQL servers Database servers not enough for large data demand from the users Facebook adds additional 2000 memcached servers: cache recently used query results in key-value stores kept in memory Example of a cluster of commodity machines: #servers 1000 RAM Capacity / Server Total RAM Capacity 64 GB 64 TB Each machine has a few TBs of disk space

Distributed File System - File system stored across many commodity hardware - Fault tolerant (hardware failure is a standard) - High throughput access (streaming access) and NOT low latency data access - Simple coherency access model: Write-once-read many - Write Model: Permission is granted to a process; modifies the primary chunk and other chunks; acknowledge from all the chunk servers (GFS) - Enables to store large dataset

Hypertable

Overview - Database system based on Google s Big Table - Runs on top of DFS (e.g. Hadoop, GlusterFS) - Maximum performance - Scalability - Written in C++ - Comprehensive Language Support (Java, PHP, Python, Perl, Ruby, C++, ) - Ease of Use - Good Documentation

Companies / Organizations - Baidu - ebay - IBM - Yelps - Groupon - Insparx -

Physical Layout RDBMS Hypertable Flattens out the table structure into a sorted list of key/value pairs, each one representing a cell in the table. Cells that are NULL are simply not included (good for sparse data) Redundancy in the row keys and column identifiers is minimal due to the block data compression.

Cell Versions Extends the traditional two-dimensional table model by adding a third dimension: timestamp Representing different versions of each table cell. Specified by MAX_VERSIONS parameter.

System Overview

Data Modelling: Logical View Each cell is identified by a row key and column name Two additional features: - Timestamp - Column qualifier A column actually defines a set of related columns knows as a column family family:qualifier

Data Modelling: Physical View

Access Groups CREATE TABLE User ( name, address, photo, profile, ACCESS GROUP default (name, address, photo), ACCESS GROUP profile (profile) ); SELECT profile from User;

Range Server: Insert

Range Server: Query

Cell Store Format Compressed blocks of cells: Series of sorted blocks of compressed sorted key/value pairs. Bloom Filter: Describes the keys that exist (with high likelihood) in the CellStore. Stores the information if a key is not present Avoid unnecessary block transfer and decompression.

Query Routing METADATA table that contains a row for each range in the system. Two-level hierarchy is overlaid on top of the METADATA table => Client library includes a METADATA cache (avoid walking through METADATA)

MongoDB

Overview Document-Oriented database system JSON-style documents Full Index Support Replication & High Availability Mirror access across LAN or WAN. Auto-Sharding Fast In-Place Updates GridFS file specialized data storage optimized for large documents Implemented in C++

NoSQL Document Model MongoDB CouchDB Graph Model Neo4j HyperGraphDB Key-Value/Wide Column Models Riak Redis HBase Cassandra

Partners - Cloud Partners - Amazon Web Services - Windows Azure - VMware - Hardware Partners - Fusion-IO - Intel

Companies using MongoDB CERN primary back-end for Data Aggregation System Forbes articles and company data Shutterfly photo storage platform with 7 million users Foursquare store venue and user check-ins Source Forge (information from Wikipedia)

Data Modelling MongoDB has a flexible schema References store relationships between data by including links or references Embedded Data denormalized data model

Replication and Sharding Data is replicated in a replica set data copied to multiple servers Each server session in a replica is supposed to be an exact copy of the primary. Secondary servers can become primary when primary fails.

Replication and Sharding -- system setup Multiple servers for sharding App Server processes queries Config Server manages meta data for shards Shard servers contains section of data

Replication and Sharding -- distributing data Range Based Sharding separate into small chucks. Each chuck has a continues section of primary key Hash Based Sharding Primary key is hashed before being placed into chunk

Replication and Sharding -- extra notes Load Balancing is done with chunks (64MB blocks) Migration happens when number of chucks in unbalanced Shards are replica sets replica sets are often >3 machines with duplicate data Queries always pass through App server. If a query does not contains primary key, all shards will be queried.

Queries -- JSON data { "firstname": "John", "lastname": "Smith", "age": 25, "address": { "streetaddress": "21 2nd Street", "city": "New York", "state": "NY", "postalcode": "10021" }, "phonenumber": [ { "type": "home", "number": "212 555-1234" }, { "type": "fax", "number": "646 555-4567" } ], "gender":{ "type":"male" } } JSON records describe the data they hold New data elements in a records can be added or removed as needed.

Queries -- simple data extraction An empty document query selects all documents in the collection: db.inventory.find( { } ) Equality conditions use { <field>: <value> } to select all documents that have that field with the specific value db.inventory.find( { type: "snacks" } )

Query and Projection Operators Query Operations Comparison (!=, >, in,...) Logical (and, or, not, nor) Element (exists, check type of field) Evaluation (mod, regular expression, where) Geospacial (geowithin, near, geointersects) Array (element match, query size) Projection Queries can be combined in various ways using these operators

Query Example CREATE TABLE users ( id MEDIUMINT NOT NULL AUTO_INCREMENT, user_id (30), age, status KEY (id) ) db.users.insert( { user_id: "abc123", age: 55, status: "A" } ) (1), PRIMARY ALTER TABLE users ADD join_date DATETIME db.users.update( { }, { $set: { join_date: new () } }, { multi: true } ) ALTER TABLE users DROP COLUMN join_date db.users.update( { }, { $unset: { join_date: "" } }, { multi: true } ) CREATE INDEX idx_user_id_asc ON users(user_id) db.users.createindex( { user_id: 1 } ) CREATE INDEX idx_user_id_asc_age_desc ON users(user_id, age DESC) db.users.createindex( { user_id: 1, age: -1 } ) DROP TABLE users db.users.drop()

Query Example INSERT INTO users(user_id, age, status) VALUES ("bcd001", 45, "A") db.users.insert( { user_id: "bcd001", age: 45, status: "A" } ) SELECT * FROM users db.users.find() SELECT id, user_id, status FROM users db.users.find( { }, { user_id: 1, status: 1 } ) SELECT user_id, status FROM users db.users.find( { }, { user_id: 1, status: 1, _id: 0 } ) SELECT * FROM users WHERE status = "A" db.users.find( { status: "A" } ) SELECT user_id, status FROM users WHERE status = "A" db.users.find( { status: "A" }, { user_id: 1, status: 1, _id: 0 } ) SELECT * FROM users WHERE status!= "A" db.users.find( { status: { $ne: "A" } } )

Query Example SELECT * FROM users WHERE status = "A" AND age = 50 db.users.find( { status: "A", age: 50 } ) SELECT * FROM users WHERE status = "A" OR age = 50 db.users.find( { $or: [ { status: "A" }, { age: 50 } ] } ) SELECT * FROM users WHERE age > 25 db.users.find( { age: { $gt: 25 } } ) SELECT * FROM users WHERE age < 25 db.users.find( { age: { $lt: 25 } } ) SELECT * FROM users WHERE age > 25 AND age <= 50 db.users.find( { age: { $gt: 25, $lte: 50 } } ) SELECT * FROM users WHERE user_id like "%bc%" db.users.find( { user_id: /bc/ } ) SELECT * FROM users WHERE user_id like "bc%" db.users.find( { user_id: /^bc/ } )

Query Example SELECT * FROM users WHERE status = "A" ORDER BY user_id ASC db.users.find( { status: "A" } ).sort( { user_id: 1 } ) SELECT * FROM users WHERE status = "A" ORDER BY user_id DESC db.users.find( { status: "A" } ).sort( { user_id: -1 } ) SELECT COUNT(*) FROM users db.users.count() db.users.find().count() or SELECT COUNT(user_id) FROM users db.users.count( { user_id: { $exists: db.users.find( true } } ) { user_id: { $exists: true } } ).count() SELECT COUNT(*) FROM users WHERE age > 30 db.users.count( { age: { $gt: 30 db.users.find( } } ) { age: { $gt: 30 } } ).count() SELECT DISTINCT(status) FROM users db.users.distinct( "status" ) SELECT * FROM users LIMIT 5 SKIP 10 db.users.find().limit(5).skip(10)

Query Example UPDATE users SET status = "C" WHERE age > 25 db.users.update( { age: { $gt: 25 } }, { $set: { status: "C" } }, { multi: true } ) DELETE FROM users WHERE status = "D" db.users.remove( { status: "D" } ) DELETE FROM users db.users.remove({})

Map-Reduce Map-reduce is provided for processing aggregated results use a custom JavaScript function for map and reduce operations map-reduce operations can be saved as a collection for iterative map-reduce operations

Map Reduce example

Indexes -- 1 Support for multiple index types on the database Single Field Compound Index Multikey Index index contents of arrays. has its own index entry Each array element

Indexes -- 2 Support for multiple index types on the database Geospatial Index only 2D. Planar geometry or spherical geometry Text Indexes (beta) Language specific stop words removed Hash Based Indexing Only supports equality join Range indexes use B+ tree

Conclusion Distributed Document storage database Indexing options allow for fast query of many data element dynamic schema Generic operations can be usable in many scenarios, but may not be optimized for those scenarios.

Many More Platforms / Frameworks - Cassandra: - Distributed database management system - Spark: - Distributed In-Memory Cluster Computing Engine - Iterative Algorithms - Interactive Data Analysis - RAM Cloud: - Distributed In-Memory storage system -

Apache Cassandra: Overview - Distributed database management system - Large amounts of data across many commodity servers - Support for clusters spanning multiple datacenters - Supports replication over multiple datacenters - Fault-tolerant - High performance and scalability - Cassandra File System (HDFS compatible) - Written in Java - Cassandra Query Language (CQL) - NoSQL system - NO support for joins or subqueries => denormalization

Companies - Netflix - ebay - Twitter - Cisco - Facebook -.

Data Modelling Rows are organized into tables The first component of a table's primary key is the partition key Within a partition, rows are clustered by the remaining columns of the key. Other columns may be indexed separately from the primary key

Data Modelling Music Example CREATE TABLE songs ( id uuid PRIMARY KEY, title text, album text, artist text, data blob ); CREATE TABLE playlists ( id uuid, song_id uuid, PRIMARY KEY (id, song_order ) ); SELECT * FROM playlists, songs, WHERE songs.id = playlists.song_id; JOIN ON TWO TABLES How to do join efficiently in a distributed fashion? (Online / Web Applications

Data Modelling Music Example CREATE TABLE songs ( id uuid PRIMARY KEY, title text, album text, artist text, data blob ); CREATE TABLE playlists ( id uuid, song_order int, song_id uuid, title text, album text, artist text, PRIMARY KEY (id, song_order ) ); SELECT * FROM playlists;

Data Modelling Music Example CREATE INDEX ON playlists (artist); SELECT * FROM playlists WHERE artist = 'Fu Manchu'; CREATE INDEX album_name ON playlists ( album ); CREATE INDEX title_name ON playlists ( title ); SELECT * FROM playlists WHERE album = 'Roll Away' AND title = 'Outside Woman Blues' ALLOW FILTERING ; Selects the least-frequent occurrence of a condition for processing first for efficiency.

Data Modelling General Example The following layout represents a row in a Column Family (CF): Column Name = Column Key

Data Modelling General Example The following layout represents a row in a Super Column Family (SCF):

Data Modelling General Example The following layout represents a row in a Column Family with composite columns:

Data Modelling General Example Do not think of relation table. Think of a nested, sorted map data structure. Map<RowKey, SortedMap<ColumnKey, ColumnValue> > A map gives efficient key lookup, and the sorted nature gives efficient scans.

Data Modelling General Example The number of column keys is unbounded = > Wide Row A key can itself holds a value. In other words, you can have a valueless column. => Inverted Index Range scan on row keys is possible only when data is partitioned in a cluster using Order Preserving Partitioner (OOP). OOP is almost never used. SortedMap<RowKey, SortedMap<ColumnKey, ColumnValue> >

Data Modelling User & Item For this example, let s say we would like to query data as follows: - Get user by user id - Get item by item id - Get all the items that a particular user likes - Get all the users who like a particular item

Data Modelling User & Item Supports querying user data by user id and item data by item id. Can not query anything else!!

Data Modelling User & Item Get the item title in addition to the item id when we query items liked by a particular user?

Data Modelling User & Item What if we want to get all the information (title, desc, price, etc.) about the items liked by a given user?

Data Modelling User & Item Let us include timestamp information:

Data Modelling User & Item Let us include timestamp information:

Cassandra vs MongoDB vs HBase Cassandra achieves the highest throughput for the maximum number of nodes in all experiments. Faster reads and writes Massive scalability Mongo natively works better with JSON blobs.

Conclusion Don t think of a relational table, but think of a nested sorted map data structure. Model column families around query patterns. De-normalize and duplicate for read performance. But don t de-normalize if you don t need to. High performance => data duplications (denormalization)

What s Your Message? Questions?