Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Similar documents
Open source large scale distributed data management with Google s MapReduce and Bigtable

How To Scale Out Of A Nosql Database

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Comparing SQL and NOSQL databases

Cloud Scale Distributed Data Storage. Jürmo Mehine

So What s the Big Deal?

Lecture Data Warehouse Systems

Can the Elephants Handle the NoSQL Onslaught?

Qsoft Inc

Applications for Big Data Analytics

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Structured Data Storage

Apache HBase: the Hadoop Database

Hadoop Ecosystem B Y R A H I M A.

Apache HBase. Crazy dances on the elephant back

Open source Google-style large scale data analysis with Hadoop

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

Splice Machine: SQL-on-Hadoop Evaluation Guide

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

MapReduce with Apache Hadoop Analysing Big Data

Challenges for Data Driven Systems

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Open Source Technologies on Microsoft Azure

Introduction to Big Data Training

Hadoop implementation of MapReduce computational model. Ján Vaňo

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Dominik Wagenknecht Accenture

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

NoSQL Data Base Basics

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Xiaoming Gao Hui Li Thilina Gunarathne

Making Sense ofnosql A GUIDE FOR MANAGERS AND THE REST OF US DAN MCCREARY MANNING ANN KELLY. Shelter Island

Not Relational Models For The Management of Large Amount of Astronomical Data. Bruno Martino (IASI/CNR), Memmo Federici (IAPS/INAF)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

NoSQL: Going Beyond Structured Data and RDBMS

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

Advanced Data Management Technologies

INTRODUCTION TO CASSANDRA

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Introduction to Hbase Gkavresis Giorgos 1470

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Preparing Your Data For Cloud

Open source, high performance database

NoSQL Databases. Nikos Parlavantzas

How To Use Big Data For Telco (For A Telco)

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

IBM Big Data Platform

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Big Data With Hadoop

Using distributed technologies to analyze Big Data

MongoDB in the NoSQL and SQL world. Horst Rechner Berlin,

BRAC. Investigating Cloud Data Storage UNIVERSITY SCHOOL OF ENGINEERING. SUPERVISOR: Dr. Mumit Khan DEPARTMENT OF COMPUTER SCIENCE AND ENGEENIRING

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Native Connectivity to Big Data Sources in MSTR 10

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Application Development. A Paradigm Shift

Data storing and data access

Trafodion Operational SQL-on-Hadoop

Big Data Course Highlights

How To Handle Big Data With A Data Scientist

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Luncheon Webinar Series May 13, 2013

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Katta & Hadoop. Katta - Distributed Lucene Index in Production. Stefan Groschupf Scale Unlimited, 101tec. sg{at}101tec.com

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Big Data Technologies Compared June 2014

Complete Java Classes Hadoop Syllabus Contact No:

Constructing a Data Lake: Hadoop and Oracle Database United!

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

The First Hybrid, In-Memory RDBMS Powered by Hadoop and Spark

Moving From Hadoop to Spark

Big Systems, Big Data

How to Hadoop Without the Worry: Protecting Big Data at Scale

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data and Hadoop with Components like Flume, Pig, Hive and Jaql

Introduction to NOSQL

The Hadoop Eco System Shanghai Data Science Meetup

Practical Cassandra. Vitalii

A very short Intro to Hadoop

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Large scale processing using Hadoop. Ján Vaňo

Big Data and Data Science: Behind the Buzz Words

Scalable Architecture on Amazon AWS Cloud

Transcription:

Вовченко Алексей, к.т.н., с.н.с. ВМК МГУ ИПИ РАН

Zettabytes Petabytes ABC Sharding A B C Id Fn Ln Addr 1 Fred Jones Liberty, NY 2 John Smith??????

122+ NoSQL Database Offerings Today! 4 Dominant Flavors The Buzz Regular machine failure, data center outages, and network service interruptions happen frequently Need is higher volume, fewer features Existing RDBMS do not automatically manage the distribution of data over available hardware. Sharding solutions over RDBMS introduce large overhead High-scale RDBMS too expensive for increased data volume Need for a flexible data model Need for a low-latency, low-overhead API to access data Need to scale-out on cheap commodity hardware Increase use of distributed analytics

Simple Key Value Stores Simplest NoSQL Store, provides low-latency writes but single key/value access Stores data as hash table of keys where every key maps to an opaque binary object Easily scales across many machines, does not support other data types Ideal for apps that require massive amounts of simple data like sensor data or rapidly changing data such as stock quotes Use-cases: apps that require massive amounts of simple data (sensor, web ops), apps that require rapidly changing data (stock quotes), Caching. Examples : MemcaheD, Dynamo Document Stores Represents rich, hierarchical data structures, reducing need for multi-table joins Structure of the documents need not be known a priori, can be variable, and evolve instantly, but Query can understand the contents of the document Applications: rapid ingest and delivery for evolving schemas and web-based objects. Examples. MongoDB, couchdb (couchbase) Column Family Manages structured data, with multiple-attribute access Columns are grouped together in column-families/groups. Each storage block contains data from only one column/column set to provide data locality for hot columns Column groups defined a-priori, but supports variable schema within a column group Scale using replication, multi-node distribution for high availability and easy failover. Optimized for writes (writes faster than reads) Applications: High throughput verticals (activity feeds, message queues). Caching. Web ops. Examples. HBase, Cassandra, BigTable, Amazon Dynamo Graph Store Uses nodes, relationships between nodes, and key-value properties Accesses data using graph traversal, navigating from start nodes to related nodes according to graph algorithms Faster for associative data sets Uses schema-less, bottoms-up model for capturing ad-hoc and rapidly changing data Common Model : RDF Applications: storing and reasoning on complex and connected data, e.g. inferencing applications in healthcare, government, telecom,oil, perform closure on social networking graphs Examples : Neo4J, DB2

Row A Row B Row C Column A Column B Column C X Note: Column Families contain Columns with time stamped versions. Columns only exist when inserted (i.e. sparse) Row A Column A Integer Row A Column B Value Row B Column B Long Timestamp1 Row B Row C Long Timestamp2 Row C Column C Huge URL Family 1 Family 2

11111 22222

Data Layout Transactions Query Language Security Indexes HBase A sparse, distributed, persistent multidimensional sorted map ACID Support on Single Row Only get/put/scan only unless combined with Hive or other technology Authentication /Authorization Row-Key only or special table RDBMS Row or Column Oriented Yes SQL Authentication /Authorization Throughput Millions of Queries per Sec Thousands of Queries per Sec Maximum Database Size PBs Yes TBs

Given this sample RDBMS table SSN primary key Last Name First Name Account Number Type of Account Timestamp 01234 Smith John abcd1234 Checking 20120118 01235 Johnson Michael wxyz1234 Checking 20120118 01235 Johnson Michael aabb1234 Checking 20111123

Row key Value (CF, Column, Version, Cell) 01234 info: { lastname : Smith, firstname : John } acct: { checking : abcd1234 } 01235 info: { lastname : Johnson, firstname : Michael } acct: { checking : wxyz1234 @ts=2012, checking : aabb1234 @ts=2011}

info Column Family Row Key Column Key Timestamp Cell Value 01234 info:fname 1330843130 John 01234 info:lname 1330843130 Smith 01235 info:fname 1330843345 Michael 01235 info:lname 1330843345 Johnson acct Column Family Row Key Column Key Timestamp Cell Value 01234 acct:checking 1330843130 abcd1234 01235 acct:checking 1330843345 wxyz1234 01235 acct:checking 1330843239 aabb1234 Key Key/Value Row Column Family Column Qualifier Timestamp Value

Key Key/Value Row Column Family Column Qualifier Timestamp Value

HBase API Region Servers Master HFile Memstore Write-Ahead Log ZooKeeper HDFS

The data is sharded Each shard contains all the data in a key-range Rows Region Server 1 Region Server 2 Region Server 3 Table Logical View A.. K.. T.. Z Keys:[A-D] Region Keys:[E-H] Region Keys:[S-V] Region Keys:[N-R] Region Keys:[I-M] Region Keys:[W-Z] Region Auto-Sharded Regions

HLog Region Server Region Client Store StoreFile MemStore StoreFile StoreMemStore StoreFile HFile HFile HFile

HLog Region Server Region Client Store StoreFile MemStore StoreFile StoreMemStore StoreFile HFile HFile HFile

HLog HLog HLog External Gateway Clients atch Clients: Thrift Client Rest Client Avro Client Other Clients: Hive HBase Thrift Server Rest Server Avro Server jaql Pig Cluster HTable HTable HTable Region Server Region Server Region Server JRuby MapReduce Store MemStore StoreFile HFile Store MemStore StoreFile HFile Store StoreFile MemStore HFile AsyncHBase HTable Java Client HTable Thrift Server Thrift Client External API Clients

Key Key/Value Row Column Family Column Qualifier Timestamp Value

/////////////////////////////////////An Example///////////////////////////////////// //Create an HBase Table import hbase(*); T = hbasestring('test', schema { key: string, f1?: {*:string}, f2?: {*:string}, f3?: {*:string} }, replace=true); //Define a Row data = [ { key: "1", f1: { Name: "Bruce Brown"}, f2: { Address: "BigData Blvd, Washington DC 20001" }, f3: {Email:"brownb@us.ibm.com"} }, { key: "2", f1: { Name: "John Smith"}, //Put the data into the HBase Table data -> write(t); f3: {Email: "jsmith@us.ibmcom"} } ]; Speeds Development of HBase Applications!

InfoSphere BigInsights Visualization & Exploration Development Tools Advanced Engines Connectors Workload Optimization Administration & Security IBM-certified Apache Hadoop

HBase GUI for Queries Integration with BigSheets for Advanced Analytics Advanced Support and Analytics through Jaql

DB2/ InfoSphere Warehouse / Netezza tools/apps DB2 / InfoSphere Warehouse BigInsights (HBase) HDFS

*http://www-01.ibm.com/software/data/bigdata/

Process Heap Description NameNode 8 GB About 1 GB of heap for every 100TB of raw data stored, or per million files / inodes Secondary NameNode 8 GB Applies the edits in memory, and therefore needs about the same amount as the NameNode JobTracker 2 GB Moderate requirements HBase Master 4 GB Usually lightly loaded, moderate requirements only DataNode 1 GB Moderate requirements TaskTracker 1 GB Moderate requirements HBase RegionServer 12 GB Majority of available memory, while leaving enough room for the operating system (for the buffer cache), and for the Task Attempts processes Tasks Attempts 1 GB (ea.) Multiply by the maximum number you allow for each ZooKeeper 1 GB Moderate requirements

NoSQL -- Your Ultimate Guide to the Non - Relational Universe! http://nosql-database.org/links.html Brewer's CAP Theorem posted by Julian Browne, January 11, 2009. http://www.julianbrowne.com/article/viewer/brewers-cap-theorem HBase Schema design: http://hbase.apache.org/book/schema.html HBase in Action by Nick Dimiduk, Amandeep Khurana