Using RDBMS, NoSQL or Hadoop?



Similar documents
How to Choose Between Hadoop, NoSQL and RDBMS

Oracle Big Data SQL Technical Update

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

DNS Big Data

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Constructing a Data Lake: Hadoop and Oracle Database United!

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Oracle Database 12c Plug In. Switch On. Get SMART.

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Hadoop and Map-Reduce. Swati Gore

The Future of Data Management

Moving From Hadoop to Spark

SharePoint Capacity Planning Balancing Organiza,onal Requirements with Performance and Cost

.nl ENTRADA. CENTR-tech 33. November 2015 Marco Davids, SIDN Labs. Klik om de s+jl te bewerken

Can t We All Just Get Along? Spark and Resource Management on Hadoop

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Hadoop Ecosystem B Y R A H I M A.

Actian SQL in Hadoop Buyer s Guide

Unified Big Data Processing with Apache Spark. Matei

In Memory Accelerator for MongoDB

Hadoop & Spark Using Amazon EMR

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Apache Hadoop: Past, Present, and Future

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Introduc8on to Apache Spark

Microsoft Azure Data Technologies: An Overview

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

How To Manage Big Data In A Microsoft Cloud (Hadoop)

Native Connectivity to Big Data Sources in MSTR 10

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

<Insert Picture Here> Big Data

Hadoop IST 734 SS CHUNG

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

Practical Cassandra. Vitalii

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

2009 Oracle Corporation 1

Business Intelligence for Big Data

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Tap into Hadoop and Other No SQL Sources

Alexander Rubin Principle Architect, Percona April 18, Using Hadoop Together with MySQL for Data Analysis

Hadoop and MySQL for Big Data

Actian Vector in Hadoop

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

extensible record stores document stores key-value stores Rick Cattel s clustering from Scalable SQL and NoSQL Data Stores SIGMOD Record, 2010

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here

Big Data SQL and Query Franchising

InfiniteGraph: The Distributed Graph Database

SQream Technologies Ltd - Confiden7al

Can the Elephants Handle the NoSQL Onslaught?

Oracle Database In-Memory The Next Big Thing

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions

Oracle Big Data SQL. Architectural Deep Dive. Dan McClary, Ph.D. Big Data Product Management Oracle

Real-time Data Replication

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

CitusDB Architecture for Real-Time Big Data

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Cloud Computing at Google. Architecture

Hadoop Meets Exadata. Presented by: Kerry Osborne. DW Global Leaders Program Decemeber, 2012

Open Source Technologies on Microsoft Azure

Using distributed technologies to analyze Big Data

Splice Machine: SQL-on-Hadoop Evaluation Guide

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Information Builders Mission & Value Proposition

SQL on NoSQL (and all of the data) With Apache Drill

Putting Apache Kafka to Use!

OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni

Big Data Patterns. Ron Bodkin Founder and President, Think Big

An Oracle White Paper June Oracle: Big Data for the Enterprise

Benchmarking Cassandra on Violin

Parquet. Columnar storage for the people

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Hadoop. for Oracle database professionals. Alex Gorbachev Calgary, AB September 2013

Chase Wu New Jersey Ins0tute of Technology

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

HPE Vertica & Hadoop. Tapping Innovation to Turbocharge Your Big Data. #SeizeTheData

Big Data Analytics - Accelerated. stream-horizon.com

Transcription:

Using RDBMS, NoSQL or Hadoop? DOAG Conference 2015 Jean- Pierre Dijcks Big Data Product Management Server Technologies Copyright 2014 Oracle and/or its affiliates. All rights reserved.

Data Ingest 2

Ingest Chunk 256 MB Synchronous HDFS Node 1 Node 2 Node 3 3

Ingest HDFS Ingest <key><value> t Master Node Chunk 256 MB Synchronous Node 1 Node 2 Records Eventually Consistent t + 1 <key><value> Replica Node 1 t + 2 Node 3 <key><value> Replica Node 2 4

Ingest Chunk 256 MB Synchronous HDFS Node 1 Node 2 Ingest Records Eventually Consistent t <key><value> Master Node t + 1 <key><value> Replica Node 1 Transac^ons Op^mized Ingest SQL Parse / Validate ASM Compute Storage t + 2 Node 3 <key><value> Replica Node 2 DBF DBF 5

High- level Comparison Ingest HDFS NoSQL RDBMS Data Type Chunk Record Transac^on Write Type Synchronous Eventually Consistent ACID Compliant Data Prepara^on No Parsing No Parsing Parsing and Valida^on 6

Disaster Recovery - High Availability Across Geography 7

Disaster Recovery: Hadoop Data Center 1 Data Center 2 Ingest Node 1 Node 1 Chunk 256 MB Node 2 Node 2 Synchronous between nodes Node 3 Node 3 MUST Duplicate to a separate Cluster in Data Center 2 8

Disaster Recovery: RDBMS Data Center 1 Data Center 2 Ingest SQL Compute Transac^ons Parse / Validate Storage ASM DBF DBF Op^mized Data Guard Golden Gate 9

Disaster Recovery: NoSQL Op^on t Ingest <key><value> Master Node Records Eventually Consistent <key><value> <key><value> t + 1 Replica Node 1 t + 2 Replica Node 2 Data Center 1 or Region 1 Data Center 2 or Region 2 10

Note on HBase Don't confuse HBase with NoSQL in terms of geographical distribu^on op^ons Hbase is based on HDFS storage Consequently, HBase cannot span data centers as we see in other NoSQL systems 11

Big Data Appliance includes Cloudera BDR Oracle packages this on BDA as a Cloudera op^on called BDR (Backup and Disaster Recovery) BDR = DistCP distributed copy method leveraging parallel compute (MapReduce like) BDR is NOT proven as Data Guard or GG Most importantly, think of BDR as a batch process not a trickle BDR is included (no extra charge) on BDA 12

High- level Comparison Ingest DR HDFS NoSQL RDBMS Data Type Chunk Record Transac^on Write Type Synchronous Eventually Consistent ACID Compliant Data Prepara^on No Parsing No Parsing Parsing and Valida^on DR Type Second Cluster Node Replica Second RDBMS DR Unit File Record Transac^on DR Timing Batch Record Transac^on 13

Access Paths to Data Sets 14

Data Sets and Analy^cal SQL Queries: Hadoop No Parse NO UPDATE json Chunk 256 MB block Hive uses MapReduce In Parallel Achieves SCALE INSERT Data is NOT parsed on ingest, example JSON document is not parsed.. Original files loaded an broken into chunks Does not like small files UDPATE NO update allowed (append only) Expensive to update a few KB by replacing a 256MB chunk (as part of a file replacement) SELECT Scan ALL data even for a single row answer SQL access path is a full table scan Hive op^mizes this with Metadata (Par^^on Pruning) MapReduce in Parallel to achieve scale 15

Data Sets and Analy^cal SQL Queries: NoSQL Index Lookup API: Get Put Mget Mput json <key><value> PUT (Insert) Data is typically not parsed, example, JSON document is loaded as a value with a key GET (Select) Data Retrieval through Index Lookup based on KEY Only access path is index (primary or secondary) If retrieving from Replica may not be consistent (yet) Generally do not issue SQL over NoSQL Not used for large analysis, instead used to receive and provide single records or small sets based on the same key 16

Data Sets and Analy^cal SQL Queries: RDBMS Data is parsed Metadata is available on write Auxiliary structures enable myriad of access paths TABLE INDEX INSERT Parse data and op^mize for retrieval Adhere to transac^on consistency UPDATE Enables individual records to be updated With read- consistency Row level locking etc. SELECT Read- consistent guaranteed SQL Op^miza^on: Choose the best access path based on the ques^on and database sta^s^cs Op^mized joins, spills to disk etc. Supports very complex ques^ons 17

High- level Comparison Ingest DR Acces HDFS NoSQL RDBMS Data Type Chunk Record Transac^on Write Type Synchronous Eventually Consistent ACID Compliant Data Prepara^on No Parsing No Parsing Parsing and Valida^on DR Type Second Cluster Node Replica Second RDBMS DR Unit File Record Transac^on DR Timing Batch Record Transac^on Complex Analy^cs? Yes No Yes Query Speed Slow Fast for simple ques^ons Fast # of Data Access Methods One (full table scan) One (index lookup) Many (Op^mized) Affordable Scale Low Predictable Latency Flexible Performance 18

Performance Aspects Ingest and Query 19

A simple set of criteria Concurrency 5 Skills Acquisi^on Cost 4.5 4 Complex Query Response Times 3.5 3 2.5 Backup per TB Cost 2 1.5 1 0.5 Single Record Read/Write Performance RDBMS 0 NoSQL DB Hadoop System per TB Cost Bulk Write Performance Governance Tools Privileged User Security General User Security 20

Ingest Performance Considera^ons HDFS No parsing Fastest for Inges^ng Large Data Sets, like log files Bulk write of sets of data, not individual records Slower write on a record- by- record basis than NoSQL NoSQL No parsing Fastest for Wri^ng Individual Records to the master Direct writes on a per- record basis Best for Millisecond Write RDBMS RDBMS does work to data (parsing) on ingest Ingest speed is slower than NoSQL on a record by record basis due to parsing Ingest speed is slower than Hadoop on a file by file basis due to parsing Benefits of RDBMS parsing become evident on query performance 21

Query Performance Considera^ons HDFS Requires parsing Slowest on Query Time Due to Full Table Scans on top of parsing No query caching Able to run complex queries Requires use of MR or Spark programs SQL immature (SQL- 92) NoSQL No parsing Fastest on Query Time for get Consistently fast is the goal No syntax available to run complex queries RDBMS Fastest SQL query ^mes because of parsing work done on ingest Completely op^mized storage formats for IO Oracle knows how to pick stuff out, metadata on files.. Least amount of I/O to retrieve records Advanced Caching RDBMS can run more complex SQL queries than NoSQL or Hadoop 22

SQL on Hadoop vs RDBMS Hadoop Pay on QUERY for GAINS on INGEST Full table scan Shorten full table scan by adding more nodes to scan through data Problem remains, data isn t parsed All raw data must be read for EACH and EVERY read RDBMS Pay on INGEST (Parsing) for GAINS on QUERY Op^mized Query Path 23

Benefits Compared Hadoop NoSQL RDBMS Schema on Read Flexibility Write file on disk without error checking Programma^cally s^tch disparate data together Low Latency for simple ac^ons High performance for lots of workloads without end user knowing Parsed Data Mul^ple Access Paths Advanced Query Caching In- Memory Formats Affordable Scale HOWEVER No API to run analy^cal query Consistency - exact same result at exact same ^me Traceability of transac^ons Applica^ons become simpler 24

Concurrency 25

Concurrency Comparisons HDFS Most Hadoop systems have small number of users running large number of jobs Batch or very frequent micro batch calcula^ons Customers report Impala struggles with concurrency (much like Netezza, or worse) Immature resource management, not all solu^ons work nicely NoSQL Very high concurrency Geographically distributed Must deal with consistency issues Great reader farm for publishing data to for example web apps Load balancing built- in RDBMS Hundreds of users, hundreds of queries running at the same ^me Queries are balanced over the system Resource Management able to control the whole system Proven in many large organiza^ons 26

Cost 27

Cost Customers want reduce cost by moving from RDBMS to Hadoop Hadoop delivers lowest cost per TB But which workloads can move from RDBMS to Hadoop? Dealing with transac^ons? Stay in RDBMS Running Advanced SQL queries? Stay in RDBMS Large numbers of concurrent users? Stay in RDBMS Will hit high performance penalty for moving DW to Hadoop because of full table scans 28

Can SQL on Hadoop solve the problems without sacrifice? 29

No.. 1 ½ Axempts Impala Solu^on to speed up Hive and MR Wrote own op^mizer and query engine Large speedup but at the cost of: No MR so loses scalability Less SQL capabili^es so loses BI transparency System cost increases to run Impala (needs add l memory) so loses some cost advantage Impala with Parquet Convert files into columnar data blocks (Parquet file format) Speed up because of columnar formats etc: But at a cost: Must run ETL to Parquet (comparable to ingest into RDBMS) Loose schema on read => flexibility Double the data Essen^ally a new columnar DB on HDFS 30

Conclusion: Choose the Right Tool for the Right Job Don t Use something it s not meant to be 31