How to Choose Between Hadoop, NoSQL and RDBMS

Similar documents
Using RDBMS, NoSQL or Hadoop?

Oracle Big Data SQL Technical Update

CitusDB Architecture for Real-Time Big Data

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Benchmarking Cassandra on Violin

Technical Overview Simple, Scalable, Object Storage Software

The Sierra Clustered Database Engine, the technology at the heart of

Move Data from Oracle to Hadoop and Gain New Business Insights

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

An Oracle White Paper October Oracle: Big Data for the Enterprise

NoSQL and Hadoop Technologies On Oracle Cloud

INTRODUCTION TO CASSANDRA

CDH AND BUSINESS CONTINUITY:

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

InfiniteGraph: The Distributed Graph Database

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

NoSQL Databases. Institute of Computer Science Databases and Information Systems (DBIS) DB 2, WS 2014/2015

Putting Apache Kafka to Use!

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Can the Elephants Handle the NoSQL Onslaught?

MySQL és Hadoop mint Big Data platform (SQL + NoSQL = MySQL Cluster?!)

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

A survey of big data architectures for handling massive data

Native Connectivity to Big Data Sources in MSTR 10

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Practical Cassandra. Vitalii

How To Handle Big Data With A Data Scientist

Using distributed technologies to analyze Big Data

An Oracle White Paper June Oracle: Big Data for the Enterprise

Hadoop Ecosystem B Y R A H I M A.

Diagram 1: Islands of storage across a digital broadcast workflow

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

Data Modeling for Big Data

Hadoop & Spark Using Amazon EMR

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Apache Hadoop. Alexandru Costan

Navigating the Big Data infrastructure layer Helena Schwenk

Trafodion Operational SQL-on-Hadoop

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

X4-2 Exadata announced (well actually around Jan 1) OEM/Grid control 12c R4 just released

Distributed File Systems

HadoopRDF : A Scalable RDF Data Analysis System

A1 and FARM scalable graph database on top of a transactional memory layer

An Oracle White Paper September Oracle: Big Data for the Enterprise

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

NoSQL for SQL Professionals William McKnight

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

So What s the Big Deal?

<Insert Picture Here> Big Data

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

OLTP Meets Bigdata, Challenges, Options, and Future Saibabu Devabhaktuni

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Oracle Database 12c Plug In. Switch On. Get SMART.

Hadoop IST 734 SS CHUNG

MongoDB Developer and Administrator Certification Course Agenda

NoSQL Data Base Basics

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Database Replication with Oracle 11g and MS SQL Server 2008

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Big Data Analytics Nokia

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Internals of Hadoop Application Framework and Distributed File System

WINDOWS AZURE DATA MANAGEMENT

Constructing a Data Lake: Hadoop and Oracle Database United!

ScaleArc idb Solution for SQL Server Deployments

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

INTRODUCING DRUID: FAST AD-HOC QUERIES ON BIG DATA MICHAEL DRISCOLL - CEO ERIC TSCHETTER - LEAD METAMARKETS

How To Use A Data Center With A Data Farm On A Microsoft Server On A Linux Server On An Ipad Or Ipad (Ortero) On A Cheap Computer (Orropera) On An Uniden (Orran)

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

Microsoft Azure Data Technologies: An Overview

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

Using In-Memory Computing to Simplify Big Data Analytics

Oracle Database In-Memory The Next Big Thing

ADDING A NEW SITE IN AN EXISTING ORACLE MULTIMASTER REPLICATION WITHOUT QUIESCING THE REPLICATION

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

ScaleArc for SQL Server

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Distributed Filesystems

INTRODUCTION ADVANTAGES OF RUNNING ORACLE 11G ON WINDOWS. Edward Whalen, Performance Tuning Corporation

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Tips and Tricks for Using Oracle TimesTen In-Memory Database in the Application Tier

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Cloud Computing at Google. Architecture

Transcription:

How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A lot of information is out in the wide world on individual technologies, however the overview of their respective roles in a complete ecosystem is frequently omitted. This paper tries to identify criteria as to when to choose which technology in light of each technology s strengths and weaknesses. Comparing the Technologies To ultimately make a choice between technologies, it is mandatory to understand the fundamental elements of each technology. We will use the main categories of Ingest the loading of data into a technology Disaster Recovery the ability to handle failures Access the ability to access data and query or analyze that data Armed with a basic understanding based on the above elements we can start to look at the main decision drivers, notably and ordered: 1. Performance 2. Security 3. Cost These three basic elements should always drive a decision for one of the technologies. A weight should be attributed to these for a specific use case, but these three should be sufficient to choose the right technology for the job. Product Set We will be comparing three main technologies in this paper: Hadoop Distributed File System (HDFS), generic NoSQL databases (key-value stores as the main category) and Relational Database Systems (Oracle Database in this case). Ingesting Data Ingesting data into HDFS, which is a distributed file system happens by loading files using a simple put API call. This loads the file into HDFS, and upon ingesting HDFS breaks up the

file into 256MB 1 chunks, which are replicated to two additional nodes in the cluster. The write completes once all data is written to the replicas. We will call this a synchronous write. HDFS does not parse the data, other than breaking it up in chunks. Because data is not parsed, it is never rejected because it is invalid. Schema definitions are imposed after writing data, not before. Data is simply written as presented and stored as presented. HDFS is also a consistent system, because the write is only acknowledged when all data is on all nodes. This means that a read will always return the same result from any of the nodes. NoSQL databases are at the fundamental layer often Key-Value stores, where each entry, or record is stored by its key, and then contains a value of arbitrary content and length. A NoSQL database therefore ingests and stores records, in individual form. Values are not parsed and consequently are not validated against a schema like definition. Writes of records go into a master node and once the data in confirmed as being written, the data is available to be queried. This enables very high write rates and instant retrieval on a per-record basis. After the write is confirmed, the system replicates data to non-master nodes in an asynchronous matter. This asynchronous replication can lead to inconsistent reads when data has not yet arrived on non-master nodes. This phenomenon is called eventual consistency. The system scales ingest by adding master nodes, enabling many more processes to write data into the system. A relational database system ingests data using SQL. The implication is that the data is fully parsed, validated and stored - when valid in optimized invisible file storage. Invalid data is rejected. Oracle Database is fully read-consistent and only a user issued commit publishes data to all other users. Replication of data if done is done outside of the user visible realm and for all intends and purposes data exists as a single item in the system. Table 1. Summarized Ingest Characteristics The above high level characteristics are summarized in Table 1, which focuses on how data is handled. Disaster Recovery Once data is ingested, and before focusing on the access to that data in more details, we discuss disaster recovery and high available in a little bit more detail. This is relevant because it impacts how data access can be structured in each of these systems. Both Oracle Database and HDFS are monolithic systems. Both use data replication to ensure data availability upon failures. Having multiple copies of an element of data implies that the systems can tolerate some data loss within a single system, without impact on the user 1 256MB is a configurable size in HDFS. Oracle Big Data Appliance uses 256MB as the default.

accessing the data. To create a fault tolerant geographically distributed system, both typically are replicated to a second instance or copy of the entire system. When replicating the system, the granularity of the replicas is different. HDFS replicates whole blocks or files, whereas Oracle Database can replicate on a record basis. HDFS is typically replicated in batch, whereas Oracle Database is typically replicated per record. NoSQL databases as described above feature built-in replication from the master node to the non-masters. While this can lead to eventual consistency, it does greatly simplify disaster recovery and geographic scale. Table 2. Adding DR Characteristics into the Summary Witin Table 2 the summarized DR capabilities are included for reference. Also note that the DR capabilities and style have an impact on the cost of a system and the ability to scale the system across regions. Accessing Data The ingesting of data in HDFS does not trigger parsing the data. The consequence, or benefit, of this behavior is that the data has to be parsed upon read. To retrieve a single record from HDFS, the system reads the whole file (typically a number of chunks) and then presents the desired results into the client application. In Oracle Database terminology, every read is going to be a full table scan, no matter whether the interest is 1 record or the entire contents of the file originally stored. When using SQL to access the data, Hive optimizes for retrieval by enabling files to be partitioned into sub sections of the file, split across directories. This splitting reduces IO because various parts of the file can be ignored when a query only requires that subset to be returned. HDFS leverages the replication of data to enable two features: read scalability enabling clients to read from multiple nodes at the same time, speculative execution starting competing threads to read data and serving the fastest thread to the client. A NoSQL database combines some optimization for access the keys to each of the records with some optimizations for ingest and flexibility the unparsed value elements of the records. The combination of these elements enables fast serving of individual records to an application through what in Oracle Database speak would be an index lookup. NoSQL databases also leverage the replication of the data for improved access. As with ingest where scale is achieved by adding masters the same increase in scale is achieved by adding non-masters on the read side. Each non-master can support an additional set of reader processes. Since the replication is asynchronous many of the non-masters are in a different geographical location from their masters and some of their non-master peers. This enables a distributed reader farm to be built with access to individual records.

Because Oracle Database parses the data when ingesting, optimizing the file formats the data is stored in under the covers can optimize for retrieval speeds. On top of that, expansive schema modeling constructs enable various access paths to be created and optimized for. Index structures, partitioning schemes, in-memory columnar formats etc. Table 3. Access added into the Summarization Those characteristics lead to a summarization in Table 3. Note that the underlying system does not inherently make query speeds slow. The APIs like SQL and their interaction with the underlying system are what makes a system perform better or worse. That same API also drives some of the analytics capabilities. In the case of NoSQL databases the API is often constraint to get and put, which does not lend itself to advanced analytics. Instead programs that get the data will do the analytics in a separate layer. Contrast that with SQL on Oracle Database. Summarizing the Classification As can be seen from the tables, the essence of each of these systems is quite different and therefore should lead to different usages and convergence, as we will see later. Figure 1. Classifying the Technologies Summarizing as is done in Figure 1 we can classify the sweet spot or core use for HDFS to be Affordable Scale, classify NoSQL Databases for Low, Predictable Latency workloads and Oracle Database as Flexible Performance.

It is important to keep these somewhat simplified core classifications at the forefront of our minds when selecting a technology for your workload(s). Optimizing HDFS and NoSQL for Access A special word on Parquet, ORC and other optimized HDFS formats is required here as the above descriptions of HDFS have sparked some work to mitigate some of the performance issues. This work has focused on the non-parse phase of HDFS ingests. When we review the core capabilities of Oracle Database, one of the key things is that we optimize for query by optimizing the underlying file formats. Parse the data, store it in a way that queries can be optimally served. This is what the core design principle is for adding Parquet to HDFS. Rather than not parsing data, data gets loaded into HDFS as usual, and then inserted into parquet tables. This is done via a SQL statement akin to what happens in Oracle Database. Under the covers, the data is parsed, laid out in a columnar format and then written to HDFS. Metadata is appended to the end of the file and enables queries to poke into the files for columns, thereby reducing IO and improving performance. Note that data is copied ingested again, just like when you load it into Oracle Database. Simply said, we have now re-hosted a database file format into HDFS. NoSQL Databases are offering more indexing capabilities to address some of the single query method issue. The main enhancements coming in the form of secondary indexes, which make it easier and faster to retrieve data from the value, without parsing the entire data set in a value set of the record. Concurrency We touched on concurrency in the above for example in a NoSQL reader farm, but it is one of main drivers for data access and we will discuss it separately here. HDFS is a systems very much focused on analytics and scale, not so much on large user quantities accessing data simultaneous. Systems like Impala do not necessarily change that as they rely on the same HDFS backbone. NoSQL databases are built for transactions and concurrency. The distributed model enables easy scale out and inherently serves concurrency. Hence the reference to it as a reader farm earlier on. Oracle Database with its heritage of OLTP workloads has the ability to run large concurrent user workloads. The main difference with NoSQL is that Oracle Database guarantees read consistency across the users. However by doing so it sacrifices the ability to scale a single database geographically as a NoSQL database can do. Choosing Technologies When taking in all that information, from the core elements of the technology and its consequences, we distill a basic understanding on what criteria to use when we need to pick a technology for a use case.

Figure 2. Indicating Technology Strengths and Weaknesses Within Figure 2 we have taken some of the knowledge and understanding in the previous parts of this document and tried to create a set of criteria and rank the technologies on them. These criteria represent a short list of items that are relevant and distinguishing in many use cases the author has seen. The criteria divided by the main decision criteria are: Performance: Single Record Read/Write performance indicating how well the technology can deal with requests for individual records Bulk Write Performance indicating how well a system can deal with bulk inserts/ingests Complex Query Response Times indicating how well the system handles the request for complex analytics queries Concurrency indicated tha ability of the technology to drive large concurrent access to the system Security: General User Security indicating how well data can be secured for the general user population Privileged User Security indicating how well data can be secured from administration and other privileged users Governance Tools how mature the tooling is on a platform to ensure proper governance of data elements and for example comply with regulations around data Cost: System per TB Cost indicating what the cost of a Terabyte of data is when stored in this technology Backup per TB Cost indicating what the cost of a Terabyte of data is when backed up or in a DR system Skills Acquisition Cost indicating the cost of the new skills, if required, and also its relative scarcity

Plotting these criteria in the graph shown in Figure 2 starts to indicate when and how to choose any of these technologies for a given use case. To interpret these numbers, which are pure indicators, not exact science scores, consider that for non-cost items, higher numbers are better. For example the score of 5 for concurrency in the case of NoSQL indicates that is supports concurrency exceptionally well. In the case of cost numbers, higher numbers indicate a higher cost. But for cost they are relative scores, so not an indicator of the actual cost per TB. Conclusion While not an exact science, the evaluation of tools and technologies is often overcomplicated. This paper tries to focus evaluations on the key strengths and weaknesses of specific technologies, and then on the simple equations created by Performance, Security and Cost. Interestingly many use cases are looking for a mix of these characteristics and rather than one of the technologies, we see customers using two or sometime even three of them. And then it becomes very interesting as we need to play this exact game of what goes where, with the data. And again, the simple Performance, Security and Cost will make the decisions easier and probably more solid. Contact address: Jean-Pierre Dijcks Oracle 500 Oracle Parkway MS 4op7 Redwood City, CA, 94065 USA Phone: +1 650 607 5394 Email jean-pierre.dijcks@oracle.com Internet: www.oracle.com