Using RDBMS, NoSQL or Hadoop? DOAG Conference 2015 Jean- Pierre Dijcks Big Data Product Management Server Technologies Copyright 2014 Oracle and/or its affiliates. All rights reserved.
Data Ingest 2
Ingest Chunk 256 MB Synchronous HDFS Node 1 Node 2 Node 3 3
Ingest HDFS Ingest <key><value> t Master Node Chunk 256 MB Synchronous Node 1 Node 2 Records Eventually Consistent t + 1 <key><value> Replica Node 1 t + 2 Node 3 <key><value> Replica Node 2 4
Ingest Chunk 256 MB Synchronous HDFS Node 1 Node 2 Ingest Records Eventually Consistent t <key><value> Master Node t + 1 <key><value> Replica Node 1 Transac^ons Op^mized Ingest SQL Parse / Validate ASM Compute Storage t + 2 Node 3 <key><value> Replica Node 2 DBF DBF 5
High- level Comparison Ingest HDFS NoSQL RDBMS Data Type Chunk Record Transac^on Write Type Synchronous Eventually Consistent ACID Compliant Data Prepara^on No Parsing No Parsing Parsing and Valida^on 6
Disaster Recovery - High Availability Across Geography 7
Disaster Recovery: Hadoop Data Center 1 Data Center 2 Ingest Node 1 Node 1 Chunk 256 MB Node 2 Node 2 Synchronous between nodes Node 3 Node 3 MUST Duplicate to a separate Cluster in Data Center 2 8
Disaster Recovery: RDBMS Data Center 1 Data Center 2 Ingest SQL Compute Transac^ons Parse / Validate Storage ASM DBF DBF Op^mized Data Guard Golden Gate 9
Disaster Recovery: NoSQL Op^on t Ingest <key><value> Master Node Records Eventually Consistent <key><value> <key><value> t + 1 Replica Node 1 t + 2 Replica Node 2 Data Center 1 or Region 1 Data Center 2 or Region 2 10
Note on HBase Don't confuse HBase with NoSQL in terms of geographical distribu^on op^ons Hbase is based on HDFS storage Consequently, HBase cannot span data centers as we see in other NoSQL systems 11
Big Data Appliance includes Cloudera BDR Oracle packages this on BDA as a Cloudera op^on called BDR (Backup and Disaster Recovery) BDR = DistCP distributed copy method leveraging parallel compute (MapReduce like) BDR is NOT proven as Data Guard or GG Most importantly, think of BDR as a batch process not a trickle BDR is included (no extra charge) on BDA 12
High- level Comparison Ingest DR HDFS NoSQL RDBMS Data Type Chunk Record Transac^on Write Type Synchronous Eventually Consistent ACID Compliant Data Prepara^on No Parsing No Parsing Parsing and Valida^on DR Type Second Cluster Node Replica Second RDBMS DR Unit File Record Transac^on DR Timing Batch Record Transac^on 13
Access Paths to Data Sets 14
Data Sets and Analy^cal SQL Queries: Hadoop No Parse NO UPDATE json Chunk 256 MB block Hive uses MapReduce In Parallel Achieves SCALE INSERT Data is NOT parsed on ingest, example JSON document is not parsed.. Original files loaded an broken into chunks Does not like small files UDPATE NO update allowed (append only) Expensive to update a few KB by replacing a 256MB chunk (as part of a file replacement) SELECT Scan ALL data even for a single row answer SQL access path is a full table scan Hive op^mizes this with Metadata (Par^^on Pruning) MapReduce in Parallel to achieve scale 15
Data Sets and Analy^cal SQL Queries: NoSQL Index Lookup API: Get Put Mget Mput json <key><value> PUT (Insert) Data is typically not parsed, example, JSON document is loaded as a value with a key GET (Select) Data Retrieval through Index Lookup based on KEY Only access path is index (primary or secondary) If retrieving from Replica may not be consistent (yet) Generally do not issue SQL over NoSQL Not used for large analysis, instead used to receive and provide single records or small sets based on the same key 16
Data Sets and Analy^cal SQL Queries: RDBMS Data is parsed Metadata is available on write Auxiliary structures enable myriad of access paths TABLE INDEX INSERT Parse data and op^mize for retrieval Adhere to transac^on consistency UPDATE Enables individual records to be updated With read- consistency Row level locking etc. SELECT Read- consistent guaranteed SQL Op^miza^on: Choose the best access path based on the ques^on and database sta^s^cs Op^mized joins, spills to disk etc. Supports very complex ques^ons 17
High- level Comparison Ingest DR Acces HDFS NoSQL RDBMS Data Type Chunk Record Transac^on Write Type Synchronous Eventually Consistent ACID Compliant Data Prepara^on No Parsing No Parsing Parsing and Valida^on DR Type Second Cluster Node Replica Second RDBMS DR Unit File Record Transac^on DR Timing Batch Record Transac^on Complex Analy^cs? Yes No Yes Query Speed Slow Fast for simple ques^ons Fast # of Data Access Methods One (full table scan) One (index lookup) Many (Op^mized) Affordable Scale Low Predictable Latency Flexible Performance 18
Performance Aspects Ingest and Query 19
A simple set of criteria Concurrency 5 Skills Acquisi^on Cost 4.5 4 Complex Query Response Times 3.5 3 2.5 Backup per TB Cost 2 1.5 1 0.5 Single Record Read/Write Performance RDBMS 0 NoSQL DB Hadoop System per TB Cost Bulk Write Performance Governance Tools Privileged User Security General User Security 20
Ingest Performance Considera^ons HDFS No parsing Fastest for Inges^ng Large Data Sets, like log files Bulk write of sets of data, not individual records Slower write on a record- by- record basis than NoSQL NoSQL No parsing Fastest for Wri^ng Individual Records to the master Direct writes on a per- record basis Best for Millisecond Write RDBMS RDBMS does work to data (parsing) on ingest Ingest speed is slower than NoSQL on a record by record basis due to parsing Ingest speed is slower than Hadoop on a file by file basis due to parsing Benefits of RDBMS parsing become evident on query performance 21
Query Performance Considera^ons HDFS Requires parsing Slowest on Query Time Due to Full Table Scans on top of parsing No query caching Able to run complex queries Requires use of MR or Spark programs SQL immature (SQL- 92) NoSQL No parsing Fastest on Query Time for get Consistently fast is the goal No syntax available to run complex queries RDBMS Fastest SQL query ^mes because of parsing work done on ingest Completely op^mized storage formats for IO Oracle knows how to pick stuff out, metadata on files.. Least amount of I/O to retrieve records Advanced Caching RDBMS can run more complex SQL queries than NoSQL or Hadoop 22
SQL on Hadoop vs RDBMS Hadoop Pay on QUERY for GAINS on INGEST Full table scan Shorten full table scan by adding more nodes to scan through data Problem remains, data isn t parsed All raw data must be read for EACH and EVERY read RDBMS Pay on INGEST (Parsing) for GAINS on QUERY Op^mized Query Path 23
Benefits Compared Hadoop NoSQL RDBMS Schema on Read Flexibility Write file on disk without error checking Programma^cally s^tch disparate data together Low Latency for simple ac^ons High performance for lots of workloads without end user knowing Parsed Data Mul^ple Access Paths Advanced Query Caching In- Memory Formats Affordable Scale HOWEVER No API to run analy^cal query Consistency - exact same result at exact same ^me Traceability of transac^ons Applica^ons become simpler 24
Concurrency 25
Concurrency Comparisons HDFS Most Hadoop systems have small number of users running large number of jobs Batch or very frequent micro batch calcula^ons Customers report Impala struggles with concurrency (much like Netezza, or worse) Immature resource management, not all solu^ons work nicely NoSQL Very high concurrency Geographically distributed Must deal with consistency issues Great reader farm for publishing data to for example web apps Load balancing built- in RDBMS Hundreds of users, hundreds of queries running at the same ^me Queries are balanced over the system Resource Management able to control the whole system Proven in many large organiza^ons 26
Cost 27
Cost Customers want reduce cost by moving from RDBMS to Hadoop Hadoop delivers lowest cost per TB But which workloads can move from RDBMS to Hadoop? Dealing with transac^ons? Stay in RDBMS Running Advanced SQL queries? Stay in RDBMS Large numbers of concurrent users? Stay in RDBMS Will hit high performance penalty for moving DW to Hadoop because of full table scans 28
Can SQL on Hadoop solve the problems without sacrifice? 29
No.. 1 ½ Axempts Impala Solu^on to speed up Hive and MR Wrote own op^mizer and query engine Large speedup but at the cost of: No MR so loses scalability Less SQL capabili^es so loses BI transparency System cost increases to run Impala (needs add l memory) so loses some cost advantage Impala with Parquet Convert files into columnar data blocks (Parquet file format) Speed up because of columnar formats etc: But at a cost: Must run ETL to Parquet (comparable to ingest into RDBMS) Loose schema on read => flexibility Double the data Essen^ally a new columnar DB on HDFS 30
Conclusion: Choose the Right Tool for the Right Job Don t Use something it s not meant to be 31