Using RDBMS, NoSQL or Hadoop?

Size: px

Start display at page:

Download "Using RDBMS, NoSQL or Hadoop?"

Leslie Short
8 years ago
Views:

Data Product Management Server Technologies

2 Data Ingest 2

3 Ingest Chunk 256 MB Synchronous HDFS Node 1 Node 2 Node 3 3

4 Ingest HDFS Ingest <key><value> t Master Node Chunk 256 MB Synchronous Node 1 Node 2 Records Eventually Consistent t + 1 <key><value> Replica Node 1 t + 2 Node 3 <key><value> Replica Node 2 4

5 Ingest Chunk 256 MB Synchronous HDFS Node 1 Node 2 Ingest Records Eventually Consistent t <key><value> Master Node t + 1 <key><value> Replica Node 1 Transac^ons Op^mized Ingest SQL Parse / Validate ASM Compute Storage t + 2 Node 3 <key><value> Replica Node 2 DBF DBF 5

<key><value> Replica Node 1 Transac^ons Op^mized Ingest SQL Parse

6 High- level Comparison Ingest HDFS NoSQL RDBMS Data Type Chunk Record Transacôn Write Type Synchronous Eventually Consistent ACID Compliant Data Preparaôn No Parsing No Parsing Parsing and Validaôn 6

Synchronous Eventually Consistent ACID Compliant

7 Disaster Recovery - High Availability Across Geography 7

8 Disaster Recovery: Hadoop Data Center 1 Data Center 2 Ingest Node 1 Node 1 Chunk 256 MB Node 2 Node 2 Synchronous between nodes Node 3 Node 3 MUST Duplicate to a separate Cluster in Data Center 2 8

9 Disaster Recovery: RDBMS Data Center 1 Data Center 2 Ingest SQL Compute Transac^ons Parse / Validate Storage ASM DBF DBF Op^mized Data Guard Golden Gate 9

10 Disaster Recovery: NoSQL Op^on t Ingest <key><value> Master Node Records Eventually Consistent <key><value> <key><value> t + 1 Replica Node 1 t + 2 Replica Node 2 Data Center 1 or Region 1 Data Center 2 or Region 2 10

<key><value> t + 1 Replica Node 1 t + 2 Replica Node 2

11 Note on HBase Don't confuse HBase with NoSQL in terms of geographical distribu^on op^ons Hbase is based on HDFS storage Consequently, HBase cannot span data centers as we see in other NoSQL systems 11

based on HDFS storage Consequently, HBase cannot

12 Big Data Appliance includes Cloudera BDR Oracle packages this on BDA as a Cloudera op^on called BDR (Backup and Disaster Recovery) BDR = DistCP distributed copy method leveraging parallel compute (MapReduce like) BDR is NOT proven as Data Guard or GG Most importantly, think of BDR as a batch process not a trickle BDR is included (no extra charge) on BDA 12

parallel compute (MapReduce like) BDR is NOT proven as Data Guard or GG Most importantly,

13 High- level Comparison Ingest DR HDFS NoSQL RDBMS Data Type Chunk Record Transacôn Write Type Synchronous Eventually Consistent ACID Compliant Data Preparaôn No Parsing No Parsing Parsing and Validaôn DR Type Second Cluster Node Replica Second RDBMS DR Unit File Record Transacôn DR Timing Batch Record Transacôn 13

Prepara^on No Parsing No Parsing Parsing and Valida^on DR Type Second Cluster

14 Access Paths to Data Sets 14

15 Data Sets and Analy^cal SQL Queries: Hadoop No Parse NO UPDATE json Chunk 256 MB block Hive uses MapReduce In Parallel Achieves SCALE INSERT Data is NOT parsed on ingest, example JSON document is not parsed.. Original files loaded an broken into chunks Does not like small files UDPATE NO update allowed (append only) Expensive to update a few KB by replacing a 256MB chunk (as part of a file replacement) SELECT Scan ALL data even for a single row answer SQL access path is a full table scan Hive op^mizes this with Metadata (Par^^on Pruning) MapReduce in Parallel to achieve scale 15

. Original files loaded an broken into chunks Does not like small files UDPATE NO update allowed (append only) Expensive to update a few KB by

16 Data Sets and Analy^cal SQL Queries: NoSQL Index Lookup API: Get Put Mget Mput json <key><value> PUT (Insert) Data is typically not parsed, example, JSON document is loaded as a value with a key GET (Select) Data Retrieval through Index Lookup based on KEY Only access path is index (primary or secondary) If retrieving from Replica may not be consistent (yet) Generally do not issue SQL over NoSQL Not used for large analysis, instead used to receive and provide single records or small sets based on the same key 16

Only access path is index (primary or secondary) If retrieving from Replica may not be consistent (yet) Generally do not issue

17 Data Sets and Analy^cal SQL Queries: RDBMS Data is parsed Metadata is available on write Auxiliary structures enable myriad of access paths TABLE INDEX INSERT Parse data and op^mize for retrieval Adhere to transacôn consistency UPDATE Enables individual records to be updated With read- consistency Row level locking etc. SELECT Read- consistent guaranteed SQL Op^mizaôn: Choose the best access path based on the quesôn and database sta^s^cs Op^mized joins, spills to disk etc. Supports very complex quesôns 17

records to be updated With read- consistency Row level locking etc.

18 High- level Comparison Ingest DR Acces HDFS NoSQL RDBMS Data Type Chunk Record Transacôn Write Type Synchronous Eventually Consistent ACID Compliant Data Preparaôn No Parsing No Parsing Parsing and Validaôn DR Type Second Cluster Node Replica Second RDBMS DR Unit File Record Transacôn DR Timing Batch Record Transacôn Complex Analy^cs? Yes No Yes Query Speed Slow Fast for simple quesôns Fast # of Data Access Methods One (full table scan) One (index lookup) Many (Op^mized) Affordable Scale Low Predictable Latency Flexible Performance 18

Record Transac^on DR Timing Batch Record Transac^on Complex Analy^cs?

19 Performance Aspects Ingest and Query 19

20 A simple set of criteria Concurrency 5 Skills Acquisi^on Cost Complex Query Response Times Backup per TB Cost Single Record Read/Write Performance RDBMS 0 NoSQL DB Hadoop System per TB Cost Bulk Write Performance Governance Tools Privileged User Security General User Security 20

5 Single Record Read/Write Performance RDBMS 0 NoSQL DB Hadoop System per TB

21 Ingest Performance Considera^ons HDFS No parsing Fastest for Inges^ng Large Data Sets, like log files Bulk write of sets of data, not individual records Slower write on a record- by- record basis than NoSQL NoSQL No parsing Fastest for Wri^ng Individual Records to the master Direct writes on a per- record basis Best for Millisecond Write RDBMS RDBMS does work to data (parsing) on ingest Ingest speed is slower than NoSQL on a record by record basis due to parsing Ingest speed is slower than Hadoop on a file by file basis due to parsing Benefits of RDBMS parsing become evident on query performance 21

22 Query Performance Considera^ons HDFS Requires parsing Slowest on Query Time Due to Full Table Scans on top of parsing No query caching Able to run complex queries Requires use of MR or Spark programs SQL immature (SQL- 92) NoSQL No parsing Fastest on Query Time for get Consistently fast is the goal No syntax available to run complex queries RDBMS Fastest SQL query ^mes because of parsing work done on ingest Completely op^mized storage formats for IO Oracle knows how to pick stuff out, metadata on files.. Least amount of I/O to retrieve records Advanced Caching RDBMS can run more complex SQL queries than NoSQL or Hadoop 22

23 SQL on Hadoop vs RDBMS Hadoop Pay on QUERY for GAINS on INGEST Full table scan Shorten full table scan by adding more nodes to scan through data Problem remains, data isn t parsed All raw data must be read for EACH and EVERY read RDBMS Pay on INGEST (Parsing) for GAINS on QUERY Op^mized Query Path 23

24 Benefits Compared Hadoop NoSQL RDBMS Schema on Read Flexibility Write file on disk without error checking Programma^cally s^tch disparate data together Low Latency for simple acôns High performance for lots of workloads without end user knowing Parsed Data Mul^ple Access Paths Advanced Query Caching In- Memory Formats Affordable Scale HOWEVER No API to run analy^cal query Consistency - exact same result at exact same ^me Traceability of transacôns Applicaôns become simpler 24

25 Concurrency 25

26 Concurrency Comparisons HDFS Most Hadoop systems have small number of users running large number of jobs Batch or very frequent micro batch calculaôns Customers report Impala struggles with concurrency (much like Netezza, or worse) Immature resource management, not all soluôns work nicely NoSQL Very high concurrency Geographically distributed Must deal with consistency issues Great reader farm for publishing data to for example web apps Load balancing built- in RDBMS Hundreds of users, hundreds of queries running at the same ^me Queries are balanced over the system Resource Management able to control the whole system Proven in many large organizaôns 26

27 Cost 27

28 Cost Customers want reduce cost by moving from RDBMS to Hadoop Hadoop delivers lowest cost per TB But which workloads can move from RDBMS to Hadoop? Dealing with transac^ons? Stay in RDBMS Running Advanced SQL queries? Stay in RDBMS Large numbers of concurrent users? Stay in RDBMS Will hit high performance penalty for moving DW to Hadoop because of full table scans 28

29 Can SQL on Hadoop solve the problems without sacrifice? 29

30 No.. 1 ½ Axempts Impala Soluôn to speed up Hive and MR Wrote own op^mizer and query engine Large speedup but at the cost of: No MR so loses scalability Less SQL capabiliês so loses BI transparency System cost increases to run Impala (needs add l memory) so loses some cost advantage Impala with Parquet Convert files into columnar data blocks (Parquet file format) Speed up because of columnar formats etc: But at a cost: Must run ETL to Parquet (comparable to ingest into RDBMS) Loose schema on read => flexibility Double the data Essenâlly a new columnar DB on HDFS 30

31 Conclusion: Choose the Right Tool for the Right Job Don t Use something it s not meant to be 31

How to Choose Between Hadoop, NoSQL and RDBMS

How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A