How to Choose Between Hadoop, NoSQL and RDBMS

Transcription

1 How to Choose Between Hadoop, NoSQL and RDBMS Keywords: Jean-Pierre Dijcks Oracle Redwood City, CA, USA Big Data, Hadoop, NoSQL Database, Relational Database, SQL, Security, Performance Introduction A lot of information is out in the wide world on individual technologies, however the overview of their respective roles in a complete ecosystem is frequently omitted. This paper tries to identify criteria as to when to choose which technology in light of each technology s strengths and weaknesses. Comparing the Technologies To ultimately make a choice between technologies, it is mandatory to understand the fundamental elements of each technology. We will use the main categories of Ingest the loading of data into a technology Disaster Recovery the ability to handle failures Access the ability to access data and query or analyze that data Armed with a basic understanding based on the above elements we can start to look at the main decision drivers, notably and ordered: 1. Performance 2. Security 3. Cost These three basic elements should always drive a decision for one of the technologies. A weight should be attributed to these for a specific use case, but these three should be sufficient to choose the right technology for the job. Product Set We will be comparing three main technologies in this paper: Hadoop Distributed File System (HDFS), generic NoSQL databases (key-value stores as the main category) and Relational Database Systems (Oracle Database in this case). Ingesting Data Ingesting data into HDFS, which is a distributed file system happens by loading files using a simple put API call. This loads the file into HDFS, and upon ingesting HDFS breaks up the

2 file into 256MB 1 chunks, which are replicated to two additional nodes in the cluster. The write completes once all data is written to the replicas. We will call this a synchronous write. HDFS does not parse the data, other than breaking it up in chunks. Because data is not parsed, it is never rejected because it is invalid. Schema definitions are imposed after writing data, not before. Data is simply written as presented and stored as presented. HDFS is also a consistent system, because the write is only acknowledged when all data is on all nodes. This means that a read will always return the same result from any of the nodes. NoSQL databases are at the fundamental layer often Key-Value stores, where each entry, or record is stored by its key, and then contains a value of arbitrary content and length. A NoSQL database therefore ingests and stores records, in individual form. Values are not parsed and consequently are not validated against a schema like definition. Writes of records go into a master node and once the data in confirmed as being written, the data is available to be queried. This enables very high write rates and instant retrieval on a per-record basis. After the write is confirmed, the system replicates data to non-master nodes in an asynchronous matter. This asynchronous replication can lead to inconsistent reads when data has not yet arrived on non-master nodes. This phenomenon is called eventual consistency. The system scales ingest by adding master nodes, enabling many more processes to write data into the system. A relational database system ingests data using SQL. The implication is that the data is fully parsed, validated and stored - when valid in optimized invisible file storage. Invalid data is rejected. Oracle Database is fully read-consistent and only a user issued commit publishes data to all other users. Replication of data if done is done outside of the user visible realm and for all intends and purposes data exists as a single item in the system. Table 1. Summarized Ingest Characteristics The above high level characteristics are summarized in Table 1, which focuses on how data is handled. Disaster Recovery Once data is ingested, and before focusing on the access to that data in more details, we discuss disaster recovery and high available in a little bit more detail. This is relevant because it impacts how data access can be structured in each of these systems. Both Oracle Database and HDFS are monolithic systems. Both use data replication to ensure data availability upon failures. Having multiple copies of an element of data implies that the systems can tolerate some data loss within a single system, without impact on the user 1 256MB is a configurable size in HDFS. Oracle Big Data Appliance uses 256MB as the default.

3 accessing the data. To create a fault tolerant geographically distributed system, both typically are replicated to a second instance or copy of the entire system. When replicating the system, the granularity of the replicas is different. HDFS replicates whole blocks or files, whereas Oracle Database can replicate on a record basis. HDFS is typically replicated in batch, whereas Oracle Database is typically replicated per record. NoSQL databases as described above feature built-in replication from the master node to the non-masters. While this can lead to eventual consistency, it does greatly simplify disaster recovery and geographic scale. Table 2. Adding DR Characteristics into the Summary Witin Table 2 the summarized DR capabilities are included for reference. Also note that the DR capabilities and style have an impact on the cost of a system and the ability to scale the system across regions. Accessing Data The ingesting of data in HDFS does not trigger parsing the data. The consequence, or benefit, of this behavior is that the data has to be parsed upon read. To retrieve a single record from HDFS, the system reads the whole file (typically a number of chunks) and then presents the desired results into the client application. In Oracle Database terminology, every read is going to be a full table scan, no matter whether the interest is 1 record or the entire contents of the file originally stored. When using SQL to access the data, Hive optimizes for retrieval by enabling files to be partitioned into sub sections of the file, split across directories. This splitting reduces IO because various parts of the file can be ignored when a query only requires that subset to be returned. HDFS leverages the replication of data to enable two features: read scalability enabling clients to read from multiple nodes at the same time, speculative execution starting competing threads to read data and serving the fastest thread to the client. A NoSQL database combines some optimization for access the keys to each of the records with some optimizations for ingest and flexibility the unparsed value elements of the records. The combination of these elements enables fast serving of individual records to an application through what in Oracle Database speak would be an index lookup. NoSQL databases also leverage the replication of the data for improved access. As with ingest where scale is achieved by adding masters the same increase in scale is achieved by adding non-masters on the read side. Each non-master can support an additional set of reader processes. Since the replication is asynchronous many of the non-masters are in a different geographical location from their masters and some of their non-master peers. This enables a distributed reader farm to be built with access to individual records.

4 Because Oracle Database parses the data when ingesting, optimizing the file formats the data is stored in under the covers can optimize for retrieval speeds. On top of that, expansive schema modeling constructs enable various access paths to be created and optimized for. Index structures, partitioning schemes, in-memory columnar formats etc. Table 3. Access added into the Summarization Those characteristics lead to a summarization in Table 3. Note that the underlying system does not inherently make query speeds slow. The APIs like SQL and their interaction with the underlying system are what makes a system perform better or worse. That same API also drives some of the analytics capabilities. In the case of NoSQL databases the API is often constraint to get and put, which does not lend itself to advanced analytics. Instead programs that get the data will do the analytics in a separate layer. Contrast that with SQL on Oracle Database. Summarizing the Classification As can be seen from the tables, the essence of each of these systems is quite different and therefore should lead to different usages and convergence, as we will see later. Figure 1. Classifying the Technologies Summarizing as is done in Figure 1 we can classify the sweet spot or core use for HDFS to be Affordable Scale, classify NoSQL Databases for Low, Predictable Latency workloads and Oracle Database as Flexible Performance.

5 It is important to keep these somewhat simplified core classifications at the forefront of our minds when selecting a technology for your workload(s). Optimizing HDFS and NoSQL for Access A special word on Parquet, ORC and other optimized HDFS formats is required here as the above descriptions of HDFS have sparked some work to mitigate some of the performance issues. This work has focused on the non-parse phase of HDFS ingests. When we review the core capabilities of Oracle Database, one of the key things is that we optimize for query by optimizing the underlying file formats. Parse the data, store it in a way that queries can be optimally served. This is what the core design principle is for adding Parquet to HDFS. Rather than not parsing data, data gets loaded into HDFS as usual, and then inserted into parquet tables. This is done via a SQL statement akin to what happens in Oracle Database. Under the covers, the data is parsed, laid out in a columnar format and then written to HDFS. Metadata is appended to the end of the file and enables queries to poke into the files for columns, thereby reducing IO and improving performance. Note that data is copied ingested again, just like when you load it into Oracle Database. Simply said, we have now re-hosted a database file format into HDFS. NoSQL Databases are offering more indexing capabilities to address some of the single query method issue. The main enhancements coming in the form of secondary indexes, which make it easier and faster to retrieve data from the value, without parsing the entire data set in a value set of the record. Concurrency We touched on concurrency in the above for example in a NoSQL reader farm, but it is one of main drivers for data access and we will discuss it separately here. HDFS is a systems very much focused on analytics and scale, not so much on large user quantities accessing data simultaneous. Systems like Impala do not necessarily change that as they rely on the same HDFS backbone. NoSQL databases are built for transactions and concurrency. The distributed model enables easy scale out and inherently serves concurrency. Hence the reference to it as a reader farm earlier on. Oracle Database with its heritage of OLTP workloads has the ability to run large concurrent user workloads. The main difference with NoSQL is that Oracle Database guarantees read consistency across the users. However by doing so it sacrifices the ability to scale a single database geographically as a NoSQL database can do. Choosing Technologies When taking in all that information, from the core elements of the technology and its consequences, we distill a basic understanding on what criteria to use when we need to pick a technology for a use case.

6 Figure 2. Indicating Technology Strengths and Weaknesses Within Figure 2 we have taken some of the knowledge and understanding in the previous parts of this document and tried to create a set of criteria and rank the technologies on them. These criteria represent a short list of items that are relevant and distinguishing in many use cases the author has seen. The criteria divided by the main decision criteria are: Performance: Single Record Read/Write performance indicating how well the technology can deal with requests for individual records Bulk Write Performance indicating how well a system can deal with bulk inserts/ingests Complex Query Response Times indicating how well the system handles the request for complex analytics queries Concurrency indicated tha ability of the technology to drive large concurrent access to the system Security: General User Security indicating how well data can be secured for the general user population Privileged User Security indicating how well data can be secured from administration and other privileged users Governance Tools how mature the tooling is on a platform to ensure proper governance of data elements and for example comply with regulations around data Cost: System per TB Cost indicating what the cost of a Terabyte of data is when stored in this technology Backup per TB Cost indicating what the cost of a Terabyte of data is when backed up or in a DR system Skills Acquisition Cost indicating the cost of the new skills, if required, and also its relative scarcity

7 Plotting these criteria in the graph shown in Figure 2 starts to indicate when and how to choose any of these technologies for a given use case. To interpret these numbers, which are pure indicators, not exact science scores, consider that for non-cost items, higher numbers are better. For example the score of 5 for concurrency in the case of NoSQL indicates that is supports concurrency exceptionally well. In the case of cost numbers, higher numbers indicate a higher cost. But for cost they are relative scores, so not an indicator of the actual cost per TB. Conclusion While not an exact science, the evaluation of tools and technologies is often overcomplicated. This paper tries to focus evaluations on the key strengths and weaknesses of specific technologies, and then on the simple equations created by Performance, Security and Cost. Interestingly many use cases are looking for a mix of these characteristics and rather than one of the technologies, we see customers using two or sometime even three of them. And then it becomes very interesting as we need to play this exact game of what goes where, with the data. And again, the simple Performance, Security and Cost will make the decisions easier and probably more solid. Contact address: Jean-Pierre Dijcks Oracle 500 Oracle Parkway MS 4op7 Redwood City, CA, USA Phone: jean-pierre.dijcks@oracle.com Internet: