WHAT IS A CLOUD DATABASE?

Transcription

1 WHAT IS A CLOUD DATABASE? The Suitability of Algebraix s Technology to Cloud Computing Robin Bloor, Ph D WHITE PAPER

2 Copyright 2011, The Bloor Group All rights reserved. Neither this publication nor any part of it may be reproduced or transmitted or stored in any form or by any means, without either the prior written permission of the copyright holder or the issue of a license by the copyright holder. The Bloor Group is the sole copyright holder of this publication Oban Drive Spicewood TX Tel: contact: [email protected] w w w. T h e V i r t u a l C i r c l e. c o m w w w. B l o o r G r o u p. c o m WHITE PAPER

3 Executive Summary This white paper was commissioned by Algebraix. The goal of the paper is to provide a definition of what a cloud database is, and in the light of that definition, examine the suitability of Algebraix s technology to fulfill the role of a cloud database. Here is a brief summary of the contents of this paper: We define a cloud dbms () to be a distributed database that can deliver a query service across multiple distributed database nodes located in multiple data centers, including cloud data centers. Querying distributed data sources is precisely the problem that businesses will encounter as cloud computing grows in popularity. Such a database also needs to deliver high availability and cater for disaster recovery. In our view, a only needs to provide a query service. SOA already delivers connectivity and integration for transactional systems, so we see no need for a to cater for transactional traffic - only query traffic. A needs to scale across large computer grids, but it also needs to be able to span multiple data centers and, as far as is possible, cater for slow network connections. We review traditional databases, focusing primarily on relational databases and column store databases, concluding that such databases, as currently engineered, could not fulfill the role of a. They have centralized architectures and such architectures would encounter a scalability limit at some point, both within and between data centers. We conclude that a distributed peer-to-peer architecture is needed to satisfy the characteristics that we have defined. We move on to examine the Hadoop/MapReduce environment and its suitability as a. It has much better scalability for many workloads than relational or column store databases, because of its distributed architecture. However it was not built for mixed workloads or for complex data structures or even for multitasking. In its current form it emphasizes fault tolerance. It succeeds as a database for very large volumes of data, but does not have the characteristics of a. Finally, we examine Algebraix s technology as implemented in its database product A2DB. Our conclusion is that it has an architecture which is suitable for deployment as a. Our view is as follows: - A2DB s unique capability to reuse intermediate results of queries that it has previously executed, contribute to it delivering high performance at a single node. - The same performance characteristics can be employed to speed up queries that join information between a local node and remote nodes, whether in the same data center or in a remote data center. - Algebraix s technology is capable of global optimization, balancing the performance requirements of both global and local queries. - Additionally the technology can deliver high availability/fault tolerant operation. We are aware that Algebraix has not been deployed and tested its database A2DB in the role of hence our conclusion is not that it qualifies as a, but that it has an architecture that would enable it to be tested in this role. 1

4 The Cloud base - In Concept Cloud computing is a major driving trend for IT. Over 36 percent of US companies already run applications in the cloud (Mimecast survey, February 2010) and the major cloud vendors are growing their revenues and customer bases rapidly. Given the trends, fairly soon the majority of IT departments will be running applications in the cloud, possibly using more than one cloud provider. So corporate computing will inevitably become much more distributed than it currently is, spreading itself across multiple data centers. This will pose management, architectural and performance challenges - and foster innovation to meet those challenges. The Cloud Implementation of Transactional and Query Systems If we think solely in terms of database technology, the wider distribution of transactional systems, such as OLTP systems, communications applications and work flow systems, will not pose a severe problem at the data level. The sweeping success of Salesforce.com demonstrates this. The data problems of placing your CRM system in the cloud are resolved easily enough by the regular transfer of customer and other data from the cloud to the data center. Indeed the broad success of SOA demonstrates the same thing. Loosely coupling silo transaction systems together works fine as regards the work flow between transactional systems. Because the volume of data passed between applications within a SOA is low, it is highly unlikely that the relatively slow speeds of the Internet will be prohibitive to placing some of these applications in the cloud. There will be exceptions, but in principle it will work well most of the time. For query workloads typified by BI applications, distribution of the data across multiple data centers is more problematic. There are three main reasons for this: 1. Internet speeds are generally slow compared to data center network speeds and this limits performance considerably. This issue can be addressed through high-speed direct connections, but this becomes expensive very quickly. 2. Query workloads are not as predictable as transactional workloads. We can predict transactional workloads reasonably accurately, but we cannot easily predict specifically what questions a user might wish to ask - hence we are less able to predict the workload. This has profound architectural implications for the distribution of query systems. Stated simply: we don t know where best to locate the data ahead of time, because we do not know which sets of data users may wish to join together. 3. Even if we achieve an efficient distribution of data, query workloads involve the movement of much greater volumes of data than transactional workloads. That movement of data will inevitably be slower than if the data was located in a single data center. This set of constraints suggests that it may be better to centralize query workloads in one physical location. This is traditionally how most BI domains have been constructed, around a big data warehouse with subsets of data drawn off to serve individual BI applications. But ultimately that approach fails the test of scalability. A centralized architecture scales poorly over very large numbers of nodes. Bottlenecks eventually arise. 2

5 Towards a Cloud base For the moment, we will set aside that fact that there are many challenges in implementing a distributed architecture for query workloads across several data centers, and provide a view of what a cloud database would look like. We can define a cloud dbms () as a distributed database that delivers a query service across multiple distributed database nodes located in multiple geographically-distributed data centers, both corporate data centers and cloud data centers. So think in terms of an organization with some applications running in the cloud. Perhaps Salesforce.com plus some User Query Node - 8 Node - 1 Cloud Center 1 Internet Node - 7 Cloud Center 2 Node - 2 Node - 3 Node - 4 Node - 5 Node - 6 Center 1 Center 2 Figure 1. A hosted transactional web applications in some remote data center plus local applications including BI applications split between two data centers. Such a situation is illustrated in Figure 1. It is the typical situation that companies will have to deal with as we move forward. In practice, a query can originate from anywhere; from a PC within the corporation, which is connected by a fast line to the local data center, from a PC in the home via a VPN line, from a laptop via a WiFi connection, or from a smart phone via a 3G or 4G connection. For that reason we represent a query here as coming through the Internet implying that the response will possibly travel through the Internet too. The will not concentrate all query traffic through a single node. A peer-to-peer architecture will be far more scalable - with any single node able to receive any query. In such 3

6 an arrangement, each node needs to have a map of the data stored at every node and know the performance characteristics of every node. When a node receives a query its first task is to determine which node is best able to respond to the query. It then passes responsibility for the query to that node. That node executes the query and returns the result directly to the user. Figure 1 shows more than one node in some of the data centers. In practice, it will probably be necessary to configure more than one node per data center to distribute the database workload within the data center as well as between data centers. Consider Figure 2. It illustrates the likely strategy that would be used by a node in accessing data held in local transactional databases or files. If the data is held in a database, the can either get at the data directly (via ODBC, for example) or access a replicated data store. Replication will only be needed if read access to the data imposes too great an impact on performance. Critical systems often have a hot standby in place ready to go if the primary system fails, in which case the stand-by systems database could be used as a data source. might also be drawn from operational data stores or data warehouses, with the same kind of replication strategy being employed. Where the application data is held in a file, the will probably be able to access the data directly. For non-database data, the would maintain a metadata map of the file so it could identify data items within the records read from the file. Finally, the will maintain its own store of data consisting of frequently used data drawn from the data sources it accesses. This would likely be most of the data the node was responsible for, with direct access to data stores being used primarily for data refresh. Local and Distributed Figure 2. A Node App 1 App 2 Node - x DBMS DBMS Repl. Node File File In processing local data, the acts as an operational data store. It has up-to-date data and responds to queries using that data. While BI databases, such as a data warehouse or large data mart, could be included, the cloud database might replace rather than complement such data stores. There is a scalability issue here. If we consider a large data center with many terabytes of data, no matter how efficient the node is, it probably will not be able to deal with all the query traffic. At each data center there would likely be several database nodes. And if the query traffic grew, as usually happens, the would need to instantiate extra nodes to handle the increased workload. 4

7 Network When workload expands, node A instantiates a new node, A' DBMS A2 DBMS A3 Node A Node A' DBMS A4 File A1 A2 A3 Node A Node A' A4 File A5 Consider the situation illustrated in Figure 3 where Node A of the is managing queries for files A1, A5 and databases A2, A3 and A4. If the workload gets too great for the resources at its disposal, then assuming that there is another server available to use, it could split like an amoeba as indicated. The original node might take responsibility for file A1 and databases A2 and A3, while the newly created node A takes responsibility for A4 and A5. In order to do this, Node A would have to have keep a full history of query traffic so that it would be able to calculate the optimal division as it split in two. Similarly there would need to be a reverse procedure that amalgamated two local nodes in the event that the query workload diminished. In concept, that takes care of queries that only access local data that Node A has responsibility for. However, there will necessarily be queries that span multiple nodes. Distributed Queries Figure 3. Cloud base Node Splitting Consider the major entities that a company holds as data: customer, product, sales transaction, staff member, supplier, purchase transaction and so on. They crop up in many applications. Consequently, many queries that seek information on these major entities will inevitably span multiple nodes of a. Even if we could find a convenient way to distribute and cluster the applications around these entities, there would be many queries that spanned multiple nodes. Most query-oriented databases, column store databases or traditional relational databases, could be configured to handle single node queries. Technically, the fundamental challenge for the is to handle distributed queries effectively. A distributed query which accesses multiple nodes of the can be thought of as an amalgamation (a union) of several queries that access individual nodes of the. This is illustrated in Figure 4. Note that the resolution of a query in this manner could result in more than one result set from each node as illustrated. Once the answers have been calculated, the has to determine which node will join them together. 5

8 The Node that receives the query decomposes it. Answer Query Join Sub Query 1 Sub Query 2 Sub Query 3 Sub Query 4 Answer SQ-1 Answer SQ-2 Answer SQ-3 Answer SQ-4 Node - 2 Node - 5 Node - 8 The most cost effective Node performs the Join. Cloud Center 1 Center 1 Center 2 Figure 4. : Distributed Queries The best node to choose is the one that is least cost in respect of time. That can depend upon many physical factors, not just the volume of data that needs to be transmitted, but the network speeds and how long it will take each node to carry out its work. It could even depend upon which node is currently busiest. The challenge is to find the fastest solution, but the problem is not a trivial one. Other Cloud base Issues There are other issues that a needs to address. A primary one is high availability. This is a necessity rather than a nice-to-have. The needs to be able to recover from the failure of any node and, in the extreme, the failure of a whole data center. However, that is achievable by any distributed database that is capable of replicating its nodes. There are also the traditional issues of database security and the broader issues of data quality and data governance. However, these are not show-stoppers. The has to be able to assemble a complete metadata map of all the nodes. For that reason, data security, data quality and data governance issues can be handled as if the were a single database. There is also the need to provide support for a variety of data access interfaces. Ultimately these will include the usual SQL interfaces (ODBC, JDBC, ADO.NET), web services interfaces (HTTP, REST, SOAP, XQuery, etc.) and any other specialized interfaces such as MDX (for data cubes.) All of these features are both necessary and important, but catering for them is not where the main challenge lies. The greatest engineering challenge is in optimizing varied query workloads across a widely distributed resource space in a manner that consistently performs well. 6

9 Can a Traditional bases Evolve to be a? bases came into existence over 40 years ago because of the limitations of file systems. They were a more effective mechanism for storing data, for many reasons. The main one was that they made metadata (data definition data) available, so that many different programs could use the same data store. The situation further improved with the emergence of a standard data access language; SQL. This meant that, for the most part, the programmer no longer needed to think about how data was stored. Naturally, when databases first appeared, a hope arose that it would eventually be possible to store all of a company s data in a single database. It was a forlorn hope. Relational base Evolution Relational databases (RDBMS) became the dominant type of database as soon as computer hardware was fast enough to enable their use for OLTP. The relational database was originally viewed as a more appropriate database for query workloads, and it was. But in time it was engineered to be suitable for OLTP. Once databases had standardized around a data model (relational) and an access language (SQL), the hope that it would become possible to implement a single corporate database for use by all programs strengthened. There were many reasons why this did not happen. The major ones were: RDBMS products could cater for many different data structures, but never catered for every possible data structure. The relational model was not a universal model of data and to compound this problem, SQL was not a universal data access language that could access any kind of data structure. In practice this meant that RDBMS was simply unfit for storing some kinds of data. Specifically, RDBMS did not properly cater for many important data types (e.g. text, composite data types, etc.) Consequently other types of database arose (e.g. object databases, text databases, content databases, etc.) Even though RDBMS were based on the use of a two dimensional structure (the table) it never catered for structures of a higher dimension. This meant they did not cater for 3D data cubes or higher dimensional data cubes. Consequently specific databases emerged for dealing with such structures (OLAP databases.) Most importantly, RDBMS did not directly cater for the dimension of time and for time series data. While RDBMS could cater to both OLTP and query workloads, it never had the performance capability to cater for both types of workload at the same time. From an engineering perspective it made much more sense to have two database instances, one which was configured for OLTP and another, fed from the first, which was configured and tuned for query traffic. Most RDBMS products charged license fees, so Independent Software Vendors (ISVs) rarely used them. But even when open source RDBMS products became available at no charge, most ISVs continued to ignore them, preferring their own files structures. The IT industry never even tried to agree on a standard file format that exposed the metadata of a file. Thus the commonly used operating systems never provided such a file type. This 7

10 meant there was no alternative for ISVs but to constantly invent new types of files, and even new data types, for the data they stored. This brought us to the situation where the industry began to accept a de facto reality: There was structured data; data held in databases with its metadata available. There was unstructured data; data held in files of various kinds where the metadata was either unavailable or incomplete. Scale and Scalability In the light of these constraints, databases evolved in two directions. On one hand databases accommodated some unstructured data - by extensions to the relational model, implementing some version of an object-relational model. On the other hand, the dream of a single corporate database continued - but only for query traffic - giving rise to the idea of the data warehouse. In practice, data warehouses were an attempt to scale up by storing all data in a single instance of a database. But in practice they never did scale up. From the get-go users were forced to store data subsets in data marts. Focusing all query workloads on the data warehouse would have paralyzed it. Because of the limitations of the relational model, some of the data marts were OLAP databases holding multidimensional data cubes. The impressive march of Moore s Law, which vaporized performance issues in many areas of IT, never came close to fixing this scalability issue - and it still hasn t. flowed from operational systems, through ETL and data quality programs into a data warehouse for later extraction into a data mart for eventual use. This was a slow process. Consequently, software designed to short cut that pedestrian route emerged, called Enterprise Information Integration (EII) software. EII tools created Operational Stores which were nothing more than accelerated data marts. RDBMS did not scale out and little effort was put into that. So when the likes of Yahoo and Google assembled large data centers with thousands of servers, there was no database technology at all that could scale out across such large computing grids. This gave rise to a completely different approach to scaling out for large volumes of data, which went by the name of MapReduce and which gave rise to Hadoop, a programming framework for implementing MapReduce across large grids of servers. The Coming of the Column Store As a database idea, the column store is very old. It goes back to the 1970s. Edward Glaser, principal developer on the MIT MULTICS project, first proposed the idea and it was used by IBM on a database called APLDI. It came back into fashion via Sybase and Sand Technology when the scalability limitations of the indexed data structures that RDBMS used became more apparent. Column-store databases became increasingly popular with the emergence of new start-up database companies like Vertica and ParAccel that took this approach. The column stores were RDBMS in the sense that they employed SQL as the primary data access language and they held data in tables, but at a physical level they stored columns rather than tables, they made heavy use of data compression and they didn t use indexes. The simple fact was that, while the speed at which data could be read from disk had been 8

11 increasing rapidly over the years, the speed of the movement of the read/write head across the disk had not increased much. Consequently, using indexes for accessing data on disk had become a liability. It caused disk head movement and slowed everything down. It had become far faster to read data serially from disk. The query is decomposed into a sub-query for each node Query base Table Sub Query 1 Server 1 Sub Query 2 Server 2 The columnar database scales up and out by adding more servers Server 3 CPU CPU CPU CPU CPU CPU As Much Memory As Possible As Much Memory As Possible As Much Memory As Possible is compressed then partitioned on disk by column and by range Figure 5. Column Store DBMS Scalability This gave rise to the scalability approach illustrated in Figure 5. This depicts the general approach of the column store DBMS to scalability. First of all, data is compressed when it is loaded, resulting in a much smaller volume of data - one twentieth of the original raw data is achievable. Then the data is stored in columns. The columns may also be split up between disks and between servers. This ensures good parallelism. A query may need to read the whole of a column from a table, for example, so if the column is split between 12 disks that are split between two servers, then the data retrieval may be 12 times faster. Furthermore, the servers will most likely be configured for a high level of memory so that a good deal of the data is already in memory. The caching algorithms will probably split a fair amount of the memory equally between the disks to balance the average workload. In addition to this, multiple processes will be running and they will be distributed between multiple cores in the cpus on each server. 9

12 The overall performance of the column store DBMS will depend on how well the software balances the workload when multiple queries are processed. This solution has the advantage that you can simply add more servers as the data volume expands and the balancing of the workload across 3, then 4 then 5 servers will usually work out well. This solution scales out onto multiple servers more effectively than the traditional RDBMS - which is precisely why it has become popular. Unfortunately it will hit a limit at some point. Clearly that limit will depend upon the structure of the data and the variety of queries being processed. Even though it scales out more effectively, it is still a centralized architecture. As the workload increases a messaging bottleneck will naturally develop at the master node of the column store database and ultimately, this limits the number of servers it can expand onto. Hadoop and Map/Reduce: A Distributed Architecture The Hadoop development framework for MapReduce has attracted a great deal of attention for two reasons. First, it does scale out across large grids of computers and secondly it is the product of an Open Source project, so companies can test it out at low cost. MapReduce is a parallel architecture designed by Google specifically for large scale search and data analysis. It is very scalable and works in a distributed manner. The Hadoop environment is a MapReduce framework that enables the addition of Java software components. It also provides HDFS (the Hadoop Distributed File System) and has been extended to include HBase, which is a kind of column store database. Figure 6 shows how Hadoop works. Basically, a mapping function partitions data and then passes it to a reducing function, which calculates a result. In the diagram we show many nodes (servers) with nodes 1 to i running the mapping process and nodes i +1 to k running the reducing process. The environment is (designed to recover from the failure of any node. The HDFS holds a redundant copy of all data, so if any node fails, the same data will be available through another BackUp /Recov HDFS HDFS Map Partition Combine Reduce Scheduler BackUp /Recov Node 1 Mapping Process BackUp /Recov Node i Mapping Process Figure 6. Hadoop & MapReduce Node i+1 Reducing Process Node j Reducing Process Node k Reducing Process BackUp /Recov BackUp /Recov BackUp /Recov 10

13 node. Every server logs what it is doing and can be recovered using its backup/recovery file, if it fails. Because of that, Hadoop/MapReduce is quite slow at each node, but it compensates for this by scaling out over thousands of nodes. It has been used productively on grids of over 5000 servers. Node failure is a daily event when you have that many commodity servers working together, so at that scale, its recoverability is an advantage. With MapReduce, all the data records consist of a simple key and value pair. An example might be a log file, consisting of message codes (the key) and the details of the condition being reported (the value). For the sake of illustrating the MapReduce process, imagine we have a large log file of many terabytes containing messages and message codes and we simply want to count each type of message record. It could be done in the following way: The log file is loaded into the HDFS file system. Each mapping node will read some of the log records. The mappers will look at each record they read and output a key value pair containing the message code as the key and 1 as the value (the count of occurrences). The reducer(s) will sort by the key and aggregate the counts. With repeated reductions eventually it will arrive at the result; a map of distinct keys with their overall counts from all inputs. While this example is very simple, if we had a very large fact table of the type that might reside in a data warehouse, we could execute SQL queries in the same way. The map process would be the SQL SELECT and the reduce process could simply be the sorting and merging of results. You can add any kind of logic to either the map or the reduce step and you can also have multiple map and reduce cycles for a single task. Also, by deploying HBase it is possible to have a very large massively parallel column-store database that presides over petabytes of data and which can be regularly updated. The Ultimately, neither column store databases nor Hadoop (with Hbase) currently have the capabilities needed to function as a. Column-store DBMS are (in most cases) centralized databases that will encounter scalability limits as data volumes and workloads increase. Ultimately, all centralized architectures suffer that fate no matter how splendid the underlying engineering. For that reason some of the column-store vendors are integrating with Hadoop and enhancing it in various ways. Because Hadoop provides a fully distributed environment it is unlikely to encounter a scalability limit of the kind that would floor a centralized architecture. Hadoop was purposely designed to preside over massive tables and, in that role, it can be useful, especially for those organizations that run into scalability limits with column store databases. However, in its current form it processes only one workload at a time - it has no multiprocessing capability at all. Also, it does not work well with complex data structures, even when they only contain structured data. Big tables, yes ; but lots of little tables from lots of databases all with varying data structures, decidedly no. Neither is Hadoop equipped to easily distribute workloads across complex networks that work at varying speeds. Hadoop expects a clean environment of similar sized servers all networked together at the same speed in an orderly fashion. Its secret sauce is homogeneity in everything it does. A has to be able to handle heterogeneity at every level. 11

14 Algebraix and Cloud base Algebraix s A2DB is, uniquely, an algebraic database. As such, it is capable of representing any kind of data in an algebraic form and managing it accordingly. Many databases (RDBMS and derivative products) are constrained by the relational model of data, unable to handle data that does not fit in that limited environment. A2DB is not constrained in that way. Its algebraic nature allows it to represent hierarchies, ordered lists, recursive data structures and compound data objects of any kind. (For a more detailed mathematical explanation of how it achieves this, read the Bloor Group white paper: Doing The Math). Algebraic Optimization and the Use of Intermediate Results To understand how Alegbraix s technology could implement a, you need understand the optimization strategy it implements. The A2DB product stores all the sets it calculates, including all intermediate result sets for possible reuse. Consider a fairly simple query which accesses some rows and columns from one table and then joins them to some rows and columns of another table. Most databases will select the data from the first table, select it from the second table and then join the resulting two tables together to provide the answer. A2DB behaves in the same manner, but with the additional nuance that it stores the first selection and the second selection and the joined result, for possible later use. If later queries make the same selection or make a selection of a subset of either of the two stored selections, then A2DB will reuse those results. Once A2DB has processed many queries it has assembled a reasonably large population of these intermediate results. Not only does it store each such set of data, it also stores their algebraic representation. So when it processes a new query, it simply examines its store of algebraic representations and selects those that can contribute to resolving the query. It then works out which of them has the least cost in terms of resource usage, and uses those sets to resolve the query. The adjacent graph illustrates how the performance of A2DB improves when the same type of query is repeated. The first time a query runs, response is slow. But it improves with each repetition until the response time falls to a very low level. This happens with all types of query. The use Figure 7. The A2DB Optimizer Performance Curves 12

15 Sources Apps Apps Universe Manager XSN Translator Resource Manager Queries Queries Queries File File File (Algebraic model) Optimizer (CPU/cores, memory, disk) Log Log Log Log Files Files Files Files LOGICAL Local Access Set Processor Storage Manager PHYSICAL Remote Access Answers Answers Load Load Load Load Files Files Files Files Apps Apps Apps Local Result Sets Remote Result Sets Mgt Node i Node k DBMS DBMS DBMS DBMS DBMS DBMS DBMS DBMS Node j Local Center Node l Remote Center Figure 8. Algebraix s Technology in a Distributed Operation of intermediate result sets proves valuable in a distributed environment and a cloud environment. Figure 8 illustrates this. The distributed architecture is peer-to-peer, so there could be many such nodes, even thousands - all functioning in the same way. On the left of the diagram are the data sources that this particular node takes input from and is responsible for. In order to load the database node it is only necessary to create load files of the source databases. The database doesn t immediately load the data, it just loads the metadata from those files. The way the technology works is that there is no data load per se. As queries arrive it references the load files (or log files or other data files) and gradually accumulates intermediate result sets, which constitute its managed data store - as illustrated. It uses physically efficient mechanisms to store such data, the same techniques as the typical column store database; no indexes, data compression and data partitioning. There is complete separation between the logical representation of the data sets stored and the physical storage of those data sets. It works in the following way: The XSN Translator translates a query into an algebraic representation that corresponds with the algebraic sets defined at a logical level in the Universe Manager. (XSN stands for Extended Set Notation.) The Universe Manager holds a logical model of all the database s sets and their relations. The Optimizer first works out which stored sets might participate in a solution. It may deduce it has to go to source data (load files) for all or part of the data requested by the query. 13

16 In any event the search for alternatives will yield one or more possible solutions. The Optimizer now consults the Resource Manager and tests each of its algebraic solutions against PHYSICAL information held by the Resource Manager. Armed with precise cost information, the Optimizer works out the physical cost of each algebraic solution and chooses the fastest one. The Resource Manager knows whether data is on disk or cached in memory and it knows how it is physically organized. Once the Optimizer has decided on a solution, it passes it to the Set Processor, which executes it. The Distributed Query Now consider what happens if the query requests some data that is not on this database node. How does it know what to do? By design, the Universe Manager doesn t just hold a map of local data, it also holds a global map that identifies all other database nodes and the data they are responsible for. When we described how the database handles a query, we omitted to discuss how it handles a query that spans more than one node. Such a query will naturally involve a join of some kind with one or more parts of the join operation referencing remote data. The mode of operation of Algebraix s technology is essentially the same, but slightly more complex. The Optimizer always checks to see if any of the data requested is part of the remote universe rather than the local universe. If it discovers that some element in the query references remote data, it deconstructs the query into several parts, as follows: A subquery for this node A subquery for each remote node that is involved A master query that joins together all the results of all the subqueries It calculates which node is the best node to execute the master query by estimating the resource cost of transporting result data from one location to another. If it decides to pass that responsibility to another node then it behaves as follows: It passes all the other subqueries to the nodes where they need to execute. It also informs each node where to deliver the result of their subquery. It then executes its own subquery and passes the result to the master node when local processing completes. At that point it has finished with that query. If it has determined that it is, itself, the best node to execute the master query, it behaves as follows: It passes all the other subqueries to the nodes where they need to execute. It gives itself as the return address for the results of those subqueries. It executes its own subquery. When it receives all the remote result sets, it executes the master query. Finally it dispatches the end result to the program that sent the query. Note that in carrying out such a distributed query the database gathers some remote result sets at the node that masters the distributed query. It will save these results as remote result 14

17 sets in the same way that it saves local result sets, so that when more queries of that type come in it may be able to resolve those queries locally rather than in a distributed manner. Failover With Hadoop, failure of any node can be catered for. The same is true of Algebraix s technology. It is fairly easy to configure complete node mirrors so that a standby node can take over immediately if an active node fails. It would be more economic though to use a SAN at each data center, and only mirror data that is written to disk (the intermediate results). Then if a node fails, it will be possible to recover the node from the SAN. This injects a greater delay into the recovery process, as the recovered database would have to recreate the last known state of the failed node. In practice, Algebraix s technology can run on commodity servers. While it may appear that it has a substantial requirement for data storage, because of its strategy of storing intermediate results, in practice this is not the case. This is because, after a suitable time has passed, the database deletes the intermediate results it didn t reuse. The database rarely requires the deployment of additional storage (such as NAS or a SAN). For atypical workloads special configurations can be deployed for any given node. Node Splitting Node splitting becomes necessary when the query load for a node becomes too great. The need becomes apparent when the performance of the node begins to decline. However, node splitting is simple to achieve: A replica node is created of the node and the data sources that the new node will be responsible for are defined - deleting those it will not be responsible for from the Universe Manager. The technology can estimate what the best split is likely to be from an analysis of past query workloads. It can also recognize which intermediate results are derived from which source files or databases. So it reclassifies those intermediate results as remote rather than local. The configuration of the original node is configured in the same way, deleting the data sources that it is no longer responsible for. The nature of the changes are then relayed to all the nodes in the. Growth Most source data will consist of databases that are themselves being added to on a regular basis. That data growth is best dealt with by feeding database log file images to the database. For other applications which simply use file systems, it is best to feed the equivalent of an update audit trail to the database. There is a specific reason for this. Algebraix s technology does not cater for updated data in the way most databases do. Typically, database updates destroy data by over-writing one value with another. This database technology is different. It treats updates as additional (i.e. new) data. In effect, they become non-destructive updates, with a record of the previous values remaining. For deletions, it simply marks the set of data or a data item as no longer current. To achieve these things, the database adds a time stamp to all data as it arrives and is used (if such a time stamp does not exist in source data.) All queries to the database either specify the time that applies, so that the result has an as at date/time or omit the time, in which case the current 15

18 date and time is applied. So all updates are taken into account when the associated data is processed according to time stamp. Because of this, all intermediate result tables also have an as at date/time associated with them. The database is configured at every node to accept new data on the basis of a timed switch. It is inadvisable to set the time switch to too short a period as this rapidly increases the number of sets held by the Universe Manager - and this, in turn, could impact performance. The Economy of A2DB In any database and especially in any distributed database, it is always possible to pose queries that will take a long time to answer. This technology does not make that problem suddenly disappear. For example if you join two terabyte-sized tables together that are on different nodes, a terabyte of data must pass over the network. If it is a slow network line, the query could take a very long time. If such a query is frequently run, the database will solve this particular performance issue naturally by holding one of the terabyte tables as an intermediate result. If you have a petabyte or even several petabytes of data that you wish to query regularly, then the database could be used for the task by deploying it on a sufficient number of nodes. In such circumstances it could look quite similar to Hadoop (with HBase). However that is not the prime requirement of a. A needs to be able to handle heterogenous workloads some of which access complex data structures, and it needs to do so with economy and with speed. That is what Algebraix s technology does. In the distributed environment it is helped by the fact that users and programs that request data normally do not pose queries that have terabyte-long answers. They pose queries that have quite short answers - a few megabytes or less. An exception is when users are downloading a large data extract for more detailed analysis, but such downloads are relatively rare. This distributed approach has the virtue that it naturally localizes data to suit the query traffic. In each node it localizes the data that is frequently queried in memory. In a distributed environment with multiple nodes it will, through its natural performance mechanisms, gradually localize the data to suit the local and global query traffic. If query volumes rise too high at a given node, then the node can split like an amoeba to cater for the rising workload. If the query traffic changes with, say, one kind of query not being posed so frequently and a new set of previously unknown queries becoming common the database will simply adjust, by adjusting the intermediate results it holds. After three or four queries of each new query type its natural performance will be restored. The nature of this technology, coupled with the fact that it can be configured for high availability, qualifies it as suitable for deployment as a. 16

19 About The Bloor Group The Bloor Group is a consulting, research and analyst firm that focuses on quality research and analysis of emerging information technologies across the whole spectrum of the IT industry. The firm s research focuses on understanding both the technical features and the business value of information technologies and how they are successfully implemented within modern computing environments. Additional information on The Bloor Group can be found at and The Bloor Group is the sole copyright holder of this publication Oban Drive Spicewood TX Tel: w w w. T h e V i r t u a l C i r c l e. c o m w w w. B l o o r G r o u p. c o m 17