WHAT IS A CLOUD DATABASE?
|
|
|
- Randolph Waters
- 10 years ago
- Views:
Transcription
1 WHAT IS A CLOUD DATABASE? The Suitability of Algebraix s Technology to Cloud Computing Robin Bloor, Ph D WHITE PAPER
2 Copyright 2011, The Bloor Group All rights reserved. Neither this publication nor any part of it may be reproduced or transmitted or stored in any form or by any means, without either the prior written permission of the copyright holder or the issue of a license by the copyright holder. The Bloor Group is the sole copyright holder of this publication Oban Drive Spicewood TX Tel: contact: [email protected] w w w. T h e V i r t u a l C i r c l e. c o m w w w. B l o o r G r o u p. c o m WHITE PAPER
3 Executive Summary This white paper was commissioned by Algebraix. The goal of the paper is to provide a definition of what a cloud database is, and in the light of that definition, examine the suitability of Algebraix s technology to fulfill the role of a cloud database. Here is a brief summary of the contents of this paper: We define a cloud dbms () to be a distributed database that can deliver a query service across multiple distributed database nodes located in multiple data centers, including cloud data centers. Querying distributed data sources is precisely the problem that businesses will encounter as cloud computing grows in popularity. Such a database also needs to deliver high availability and cater for disaster recovery. In our view, a only needs to provide a query service. SOA already delivers connectivity and integration for transactional systems, so we see no need for a to cater for transactional traffic - only query traffic. A needs to scale across large computer grids, but it also needs to be able to span multiple data centers and, as far as is possible, cater for slow network connections. We review traditional databases, focusing primarily on relational databases and column store databases, concluding that such databases, as currently engineered, could not fulfill the role of a. They have centralized architectures and such architectures would encounter a scalability limit at some point, both within and between data centers. We conclude that a distributed peer-to-peer architecture is needed to satisfy the characteristics that we have defined. We move on to examine the Hadoop/MapReduce environment and its suitability as a. It has much better scalability for many workloads than relational or column store databases, because of its distributed architecture. However it was not built for mixed workloads or for complex data structures or even for multitasking. In its current form it emphasizes fault tolerance. It succeeds as a database for very large volumes of data, but does not have the characteristics of a. Finally, we examine Algebraix s technology as implemented in its database product A2DB. Our conclusion is that it has an architecture which is suitable for deployment as a. Our view is as follows: - A2DB s unique capability to reuse intermediate results of queries that it has previously executed, contribute to it delivering high performance at a single node. - The same performance characteristics can be employed to speed up queries that join information between a local node and remote nodes, whether in the same data center or in a remote data center. - Algebraix s technology is capable of global optimization, balancing the performance requirements of both global and local queries. - Additionally the technology can deliver high availability/fault tolerant operation. We are aware that Algebraix has not been deployed and tested its database A2DB in the role of hence our conclusion is not that it qualifies as a, but that it has an architecture that would enable it to be tested in this role. 1
4 The Cloud base - In Concept Cloud computing is a major driving trend for IT. Over 36 percent of US companies already run applications in the cloud (Mimecast survey, February 2010) and the major cloud vendors are growing their revenues and customer bases rapidly. Given the trends, fairly soon the majority of IT departments will be running applications in the cloud, possibly using more than one cloud provider. So corporate computing will inevitably become much more distributed than it currently is, spreading itself across multiple data centers. This will pose management, architectural and performance challenges - and foster innovation to meet those challenges. The Cloud Implementation of Transactional and Query Systems If we think solely in terms of database technology, the wider distribution of transactional systems, such as OLTP systems, communications applications and work flow systems, will not pose a severe problem at the data level. The sweeping success of Salesforce.com demonstrates this. The data problems of placing your CRM system in the cloud are resolved easily enough by the regular transfer of customer and other data from the cloud to the data center. Indeed the broad success of SOA demonstrates the same thing. Loosely coupling silo transaction systems together works fine as regards the work flow between transactional systems. Because the volume of data passed between applications within a SOA is low, it is highly unlikely that the relatively slow speeds of the Internet will be prohibitive to placing some of these applications in the cloud. There will be exceptions, but in principle it will work well most of the time. For query workloads typified by BI applications, distribution of the data across multiple data centers is more problematic. There are three main reasons for this: 1. Internet speeds are generally slow compared to data center network speeds and this limits performance considerably. This issue can be addressed through high-speed direct connections, but this becomes expensive very quickly. 2. Query workloads are not as predictable as transactional workloads. We can predict transactional workloads reasonably accurately, but we cannot easily predict specifically what questions a user might wish to ask - hence we are less able to predict the workload. This has profound architectural implications for the distribution of query systems. Stated simply: we don t know where best to locate the data ahead of time, because we do not know which sets of data users may wish to join together. 3. Even if we achieve an efficient distribution of data, query workloads involve the movement of much greater volumes of data than transactional workloads. That movement of data will inevitably be slower than if the data was located in a single data center. This set of constraints suggests that it may be better to centralize query workloads in one physical location. This is traditionally how most BI domains have been constructed, around a big data warehouse with subsets of data drawn off to serve individual BI applications. But ultimately that approach fails the test of scalability. A centralized architecture scales poorly over very large numbers of nodes. Bottlenecks eventually arise. 2
5 Towards a Cloud base For the moment, we will set aside that fact that there are many challenges in implementing a distributed architecture for query workloads across several data centers, and provide a view of what a cloud database would look like. We can define a cloud dbms () as a distributed database that delivers a query service across multiple distributed database nodes located in multiple geographically-distributed data centers, both corporate data centers and cloud data centers. So think in terms of an organization with some applications running in the cloud. Perhaps Salesforce.com plus some User Query Node - 8 Node - 1 Cloud Center 1 Internet Node - 7 Cloud Center 2 Node - 2 Node - 3 Node - 4 Node - 5 Node - 6 Center 1 Center 2 Figure 1. A hosted transactional web applications in some remote data center plus local applications including BI applications split between two data centers. Such a situation is illustrated in Figure 1. It is the typical situation that companies will have to deal with as we move forward. In practice, a query can originate from anywhere; from a PC within the corporation, which is connected by a fast line to the local data center, from a PC in the home via a VPN line, from a laptop via a WiFi connection, or from a smart phone via a 3G or 4G connection. For that reason we represent a query here as coming through the Internet implying that the response will possibly travel through the Internet too. The will not concentrate all query traffic through a single node. A peer-to-peer architecture will be far more scalable - with any single node able to receive any query. In such 3
6 an arrangement, each node needs to have a map of the data stored at every node and know the performance characteristics of every node. When a node receives a query its first task is to determine which node is best able to respond to the query. It then passes responsibility for the query to that node. That node executes the query and returns the result directly to the user. Figure 1 shows more than one node in some of the data centers. In practice, it will probably be necessary to configure more than one node per data center to distribute the database workload within the data center as well as between data centers. Consider Figure 2. It illustrates the likely strategy that would be used by a node in accessing data held in local transactional databases or files. If the data is held in a database, the can either get at the data directly (via ODBC, for example) or access a replicated data store. Replication will only be needed if read access to the data imposes too great an impact on performance. Critical systems often have a hot standby in place ready to go if the primary system fails, in which case the stand-by systems database could be used as a data source. might also be drawn from operational data stores or data warehouses, with the same kind of replication strategy being employed. Where the application data is held in a file, the will probably be able to access the data directly. For non-database data, the would maintain a metadata map of the file so it could identify data items within the records read from the file. Finally, the will maintain its own store of data consisting of frequently used data drawn from the data sources it accesses. This would likely be most of the data the node was responsible for, with direct access to data stores being used primarily for data refresh. Local and Distributed Figure 2. A Node App 1 App 2 Node - x DBMS DBMS Repl. Node File File In processing local data, the acts as an operational data store. It has up-to-date data and responds to queries using that data. While BI databases, such as a data warehouse or large data mart, could be included, the cloud database might replace rather than complement such data stores. There is a scalability issue here. If we consider a large data center with many terabytes of data, no matter how efficient the node is, it probably will not be able to deal with all the query traffic. At each data center there would likely be several database nodes. And if the query traffic grew, as usually happens, the would need to instantiate extra nodes to handle the increased workload. 4
7 Network When workload expands, node A instantiates a new node, A' DBMS A2 DBMS A3 Node A Node A' DBMS A4 File A1 A2 A3 Node A Node A' A4 File A5 Consider the situation illustrated in Figure 3 where Node A of the is managing queries for files A1, A5 and databases A2, A3 and A4. If the workload gets too great for the resources at its disposal, then assuming that there is another server available to use, it could split like an amoeba as indicated. The original node might take responsibility for file A1 and databases A2 and A3, while the newly created node A takes responsibility for A4 and A5. In order to do this, Node A would have to have keep a full history of query traffic so that it would be able to calculate the optimal division as it split in two. Similarly there would need to be a reverse procedure that amalgamated two local nodes in the event that the query workload diminished. In concept, that takes care of queries that only access local data that Node A has responsibility for. However, there will necessarily be queries that span multiple nodes. Distributed Queries Figure 3. Cloud base Node Splitting Consider the major entities that a company holds as data: customer, product, sales transaction, staff member, supplier, purchase transaction and so on. They crop up in many applications. Consequently, many queries that seek information on these major entities will inevitably span multiple nodes of a. Even if we could find a convenient way to distribute and cluster the applications around these entities, there would be many queries that spanned multiple nodes. Most query-oriented databases, column store databases or traditional relational databases, could be configured to handle single node queries. Technically, the fundamental challenge for the is to handle distributed queries effectively. A distributed query which accesses multiple nodes of the can be thought of as an amalgamation (a union) of several queries that access individual nodes of the. This is illustrated in Figure 4. Note that the resolution of a query in this manner could result in more than one result set from each node as illustrated. Once the answers have been calculated, the has to determine which node will join them together. 5
8 The Node that receives the query decomposes it. Answer Query Join Sub Query 1 Sub Query 2 Sub Query 3 Sub Query 4 Answer SQ-1 Answer SQ-2 Answer SQ-3 Answer SQ-4 Node - 2 Node - 5 Node - 8 The most cost effective Node performs the Join. Cloud Center 1 Center 1 Center 2 Figure 4. : Distributed Queries The best node to choose is the one that is least cost in respect of time. That can depend upon many physical factors, not just the volume of data that needs to be transmitted, but the network speeds and how long it will take each node to carry out its work. It could even depend upon which node is currently busiest. The challenge is to find the fastest solution, but the problem is not a trivial one. Other Cloud base Issues There are other issues that a needs to address. A primary one is high availability. This is a necessity rather than a nice-to-have. The needs to be able to recover from the failure of any node and, in the extreme, the failure of a whole data center. However, that is achievable by any distributed database that is capable of replicating its nodes. There are also the traditional issues of database security and the broader issues of data quality and data governance. However, these are not show-stoppers. The has to be able to assemble a complete metadata map of all the nodes. For that reason, data security, data quality and data governance issues can be handled as if the were a single database. There is also the need to provide support for a variety of data access interfaces. Ultimately these will include the usual SQL interfaces (ODBC, JDBC, ADO.NET), web services interfaces (HTTP, REST, SOAP, XQuery, etc.) and any other specialized interfaces such as MDX (for data cubes.) All of these features are both necessary and important, but catering for them is not where the main challenge lies. The greatest engineering challenge is in optimizing varied query workloads across a widely distributed resource space in a manner that consistently performs well. 6
9 Can a Traditional bases Evolve to be a? bases came into existence over 40 years ago because of the limitations of file systems. They were a more effective mechanism for storing data, for many reasons. The main one was that they made metadata (data definition data) available, so that many different programs could use the same data store. The situation further improved with the emergence of a standard data access language; SQL. This meant that, for the most part, the programmer no longer needed to think about how data was stored. Naturally, when databases first appeared, a hope arose that it would eventually be possible to store all of a company s data in a single database. It was a forlorn hope. Relational base Evolution Relational databases (RDBMS) became the dominant type of database as soon as computer hardware was fast enough to enable their use for OLTP. The relational database was originally viewed as a more appropriate database for query workloads, and it was. But in time it was engineered to be suitable for OLTP. Once databases had standardized around a data model (relational) and an access language (SQL), the hope that it would become possible to implement a single corporate database for use by all programs strengthened. There were many reasons why this did not happen. The major ones were: RDBMS products could cater for many different data structures, but never catered for every possible data structure. The relational model was not a universal model of data and to compound this problem, SQL was not a universal data access language that could access any kind of data structure. In practice this meant that RDBMS was simply unfit for storing some kinds of data. Specifically, RDBMS did not properly cater for many important data types (e.g. text, composite data types, etc.) Consequently other types of database arose (e.g. object databases, text databases, content databases, etc.) Even though RDBMS were based on the use of a two dimensional structure (the table) it never catered for structures of a higher dimension. This meant they did not cater for 3D data cubes or higher dimensional data cubes. Consequently specific databases emerged for dealing with such structures (OLAP databases.) Most importantly, RDBMS did not directly cater for the dimension of time and for time series data. While RDBMS could cater to both OLTP and query workloads, it never had the performance capability to cater for both types of workload at the same time. From an engineering perspective it made much more sense to have two database instances, one which was configured for OLTP and another, fed from the first, which was configured and tuned for query traffic. Most RDBMS products charged license fees, so Independent Software Vendors (ISVs) rarely used them. But even when open source RDBMS products became available at no charge, most ISVs continued to ignore them, preferring their own files structures. The IT industry never even tried to agree on a standard file format that exposed the metadata of a file. Thus the commonly used operating systems never provided such a file type. This 7
10 meant there was no alternative for ISVs but to constantly invent new types of files, and even new data types, for the data they stored. This brought us to the situation where the industry began to accept a de facto reality: There was structured data; data held in databases with its metadata available. There was unstructured data; data held in files of various kinds where the metadata was either unavailable or incomplete. Scale and Scalability In the light of these constraints, databases evolved in two directions. On one hand databases accommodated some unstructured data - by extensions to the relational model, implementing some version of an object-relational model. On the other hand, the dream of a single corporate database continued - but only for query traffic - giving rise to the idea of the data warehouse. In practice, data warehouses were an attempt to scale up by storing all data in a single instance of a database. But in practice they never did scale up. From the get-go users were forced to store data subsets in data marts. Focusing all query workloads on the data warehouse would have paralyzed it. Because of the limitations of the relational model, some of the data marts were OLAP databases holding multidimensional data cubes. The impressive march of Moore s Law, which vaporized performance issues in many areas of IT, never came close to fixing this scalability issue - and it still hasn t. flowed from operational systems, through ETL and data quality programs into a data warehouse for later extraction into a data mart for eventual use. This was a slow process. Consequently, software designed to short cut that pedestrian route emerged, called Enterprise Information Integration (EII) software. EII tools created Operational Stores which were nothing more than accelerated data marts. RDBMS did not scale out and little effort was put into that. So when the likes of Yahoo and Google assembled large data centers with thousands of servers, there was no database technology at all that could scale out across such large computing grids. This gave rise to a completely different approach to scaling out for large volumes of data, which went by the name of MapReduce and which gave rise to Hadoop, a programming framework for implementing MapReduce across large grids of servers. The Coming of the Column Store As a database idea, the column store is very old. It goes back to the 1970s. Edward Glaser, principal developer on the MIT MULTICS project, first proposed the idea and it was used by IBM on a database called APLDI. It came back into fashion via Sybase and Sand Technology when the scalability limitations of the indexed data structures that RDBMS used became more apparent. Column-store databases became increasingly popular with the emergence of new start-up database companies like Vertica and ParAccel that took this approach. The column stores were RDBMS in the sense that they employed SQL as the primary data access language and they held data in tables, but at a physical level they stored columns rather than tables, they made heavy use of data compression and they didn t use indexes. The simple fact was that, while the speed at which data could be read from disk had been 8
11 increasing rapidly over the years, the speed of the movement of the read/write head across the disk had not increased much. Consequently, using indexes for accessing data on disk had become a liability. It caused disk head movement and slowed everything down. It had become far faster to read data serially from disk. The query is decomposed into a sub-query for each node Query base Table Sub Query 1 Server 1 Sub Query 2 Server 2 The columnar database scales up and out by adding more servers Server 3 CPU CPU CPU CPU CPU CPU As Much Memory As Possible As Much Memory As Possible As Much Memory As Possible is compressed then partitioned on disk by column and by range Figure 5. Column Store DBMS Scalability This gave rise to the scalability approach illustrated in Figure 5. This depicts the general approach of the column store DBMS to scalability. First of all, data is compressed when it is loaded, resulting in a much smaller volume of data - one twentieth of the original raw data is achievable. Then the data is stored in columns. The columns may also be split up between disks and between servers. This ensures good parallelism. A query may need to read the whole of a column from a table, for example, so if the column is split between 12 disks that are split between two servers, then the data retrieval may be 12 times faster. Furthermore, the servers will most likely be configured for a high level of memory so that a good deal of the data is already in memory. The caching algorithms will probably split a fair amount of the memory equally between the disks to balance the average workload. In addition to this, multiple processes will be running and they will be distributed between multiple cores in the cpus on each server. 9
12 The overall performance of the column store DBMS will depend on how well the software balances the workload when multiple queries are processed. This solution has the advantage that you can simply add more servers as the data volume expands and the balancing of the workload across 3, then 4 then 5 servers will usually work out well. This solution scales out onto multiple servers more effectively than the traditional RDBMS - which is precisely why it has become popular. Unfortunately it will hit a limit at some point. Clearly that limit will depend upon the structure of the data and the variety of queries being processed. Even though it scales out more effectively, it is still a centralized architecture. As the workload increases a messaging bottleneck will naturally develop at the master node of the column store database and ultimately, this limits the number of servers it can expand onto. Hadoop and Map/Reduce: A Distributed Architecture The Hadoop development framework for MapReduce has attracted a great deal of attention for two reasons. First, it does scale out across large grids of computers and secondly it is the product of an Open Source project, so companies can test it out at low cost. MapReduce is a parallel architecture designed by Google specifically for large scale search and data analysis. It is very scalable and works in a distributed manner. The Hadoop environment is a MapReduce framework that enables the addition of Java software components. It also provides HDFS (the Hadoop Distributed File System) and has been extended to include HBase, which is a kind of column store database. Figure 6 shows how Hadoop works. Basically, a mapping function partitions data and then passes it to a reducing function, which calculates a result. In the diagram we show many nodes (servers) with nodes 1 to i running the mapping process and nodes i +1 to k running the reducing process. The environment is (designed to recover from the failure of any node. The HDFS holds a redundant copy of all data, so if any node fails, the same data will be available through another BackUp /Recov HDFS HDFS Map Partition Combine Reduce Scheduler BackUp /Recov Node 1 Mapping Process BackUp /Recov Node i Mapping Process Figure 6. Hadoop & MapReduce Node i+1 Reducing Process Node j Reducing Process Node k Reducing Process BackUp /Recov BackUp /Recov BackUp /Recov 10
13 node. Every server logs what it is doing and can be recovered using its backup/recovery file, if it fails. Because of that, Hadoop/MapReduce is quite slow at each node, but it compensates for this by scaling out over thousands of nodes. It has been used productively on grids of over 5000 servers. Node failure is a daily event when you have that many commodity servers working together, so at that scale, its recoverability is an advantage. With MapReduce, all the data records consist of a simple key and value pair. An example might be a log file, consisting of message codes (the key) and the details of the condition being reported (the value). For the sake of illustrating the MapReduce process, imagine we have a large log file of many terabytes containing messages and message codes and we simply want to count each type of message record. It could be done in the following way: The log file is loaded into the HDFS file system. Each mapping node will read some of the log records. The mappers will look at each record they read and output a key value pair containing the message code as the key and 1 as the value (the count of occurrences). The reducer(s) will sort by the key and aggregate the counts. With repeated reductions eventually it will arrive at the result; a map of distinct keys with their overall counts from all inputs. While this example is very simple, if we had a very large fact table of the type that might reside in a data warehouse, we could execute SQL queries in the same way. The map process would be the SQL SELECT and the reduce process could simply be the sorting and merging of results. You can add any kind of logic to either the map or the reduce step and you can also have multiple map and reduce cycles for a single task. Also, by deploying HBase it is possible to have a very large massively parallel column-store database that presides over petabytes of data and which can be regularly updated. The Ultimately, neither column store databases nor Hadoop (with Hbase) currently have the capabilities needed to function as a. Column-store DBMS are (in most cases) centralized databases that will encounter scalability limits as data volumes and workloads increase. Ultimately, all centralized architectures suffer that fate no matter how splendid the underlying engineering. For that reason some of the column-store vendors are integrating with Hadoop and enhancing it in various ways. Because Hadoop provides a fully distributed environment it is unlikely to encounter a scalability limit of the kind that would floor a centralized architecture. Hadoop was purposely designed to preside over massive tables and, in that role, it can be useful, especially for those organizations that run into scalability limits with column store databases. However, in its current form it processes only one workload at a time - it has no multiprocessing capability at all. Also, it does not work well with complex data structures, even when they only contain structured data. Big tables, yes ; but lots of little tables from lots of databases all with varying data structures, decidedly no. Neither is Hadoop equipped to easily distribute workloads across complex networks that work at varying speeds. Hadoop expects a clean environment of similar sized servers all networked together at the same speed in an orderly fashion. Its secret sauce is homogeneity in everything it does. A has to be able to handle heterogeneity at every level. 11
14 Algebraix and Cloud base Algebraix s A2DB is, uniquely, an algebraic database. As such, it is capable of representing any kind of data in an algebraic form and managing it accordingly. Many databases (RDBMS and derivative products) are constrained by the relational model of data, unable to handle data that does not fit in that limited environment. A2DB is not constrained in that way. Its algebraic nature allows it to represent hierarchies, ordered lists, recursive data structures and compound data objects of any kind. (For a more detailed mathematical explanation of how it achieves this, read the Bloor Group white paper: Doing The Math). Algebraic Optimization and the Use of Intermediate Results To understand how Alegbraix s technology could implement a, you need understand the optimization strategy it implements. The A2DB product stores all the sets it calculates, including all intermediate result sets for possible reuse. Consider a fairly simple query which accesses some rows and columns from one table and then joins them to some rows and columns of another table. Most databases will select the data from the first table, select it from the second table and then join the resulting two tables together to provide the answer. A2DB behaves in the same manner, but with the additional nuance that it stores the first selection and the second selection and the joined result, for possible later use. If later queries make the same selection or make a selection of a subset of either of the two stored selections, then A2DB will reuse those results. Once A2DB has processed many queries it has assembled a reasonably large population of these intermediate results. Not only does it store each such set of data, it also stores their algebraic representation. So when it processes a new query, it simply examines its store of algebraic representations and selects those that can contribute to resolving the query. It then works out which of them has the least cost in terms of resource usage, and uses those sets to resolve the query. The adjacent graph illustrates how the performance of A2DB improves when the same type of query is repeated. The first time a query runs, response is slow. But it improves with each repetition until the response time falls to a very low level. This happens with all types of query. The use Figure 7. The A2DB Optimizer Performance Curves 12
15 Sources Apps Apps Universe Manager XSN Translator Resource Manager Queries Queries Queries File File File (Algebraic model) Optimizer (CPU/cores, memory, disk) Log Log Log Log Files Files Files Files LOGICAL Local Access Set Processor Storage Manager PHYSICAL Remote Access Answers Answers Load Load Load Load Files Files Files Files Apps Apps Apps Local Result Sets Remote Result Sets Mgt Node i Node k DBMS DBMS DBMS DBMS DBMS DBMS DBMS DBMS Node j Local Center Node l Remote Center Figure 8. Algebraix s Technology in a Distributed Operation of intermediate result sets proves valuable in a distributed environment and a cloud environment. Figure 8 illustrates this. The distributed architecture is peer-to-peer, so there could be many such nodes, even thousands - all functioning in the same way. On the left of the diagram are the data sources that this particular node takes input from and is responsible for. In order to load the database node it is only necessary to create load files of the source databases. The database doesn t immediately load the data, it just loads the metadata from those files. The way the technology works is that there is no data load per se. As queries arrive it references the load files (or log files or other data files) and gradually accumulates intermediate result sets, which constitute its managed data store - as illustrated. It uses physically efficient mechanisms to store such data, the same techniques as the typical column store database; no indexes, data compression and data partitioning. There is complete separation between the logical representation of the data sets stored and the physical storage of those data sets. It works in the following way: The XSN Translator translates a query into an algebraic representation that corresponds with the algebraic sets defined at a logical level in the Universe Manager. (XSN stands for Extended Set Notation.) The Universe Manager holds a logical model of all the database s sets and their relations. The Optimizer first works out which stored sets might participate in a solution. It may deduce it has to go to source data (load files) for all or part of the data requested by the query. 13
16 In any event the search for alternatives will yield one or more possible solutions. The Optimizer now consults the Resource Manager and tests each of its algebraic solutions against PHYSICAL information held by the Resource Manager. Armed with precise cost information, the Optimizer works out the physical cost of each algebraic solution and chooses the fastest one. The Resource Manager knows whether data is on disk or cached in memory and it knows how it is physically organized. Once the Optimizer has decided on a solution, it passes it to the Set Processor, which executes it. The Distributed Query Now consider what happens if the query requests some data that is not on this database node. How does it know what to do? By design, the Universe Manager doesn t just hold a map of local data, it also holds a global map that identifies all other database nodes and the data they are responsible for. When we described how the database handles a query, we omitted to discuss how it handles a query that spans more than one node. Such a query will naturally involve a join of some kind with one or more parts of the join operation referencing remote data. The mode of operation of Algebraix s technology is essentially the same, but slightly more complex. The Optimizer always checks to see if any of the data requested is part of the remote universe rather than the local universe. If it discovers that some element in the query references remote data, it deconstructs the query into several parts, as follows: A subquery for this node A subquery for each remote node that is involved A master query that joins together all the results of all the subqueries It calculates which node is the best node to execute the master query by estimating the resource cost of transporting result data from one location to another. If it decides to pass that responsibility to another node then it behaves as follows: It passes all the other subqueries to the nodes where they need to execute. It also informs each node where to deliver the result of their subquery. It then executes its own subquery and passes the result to the master node when local processing completes. At that point it has finished with that query. If it has determined that it is, itself, the best node to execute the master query, it behaves as follows: It passes all the other subqueries to the nodes where they need to execute. It gives itself as the return address for the results of those subqueries. It executes its own subquery. When it receives all the remote result sets, it executes the master query. Finally it dispatches the end result to the program that sent the query. Note that in carrying out such a distributed query the database gathers some remote result sets at the node that masters the distributed query. It will save these results as remote result 14
17 sets in the same way that it saves local result sets, so that when more queries of that type come in it may be able to resolve those queries locally rather than in a distributed manner. Failover With Hadoop, failure of any node can be catered for. The same is true of Algebraix s technology. It is fairly easy to configure complete node mirrors so that a standby node can take over immediately if an active node fails. It would be more economic though to use a SAN at each data center, and only mirror data that is written to disk (the intermediate results). Then if a node fails, it will be possible to recover the node from the SAN. This injects a greater delay into the recovery process, as the recovered database would have to recreate the last known state of the failed node. In practice, Algebraix s technology can run on commodity servers. While it may appear that it has a substantial requirement for data storage, because of its strategy of storing intermediate results, in practice this is not the case. This is because, after a suitable time has passed, the database deletes the intermediate results it didn t reuse. The database rarely requires the deployment of additional storage (such as NAS or a SAN). For atypical workloads special configurations can be deployed for any given node. Node Splitting Node splitting becomes necessary when the query load for a node becomes too great. The need becomes apparent when the performance of the node begins to decline. However, node splitting is simple to achieve: A replica node is created of the node and the data sources that the new node will be responsible for are defined - deleting those it will not be responsible for from the Universe Manager. The technology can estimate what the best split is likely to be from an analysis of past query workloads. It can also recognize which intermediate results are derived from which source files or databases. So it reclassifies those intermediate results as remote rather than local. The configuration of the original node is configured in the same way, deleting the data sources that it is no longer responsible for. The nature of the changes are then relayed to all the nodes in the. Growth Most source data will consist of databases that are themselves being added to on a regular basis. That data growth is best dealt with by feeding database log file images to the database. For other applications which simply use file systems, it is best to feed the equivalent of an update audit trail to the database. There is a specific reason for this. Algebraix s technology does not cater for updated data in the way most databases do. Typically, database updates destroy data by over-writing one value with another. This database technology is different. It treats updates as additional (i.e. new) data. In effect, they become non-destructive updates, with a record of the previous values remaining. For deletions, it simply marks the set of data or a data item as no longer current. To achieve these things, the database adds a time stamp to all data as it arrives and is used (if such a time stamp does not exist in source data.) All queries to the database either specify the time that applies, so that the result has an as at date/time or omit the time, in which case the current 15
18 date and time is applied. So all updates are taken into account when the associated data is processed according to time stamp. Because of this, all intermediate result tables also have an as at date/time associated with them. The database is configured at every node to accept new data on the basis of a timed switch. It is inadvisable to set the time switch to too short a period as this rapidly increases the number of sets held by the Universe Manager - and this, in turn, could impact performance. The Economy of A2DB In any database and especially in any distributed database, it is always possible to pose queries that will take a long time to answer. This technology does not make that problem suddenly disappear. For example if you join two terabyte-sized tables together that are on different nodes, a terabyte of data must pass over the network. If it is a slow network line, the query could take a very long time. If such a query is frequently run, the database will solve this particular performance issue naturally by holding one of the terabyte tables as an intermediate result. If you have a petabyte or even several petabytes of data that you wish to query regularly, then the database could be used for the task by deploying it on a sufficient number of nodes. In such circumstances it could look quite similar to Hadoop (with HBase). However that is not the prime requirement of a. A needs to be able to handle heterogenous workloads some of which access complex data structures, and it needs to do so with economy and with speed. That is what Algebraix s technology does. In the distributed environment it is helped by the fact that users and programs that request data normally do not pose queries that have terabyte-long answers. They pose queries that have quite short answers - a few megabytes or less. An exception is when users are downloading a large data extract for more detailed analysis, but such downloads are relatively rare. This distributed approach has the virtue that it naturally localizes data to suit the query traffic. In each node it localizes the data that is frequently queried in memory. In a distributed environment with multiple nodes it will, through its natural performance mechanisms, gradually localize the data to suit the local and global query traffic. If query volumes rise too high at a given node, then the node can split like an amoeba to cater for the rising workload. If the query traffic changes with, say, one kind of query not being posed so frequently and a new set of previously unknown queries becoming common the database will simply adjust, by adjusting the intermediate results it holds. After three or four queries of each new query type its natural performance will be restored. The nature of this technology, coupled with the fact that it can be configured for high availability, qualifies it as suitable for deployment as a. 16
19 About The Bloor Group The Bloor Group is a consulting, research and analyst firm that focuses on quality research and analysis of emerging information technologies across the whole spectrum of the IT industry. The firm s research focuses on understanding both the technical features and the business value of information technologies and how they are successfully implemented within modern computing environments. Additional information on The Bloor Group can be found at and The Bloor Group is the sole copyright holder of this publication Oban Drive Spicewood TX Tel: w w w. T h e V i r t u a l C i r c l e. c o m w w w. B l o o r G r o u p. c o m 17
Information Architecture
The Bloor Group Actian and The Big Data Information Architecture WHITE PAPER The Actian Big Data Information Architecture Actian and The Big Data Information Architecture Originally founded in 2005 to
Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database
Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica
SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013
SAP HANA SAP s In-Memory Database Dr. Martin Kittel, SAP HANA Development January 16, 2013 Disclaimer This presentation outlines our general product direction and should not be relied on in making a purchase
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Innovative technology for big data analytics
Technical white paper Innovative technology for big data analytics The HP Vertica Analytics Platform database provides price/performance, scalability, availability, and ease of administration Table of
Data Modeling for Big Data
Data Modeling for Big Data by Jinbao Zhu, Principal Software Engineer, and Allen Wang, Manager, Software Engineering, CA Technologies In the Internet era, the volume of data we deal with has grown to terabytes
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
CHAPTER - 5 CONCLUSIONS / IMP. FINDINGS
CHAPTER - 5 CONCLUSIONS / IMP. FINDINGS In today's scenario data warehouse plays a crucial role in order to perform important operations. Different indexing techniques has been used and analyzed using
W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract
W H I T E P A P E R Deriving Intelligence from Large Data Using Hadoop and Applying Analytics Abstract This white paper is focused on discussing the challenges facing large scale data processing and the
The IBM Cognos Platform
The IBM Cognos Platform Deliver complete, consistent, timely information to all your users, with cost-effective scale Highlights Reach all your information reliably and quickly Deliver a complete, consistent
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances
INSIGHT Oracle's All- Out Assault on the Big Data Market: Offering Hadoop, R, Cubes, and Scalable IMDB in Familiar Packages Carl W. Olofson IDC OPINION Global Headquarters: 5 Speen Street Framingham, MA
Big Data and Big Data Modeling
Big Data and Big Data Modeling The Age of Disruption Robin Bloor The Bloor Group March 19, 2015 TP02 Presenter Bio Robin Bloor, Ph.D. Robin Bloor is Chief Analyst at The Bloor Group. He has been an industry
Whitepaper. Innovations in Business Intelligence Database Technology. www.sisense.com
Whitepaper Innovations in Business Intelligence Database Technology The State of Database Technology in 2015 Database technology has seen rapid developments in the past two decades. Online Analytical Processing
An Oracle White Paper November 2010. Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics
An Oracle White Paper November 2010 Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics 1 Introduction New applications such as web searches, recommendation engines,
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
IBM Software Information Management. Scaling strategies for mission-critical discovery and navigation applications
IBM Software Information Management Scaling strategies for mission-critical discovery and navigation applications Scaling strategies for mission-critical discovery and navigation applications Contents
Testing Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
Informix Dynamic Server May 2007. Availability Solutions with Informix Dynamic Server 11
Informix Dynamic Server May 2007 Availability Solutions with Informix Dynamic Server 11 1 Availability Solutions with IBM Informix Dynamic Server 11.10 Madison Pruet Ajay Gupta The addition of Multi-node
5 Signs You Might Be Outgrowing Your MySQL Data Warehouse*
Whitepaper 5 Signs You Might Be Outgrowing Your MySQL Data Warehouse* *And Why Vertica May Be the Right Fit Like Outgrowing Old Clothes... Most of us remember a favorite pair of pants or shirt we had as
SQL Server Administrator Introduction - 3 Days Objectives
SQL Server Administrator Introduction - 3 Days INTRODUCTION TO MICROSOFT SQL SERVER Exploring the components of SQL Server Identifying SQL Server administration tasks INSTALLING SQL SERVER Identifying
Microsoft Analytics Platform System. Solution Brief
Microsoft Analytics Platform System Solution Brief Contents 4 Introduction 4 Microsoft Analytics Platform System 5 Enterprise-ready Big Data 7 Next-generation performance at scale 10 Engineered for optimal
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Parallel Data Warehouse
MICROSOFT S ANALYTICS SOLUTIONS WITH PARALLEL DATA WAREHOUSE Parallel Data Warehouse Stefan Cronjaeger Microsoft May 2013 AGENDA PDW overview Columnstore and Big Data Business Intellignece Project Ability
INTRODUCTION TO CASSANDRA
INTRODUCTION TO CASSANDRA This ebook provides a high level overview of Cassandra and describes some of its key strengths and applications. WHAT IS CASSANDRA? Apache Cassandra is a high performance, open
Data Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
NoSQL for SQL Professionals William McKnight
NoSQL for SQL Professionals William McKnight Session Code BD03 About your Speaker, William McKnight President, McKnight Consulting Group Frequent keynote speaker and trainer internationally Consulted to
WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS
WINDOWS AZURE DATA MANAGEMENT AND BUSINESS ANALYTICS Managing and analyzing data in the cloud is just as important as it is anywhere else. To let you do this, Windows Azure provides a range of technologies
Using In-Memory Data Fabric Architecture from SAP to Create Your Data Advantage
SAP HANA Using In-Memory Data Fabric Architecture from SAP to Create Your Data Advantage Deep analysis of data is making businesses like yours more competitive every day. We ve all heard the reasons: the
Big Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
Real-time Data Replication
Real-time Data Replication from Oracle to other databases using DataCurrents WHITEPAPER Contents Data Replication Concepts... 2 Real time Data Replication... 3 Heterogeneous Data Replication... 4 Different
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
How To Handle Big Data With A Data Scientist
III Big Data Technologies Today, new technologies make it possible to realize value from Big Data. Big data technologies can replace highly customized, expensive legacy systems with a standard solution
Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,[email protected]
Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,[email protected] Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,
CLOUD DATABASE DATABASE AS A SERVICE
CLOUD DATABASE DATABASE AS A SERVICE Waleed Al Shehri Department of Computing, Macquarie University Sydney, NSW 2109, Australia [email protected] ABSTRACT Cloud computing has been the
IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!
The Bloor Group IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS VENDOR PROFILE The IBM Big Data Landscape IBM can legitimately claim to have been involved in Big Data and to have a much broader
Cloud Computing and Advanced Relationship Analytics
Cloud Computing and Advanced Relationship Analytics Using Objectivity/DB to Discover the Relationships in your Data By Brian Clark Vice President, Product Management Objectivity, Inc. 408 992 7136 [email protected]
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale
WHITE PAPER Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale Sponsored by: IBM Carl W. Olofson December 2014 IN THIS WHITE PAPER This white paper discusses the concept
TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS
9 8 TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS Assist. Prof. Latinka Todoranova Econ Lit C 810 Information technology is a highly dynamic field of research. As part of it, business intelligence
How to Enhance Traditional BI Architecture to Leverage Big Data
B I G D ATA How to Enhance Traditional BI Architecture to Leverage Big Data Contents Executive Summary... 1 Traditional BI - DataStack 2.0 Architecture... 2 Benefits of Traditional BI - DataStack 2.0...
Efficient database auditing
Topicus Fincare Efficient database auditing And entity reversion Dennis Windhouwer Supervised by: Pim van den Broek, Jasper Laagland and Johan te Winkel 9 April 2014 SUMMARY Topicus wants their current
Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
Actian Vector in Hadoop
Actian Vector in Hadoop Industrialized, High-Performance SQL in Hadoop A Technical Overview Contents Introduction...3 Actian Vector in Hadoop - Uniquely Fast...5 Exploiting the CPU...5 Exploiting Single
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE
INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE AGENDA Introduction to Big Data Introduction to Hadoop HDFS file system Map/Reduce framework Hadoop utilities Summary BIG DATA FACTS In what timeframe
Facilitating Efficient Data Management by Craig S. Mullins
Facilitating Efficient Data Management by Craig S. Mullins Most modern applications utilize database management systems (DBMS) to create, store and manage business data. The DBMS software enables end users
CitusDB Architecture for Real-Time Big Data
CitusDB Architecture for Real-Time Big Data CitusDB Highlights Empowers real-time Big Data using PostgreSQL Scales out PostgreSQL to support up to hundreds of terabytes of data Fast parallel processing
<Insert Picture Here> Oracle and/or Hadoop And what you need to know
Oracle and/or Hadoop And what you need to know Jean-Pierre Dijcks Data Warehouse Product Management Agenda Business Context An overview of Hadoop and/or MapReduce Choices, choices,
Gradient An EII Solution From Infosys
Gradient An EII Solution From Infosys Keywords: Grid, Enterprise Integration, EII Introduction New arrays of business are emerging that require cross-functional data in near real-time. Examples of such
Data processing goes big
Test report: Integration Big Data Edition Data processing goes big Dr. Götz Güttich Integration is a powerful set of tools to access, transform, move and synchronize data. With more than 450 connectors,
Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities
Technology Insight Paper Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities By John Webster February 2015 Enabling you to make the best technology decisions Enabling
The Sierra Clustered Database Engine, the technology at the heart of
A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel
Microsoft SQL Server 2008 R2 Enterprise Edition and Microsoft SharePoint Server 2010
Microsoft SQL Server 2008 R2 Enterprise Edition and Microsoft SharePoint Server 2010 Better Together Writer: Bill Baer, Technical Product Manager, SharePoint Product Group Technical Reviewers: Steve Peschka,
Scaling Your Data to the Cloud
ZBDB Scaling Your Data to the Cloud Technical Overview White Paper POWERED BY Overview ZBDB Zettabyte Database is a new, fully managed data warehouse on the cloud, from SQream Technologies. By building
Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data
INFO 1500 Introduction to IT Fundamentals 5. Database Systems and Managing Data Resources Learning Objectives 1. Describe how the problems of managing data resources in a traditional file environment are
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Simplifying Storage Operations By David Strom (published 3.15 by VMware) Introduction
Simplifying Storage Operations By David Strom (published 3.15 by VMware) Introduction There are tectonic changes to storage technology that the IT industry hasn t seen for many years. Storage has been
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
The Challenge of Managing On-line Transaction Processing Applications in the Cloud Computing World
The Challenge of Managing On-line Transaction Processing Applications in the Cloud Computing World Marcia Kaufman, COO and Principal Analyst Sponsored by CloudTran The Challenge of Managing On-line Transaction
Report Data Management in the Cloud: Limitations and Opportunities
Report Data Management in the Cloud: Limitations and Opportunities Article by Daniel J. Abadi [1] Report by Lukas Probst January 4, 2013 In this report I want to summarize Daniel J. Abadi's article [1]
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
In Memory Accelerator for MongoDB
In Memory Accelerator for MongoDB Yakov Zhdanov, Director R&D GridGain Systems GridGain: In Memory Computing Leader 5 years in production 100s of customers & users Starts every 10 secs worldwide Over 15,000,000
Apache Hadoop FileSystem and its Usage in Facebook
Apache Hadoop FileSystem and its Usage in Facebook Dhruba Borthakur Project Lead, Apache Hadoop Distributed File System [email protected] Presented at Indian Institute of Technology November, 2010 http://www.facebook.com/hadoopfs
The Future of Data Management
The Future of Data Management with Hadoop and the Enterprise Data Hub Amr Awadallah (@awadallah) Cofounder and CTO Cloudera Snapshot Founded 2008, by former employees of Employees Today ~ 800 World Class
Hadoop. http://hadoop.apache.org/ Sunday, November 25, 12
Hadoop http://hadoop.apache.org/ What Is Apache Hadoop? The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
SQL Server 2005 Features Comparison
Page 1 of 10 Quick Links Home Worldwide Search Microsoft.com for: Go : Home Product Information How to Buy Editions Learning Downloads Support Partners Technologies Solutions Community Previous Versions
In-memory computing with SAP HANA
In-memory computing with SAP HANA June 2015 Amit Satoor, SAP @asatoor 2015 SAP SE or an SAP affiliate company. All rights reserved. 1 Hyperconnectivity across people, business, and devices give rise to
Columnstore Indexes for Fast Data Warehouse Query Processing in SQL Server 11.0
SQL Server Technical Article Columnstore Indexes for Fast Data Warehouse Query Processing in SQL Server 11.0 Writer: Eric N. Hanson Technical Reviewer: Susan Price Published: November 2010 Applies to:
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:
Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Chapter 6 Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:
Chapter 6. Foundations of Business Intelligence: Databases and Information Management
Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:
bigdata Managing Scale in Ontological Systems
Managing Scale in Ontological Systems 1 This presentation offers a brief look scale in ontological (semantic) systems, tradeoffs in expressivity and data scale, and both information and systems architectural
Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.
Oracle BI EE Implementation on Netezza Prepared by SureShot Strategies, Inc. The goal of this paper is to give an insight to Netezza architecture and implementation experience to strategize Oracle BI EE
Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing
Evaluating NoSQL for Enterprise Applications Dirk Bartels VP Strategy & Marketing Agenda The Real Time Enterprise The Data Gold Rush Managing The Data Tsunami Analytics and Data Case Studies Where to go
Microsoft Azure Data Technologies: An Overview
David Chappell Microsoft Azure Data Technologies: An Overview Sponsored by Microsoft Corporation Copyright 2014 Chappell & Associates Contents Blobs... 3 Running a DBMS in a Virtual Machine... 4 SQL Database...
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
ScaleArc idb Solution for SQL Server Deployments
ScaleArc idb Solution for SQL Server Deployments Objective This technology white paper describes the ScaleArc idb solution and outlines the benefits of scaling, load balancing, caching, SQL instrumentation
Big Data and Your Data Warehouse Philip Russom
Big Data and Your Data Warehouse Philip Russom TDWI Research Director for Data Management April 5, 2012 Sponsor Speakers Philip Russom Research Director, Data Management, TDWI Peter Jeffcock Director,
Distribution One Server Requirements
Distribution One Server Requirements Introduction Welcome to the Hardware Configuration Guide. The goal of this guide is to provide a practical approach to sizing your Distribution One application and
Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014
Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ Cloudera World Japan November 2014 WANdisco Background WANdisco: Wide Area Network Distributed Computing Enterprise ready, high availability
Online Transaction Processing in SQL Server 2008
Online Transaction Processing in SQL Server 2008 White Paper Published: August 2007 Updated: July 2008 Summary: Microsoft SQL Server 2008 provides a database platform that is optimized for today s applications,
Backup and Recovery: The Benefits of Multiple Deduplication Policies
Backup and Recovery: The Benefits of Multiple Deduplication Policies NOTICE This White Paper may contain proprietary information protected by copyright. Information in this White Paper is subject to change
Why Big Data in the Cloud?
Have 40 Why Big Data in the Cloud? Colin White, BI Research January 2014 Sponsored by Treasure Data TABLE OF CONTENTS Introduction The Importance of Big Data The Role of Cloud Computing Using Big Data
Data Deduplication: An Essential Component of your Data Protection Strategy
WHITE PAPER: THE EVOLUTION OF DATA DEDUPLICATION Data Deduplication: An Essential Component of your Data Protection Strategy JULY 2010 Andy Brewerton CA TECHNOLOGIES RECOVERY MANAGEMENT AND DATA MODELLING
Highly Available Service Environments Introduction
Highly Available Service Environments Introduction This paper gives a very brief overview of the common issues that occur at the network, hardware, and application layers, as well as possible solutions,
Trafodion Operational SQL-on-Hadoop
Trafodion Operational SQL-on-Hadoop SophiaConf 2015 Pierre Baudelle, HP EMEA TSC July 6 th, 2015 Hadoop workload profiles Operational Interactive Non-interactive Batch Real-time analytics Operational SQL
Why DBMSs Matter More than Ever in the Big Data Era
E-PAPER FEBRUARY 2014 Why DBMSs Matter More than Ever in the Big Data Era Having the right database infrastructure can make or break big data analytics projects. TW_1401138 Big data has become big news
Beyond Data Migration Best Practices
Beyond Data Migration Best Practices Table of Contents Executive Summary...2 Planning -Before Migration...2 Migration Sizing...4 Data Volumes...5 Item Counts...5 Effective Performance...8 Calculating Migration
