Managing big data without breaking the bank

Transcription

1 Managing big data without breaking the bank By George Gilbert April 29, 2013 This report was underwritten by RainStor. Cloud

2 TABLE OF CONTENTS Executive summary... 3 A short look at the present and future of big data analytics... 5 The two main application scenarios that form the sweet spot for online analytic archive databases... 7 The challenge of keeping data accessible when volumes are growing 50 percent to 100 percent per annum... 7 The column store approach... 7 The online analytic archive approach... 7 The two most popular usage scenarios for online analytic archives... 8 Business impact of big data deployment: Major telco case study Making disk work as the new tape for an online archive: Major financial services institution case study.. 15 RainStor s done for archives what Splunk s done for logs Technology considerations and choices: What RainStor is good for and why Traditional SQL DBMS performance constraints not relevant to online analytic archive Data compression Greater analytic query performance Deployment flexibility and administrative efficiency Greater ingest speed of new data Conclusion About George Gilbert About GigaOM Research Managing big data without breaking the bank 2

3 Executive summary In the tsunami of experimentation, investment, and deployment of systems that analyze big data, vendors have seemingly been trying approaches at two extremes either embracing the Hadoop ecosystem or building increasingly sophisticated query capabilities into database management system (DBMS) engines. At one end of the spectrum, the scale-out Hadoop distributed file system (HDFS) has become a way to collect volumes and types of data on commodity servers and storage that would otherwise overwhelm traditional enterprise data warehouses (EDWs). The Hadoop ecosystem has a variety of ways to query data in HDFS, with SQL-based approaches emerging in variety and maturity. At the other end of the spectrum are both traditional and NewSQL DBMS vendors, with IBM, Microsoft, and Oracle among the former and Greenplum, Vertica, Teradata Aster, and many others emerging among the latter. What all these seem to have in common is the unprecedented innovation and growth in analytic query sophistication. Accessing tables stored on disks organized in rows via SQL is no longer enough. Vendors have been adding the equivalent of new DBMS engine plug-ins, including in-memory cache for performance, column storage for data compression and faster queries, advanced statistical analysis, and even machine learning technology. While the NewSQL vendors have introduced much lower price points than the traditional vendors as well as greater flexibility in using commodity storage, they haven t made quite as much progress on shrinking the growth in storage hardware required relative to the growth in data volumes. For some use cases, there appears to be room for a third approach that lies between the extremes and borrows from the best of each. RainStor in particular and the databases focusing on column storage more generally have carved out a very sizable set of data storage and analytics scenarios that have been mostly ignored. Much of the explosion in data volumes that needs to be analyzed doesn t need to be updated. In other words, it can be stored as an archive in a deeply compressed format while still online for query and analysis. Databases with column store technology, such as Vertica and Greenplum, have taken important steps in this direction, and the incumbent vendors are also making progress in offering this as an option. In early April, IBM announced a statement of direction in which they intend to add in-memory and column store technology to DB Organizing data storage in columns makes it easier to compress, typically by a factor of four to six, with IBM claiming 10 as a statement of direction. Column stores can accelerate Managing big data without breaking the bank 3

4 queries by scanning just the relevant and now smaller columns in parallel on multiple CPU cores. But the storage and database engine overhead of mediating potentially simultaneous updates to and reads from the data still remains. In other words, the column stores are a better data warehouse. They are not optimized to serve as archives, however. An online archive can compress its data by a factor of 30 to 40 because it will never have to be decompressed for updates. New data only gets appended. Without the need to support updates, it s much easier to ingest new data at very high speed, and without the need to mediate updates, it s much easier to distribute the data on clusters of low-cost storage. This paper is written for two audiences. One is the business buyer who is evaluating databases and trying to reconcile the difficulty of growth in data volumes running at 50 percent to 100 percent per annum with an IT budget growing in single digits. Of particular value to this audience are the generic use cases and the customer case studies, since they show how others are tackling the problem. Some of the world s largest banks and telcos are already putting these solutions into production. Also relevant is the price comparison with Oracle Exadata, which shows not just the capital cost of a traditional data warehouse solution but also the hidden running costs. The other audience is the IT infrastructure technologist who is tasked with evaluating the proliferation of database technologies. For this audience, the more technical sections of the paper will be valuable. These sections focus on the different technology approaches to creating online analytic databases. The paper will compare mainstream data warehouse technologies and column stores in particular with a database that focuses more narrowly as an online analytic archive. In order to use a concrete example of an existing analytic archive, the paper will explain how RainStor s database works. When purpose-built, an online analytic archive can achieve as much as an order of magnitude data storage savings relative to traditional solutions, run on commodity clusters of storage, support standard SQL, ODBC, and business intelligence (BI) tool access, and make more assumptions about administrative requirements so that the operational costs are much lower. But when evaluating the two approaches, one has to remember that archives can t refresh existing data with new information; they only can add more recent or older data. They are read-only solutions. Managing big data without breaking the bank 4

5 A short look at the present and future of big data analytics The rapidly growing interest in all things labeled big data is grounded in two major catalysts. 1) Relative to compute and networking, the price and performance of getting data to and from storage has improved exponentially. This is true whether it s through storage area network (SAN)/network-attached storage (NAS) solid-state drive (SSD) or even the commodity clusters of direct-attached storage (DAS) hard disk drive (HDD) that new databases can exploit. That improvement has meant whole new classes of systems could be built costeffectively and centered on large-scale data capture and analysis. 2) A new branch of data-intensive science has emerged: Informatics is being appended to just about all sciences as a new discipline. Within commercial environments, these changes have shown up in data growth rates of 50 percent to 100 percent per annum, and just about every area in the value chain is becoming more data-intensive and data-driven. The ironic twist unearthed in some of GigaOM s research is that the problem of managing data growth has become so urgent that customers and vendors have been tackling the problem from two extremes. At the very high end, the approaches feature advanced scale-out architectures and parallel processing of advanced ways to manipulate data within the database itself, as opposed to using tools once the data is retrieved by the end-user. Doing the work in the database makes it possible to process data with much higher performance because it doesn t need to be moved around the network. Some of these new features include user-defined functions, predictive analytics, machine learning, and other algorithms. Accompanying this high sophistication are high software license fees, storage costs that grow at least several times the rate of the IT budget, and hidden operational expenses to make all these pieces fit together and stay well-oiled. At the other end of the spectrum is the proliferation of technologies in the Hadoop ecosystem. Here customers and vendors are applying the technologies of web-scale consumer online services to more traditional IT challenges. While much lower cost than the high-end approach, the Hadoop ecosystem has required a completely different skillset, particularly when it comes to using standard SQL for interactive data manipulation, though that is beginning to change. Managing big data without breaking the bank 5

6 The industry is moving in the direction of combining the best of the high-end databases with the Hadoop ecosystem s support for commodity storage software and hardware infrastructure, but there are still user scenarios that don t need some of the more high-end database features. Instead, these scenarios need to use disk as the new tape. When data volumes move into the petabytes (PB), customers need to store data extremely cost effectively while leveraging disk s ability to remain online, forever if need be, and support traditional SQL or BI-driven analytics cost-effectively. This paper will look at several approaches to this problem and contrast them with the way column stores have started to become a popular approach the problem. Managing big data without breaking the bank 6

7 The two main application scenarios that form the sweet spot for online analytic archive databases The challenge of keeping data accessible when volumes are growing 50 percent to 100 percent per annum Increasingly, data can t be discarded just because it bumps up against storage capacity, whether for compliance reasons or because of the growing need to have analytic access to all data. Archiving it to tape makes the data just marginally more accessible than the disaster recovery scenario of putting tapes in salt mines, especially if the data s structure has evolved. In the past, the solution to the problem was to keep the most recent data online in a traditional data warehouse for 12 months, for example, and as the data got older, it would be migrated off to progressively less expensive but harder to access solutions, with tape archives being the lowest cost but highest volume. The column store approach The popular column store databases are relevant solutions. They improve on traditional data warehouses built with OLTP database engines by using data storage compression of individual columns instead of tables of rows. It s easier to compress data in columns because all the items stored together are much more likely to be similar or repetitive. As a result, current solutions appear to compress data by a factor of six. IBM s announced DB claims it can reach a factor of 10. The online analytic archive approach An online analytic archive is an alternate approach that features a database designed to bridge the need for online access to current and historical data sets via standard SQL by leveraging data compression of 30 to 40 times relative to raw data. It can achieve higher compression ratios relative to column stores because it can look for common values not just across individual columns but also across tables within an individual partition of a million rows of data. Data can be loaded in a variety of formats, including logs or Managing big data without breaking the bank 7

8 tabular data from another relational database management system (RDBMS). As an example, RainStor has an online analytic archive database designed for this approach. The two most popular usage scenarios for online analytic archives So far, two mainstream use cases appear uniquely well-suited to this approach. A graphical illustration for understanding the scenarios is in the illustration in Figure 1 on the following page. The Y-axis captures data recency, in which the most recent data is at the top. As it ages, it accumulates and drives greater accumulation of volume on the X-axis. The first scenario uses an online analytic archive as the primary analytic repository. In this case, the data being loaded is immediately historical and will never change. This could be raw machine or log data, whether call detail records from a switch or network packet flow or high-volume sensor data. Because the data is continuous and high-velocity, the load volumes are huge. In some situations, the archive may load data from existing applications. Here, entire applications might be decommissioned, but their data is migrated to the online archive for analytic access. The second scenario uses an online archive that can replace tape with an immutable medium that allows for querying and has an essentially infinite capacity. Managing big data without breaking the bank 8

9 The two most popular usage scenarios for online analytic archives Source: GigaOM Research Economic drivers of big data management and analytics: Managing multiple tiers of price and performance points Source: GigaOM Research Every prospective customer is going to have a different mix of software, hardware, and operational expense requirements. Even so, a snapshot of cost comparison is useful if only to show the magnitude of the difference between an online archive such as RainStor relative to traditional approaches. Traditional enterprise data warehouses (EDW) or high-end engineered systems are focused on the most demanding Managing big data without breaking the bank 9

10 and sophisticated analytics. Oracle and others have been adding the capability to go way beyond the traditional SQL queries needed for an online analytic archive. Much attention from vendors and industry analysts has been focused on embedding these advanced data manipulation features in the nodes of the DBMS cluster itself for maximum performance. These features include statistical analysis by parallel implementations of the language R, similar implementations of specialized machine learning algorithms, and user-defined functions, among others. Some usage scenarios for an online archive could make use of these features. When developing sophisticated predictive models, having both query sophistication and the largest possible dataset are critically valuable. However, many online analytic archive scenarios don t need all these features and the associated high cost of the software, servers, storage, and administrative overhead. The tables on the following pages detail the price differences and illustrate how traditional EDWs are targeted at different price and performance points. At every line item, from server cost to storage cost and especially to software and administrative costs, approaches such as RainStor s, used here as an example, are dramatically more cost-effective as an online analytic archive. Managing big data without breaking the bank 10

11 Oracle Exadata Item Product Unit price Quantity Total (with 20% discount on hardware, software) Database software license Oracle EE + RAC (cluster) $70,500 per core 4 servers x 12 cores x 0.5 Intel Core Adjustment Factor = 48 cores $1,353,600 Storage software license Exadata storage server software $10,000 per disk 84 x 600GB $672,000 Server hardware System administration cost Database administration cost 4 servers x 12 Xeon cores = 48 cores (X5675 3GHz 384GB) 84 x 600GB 10K RPM SAS disks = 50TB raw + 2.6TB Flash $625,000 per half- rack appliance half- rack Exadata appliance with highperformance storage (Minimal effective compression based on HA replication, denormalization, indexing) Sys admin labor $250 per hour 6.4 man hours / TB x 50 TB = 320 man hours DBA labor $250 per hour 41 man hours / TB x 50 TB = 2080 man hours $500,000 $80,000 $520,000 $Total 3,125,600 *Source: Oracle-published price comparison between Exadata and IBM Power system, (Dated 10/12; retrieved 4/2/13) Managing big data without breaking the bank 11

12 RainStor Item Product Unit price Quantity Total (with 20% discount) Database software license Server hardware RainStor HP 12 core Xeon (X5675 3GHz 384GB) Tiered: $3,000- $4,000 per raw terabyte** TB $300,000 $12,000 2 servers $24,000 Storage NetApp NAS $ 3,000 per TB 100 TB (compressed volume is 2.5 to 5 TB) System administration cost Sys admin labor $250 per hour 13 man hours / TB x 4.5 TB = 650 man hours $15,000 $14,625 Total $353,625 *Assumption: RainStor is designed to minimize the need for a database administrator (DBA). To compensate, the comparison doubles the amount of system administration labor required per TB. **RainStor doubles the storage capacity at less than one-tenth the cost of Oracle Exadata. A company using the RainStor approach of maximum compression and traditional SQL queries can run on commodity hardware and with license and administrative costs that are materially less expensive. However, because of the advanced compression, this method needs significantly less storage and server hardware to process that smaller amount of data. It also requires less administrative overhead to go along with the smaller deployment, but for the purposes of comparison, we ll just look at the storage hardware cost savings. The compression factor is well-known, but using storage cost per gigabyte (GB) per month makes it possible to convert that compression into a dollar figure once the raw storage amount is figured into the equation. The chart below shows how it is possible to convert SAN/NAS storage that would cost $2 to $4 per GB per month into its effective price on commodity storage in an online analytic archive. By applying 30 to 40 times the compression, that storage cost becomes the equivalent of $0.05 to $0.13 per GB per month. What the chart does not show is that customers can run their databases on lower-performance, lowercost storage. Since an online analytic archive supports cloud or scale-out storage, it can run in Managing big data without breaking the bank 12

13 environments that are $0.10 per GB per month. Applying 30-time compression to this storage equates to an effective cost of $0.003 per GB per month relative to the original $2 to $4 per GB per month. For any given storage performance tier, RainStor compression saves a factor of 30x-40x Source: GigaOM Research Managing big data without breaking the bank 13

14 Business impact of big data deployment: Major telco case study Looking at how a major telco solved its data storage and analysis problem illustrates the advantages of an online analytic archive approach. At the highest level, this telco used to take data from multiple customer touch points or network operations and carve it up into multiple timeframes starting with the most recent data in the highestprice and highest-performance analytic tier, the EDW. Older data was archived off into progressively more inaccessible but less expensive repositories because of cost constraints. By storing data 30 to 40 times more cost-effectively for SQL 92 and open database connectivity (ODBC)/Java database connectivity (JDBC) access, a company can now use an online analytic archive to keep an order of magnitude more data volume in a single repository for analysis. One 10 terabyte (TB) Oracle database with indexes and denormalized data only contained 2.5 TB of raw data. With the online analytic archive compression, it occupied 116 GB. Although, in the case of RainStor, support is available for MapReduce, Pig, Hive, and search access on HDFS, this customer doesn t yet use that functionality. In the past, this telco customer typically kept six months of data in an EDW, followed by 18 months in an online archive and seven years (or more, considering tape doesn t get discarded) in an offline tape archive. Cost constraints made it impossible to unify the repositories to analyze and query all the historical data. Now two or sometimes all three tiers can be unified because the telco is using a technology that enables an order of magnitude price/performance advantage over traditional EDW solutions. Unifying multiple tiers of analytic databases makes it easier for business users to analyze data over much longer time periods, for example, 10 years instead of only the most recent year. It also reduces the operational cost by an order of magnitude or more because the reduced data footprint from compression means administrative overhead, which is measured by TB, comes down commensurately. Traditional EDWs require database administrators to analyze query patterns, set up and manage data partitions, and determine how to index the data. If requirements change, such as new views on the data, administrators have to iterate the whole process. The pipeline that moves data from transaction processing systems to analytic systems is traditionally very brittle. Managing big data without breaking the bank 14

15 Making disk work as the new tape for an online archive: Major financial services institution case study One of Wall Street s major financial services companies demonstrates a second use case, which is the ability to manage cold or historical data that can t be maintained in a traditional data warehouse because of cost. Last year, this bank was accumulating one billion records per day with a growth rate of 70 percent to 100 percent per annum. The total volume of data is on track to hit 75 PB by the end of The bank retains 60 percent of this data because of the many regulatory and compliance requirements governing it. Currently, the total cost of tier 1 SAN storage is $3 to $4 per GB per month. That equates to $3 to $4 million per PB per month. This compliance-related data has limited business value, and individual departments were beginning to build their own trade archives. However, these departmental archives would only accelerate the accumulation of information as well as the operational costs of managing it. Overall spending on these regulatory requirements already consumes 40 percent of the entire IT budget. For Nasdaq trading data, these same compliance rules mandate retention of all historical data indefinitely even though the data is only trade logs of who bought what from whom for how much and when. The two traditional choices for managing it were unacceptable. Keeping all 13 years of it in a traditional data warehouse managed by Sybase IQ on tier 1 EMC VMAX storage at $3 to $4 per GB per month would have been orders of magnitude too expensive. And the old method of keeping just the most recent month of data in the data warehouse with the balance archived to tape made that data essentially inaccessible. Keeping all the historical data on tape became too difficult to meet service level agreements (SLA) with auditors and regulatory mandates when reports, lookups, or audits required timely access. Using RainStor as their online analytic archive solution, the bank expects that it will eventually shrink 30PB of data down to 1PB. In addition, it expects to further reduce the cost by moving the information off tier 1 SAN storage to tier 3 NAS. Managing big data without breaking the bank 15

16 RainStor s done for archives what Splunk s done for logs Splunk has gained a wide following by carving off simplified and cost-effective management of the log files that systems and applications generate. The bank implementing RainStor sees it accomplishing the same result for applications archive data and data warehouses based on its ability to compress trade history data by 40 times. The trade execution engine is an in-memory cache database; after each day, the trade data goes through a feed into the online analytic archive database. All 13 years of historical trade data has been moved to the online repository, which has become a single source for all historical online queries and reports. The results of the most demanding queries in the main data warehouse also get archived with the raw data once executed. The archive is accessible via the bank s existing BI tools. The trading data archive is compressed at a 40-time ratio relative to the raw data. Since Sybase IQ achieved a 1.5-time ratio, this compression would have delivered close to a 30-time reduction in storage not just in capital costs but also in overall operational expenditures. The compression actually accelerates query performance as well. Since storage bandwidth is generally the constraining factor in query performance, moving less data from storage to server accelerates performance. Less space in storage in this case also means the bank can use tier 3 NAS storage with slower SATA drives. Tier 3 storage costs $1 to $2 per GB per month, a little more than half what its tier 1 storage costs. So the total savings was roughly 50 times when combining price, performance, and capacity. An online analytic archive (such as RainStor s) can take advantage of scale-out storage running on commodity servers, but the bank kept the NAS setup in order to avoid any changes to its storage administration and the data pipeline from the main trading system. From the bank s perspective, columnar databases can only compress data within the scope of one column; the data still must be formatted to allow refreshes of updated data. A read-only archive can compress data far more deeply because it never has to be decompressed and changed. As a result, it can physically organize the data to maximize compression across columns and tables within each partition or container assuming it will never be changed, only added to. Rather than directly expose what is so densely packed, the database presents a relational interface to any administrator or BI tool that looks at it. Although columnar databases such as Sybase IQ and hybrids such as Oracle 12c on Exadata can have five to six times the compression capabilities relative to raw data, there are compromises in some situations. For this particular customer, when the data is formatted to accelerate some complex queries through denormalization as well as indexing or replicated for high availability, compression ratios came down to 1.5 times in practice. Managing big data without breaking the bank 16

17 The bank s implementation of the online analytic archive generated additional and unanticipated performance and operational cost benefits. The bank needed no changes to its database administration scripts to get data from the online trading system to the new archive. Security was simpler since the data is unchangeable by the very design of the database. And the daily load time went down from eight to 12 hours to two hours. In the future, this bank expects to simplify this process further so that it uses a simple extract and load. More surprisingly, the queries run on the previous warehouse (Sybase IQ) can run unmodified on the archive. That s a pretty strong statement of this technology s support for standard SQL or BI queries. Despite all the focus on it being a space-efficient repository, it s also a database. The bank has significant plans for this solution. It plans to use it as the online archive for all its tier 1 online databases. Departments will no longer have their own archive landfills. Part of the appeal is that many individual archive databases can run under a single instance of the archive DBMS engine with common compliance enforced and running on the cheapest storage that is still performant. That will create a shared service with much fewer administrative costs and not just for the original trading system. The bank expects that 30 percent of all new information will ultimately go into this online archive. Managing big data without breaking the bank 17

18 Technology considerations and choices: What RainStor is good for and why So far, several SQL DBMSs have served as generic examples for EDWs and transaction-processing usage scenarios. In order to take a concrete look at online analytic archives, RainStor will be examined in detail in this section. Traditional SQL DBMS performance constraints not relevant to online analytic archive Understanding how an online analytic archive works starts with understanding its intended use. Because it is meant to be both an analytic database and an online archive, it must be able to store petabytes of data with maximum efficiency, load and append new data with commensurate speed, and read large data sets via SQL with high performance. The crucial difference between this usage scenario and a traditional DBMS is that once data is in the archive, it never gets updated. It can only be appended. Therefore, the core value-add of a traditional DBMS technology of mediating reads and writes to common data records via transaction processing is not relevant. Traditional databases are optimized to use memory to mediate random reads and writes to data on physical disks that store and fetch data best when they are reading or writing sequentially. This type of mediation is known as transaction processing. Traditional SQL DBMSs have adhered to rigorous transaction processing principles known as ACID (atomicity, consistency, isolation, and durability). These principles account for much of the code and processing overhead of traditional databases. Without this requirement, a whole range of analytic archive features become possible. It s hard to overstate how much that means for the features on which analytic archives in general and RainStor in particular focus. Data compression Once the design assumption of an analytic archive is made, all the traditional assumptions about how to organize and store data can be reconsidered. One of the first considerations is the size of the block of data on which all operations are made. Transactional databases typically feature small block sizes of 64KB in order to minimize the likelihood that multiple transactions will be trying to access data that another is Managing big data without breaking the bank 18

19 modifying at the same time. RainStor, by contrast, works with a block size of 50MB because it assumes it will sequentially scan the data and that it never has to worry about an update that might get in the way of multiple read requests. Trying to update data stored this way would be extremely slow. Since the data sets will never need to be updated, it becomes possible to store everything without repeating anything. RainStor looks across all the values and patterns within each container or partition so everything is stored only once. If the data were going to be updatable, then it would have to go through a process that would consume a lot of computational and disk I/O resources to unpack this dense collection, make the changes, and then recalculate how to compress everything again. Even columnar databases like Sybase IQ or Vertica have to make accommodations for updates to previously stored data, limiting the amount they can compress their data. In addition, the traditional DBMS needs indexes on its tables to speed the process of finding the right records on which to operate. These indexes also take up space, sometimes as much or more than the underlying data, and they also consume CPU resources for recalculating them if the underlying data changes. Once the large 50MB block sizes are combined with a 40-time compression ratio, RainStor is effectively dealing with 2GB (2,000MB) blocks of raw data. That greatly accelerates query performance. Greater analytic query performance Most databases bottleneck on moving data to and from disk because they re reading and writing small blocks of data, typically the 64KB mentioned above. Because RainStor is working with 50MB blocks that actually contain 2GB of raw data, it can process large queries with very high performance. It only has to unpack the data when it s presenting the final result set. All intermediate operations work on the compressed data. In addition, query optimizers in a DBMS that feature updatable data can t always pinpoint the data they re seeking to query as quickly. Since updates can change the data, the query optimizer must assume the indexes that point to the data are changing as well. That means the optimizer can only make estimates ahead of time about what data is located where in the database. RainStor s equivalent of a query optimizer doesn t have to make those same fuzzy estimates. The data never changes. RainStor puts one million records in a container and then distributes the containers Managing big data without breaking the bank 19

20 across the file system, whether HDFS, other scale-out file systems, or high-end SAN and NAS storage. Every data container has one million immutable records no matter what the total size of the database. Each container also has the exact statistics on every record it holds. A 10bn record database would require 10,000 containers. If the database were distributed on a cluster of servers with local storage, each node would contain the metadata for the entire database. The metadata approximates 2 percent of the size of the database itself, so all of it can typically stay in memory to accelerate performance. RainStor uses the metadata to apply a filter that tells it which containers on which storage nodes to read. Indexes on traditional DBMSs fulfill a similar role and typically occupy 50 percent to 200 percent of the size of the raw data so that metadata generally can t stay in memory. As previously mentioned, RainStor supports SQL 92 queries with ODBC/JDBC APIs that enable mainstream BI tools to access it. While RainStor supports managing large data sets with highperformance queries, it is not going down the road of complex in-database functions that many other vendors are pursuing. Those products are for different use cases in which customers are typically trying to build predictive models offline on large datasets that can be applied in real-time via online applications. RainStor and other databases that rely primarily on SQL for data manipulation are more focused on trends and anomalies in data that is more historical. Deployment flexibility and administrative efficiency One of the most prominent trends in SQL databases over the last few years is the effectiveness of recent entrants in exploiting the scalability of massively parallel processing (MPP) on commodity hardware. Greenplum and Vertica are only among the most visible of these products. Products such as Hadapt and Cloudera s Impala are moving toward combining the Hadoop ecosystem with SQL query engines. SQL DBMSs and Hadoop-based ones are converging in a direction in which they combine the best of both approaches. But Oracle still represents the mainstream, and it hasn t yet mastered MPP. It can run multiple database nodes using its Real Application Cluster (RAC), but these share storage in the form of a SAN or NAS. The need for shared storage goes back to the discussion on block sizes and transactions. Since the DBMS has to mediate multiple potential transactions while trying to look at or change the same data records, it needs to work with small blocks of data. The whole process bottlenecks on storage, and managing it is a lot easier when the storage is in one place rather than distributed across all the database server nodes. Oracle can squeeze more performance out of its database when deployed on its Exadata hardware, but it Managing big data without breaking the bank 20

21 is still using highly specialized and highly expensive storage hardware, this time linking storage nodes so they look like a single storage node with an Infiniband network. The economic drivers section above goes into the cost implications of this type of deployment. RainStor can run with one or more database nodes with a SAN or NAS when that is how an organization administers storage, but it also has the flexibility to work on MPP database servers with each having local storage. Most databases that run with this type of scale-out hardware run in a configuration called shared nothing, since each node operates independently. One node generally has to act as master in order to organize the output of the others. RainStor actually stores the metadata for the entire database on each node, and because it communicates with 50MB blocks that are effectively 2GB of raw data, its performance in a commodity cluster of servers looks as if it were actually a single expensive machine that has shared everything. Shared everything architectures are technically easier to scale, but they are the most expensive approach. With RainStor, each node knows exactly where all the other data is stored so each can call on any other node to process all or part of a query or to load data. Since RainStor operates on large blocks of data, it works well with scale-out file systems, such as S3 and EBS on AWS and HDFS among others. The design center is also for an administrator closer in skills to a system administrator than a more specialized database administrator. Greater ingest speed of new data Just as RainStor deals in large block sizes for each operation, it also loads data into the file system in sets of a million records at a time since the database never has to worry about reading or writing out-of-date data. Where transactions are more meaningful with RainStor is when it is ingesting data at a high volume across many servers in parallel. If one node fails during this process, the other nodes won t improperly save related data. Where transactions are not as meaningful or appropriate for RainStor include the classic update example in which cash withdrawn from an ATM account must be simultaneously withdrawn from the customer s checking account. If either operation fails, both must abort. Managing big data without breaking the bank 21

22 Conclusion The explosion in data volumes and velocity requires specialization in databases that we haven t seen in the decades since traditional SQL DBMSs became universal. There is much innovation at the high end in transaction performance, especially in scale-out and in-memory architectures. Within the database engine itself, analytic databases are supporting the emerging applications by providing new and advanced functionality, such as predictive analytics, machine learning, and other programmability options. Within this race to the top, RainStor in particular and the databases focusing on column storage more generally have carved out a very sizable set of data storage and analytics scenarios that have been mostly ignored. Granted, some enterprises might rightly ignore these scenarios because they aren t applicable. If analytic functionality beyond SQL is involved, the high-end DBMS vendors represent the best solutions today. For those who are strictly adhering to a design center of an online analytic archive, they can achieve 30 to 40 times the improvements in storage efficiency. That allows companies trying to manage the exponential growth in data to keep their archives online and accessible forever via standard SQL, Hive, Pig, MapReduce, and BI tools. Managing big data without breaking the bank 22

23 About George Gilbert George Gilbert is the co-founder and partner of TechAlpha, a management consulting and research firm that advises clients in the technology, media, and telecommunications industries. He is recognized as a thought leader on the future of cloud computing, data center automation, and Software-as-a-Service (SaaS) economics, and he has contributed to many publications, including the Economist. Previously Gilbert was the lead enterprise software analyst for Credit Suisse First Boston, one of the leading investment banks for the technology sector. Prior to being an analyst, Gilbert worked at Microsoft as a product manager on Windows Server and spent four years in product management and marketing at Lotus Development. He received his B.A. in economics from Harvard University. About GigaOM Research GigaOM Research gives you insider access to expert industry insights on emerging markets. Focused on delivering highly relevant and timely research to the people who need it most, our analysis, reports, and original research come from the most respected voices in the industry. Whether you re beginning to learn about a new market or are an industry insider, GigaOM Pro addresses the need for relevant, illuminating insights into the industry s most dynamic markets. Visit us at: pro.gigaom.com 2012 Giga Omni Media, Inc. All Rights Reserved. This publication may be used only as expressly permitted by license from GigaOM and may not be accessed, used, copied, distributed, published, sold, publicly displayed, or otherwise exploited without the express prior written permission of GigaOM. For licensing information, please contact us. Managing big data without breaking the bank 23