InDetail RainStor archiving An InDetail Paper by Bloor Research Author : Philip Howard Publish date : November 2013
Archival is a no-brainer when it comes to return on investment and total cost of ownership. This is particularly true as data volumes grow. Philip Howard
Executive summary Archiving, by which we mean on-line or nearline archiving rather than off-line, tape-based archiving, has been a technology ahead of its time. It has been around for a number of years but it has never been as widely adopted as its merits would recommend. However, we believe that this is changing. On the one hand, the amount of data held by organisations is greatly increasing and, at the same time, compliance requirements are mandating that companies retain more and more data. On the other hand, compression technologies have improved and we have seen the introduction of Hadoop and other low-cost compute platforms that combine to further improve the cost-of-ownership benefits of archiving. When combined with the ability to query archived data directly there is no good argument against adopting this technology and lots of good reasons in favour of it. We expect archiving to move much higher up the CIO s priority list over the coming months and years. That said, we should perhaps explain what we mean by an (enterprise) data archive, because perceptions of this may vary across organisations. For example, if you are collecting machine generated data (from networks, sensors or smart meters, for example) then the data in your archive could range from being as little as a day old to being several years old, depending on how long you want to keep this information for analysis purposes. On the other hand, if we think about archiving transactional data then it will often be the case that the data in the archive is no younger than six months old and the oldest data will often be mandated by compliance requirements of, say, seven years. Again, if you are archiving from a data warehouse, the youngest data in the archive might be as much three years old, depending on the duration for which you keep the data for live analysis purposes. Of course, an enterprise data archiving strategy may encompass all of these. With current rates of data growth, archiving is coming more and more to the fore. Return on investment therefore becomes a critical issue and so do standards, both for the platform(s) that will be used for archiving and the processes and best practices used to archive the data. These have certainly been issues that the companies we have spoken to see later have faced in adopting their archiving solution. Fast facts RainStor s eponymous product is marketed both as an analytics data hub and as an archiving product. In this paper we are concerned with the product s capabilities in the latter area. Note that the product is specific to storing, retrieving and managing archived data and not to deciding what data should be archived in the first place. In terms of its archiving capabilities RainStor supports three different approaches, all of which are based on the same software, as follows: Archiving in the cloud archiving as a service Archiving on Hadoop Archiving on conventional in-house platforms In this paper we will discuss the facilities provided, supported by the views of a number of RainStor customers to whom we have spoken. Key findings In the opinion of Bloor Research, the following represent the key facts of which prospective users should be aware: RainStor guarantees 10x compression and normally expects compression to be between 20x and 40x. For some data, compression rates can be as high as 45x. RainStor supports ANSI standard queries (via ODBC/JDBC) against its archive so you should be able to use existing business intelligence tools to query the archived data. Queries perform very efficiently, thanks to the use of Bloom filters, which are also used to improve performance for free text search. Federated queries are supported (via third party tools) across both the archive and live systems. If the data being archived is relational then you can ingest the data schema along with the data. RainStor understands the concept of schema evolution so that you can query the data within the context of the schema that was active at the time the data was live. A Bloor InDetail Paper 1 2013 Bloor Research
Executive summary While RainStor provides a connector called FastConnect for rapid loading of data from Teradata data warehouses, there are no equivalent products for other environments. While users are fine with the load performance they are getting in non-teradata environments it would be nice if a similar higher performance loader was available for other popular database environments like Oracle and Netezza (IBM). We understand that RainStor is planning exactly this. The Hadoop version of RainStor provides security that Hadoop otherwise lacks and RainStor supports MapReduce as well as SQL in this environment. However, there are no current facilities to embed SQL statements within MapReduce jobs, or vice versa. The bottom line It is the combination of capabilities offered by RainStor that is most impressive: compression as high as 99%, the ability to access archived data via SQL, the performance you get for such queries (thanks to the use of Bloom filters), the support for schema evolution, and the choice of platform for storage (cloud, Hadoop or conventional). There are not many, if any, competitors that can offer this range of cost, performance, scalability and compliance features. The initial process of selecting the data to archive is typically the domain of data profiling and discovery tools to ensure that business objects are archived in toto. We would like to see RainStor partner with one or more vendors that specialise in this area. 2 2013 Bloor Research A Bloor InDetail Paper
The product RainStor is currently in version 5.5, which was released in June 2013. It is targeted at both archiving and analytics markets. In the latter case the idea is that you store raw data, such as log files or any network machine generated data, in RainStor, while holding aggregated data in your data warehouse. As we shall see, the query capabilities provided by RainStor make this a practical proposition. For Teradata environments (Teradata is a partner) there is a FastConnect product, which specifically enables communications between the Teradata and RainStor systems, as well as a recently introduced product called FastForward, which allows users to reinstate data into RainStor from Teradata tape archives. As far as archiving is concerned, the software is the same as you would use for analytics but the emphasis is different. As previously noted, this solution may be deployed in conjunction with scale-out NAS environments like EMC Isilon and popular WORM storage, such as EMC Centera, which is the traditional approach adopted by RainStor or, since release 5 (Jan 2012), you can implement RainStor on top of a Hadoop cluster. Moreover, you can combine these options with an implementation, for example, of EMC Isilon running on HDFS. The solution can also be deployed in the cloud via an archiving as a service offering, which is perhaps best described as providing the facility for IT to offer this as a service to multiple business units. Multi-tenancy is supported for customers archiving to a central repository across multiple applications. In the bulk of this paper we will discuss the architecture of the product, the potential benefits it offers and how these are achieved and we will back this up with reports on the discussions we have had with a couple of RainStor customers. A Bloor InDetail Paper 3 2013 Bloor Research
The product BI Tools & Dashboards (ODBC/JDBC Connectivity) Visualization Layer Hive Pig Java MapReduce - Batch (Distributed Programming Framework) SQL 92 (With Oracle, SQL Server, SybaselQ extensions) Programming Languages Computation Security & Compliance (Encryption, Masking, Audit Trail, Data Disposition, Kerberos, LDAP/Active Directory, Immutable) Security RainStor Database (up to 40x Data Compression) Database Storage HDFS (Hadoop Distributed File System) NAS, SAN, CAS, NFS (On-premise, cloud) Object/Hardware Storage Apache Projects RainStor Vendor Specific Figure 1: RainStor architecture Architecture RainStor describes its product as being founded on a database. From a purist s point of view this is not correct, at least in the sense that there is no data model that is inherent to the environment. In fact, the underlying structure is a file system. It could, of course, be regarded as a NoSQL environment such as that offered by Hadoop, which is also based on a file system and is schema-free. However, this would be misleading since RainStor has strong SQL support and understands the concept of database schemas. In fact the most accurate description of the storage architecture is as a data repository that stores data in files. This means that RainStor is very easy to install and implement and it requires virtually no administration. The repository element of this is important because it implies the presence of metadata. This does three things. First of all it allows the ingestion of SQL schemas and supports schema evolution. This is a patented process, which means that you can query archived data against the schema that was in place at the time the data was created. This is important not just for business purposes but also in some compliance environments where you need to be able to prove what you knew at some point in time. The second important aspect of this repository relates to how you query data in the archive. RainStor uses what are known as Bloom filters (named for the inventor Burton H. Bloom) which are space-efficient probabilistic data structures (held as metadata) that are used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Put simply, these filters tell the system where the data is not, so that the software looks for the data it needs only within relevant data blocks. The advantage of using these Bloom filters (which, technically, are based on bit vectors) is that they greatly increase performance and, at the same time, they require much less management and overhead than indexes. Note that filters, like the data, are compressed. 4 2013 Bloor Research A Bloor InDetail Paper
The product While on the subject of compression, RainStor guarantees 10x compression, which is about on a par with the best that the major database vendors can offer (and much better than you would expect from Hadoop). However, 20x to 40x compression is commonplace in practice and, as we shall see, one of RainStor s users that we spoke to claimed 99% compression (in other words, 45x giving a 99% reduced footprint). These figures are significantly better than you would get from any merchant database vendor. From a technical perspective RainStor uses a form of tokenisation with byte level compression, combined with a linked list to enable data value and pattern de-duplication. Querying the data While we have mentioned some aspects of RainStor s query abilities already (Bloom filters and schema evolution) and we discuss the use of Hadoop in a later section, more generally RainStor includes a query engine that supports (translates) incoming SQL (ANSI standard SQL 92) so that you can run conventional business intelligence environments against RainStor. In addition, the company has recently introduced free text search capabilities that combine Bloom filters with some of the parsing aspects of Lucene, bringing the performance advantages (according to RainStor, one to two orders of magnitude) of the former to this requirement. Compliance RainStor provides an append only environment. That is, you can add new records but you cannot update records. If you need to do the latter you must delete the original record (there is a record level delete function) if you are allowed to records may be designated as subject to legal hold and then insert a new record. This could be a limitation in analytic environments but shouldn t be a problem for archiving. In fact, compliance-driven archiving actually requires data disposition and retention capabilities. RainStor provides record level expiry and auto-delete capabilities as well as the ability to add comments and to audit these (and there are broader auditing capabilities, as you might expect). Where compliance is a major consideration, WORM (write once read many) storage may be the preferred option even though RainStor s Hadoop implementation has done much to improve the security of the Hadoop environment (see next section). MD5 fingerprinting is provided along with tamper-proofing. Hadoop One of the problems with Hadoop is that the security isn t great. It s not that you don t get any security with Hadoop distributions Kerberos authentication is fairly standard, for example but it is not as detailed as it needs to be for enterprise-level requirements and certainly won t be compliant with requirements such as PCI. Moreover, it is frequently the case that Kerberos gets turned off for performance reasons, so a more robust solution is required. RainStor, in its latest 5.5 release, has set out to resolve this issue. It now offers not just Kerberos but also LDAP and Active Directory support. In addition, the company offers data masking for both SQL and MapReduce functions. While this is not intended to be a fullblown data masking tool, it will be useful for masking data in, say, log files. Masking can be done in a consistent fashion so that the same piece of data is always masked in the same way. The way that RainStor is implemented on Hadoop is that RainStor partitions are stored within HDFS (Hadoop distributed file system). You can choose what data you want to flatten. So, because RainStor understands SQL you can store relevant data in tabular format if you want to, and you can flatten other data where that is not relevant. Also, of course, the SQL supported by RainStor means that you can perform functions like joins, which would not otherwise be possible. It is also worth noting that RainStor files are treated as first-class objects within the context of Hadoop and MapReduce, so there is no need to change existing scripts (in Pig, say) only to change a single parameter that points the query to RainStor partitions. The company has added native MapReduce support to its existing SQL capabilities and has partnered with Cloudera, HortonWorks and MapR. What you can t do is to combine MapReduce and SQL within a single query you have to use one or the other although we understand that RainStor is planning to introduce this capability in due course. Finally, the one other point to note is that RainStor understands time stamps, which is not something that is native to Hadoop. A Bloor InDetail Paper 5 2013 Bloor Research
Case Studies We have interviewed two of RainStor s users of its archiving solutions to confirm (or otherwise) the company s claims. Dell In addition to being a partner, Dell has also adopted RainStor as an archiving solution within its own IT department, within the data warehousing and business intelligence group. The company uses Teradata as its main data warehousing platform and then archives seldom used data that it nevertheless wishes or must retain, onto RainStor. Note that this is an archiving solution rather than one more focused at analytics. The initial justification for implementing RainStor was performance. The customer has a number of tables in its warehouse that are multi-terabyte in size and the company could halve the size of these tables, and their related indexes, by archiving rarely accessed data. This in turn improves the query performance achieved from the data warehouse without needing costly hardware upgrades. The organization did not undertake any product comparisons before opting in favour of RainStor, partly because the company was already a partner and partly because the company had looked at Informatica ILM (which embedded RainStor at the time this partnership has since ended) previously, so the company was familiar with the technology. It did, however, undertake a proof of concept in which RainStor demonstrated 40x compression rates on 2TB of data and according to the Enterprise BI Technical Architecture lead we spoke with, the company is fully expecting at least 20x against the 50TB to be archived. From this statement it will be clear that this implementation is not yet complete. In part, the reason for this is that the customer has opted to implement RainStor on top of Hadoop. It was originally going to implement RainStor natively but then the company released its Hadoop option and the IT team lobbied management to move to a Hadoop environment, in part because the data warehouse team had no previous experience of Hadoop. In practice, according to the user we spoke to, we learned some things the hard way about Hadoop. Nevertheless, management is now very pleased that it opted for an approach based upon Hadoop. While the company initially thought it would not need support for Hadoop it has subsequently decided that this would be a good idea and is licensing this from Cloudera. The data stored in RainStor is varied: it consists of sales, service, manufacturing and inventory data. It is loaded using RainStor FastConnect in Informatica PowerCenter workflows. TOAD Decision Point is used to query the data. The customer has also performed POCs using SAP Business Objects to federate queries across data stored within the warehouse and the archive. As far as support is concerned, the customer was complimentary: he described the company as very responsive and supportive and said we couldn t have asked for better support. 6 2013 Bloor Research A Bloor InDetail Paper
Case Studies CenturyLink CenturyLink is one of the largest telecommunications companies in the United States. Unlike Dell, which is a recent RainStor customer, it has been using the product since 2010 to archive data that is too expensive to retain within its Oracle Exadata data warehouse. It originally evaluated Vertica (now HP) and Greenplum (now EMC) as alternatives to RainStor but felt that RainStor offered a better architecture for supporting archiving. Indeed, the principal architect for the organisation stated that, we are getting Exadata speeds out of RainStor. Indeed, he explained that sometimes it is (much) faster: we have reports that ran in Oracle for 5.5 hours and errored out that finished in RainStor in less than 20 minutes. Access to the data is primarily through SAP Business Objects and QlikView and queries can be federated. Apart from the compression, which the customer said was upwards of 99%: really tremendous the other big advantage of implementing RainStor is that the issue of back-up and recovery goes away you just have to tune the environment, so we save a bundle of money on storage. The only difficulty is on the Oracle side, which he described as going stupid with the idea of an external database. Loading times were described as acceptable; you can t trickle feed the data into RainStor you have to present a file, sort and load it. The company has multiple implementations of RainStor in different locations and it expects its volume of archived data to surpass 1PB at some point next year. Currently this is all NAS-based but the company is currently going through a Hadoop proof of concept process, which it will use for unstructured archiving while retaining the existing RainStor implementations for structured and semi-structured data. The big advantage of RainStor on Hadoop, according to the customer, is that it adds security. It is interesting to note that the spokesperson for CenturyLink had the same comments about its Hadoop implementation (that is, listen to RainStor) as the previous high tech customer running on HDFS. The customer we spoke to was complimentary about RainStor without being gushing. He described the company s service as somewhere between excellent and above average, said that bug fixes were immediate was pleased that we have had a lot fewer issues with RainStor than with other vendors and that the system is very stable. A Bloor InDetail Paper 7 2013 Bloor Research
The vendor RainStor was previously Clearpace Software. The company was founded by ex-mod staff in the UK who had been working on a way to effectively store data derived from battlefield simulations. As you may imagine this means very large volumes of data, which need to be ingested rapidly, stored for a long time and easily retrieved, hence the product s excellent levels of compression. However, in 2007 the company changed its name to RainStor and moved its headquarters to the United States, although the bulk of development remains within the UK. Initially the company focused on reseller partnerships and an OEM model, notably with Dell, Teradata, Anritsu, AdaptiveMobile, Group2000, Informatica (which formerly embedded RainStor in its ILM product), and HP as well as a number of systems integrators. However, more recently the company has moved towards a hybrid model, which includes both direct sales and indirect. RainStor focuses particularly on the banking and financial services, government, utilities, and telecommunications and media sectors. Given the scale of their data retention requirements one might imagine that the healthcare and the pharmaceutical sectors might also be fruitful for RainStor. The company has in excess of 150 customers worldwide. Web site: www.rainstor.com 8 2013 Bloor Research A Bloor InDetail Paper
Summary Archival is a no-brainer when it comes to return on investment and total cost of ownership. This is particularly true as data volumes grow and when you can get the sort of performance and scale that RainStor can offer. Moreover, the choice of archiving in the cloud or on Hadoop, as well as in conventional environments, means that archiving is within the pecuniary reach of organisations of all sizes. Bloor Research believes that archiving should be a standard part of any company s infrastructure and, moreover, it should be higher up the priority list than it has been historically. With big data already on the agenda this is already happening and RainStor is well-placed to take advantage of this opportunity. The company s tag line says: taking the big out of data and this is absolutely true. Further Information Further information about this subject is available from http://www.bloorresearch.com/update/2190 A Bloor InDetail Paper 9 2013 Bloor Research
Bloor Research overview Bloor Research is one of Europe s leading IT research, analysis and consultancy organisations. We explain how to bring greater Agility to corporate IT systems through the effective governance, management and leverage of Information. We have built a reputation for telling the right story with independent, intelligent, well-articulated communications content and publications on all aspects of the ICT industry. We believe the objective of telling the right story is to: Describe the technology in context to its business value and the other systems and processes it interacts with. Understand how new and innovative technologies fit in with existing ICT investments. Look at the whole market and explain all the solutions available and how they can be more effectively evaluated. Filter noise and make it easier to find the additional information or news that supports both investment and implementation. Ensure all our content is available through the most appropriate channel. Founded in 1989, we have spent over two decades distributing research and analysis to IT user and vendor organisations throughout the world via online subscriptions, tailored research services, events and consultancy projects. We are committed to turning our knowledge into business value for you. About the author Philip Howard Research Director - Data Management Philip started in the computer industry way back in 1973 and has variously worked as a systems analyst, programmer and salesperson, as well as in marketing and product management, for a variety of companies including GEC Marconi, GPT, Philips Data Systems, Raytheon and NCR. After a quarter of a century of not being his own boss Philip set up his own company in 1992 and his first client was Bloor Research (then ButlerBloor), with Philip working for the company as an associate analyst. His relationship with Bloor Research has continued since that time and he is now Research Director focused on Data Management. Data management refers to the management, movement, governance and storage of data and involves diverse technologies that include (but are not limited to) databases and data warehousing, data integration (including ETL, data migration and data federation), data quality, master data management, metadata management and log and event management. Philip also tracks spreadsheet management and complex event processing. In addition to the numerous reports Philip has written on behalf of Bloor Research, Philip also contributes regularly to IT-Director.com and IT-Analysis.com and was previously editor of both Application Development News and Operating System News on behalf of Cambridge Market Intelligence (CMI). He has also contributed to various magazines and written a number of reports published by companies such as CMI and The Financial Times. Philip speaks regularly at conferences and other events throughout Europe and North America. Away from work, Philip s primary leisure activities are canal boats, skiing, playing Bridge (at which he is a Life Master), dining out and walking Benji the dog.
Copyright & disclaimer This document is copyright 2013 Bloor Research. No part of this publication may be reproduced by any method whatsoever without the prior consent of Bloor Research. Due to the nature of this material, numerous hardware and software products have been mentioned by name. In the majority, if not all, of the cases, these product names are claimed as trademarks by the companies that manufacture the products. It is not Bloor Research s intent to claim these names or trademarks as our own. Likewise, company logos, graphics or screen shots have been reproduced with the consent of the owner and are subject to that owner s copyright. Whilst every care has been taken in the preparation of this document to ensure that the information is correct, the publishers cannot accept responsibility for any errors or omissions.
2nd Floor, 145 157 St John Street LONDON, EC1V 4PY, United Kingdom Tel: +44 (0)207 043 9750 Fax: +44 (0)207 043 9748 Web: www.bloorresearch.com email: info@bloorresearch.com