Information Architecture

Size: px
Start display at page:

Download "Information Architecture"


1 The Bloor Group Actian and The Big Data Information Architecture WHITE PAPER

2 The Actian Big Data Information Architecture Actian and The Big Data Information Architecture Originally founded in 2005 to acquire the Ingres database business from CA, Actian has gradually built a portfolio of products, most of which squarely target the Big Data market. It includes several types of database products and a whole suite of data integration and analytics software, all of which culminate into the Actian Analytics Platform. The Actian Analytics Platform The Actian Analytics Platform is rich in database capability that can and surely will be used in Big Data projects and to build Big Data architectures. Its core functionalities can be briefly described as follows: It includes a very high performance column-store database that was built to deliver extreme scale-up performance on a single server. It is one of the few technologies that has been engineered to take maximum advantage of on-chip vector instructions of x86 chips. It has been proven in database implementations up to tens of terabytes. It currently benchmarks as the world s fastest query-oriented database by a wide margin. Rather than choosing to build a more conventional MPP version of its column-store database, Actian preferred to implement it over Hadoop s HDFS and name the product Hadoop SQL Edition. While this product is in its first release at the time of writing, it has, nevertheless, benchmarked as considerably faster (by multiples) on the TPC-DS benchmark. Currently it can scale to 30 Hadoop nodes, but this will likely increase in future releases. It offers a scale-out analytical database which can be deployed across hundreds of server nodes to process extremely large collections of data at and beyond the petabyte level. It has many built-in analytical functions in the engine and thus parallelizes both queries and analytical calculations. It offers an object database which is often deployed as a graph database for traversing data networks rather than tables. That type of workload would be its most likely role within a Big Data environment, although it could also be used as a document database. Data in Motion The Actian Analytics Platform includes a software building and execution capability that processes data in flight. It can be used to build data workflows where data is processed as it is piped from one source to another. In terms of Big Data architecture, it is a key feature for Actian as it complements Actian s variety of database products. The important aspects of this capability are: It processes data in parallel using both pipeline parallelism and data segmentation parallelism. As such, it is extremely fast, and when used with HDFS, it is far faster than Hadoop s native MapReduce framework. The underlying parallelization engine auto-configures to make optimal use of the available computer resources on which it is deployed. It is the fundamental technology that was used to build Actian s ETL and data cleansing products and thus is responsible for their speed. It comes with a series of connectors to databases and data stores. 1

3 For users and software developers, it provides a codeless drag-and-drop prototyping environment for building data workflows. It scales out across multiple server nodes, and it can span Hadoop and non-hadoop environments. It can also interface to data streams. In respect to analytics, it is directly integrated with the open source KNIME suite of machine learning software and can execute routines written in the R language. If one considers the broad field of business intelligence (BI) and data analytics, which will be the primary application area for Big Data, it is clear that many activities (data access, metadata capture, data cleansing, data transformation and organization prior to ingest into a database) are not database applications. They are, however, suitable applications for the workflow development and data processing features built into Actian s platform. Clearly the Actian Analytics Platform can also be used to carry out analytical processing and to query Hadoop directly (using SQL via Hive or, of course, Actian s own Hadoop SQL Edition). Thus, in many scenarios, the platform is an alternative as well as a complement to an analytical database. Actian and Big Data Architecture In our research paper entitled The Big Data Information Architecture (June 2014) we describe an event-driven architecture that we expect to supersede the traditional data warehouse architecture that has dominated the IT industry for almost two decades. The Actian Analytics Platform fits the described architecture very well. We illustrate this in Figure 1 on the following page, which depicts what we refer to in our research paper as a Data Refinery and Processing Hub. This Hub is responsible for both ingesting data into an organizations s data layer and providing a processing service that may involve data queries and analytical calculations on collections of data. The Data Hub is an arrangement of hardware and software that replaces the collection of ETL jobs, data staging areas, data warehouse and operational data stores that constitutes the traditional BI environment. Additionally, it exceeds the capability of the traditional BI environment in being able to handle data streams and unstructured data, as well as large data volumes. If we consider the Actian Analytics Platform from the database perspective, it is clearly wellequipped to provide a comprehensive database capability for the Data Hub. The platform s support for SQL workloads and its Hadoop SQL Edition for larger data volumes can deliver excellent performance, and its analytical database can handle analytical queries. Its objectbased database is equipped to store data in the form of connected graphs or documents and can process the associated workloads. The Actian Analytics Platform can be deployed to provide a continuous data flow service from Hadoop to any of Actian s data stores, including Hadoop SQL Edition. As the data hub gradually expands over time, the ETL capabilities can be maintained and augmented. Ideally, flows of data within the Data Hub will be managed so that a full data lineage is known, recorded and continually monitored. This is an activity in which the Actian Analytics Platform will be a critical component; it does not just flow data to where it is required, but it also keeps track of data location and data lineage. With multiple database engines, it may be desirable for reasons of physical performance to replicate some data within The Hub, as when, for example, it is required both within a traditional query database and a graph database. 2

4 Figure 1: Actian Analytics Platform Deployed in a Data Refinery and Processing Hub Just as the Actian Analytics Platform would be deployed for data flow within the Hub, it would also be used for data pulled from external data sources or received directly as data streams. Similarly, it will be used for data export from The Hub, directly from Hadoop or any database within The Hub to feed data marts and export data to other environments. By employing the Actian Analytics Platform in this manner, all data movements to, from and within The Hub can execute in parallel. A fundamental idea of The Data Hub is that, as far as possible, all SQL queries that run on corporate data would execute there. There may be pragmatic reasons for exporting data from The Hub to data marts to feed other databases (for example, supplying data to an IBM mainframe environment), but these would be minimized. Because The Data Hub is built to be a fully scalable environment, as workloads grow, more commodity servers are configured into the environment to handle the expanding demand. BI and analytics applications that simply wish to access data would do so directly, connecting to one or another of the databases within The Hub to launch SQL queries, or possibly, directly harvesting the data. The Actian Analytics Platform can also play the role of an analytics engine, either by employing the KNIME suite of machine learning algorithms or by directly using analytic routines created in a language like R or Python. As such, it can supplement the query capabilities of Hadoop SQL Edition or even Hive and apply parallel analytical processing to the queried data. 3

5 Because the Actian Analytics Platform offers a development environment it can also be used both to develop and execute other activities that may take place within The Data Hub, such as data cleansing, metadata discovery and so on. It can also orchestrate the activities of other software tools that might be used within The Hub. The Actian Analytics Platform is an extraordinarily versatile solution, and organizations who select Actian to provide the foundation of their Big Data information architecture will no doubt make extensive use of it. Actian in Summary As far as we are aware, Actian is the only vendor that currently provides a broad line of software capabilities that include both a suite of database products that cater for multiple query types (SQL queries, analytical queries, graph queries, document queries) and also a data flow development environment and engine. As such, it has all the requisite components for building a Data Hub of the type described in our research report, and hence provides the foundation for a Big Data environment to initially supplement and ultimately replace the traditional data warehouse environment and support an extensive analytical capability. About The Bloor Group The Bloor Group is a consulting, research and technology analysis firm that focuses on open research and the use of modern media to gather knowledge and disseminate it to IT users. Visit both and for more information. The Bloor Group is the sole copyright holder of this publication. Austin, TX