Actian SQL in Hadoop Buyer s Guide

Contents Introduction: Big Data and Hadoop... 3 SQL on Hadoop Benefits... 4 Approaches to SQL on Hadoop... 4 The Top 10 SQL in Hadoop Capabilities... 5 SQL in Hadoop Decision Criteria... 6 Actian: The High- performance SQL in Hadoop Database... 8 Summary... 10 Learn more... 10 Try it for free!... 10 2

Introduction: Big Data and Hadoop When most people think of big data, they also think of Hadoop. Why? Because Hadoop has emerged as the de facto data operating system platform to address the biggest challenge posed by big data: ginormous amounts of data continuously being generated. Organizations have come to recognize that they can use Hadoop to store massive data sets by distributing them across inexpensive commodity servers at a tenth of the cost of storing data in a data warehouse. All this Hadoop data stored in an ever- growing reservoir provides an incredible opportunity for organizations who can mine this massive big data to create transformational business outcomes, including: Delighting customers with exceptional service to increase revenue or reduce churn. Responding faster or using new insights to create a competitive advantage. Reducing risk and fraud, which impact the bottom line of every organization. Uncovering opportunities for new product or service offerings. However, while Hadoop has helped address the big data challenge of how to inexpensively store large amounts of data of all shapes and sizes, it has fallen short as an analytics platform to deliver on the promise of big data. Partly, because of the limitations of the Hadoop technology: Not a database Hadoop consists of multiple components but fundamentally is a file system Hadoop Distributed File System (HDFS) with a specialized programming framework (MapReduce) to build programs that can process the data stored in HDFS. It is not easy to find individual records or record sets. And, Hadoop lacks most of the capabilities of a typical database to organize, access, and manage data. Batch vs low latency Hadoop is designed for large batch processing where it can take hours or even days to analyze a single data set. Because Hadoop is a file system, it requires scanning all files just to find a single record. Batch processing also makes it difficult for data scientists to explore data and build analytic models. Moreover, Hadoop cannot support the interactive, ad hoc queries required by most business analysts and applications. Programming required Hadoop requires the use of complex, specialized MapReduce programs to process data. MapReduce programmers are in short supply and rarely understand the business objectives for how the Hadoop data is to be used. Plus, writing, testing, and running a MapReduce job to prepare and analyze data often can take weeks. This can become a huge bottleneck as every data request from a data scientist or business analyst must go through a Java programmer. Thus, the need for rare and expensive skillsets, long and error- prone implementation cycles, lack of support for popular reporting and BI tools, and inadequate execution speed have led to a search for an alternative that combines all of the benefits of Hadoop with SQL the world s most widely used data querying language. This guide is designed to help organizations understand what capabilities are most important when making a SQL on Hadoop solution purchase decision. 3

SQL on Hadoop Benefits SQL is THE standard trade language for anyone who uses databases and works with data. There are more than one million SQL trained users and only 150,000 MapReduce programmers (with an estimated shortage of 170,000). Almost every enterprise application and business analyst tool uses SQL to manage data. It makes sense that SQL should be the way business users want to access Hadoop data. Here are six more reasons why it makes sense to consider a SQL on Hadoop solution: 1. Use existing SQL trained users no retraining needed for user access to Hadoop data. 2. Use existing BI tools immediate productivity using existing SQL- based business intelligence and visualization tools. 3. Use existing SQL apps no need to rewrite queries to access Hadoop- based data. 4. No hunting for Hadoop skills no searching for hard to find, expensive MapReduce programmers. 5. No transferring of Hadoop data no need to move the data out of Hadoop in order for business users to be able to use SQL to access Hadoop data. 6. No managing duplicate Hadoop data duplicating and maintaining another copy of Hadoop data in a separate data management environment negates the cost- benefit of Hadoop and also requires additional IT resources for support. Approaches to SQL on Hadoop The SQL on Hadoop marketplace is crowded with solutions that broadly can be viewed in three main categories: Hadoop Connector: With this approach, the organization must deploy both a Hadoop cluster and a DBMS cluster, on the same or separate hardware, and use a connector to pass data back and forth between the two systems. The approach is expensive and hard to manage and, most often, is adopted by traditional data warehouse vendors. Solutions that fall into this category include HP Vertica, Teradata, and Oracle. SQL and Hadoop: In this approach, vendors have taken an existing SQL engine and modified it such that when generating a query execution plan, it can determine which parts of the query should be executed via MapReduce and which parts should be executed via SQL operators. Data that is processed via SQL operators is copied from the HDFS into a local table structure. This again requires management of data in both HDFS and local tables. Solutions that fall into this category include Hadapt, RainStor, Citus Data, and Splice Machine. SQL on Hadoop: These vendors are building SQL engines from the ground up that enable native SQL processing of data in HDFS while avoiding the use of MapReduce. These products have limited SQL language support, rudimentary query optimizers that can require handcrafting queries to specify the join order, and no support for trickle updates. Product immaturity is reflected in the lack of support for internationalization, limited security features, and the lack of workload management. Solutions that fall into this category include Impala, Drill, Stinger, and HAWQ. 4

Unfortunately, all of these categories fall short of what is needed to deliver true SQL access to Hadoop data. Thus, a new category called SQL in Hadoop is needed for solutions that provide an industrialized, high- performance analytics database able to run natively in Hadoop on top of HDFS. The Top 10 SQL in Hadoop Capabilities Actian has worked with hundreds of customers to understand their big data requirements, including SQL access to Hadoop data. Here are the top 10 key capabilities required of a true SQL in Hadoop solution. 1. Comprehensive: A SQL in Hadoop capability shouldn t be used in isolation. Organizations need to view SQL in Hadoop as part of a complete analytics solution that covers the end- to- end process from connecting to raw big data, analyzing data natively on Hadoop, all the way through delivering analytics results that impact business. Capabilities should include data integration; data blending and enrichment; discovery analytics and data science workbench; high- performance analytical processing; and SQL access to analytical results all natively on Hadoop. 2. Data Quality: Organizations need to ensure that Hadoop data accessed by business users has been curated for high- quality data. This especially is true if businesses will use the Hadoop data for operational decision making. The SQL in Hadoop solution should be part of an analytics platform that supports data blending, enrichment, and quality validation. 3. SQL: The solution needs to support standard ANSI SQL to make it easy and flexible to work with SQL- based applications, as well as standard business intelligence or visualization tools. This also ensures that any existing SQL queries will be able to access Hadoop data without the need for modification. In addition, the SQL in Hadoop capability should support advanced analytics, such as cubing and window functions. 4. Performance Optimized: The solution should be mature technology strongly rooted in data management with published TPC benchmarks. It should include a proven cost- based query planner and optimizer that make optimal use of all available resources from the node, memory, cache, and all the way down to the CPU. 5. Security: The solution should include native security expected of any enterprise quality database management system (DBMS), including authentication, user- and role- based security, data protection, and encryption. 6. Read/Write: The ability to not only access but also update/append Hadoop data is important. The SQL in Hadoop solution should be able to both read and write data to the HDFS nodes. It also should be fully ACID- compliant, with multi- version read consistency, plus system- wide failover protection. 7. Compression: Hadoop can reduce storage costs but it still requires replicating copies for high availability. The SQL in Hadoop solution should compress data by at least a factor of 10 and store it in a native columnar format for faster SQL performance. 5

8. Manageability: The SQL in Hadoop solution should be certified as YARN- ready to ensure it uses the Hadoop YARN capability to automatically manage resources and nodes on the cluster. The solution also should include a Web- based systems management console to monitor analytics and query processing. 9. Architecture: The SQL in Hadoop solution should be able to expand and scale as the size of the cluster grows. It should be able to handle extremely complex queries, thousands of nodes and SQL users, and petabytes and more of data. 10. Native: SQL in Hadoop should run natively without the need to move the data off the HDFS nodes into a separate database or file system. It also should be flexible to support the most commonly used Hadoop distributions including, Apache, Cloudera, Hortonworks, and MapR. SQL in Hadoop Decision Criteria Choosing a SQL in Hadoop solution is a critical decision that will have a long- term impact on your application infrastructure and your ability to scale to support the needs of business users. While you may not need a solution that meets all of these criteria immediately, it s important to consider how your requirements may change over time. 1. Comprehensive: Is the solution part of a complete analytics platform with: Connectors to all types of disparate data (DBMS, file systems, social media, machine/sensors, SaaS applications)? Parallelized integration and loading of data into Hadoop? Visual data science workbench to build re- usable data and analytics workflows? Comprehensive set of hundreds of analytics and transformation functions? Ability to analyze massive amounts of structured and unstructured data without need for sampling? High- performance data flow engine that is 10x faster than MapReduce for processing analytic workflows? Ability to blend and enrich Hadoop data with non- Hadoop data to improve context and improve analytical accuracy? 2. Data Quality: If business users are to use the Hadoop data for operational decision making, is the solution part of a complete analytics platform with ability to perform: Profiling to assess the data to discover inconsistencies? Data cleansing and validation to improve data quality and consistency? 3. SQL: Does the solution support: Standard ANSI SQL enabling the use of existing SQL without rewrite? Advanced analytics, including cubing, grouping, and window functions? Standard enterprise and desktop BI and visualization tools? 6

4. Performance Optimized: Does the solution support/use/have: Mature and proven cost- based query planner? Optimal use of all available resources, including node, memory, cache, and CPU? Use cases requiring responses in milliseconds to seconds versus minutes to hours? Fast lookups on small subsets of the data? Fast performance on massive joins? Published performance benchmarks? 5. Security: Does the solution support: Role- based security? Authentication through LDAP or Active Directory? Column- level data encryption? User- level and role- level data access? 6. Read/Write: Does the solution use/provide: Ability to both read and append/update data in HDFS? Concurrent reads and updates without locking up the database? Full ACID- compliance with multi- version read consistency? 7. Compression: Does the solution use/provide: Compress the data by at least a factor of 10 to reduce the amount of Hadoop storage? Store the data in a columnar format for faster access? 8. Manageability: Does the solution use/provide: YARN for automated Hadoop cluster resource management? Certification as YARN- ready? Web- based management console for monitoring analytic/query processing? 9. Architecture: Is/does the solution: Based on a mature, proven data management platform? Scale linearly to handle thousands of users, nodes, and petabytes of data? Provide system- wide failover protection? 10. Native: Is/does the solution: Run natively on Hadoop (HDFS) without the need to move the data or use a separate integration layer? Natively run on the most common Hadoop distributions, such as Apache, Cloudera, Hortonworks, or MapR 11. Vendor: Is/does the vendor: Built upon years of deep data management and Hadoop experience? Well- financed and stable? Have offices worldwide? Have global customer support? Have a strong customer base? 7

Actian: The High-performance SQL in Hadoop Database Actian Vector in Hadoop uniquely offers the first true SQL in Hadoop solution. Actian Vector in Hadoop, a key capability of the Actian Analytics Platform Hadoop SQL Edition, represents the next generation of innovation at Actian and employs a unique approach for bringing all of the benefits of industrialized, high- performance SQL together with Hadoop. Figure 1: Actian Analytics Platform Hadoop SQL Edition Capabilities 8

Actian Vector in Hadoop contains a mature RDBMS engine that performs native SQL processing of data in HDFS. It has rich SQL language support, an advanced query optimizer, support for trickle updates, and has been certified for use with the most popular BI tools. It is built from mature vector database technology that has been hardened in the enterprise and includes support for localization and internationalization, advanced security features, and workload management. Plus, it has been benchmarked to perform more than 30 times faster than other approaches to SQL on Hadoop. Figure 2: Actian Analytics Platform offers a fully functional analytics database in Hadoop To understand what s special about Actian Vector in Hadoop, here are key performance features that make it unique: 1. CPU Exploitation: Actian Vector was written from the ground up to take advantage of performance features in modern CPUs, resulting in dramatically higher data processing rates compared to other relational databases. Actian Vector in Hadoop leverages these innovations and brings this unbridled processing power to the data nodes in a Hadoop cluster. 9

2. Single Instruction, Multiple Data (SIMD): SIMD enables a single operation to be applied on every entity in a set of data all at once. Actian Vector takes advantage of SIMD instructions by processing vectors of data through the Streaming SIMD Extensions instruction set. Because typical data analysis queries process large volumes of data, the SIMD results in the average computation against a single data value taking less than a single CPU cycle. 3. Parallel Execution: Actian implements a flexible, adaptive parallel execution algorithm and can be scaled- up or scaled- out to meet specific workload requirements. Actian Vector can execute statements in parallel using any number of CPU cores on a server or across any number of data nodes on a Hadoop cluster. Taking the raw power of the Vector data processing engine to the HDFS data is what gives Actian Vector in Hadoop its unique performance characteristics. 4. Updates via Positional Delta Trees (PDTs): Actian Vector in Hadoop implements a fully ACID- compliant transactional database with multi- version read consistency. One of the biggest challenges with HDFS is that it is not designed for incremental updates. Actian addresses this challenge with high- performance, in- memory Positional Delta Trees (PDTs), which are used to store small incremental changes (inserts that are not appends), as well as updates and deletes. 5. Out- of- the- box YARN Support: YARN provides resource negotiation and management for the entire Hadoop cluster and all applications running on it. Actian Vector is the first SQL in Hadoop capability certified as YARN- ready. This means Actian query workloads can run as first- class citizens on Hadoop clusters, sharing resources side- by- side with MapReduce- based applications. Summary If you need to analyze large volumes of data on Hadoop and deliver scalable, enterprise- grade SQL access to Hadoop data to your business, you should strongly consider Actian for its industrialized, high- performance SQL in Hadoop solution. Actian s SQL in Hadoop is the foundation for revolutionary performance gains in database processing gains that are so game- changing that they appear on our competitor s long- term roadmaps. Implement an easy to deploy, easy to use, ANSI- compliant solution and benefit from significantly better query performance than any other SQL on Hadoop solution. Learn more Learn more about the Actian Analytics Platform Hadoop SQL Edition, which includes Actian Vector in Hadoop, our high- performance SQL in Hadoop capability: actian.com/hadoop Try it for free! Experience the difference of having a true analytics database for high- performance SQL access to Hadoop data. Download and try out Actian s SQL in Hadoop capability for 30 days for free: bigdata.actian.com/sql- in- hadoop 10

About Actian: Accelerating Big Data 2.0 Actian transforms big data into business value for any organization not just the privileged few. sources of revenue, business opportunities, and ways of mitigating risk with high- performance, in- database analytics complemented with extensive connectivity and data preparation. Actian makes Hadoop enterprise- grade by providing high- performance data blending and enrichment, visual design, and SQL analytics on Hadoop without the need for MapReduce skills. Among the tens of thousands of organizations using Actian are innovators using analytics for a competitive advantage in industries such as financial services, telecommunications, digital media, healthcare, and retail. The company is headquartered in Silicon Valley and has offices worldwide. www.actian.com www.actian.com 500 Arguello Street, Ste. 200, Redwood City, CA 94063 +1.888.446.4737 [Toll Free] +1.650.587.5500 [Tel] 2014 Actian Corporation. Actian, Big Data for the Rest of Us, Accelerating Big Data 2.0, and Actian Analytics Platform are trademarks of Actian and its subsidiaries. All other trademarks, trade names, service marks, and logos referenced herein belong to their respective companies. (WP05-0914)