Which SQL Engine Leads the Herd?

October 2014 Which SQL Engine Leads the Herd? A Comparison of three leading SQL-on-Hadoop Implementations for compatibility, performance and scalability

Which SQL Engine Leads the Herd? 2 Contents Executive Summary... 2 The case for SQL on Hadoop... 3 Standards are job one... 3 Evaluating SQL-on-Hadoop solutions... 3 SQL-on-Hadoop not all are equal... 4 Beware of cherry pickers... 5 Performance matters too... 5 Big SQL means Big Investment Protection... 6 Value beyond compliance... 6 So Who leads the herd?... 7 For More Information... 7 Executive Summary In an independently audited benchmark of three popular SQLon-Hadoop implementations, IBM showed that Hadoop is ready to run OLAP and complex query workloads at a fraction of the cost of traditional systems that is, if you choose the right technology. Compatibility Matters IBM s Big SQL was the only solution able to execute all 99 TPC-DS queries at scale with minor modifications permissible under TPC rules. Cloudera s Impala was able to run 52 queries, and Hive ran 58 1 queries in a manner that complied with the TPC rules. For the remaining queries, re-writes were required (some extensive) and some queries could not be made to run at all. Lack of SQL compatibility is a nuisance in a benchmark, but could be a costly disaster in a production environment. Throughput matters When comparing Big SQL against the subset of queries that could actually run on competing platforms, IBM ran queries on average 3.6 times faster than Impala and 5.4 times faster than Hive on a 10 TB scale test 2. Not only was Big SQL the only engine able to run the Hadoop DS workload it ran the workload significantly faster as well for both Hadoop-DS single-user and multi-user tests. Scale matters Hadoop is about big data after all. IBM had originally planned to compare all three vendors at a 30TB scale, but it achieving stability at scale was a challenge. While Big SQL could reliably execute all queries at a 30TB scale, the competitive platforms could not, exhibiting various run-time errors. The comparison was made instead at the 10TB scale where results could be repeated, and thus audited. These findings are compelling. Not only was IBM s Big SQL the only Hadoop solution tested able to actually run the complete set of queries, but it was found to be the fastest, the most scalable, and the most reliable as well. With so many vendors making claims about the performance and compatibility of SQL-on-Hadoop, IBM decided to put leading distributions to the test, conducting the first ever Hadoop-DS benchmark. The test compared IBM s Big SQL with Cloudera s Impala and Hortonworks Hive 0.13. Hadoop-DS is a Hadoop Decision Support benchmark developed by IBM modeled after the highly regarded Transaction Processing Council Decision Support (TPC-DS ) benchmark. To help make the process fair, IBM established three competing teams running each Hadoop distribution on identical hardware configurations. IBM engaged the services of an independent TPC professional to audit and help document the result. Among the key findings were: 1 These results refer to initial testing at 1 GB scale for compatibility. At 10 TB scale, both Impala and Hive ran fewer queries. 2 This result based on the Hadoop DS single-user run. Detailed performance results for single and multi user runs are detailed in a separate benchmark report.

Which SQL Engine Leads the Herd? 3 The case for SQL on Hadoop In today s data centers SQL has become a ubiquitous way to access and manipulate data. No longer a tool used just by developers and database administrators, today most professionals and analysts have at least some knowledge of SQL or use tools that rely on SQL as a standard. While non-structured data types gets all the attention for big data workloads, the majority of real projects involve transactional or log data 3 data formats generally well-suited to manipulation with SQL. While Hive was the only game in town just a few years ago, today there are at least a dozen competing commercial and open source efforts around SQL-on- Hadoop. Vendors are competing based on performance, compatibility, and the ability to scale to support real-world production workloads. Standards are job one Standards are important in every industry. They help reduce cost, expand markets, spur innovation, reduce risk and generally give organizations a competitive edge. This is true of SQL as well. In almost every organization, SQL is at the heart of enterprise data used in transactional systems, data warehouses, columnar databases and analytics platforms to name just a few examples. Additionally, a vast number of commercial and in-house developed tools used to access, manipulate and visualize data rely on SQL. SQL is lifeblood of the modern transaction and decision support systems. The last thing an organization wants to do is introduce technology that is not compatible with what they have. It s nice to be able to use open source software, but at the end of the day standardization is what matters most the software needs to work. Evaluating SQL-on-Hadoop solutions As customers know, testing their own applications is the only benchmark that matters, but when it comes to a standard benchmark, the TPC Benchmark DS (TPC-DS) is among the most thorough. TPC-DS is a decision support benchmark that models several aspects of the business operations of a global retailer. Comprised of 99 separate queries, it models real-world business operations that companies in this and other industries would find familiar. While there are no official results at the time of this writing, TPC-DS is widely regarded as a fair and complete benchmark. The rigor and realism of the benchmark makes it almost impossible for vendors to game the benchmark as long as they are properly adhering to the benchmark specification and rules. 3 70% of 465 survey respondents cite transactional data as a primary target for big data initiatives - Gartner research note Survey Analysis - Big Data Adoption in 2013 Shows Substance Behind the Hype Sept 12 2013 Analyst(s): Lisa Kart, Nick Heudecker, Frank Buytendijk

Which SQL Engine Leads the Herd? 4 SQL-on-Hadoop systems cannot meet several of the technology requirements of the TPC-DS benchmark, so IBM modeled the Hadoop-DS benchmark on TPC-DS, using the same data sets and queries, but not performing data maintenance operations, and not enforcing referential integrity or meeting other benchmark requirements not feasible with Hadoop systems. The benchmark is designed to model systems where operational data is used both to make business decisions quickly and to direct long range planning and operation. The types of queries involved fall broadly into four different categories. Reporting queries Ad-hoc queries Iterative OLAP queries Data mining queries Because the sizes of businesses vary, the benchmark is designed to scale also model different sizes of warehouses. Standard scale sizes are 100GB, 300GB, 1TB, 3TB, 10TB, 30TB and 100TB. SQL-on-Hadoop not all are equal One of the first hurdles in conducting the benchmark is simply getting the queries to run across all three Hadoop environments. From this point of view, not all SQL-on-Hadoop implementations are created equal. As shown in Figure 1, in initial testing IBM Big SQL was able to run 99 of the standard TPC-DS queries after building the dataset. 87 queries ran out-of-the box and an additional 12 were easily modified within a few hours with minor syntax changes allowable under the TPC-DS benchmark specification 4. Figure 1 Query compatibility by SQL-on-Hadoop solution Other distributions did not fare so well. In the case of Cloudera s Impala, 35 queries ran un-modified, 17 required minor modifications complying with TPC-DS rules, and 36 required more extensive non-compliant modifications. More concerning was that some of the 99 queries could not be run at 4 Section 4.2 of the TPC-DS 1.1.0 specification available from tpc.org provides rules around what types of modifications are permissible and which are not.

Which SQL Engine Leads the Herd? 5 all either because no re-write to the query was found or because the queries would fail at run-time. In the case of Hive.13 the situation was similar. 32 queries ran out-the-box, an additional 26 queries ran with compliant modifications, and 13 queries could be re-written with noncompliant modifications. As team scaled up the size of the test however queries that worked at smaller scale stopped working. At a 10TB dataset size 30 of the queries would not run at all. This exercise highlights the challenge that customers can be expected to encounter when seeking to adapt existing SQL schemas and applications to SQL-on-Hadoop implementations. Beware of cherry pickers Vendors have been making many performance claims related to the TPC-DS benchmark, cherry picking queries from the suite of 99 queries, and publishing only those queries that happen to work, and show an offering in the best possible light. In some cases, vendors have even altered table schemas to avoid compatibility issues or boost performance. Clearly this is not a proper way to run a benchmark. In fact, the rules of the benchmark specifically forbid this practice. The real news is not that selected queries can be made to run faster, but that many of the ANSI SQL queries in the benchmark simply don t run at all on competitive platforms. You can just imagine the challenges associated with getting your own production application running on a database platform that doesn t support 50% of your standard queries. This would amount to a re-write of the application, introducing risk, added-costs and certain delays. Performance matters too A full comparison between Impala, Hive and Big SQL could not be made because Hive and Impala could only run a subset of the queries. It was still possible though to compare results for the common set of 46 queries that all distributions were able to run at a 10 TB scale. Figure 2 shows a direct comparison of the elapsed time for the common set of queries across the three distributions. Although Big SQL was able to run all 99 queries, in this comparison only the queries that Hive and Impala were able to run are included in this result. IBM s Big SQL was able to complete all the common queries in 48 minutes and 28 seconds while Impala took 2 hours, 55 minutes and 36 seconds. Hive.13 exhibited the worst performance running the queries in 4 hours, 25 minutes and 49 seconds.

Which SQL Engine Leads the Herd? 6 18,000 16,000 14,000 12,000 10,000 8,000 6,000 4,000 2,000 Figure 2: Time in seconds to run the common set of 46 queries on each SQL-on-Hadoop platform The same ranking was observed in the multiuser test consisting of four concurrent streams of queries executing. Big SQL means Big Investment Protection The good news for IBM customers is that Big SQL offers significant investment protection. Not is Big SQL the only SQL-on-Hadoop test that was able to run all the queries, it was also the fastest, and the only engine to scale to a 30 TB dataset size. What s even better is that customers don t need to compromise on standards. Rigorous SQL compatibility means that customers can: 0 Big SQL Impala Hive Leverage existing investments in software, tools and people skills Run existing applications where appropriate over SQL-on-Hadoop data stores Runs queries faster, more efficiently and at a larger scale translating into a lower operations costs Value beyond compliance Standards compliance and performance are essential, but it is also important that a chosen SQL implementation play nice with others. In Hadoop, playing nice means a number of things: Supporting open data formats Using standard client-side database drivers Supporting built-in functions that SQL users expect Providing sophisticated security capabilities Federated access to multiple data sources Open, standard data formats SQL is a useful language. Thanks to standardization and decades of maturation, it is well known and adept at solving many problems. It is not however the only language available, nor is it always the best solution for every problem. Hadoop has an ever-expanding array of languages and tools for analyzing large datasets, but to be able to use these rich tools, data needs to exist in standard Hadoop data formats. Hadoop enthusiasts will be pleased to know that there is no such thing as a Big SQL data store. A table defined in Hive is a table defined in Big SQL and vice versa. Big SQL supports 100% native HDFS file formats. This is not true of all distributions, and customers should be wary of SQL-on- Hadoop implementations that introduce their own proprietary metadata. Common client-side drivers Beyond the ability to share SQL across platforms, Big SQL supports standard IBM client drivers allowing the same set of standards-compliant JDBC, JCC, ODBC, CLI and.net drivers to be used across multiple databases and operating systems. Clients using these drivers can access IBM Big SQL, DB2, IBM Informix and third party database environments transparently. By combining a standard SQL implementation with industry standard drivers, the number of ISV applications that can interact seamlessly with IBM s Big SQL is enlarged. Built-in functions drive productivity - Having an SQL implementation that merely works is different than having an environment that makes users productive. Incorporating a rich library of over 250 built-in functions along with SQL OLAP functions, Big SQL is built for analytics. It provides advanced features including sub-query support, additional SQL types and global (session) variables. With these additional features users

Which SQL Engine Leads the Herd? 7 can simply do more things, and answer more questions from within the SQL environment. When using other SQL solutions that may lack these features, users may find themselves writing custom code to implement the same capabilities already built into Big SQL. Security and auditing For some SQL implementations, security is an afterthought. Big SQL was built with security in mind. User authentication is handled using standard mechanisms including LDAP and Kerberos so that Big SQL fits seamlessly into your enterprise environment. Big SQL supports flexible authorization controls based on users, groups and roles. It uses standard SQL GRANT and REVOKE syntax familiar to database administrators. In addition to basic tablelevel access controls, Big SQL supports fine-grained role and column level access controls (RCAC). Fine grained access control and features like data masking help expand the range of solutions that Big SQL is applicable to. In addition to flexible authentication and authorization, Big SQL also provides extensive auditing facilities. In short, Big SQL brings the rich security features that RDBMS administrators expect to the world of Hadoop. Federated queries In modern data centers, data seldom exists in one place. Some data will exist in relational databases and other data will be in data warehouses or specialized column-oriented databases. Big SQL supports rich federation capabilities allowing users to write queries that access not only Hadoop-based data, but other databases as well. A single query may join data from Big SQL, Hive on Hadoop, a table on a Teradata warehouse, and data from an Oracle database. So Who leads the herd? The findings of this benchmark are compelling. Organizations are heavily invested in SQL. The last thing customers need is one SQL dialect for Hadoop, and another for their existing database environments. The fact that Big SQL was the only SQL-on-Hadoop implementation able to actually run the Hadoop-DS workload is important. The fact that it is also faster, more scalable, more stable, and has a richer set of features is very impressive indeed. Big SQL was the only implementation able to run the full Hadoop DS benchmark with all 99 queries Big SQL delivered over three times the performance of the nearest competitor in the single user test Big SQL was the only offering able to scale to 30 TB and run the full workload at that scale This result should not be surprising. IBM invented SQL after all and has over 30 years of experience building SQL query engines and optimizers. When it comes to SQL-on-Hadoop, IBM InfoSphere BigInsights with Big SQL clearly leads the herd. For More Information To learn more about Big SQL, download the free IBM whitepaper SQL-on-Hadoop without compromise at https://www14.software.ibm.com/webapp/iwm/web/signup.do? source=sw-infomgt&s_pkg=ov23626 To try IBM Big SQL for free, download IBM s free InfoSphere BigInsights QuickStart Edition, or run BigInsights in the cloud at http://www.ibm.com/software/data/infosphere/hadoop

Which SQL Engine Leads the Herd? 8 Please note: Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. Copyright IBM Corporation 2014 IBM Canada 3600 Steeles Ave East Markham, Ontario L3R 9Z7 Produced in Canada October 2014 All Rights Reserved IBM, the IBM logo, ibm.com, BigInsights, Cognos, DB2, Informix, InfoSphere, PureData and z/os are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol ( or ), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at Copyright and trademark information at ibm.com/legal/copytrade.shtml TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera. Hortonworks is a trademark of Hortonworks Inc. Hadoop and Hive are trademarks of the Apache Software Foundation Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. UNIX is a registered trademark of The Open Group in the United States and other countries. Other company, product and service names may be trademarks or service marks of others. References in this publication to IBM products and services do not imply that IBM intends to make them available in all countries in which IBM operates. IMW14799-USEN-00