SQream Technologies SQream DB GPU-Based SQL Database Technical Overview White Paper
Overview SQream DB is an analytic database built from scratch to harness the unique performance of graphical processors (GPUs) for handling petabyte-scale data, thus yielding significant savings in time and resources to its users. SQream DB s unique, cost-effective solution, provides enterprises with significant added value empowering BI, data scientists, engineers and even marketing teams with new possibilities in big data analytics. SQream DB running on a single or multiple NVIDIA GPUs, is capable of processing enormous data sets up to 100 times faster than any other leading data warehouse solution available today, by easily integrating it with existing tools and relational SQL queries - boosting productivity while reducing infrastructure and operating costs. Translating the above into tangible gains - running 100 times more queries while lowering the TCO - means that SQream DB is an outstandingly valuable asset to any organization handling big data analytic workloads. The SQream Advantage With the worldwide exploding data creation, organizations need to make use of and stay on top of their collected data. Organizations are facing a serious challenge in regards to storing immense volumes of structured and semi-structured data, analyzing it and obtaining real-time, rapid, actionable insights from it. Entities with quickly scaling data need a high-performance solution that will continue to perform well when addressing multi-petabyte data sets and heavy workloads. SQream DB is designed to address such needs, with the following four main advantages: Small Server Size SQream DB is designed from ground up to serve as a powerful database, while requiring as little as a single standard tower server or a 2U rack mount enclosure. Comparing a single 2U server with a full 42U rack vendor-supplied enclosure such as Teradata, Oracle Exadata and IBM PureData System for Analytics (formerly Netezza), the 2U server is capable of yielding equal or better query execution performance. As for costs - the savings in hardware, power, floor space, cooling and maintenance are enormous. SQream DB is not limited to the 2U form factor and can scale to larger configurations supporting multiple GPUs. Scale GPU is a Massively Parallel Processor (MPP) on a Card The idea behind SQream s architecture is harnessing the readily available power of thousands of parallel processing cores in a cost-effective GPU, to compete with and overtake standard and parallel DBMS solutions, running on dozens of expensive general-purpose processors. 2
MULTI- CPU - up to 32 cores GPU - up to 2880 cores CACHE RAM RAM MULTI- CPU - up to 32 cores GPU - up to 2880 cores CACHE RAM RAM A 32-core CPU installation (latency- oriented) requires a lot of power and can cost thousands of dollars. On the other hand, a single throughput-oriented GPU can have as many as 3000 onboard cores, delivering superior performance at a significantly lower cost, and a 90% reduced power consumption. With up to 20 times more processing power per node, suitable for aggressive data operations, and outstanding highspeed and scalability it is easy to see how SQream DB benefits the use of GPUs. While other clustered solutions may be massively parallel through scaling-out computers, SQream DB is massively parallel through the GPUs on-board thousands of cores. Moreover, several GPUs can link together inside the same enclosure, delivering a reduction of both memory and network I/O while decreasing network load and latency. Simplicity in Integration With SQream DB implementation could not be easier. SQream DB uses the familiar ANSI SQL syntax, meaning there is no need for any data remodeling, and no new skills need to be acquired. Employees don t need retraining and do not have to rewrite hundreds of queries. Even third party ETL and BI tools can easily be connected and used via industry standard ODBC/JDBC interfaces, without hiring integration specialists. 3
[At the time of writing this paper, SQream DB was tested to work with the following ETL and BI tools: Pentaho, Talend, Informatica, DataStage, SSIS, QlikView, Spotfire, Tableau, Business Objects and even Excel.] Simplicity by Design SQream DB is a columnar database, in which each column is stored as a collection of data chunks, each containing millions of values. SQream DB automates the creation of smart metadata on top of each column and every data chunk. This smart metadata replaces the common indexing used by most databases, thus eliminating the lengthy and limiting process of index creation while ingesting new data. The result is a smart grid for accessing any desired data on demand, at petabyte scale. SQream Database Architecture Connectors: JDBC,.Net, ODBC SQream Server SQL Parser Optimizer Resource Manager CPU/GPU Execution graph Runtime I/O Manager SQream Storage Metadata ext4/ntfs 4
Relational Algebra SQream DB utilizes a concept called relational algebra, first proposed by Edgar F. Codd from IBM Research, in 1969. This is a powerful model based on mathematical theory and is used by many SQL engines. It is based on set theory. The operations described as filters and joins, are such strong concepts, that they are comparable to mathematical basics like addition and multiplication. Relational Algebra is therefore not only well studied, but comprehensively battle tested in real world applications. By transforming your relational SQL queries into clever, highly parallelizable relational algebra, SQream DB can efficiently perform complex operations on the massively parallel GPU cores. These operations are performed internally by the SQream DB compiler and require no user intervention. Performance Relational Algebra Optimizations The SQream DB compiler does a lot of the heavy lifting. The compiler processes the given SQL query (from standard ODBC or JDBC connectors), creates an execution plan and then optimizes it. The result is an equivalent query that produces the same results, but runs a lot faster. Because SQream DB works in a massively parallel environment, most of the optimizations involve combining repeated work and choosing alternative paths that reduce repeated processor and I/O operations. GPU Parallelism SQream DB s main processing power comes from the massively parallel NVIDIA GPU. The execution plan that the compiler choses is uniquely suited and optimized for the NVIDIA GPU, resulting in high-speed, real-time, high scale performance. By using original patent-pending concepts, SQream DB s compiler and compressors are able to reduce the amount of I/O and repeated operations before the data is even transferred to the GPU, resulting in an incredible speed advantage with complex queries. Storage SQream DB utilizes powerful and robust columnar storage, split up into GPU manageable chunks. While some newer DBMS solutions are semi-columnar, SQream DB is fully columnar, including both the storage and the query engine. Vertical partitioning - columnar storage - This feature allows selective access to the required subset of columns, reducing disk scan and memory I/O time, compared with standard row storage. This seemingly straightforward concept enables SQream DB to operate very quickly. Horizontal partitioning - extent storage SQream automatically splits up the storage horizontally into manageable chunks enabling optimal usage of the hardware resources and relatively small memory availability in GPUs, compared with CPU RAM. 5
Emp_no Dept_id Hire_date Emp_in Dept_in 1 1 2012-01-01 Smith John 2 1 2014-05-16 Johnson Barbara 3 1 2014-01-22 Miller Amanda 4 2 2012-06-08 Taylor Evelyn 5 2 2013-04-25 Wilson Bob 6 3 2013-08-01 Brown Jim 1 1 2012-01-01 Smith John 2 1 2014-05-16 Johnson Barbara 3 1 2014-01-22 Miller Amanda 1 2 3 4 5 1 1 1 2 2 2012-01-01 2014-05-16 2014-01-22 2012-06-08 2013-04-25 Smart Metadata Smart metadata is automatically generated on the fly for each chunk, while data is ingested. The smart metadata enables the immediate pinpointing of the exact required data for each query. When using leading RDBMS solutions, DBAs need to set up indexing, at least on a few columns. SQream DB s smart metadata method means that the DBAs do not need to perform any data modeling or create indexes or primary keys, as these are automatically dealt with through the smart metadata during the data ingestion. The result is a cutting-edge smart grid for accessing and querying any desired data on demand, at petabyte scale. Smart metadata comes into play and enables ultra-fast, sub-second responses to specific queries, such as SELECT COUNT or SELECT DISTINCT SQream utilizes the smart metadata extensively, while saving significant processing and I/O time by pinpointing data chunks that are involved in the processing of each query. SQream DB offers ultra-fast data ingestion. Processing is done on the GPU, leaving the CPU free to perform heavy I/O. Thus, up to 2TB worth of ETL operations may be ingested by the server each hour, even with a basic configuration consisting of a single GPU card. Compression By utilizing cutting-edge but well-established compression algorithms specially tuned for fast operations, SQream DB enables reduction of disk storage size, while still maintaining blazing fast queries. In fact, the compression algorithms are so fast, that most hard-drives will be the bottleneck of the compress/decompress process. SQream s compression and decompression is performed on-the-fly on the GPU, 50 times faster than on a standard CPU. It is so fast that SQream DB compresses and decompresses everything. Other leading databases compress only some of the data. 6
Scaling Linear scaling in performance As opposed to other DBMSs - where performance decreases as data volume increases (beyond a certain threshold) - SQream DB s innovative technology allows for steady performance regardless of the data scale. Scaling in storage Storage may be enlarged easily, by adding more drives to the server. SQream DB s highly capable algorithms tackle the rest. Since SQream DB is throughput intensive, it is opt for multi-terabyte conventional hard drives and basic SSDs. Scaling in GPUs, not CPUs or nodes Adding additional compute power is simple. There is no need to replace the entire server, but only to plug in additional NVIDIA GPU cards. Interfaces and Integration SQL Support SQream DB supports the pure ANSI SQL language. Stored procedures such as Microsoft T-SQL and Oracle PL/SQL are not supported. SQream DB integrates easily into existing systems by supporting the usage of both ODBC and JDBC connectors. This means existing ETL and analytics tools and developed applications can stay, minimizing the time needed to get up and running with SQream DB. SQream DB may be introduced on its own, as a standalone petabyte-scale database, to meet all the analytic needs. However, there is no need to throw away existing solutions. Instead of upgrading current solutions by procuring additional non-linearly scaling hardware, organizations may plug in SQream DB as a secondary database solution, creating an on/ offloading system and empowering existing investments. IT Monitoring SQream DB runs on standard hardware and can easily integrate with any control and monitoring software in use, to track Linux based machines. Logging SQream DB contains a built-in logger that tracks critical server information, enabling IT and security teams to gain insights from the server s operations - from failed login attempts, to CPU time spent per query, through read-write cycles and memory utilization. Security SQream DB offers username/password authentication for levels ranging from the cluster (multi-database), all the way down to per-table authentication. 7
Backup and Restore Operations SQream DB offers backup and restore operations either via SQL statements or directly from the file system. The latter means that SQream DB can be backed up and restored, using any external storage system (Data Replication Manager). High Availability Configuration Multiple SQream DB servers may be connected to a single external storage system, while at any point in time, only one server is active and the others are passive. When the Active server fails, the Passive server mounts the shared storage and continues to respond to queries, without any data loss. [Active/Active and automatic Fail-Over is planned for the next release]. Alternatively, SQream DB can also run in a stand-alone cluster topology, in which two servers - both with the same internal direct attached storage, are active - while the first, which ingests new data and serves queries, continuously updates the other. Upon the first server failure, the other seamlessly takes control, with no time or data loss. Active Passive Storage SQream vs. Other Big Data Solutions Organizations may be considering a trendy new cluster or NoSQL solution. These are excellent for specific implementations, but they require experienced DBAs and new application development skills. Compared with the painless and hassle-free integration of SQream DB, the benefits of the latter are obvious. 8
Summary SQream DB delivers up to 100 times faster big data analytics compared with other key market players, while using significantly smaller hardware footprint. SQream DB is the only solution that is truly capable of dealing with massive big data escalating magnitudes (petabyte scale and hundreds of billions of rows of data), and doing so at relative ease and extraordinary value. SQream DB opens up new opportunities for organizations to do much more with their data, in relevance to their unique business use cases. Petabyte scale data insights with hundreds of billions of entries are now within reach. Organizations may integrate SQream DB as a standalone database solution or as a complementary analytics database, maximizing existing core IT investments. The SQream DB hardware architecture enables significant cost savings through the use of GPU s and their massively parallel abilities, instead of clustering servers and nodes - thus optimizing the system in a way that saves both hardware, infrastructure, utilities and maintenance costs. The integration of SQream DB is extremely straight-forward and requires no massive rewrites of SQL queries, no additional skills need to be acquired, and the database plugs in easily to the existing ecosystem - requiring little to no transition time and no investment in training, etc. All of the above translate into substantial gains for the organization by enabling the running of two orders of magnitude more queries - unlocking the critical business intelligence and information hiding in organizations collected big data. SQream DB brings organizations to a leading advantage point, while significantly reducing their hardware and operating costs. For more information about SQream DB, visit www.sqream.com or call +972.3.544.4871. Copyright 2010. All rights reserved. This document is provided for information purposes only and the contents hereof are subject to change without notice. This document is not warranted to be error-free, nor subject to any other warranties or conditions, whether expressed orally or implied in law, including implied warranties and conditions of merchantability or fitness for a particular purpose. We specifically disclaim any liability with respect to this document and no contractual obligations are formed either directly or indirectly by this document. This document may not be reproduced in any form, for any purpose, without our prior written permission. 9